All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch -mm v2] mm: introduce oom_adj_child
@ 2009-07-29  4:27 ` David Rientjes
  0 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2009-07-29  4:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Paul Menage, KOSAKI Motohiro, linux-kernel, linux-mm

It's helpful to be able to specify an oom_adj value for newly forked
children that do not share memory with the parent.

Before making oom_adj values a characteristic of a task's mm in
2ff05b2b4eac2e63d345fc731ea151a060247f53, it was possible to change the
oom_adj value of a vfork() child prior to execve() without implicitly
changing the oom_adj value of the parent.  With the new behavior, the
oom_adj values of both threads would change since they represent the same
memory.

That change was necessary to fix an oom killer livelock which would occur
when a child would be selected for oom kill prior to execve() and the
task could not be killed because it shared memory with an OOM_DISABLE
parent.  In fact, only the most negative (most immune) oom_adj value for
all threads sharing the same memory would actually be used by the oom
killer, leaving inconsistencies amongst all other threads having
different oom_adj values (and, thus, incorrectly exported
/proc/pid/oom_score values).

This patch adds a new per-process parameter: /proc/pid/oom_adj_child.
This defaults to mirror the value of /proc/pid/oom_adj but may be changed
so that mm's initialized by their children are preferred over the parent
by the oom killer.  Setting oom_adj_child to be less (i.e. more immune)
than the task's oom_adj value itself is governed by the CAP_SYS_RESOURCE
capability.

When a mm is initialized, the initial oom_adj value will be set to the
parent's oom_adj_child.  This allows tasks to elevate the oom_adj value
of a vfork'd child prior to execve() before the execution actually takes
place.

Furthermore, /proc/pid/oom_adj_child is inherited from the task that
forked it.

Cc: Rik van Riel <riel@redhat.com>
Cc: Paul Menage <menage@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/filesystems/proc.txt |   38 ++++++++++++++++----
 fs/proc/base.c                     |   68 ++++++++++++++++++++++++++++++++++++
 include/linux/sched.h              |    1 +
 kernel/fork.c                      |    3 +-
 4 files changed, 101 insertions(+), 9 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -34,10 +34,11 @@ Table of Contents
 
   3	Per-Process Parameters
   3.1	/proc/<pid>/oom_adj - Adjust the oom-killer score
-  3.2	/proc/<pid>/oom_score - Display current oom-killer score
-  3.3	/proc/<pid>/io - Display the IO accounting fields
-  3.4	/proc/<pid>/coredump_filter - Core dump filtering settings
-  3.5	/proc/<pid>/mountinfo - Information about mounts
+  3.2	/proc/<pid>/oom_adj_child - Change default oom_adj for children
+  3.3	/proc/<pid>/oom_score - Display current oom-killer score
+  3.4	/proc/<pid>/io - Display the IO accounting fields
+  3.5	/proc/<pid>/coredump_filter - Core dump filtering settings
+  3.6	/proc/<pid>/mountinfo - Information about mounts
 
 
 ------------------------------------------------------------------------------
@@ -1206,7 +1207,28 @@ The task with the highest badness score is then selected and its children
 are killed, process itself will be killed in an OOM situation when it does
 not have children or some of them disabled oom like described above.
 
-3.2 /proc/<pid>/oom_score - Display current oom-killer score
+
+3.2 /proc/<pid>/oom_adj_child - Change default oom_adj for children
+-------------------------------------------------------------------
+
+This file can be used to change the default oom_adj value for children when a
+new mm is initialized.  The oom_adj value for a child's mm is typically the
+task's oom_adj value itself, however this value can be altered by writing to
+this file.
+
+This is particularly helpful when a child is vfork'd and its mm following exec
+should have a higher priority oom_adj value than its parent.  The new mm will
+default to oom_adj_child of the parent task.
+
+oom_adj_child will mirror oom_adj whenever the latter changes for all tasks
+that share its memory.  This avoids having to set both values when simply
+tuning oom_adj and that value should be inherited by all children.
+
+Setting oom_adj_child to be more immune than the task's mm itself (i.e. less
+than oom_adj) is governed by the CAP_SYS_RESOURCE capability.
+
+
+3.3 /proc/<pid>/oom_score - Display current oom-killer score
 -------------------------------------------------------------
 
 This file can be used to check the current score used by the oom-killer is for
@@ -1214,7 +1236,7 @@ any given <pid>. Use it together with /proc/<pid>/oom_adj to tune which
 process should be killed in an out-of-memory situation.
 
 
-3.3  /proc/<pid>/io - Display the IO accounting fields
+3.4  /proc/<pid>/io - Display the IO accounting fields
 -------------------------------------------------------
 
 This file contains IO statistics for each running process
@@ -1316,7 +1338,7 @@ those 64-bit counters, process A could see an intermediate result.
 More information about this can be found within the taskstats documentation in
 Documentation/accounting.
 
-3.4 /proc/<pid>/coredump_filter - Core dump filtering settings
+3.5 /proc/<pid>/coredump_filter - Core dump filtering settings
 ---------------------------------------------------------------
 When a process is dumped, all anonymous memory is written to a core file as
 long as the size of the core file isn't limited. But sometimes we don't want
@@ -1360,7 +1382,7 @@ For example:
   $ echo 0x7 > /proc/self/coredump_filter
   $ ./some_program
 
-3.5	/proc/<pid>/mountinfo - Information about mounts
+3.6	/proc/<pid>/mountinfo - Information about mounts
 --------------------------------------------------------
 
 This file contains lines of the form:
diff --git a/fs/proc/base.c b/fs/proc/base.c
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1023,6 +1023,7 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
 				size_t count, loff_t *ppos)
 {
 	struct task_struct *task;
+	struct task_struct *g, *p;
 	char buffer[PROC_NUMBUF], *end;
 	int oom_adjust;
 
@@ -1051,6 +1052,12 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
 		put_task_struct(task);
 		return -EACCES;
 	}
+	read_lock(&tasklist_lock);
+	do_each_thread(g, p) {
+		if (p->mm && p->mm == task->mm)
+			p->oom_adj_child = oom_adjust;
+	} while_each_thread(g, p);
+	read_unlock(&tasklist_lock);
 	task->mm->oom_adj = oom_adjust;
 	task_unlock(task);
 	put_task_struct(task);
@@ -1064,6 +1071,65 @@ static const struct file_operations proc_oom_adjust_operations = {
 	.write		= oom_adjust_write,
 };
 
+static ssize_t oom_adj_child_read(struct file *file, char __user *buf,
+				size_t count, loff_t *ppos)
+{
+	struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
+	char buffer[PROC_NUMBUF];
+	size_t len;
+	int oom_adj_child;
+
+	if (!task)
+		return -ESRCH;
+	oom_adj_child = task->oom_adj_child;
+	put_task_struct(task);
+
+	len = snprintf(buffer, sizeof(buffer), "%i\n", oom_adj_child);
+
+	return simple_read_from_buffer(buf, count, ppos, buffer, len);
+}
+
+static ssize_t oom_adj_child_write(struct file *file, const char __user *buf,
+				size_t count, loff_t *ppos)
+{
+	struct task_struct *task;
+	char buffer[PROC_NUMBUF], *end;
+	int oom_adj_child;
+
+	memset(buffer, 0, sizeof(buffer));
+	if (count > sizeof(buffer) - 1)
+		count = sizeof(buffer) - 1;
+	if (copy_from_user(buffer, buf, count))
+		return -EFAULT;
+	oom_adj_child = simple_strtol(buffer, &end, 0);
+	if ((oom_adj_child < OOM_ADJUST_MIN ||
+	     oom_adj_child > OOM_ADJUST_MAX) && oom_adj_child != OOM_DISABLE)
+		return -EINVAL;
+	if (*end == '\n')
+		end++;
+	task = get_proc_task(file->f_path.dentry->d_inode);
+	if (!task)
+		return -ESRCH;
+	task_lock(task);
+	if (task->mm && oom_adj_child < task->mm->oom_adj &&
+	    !capable(CAP_SYS_RESOURCE)) {
+		task_unlock(task);
+		put_task_struct(task);
+		return -EINVAL;
+	}
+	task_unlock(task);
+	task->oom_adj_child = oom_adj_child;
+	put_task_struct(task);
+	if (end - buffer == 0)
+		return -EIO;
+	return end - buffer;
+}
+
+static const struct file_operations proc_oom_adj_child_operations = {
+	.read		= oom_adj_child_read,
+	.write		= oom_adj_child_write,
+};
+
 #ifdef CONFIG_AUDITSYSCALL
 #define TMPBUFLEN 21
 static ssize_t proc_loginuid_read(struct file * file, char __user * buf,
@@ -2548,6 +2614,7 @@ static const struct pid_entry tgid_base_stuff[] = {
 #endif
 	INF("oom_score",  S_IRUGO, proc_oom_score),
 	REG("oom_adj",    S_IRUGO|S_IWUSR, proc_oom_adjust_operations),
+	REG("oom_adj_child",	S_IRUGO|S_IWUSR, proc_oom_adj_child_operations),
 #ifdef CONFIG_AUDITSYSCALL
 	REG("loginuid",   S_IWUSR|S_IRUGO, proc_loginuid_operations),
 	REG("sessionid",  S_IRUGO, proc_sessionid_operations),
@@ -2886,6 +2953,7 @@ static const struct pid_entry tid_base_stuff[] = {
 #endif
 	INF("oom_score", S_IRUGO, proc_oom_score),
 	REG("oom_adj",   S_IRUGO|S_IWUSR, proc_oom_adjust_operations),
+	REG("oom_adj_child",	S_IRUGO|S_IWUSR, proc_oom_adj_child_operations),
 #ifdef CONFIG_AUDITSYSCALL
 	REG("loginuid",  S_IWUSR|S_IRUGO, proc_loginuid_operations),
 	REG("sessionid",  S_IRUSR, proc_sessionid_operations),
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1198,6 +1198,7 @@ struct task_struct {
 	 * a short time
 	 */
 	unsigned char fpu_counter;
+	s8 oom_adj_child;	/* Default child OOM-kill score adjustment */
 #ifdef CONFIG_BLK_DEV_IO_TRACE
 	unsigned int btrace_seq;
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -426,7 +426,7 @@ static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
 	init_rwsem(&mm->mmap_sem);
 	INIT_LIST_HEAD(&mm->mmlist);
 	mm->flags = (current->mm) ? current->mm->flags : default_dump_filter;
-	mm->oom_adj = (current->mm) ? current->mm->oom_adj : 0;
+	mm->oom_adj = p->oom_adj_child;
 	mm->core_state = NULL;
 	mm->nr_ptes = 0;
 	set_mm_counter(mm, file_rss, 0);
@@ -679,6 +679,7 @@ good_mm:
 
 	tsk->mm = mm;
 	tsk->active_mm = mm;
+	tsk->oom_adj_child = mm->oom_adj;
 	return 0;
 
 fail_nomem:

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [patch -mm v2] mm: introduce oom_adj_child
@ 2009-07-29  4:27 ` David Rientjes
  0 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2009-07-29  4:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Paul Menage, KOSAKI Motohiro, linux-kernel, linux-mm

It's helpful to be able to specify an oom_adj value for newly forked
children that do not share memory with the parent.

Before making oom_adj values a characteristic of a task's mm in
2ff05b2b4eac2e63d345fc731ea151a060247f53, it was possible to change the
oom_adj value of a vfork() child prior to execve() without implicitly
changing the oom_adj value of the parent.  With the new behavior, the
oom_adj values of both threads would change since they represent the same
memory.

That change was necessary to fix an oom killer livelock which would occur
when a child would be selected for oom kill prior to execve() and the
task could not be killed because it shared memory with an OOM_DISABLE
parent.  In fact, only the most negative (most immune) oom_adj value for
all threads sharing the same memory would actually be used by the oom
killer, leaving inconsistencies amongst all other threads having
different oom_adj values (and, thus, incorrectly exported
/proc/pid/oom_score values).

This patch adds a new per-process parameter: /proc/pid/oom_adj_child.
This defaults to mirror the value of /proc/pid/oom_adj but may be changed
so that mm's initialized by their children are preferred over the parent
by the oom killer.  Setting oom_adj_child to be less (i.e. more immune)
than the task's oom_adj value itself is governed by the CAP_SYS_RESOURCE
capability.

When a mm is initialized, the initial oom_adj value will be set to the
parent's oom_adj_child.  This allows tasks to elevate the oom_adj value
of a vfork'd child prior to execve() before the execution actually takes
place.

Furthermore, /proc/pid/oom_adj_child is inherited from the task that
forked it.

Cc: Rik van Riel <riel@redhat.com>
Cc: Paul Menage <menage@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/filesystems/proc.txt |   38 ++++++++++++++++----
 fs/proc/base.c                     |   68 ++++++++++++++++++++++++++++++++++++
 include/linux/sched.h              |    1 +
 kernel/fork.c                      |    3 +-
 4 files changed, 101 insertions(+), 9 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -34,10 +34,11 @@ Table of Contents
 
   3	Per-Process Parameters
   3.1	/proc/<pid>/oom_adj - Adjust the oom-killer score
-  3.2	/proc/<pid>/oom_score - Display current oom-killer score
-  3.3	/proc/<pid>/io - Display the IO accounting fields
-  3.4	/proc/<pid>/coredump_filter - Core dump filtering settings
-  3.5	/proc/<pid>/mountinfo - Information about mounts
+  3.2	/proc/<pid>/oom_adj_child - Change default oom_adj for children
+  3.3	/proc/<pid>/oom_score - Display current oom-killer score
+  3.4	/proc/<pid>/io - Display the IO accounting fields
+  3.5	/proc/<pid>/coredump_filter - Core dump filtering settings
+  3.6	/proc/<pid>/mountinfo - Information about mounts
 
 
 ------------------------------------------------------------------------------
@@ -1206,7 +1207,28 @@ The task with the highest badness score is then selected and its children
 are killed, process itself will be killed in an OOM situation when it does
 not have children or some of them disabled oom like described above.
 
-3.2 /proc/<pid>/oom_score - Display current oom-killer score
+
+3.2 /proc/<pid>/oom_adj_child - Change default oom_adj for children
+-------------------------------------------------------------------
+
+This file can be used to change the default oom_adj value for children when a
+new mm is initialized.  The oom_adj value for a child's mm is typically the
+task's oom_adj value itself, however this value can be altered by writing to
+this file.
+
+This is particularly helpful when a child is vfork'd and its mm following exec
+should have a higher priority oom_adj value than its parent.  The new mm will
+default to oom_adj_child of the parent task.
+
+oom_adj_child will mirror oom_adj whenever the latter changes for all tasks
+that share its memory.  This avoids having to set both values when simply
+tuning oom_adj and that value should be inherited by all children.
+
+Setting oom_adj_child to be more immune than the task's mm itself (i.e. less
+than oom_adj) is governed by the CAP_SYS_RESOURCE capability.
+
+
+3.3 /proc/<pid>/oom_score - Display current oom-killer score
 -------------------------------------------------------------
 
 This file can be used to check the current score used by the oom-killer is for
@@ -1214,7 +1236,7 @@ any given <pid>. Use it together with /proc/<pid>/oom_adj to tune which
 process should be killed in an out-of-memory situation.
 
 
-3.3  /proc/<pid>/io - Display the IO accounting fields
+3.4  /proc/<pid>/io - Display the IO accounting fields
 -------------------------------------------------------
 
 This file contains IO statistics for each running process
@@ -1316,7 +1338,7 @@ those 64-bit counters, process A could see an intermediate result.
 More information about this can be found within the taskstats documentation in
 Documentation/accounting.
 
-3.4 /proc/<pid>/coredump_filter - Core dump filtering settings
+3.5 /proc/<pid>/coredump_filter - Core dump filtering settings
 ---------------------------------------------------------------
 When a process is dumped, all anonymous memory is written to a core file as
 long as the size of the core file isn't limited. But sometimes we don't want
@@ -1360,7 +1382,7 @@ For example:
   $ echo 0x7 > /proc/self/coredump_filter
   $ ./some_program
 
-3.5	/proc/<pid>/mountinfo - Information about mounts
+3.6	/proc/<pid>/mountinfo - Information about mounts
 --------------------------------------------------------
 
 This file contains lines of the form:
diff --git a/fs/proc/base.c b/fs/proc/base.c
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1023,6 +1023,7 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
 				size_t count, loff_t *ppos)
 {
 	struct task_struct *task;
+	struct task_struct *g, *p;
 	char buffer[PROC_NUMBUF], *end;
 	int oom_adjust;
 
@@ -1051,6 +1052,12 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
 		put_task_struct(task);
 		return -EACCES;
 	}
+	read_lock(&tasklist_lock);
+	do_each_thread(g, p) {
+		if (p->mm && p->mm == task->mm)
+			p->oom_adj_child = oom_adjust;
+	} while_each_thread(g, p);
+	read_unlock(&tasklist_lock);
 	task->mm->oom_adj = oom_adjust;
 	task_unlock(task);
 	put_task_struct(task);
@@ -1064,6 +1071,65 @@ static const struct file_operations proc_oom_adjust_operations = {
 	.write		= oom_adjust_write,
 };
 
+static ssize_t oom_adj_child_read(struct file *file, char __user *buf,
+				size_t count, loff_t *ppos)
+{
+	struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
+	char buffer[PROC_NUMBUF];
+	size_t len;
+	int oom_adj_child;
+
+	if (!task)
+		return -ESRCH;
+	oom_adj_child = task->oom_adj_child;
+	put_task_struct(task);
+
+	len = snprintf(buffer, sizeof(buffer), "%i\n", oom_adj_child);
+
+	return simple_read_from_buffer(buf, count, ppos, buffer, len);
+}
+
+static ssize_t oom_adj_child_write(struct file *file, const char __user *buf,
+				size_t count, loff_t *ppos)
+{
+	struct task_struct *task;
+	char buffer[PROC_NUMBUF], *end;
+	int oom_adj_child;
+
+	memset(buffer, 0, sizeof(buffer));
+	if (count > sizeof(buffer) - 1)
+		count = sizeof(buffer) - 1;
+	if (copy_from_user(buffer, buf, count))
+		return -EFAULT;
+	oom_adj_child = simple_strtol(buffer, &end, 0);
+	if ((oom_adj_child < OOM_ADJUST_MIN ||
+	     oom_adj_child > OOM_ADJUST_MAX) && oom_adj_child != OOM_DISABLE)
+		return -EINVAL;
+	if (*end == '\n')
+		end++;
+	task = get_proc_task(file->f_path.dentry->d_inode);
+	if (!task)
+		return -ESRCH;
+	task_lock(task);
+	if (task->mm && oom_adj_child < task->mm->oom_adj &&
+	    !capable(CAP_SYS_RESOURCE)) {
+		task_unlock(task);
+		put_task_struct(task);
+		return -EINVAL;
+	}
+	task_unlock(task);
+	task->oom_adj_child = oom_adj_child;
+	put_task_struct(task);
+	if (end - buffer == 0)
+		return -EIO;
+	return end - buffer;
+}
+
+static const struct file_operations proc_oom_adj_child_operations = {
+	.read		= oom_adj_child_read,
+	.write		= oom_adj_child_write,
+};
+
 #ifdef CONFIG_AUDITSYSCALL
 #define TMPBUFLEN 21
 static ssize_t proc_loginuid_read(struct file * file, char __user * buf,
@@ -2548,6 +2614,7 @@ static const struct pid_entry tgid_base_stuff[] = {
 #endif
 	INF("oom_score",  S_IRUGO, proc_oom_score),
 	REG("oom_adj",    S_IRUGO|S_IWUSR, proc_oom_adjust_operations),
+	REG("oom_adj_child",	S_IRUGO|S_IWUSR, proc_oom_adj_child_operations),
 #ifdef CONFIG_AUDITSYSCALL
 	REG("loginuid",   S_IWUSR|S_IRUGO, proc_loginuid_operations),
 	REG("sessionid",  S_IRUGO, proc_sessionid_operations),
@@ -2886,6 +2953,7 @@ static const struct pid_entry tid_base_stuff[] = {
 #endif
 	INF("oom_score", S_IRUGO, proc_oom_score),
 	REG("oom_adj",   S_IRUGO|S_IWUSR, proc_oom_adjust_operations),
+	REG("oom_adj_child",	S_IRUGO|S_IWUSR, proc_oom_adj_child_operations),
 #ifdef CONFIG_AUDITSYSCALL
 	REG("loginuid",  S_IWUSR|S_IRUGO, proc_loginuid_operations),
 	REG("sessionid",  S_IRUSR, proc_sessionid_operations),
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1198,6 +1198,7 @@ struct task_struct {
 	 * a short time
 	 */
 	unsigned char fpu_counter;
+	s8 oom_adj_child;	/* Default child OOM-kill score adjustment */
 #ifdef CONFIG_BLK_DEV_IO_TRACE
 	unsigned int btrace_seq;
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -426,7 +426,7 @@ static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
 	init_rwsem(&mm->mmap_sem);
 	INIT_LIST_HEAD(&mm->mmlist);
 	mm->flags = (current->mm) ? current->mm->flags : default_dump_filter;
-	mm->oom_adj = (current->mm) ? current->mm->oom_adj : 0;
+	mm->oom_adj = p->oom_adj_child;
 	mm->core_state = NULL;
 	mm->nr_ptes = 0;
 	set_mm_counter(mm, file_rss, 0);
@@ -679,6 +679,7 @@ good_mm:
 
 	tsk->mm = mm;
 	tsk->active_mm = mm;
+	tsk->oom_adj_child = mm->oom_adj;
 	return 0;
 
 fail_nomem:

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
  2009-07-29  4:27 ` David Rientjes
@ 2009-07-29 23:13   ` Andrew Morton
  -1 siblings, 0 replies; 64+ messages in thread
From: Andrew Morton @ 2009-07-29 23:13 UTC (permalink / raw)
  To: David Rientjes; +Cc: riel, menage, kosaki.motohiro, linux-kernel, linux-mm

On Tue, 28 Jul 2009 21:27:15 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> +static ssize_t oom_adj_child_write(struct file *file, const char __user *buf,
> +				size_t count, loff_t *ppos)
> +{
> +	struct task_struct *task;
> +	char buffer[PROC_NUMBUF], *end;
> +	int oom_adj_child;
> +
> +	memset(buffer, 0, sizeof(buffer));
> +	if (count > sizeof(buffer) - 1)
> +		count = sizeof(buffer) - 1;
> +	if (copy_from_user(buffer, buf, count))
> +		return -EFAULT;
> +	oom_adj_child = simple_strtol(buffer, &end, 0);
> +	if ((oom_adj_child < OOM_ADJUST_MIN ||
> +	     oom_adj_child > OOM_ADJUST_MAX) && oom_adj_child != OOM_DISABLE)
> +		return -EINVAL;
> +	if (*end == '\n')
> +		end++;
> +	task = get_proc_task(file->f_path.dentry->d_inode);
> +	if (!task)
> +		return -ESRCH;
> +	task_lock(task);
> +	if (task->mm && oom_adj_child < task->mm->oom_adj &&
> +	    !capable(CAP_SYS_RESOURCE)) {
> +		task_unlock(task);
> +		put_task_struct(task);
> +		return -EINVAL;
> +	}
> +	task_unlock(task);
> +	task->oom_adj_child = oom_adj_child;
> +	put_task_struct(task);
> +	if (end - buffer == 0)
> +		return -EIO;
> +	return end - buffer;
> +}

Do we really need to do all that string hacking?  All it does is reads
a plain old integer from userspace.

It's weird that the obfuscated check for zero-length input happens
right at the end of the function, particularly as we couldn't have got
that far anyway, because we'd already have returned -EINVAL.

And even after all that, I suspect the function will permit illogical
input such as "12foo" - which is what strict_strtoul() is for (as
checkpatch points out!).



grumble.  At how many codesites do we read an ascii integer from
userspace?  Thousands, surely.  You'd think we'd have a little function
to do it by now.



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
@ 2009-07-29 23:13   ` Andrew Morton
  0 siblings, 0 replies; 64+ messages in thread
From: Andrew Morton @ 2009-07-29 23:13 UTC (permalink / raw)
  To: David Rientjes; +Cc: riel, menage, kosaki.motohiro, linux-kernel, linux-mm

On Tue, 28 Jul 2009 21:27:15 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> +static ssize_t oom_adj_child_write(struct file *file, const char __user *buf,
> +				size_t count, loff_t *ppos)
> +{
> +	struct task_struct *task;
> +	char buffer[PROC_NUMBUF], *end;
> +	int oom_adj_child;
> +
> +	memset(buffer, 0, sizeof(buffer));
> +	if (count > sizeof(buffer) - 1)
> +		count = sizeof(buffer) - 1;
> +	if (copy_from_user(buffer, buf, count))
> +		return -EFAULT;
> +	oom_adj_child = simple_strtol(buffer, &end, 0);
> +	if ((oom_adj_child < OOM_ADJUST_MIN ||
> +	     oom_adj_child > OOM_ADJUST_MAX) && oom_adj_child != OOM_DISABLE)
> +		return -EINVAL;
> +	if (*end == '\n')
> +		end++;
> +	task = get_proc_task(file->f_path.dentry->d_inode);
> +	if (!task)
> +		return -ESRCH;
> +	task_lock(task);
> +	if (task->mm && oom_adj_child < task->mm->oom_adj &&
> +	    !capable(CAP_SYS_RESOURCE)) {
> +		task_unlock(task);
> +		put_task_struct(task);
> +		return -EINVAL;
> +	}
> +	task_unlock(task);
> +	task->oom_adj_child = oom_adj_child;
> +	put_task_struct(task);
> +	if (end - buffer == 0)
> +		return -EIO;
> +	return end - buffer;
> +}

Do we really need to do all that string hacking?  All it does is reads
a plain old integer from userspace.

It's weird that the obfuscated check for zero-length input happens
right at the end of the function, particularly as we couldn't have got
that far anyway, because we'd already have returned -EINVAL.

And even after all that, I suspect the function will permit illogical
input such as "12foo" - which is what strict_strtoul() is for (as
checkpatch points out!).



grumble.  At how many codesites do we read an ascii integer from
userspace?  Thousands, surely.  You'd think we'd have a little function
to do it by now.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
  2009-07-29 23:13   ` Andrew Morton
@ 2009-07-29 23:25     ` Paul Menage
  -1 siblings, 0 replies; 64+ messages in thread
From: Paul Menage @ 2009-07-29 23:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, riel, kosaki.motohiro, linux-kernel, linux-mm

On Wed, Jul 29, 2009 at 4:13 PM, Andrew Morton<akpm@linux-foundation.org> wrote:
>
> Do we really need to do all that string hacking?  All it does is reads
> a plain old integer from userspace.

It would be nice to have the equivalent of the cgroupfs read_u64 and
write_u64 methods, where you just supply a function that
accepts/returns the appropriate value, and all the buffer munging is
done in the generic code.

Paul

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
@ 2009-07-29 23:25     ` Paul Menage
  0 siblings, 0 replies; 64+ messages in thread
From: Paul Menage @ 2009-07-29 23:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, riel, kosaki.motohiro, linux-kernel, linux-mm

On Wed, Jul 29, 2009 at 4:13 PM, Andrew Morton<akpm@linux-foundation.org> wrote:
>
> Do we really need to do all that string hacking?  All it does is reads
> a plain old integer from userspace.

It would be nice to have the equivalent of the cgroupfs read_u64 and
write_u64 methods, where you just supply a function that
accepts/returns the appropriate value, and all the buffer munging is
done in the generic code.

Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
  2009-07-29  4:27 ` David Rientjes
@ 2009-07-30  2:32   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 64+ messages in thread
From: KOSAKI Motohiro @ 2009-07-30  2:32 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Paul Menage,
	linux-kernel, linux-mm

NAK this. I explain the reason at below.


> It's helpful to be able to specify an oom_adj value for newly forked
> children that do not share memory with the parent.
> 
> Before making oom_adj values a characteristic of a task's mm in
> 2ff05b2b4eac2e63d345fc731ea151a060247f53, it was possible to change the
> oom_adj value of a vfork() child prior to execve() without implicitly
> changing the oom_adj value of the parent.  With the new behavior, the
> oom_adj values of both threads would change since they represent the same
> memory.
> 
> That change was necessary to fix an oom killer livelock which would occur
> when a child would be selected for oom kill prior to execve() and the
> task could not be killed because it shared memory with an OOM_DISABLE
> parent.  In fact, only the most negative (most immune) oom_adj value for
> all threads sharing the same memory would actually be used by the oom
> killer, leaving inconsistencies amongst all other threads having
> different oom_adj values (and, thus, incorrectly exported
> /proc/pid/oom_score values).
> 
> This patch adds a new per-process parameter: /proc/pid/oom_adj_child.

nit: per-thread.

> This defaults to mirror the value of /proc/pid/oom_adj but may be changed
> so that mm's initialized by their children are preferred over the parent
> by the oom killer.  Setting oom_adj_child to be less (i.e. more immune)
> than the task's oom_adj value itself is governed by the CAP_SYS_RESOURCE
> capability.
> 
> When a mm is initialized, the initial oom_adj value will be set to the
> parent's oom_adj_child.  This allows tasks to elevate the oom_adj value
> of a vfork'd child prior to execve() before the execution actually takes
> place.
> 
> Furthermore, /proc/pid/oom_adj_child is inherited from the task that
> forked it.
> 
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Paul Menage <menage@google.com>
> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  Documentation/filesystems/proc.txt |   38 ++++++++++++++++----
>  fs/proc/base.c                     |   68 ++++++++++++++++++++++++++++++++++++
>  include/linux/sched.h              |    1 +
>  kernel/fork.c                      |    3 +-
>  4 files changed, 101 insertions(+), 9 deletions(-)
> 
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -34,10 +34,11 @@ Table of Contents
>  
>    3	Per-Process Parameters
>    3.1	/proc/<pid>/oom_adj - Adjust the oom-killer score
> -  3.2	/proc/<pid>/oom_score - Display current oom-killer score
> -  3.3	/proc/<pid>/io - Display the IO accounting fields
> -  3.4	/proc/<pid>/coredump_filter - Core dump filtering settings
> -  3.5	/proc/<pid>/mountinfo - Information about mounts
> +  3.2	/proc/<pid>/oom_adj_child - Change default oom_adj for children
> +  3.3	/proc/<pid>/oom_score - Display current oom-killer score
> +  3.4	/proc/<pid>/io - Display the IO accounting fields
> +  3.5	/proc/<pid>/coredump_filter - Core dump filtering settings
> +  3.6	/proc/<pid>/mountinfo - Information about mounts
>  
>  
>  ------------------------------------------------------------------------------
> @@ -1206,7 +1207,28 @@ The task with the highest badness score is then selected and its children
>  are killed, process itself will be killed in an OOM situation when it does
>  not have children or some of them disabled oom like described above.
>  
> -3.2 /proc/<pid>/oom_score - Display current oom-killer score
> +
> +3.2 /proc/<pid>/oom_adj_child - Change default oom_adj for children
> +-------------------------------------------------------------------
> +
> +This file can be used to change the default oom_adj value for children when a
> +new mm is initialized.  The oom_adj value for a child's mm is typically the
> +task's oom_adj value itself, however this value can be altered by writing to
> +this file.
> +
> +This is particularly helpful when a child is vfork'd and its mm following exec
> +should have a higher priority oom_adj value than its parent.  The new mm will
> +default to oom_adj_child of the parent task.
> +
> +oom_adj_child will mirror oom_adj whenever the latter changes for all tasks
> +that share its memory.  This avoids having to set both values when simply
> +tuning oom_adj and that value should be inherited by all children.
> +
> +Setting oom_adj_child to be more immune than the task's mm itself (i.e. less
> +than oom_adj) is governed by the CAP_SYS_RESOURCE capability.
> +
> +
> +3.3 /proc/<pid>/oom_score - Display current oom-killer score
>  -------------------------------------------------------------
>  
>  This file can be used to check the current score used by the oom-killer is for
> @@ -1214,7 +1236,7 @@ any given <pid>. Use it together with /proc/<pid>/oom_adj to tune which
>  process should be killed in an out-of-memory situation.
>  
>  
> -3.3  /proc/<pid>/io - Display the IO accounting fields
> +3.4  /proc/<pid>/io - Display the IO accounting fields
>  -------------------------------------------------------
>  
>  This file contains IO statistics for each running process
> @@ -1316,7 +1338,7 @@ those 64-bit counters, process A could see an intermediate result.
>  More information about this can be found within the taskstats documentation in
>  Documentation/accounting.
>  
> -3.4 /proc/<pid>/coredump_filter - Core dump filtering settings
> +3.5 /proc/<pid>/coredump_filter - Core dump filtering settings
>  ---------------------------------------------------------------
>  When a process is dumped, all anonymous memory is written to a core file as
>  long as the size of the core file isn't limited. But sometimes we don't want
> @@ -1360,7 +1382,7 @@ For example:
>    $ echo 0x7 > /proc/self/coredump_filter
>    $ ./some_program
>  
> -3.5	/proc/<pid>/mountinfo - Information about mounts
> +3.6	/proc/<pid>/mountinfo - Information about mounts
>  --------------------------------------------------------
>  
>  This file contains lines of the form:
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1023,6 +1023,7 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
>  				size_t count, loff_t *ppos)
>  {
>  	struct task_struct *task;
> +	struct task_struct *g, *p;
>  	char buffer[PROC_NUMBUF], *end;
>  	int oom_adjust;
>  
> @@ -1051,6 +1052,12 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
>  		put_task_struct(task);
>  		return -EACCES;
>  	}
> +	read_lock(&tasklist_lock);
> +	do_each_thread(g, p) {
> +		if (p->mm && p->mm == task->mm)
> +			p->oom_adj_child = oom_adjust;
> +	} while_each_thread(g, p);
> +	read_unlock(&tasklist_lock);
>  	task->mm->oom_adj = oom_adjust;
>  	task_unlock(task);
>  	put_task_struct(task);
> @@ -1064,6 +1071,65 @@ static const struct file_operations proc_oom_adjust_operations = {
>  	.write		= oom_adjust_write,
>  };
>  
> +static ssize_t oom_adj_child_read(struct file *file, char __user *buf,
> +				size_t count, loff_t *ppos)
> +{
> +	struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
> +	char buffer[PROC_NUMBUF];
> +	size_t len;
> +	int oom_adj_child;
> +
> +	if (!task)
> +		return -ESRCH;
> +	oom_adj_child = task->oom_adj_child;
> +	put_task_struct(task);
> +
> +	len = snprintf(buffer, sizeof(buffer), "%i\n", oom_adj_child);
> +
> +	return simple_read_from_buffer(buf, count, ppos, buffer, len);
> +}
> +
> +static ssize_t oom_adj_child_write(struct file *file, const char __user *buf,
> +				size_t count, loff_t *ppos)
> +{
> +	struct task_struct *task;
> +	char buffer[PROC_NUMBUF], *end;
> +	int oom_adj_child;
> +
> +	memset(buffer, 0, sizeof(buffer));
> +	if (count > sizeof(buffer) - 1)
> +		count = sizeof(buffer) - 1;
> +	if (copy_from_user(buffer, buf, count))
> +		return -EFAULT;
> +	oom_adj_child = simple_strtol(buffer, &end, 0);
> +	if ((oom_adj_child < OOM_ADJUST_MIN ||
> +	     oom_adj_child > OOM_ADJUST_MAX) && oom_adj_child != OOM_DISABLE)
> +		return -EINVAL;
> +	if (*end == '\n')
> +		end++;
> +	task = get_proc_task(file->f_path.dentry->d_inode);
> +	if (!task)
> +		return -ESRCH;
> +	task_lock(task);
> +	if (task->mm && oom_adj_child < task->mm->oom_adj &&
> +	    !capable(CAP_SYS_RESOURCE)) {
> +		task_unlock(task);
> +		put_task_struct(task);
> +		return -EINVAL;
> +	}
> +	task_unlock(task);
> +	task->oom_adj_child = oom_adj_child;
> +	put_task_struct(task);
> +	if (end - buffer == 0)
> +		return -EIO;
> +	return end - buffer;
> +}
> +
> +static const struct file_operations proc_oom_adj_child_operations = {
> +	.read		= oom_adj_child_read,
> +	.write		= oom_adj_child_write,
> +};
> +
>  #ifdef CONFIG_AUDITSYSCALL
>  #define TMPBUFLEN 21
>  static ssize_t proc_loginuid_read(struct file * file, char __user * buf,
> @@ -2548,6 +2614,7 @@ static const struct pid_entry tgid_base_stuff[] = {
>  #endif
>  	INF("oom_score",  S_IRUGO, proc_oom_score),
>  	REG("oom_adj",    S_IRUGO|S_IWUSR, proc_oom_adjust_operations),
> +	REG("oom_adj_child",	S_IRUGO|S_IWUSR, proc_oom_adj_child_operations),
>  #ifdef CONFIG_AUDITSYSCALL
>  	REG("loginuid",   S_IWUSR|S_IRUGO, proc_loginuid_operations),
>  	REG("sessionid",  S_IRUGO, proc_sessionid_operations),
> @@ -2886,6 +2953,7 @@ static const struct pid_entry tid_base_stuff[] = {
>  #endif
>  	INF("oom_score", S_IRUGO, proc_oom_score),
>  	REG("oom_adj",   S_IRUGO|S_IWUSR, proc_oom_adjust_operations),
> +	REG("oom_adj_child",	S_IRUGO|S_IWUSR, proc_oom_adj_child_operations),
>  #ifdef CONFIG_AUDITSYSCALL
>  	REG("loginuid",  S_IWUSR|S_IRUGO, proc_loginuid_operations),
>  	REG("sessionid",  S_IRUSR, proc_sessionid_operations),
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1198,6 +1198,7 @@ struct task_struct {
>  	 * a short time
>  	 */
>  	unsigned char fpu_counter;
> +	s8 oom_adj_child;	/* Default child OOM-kill score adjustment */
>  #ifdef CONFIG_BLK_DEV_IO_TRACE
>  	unsigned int btrace_seq;
>  #endif
> diff --git a/kernel/fork.c b/kernel/fork.c
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -426,7 +426,7 @@ static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
>  	init_rwsem(&mm->mmap_sem);
>  	INIT_LIST_HEAD(&mm->mmlist);
>  	mm->flags = (current->mm) ? current->mm->flags : default_dump_filter;
> -	mm->oom_adj = (current->mm) ? current->mm->oom_adj : 0;
> +	mm->oom_adj = p->oom_adj_child;

This code doesn't fix anything.
mm->oom_adj assignment still change vfork() parent process oom_adj value.
(Again, vfork() parent and child use the same mm)

IOW, in vfork case, oom_adj_child parameter doesn't only change child oom_adj,
but also parent oom_adj value.
IOW, oom_adj_child is NOT child effective parameter.


>  	mm->core_state = NULL;
>  	mm->nr_ptes = 0;
>  	set_mm_counter(mm, file_rss, 0);
> @@ -679,6 +679,7 @@ good_mm:
>  
>  	tsk->mm = mm;
>  	tsk->active_mm = mm;
> +	tsk->oom_adj_child = mm->oom_adj;
>  	return 0;
>  
>  fail_nomem:




^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
@ 2009-07-30  2:32   ` KOSAKI Motohiro
  0 siblings, 0 replies; 64+ messages in thread
From: KOSAKI Motohiro @ 2009-07-30  2:32 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Paul Menage,
	linux-kernel, linux-mm

NAK this. I explain the reason at below.


> It's helpful to be able to specify an oom_adj value for newly forked
> children that do not share memory with the parent.
> 
> Before making oom_adj values a characteristic of a task's mm in
> 2ff05b2b4eac2e63d345fc731ea151a060247f53, it was possible to change the
> oom_adj value of a vfork() child prior to execve() without implicitly
> changing the oom_adj value of the parent.  With the new behavior, the
> oom_adj values of both threads would change since they represent the same
> memory.
> 
> That change was necessary to fix an oom killer livelock which would occur
> when a child would be selected for oom kill prior to execve() and the
> task could not be killed because it shared memory with an OOM_DISABLE
> parent.  In fact, only the most negative (most immune) oom_adj value for
> all threads sharing the same memory would actually be used by the oom
> killer, leaving inconsistencies amongst all other threads having
> different oom_adj values (and, thus, incorrectly exported
> /proc/pid/oom_score values).
> 
> This patch adds a new per-process parameter: /proc/pid/oom_adj_child.

nit: per-thread.

> This defaults to mirror the value of /proc/pid/oom_adj but may be changed
> so that mm's initialized by their children are preferred over the parent
> by the oom killer.  Setting oom_adj_child to be less (i.e. more immune)
> than the task's oom_adj value itself is governed by the CAP_SYS_RESOURCE
> capability.
> 
> When a mm is initialized, the initial oom_adj value will be set to the
> parent's oom_adj_child.  This allows tasks to elevate the oom_adj value
> of a vfork'd child prior to execve() before the execution actually takes
> place.
> 
> Furthermore, /proc/pid/oom_adj_child is inherited from the task that
> forked it.
> 
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Paul Menage <menage@google.com>
> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  Documentation/filesystems/proc.txt |   38 ++++++++++++++++----
>  fs/proc/base.c                     |   68 ++++++++++++++++++++++++++++++++++++
>  include/linux/sched.h              |    1 +
>  kernel/fork.c                      |    3 +-
>  4 files changed, 101 insertions(+), 9 deletions(-)
> 
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -34,10 +34,11 @@ Table of Contents
>  
>    3	Per-Process Parameters
>    3.1	/proc/<pid>/oom_adj - Adjust the oom-killer score
> -  3.2	/proc/<pid>/oom_score - Display current oom-killer score
> -  3.3	/proc/<pid>/io - Display the IO accounting fields
> -  3.4	/proc/<pid>/coredump_filter - Core dump filtering settings
> -  3.5	/proc/<pid>/mountinfo - Information about mounts
> +  3.2	/proc/<pid>/oom_adj_child - Change default oom_adj for children
> +  3.3	/proc/<pid>/oom_score - Display current oom-killer score
> +  3.4	/proc/<pid>/io - Display the IO accounting fields
> +  3.5	/proc/<pid>/coredump_filter - Core dump filtering settings
> +  3.6	/proc/<pid>/mountinfo - Information about mounts
>  
>  
>  ------------------------------------------------------------------------------
> @@ -1206,7 +1207,28 @@ The task with the highest badness score is then selected and its children
>  are killed, process itself will be killed in an OOM situation when it does
>  not have children or some of them disabled oom like described above.
>  
> -3.2 /proc/<pid>/oom_score - Display current oom-killer score
> +
> +3.2 /proc/<pid>/oom_adj_child - Change default oom_adj for children
> +-------------------------------------------------------------------
> +
> +This file can be used to change the default oom_adj value for children when a
> +new mm is initialized.  The oom_adj value for a child's mm is typically the
> +task's oom_adj value itself, however this value can be altered by writing to
> +this file.
> +
> +This is particularly helpful when a child is vfork'd and its mm following exec
> +should have a higher priority oom_adj value than its parent.  The new mm will
> +default to oom_adj_child of the parent task.
> +
> +oom_adj_child will mirror oom_adj whenever the latter changes for all tasks
> +that share its memory.  This avoids having to set both values when simply
> +tuning oom_adj and that value should be inherited by all children.
> +
> +Setting oom_adj_child to be more immune than the task's mm itself (i.e. less
> +than oom_adj) is governed by the CAP_SYS_RESOURCE capability.
> +
> +
> +3.3 /proc/<pid>/oom_score - Display current oom-killer score
>  -------------------------------------------------------------
>  
>  This file can be used to check the current score used by the oom-killer is for
> @@ -1214,7 +1236,7 @@ any given <pid>. Use it together with /proc/<pid>/oom_adj to tune which
>  process should be killed in an out-of-memory situation.
>  
>  
> -3.3  /proc/<pid>/io - Display the IO accounting fields
> +3.4  /proc/<pid>/io - Display the IO accounting fields
>  -------------------------------------------------------
>  
>  This file contains IO statistics for each running process
> @@ -1316,7 +1338,7 @@ those 64-bit counters, process A could see an intermediate result.
>  More information about this can be found within the taskstats documentation in
>  Documentation/accounting.
>  
> -3.4 /proc/<pid>/coredump_filter - Core dump filtering settings
> +3.5 /proc/<pid>/coredump_filter - Core dump filtering settings
>  ---------------------------------------------------------------
>  When a process is dumped, all anonymous memory is written to a core file as
>  long as the size of the core file isn't limited. But sometimes we don't want
> @@ -1360,7 +1382,7 @@ For example:
>    $ echo 0x7 > /proc/self/coredump_filter
>    $ ./some_program
>  
> -3.5	/proc/<pid>/mountinfo - Information about mounts
> +3.6	/proc/<pid>/mountinfo - Information about mounts
>  --------------------------------------------------------
>  
>  This file contains lines of the form:
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1023,6 +1023,7 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
>  				size_t count, loff_t *ppos)
>  {
>  	struct task_struct *task;
> +	struct task_struct *g, *p;
>  	char buffer[PROC_NUMBUF], *end;
>  	int oom_adjust;
>  
> @@ -1051,6 +1052,12 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
>  		put_task_struct(task);
>  		return -EACCES;
>  	}
> +	read_lock(&tasklist_lock);
> +	do_each_thread(g, p) {
> +		if (p->mm && p->mm == task->mm)
> +			p->oom_adj_child = oom_adjust;
> +	} while_each_thread(g, p);
> +	read_unlock(&tasklist_lock);
>  	task->mm->oom_adj = oom_adjust;
>  	task_unlock(task);
>  	put_task_struct(task);
> @@ -1064,6 +1071,65 @@ static const struct file_operations proc_oom_adjust_operations = {
>  	.write		= oom_adjust_write,
>  };
>  
> +static ssize_t oom_adj_child_read(struct file *file, char __user *buf,
> +				size_t count, loff_t *ppos)
> +{
> +	struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
> +	char buffer[PROC_NUMBUF];
> +	size_t len;
> +	int oom_adj_child;
> +
> +	if (!task)
> +		return -ESRCH;
> +	oom_adj_child = task->oom_adj_child;
> +	put_task_struct(task);
> +
> +	len = snprintf(buffer, sizeof(buffer), "%i\n", oom_adj_child);
> +
> +	return simple_read_from_buffer(buf, count, ppos, buffer, len);
> +}
> +
> +static ssize_t oom_adj_child_write(struct file *file, const char __user *buf,
> +				size_t count, loff_t *ppos)
> +{
> +	struct task_struct *task;
> +	char buffer[PROC_NUMBUF], *end;
> +	int oom_adj_child;
> +
> +	memset(buffer, 0, sizeof(buffer));
> +	if (count > sizeof(buffer) - 1)
> +		count = sizeof(buffer) - 1;
> +	if (copy_from_user(buffer, buf, count))
> +		return -EFAULT;
> +	oom_adj_child = simple_strtol(buffer, &end, 0);
> +	if ((oom_adj_child < OOM_ADJUST_MIN ||
> +	     oom_adj_child > OOM_ADJUST_MAX) && oom_adj_child != OOM_DISABLE)
> +		return -EINVAL;
> +	if (*end == '\n')
> +		end++;
> +	task = get_proc_task(file->f_path.dentry->d_inode);
> +	if (!task)
> +		return -ESRCH;
> +	task_lock(task);
> +	if (task->mm && oom_adj_child < task->mm->oom_adj &&
> +	    !capable(CAP_SYS_RESOURCE)) {
> +		task_unlock(task);
> +		put_task_struct(task);
> +		return -EINVAL;
> +	}
> +	task_unlock(task);
> +	task->oom_adj_child = oom_adj_child;
> +	put_task_struct(task);
> +	if (end - buffer == 0)
> +		return -EIO;
> +	return end - buffer;
> +}
> +
> +static const struct file_operations proc_oom_adj_child_operations = {
> +	.read		= oom_adj_child_read,
> +	.write		= oom_adj_child_write,
> +};
> +
>  #ifdef CONFIG_AUDITSYSCALL
>  #define TMPBUFLEN 21
>  static ssize_t proc_loginuid_read(struct file * file, char __user * buf,
> @@ -2548,6 +2614,7 @@ static const struct pid_entry tgid_base_stuff[] = {
>  #endif
>  	INF("oom_score",  S_IRUGO, proc_oom_score),
>  	REG("oom_adj",    S_IRUGO|S_IWUSR, proc_oom_adjust_operations),
> +	REG("oom_adj_child",	S_IRUGO|S_IWUSR, proc_oom_adj_child_operations),
>  #ifdef CONFIG_AUDITSYSCALL
>  	REG("loginuid",   S_IWUSR|S_IRUGO, proc_loginuid_operations),
>  	REG("sessionid",  S_IRUGO, proc_sessionid_operations),
> @@ -2886,6 +2953,7 @@ static const struct pid_entry tid_base_stuff[] = {
>  #endif
>  	INF("oom_score", S_IRUGO, proc_oom_score),
>  	REG("oom_adj",   S_IRUGO|S_IWUSR, proc_oom_adjust_operations),
> +	REG("oom_adj_child",	S_IRUGO|S_IWUSR, proc_oom_adj_child_operations),
>  #ifdef CONFIG_AUDITSYSCALL
>  	REG("loginuid",  S_IWUSR|S_IRUGO, proc_loginuid_operations),
>  	REG("sessionid",  S_IRUSR, proc_sessionid_operations),
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1198,6 +1198,7 @@ struct task_struct {
>  	 * a short time
>  	 */
>  	unsigned char fpu_counter;
> +	s8 oom_adj_child;	/* Default child OOM-kill score adjustment */
>  #ifdef CONFIG_BLK_DEV_IO_TRACE
>  	unsigned int btrace_seq;
>  #endif
> diff --git a/kernel/fork.c b/kernel/fork.c
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -426,7 +426,7 @@ static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
>  	init_rwsem(&mm->mmap_sem);
>  	INIT_LIST_HEAD(&mm->mmlist);
>  	mm->flags = (current->mm) ? current->mm->flags : default_dump_filter;
> -	mm->oom_adj = (current->mm) ? current->mm->oom_adj : 0;
> +	mm->oom_adj = p->oom_adj_child;

This code doesn't fix anything.
mm->oom_adj assignment still change vfork() parent process oom_adj value.
(Again, vfork() parent and child use the same mm)

IOW, in vfork case, oom_adj_child parameter doesn't only change child oom_adj,
but also parent oom_adj value.
IOW, oom_adj_child is NOT child effective parameter.


>  	mm->core_state = NULL;
>  	mm->nr_ptes = 0;
>  	set_mm_counter(mm, file_rss, 0);
> @@ -679,6 +679,7 @@ good_mm:
>  
>  	tsk->mm = mm;
>  	tsk->active_mm = mm;
> +	tsk->oom_adj_child = mm->oom_adj;
>  	return 0;
>  
>  fail_nomem:



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
  2009-07-30  2:32   ` KOSAKI Motohiro
@ 2009-07-30  7:06     ` David Rientjes
  -1 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2009-07-30  7:06 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Rik van Riel, Paul Menage, linux-kernel, linux-mm

On Thu, 30 Jul 2009, KOSAKI Motohiro wrote:

> > diff --git a/kernel/fork.c b/kernel/fork.c
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -426,7 +426,7 @@ static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
> >  	init_rwsem(&mm->mmap_sem);
> >  	INIT_LIST_HEAD(&mm->mmlist);
> >  	mm->flags = (current->mm) ? current->mm->flags : default_dump_filter;
> > -	mm->oom_adj = (current->mm) ? current->mm->oom_adj : 0;
> > +	mm->oom_adj = p->oom_adj_child;
> 
> This code doesn't fix anything.
> mm->oom_adj assignment still change vfork() parent process oom_adj value.
> (Again, vfork() parent and child use the same mm)
> 

That's because the oom killer only really considers the highest oom_adj 
value amongst all threads that share the same mm.  Allowing those threads 
to each have different oom_adj values leads (i) to an inconsistency in 
reporting /proc/pid/oom_score for how the oom killer selects a task to 
kill and (ii) the oom killer livelock that it fixes when one thread 
happens to be OOM_DISABLE.

So, yes, changing the oom_adj value for a thread may have side-effects 
on other threads that didn't exist prior to 2.6.31-rc1 because the oom_adj 
value now represents a killable quantity of memory instead of a being a 
characteristic of the task itself.  But we now provide the inheritance 
property in a new way, via /proc/pid/oom_adj_child, that gives you all the 
functionality that the previous way did but without the potential for 
livelock.

> IOW, in vfork case, oom_adj_child parameter doesn't only change child oom_adj,
> but also parent oom_adj value.

Changing oom_adj_child for a task never changes oom_adj for any mm, it 
simply specifies what default value shall be given for a child's newly 
initialized mm.  Chaning oom_adj, on the other hand, will 

> IOW, oom_adj_child is NOT child effective parameter.
> 

It's not meant to be, it's only meant to specify a default value for newly 
initialized mm's of its descendants.  What happens after that is governed 
completely by the child's own /proc/pid/oom_adj.  That's pretty clearly 
explained in Documentation/filesystems/proc.txt.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
@ 2009-07-30  7:06     ` David Rientjes
  0 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2009-07-30  7:06 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Rik van Riel, Paul Menage, linux-kernel, linux-mm

On Thu, 30 Jul 2009, KOSAKI Motohiro wrote:

> > diff --git a/kernel/fork.c b/kernel/fork.c
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -426,7 +426,7 @@ static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
> >  	init_rwsem(&mm->mmap_sem);
> >  	INIT_LIST_HEAD(&mm->mmlist);
> >  	mm->flags = (current->mm) ? current->mm->flags : default_dump_filter;
> > -	mm->oom_adj = (current->mm) ? current->mm->oom_adj : 0;
> > +	mm->oom_adj = p->oom_adj_child;
> 
> This code doesn't fix anything.
> mm->oom_adj assignment still change vfork() parent process oom_adj value.
> (Again, vfork() parent and child use the same mm)
> 

That's because the oom killer only really considers the highest oom_adj 
value amongst all threads that share the same mm.  Allowing those threads 
to each have different oom_adj values leads (i) to an inconsistency in 
reporting /proc/pid/oom_score for how the oom killer selects a task to 
kill and (ii) the oom killer livelock that it fixes when one thread 
happens to be OOM_DISABLE.

So, yes, changing the oom_adj value for a thread may have side-effects 
on other threads that didn't exist prior to 2.6.31-rc1 because the oom_adj 
value now represents a killable quantity of memory instead of a being a 
characteristic of the task itself.  But we now provide the inheritance 
property in a new way, via /proc/pid/oom_adj_child, that gives you all the 
functionality that the previous way did but without the potential for 
livelock.

> IOW, in vfork case, oom_adj_child parameter doesn't only change child oom_adj,
> but also parent oom_adj value.

Changing oom_adj_child for a task never changes oom_adj for any mm, it 
simply specifies what default value shall be given for a child's newly 
initialized mm.  Chaning oom_adj, on the other hand, will 

> IOW, oom_adj_child is NOT child effective parameter.
> 

It's not meant to be, it's only meant to specify a default value for newly 
initialized mm's of its descendants.  What happens after that is governed 
completely by the child's own /proc/pid/oom_adj.  That's pretty clearly 
explained in Documentation/filesystems/proc.txt.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
  2009-07-29  4:27 ` David Rientjes
@ 2009-07-30  9:00   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-07-30  9:00 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Paul Menage, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Tue, 28 Jul 2009 21:27:15 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> It's helpful to be able to specify an oom_adj value for newly forked
> children that do not share memory with the parent.
> 
> Before making oom_adj values a characteristic of a task's mm in
> 2ff05b2b4eac2e63d345fc731ea151a060247f53, it was possible to change the
> oom_adj value of a vfork() child prior to execve() without implicitly
> changing the oom_adj value of the parent.  With the new behavior, the
> oom_adj values of both threads would change since they represent the same
> memory.
> 
> That change was necessary to fix an oom killer livelock which would occur
> when a child would be selected for oom kill prior to execve() and the
> task could not be killed because it shared memory with an OOM_DISABLE
> parent.  In fact, only the most negative (most immune) oom_adj value for
> all threads sharing the same memory would actually be used by the oom
> killer, leaving inconsistencies amongst all other threads having
> different oom_adj values (and, thus, incorrectly exported
> /proc/pid/oom_score values).
> 
> This patch adds a new per-process parameter: /proc/pid/oom_adj_child.
> This defaults to mirror the value of /proc/pid/oom_adj but may be changed
> so that mm's initialized by their children are preferred over the parent
> by the oom killer.  Setting oom_adj_child to be less (i.e. more immune)
> than the task's oom_adj value itself is governed by the CAP_SYS_RESOURCE
> capability.
> 
> When a mm is initialized, the initial oom_adj value will be set to the
> parent's oom_adj_child.  This allows tasks to elevate the oom_adj value
> of a vfork'd child prior to execve() before the execution actually takes
> place.
> 
> Furthermore, /proc/pid/oom_adj_child is inherited from the task that
> forked it.
> 
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Paul Menage <menage@google.com>
> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  Documentation/filesystems/proc.txt |   38 ++++++++++++++++----
>  fs/proc/base.c                     |   68 ++++++++++++++++++++++++++++++++++++
>  include/linux/sched.h              |    1 +
>  kernel/fork.c                      |    3 +-
>  4 files changed, 101 insertions(+), 9 deletions(-)
> 
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -34,10 +34,11 @@ Table of Contents
>  
>    3	Per-Process Parameters
>    3.1	/proc/<pid>/oom_adj - Adjust the oom-killer score
> -  3.2	/proc/<pid>/oom_score - Display current oom-killer score
> -  3.3	/proc/<pid>/io - Display the IO accounting fields
> -  3.4	/proc/<pid>/coredump_filter - Core dump filtering settings
> -  3.5	/proc/<pid>/mountinfo - Information about mounts
> +  3.2	/proc/<pid>/oom_adj_child - Change default oom_adj for children
> +  3.3	/proc/<pid>/oom_score - Display current oom-killer score
> +  3.4	/proc/<pid>/io - Display the IO accounting fields
> +  3.5	/proc/<pid>/coredump_filter - Core dump filtering settings
> +  3.6	/proc/<pid>/mountinfo - Information about mounts
>  
>  
>  ------------------------------------------------------------------------------
> @@ -1206,7 +1207,28 @@ The task with the highest badness score is then selected and its children
>  are killed, process itself will be killed in an OOM situation when it does
>  not have children or some of them disabled oom like described above.
>  
> -3.2 /proc/<pid>/oom_score - Display current oom-killer score
> +
> +3.2 /proc/<pid>/oom_adj_child - Change default oom_adj for children
> +-------------------------------------------------------------------
> +
> +This file can be used to change the default oom_adj value for children when a
> +new mm is initialized.  The oom_adj value for a child's mm is typically the
> +task's oom_adj value itself, however this value can be altered by writing to
> +this file.
> +
> +This is particularly helpful when a child is vfork'd and its mm following exec
> +should have a higher priority oom_adj value than its parent.  The new mm will
> +default to oom_adj_child of the parent task.
> +
> +oom_adj_child will mirror oom_adj whenever the latter changes for all tasks
> +that share its memory.  This avoids having to set both values when simply
> +tuning oom_adj and that value should be inherited by all children.
> +
> +Setting oom_adj_child to be more immune than the task's mm itself (i.e. less
> +than oom_adj) is governed by the CAP_SYS_RESOURCE capability.
> +
a few comments.

1. IIUC, the name is strange.

At job scheduler, which does this.

if (vfork() == 0) {
	/* do some job */
	execve(.....)
}

Then, when oom_adj_child can be effective is after execve().
IIUC, the _child_ means a process created by vfork().

2. More simple plan is like this, IIUC.

  fix oom-killer's select_bad_process() not to be in deadlock.

rather than this new stupid interface.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
@ 2009-07-30  9:00   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-07-30  9:00 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Paul Menage, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Tue, 28 Jul 2009 21:27:15 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> It's helpful to be able to specify an oom_adj value for newly forked
> children that do not share memory with the parent.
> 
> Before making oom_adj values a characteristic of a task's mm in
> 2ff05b2b4eac2e63d345fc731ea151a060247f53, it was possible to change the
> oom_adj value of a vfork() child prior to execve() without implicitly
> changing the oom_adj value of the parent.  With the new behavior, the
> oom_adj values of both threads would change since they represent the same
> memory.
> 
> That change was necessary to fix an oom killer livelock which would occur
> when a child would be selected for oom kill prior to execve() and the
> task could not be killed because it shared memory with an OOM_DISABLE
> parent.  In fact, only the most negative (most immune) oom_adj value for
> all threads sharing the same memory would actually be used by the oom
> killer, leaving inconsistencies amongst all other threads having
> different oom_adj values (and, thus, incorrectly exported
> /proc/pid/oom_score values).
> 
> This patch adds a new per-process parameter: /proc/pid/oom_adj_child.
> This defaults to mirror the value of /proc/pid/oom_adj but may be changed
> so that mm's initialized by their children are preferred over the parent
> by the oom killer.  Setting oom_adj_child to be less (i.e. more immune)
> than the task's oom_adj value itself is governed by the CAP_SYS_RESOURCE
> capability.
> 
> When a mm is initialized, the initial oom_adj value will be set to the
> parent's oom_adj_child.  This allows tasks to elevate the oom_adj value
> of a vfork'd child prior to execve() before the execution actually takes
> place.
> 
> Furthermore, /proc/pid/oom_adj_child is inherited from the task that
> forked it.
> 
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Paul Menage <menage@google.com>
> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  Documentation/filesystems/proc.txt |   38 ++++++++++++++++----
>  fs/proc/base.c                     |   68 ++++++++++++++++++++++++++++++++++++
>  include/linux/sched.h              |    1 +
>  kernel/fork.c                      |    3 +-
>  4 files changed, 101 insertions(+), 9 deletions(-)
> 
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -34,10 +34,11 @@ Table of Contents
>  
>    3	Per-Process Parameters
>    3.1	/proc/<pid>/oom_adj - Adjust the oom-killer score
> -  3.2	/proc/<pid>/oom_score - Display current oom-killer score
> -  3.3	/proc/<pid>/io - Display the IO accounting fields
> -  3.4	/proc/<pid>/coredump_filter - Core dump filtering settings
> -  3.5	/proc/<pid>/mountinfo - Information about mounts
> +  3.2	/proc/<pid>/oom_adj_child - Change default oom_adj for children
> +  3.3	/proc/<pid>/oom_score - Display current oom-killer score
> +  3.4	/proc/<pid>/io - Display the IO accounting fields
> +  3.5	/proc/<pid>/coredump_filter - Core dump filtering settings
> +  3.6	/proc/<pid>/mountinfo - Information about mounts
>  
>  
>  ------------------------------------------------------------------------------
> @@ -1206,7 +1207,28 @@ The task with the highest badness score is then selected and its children
>  are killed, process itself will be killed in an OOM situation when it does
>  not have children or some of them disabled oom like described above.
>  
> -3.2 /proc/<pid>/oom_score - Display current oom-killer score
> +
> +3.2 /proc/<pid>/oom_adj_child - Change default oom_adj for children
> +-------------------------------------------------------------------
> +
> +This file can be used to change the default oom_adj value for children when a
> +new mm is initialized.  The oom_adj value for a child's mm is typically the
> +task's oom_adj value itself, however this value can be altered by writing to
> +this file.
> +
> +This is particularly helpful when a child is vfork'd and its mm following exec
> +should have a higher priority oom_adj value than its parent.  The new mm will
> +default to oom_adj_child of the parent task.
> +
> +oom_adj_child will mirror oom_adj whenever the latter changes for all tasks
> +that share its memory.  This avoids having to set both values when simply
> +tuning oom_adj and that value should be inherited by all children.
> +
> +Setting oom_adj_child to be more immune than the task's mm itself (i.e. less
> +than oom_adj) is governed by the CAP_SYS_RESOURCE capability.
> +
a few comments.

1. IIUC, the name is strange.

At job scheduler, which does this.

if (vfork() == 0) {
	/* do some job */
	execve(.....)
}

Then, when oom_adj_child can be effective is after execve().
IIUC, the _child_ means a process created by vfork().

2. More simple plan is like this, IIUC.

  fix oom-killer's select_bad_process() not to be in deadlock.

rather than this new stupid interface.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
  2009-07-30  9:00   ` KAMEZAWA Hiroyuki
@ 2009-07-30  9:31     ` David Rientjes
  -1 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2009-07-30  9:31 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Paul Menage, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Thu, 30 Jul 2009, KAMEZAWA Hiroyuki wrote:

> 1. IIUC, the name is strange.
> 
> At job scheduler, which does this.
> 
> if (vfork() == 0) {
> 	/* do some job */
> 	execve(.....)
> }
> 
> Then, when oom_adj_child can be effective is after execve().
> IIUC, the _child_ means a process created by vfork().
> 

It's certainly a difficult thing to name and I don't claim that "child" is 
completely accurate since, as you said, vfork'd tasks are also children 
of the parent yet they share the same oom_adj value since it's an 
attribute of the shared mm.

If you have suggestions for a better name, I'd happily ack it.

> 2. More simple plan is like this, IIUC.
> 
>   fix oom-killer's select_bad_process() not to be in deadlock.
> 

Alternate ideas?

> rather than this new stupid interface.
> 

Well, thank you.  Regardless of whether you think it's stupid or not, it 
doesn't allow you to livelock the kernel in a very trivial way when the 
oom killer gets invoked prior to execve() and the parent is OOM_DISABLE.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
@ 2009-07-30  9:31     ` David Rientjes
  0 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2009-07-30  9:31 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Paul Menage, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Thu, 30 Jul 2009, KAMEZAWA Hiroyuki wrote:

> 1. IIUC, the name is strange.
> 
> At job scheduler, which does this.
> 
> if (vfork() == 0) {
> 	/* do some job */
> 	execve(.....)
> }
> 
> Then, when oom_adj_child can be effective is after execve().
> IIUC, the _child_ means a process created by vfork().
> 

It's certainly a difficult thing to name and I don't claim that "child" is 
completely accurate since, as you said, vfork'd tasks are also children 
of the parent yet they share the same oom_adj value since it's an 
attribute of the shared mm.

If you have suggestions for a better name, I'd happily ack it.

> 2. More simple plan is like this, IIUC.
> 
>   fix oom-killer's select_bad_process() not to be in deadlock.
> 

Alternate ideas?

> rather than this new stupid interface.
> 

Well, thank you.  Regardless of whether you think it's stupid or not, it 
doesn't allow you to livelock the kernel in a very trivial way when the 
oom killer gets invoked prior to execve() and the parent is OOM_DISABLE.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
  2009-07-30  9:31     ` David Rientjes
@ 2009-07-30 10:02       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-07-30 10:02 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Paul Menage, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Thu, 30 Jul 2009 02:31:04 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> On Thu, 30 Jul 2009, KAMEZAWA Hiroyuki wrote:
> 
> > 1. IIUC, the name is strange.
> > 
> > At job scheduler, which does this.
> > 
> > if (vfork() == 0) {
> > 	/* do some job */
> > 	execve(.....)
> > }
> > 
> > Then, when oom_adj_child can be effective is after execve().
> > IIUC, the _child_ means a process created by vfork().
> > 
> 
> It's certainly a difficult thing to name and I don't claim that "child" is 
> completely accurate since, as you said, vfork'd tasks are also children 
> of the parent yet they share the same oom_adj value since it's an 
> attribute of the shared mm.
> 
> If you have suggestions for a better name, I'd happily ack it.
> 

Simply, reset_oom_adj_at_new_mm_context or some.

> > 2. More simple plan is like this, IIUC.
> > 
> >   fix oom-killer's select_bad_process() not to be in deadlock.
> > 
> 
> Alternate ideas?
> 
At brief thiking.

1. move oom_adj from mm_struct to signal struct. or somewhere.
   (see copy_signal())
   Then,
    - all threads in a process will have the same oom_adj.
    - vfork()'ed thread will inherit its parent's oom_adj.   
    - vfork()'ed thread can override oom_adj of its own.

    In other words, oom_adj is shared when CLONE_PARENT is not set.

2. rename  mm_struct's oom_adj as shadow_oom_adj.

   update this shadow_oom_adj as the highest oom_adj among
   the values all threads share this mm_struct have.
   This update is done when
   - mm_init()
   - oom_adj is written.

   User's 
   # echo XXXX > /proc/<x>/oom_adj
   is not necessary to be very very fast.

   I don't think a process which calls vfork() is multi-threaded.

3. use shadow_oom_adj in select_bad_process().



> > rather than this new stupid interface.
> > 
> 
> Well, thank you.  Regardless of whether you think it's stupid or not, it 
> doesn't allow you to livelock the kernel in a very trivial way when the 
> oom killer gets invoked prior to execve() and the parent is OOM_DISABLE.
> 


just plz consider more.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
@ 2009-07-30 10:02       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-07-30 10:02 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Paul Menage, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Thu, 30 Jul 2009 02:31:04 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> On Thu, 30 Jul 2009, KAMEZAWA Hiroyuki wrote:
> 
> > 1. IIUC, the name is strange.
> > 
> > At job scheduler, which does this.
> > 
> > if (vfork() == 0) {
> > 	/* do some job */
> > 	execve(.....)
> > }
> > 
> > Then, when oom_adj_child can be effective is after execve().
> > IIUC, the _child_ means a process created by vfork().
> > 
> 
> It's certainly a difficult thing to name and I don't claim that "child" is 
> completely accurate since, as you said, vfork'd tasks are also children 
> of the parent yet they share the same oom_adj value since it's an 
> attribute of the shared mm.
> 
> If you have suggestions for a better name, I'd happily ack it.
> 

Simply, reset_oom_adj_at_new_mm_context or some.

> > 2. More simple plan is like this, IIUC.
> > 
> >   fix oom-killer's select_bad_process() not to be in deadlock.
> > 
> 
> Alternate ideas?
> 
At brief thiking.

1. move oom_adj from mm_struct to signal struct. or somewhere.
   (see copy_signal())
   Then,
    - all threads in a process will have the same oom_adj.
    - vfork()'ed thread will inherit its parent's oom_adj.   
    - vfork()'ed thread can override oom_adj of its own.

    In other words, oom_adj is shared when CLONE_PARENT is not set.

2. rename  mm_struct's oom_adj as shadow_oom_adj.

   update this shadow_oom_adj as the highest oom_adj among
   the values all threads share this mm_struct have.
   This update is done when
   - mm_init()
   - oom_adj is written.

   User's 
   # echo XXXX > /proc/<x>/oom_adj
   is not necessary to be very very fast.

   I don't think a process which calls vfork() is multi-threaded.

3. use shadow_oom_adj in select_bad_process().



> > rather than this new stupid interface.
> > 
> 
> Well, thank you.  Regardless of whether you think it's stupid or not, it 
> doesn't allow you to livelock the kernel in a very trivial way when the 
> oom killer gets invoked prior to execve() and the parent is OOM_DISABLE.
> 


just plz consider more.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
  2009-07-30 10:02       ` KAMEZAWA Hiroyuki
@ 2009-07-30 19:05         ` David Rientjes
  -1 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2009-07-30 19:05 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Paul Menage, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Thu, 30 Jul 2009, KAMEZAWA Hiroyuki wrote:

> > If you have suggestions for a better name, I'd happily ack it.
> > 
> 
> Simply, reset_oom_adj_at_new_mm_context or some.
> 

I think it's preferred to keep the name relatively short which is an 
unfortuante requirement in this case.  I also prefer to start the name 
with "oom_adj" so it appears alongside /proc/pid/oom_adj when listed 
alphabetically.

> > > 2. More simple plan is like this, IIUC.
> > > 
> > >   fix oom-killer's select_bad_process() not to be in deadlock.
> > > 
> > 
> > Alternate ideas?
> > 
> At brief thiking.
> 
> 1. move oom_adj from mm_struct to signal struct. or somewhere.
>    (see copy_signal())
>    Then,
>     - all threads in a process will have the same oom_adj.
>     - vfork()'ed thread will inherit its parent's oom_adj.   
>     - vfork()'ed thread can override oom_adj of its own.
> 
>     In other words, oom_adj is shared when CLONE_PARENT is not set.
> 

Hmm, didn't we talk about signal_struct already?  The problem with that 
approach is that oom_adj values represent a killable quantity of memory, 
so having multiple threads sharing the same mm_struct with one set to 
OOM_DISABLE and the other at +15 will still livelock because the oom 
killer can't kill either.

> 2. rename  mm_struct's oom_adj as shadow_oom_adj.
> 
>    update this shadow_oom_adj as the highest oom_adj among
>    the values all threads share this mm_struct have.
>    This update is done when
>    - mm_init()
>    - oom_adj is written.
> 
>    User's 
>    # echo XXXX > /proc/<x>/oom_adj
>    is not necessary to be very very fast.
> 
>    I don't think a process which calls vfork() is multi-threaded.
> 
> 3. use shadow_oom_adj in select_bad_process().
> 

Ideas 2 & 3 here seem to be a single proposal.  The problem is that it 
still leaves /proc/pid/oom_score to be inconsistent with the badness 
scoring that the oom killer will eventually use since if it oom kills one 
task, it must kill all tasks sharing the same mm_struct to lead to future 
memory freeing.

Additionally, if you were to set one thread to OOM_DISABLE, storing the 
highest oom_adj value in mm_struct isn't going to help because 
oom_kill_task() will still require a tasklist scan to ensure no threads 
sharing the mm_struct are OOM_DISABLE and the livelock persists.

In other words, the issue here is larger than the inheritance of the 
oom_adj value amongst children, it addresses a livelock that neither of 
your approaches solve.  The fix actually makes /proc/pid/oom_adj (and 
/proc/pid/oom_score) consistent with how the oom killer behaves.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
@ 2009-07-30 19:05         ` David Rientjes
  0 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2009-07-30 19:05 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Paul Menage, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Thu, 30 Jul 2009, KAMEZAWA Hiroyuki wrote:

> > If you have suggestions for a better name, I'd happily ack it.
> > 
> 
> Simply, reset_oom_adj_at_new_mm_context or some.
> 

I think it's preferred to keep the name relatively short which is an 
unfortuante requirement in this case.  I also prefer to start the name 
with "oom_adj" so it appears alongside /proc/pid/oom_adj when listed 
alphabetically.

> > > 2. More simple plan is like this, IIUC.
> > > 
> > >   fix oom-killer's select_bad_process() not to be in deadlock.
> > > 
> > 
> > Alternate ideas?
> > 
> At brief thiking.
> 
> 1. move oom_adj from mm_struct to signal struct. or somewhere.
>    (see copy_signal())
>    Then,
>     - all threads in a process will have the same oom_adj.
>     - vfork()'ed thread will inherit its parent's oom_adj.   
>     - vfork()'ed thread can override oom_adj of its own.
> 
>     In other words, oom_adj is shared when CLONE_PARENT is not set.
> 

Hmm, didn't we talk about signal_struct already?  The problem with that 
approach is that oom_adj values represent a killable quantity of memory, 
so having multiple threads sharing the same mm_struct with one set to 
OOM_DISABLE and the other at +15 will still livelock because the oom 
killer can't kill either.

> 2. rename  mm_struct's oom_adj as shadow_oom_adj.
> 
>    update this shadow_oom_adj as the highest oom_adj among
>    the values all threads share this mm_struct have.
>    This update is done when
>    - mm_init()
>    - oom_adj is written.
> 
>    User's 
>    # echo XXXX > /proc/<x>/oom_adj
>    is not necessary to be very very fast.
> 
>    I don't think a process which calls vfork() is multi-threaded.
> 
> 3. use shadow_oom_adj in select_bad_process().
> 

Ideas 2 & 3 here seem to be a single proposal.  The problem is that it 
still leaves /proc/pid/oom_score to be inconsistent with the badness 
scoring that the oom killer will eventually use since if it oom kills one 
task, it must kill all tasks sharing the same mm_struct to lead to future 
memory freeing.

Additionally, if you were to set one thread to OOM_DISABLE, storing the 
highest oom_adj value in mm_struct isn't going to help because 
oom_kill_task() will still require a tasklist scan to ensure no threads 
sharing the mm_struct are OOM_DISABLE and the livelock persists.

In other words, the issue here is larger than the inheritance of the 
oom_adj value amongst children, it addresses a livelock that neither of 
your approaches solve.  The fix actually makes /proc/pid/oom_adj (and 
/proc/pid/oom_score) consistent with how the oom killer behaves.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
  2009-07-30 19:05         ` David Rientjes
@ 2009-07-31  0:33           ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-07-31  0:33 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Paul Menage, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Thu, 30 Jul 2009 12:05:30 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> On Thu, 30 Jul 2009, KAMEZAWA Hiroyuki wrote:
> 
> > > If you have suggestions for a better name, I'd happily ack it.
> > > 
> > 
> > Simply, reset_oom_adj_at_new_mm_context or some.
> > 
> 
> I think it's preferred to keep the name relatively short which is an 
> unfortuante requirement in this case.  I also prefer to start the name 
> with "oom_adj" so it appears alongside /proc/pid/oom_adj when listed 
> alphabetically.
> 
But misleading name is bad.



> > > > 2. More simple plan is like this, IIUC.
> > > > 
> > > >   fix oom-killer's select_bad_process() not to be in deadlock.
> > > > 
> > > 
> > > Alternate ideas?
> > > 
> > At brief thiking.
> > 
> > 1. move oom_adj from mm_struct to signal struct. or somewhere.
> >    (see copy_signal())
> >    Then,
> >     - all threads in a process will have the same oom_adj.
> >     - vfork()'ed thread will inherit its parent's oom_adj.   
> >     - vfork()'ed thread can override oom_adj of its own.
> > 
> >     In other words, oom_adj is shared when CLONE_PARENT is not set.
> > 
> 
> Hmm, didn't we talk about signal_struct already?  The problem with that 
> approach is that oom_adj values represent a killable quantity of memory, 
> so having multiple threads sharing the same mm_struct with one set to 
> OOM_DISABLE and the other at +15 will still livelock because the oom 
> killer can't kill either.
>
> > 2. rename  mm_struct's oom_adj as shadow_oom_adj.
> > 
> >    update this shadow_oom_adj as the highest oom_adj among
> >    the values all threads share this mm_struct have.
> >    This update is done when
> >    - mm_init()
> >    - oom_adj is written.
> > 
> >    User's 
> >    # echo XXXX > /proc/<x>/oom_adj
> >    is not necessary to be very very fast.
> > 
> >    I don't think a process which calls vfork() is multi-threaded.
> > 
> > 3. use shadow_oom_adj in select_bad_process().
> > 
> 
> Ideas 2 & 3 here seem to be a single proposal.  The problem is that it 
> still leaves /proc/pid/oom_score to be inconsistent with the badness 
> scoring that the oom killer will eventually use since if it oom kills one 
> task, it must kill all tasks sharing the same mm_struct to lead to future 
> memory freeing.
> 
yes.

> Additionally, if you were to set one thread to OOM_DISABLE, storing the 
> highest oom_adj value in mm_struct isn't going to help because 
> oom_kill_task() will still require a tasklist scan to ensure no threads 
> sharing the mm_struct are OOM_DISABLE and the livelock persists.
> 

Why don't you think select_bad_process()-> oom_kill_task() implementation is bad ?
IMHO, it's bad manner to fix an os-implementation problem by adding _new_ user
interface which is hard to understand.


> In other words, the issue here is larger than the inheritance of the 
> oom_adj value amongst children, it addresses a livelock that neither of 
> your approaches solve.  The fix actually makes /proc/pid/oom_adj (and 
> /proc/pid/oom_score) consistent with how the oom killer behaves.

This oom_adj_child itself is not related to livelock problem. Don't make
the problem bigger than it is.
oom_adj_child itself is just a problem how to handle vfork().

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
@ 2009-07-31  0:33           ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-07-31  0:33 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Paul Menage, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Thu, 30 Jul 2009 12:05:30 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> On Thu, 30 Jul 2009, KAMEZAWA Hiroyuki wrote:
> 
> > > If you have suggestions for a better name, I'd happily ack it.
> > > 
> > 
> > Simply, reset_oom_adj_at_new_mm_context or some.
> > 
> 
> I think it's preferred to keep the name relatively short which is an 
> unfortuante requirement in this case.  I also prefer to start the name 
> with "oom_adj" so it appears alongside /proc/pid/oom_adj when listed 
> alphabetically.
> 
But misleading name is bad.



> > > > 2. More simple plan is like this, IIUC.
> > > > 
> > > >   fix oom-killer's select_bad_process() not to be in deadlock.
> > > > 
> > > 
> > > Alternate ideas?
> > > 
> > At brief thiking.
> > 
> > 1. move oom_adj from mm_struct to signal struct. or somewhere.
> >    (see copy_signal())
> >    Then,
> >     - all threads in a process will have the same oom_adj.
> >     - vfork()'ed thread will inherit its parent's oom_adj.   
> >     - vfork()'ed thread can override oom_adj of its own.
> > 
> >     In other words, oom_adj is shared when CLONE_PARENT is not set.
> > 
> 
> Hmm, didn't we talk about signal_struct already?  The problem with that 
> approach is that oom_adj values represent a killable quantity of memory, 
> so having multiple threads sharing the same mm_struct with one set to 
> OOM_DISABLE and the other at +15 will still livelock because the oom 
> killer can't kill either.
>
> > 2. rename  mm_struct's oom_adj as shadow_oom_adj.
> > 
> >    update this shadow_oom_adj as the highest oom_adj among
> >    the values all threads share this mm_struct have.
> >    This update is done when
> >    - mm_init()
> >    - oom_adj is written.
> > 
> >    User's 
> >    # echo XXXX > /proc/<x>/oom_adj
> >    is not necessary to be very very fast.
> > 
> >    I don't think a process which calls vfork() is multi-threaded.
> > 
> > 3. use shadow_oom_adj in select_bad_process().
> > 
> 
> Ideas 2 & 3 here seem to be a single proposal.  The problem is that it 
> still leaves /proc/pid/oom_score to be inconsistent with the badness 
> scoring that the oom killer will eventually use since if it oom kills one 
> task, it must kill all tasks sharing the same mm_struct to lead to future 
> memory freeing.
> 
yes.

> Additionally, if you were to set one thread to OOM_DISABLE, storing the 
> highest oom_adj value in mm_struct isn't going to help because 
> oom_kill_task() will still require a tasklist scan to ensure no threads 
> sharing the mm_struct are OOM_DISABLE and the livelock persists.
> 

Why don't you think select_bad_process()-> oom_kill_task() implementation is bad ?
IMHO, it's bad manner to fix an os-implementation problem by adding _new_ user
interface which is hard to understand.


> In other words, the issue here is larger than the inheritance of the 
> oom_adj value amongst children, it addresses a livelock that neither of 
> your approaches solve.  The fix actually makes /proc/pid/oom_adj (and 
> /proc/pid/oom_score) consistent with how the oom killer behaves.

This oom_adj_child itself is not related to livelock problem. Don't make
the problem bigger than it is.
oom_adj_child itself is just a problem how to handle vfork().

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
  2009-07-30  7:06     ` David Rientjes
@ 2009-07-31  6:47       ` KOSAKI Motohiro
  -1 siblings, 0 replies; 64+ messages in thread
From: KOSAKI Motohiro @ 2009-07-31  6:47 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Paul Menage,
	linux-kernel, linux-mm

Hi

> On Thu, 30 Jul 2009, KOSAKI Motohiro wrote:
> 
> > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -426,7 +426,7 @@ static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
> > >  	init_rwsem(&mm->mmap_sem);
> > >  	INIT_LIST_HEAD(&mm->mmlist);
> > >  	mm->flags = (current->mm) ? current->mm->flags : default_dump_filter;
> > > -	mm->oom_adj = (current->mm) ? current->mm->oom_adj : 0;
> > > +	mm->oom_adj = p->oom_adj_child;
> > 
> > This code doesn't fix anything.
> > mm->oom_adj assignment still change vfork() parent process oom_adj value.
> > (Again, vfork() parent and child use the same mm)
> > 
> 
> That's because the oom killer only really considers the highest oom_adj 
> value amongst all threads that share the same mm.  Allowing those threads 
> to each have different oom_adj values leads (i) to an inconsistency in 
> reporting /proc/pid/oom_score for how the oom killer selects a task to 
> kill and (ii) the oom killer livelock that it fixes when one thread 
> happens to be OOM_DISABLE.

I agree both. again I only disagree ABI breakage regression and
stupid new /proc interface.
Paul already pointed out this issue can be fixed without ABI change.


> So, yes, changing the oom_adj value for a thread may have side-effects 
> on other threads that didn't exist prior to 2.6.31-rc1 because the oom_adj 
> value now represents a killable quantity of memory instead of a being a 
> characteristic of the task itself.  But we now provide the inheritance 
> property in a new way, via /proc/pid/oom_adj_child, that gives you all the 
> functionality that the previous way did but without the potential for 
> livelock.

maybe, I should say my stand-point obviously. I don't dislike your
per-process oom_adj concept.
I only oppose vfork breakage.

if you feel my stand point is double standard, I need explain me more.
So, I don't think per-process oom_adj makes any regression on _real_ world.
but vfork()'s one is real world issue.

I think they are totally different thing.


And, May I explay why I think your oom_adj_child is wrong idea?
The fact is: new feature introducing never fix regression. yes, some
application use new interface and disappear the problem. but other
application still hit the problem. that's not correct development style
in kernel.


> 
> > IOW, in vfork case, oom_adj_child parameter doesn't only change child oom_adj,
> > but also parent oom_adj value.
> 
> Changing oom_adj_child for a task never changes oom_adj for any mm, it 
> simply specifies what default value shall be given for a child's newly 
> initialized mm.  Chaning oom_adj, on the other hand, will 

Ah, ok. I miunderstood.
However, We can fix this issue without new interface, isn't it?


> > IOW, oom_adj_child is NOT child effective parameter.
> > 
> 
> It's not meant to be, it's only meant to specify a default value for newly 
> initialized mm's of its descendants.  What happens after that is governed 
> completely by the child's own /proc/pid/oom_adj.  That's pretty clearly 
> explained in Documentation/filesystems/proc.txt.




^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
@ 2009-07-31  6:47       ` KOSAKI Motohiro
  0 siblings, 0 replies; 64+ messages in thread
From: KOSAKI Motohiro @ 2009-07-31  6:47 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Paul Menage,
	linux-kernel, linux-mm

Hi

> On Thu, 30 Jul 2009, KOSAKI Motohiro wrote:
> 
> > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -426,7 +426,7 @@ static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
> > >  	init_rwsem(&mm->mmap_sem);
> > >  	INIT_LIST_HEAD(&mm->mmlist);
> > >  	mm->flags = (current->mm) ? current->mm->flags : default_dump_filter;
> > > -	mm->oom_adj = (current->mm) ? current->mm->oom_adj : 0;
> > > +	mm->oom_adj = p->oom_adj_child;
> > 
> > This code doesn't fix anything.
> > mm->oom_adj assignment still change vfork() parent process oom_adj value.
> > (Again, vfork() parent and child use the same mm)
> > 
> 
> That's because the oom killer only really considers the highest oom_adj 
> value amongst all threads that share the same mm.  Allowing those threads 
> to each have different oom_adj values leads (i) to an inconsistency in 
> reporting /proc/pid/oom_score for how the oom killer selects a task to 
> kill and (ii) the oom killer livelock that it fixes when one thread 
> happens to be OOM_DISABLE.

I agree both. again I only disagree ABI breakage regression and
stupid new /proc interface.
Paul already pointed out this issue can be fixed without ABI change.


> So, yes, changing the oom_adj value for a thread may have side-effects 
> on other threads that didn't exist prior to 2.6.31-rc1 because the oom_adj 
> value now represents a killable quantity of memory instead of a being a 
> characteristic of the task itself.  But we now provide the inheritance 
> property in a new way, via /proc/pid/oom_adj_child, that gives you all the 
> functionality that the previous way did but without the potential for 
> livelock.

maybe, I should say my stand-point obviously. I don't dislike your
per-process oom_adj concept.
I only oppose vfork breakage.

if you feel my stand point is double standard, I need explain me more.
So, I don't think per-process oom_adj makes any regression on _real_ world.
but vfork()'s one is real world issue.

I think they are totally different thing.


And, May I explay why I think your oom_adj_child is wrong idea?
The fact is: new feature introducing never fix regression. yes, some
application use new interface and disappear the problem. but other
application still hit the problem. that's not correct development style
in kernel.


> 
> > IOW, in vfork case, oom_adj_child parameter doesn't only change child oom_adj,
> > but also parent oom_adj value.
> 
> Changing oom_adj_child for a task never changes oom_adj for any mm, it 
> simply specifies what default value shall be given for a child's newly 
> initialized mm.  Chaning oom_adj, on the other hand, will 

Ah, ok. I miunderstood.
However, We can fix this issue without new interface, isn't it?


> > IOW, oom_adj_child is NOT child effective parameter.
> > 
> 
> It's not meant to be, it's only meant to specify a default value for newly 
> initialized mm's of its descendants.  What happens after that is governed 
> completely by the child's own /proc/pid/oom_adj.  That's pretty clearly 
> explained in Documentation/filesystems/proc.txt.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
  2009-07-31  0:33           ` KAMEZAWA Hiroyuki
@ 2009-07-31  6:50             ` KOSAKI Motohiro
  -1 siblings, 0 replies; 64+ messages in thread
From: KOSAKI Motohiro @ 2009-07-31  6:50 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: kosaki.motohiro, David Rientjes, Andrew Morton, Rik van Riel,
	Paul Menage, linux-kernel, linux-mm

> On Thu, 30 Jul 2009 12:05:30 -0700 (PDT)
> David Rientjes <rientjes@google.com> wrote:
> 
> > On Thu, 30 Jul 2009, KAMEZAWA Hiroyuki wrote:
> > 
> > > > If you have suggestions for a better name, I'd happily ack it.
> > > > 
> > > 
> > > Simply, reset_oom_adj_at_new_mm_context or some.
> > > 
> > 
> > I think it's preferred to keep the name relatively short which is an 
> > unfortuante requirement in this case.  I also prefer to start the name 
> > with "oom_adj" so it appears alongside /proc/pid/oom_adj when listed 
> > alphabetically.
> > 
> But misleading name is bad.
> 
> 
> 
> > > > > 2. More simple plan is like this, IIUC.
> > > > > 
> > > > >   fix oom-killer's select_bad_process() not to be in deadlock.
> > > > > 
> > > > 
> > > > Alternate ideas?
> > > > 
> > > At brief thiking.
> > > 
> > > 1. move oom_adj from mm_struct to signal struct. or somewhere.
> > >    (see copy_signal())
> > >    Then,
> > >     - all threads in a process will have the same oom_adj.
> > >     - vfork()'ed thread will inherit its parent's oom_adj.   
> > >     - vfork()'ed thread can override oom_adj of its own.
> > > 
> > >     In other words, oom_adj is shared when CLONE_PARENT is not set.
> > > 
> > 
> > Hmm, didn't we talk about signal_struct already?  The problem with that 
> > approach is that oom_adj values represent a killable quantity of memory, 
> > so having multiple threads sharing the same mm_struct with one set to 
> > OOM_DISABLE and the other at +15 will still livelock because the oom 
> > killer can't kill either.
> >
> > > 2. rename  mm_struct's oom_adj as shadow_oom_adj.
> > > 
> > >    update this shadow_oom_adj as the highest oom_adj among
> > >    the values all threads share this mm_struct have.
> > >    This update is done when
> > >    - mm_init()
> > >    - oom_adj is written.
> > > 
> > >    User's 
> > >    # echo XXXX > /proc/<x>/oom_adj
> > >    is not necessary to be very very fast.
> > > 
> > >    I don't think a process which calls vfork() is multi-threaded.
> > > 
> > > 3. use shadow_oom_adj in select_bad_process().
> > > 
> > 
> > Ideas 2 & 3 here seem to be a single proposal.  The problem is that it 
> > still leaves /proc/pid/oom_score to be inconsistent with the badness 
> > scoring that the oom killer will eventually use since if it oom kills one 
> > task, it must kill all tasks sharing the same mm_struct to lead to future 
> > memory freeing.
> > 
> yes.
> 
> > Additionally, if you were to set one thread to OOM_DISABLE, storing the 
> > highest oom_adj value in mm_struct isn't going to help because 
> > oom_kill_task() will still require a tasklist scan to ensure no threads 
> > sharing the mm_struct are OOM_DISABLE and the livelock persists.
> > 
> 
> Why don't you think select_bad_process()-> oom_kill_task() implementation is bad ?
> IMHO, it's bad manner to fix an os-implementation problem by adding _new_ user
> interface which is hard to understand.
> 
> 
> > In other words, the issue here is larger than the inheritance of the 
> > oom_adj value amongst children, it addresses a livelock that neither of 
> > your approaches solve.  The fix actually makes /proc/pid/oom_adj (and 
> > /proc/pid/oom_score) consistent with how the oom killer behaves.
> 
> This oom_adj_child itself is not related to livelock problem. Don't make
> the problem bigger than it is.
> oom_adj_child itself is just a problem how to handle vfork().


I made my proposal patch today.
this patch have following charactatistics.

o per-process oom_adj (by signal_struct)
o don't live-lock


Please comment.



Patch against 2.6.31-rc4

===========================
Subject: [PATCH] move oom_adj to task->signal

test program
----------------------------------------------------
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#define BUF_SIZE 128

void oom_adj_print(void)
{
	FILE* file;
	char buf[BUF_SIZE];

	file = fopen("/proc/self/oom_adj", "r");
	if (!file) {
		perror("fopen");
		exit(1);
	}

	fscanf(file, "%s\n", buf);
	printf("%s\n", buf);

	fclose(file);
}

void oom_adj_write(int value)
{
	FILE* file;
	size_t ret;


	file = fopen("/proc/self/oom_adj", "w");
	if (!file) {
		perror("fopen");
		exit(1);
	}

	ret = fprintf(file, "%d", value);
	if (!ret) {
		perror("fprintf");
		exit(1);
	}

	fclose(file);
}

int main(void)
{
	int status;

	oom_adj_print();
	oom_adj_write(1);
	oom_adj_print();

	printf("vfork\n");
	if (vfork() == 0) {
		/* child */
		oom_adj_print();
		oom_adj_write(2);
		oom_adj_print();
		_exit(0);
	}
	wait(&status);
	oom_adj_print();

	return 0;
}

test result:
---------------------------------
% ./a.out
0
1
vfork
1
2
1



Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reported-by: Paul Menage <menage@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>,
Cc: Andrew Morton <akpm@linux-foundation.org>,
---
 fs/proc/base.c           |    7 ++++---
 include/linux/mm_types.h |    3 ++-
 include/linux/oom.h      |    1 +
 include/linux/sched.h    |    2 ++
 kernel/exit.c            |    2 ++
 kernel/fork.c            |    2 ++
 mm/oom_kill.c            |   14 +++++++++-----
 7 files changed, 22 insertions(+), 9 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 3ce5ae9..c64499e 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1008,7 +1008,7 @@ static ssize_t oom_adjust_read(struct file *file, char __user *buf,
 		return -ESRCH;
 	task_lock(task);
 	if (task->mm)
-		oom_adjust = task->mm->oom_adj;
+		oom_adjust = task->signal->oom_adj;
 	else
 		oom_adjust = OOM_DISABLE;
 	task_unlock(task);
@@ -1046,12 +1046,13 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
 		put_task_struct(task);
 		return -EINVAL;
 	}
-	if (oom_adjust < task->mm->oom_adj && !capable(CAP_SYS_RESOURCE)) {
+	if (oom_adjust < task->signal->oom_adj && !capable(CAP_SYS_RESOURCE)) {
 		task_unlock(task);
 		put_task_struct(task);
 		return -EACCES;
 	}
-	task->mm->oom_adj = oom_adjust;
+	task->signal->oom_adj = oom_adjust;
+	task->mm->oom_adj_cached = OOM_CACHE_DEFAULT;
 	task_unlock(task);
 	put_task_struct(task);
 	if (end - buffer == 0)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 7acc843..f93f97f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -240,7 +240,8 @@ struct mm_struct {
 
 	unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
 
-	s8 oom_adj;	/* OOM kill score adjustment (bit shift) */
+	s8 oom_adj_cached;	/* mirror from signal_struct->oom_adj.
+				   in vfork case, multiple processes use the same mm. */
 
 	cpumask_t cpu_vm_mask;
 
diff --git a/include/linux/oom.h b/include/linux/oom.h
index a7979ba..a219480 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -3,6 +3,7 @@
 
 /* /proc/<pid>/oom_adj set to -17 protects from the oom-killer */
 #define OOM_DISABLE (-17)
+#define OOM_CACHE_DEFAULT (15)
 /* inclusive */
 #define OOM_ADJUST_MIN (-16)
 #define OOM_ADJUST_MAX 15
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3ab08e4..e10b12b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -629,6 +629,8 @@ struct signal_struct {
 	unsigned audit_tty;
 	struct tty_audit_buf *tty_audit_buf;
 #endif
+
+	s8 oom_adj;	/* OOM kill score adjustment (bit shift) */
 };
 
 /* Context switch must be unlocked if interrupts are to be enabled */
diff --git a/kernel/exit.c b/kernel/exit.c
index 869dc22..c741a45 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -48,6 +48,7 @@
 #include <linux/fs_struct.h>
 #include <linux/init_task.h>
 #include <linux/perf_counter.h>
+#include <linux/oom.h>
 #include <trace/events/sched.h>
 
 #include <asm/uaccess.h>
@@ -688,6 +689,7 @@ static void exit_mm(struct task_struct * tsk)
 	enter_lazy_tlb(mm, current);
 	/* We don't want this task to be frozen prematurely */
 	clear_freeze_flag(tsk);
+	mm->oom_adj_cached = OOM_CACHE_DEFAULT;
 	task_unlock(tsk);
 	mm_update_next_owner(mm);
 	mmput(mm);
diff --git a/kernel/fork.c b/kernel/fork.c
index 9b42695..b7cb474 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -62,6 +62,7 @@
 #include <linux/fs_struct.h>
 #include <linux/magic.h>
 #include <linux/perf_counter.h>
+#include <linux/oom.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -426,6 +427,7 @@ static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
 	init_rwsem(&mm->mmap_sem);
 	INIT_LIST_HEAD(&mm->mmlist);
 	mm->flags = (current->mm) ? current->mm->flags : default_dump_filter;
+	mm->oom_adj_cached = OOM_CACHE_DEFAULT;
 	mm->core_state = NULL;
 	mm->nr_ptes = 0;
 	set_mm_counter(mm, file_rss, 0);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 175a67a..eae2d78 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -58,7 +58,7 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 	unsigned long points, cpu_time, run_time;
 	struct mm_struct *mm;
 	struct task_struct *child;
-	int oom_adj;
+	s8 oom_adj;
 
 	task_lock(p);
 	mm = p->mm;
@@ -66,7 +66,10 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 		task_unlock(p);
 		return 0;
 	}
-	oom_adj = mm->oom_adj;
+
+	if (mm->oom_adj_cached < p->signal->oom_adj)
+		mm->oom_adj_cached = p->signal->oom_adj;
+	oom_adj = mm->oom_adj_cached;
 	if (oom_adj == OOM_DISABLE) {
 		task_unlock(p);
 		return 0;
@@ -307,7 +310,8 @@ static void dump_tasks(const struct mem_cgroup *mem)
 		}
 		printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d     %3d %s\n",
 		       p->pid, __task_cred(p)->uid, p->tgid, mm->total_vm,
-		       get_mm_rss(mm), (int)task_cpu(p), mm->oom_adj, p->comm);
+		       get_mm_rss(mm), (int)task_cpu(p), p->signal->oom_adj,
+		       p->comm);
 		task_unlock(p);
 	} while_each_thread(g, p);
 }
@@ -350,7 +354,7 @@ static int oom_kill_task(struct task_struct *p)
 
 	task_lock(p);
 	mm = p->mm;
-	if (!mm || mm->oom_adj == OOM_DISABLE) {
+	if (!mm || p->signal->oom_adj == OOM_DISABLE) {
 		task_unlock(p);
 		return 1;
 	}
@@ -381,7 +385,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 		printk(KERN_WARNING "%s invoked oom-killer: "
 			"gfp_mask=0x%x, order=%d, oom_adj=%d\n",
 			current->comm, gfp_mask, order,
-			current->mm ? current->mm->oom_adj : OOM_DISABLE);
+			current->mm ? current->signal->oom_adj : OOM_DISABLE);
 		cpuset_print_task_mems_allowed(current);
 		task_unlock(current);
 		dump_stack();
-- 
1.6.0.GIT





^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
@ 2009-07-31  6:50             ` KOSAKI Motohiro
  0 siblings, 0 replies; 64+ messages in thread
From: KOSAKI Motohiro @ 2009-07-31  6:50 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: kosaki.motohiro, David Rientjes, Andrew Morton, Rik van Riel,
	Paul Menage, linux-kernel, linux-mm

> On Thu, 30 Jul 2009 12:05:30 -0700 (PDT)
> David Rientjes <rientjes@google.com> wrote:
> 
> > On Thu, 30 Jul 2009, KAMEZAWA Hiroyuki wrote:
> > 
> > > > If you have suggestions for a better name, I'd happily ack it.
> > > > 
> > > 
> > > Simply, reset_oom_adj_at_new_mm_context or some.
> > > 
> > 
> > I think it's preferred to keep the name relatively short which is an 
> > unfortuante requirement in this case.  I also prefer to start the name 
> > with "oom_adj" so it appears alongside /proc/pid/oom_adj when listed 
> > alphabetically.
> > 
> But misleading name is bad.
> 
> 
> 
> > > > > 2. More simple plan is like this, IIUC.
> > > > > 
> > > > >   fix oom-killer's select_bad_process() not to be in deadlock.
> > > > > 
> > > > 
> > > > Alternate ideas?
> > > > 
> > > At brief thiking.
> > > 
> > > 1. move oom_adj from mm_struct to signal struct. or somewhere.
> > >    (see copy_signal())
> > >    Then,
> > >     - all threads in a process will have the same oom_adj.
> > >     - vfork()'ed thread will inherit its parent's oom_adj.   
> > >     - vfork()'ed thread can override oom_adj of its own.
> > > 
> > >     In other words, oom_adj is shared when CLONE_PARENT is not set.
> > > 
> > 
> > Hmm, didn't we talk about signal_struct already?  The problem with that 
> > approach is that oom_adj values represent a killable quantity of memory, 
> > so having multiple threads sharing the same mm_struct with one set to 
> > OOM_DISABLE and the other at +15 will still livelock because the oom 
> > killer can't kill either.
> >
> > > 2. rename  mm_struct's oom_adj as shadow_oom_adj.
> > > 
> > >    update this shadow_oom_adj as the highest oom_adj among
> > >    the values all threads share this mm_struct have.
> > >    This update is done when
> > >    - mm_init()
> > >    - oom_adj is written.
> > > 
> > >    User's 
> > >    # echo XXXX > /proc/<x>/oom_adj
> > >    is not necessary to be very very fast.
> > > 
> > >    I don't think a process which calls vfork() is multi-threaded.
> > > 
> > > 3. use shadow_oom_adj in select_bad_process().
> > > 
> > 
> > Ideas 2 & 3 here seem to be a single proposal.  The problem is that it 
> > still leaves /proc/pid/oom_score to be inconsistent with the badness 
> > scoring that the oom killer will eventually use since if it oom kills one 
> > task, it must kill all tasks sharing the same mm_struct to lead to future 
> > memory freeing.
> > 
> yes.
> 
> > Additionally, if you were to set one thread to OOM_DISABLE, storing the 
> > highest oom_adj value in mm_struct isn't going to help because 
> > oom_kill_task() will still require a tasklist scan to ensure no threads 
> > sharing the mm_struct are OOM_DISABLE and the livelock persists.
> > 
> 
> Why don't you think select_bad_process()-> oom_kill_task() implementation is bad ?
> IMHO, it's bad manner to fix an os-implementation problem by adding _new_ user
> interface which is hard to understand.
> 
> 
> > In other words, the issue here is larger than the inheritance of the 
> > oom_adj value amongst children, it addresses a livelock that neither of 
> > your approaches solve.  The fix actually makes /proc/pid/oom_adj (and 
> > /proc/pid/oom_score) consistent with how the oom killer behaves.
> 
> This oom_adj_child itself is not related to livelock problem. Don't make
> the problem bigger than it is.
> oom_adj_child itself is just a problem how to handle vfork().


I made my proposal patch today.
this patch have following charactatistics.

o per-process oom_adj (by signal_struct)
o don't live-lock


Please comment.



Patch against 2.6.31-rc4

===========================
Subject: [PATCH] move oom_adj to task->signal

test program
----------------------------------------------------
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#define BUF_SIZE 128

void oom_adj_print(void)
{
	FILE* file;
	char buf[BUF_SIZE];

	file = fopen("/proc/self/oom_adj", "r");
	if (!file) {
		perror("fopen");
		exit(1);
	}

	fscanf(file, "%s\n", buf);
	printf("%s\n", buf);

	fclose(file);
}

void oom_adj_write(int value)
{
	FILE* file;
	size_t ret;


	file = fopen("/proc/self/oom_adj", "w");
	if (!file) {
		perror("fopen");
		exit(1);
	}

	ret = fprintf(file, "%d", value);
	if (!ret) {
		perror("fprintf");
		exit(1);
	}

	fclose(file);
}

int main(void)
{
	int status;

	oom_adj_print();
	oom_adj_write(1);
	oom_adj_print();

	printf("vfork\n");
	if (vfork() == 0) {
		/* child */
		oom_adj_print();
		oom_adj_write(2);
		oom_adj_print();
		_exit(0);
	}
	wait(&status);
	oom_adj_print();

	return 0;
}

test result:
---------------------------------
% ./a.out
0
1
vfork
1
2
1



Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reported-by: Paul Menage <menage@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>,
Cc: Andrew Morton <akpm@linux-foundation.org>,
---
 fs/proc/base.c           |    7 ++++---
 include/linux/mm_types.h |    3 ++-
 include/linux/oom.h      |    1 +
 include/linux/sched.h    |    2 ++
 kernel/exit.c            |    2 ++
 kernel/fork.c            |    2 ++
 mm/oom_kill.c            |   14 +++++++++-----
 7 files changed, 22 insertions(+), 9 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 3ce5ae9..c64499e 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1008,7 +1008,7 @@ static ssize_t oom_adjust_read(struct file *file, char __user *buf,
 		return -ESRCH;
 	task_lock(task);
 	if (task->mm)
-		oom_adjust = task->mm->oom_adj;
+		oom_adjust = task->signal->oom_adj;
 	else
 		oom_adjust = OOM_DISABLE;
 	task_unlock(task);
@@ -1046,12 +1046,13 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
 		put_task_struct(task);
 		return -EINVAL;
 	}
-	if (oom_adjust < task->mm->oom_adj && !capable(CAP_SYS_RESOURCE)) {
+	if (oom_adjust < task->signal->oom_adj && !capable(CAP_SYS_RESOURCE)) {
 		task_unlock(task);
 		put_task_struct(task);
 		return -EACCES;
 	}
-	task->mm->oom_adj = oom_adjust;
+	task->signal->oom_adj = oom_adjust;
+	task->mm->oom_adj_cached = OOM_CACHE_DEFAULT;
 	task_unlock(task);
 	put_task_struct(task);
 	if (end - buffer == 0)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 7acc843..f93f97f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -240,7 +240,8 @@ struct mm_struct {
 
 	unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
 
-	s8 oom_adj;	/* OOM kill score adjustment (bit shift) */
+	s8 oom_adj_cached;	/* mirror from signal_struct->oom_adj.
+				   in vfork case, multiple processes use the same mm. */
 
 	cpumask_t cpu_vm_mask;
 
diff --git a/include/linux/oom.h b/include/linux/oom.h
index a7979ba..a219480 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -3,6 +3,7 @@
 
 /* /proc/<pid>/oom_adj set to -17 protects from the oom-killer */
 #define OOM_DISABLE (-17)
+#define OOM_CACHE_DEFAULT (15)
 /* inclusive */
 #define OOM_ADJUST_MIN (-16)
 #define OOM_ADJUST_MAX 15
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3ab08e4..e10b12b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -629,6 +629,8 @@ struct signal_struct {
 	unsigned audit_tty;
 	struct tty_audit_buf *tty_audit_buf;
 #endif
+
+	s8 oom_adj;	/* OOM kill score adjustment (bit shift) */
 };
 
 /* Context switch must be unlocked if interrupts are to be enabled */
diff --git a/kernel/exit.c b/kernel/exit.c
index 869dc22..c741a45 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -48,6 +48,7 @@
 #include <linux/fs_struct.h>
 #include <linux/init_task.h>
 #include <linux/perf_counter.h>
+#include <linux/oom.h>
 #include <trace/events/sched.h>
 
 #include <asm/uaccess.h>
@@ -688,6 +689,7 @@ static void exit_mm(struct task_struct * tsk)
 	enter_lazy_tlb(mm, current);
 	/* We don't want this task to be frozen prematurely */
 	clear_freeze_flag(tsk);
+	mm->oom_adj_cached = OOM_CACHE_DEFAULT;
 	task_unlock(tsk);
 	mm_update_next_owner(mm);
 	mmput(mm);
diff --git a/kernel/fork.c b/kernel/fork.c
index 9b42695..b7cb474 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -62,6 +62,7 @@
 #include <linux/fs_struct.h>
 #include <linux/magic.h>
 #include <linux/perf_counter.h>
+#include <linux/oom.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -426,6 +427,7 @@ static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
 	init_rwsem(&mm->mmap_sem);
 	INIT_LIST_HEAD(&mm->mmlist);
 	mm->flags = (current->mm) ? current->mm->flags : default_dump_filter;
+	mm->oom_adj_cached = OOM_CACHE_DEFAULT;
 	mm->core_state = NULL;
 	mm->nr_ptes = 0;
 	set_mm_counter(mm, file_rss, 0);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 175a67a..eae2d78 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -58,7 +58,7 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 	unsigned long points, cpu_time, run_time;
 	struct mm_struct *mm;
 	struct task_struct *child;
-	int oom_adj;
+	s8 oom_adj;
 
 	task_lock(p);
 	mm = p->mm;
@@ -66,7 +66,10 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 		task_unlock(p);
 		return 0;
 	}
-	oom_adj = mm->oom_adj;
+
+	if (mm->oom_adj_cached < p->signal->oom_adj)
+		mm->oom_adj_cached = p->signal->oom_adj;
+	oom_adj = mm->oom_adj_cached;
 	if (oom_adj == OOM_DISABLE) {
 		task_unlock(p);
 		return 0;
@@ -307,7 +310,8 @@ static void dump_tasks(const struct mem_cgroup *mem)
 		}
 		printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d     %3d %s\n",
 		       p->pid, __task_cred(p)->uid, p->tgid, mm->total_vm,
-		       get_mm_rss(mm), (int)task_cpu(p), mm->oom_adj, p->comm);
+		       get_mm_rss(mm), (int)task_cpu(p), p->signal->oom_adj,
+		       p->comm);
 		task_unlock(p);
 	} while_each_thread(g, p);
 }
@@ -350,7 +354,7 @@ static int oom_kill_task(struct task_struct *p)
 
 	task_lock(p);
 	mm = p->mm;
-	if (!mm || mm->oom_adj == OOM_DISABLE) {
+	if (!mm || p->signal->oom_adj == OOM_DISABLE) {
 		task_unlock(p);
 		return 1;
 	}
@@ -381,7 +385,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 		printk(KERN_WARNING "%s invoked oom-killer: "
 			"gfp_mask=0x%x, order=%d, oom_adj=%d\n",
 			current->comm, gfp_mask, order,
-			current->mm ? current->mm->oom_adj : OOM_DISABLE);
+			current->mm ? current->signal->oom_adj : OOM_DISABLE);
 		cpuset_print_task_mems_allowed(current);
 		task_unlock(current);
 		dump_stack();
-- 
1.6.0.GIT




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
  2009-07-31  6:47       ` KOSAKI Motohiro
@ 2009-07-31  9:31         ` David Rientjes
  -1 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2009-07-31  9:31 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Rik van Riel, Paul Menage, linux-kernel, linux-mm

On Fri, 31 Jul 2009, KOSAKI Motohiro wrote:

> > That's because the oom killer only really considers the highest oom_adj 
> > value amongst all threads that share the same mm.  Allowing those threads 
> > to each have different oom_adj values leads (i) to an inconsistency in 
> > reporting /proc/pid/oom_score for how the oom killer selects a task to 
> > kill and (ii) the oom killer livelock that it fixes when one thread 
> > happens to be OOM_DISABLE.
> 
> I agree both. again I only disagree ABI breakage regression and
> stupid new /proc interface.

Let's state the difference in behavior as of 2.6.31-rc1: applications can 
no longer change the oom_adj value of a vfork() child prior to exec() 
without it also affecting the parent.  I agree that was previously 
allowed.  And it was that very allowance that LEADS TO THE LIVELOCK 
because they both share a VM and it was possible for the oom killer to 
select the one of the threads while the other was OOM_DISABLE.

This is an extremely simple livelock to trigger, AND YOU DON'T EVEN NEED 
CAP_SYS_RESOURCE TO DO IT.  Consider a job scheduler that superuser has 
set to OOM_DISABLE because of its necessity to the system.  Imagine if 
that job scheduler vfork's a child and sets its inherited oom_adj value of 
OOM_DISABLE to something higher so that the machine doesn't panic on 
exec() when the child spikes in memory usage when the application first 
starts.

Now imagine that either there are no other user threads or the job 
scheduler itself has allocated more pages than any other thread.  Or, more 
simply, imagine that it sets the child's oom_adj value to a higher 
priority than other threads based on some heuristic.  Regardless, if the 
system becomes oom before the exec() can happen and before the new VM is 
attached to the child, the machine livelocks.

That happens because of two things:

 - the oom killer uses the oom_adj value to adjust the oom_score for a
   task, and that score is mainly based on the size of each thread's VM,
   and

 - the oom killer cannot kill a thread that shares a VM with an
   OOM_DISABLE thread because it will not lead to future memory freeing.

So the preferred solution for complete consistency and to fix the livelock 
is to make the oom_adj value a characteristic of the VM, because THAT'S 
WHAT IT ACTS ON.  The effective oom_adj value for a thread is always equal 
to the highest oom_adj value of any thread sharing its VM.

Do we really want to keep this inconsistency around forever in the kernel 
so that /proc/pid/oom_score actually means NOTHING because another thread 
sharing the memory has a different oom_adj?  Or do we want to take the 
opportunity to fix a broken userspace model that leads to a livelock to 
fix it and move on with a consistent interface and, with oom_adj_child, 
all the functionality you had before.

And you and KAMEZAWA-san can continue to call my patches stupid, but 
that's not adding anything to your argument.

> Paul already pointed out this issue can be fixed without ABI change.
> 

I'm unaware of any viable solution that has been proposed, sorry.

> if you feel my stand point is double standard, I need explain me more.
> So, I don't think per-process oom_adj makes any regression on _real_ world.

Wrong, our machines have livelocked because of the exact scenario I 
described above.

> but vfork()'s one is real world issue.
> 

And it's based on a broken assumption that oom_adj values actually mean 
anything independent of other threads sharing the same memory.  That's a 
completely false assumption.  Applications that are tuning oom_adj value 
will rely on oom_scores, which are currently false if oom_adj differs 
amongst those threads, and should be written to how the oom killer uses 
the value.

> And, May I explay why I think your oom_adj_child is wrong idea?
> The fact is: new feature introducing never fix regression. yes, some
> application use new interface and disappear the problem. but other
> application still hit the problem. that's not correct development style
> in kernel.
> 

So you're proposing that we forever allow /proc/pid/oom_score to be 
completely wrong for pid without any knowledge to userspace?  That we 
falsely advertise what it represents and allow userspace to believe that 
changing oom_adj for a thread sharing memory with other threads actually 
changes how the oom killer selects tasks?

Please.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
@ 2009-07-31  9:31         ` David Rientjes
  0 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2009-07-31  9:31 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Rik van Riel, Paul Menage, linux-kernel, linux-mm

On Fri, 31 Jul 2009, KOSAKI Motohiro wrote:

> > That's because the oom killer only really considers the highest oom_adj 
> > value amongst all threads that share the same mm.  Allowing those threads 
> > to each have different oom_adj values leads (i) to an inconsistency in 
> > reporting /proc/pid/oom_score for how the oom killer selects a task to 
> > kill and (ii) the oom killer livelock that it fixes when one thread 
> > happens to be OOM_DISABLE.
> 
> I agree both. again I only disagree ABI breakage regression and
> stupid new /proc interface.

Let's state the difference in behavior as of 2.6.31-rc1: applications can 
no longer change the oom_adj value of a vfork() child prior to exec() 
without it also affecting the parent.  I agree that was previously 
allowed.  And it was that very allowance that LEADS TO THE LIVELOCK 
because they both share a VM and it was possible for the oom killer to 
select the one of the threads while the other was OOM_DISABLE.

This is an extremely simple livelock to trigger, AND YOU DON'T EVEN NEED 
CAP_SYS_RESOURCE TO DO IT.  Consider a job scheduler that superuser has 
set to OOM_DISABLE because of its necessity to the system.  Imagine if 
that job scheduler vfork's a child and sets its inherited oom_adj value of 
OOM_DISABLE to something higher so that the machine doesn't panic on 
exec() when the child spikes in memory usage when the application first 
starts.

Now imagine that either there are no other user threads or the job 
scheduler itself has allocated more pages than any other thread.  Or, more 
simply, imagine that it sets the child's oom_adj value to a higher 
priority than other threads based on some heuristic.  Regardless, if the 
system becomes oom before the exec() can happen and before the new VM is 
attached to the child, the machine livelocks.

That happens because of two things:

 - the oom killer uses the oom_adj value to adjust the oom_score for a
   task, and that score is mainly based on the size of each thread's VM,
   and

 - the oom killer cannot kill a thread that shares a VM with an
   OOM_DISABLE thread because it will not lead to future memory freeing.

So the preferred solution for complete consistency and to fix the livelock 
is to make the oom_adj value a characteristic of the VM, because THAT'S 
WHAT IT ACTS ON.  The effective oom_adj value for a thread is always equal 
to the highest oom_adj value of any thread sharing its VM.

Do we really want to keep this inconsistency around forever in the kernel 
so that /proc/pid/oom_score actually means NOTHING because another thread 
sharing the memory has a different oom_adj?  Or do we want to take the 
opportunity to fix a broken userspace model that leads to a livelock to 
fix it and move on with a consistent interface and, with oom_adj_child, 
all the functionality you had before.

And you and KAMEZAWA-san can continue to call my patches stupid, but 
that's not adding anything to your argument.

> Paul already pointed out this issue can be fixed without ABI change.
> 

I'm unaware of any viable solution that has been proposed, sorry.

> if you feel my stand point is double standard, I need explain me more.
> So, I don't think per-process oom_adj makes any regression on _real_ world.

Wrong, our machines have livelocked because of the exact scenario I 
described above.

> but vfork()'s one is real world issue.
> 

And it's based on a broken assumption that oom_adj values actually mean 
anything independent of other threads sharing the same memory.  That's a 
completely false assumption.  Applications that are tuning oom_adj value 
will rely on oom_scores, which are currently false if oom_adj differs 
amongst those threads, and should be written to how the oom killer uses 
the value.

> And, May I explay why I think your oom_adj_child is wrong idea?
> The fact is: new feature introducing never fix regression. yes, some
> application use new interface and disappear the problem. but other
> application still hit the problem. that's not correct development style
> in kernel.
> 

So you're proposing that we forever allow /proc/pid/oom_score to be 
completely wrong for pid without any knowledge to userspace?  That we 
falsely advertise what it represents and allow userspace to believe that 
changing oom_adj for a thread sharing memory with other threads actually 
changes how the oom killer selects tasks?

Please.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
  2009-07-31  0:33           ` KAMEZAWA Hiroyuki
@ 2009-07-31  9:36             ` David Rientjes
  -1 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2009-07-31  9:36 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Paul Menage, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Fri, 31 Jul 2009, KAMEZAWA Hiroyuki wrote:

> > > Simply, reset_oom_adj_at_new_mm_context or some.
> > > 
> > 
> > I think it's preferred to keep the name relatively short which is an 
> > unfortuante requirement in this case.  I also prefer to start the name 
> > with "oom_adj" so it appears alongside /proc/pid/oom_adj when listed 
> > alphabetically.
> > 
> But misleading name is bad.
> 

Can you help think of any names that start with oom_adj_* and are 
relatively short?  I'd happily ack it.

> Why don't you think select_bad_process()-> oom_kill_task() implementation is bad ?

It livelocks if a thread is chosen and passed to oom_kill_task() while 
another per-thread oom_adj value is OOM_DISABLE for a thread sharing the 
same memory.

> IMHO, it's bad manner to fix an os-implementation problem by adding _new_ user
> interface which is hard to understand.
> 

How else do you propose the oom killer use oom_adj values on a per-thread 
basis without considering other threads sharing the same memory?  It does 
no good for the oom killer to kill a thread if another one sharing its 
memory is OOM_DISABLE because it can't kill all of them.  That means the 
memory cannot be freed and it must choose another task, but the end result 
is that it needlessly killed others.  That's not a desirable result, 
sorry.

> > In other words, the issue here is larger than the inheritance of the 
> > oom_adj value amongst children, it addresses a livelock that neither of 
> > your approaches solve.  The fix actually makes /proc/pid/oom_adj (and 
> > /proc/pid/oom_score) consistent with how the oom killer behaves.
> 
> This oom_adj_child itself is not related to livelock problem. Don't make
> the problem bigger than it is.
> oom_adj_child itself is just a problem how to handle vfork().
> 

Right, that's why it wasn't part of my original patchset which fixed the 
livelock.  We had seen that others had use cases where they still needed 
to set a newly initialized mm to have a starting oom_adj value different 
from its parent.  That's entirely understandable, and that's why I 
proposed oom_adj_child.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
@ 2009-07-31  9:36             ` David Rientjes
  0 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2009-07-31  9:36 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Paul Menage, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Fri, 31 Jul 2009, KAMEZAWA Hiroyuki wrote:

> > > Simply, reset_oom_adj_at_new_mm_context or some.
> > > 
> > 
> > I think it's preferred to keep the name relatively short which is an 
> > unfortuante requirement in this case.  I also prefer to start the name 
> > with "oom_adj" so it appears alongside /proc/pid/oom_adj when listed 
> > alphabetically.
> > 
> But misleading name is bad.
> 

Can you help think of any names that start with oom_adj_* and are 
relatively short?  I'd happily ack it.

> Why don't you think select_bad_process()-> oom_kill_task() implementation is bad ?

It livelocks if a thread is chosen and passed to oom_kill_task() while 
another per-thread oom_adj value is OOM_DISABLE for a thread sharing the 
same memory.

> IMHO, it's bad manner to fix an os-implementation problem by adding _new_ user
> interface which is hard to understand.
> 

How else do you propose the oom killer use oom_adj values on a per-thread 
basis without considering other threads sharing the same memory?  It does 
no good for the oom killer to kill a thread if another one sharing its 
memory is OOM_DISABLE because it can't kill all of them.  That means the 
memory cannot be freed and it must choose another task, but the end result 
is that it needlessly killed others.  That's not a desirable result, 
sorry.

> > In other words, the issue here is larger than the inheritance of the 
> > oom_adj value amongst children, it addresses a livelock that neither of 
> > your approaches solve.  The fix actually makes /proc/pid/oom_adj (and 
> > /proc/pid/oom_score) consistent with how the oom killer behaves.
> 
> This oom_adj_child itself is not related to livelock problem. Don't make
> the problem bigger than it is.
> oom_adj_child itself is just a problem how to handle vfork().
> 

Right, that's why it wasn't part of my original patchset which fixed the 
livelock.  We had seen that others had use cases where they still needed 
to set a newly initialized mm to have a starting oom_adj value different 
from its parent.  That's entirely understandable, and that's why I 
proposed oom_adj_child.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
  2009-07-31  9:36             ` David Rientjes
@ 2009-07-31 10:49               ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-07-31 10:49 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Rik van Riel, Paul Menage,
	KOSAKI Motohiro, linux-kernel, linux-mm

David Rientjes wrote:
> On Fri, 31 Jul 2009, KAMEZAWA Hiroyuki wrote:
>
>> > > Simply, reset_oom_adj_at_new_mm_context or some.
>> > >
>> >
>> > I think it's preferred to keep the name relatively short which is an
>> > unfortuante requirement in this case.  I also prefer to start the name
>> > with "oom_adj" so it appears alongside /proc/pid/oom_adj when listed
>> > alphabetically.
>> >
>> But misleading name is bad.
>>
>
> Can you help think of any names that start with oom_adj_* and are
> relatively short?  I'd happily ack it.
>
There have been traditional name "effective" as uid and euid.

 then,  per thread oom_adj as oom_adj
        per proc   oom_adj as effective_oom_adj

is an natural way as Unix, I think.



>> Why don't you think select_bad_process()-> oom_kill_task()
>> implementation is bad ?
>
> It livelocks if a thread is chosen and passed to oom_kill_task() while
> another per-thread oom_adj value is OOM_DISABLE for a thread sharing the
> same memory.
>
I say "why don't modify buggy selection logic?"

Why we have to scan all threads ?
As fs/proc/readdir does, you can scan only "process group leader".

per-thread scan itself is buggy because now we have per-process
effective-oom-adj.


>> IMHO, it's bad manner to fix an os-implementation problem by adding
>> _new_ user
>> interface which is hard to understand.
>>
>
> How else do you propose the oom killer use oom_adj values on a per-thread
> basis without considering other threads sharing the same memory?
As I wrote.
   per-process(signal struct) or per-thread oom_adj and add
   mm->effecitve_oom_adj

task scanning isn't necessary to do per-thread scan and you can scan
only process-group-leader. What's bad ?
If oom_score is problem, plz fix it to show effective_oom_score.

If you can wait until the end of August, plz wait. I'll do some.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
@ 2009-07-31 10:49               ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-07-31 10:49 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Rik van Riel, Paul Menage,
	KOSAKI Motohiro, linux-kernel, linux-mm

David Rientjes wrote:
> On Fri, 31 Jul 2009, KAMEZAWA Hiroyuki wrote:
>
>> > > Simply, reset_oom_adj_at_new_mm_context or some.
>> > >
>> >
>> > I think it's preferred to keep the name relatively short which is an
>> > unfortuante requirement in this case.  I also prefer to start the name
>> > with "oom_adj" so it appears alongside /proc/pid/oom_adj when listed
>> > alphabetically.
>> >
>> But misleading name is bad.
>>
>
> Can you help think of any names that start with oom_adj_* and are
> relatively short?  I'd happily ack it.
>
There have been traditional name "effective" as uid and euid.

 then,  per thread oom_adj as oom_adj
        per proc   oom_adj as effective_oom_adj

is an natural way as Unix, I think.



>> Why don't you think select_bad_process()-> oom_kill_task()
>> implementation is bad ?
>
> It livelocks if a thread is chosen and passed to oom_kill_task() while
> another per-thread oom_adj value is OOM_DISABLE for a thread sharing the
> same memory.
>
I say "why don't modify buggy selection logic?"

Why we have to scan all threads ?
As fs/proc/readdir does, you can scan only "process group leader".

per-thread scan itself is buggy because now we have per-process
effective-oom-adj.


>> IMHO, it's bad manner to fix an os-implementation problem by adding
>> _new_ user
>> interface which is hard to understand.
>>
>
> How else do you propose the oom killer use oom_adj values on a per-thread
> basis without considering other threads sharing the same memory?
As I wrote.
   per-process(signal struct) or per-thread oom_adj and add
   mm->effecitve_oom_adj

task scanning isn't necessary to do per-thread scan and you can scan
only process-group-leader. What's bad ?
If oom_score is problem, plz fix it to show effective_oom_score.

If you can wait until the end of August, plz wait. I'll do some.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
  2009-07-31 10:49               ` KAMEZAWA Hiroyuki
@ 2009-07-31 19:18                 ` David Rientjes
  -1 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2009-07-31 19:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Paul Menage, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Fri, 31 Jul 2009, KAMEZAWA Hiroyuki wrote:

> > Can you help think of any names that start with oom_adj_* and are
> > relatively short?  I'd happily ack it.
> >
> There have been traditional name "effective" as uid and euid.
> 
>  then,  per thread oom_adj as oom_adj
>         per proc   oom_adj as effective_oom_adj
> 
> is an natural way as Unix, I think.
> 

I don't think effective_oom_adj is a suitable name replacement for 
oom_adj_child since it doesn't imply that the value is a no-op for the 
thread itself and only serves a purpose when an mm is initialized for a 
child.

> > It livelocks if a thread is chosen and passed to oom_kill_task() while
> > another per-thread oom_adj value is OOM_DISABLE for a thread sharing the
> > same memory.
> >
> I say "why don't modify buggy selection logic?"
> 
> Why we have to scan all threads ?
> As fs/proc/readdir does, you can scan only "process group leader".
> 
> per-thread scan itself is buggy because now we have per-process
> effective-oom-adj.
> 

Without my patches to change oom_adj from task_struct to mm_struct, you'd 
need to scan all tasks and not just the tgids because their oom_adj values 
can differ amongst threads in the same thread group.  So while it may now 
be possible to shorten the scan as a result of my approach, it isn't a 
solution itself to the problem.

> > How else do you propose the oom killer use oom_adj values on a per-thread
> > basis without considering other threads sharing the same memory?
> As I wrote.
>    per-process(signal struct) or per-thread oom_adj and add
>    mm->effecitve_oom_adj
> 
> task scanning isn't necessary to do per-thread scan and you can scan
> only process-group-leader. What's bad ?
> If oom_score is problem, plz fix it to show effective_oom_score.
> 

When only using (and showing) mm->effective_oom_adj for a task, userspace 
will not be able to adjust /proc/pid/oom_score with /proc/pid/oom_adj 
as Documentation/filesystems/proc.txt says you can for a thread unless it 
exceeds effective_oom_adj.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
@ 2009-07-31 19:18                 ` David Rientjes
  0 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2009-07-31 19:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Paul Menage, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Fri, 31 Jul 2009, KAMEZAWA Hiroyuki wrote:

> > Can you help think of any names that start with oom_adj_* and are
> > relatively short?  I'd happily ack it.
> >
> There have been traditional name "effective" as uid and euid.
> 
>  then,  per thread oom_adj as oom_adj
>         per proc   oom_adj as effective_oom_adj
> 
> is an natural way as Unix, I think.
> 

I don't think effective_oom_adj is a suitable name replacement for 
oom_adj_child since it doesn't imply that the value is a no-op for the 
thread itself and only serves a purpose when an mm is initialized for a 
child.

> > It livelocks if a thread is chosen and passed to oom_kill_task() while
> > another per-thread oom_adj value is OOM_DISABLE for a thread sharing the
> > same memory.
> >
> I say "why don't modify buggy selection logic?"
> 
> Why we have to scan all threads ?
> As fs/proc/readdir does, you can scan only "process group leader".
> 
> per-thread scan itself is buggy because now we have per-process
> effective-oom-adj.
> 

Without my patches to change oom_adj from task_struct to mm_struct, you'd 
need to scan all tasks and not just the tgids because their oom_adj values 
can differ amongst threads in the same thread group.  So while it may now 
be possible to shorten the scan as a result of my approach, it isn't a 
solution itself to the problem.

> > How else do you propose the oom killer use oom_adj values on a per-thread
> > basis without considering other threads sharing the same memory?
> As I wrote.
>    per-process(signal struct) or per-thread oom_adj and add
>    mm->effecitve_oom_adj
> 
> task scanning isn't necessary to do per-thread scan and you can scan
> only process-group-leader. What's bad ?
> If oom_score is problem, plz fix it to show effective_oom_score.
> 

When only using (and showing) mm->effective_oom_adj for a task, userspace 
will not be able to adjust /proc/pid/oom_score with /proc/pid/oom_adj 
as Documentation/filesystems/proc.txt says you can for a thread unless it 
exceeds effective_oom_adj.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
  2009-07-31  6:50             ` KOSAKI Motohiro
@ 2009-07-31 19:38               ` David Rientjes
  -1 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2009-07-31 19:38 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Rik van Riel, Paul Menage,
	linux-kernel, linux-mm

On Fri, 31 Jul 2009, KOSAKI Motohiro wrote:

> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 3ce5ae9..c64499e 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1008,7 +1008,7 @@ static ssize_t oom_adjust_read(struct file *file, char __user *buf,
>  		return -ESRCH;
>  	task_lock(task);
>  	if (task->mm)
> -		oom_adjust = task->mm->oom_adj;
> +		oom_adjust = task->signal->oom_adj;
>  	else
>  		oom_adjust = OOM_DISABLE;
>  	task_unlock(task);

This may display a /proc/pid/oom_adj that is radically different from 
task->mm->oom_adj_cached without knowledge to userspace and you can't 
simply display task->mm>oom_adj_cached here because it gets reset on every 
write to /proc/pid/oom_adj.

> @@ -1046,12 +1046,13 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
>  		put_task_struct(task);
>  		return -EINVAL;
>  	}
> -	if (oom_adjust < task->mm->oom_adj && !capable(CAP_SYS_RESOURCE)) {
> +	if (oom_adjust < task->signal->oom_adj && !capable(CAP_SYS_RESOURCE)) {
>  		task_unlock(task);
>  		put_task_struct(task);
>  		return -EACCES;
>  	}
> -	task->mm->oom_adj = oom_adjust;
> +	task->signal->oom_adj = oom_adjust;
> +	task->mm->oom_adj_cached = OOM_CACHE_DEFAULT;
>  	task_unlock(task);
>  	put_task_struct(task);
>  	if (end - buffer == 0)
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 7acc843..f93f97f 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -240,7 +240,8 @@ struct mm_struct {
>  
>  	unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
>  
> -	s8 oom_adj;	/* OOM kill score adjustment (bit shift) */
> +	s8 oom_adj_cached;	/* mirror from signal_struct->oom_adj.
> +				   in vfork case, multiple processes use the same mm. */
>  
>  	cpumask_t cpu_vm_mask;
>  
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index a7979ba..a219480 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -3,6 +3,7 @@
>  
>  /* /proc/<pid>/oom_adj set to -17 protects from the oom-killer */
>  #define OOM_DISABLE (-17)
> +#define OOM_CACHE_DEFAULT (15)
>  /* inclusive */
>  #define OOM_ADJUST_MIN (-16)
>  #define OOM_ADJUST_MAX 15
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 3ab08e4..e10b12b 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -629,6 +629,8 @@ struct signal_struct {
>  	unsigned audit_tty;
>  	struct tty_audit_buf *tty_audit_buf;
>  #endif
> +
> +	s8 oom_adj;	/* OOM kill score adjustment (bit shift) */
>  };
>  
>  /* Context switch must be unlocked if interrupts are to be enabled */

I don't believe oom_adj is an appropriate use of signal_struct, sorry.

> diff --git a/kernel/exit.c b/kernel/exit.c
> index 869dc22..c741a45 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -688,6 +689,7 @@ static void exit_mm(struct task_struct * tsk)
>  	enter_lazy_tlb(mm, current);
>  	/* We don't want this task to be frozen prematurely */
>  	clear_freeze_flag(tsk);
> +	mm->oom_adj_cached = OOM_CACHE_DEFAULT;
>  	task_unlock(tsk);
>  	mm_update_next_owner(mm);
>  	mmput(mm);

This is similiar to an early proposal that wanted to keep an array of 
oom_adj values for tasks attached to the mm in mm_struct.  The problem is 
that you're obviously losing information about all threads attached to the 
mm any time one of the threads exits or writes to /proc/pid/oom_adj.  That 
information can only be regenerated with a tasklist scan.

> diff --git a/kernel/fork.c b/kernel/fork.c
> index 9b42695..b7cb474 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -426,6 +427,7 @@ static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
>  	init_rwsem(&mm->mmap_sem);
>  	INIT_LIST_HEAD(&mm->mmlist);
>  	mm->flags = (current->mm) ? current->mm->flags : default_dump_filter;
> +	mm->oom_adj_cached = OOM_CACHE_DEFAULT;
>  	mm->core_state = NULL;
>  	mm->nr_ptes = 0;
>  	set_mm_counter(mm, file_rss, 0);
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 175a67a..eae2d78 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -58,7 +58,7 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
>  	unsigned long points, cpu_time, run_time;
>  	struct mm_struct *mm;
>  	struct task_struct *child;
> -	int oom_adj;
> +	s8 oom_adj;
>  
>  	task_lock(p);
>  	mm = p->mm;
> @@ -66,7 +66,10 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
>  		task_unlock(p);
>  		return 0;
>  	}
> -	oom_adj = mm->oom_adj;
> +
> +	if (mm->oom_adj_cached < p->signal->oom_adj)
> +		mm->oom_adj_cached = p->signal->oom_adj;

This conditional will never be true since mm->oom_adj_cached is 
initialized to 15, which is the upper bound on which p->signal->oom_adj 
may ever be, so mm->oom_adj_cached never gets changed from 
OOM_CACHE_DEFAULT.

Thus, this patch doesn't even work, and you probably would have noticed 
that if you'd checked /proc/pid/oom_score for any pid.

Even if mm->oom_adj_cached _was_ properly updated here, 
/proc/pid/oom_score would be out of sync with more negative oom_adj values 
for threads sharing the mm_struct since it calls badness() for only a 
single thread.

> +	oom_adj = mm->oom_adj_cached;
>  	if (oom_adj == OOM_DISABLE) {
>  		task_unlock(p);
>  		return 0;
> @@ -350,7 +354,7 @@ static int oom_kill_task(struct task_struct *p)
>  
>  	task_lock(p);
>  	mm = p->mm;
> -	if (!mm || mm->oom_adj == OOM_DISABLE) {
> +	if (!mm || p->signal->oom_adj == OOM_DISABLE) {
>  		task_unlock(p);
>  		return 1;
>  	}

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
@ 2009-07-31 19:38               ` David Rientjes
  0 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2009-07-31 19:38 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Rik van Riel, Paul Menage,
	linux-kernel, linux-mm

On Fri, 31 Jul 2009, KOSAKI Motohiro wrote:

> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 3ce5ae9..c64499e 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1008,7 +1008,7 @@ static ssize_t oom_adjust_read(struct file *file, char __user *buf,
>  		return -ESRCH;
>  	task_lock(task);
>  	if (task->mm)
> -		oom_adjust = task->mm->oom_adj;
> +		oom_adjust = task->signal->oom_adj;
>  	else
>  		oom_adjust = OOM_DISABLE;
>  	task_unlock(task);

This may display a /proc/pid/oom_adj that is radically different from 
task->mm->oom_adj_cached without knowledge to userspace and you can't 
simply display task->mm>oom_adj_cached here because it gets reset on every 
write to /proc/pid/oom_adj.

> @@ -1046,12 +1046,13 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
>  		put_task_struct(task);
>  		return -EINVAL;
>  	}
> -	if (oom_adjust < task->mm->oom_adj && !capable(CAP_SYS_RESOURCE)) {
> +	if (oom_adjust < task->signal->oom_adj && !capable(CAP_SYS_RESOURCE)) {
>  		task_unlock(task);
>  		put_task_struct(task);
>  		return -EACCES;
>  	}
> -	task->mm->oom_adj = oom_adjust;
> +	task->signal->oom_adj = oom_adjust;
> +	task->mm->oom_adj_cached = OOM_CACHE_DEFAULT;
>  	task_unlock(task);
>  	put_task_struct(task);
>  	if (end - buffer == 0)
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 7acc843..f93f97f 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -240,7 +240,8 @@ struct mm_struct {
>  
>  	unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
>  
> -	s8 oom_adj;	/* OOM kill score adjustment (bit shift) */
> +	s8 oom_adj_cached;	/* mirror from signal_struct->oom_adj.
> +				   in vfork case, multiple processes use the same mm. */
>  
>  	cpumask_t cpu_vm_mask;
>  
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index a7979ba..a219480 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -3,6 +3,7 @@
>  
>  /* /proc/<pid>/oom_adj set to -17 protects from the oom-killer */
>  #define OOM_DISABLE (-17)
> +#define OOM_CACHE_DEFAULT (15)
>  /* inclusive */
>  #define OOM_ADJUST_MIN (-16)
>  #define OOM_ADJUST_MAX 15
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 3ab08e4..e10b12b 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -629,6 +629,8 @@ struct signal_struct {
>  	unsigned audit_tty;
>  	struct tty_audit_buf *tty_audit_buf;
>  #endif
> +
> +	s8 oom_adj;	/* OOM kill score adjustment (bit shift) */
>  };
>  
>  /* Context switch must be unlocked if interrupts are to be enabled */

I don't believe oom_adj is an appropriate use of signal_struct, sorry.

> diff --git a/kernel/exit.c b/kernel/exit.c
> index 869dc22..c741a45 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -688,6 +689,7 @@ static void exit_mm(struct task_struct * tsk)
>  	enter_lazy_tlb(mm, current);
>  	/* We don't want this task to be frozen prematurely */
>  	clear_freeze_flag(tsk);
> +	mm->oom_adj_cached = OOM_CACHE_DEFAULT;
>  	task_unlock(tsk);
>  	mm_update_next_owner(mm);
>  	mmput(mm);

This is similiar to an early proposal that wanted to keep an array of 
oom_adj values for tasks attached to the mm in mm_struct.  The problem is 
that you're obviously losing information about all threads attached to the 
mm any time one of the threads exits or writes to /proc/pid/oom_adj.  That 
information can only be regenerated with a tasklist scan.

> diff --git a/kernel/fork.c b/kernel/fork.c
> index 9b42695..b7cb474 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -426,6 +427,7 @@ static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
>  	init_rwsem(&mm->mmap_sem);
>  	INIT_LIST_HEAD(&mm->mmlist);
>  	mm->flags = (current->mm) ? current->mm->flags : default_dump_filter;
> +	mm->oom_adj_cached = OOM_CACHE_DEFAULT;
>  	mm->core_state = NULL;
>  	mm->nr_ptes = 0;
>  	set_mm_counter(mm, file_rss, 0);
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 175a67a..eae2d78 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -58,7 +58,7 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
>  	unsigned long points, cpu_time, run_time;
>  	struct mm_struct *mm;
>  	struct task_struct *child;
> -	int oom_adj;
> +	s8 oom_adj;
>  
>  	task_lock(p);
>  	mm = p->mm;
> @@ -66,7 +66,10 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
>  		task_unlock(p);
>  		return 0;
>  	}
> -	oom_adj = mm->oom_adj;
> +
> +	if (mm->oom_adj_cached < p->signal->oom_adj)
> +		mm->oom_adj_cached = p->signal->oom_adj;

This conditional will never be true since mm->oom_adj_cached is 
initialized to 15, which is the upper bound on which p->signal->oom_adj 
may ever be, so mm->oom_adj_cached never gets changed from 
OOM_CACHE_DEFAULT.

Thus, this patch doesn't even work, and you probably would have noticed 
that if you'd checked /proc/pid/oom_score for any pid.

Even if mm->oom_adj_cached _was_ properly updated here, 
/proc/pid/oom_score would be out of sync with more negative oom_adj values 
for threads sharing the mm_struct since it calls badness() for only a 
single thread.

> +	oom_adj = mm->oom_adj_cached;
>  	if (oom_adj == OOM_DISABLE) {
>  		task_unlock(p);
>  		return 0;
> @@ -350,7 +354,7 @@ static int oom_kill_task(struct task_struct *p)
>  
>  	task_lock(p);
>  	mm = p->mm;
> -	if (!mm || mm->oom_adj == OOM_DISABLE) {
> +	if (!mm || p->signal->oom_adj == OOM_DISABLE) {
>  		task_unlock(p);
>  		return 1;
>  	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
  2009-07-31 19:18                 ` David Rientjes
@ 2009-08-01  1:10                   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-01  1:10 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Rik van Riel, Paul Menage,
	KOSAKI Motohiro, linux-kernel, linux-mm

David Rientjes wrote:

>> > It livelocks if a thread is chosen and passed to oom_kill_task() while
>> > another per-thread oom_adj value is OOM_DISABLE for a thread sharing
>> the
>> > same memory.
>> >
>> I say "why don't modify buggy selection logic?"
>>
>> Why we have to scan all threads ?
>> As fs/proc/readdir does, you can scan only "process group leader".
>>
>> per-thread scan itself is buggy because now we have per-process
>> effective-oom-adj.
>>
>
> Without my patches to change oom_adj from task_struct to mm_struct, you'd
> need to scan all tasks and not just the tgids because their oom_adj values
> can differ amongst threads in the same thread group.  So while it may now
> be possible to shorten the scan as a result of my approach, it isn't a
> solution itself to the problem.

Did I said "revert your patch in -rc" even once ?
livelock-avoidance itself is good work, thank you.
All my suggestion is based on your patch already in rc4.
Summarizing I think now .....
  - rename mm->oom_adj as mm->effective_oom_adj
  - re-add per-thread oom_adj
  - update mm->effective_oom_adj based on per-thread oom_adj
  - if necessary, plz add read-only /proc/pid/effective_oom_adj file.
    or show 2 values in /proc/pid/oom_adj
  - rewrite documentation about oom_score.
   " it's calclulated from  _process's_ memory usage and oom_adj of
    all threads which shares a memor  context".
   This behavior is not changed from old implemtation, anyway.
 - If necessary, rewrite oom_kill itself to scan only thread group
   leader. It's a way to go regardless of  vfork problem.



>
>> > How else do you propose the oom killer use oom_adj values on a
>> per-thread
>> > basis without considering other threads sharing the same memory?
>> As I wrote.
>>    per-process(signal struct) or per-thread oom_adj and add
>>    mm->effecitve_oom_adj
>>
>> task scanning isn't necessary to do per-thread scan and you can scan
>> only process-group-leader. What's bad ?
>> If oom_score is problem, plz fix it to show effective_oom_score.
>>
>
> When only using (and showing) mm->effective_oom_adj for a task, userspace
> will not be able to adjust /proc/pid/oom_score with /proc/pid/oom_adj
> as Documentation/filesystems/proc.txt says you can for a thread unless it
> exceeds effective_oom_adj.>

Is it different from old behavior ?
I think documentation is wrong. It should say "you should think of
multi-thread effect to oom_adj/oom_score".

Thanks,
-Kame

> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
@ 2009-08-01  1:10                   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-01  1:10 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Rik van Riel, Paul Menage,
	KOSAKI Motohiro, linux-kernel, linux-mm

David Rientjes wrote:

>> > It livelocks if a thread is chosen and passed to oom_kill_task() while
>> > another per-thread oom_adj value is OOM_DISABLE for a thread sharing
>> the
>> > same memory.
>> >
>> I say "why don't modify buggy selection logic?"
>>
>> Why we have to scan all threads ?
>> As fs/proc/readdir does, you can scan only "process group leader".
>>
>> per-thread scan itself is buggy because now we have per-process
>> effective-oom-adj.
>>
>
> Without my patches to change oom_adj from task_struct to mm_struct, you'd
> need to scan all tasks and not just the tgids because their oom_adj values
> can differ amongst threads in the same thread group.  So while it may now
> be possible to shorten the scan as a result of my approach, it isn't a
> solution itself to the problem.

Did I said "revert your patch in -rc" even once ?
livelock-avoidance itself is good work, thank you.
All my suggestion is based on your patch already in rc4.
Summarizing I think now .....
  - rename mm->oom_adj as mm->effective_oom_adj
  - re-add per-thread oom_adj
  - update mm->effective_oom_adj based on per-thread oom_adj
  - if necessary, plz add read-only /proc/pid/effective_oom_adj file.
    or show 2 values in /proc/pid/oom_adj
  - rewrite documentation about oom_score.
   " it's calclulated from  _process's_ memory usage and oom_adj of
    all threads which shares a memor  context".
   This behavior is not changed from old implemtation, anyway.
 - If necessary, rewrite oom_kill itself to scan only thread group
   leader. It's a way to go regardless of  vfork problem.



>
>> > How else do you propose the oom killer use oom_adj values on a
>> per-thread
>> > basis without considering other threads sharing the same memory?
>> As I wrote.
>>    per-process(signal struct) or per-thread oom_adj and add
>>    mm->effecitve_oom_adj
>>
>> task scanning isn't necessary to do per-thread scan and you can scan
>> only process-group-leader. What's bad ?
>> If oom_score is problem, plz fix it to show effective_oom_score.
>>
>
> When only using (and showing) mm->effective_oom_adj for a task, userspace
> will not be able to adjust /proc/pid/oom_score with /proc/pid/oom_adj
> as Documentation/filesystems/proc.txt says you can for a thread unless it
> exceeds effective_oom_adj.>

Is it different from old behavior ?
I think documentation is wrong. It should say "you should think of
multi-thread effect to oom_adj/oom_score".

Thanks,
-Kame

> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
  2009-08-01  1:10                   ` KAMEZAWA Hiroyuki
@ 2009-08-01 20:26                     ` David Rientjes
  -1 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2009-08-01 20:26 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Paul Menage, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Sat, 1 Aug 2009, KAMEZAWA Hiroyuki wrote:

> Summarizing I think now .....
>   - rename mm->oom_adj as mm->effective_oom_adj
>   - re-add per-thread oom_adj
>   - update mm->effective_oom_adj based on per-thread oom_adj
>   - if necessary, plz add read-only /proc/pid/effective_oom_adj file.
>     or show 2 values in /proc/pid/oom_adj
>   - rewrite documentation about oom_score.
>    " it's calclulated from  _process's_ memory usage and oom_adj of
>     all threads which shares a memor  context".
>    This behavior is not changed from old implemtation, anyway.
>  - If necessary, rewrite oom_kill itself to scan only thread group
>    leader. It's a way to go regardless of  vfork problem.
> 

Ok, so you've abandoned the signal_struct proposal and now want to add it 
back to task_struct with an effective member in mm_struct by changing the 
documentation.  Hmm.

This solves the livelock problem by adding additional tunables, but 
doesn't match how the documentation describes the use case for 
/proc/pid/oom_adj.  Your argument is that the behavior of that value can't 
change: that it must be per-thread.  And that allowance leads to one of 
two inconsistent scenarios:

 - /proc/pid/oom_score is inconsistent when tuning /proc/pid/oom_adj if it
   relies on the per-thread oom_adj; it now really represents nothing but
   an incorrect value if other threads share that memory and misleads the
   user on how the oom killer chooses victims, or

 - /proc/pid/oom_score is inconsistent when the thread that set the
   effective per-mm oom_adj exits and it is now obsolete since you have
   no way to determine what the next effective oom_adj value shall be.

Determining the next effective per-mm oom_adj isn't possible when the only 
threads sharing the mm remaining have different per-thread oom_adj values.  
That's a horribly inconsistent state to be getting into because it allows 
oom_score to change when a thread exits, which is completely unknown to 
userspace, OR is allows the effective per-mm oom_adj to be different from 
all threads sharing the same memory (and, thus, /proc/pid/oom_score not 
being representative of any thread's /proc/pid/oom_adj).

> I think documentation is wrong. It should say "you should think of
> multi-thread effect to oom_adj/oom_score".
> 

It's more likely than not that applications were probably written to the 
way the documentation described the two files: that is, adjust 
/proc/pid/oom_score by tuning /proc/pid/oom_adj instead of relying on an 
undocumented implementation detail concerning the tuning of oom_adj for a 
vfork'd child prior to exec().  The user is probably unaware of the oom 
killer's implementation and simply interprets a higher oom_score as a more 
likely candidate for oom kill.  My patches preserve that in all scenarios 
without altering the documentation or adding additional files that would 
be required to leave the oom_adj value itself in an inconsistent state as 
you propose.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
@ 2009-08-01 20:26                     ` David Rientjes
  0 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2009-08-01 20:26 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Paul Menage, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Sat, 1 Aug 2009, KAMEZAWA Hiroyuki wrote:

> Summarizing I think now .....
>   - rename mm->oom_adj as mm->effective_oom_adj
>   - re-add per-thread oom_adj
>   - update mm->effective_oom_adj based on per-thread oom_adj
>   - if necessary, plz add read-only /proc/pid/effective_oom_adj file.
>     or show 2 values in /proc/pid/oom_adj
>   - rewrite documentation about oom_score.
>    " it's calclulated from  _process's_ memory usage and oom_adj of
>     all threads which shares a memor  context".
>    This behavior is not changed from old implemtation, anyway.
>  - If necessary, rewrite oom_kill itself to scan only thread group
>    leader. It's a way to go regardless of  vfork problem.
> 

Ok, so you've abandoned the signal_struct proposal and now want to add it 
back to task_struct with an effective member in mm_struct by changing the 
documentation.  Hmm.

This solves the livelock problem by adding additional tunables, but 
doesn't match how the documentation describes the use case for 
/proc/pid/oom_adj.  Your argument is that the behavior of that value can't 
change: that it must be per-thread.  And that allowance leads to one of 
two inconsistent scenarios:

 - /proc/pid/oom_score is inconsistent when tuning /proc/pid/oom_adj if it
   relies on the per-thread oom_adj; it now really represents nothing but
   an incorrect value if other threads share that memory and misleads the
   user on how the oom killer chooses victims, or

 - /proc/pid/oom_score is inconsistent when the thread that set the
   effective per-mm oom_adj exits and it is now obsolete since you have
   no way to determine what the next effective oom_adj value shall be.

Determining the next effective per-mm oom_adj isn't possible when the only 
threads sharing the mm remaining have different per-thread oom_adj values.  
That's a horribly inconsistent state to be getting into because it allows 
oom_score to change when a thread exits, which is completely unknown to 
userspace, OR is allows the effective per-mm oom_adj to be different from 
all threads sharing the same memory (and, thus, /proc/pid/oom_score not 
being representative of any thread's /proc/pid/oom_adj).

> I think documentation is wrong. It should say "you should think of
> multi-thread effect to oom_adj/oom_score".
> 

It's more likely than not that applications were probably written to the 
way the documentation described the two files: that is, adjust 
/proc/pid/oom_score by tuning /proc/pid/oom_adj instead of relying on an 
undocumented implementation detail concerning the tuning of oom_adj for a 
vfork'd child prior to exec().  The user is probably unaware of the oom 
killer's implementation and simply interprets a higher oom_score as a more 
likely candidate for oom kill.  My patches preserve that in all scenarios 
without altering the documentation or adding additional files that would 
be required to leave the oom_adj value itself in an inconsistent state as 
you propose.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
  2009-08-01 20:26                     ` David Rientjes
@ 2009-08-03  1:42                       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-03  1:42 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Paul Menage, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Sat, 1 Aug 2009 13:26:52 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> On Sat, 1 Aug 2009, KAMEZAWA Hiroyuki wrote:
> 
> > Summarizing I think now .....
> >   - rename mm->oom_adj as mm->effective_oom_adj
> >   - re-add per-thread oom_adj
> >   - update mm->effective_oom_adj based on per-thread oom_adj
> >   - if necessary, plz add read-only /proc/pid/effective_oom_adj file.
> >     or show 2 values in /proc/pid/oom_adj
> >   - rewrite documentation about oom_score.
> >    " it's calclulated from  _process's_ memory usage and oom_adj of
> >     all threads which shares a memor  context".
> >    This behavior is not changed from old implemtation, anyway.
> >  - If necessary, rewrite oom_kill itself to scan only thread group
> >    leader. It's a way to go regardless of  vfork problem.
> > 
> 
> Ok, so you've abandoned the signal_struct proposal and now want to add it 
per-signal is also ok, just I didn't write.

> back to task_struct with an effective member in mm_struct by changing the 
> documentation.  Hmm.
> 
> This solves the livelock problem by adding additional tunables, but 
> doesn't match how the documentation describes the use case for 
> /proc/pid/oom_adj.  Your argument is that the behavior of that value can't 
> change: that it must be per-thread.  And that allowance leads to one of 
> two inconsistent scenarios:
> 
>  - /proc/pid/oom_score is inconsistent when tuning /proc/pid/oom_adj if it
>    relies on the per-thread oom_adj; it now really represents nothing but
>    an incorrect value if other threads share that memory and misleads the
>    user on how the oom killer chooses victims, or

What's why I said to show effective_oom_adj if necessary..

> 
>  - /proc/pid/oom_score is inconsistent when the thread that set the
>    effective per-mm oom_adj exits and it is now obsolete since you have
>    no way to determine what the next effective oom_adj value shall be.
> 
plz re-caluculate it. it's not a big job if done in lazy way.


> Determining the next effective per-mm oom_adj isn't possible when the only 
> threads sharing the mm remaining have different per-thread oom_adj values.  
> That's a horribly inconsistent state to be getting into because it allows 
> oom_score to change when a thread exits, which is completely unknown to 
> userspace, OR is allows the effective per-mm oom_adj to be different from 
> all threads sharing the same memory (and, thus, /proc/pid/oom_score not 
> being representative of any thread's /proc/pid/oom_adj).
> 
A _sane_ user will just set oom_adj to thread-group-leader.
Do you think users are too fool to set per-thread oom_adj independently ?
No problems in real world.


> > I think documentation is wrong. It should say "you should think of
> > multi-thread effect to oom_adj/oom_score".
> > 
> 
> It's more likely than not that applications were probably written to the 
> way the documentation described the two files: that is, adjust 
> /proc/pid/oom_score by tuning /proc/pid/oom_adj instead of relying on an 
> undocumented implementation detail concerning the tuning of oom_adj for a 
> vfork'd child prior to exec().  The user is probably unaware of the oom 
> killer's implementation and simply interprets a higher oom_score as a more 
> likely candidate for oom kill.  My patches preserve that in all scenarios 
> without altering the documentation or adding additional files that would 
> be required to leave the oom_adj value itself in an inconsistent state as 
> you propose.
> 
No.  My understanding is this.

 - oom_adj is designed considering vfork(), of course. then. per-thread.
 - oom_score has been incorrect in multi-threaded system. The user will not
   be affected.
 - you fixed livelock but breaks the feature.


Thanks,
-Kame
  





^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
@ 2009-08-03  1:42                       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-03  1:42 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Paul Menage, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Sat, 1 Aug 2009 13:26:52 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> On Sat, 1 Aug 2009, KAMEZAWA Hiroyuki wrote:
> 
> > Summarizing I think now .....
> >   - rename mm->oom_adj as mm->effective_oom_adj
> >   - re-add per-thread oom_adj
> >   - update mm->effective_oom_adj based on per-thread oom_adj
> >   - if necessary, plz add read-only /proc/pid/effective_oom_adj file.
> >     or show 2 values in /proc/pid/oom_adj
> >   - rewrite documentation about oom_score.
> >    " it's calclulated from  _process's_ memory usage and oom_adj of
> >     all threads which shares a memor  context".
> >    This behavior is not changed from old implemtation, anyway.
> >  - If necessary, rewrite oom_kill itself to scan only thread group
> >    leader. It's a way to go regardless of  vfork problem.
> > 
> 
> Ok, so you've abandoned the signal_struct proposal and now want to add it 
per-signal is also ok, just I didn't write.

> back to task_struct with an effective member in mm_struct by changing the 
> documentation.  Hmm.
> 
> This solves the livelock problem by adding additional tunables, but 
> doesn't match how the documentation describes the use case for 
> /proc/pid/oom_adj.  Your argument is that the behavior of that value can't 
> change: that it must be per-thread.  And that allowance leads to one of 
> two inconsistent scenarios:
> 
>  - /proc/pid/oom_score is inconsistent when tuning /proc/pid/oom_adj if it
>    relies on the per-thread oom_adj; it now really represents nothing but
>    an incorrect value if other threads share that memory and misleads the
>    user on how the oom killer chooses victims, or

What's why I said to show effective_oom_adj if necessary..

> 
>  - /proc/pid/oom_score is inconsistent when the thread that set the
>    effective per-mm oom_adj exits and it is now obsolete since you have
>    no way to determine what the next effective oom_adj value shall be.
> 
plz re-caluculate it. it's not a big job if done in lazy way.


> Determining the next effective per-mm oom_adj isn't possible when the only 
> threads sharing the mm remaining have different per-thread oom_adj values.  
> That's a horribly inconsistent state to be getting into because it allows 
> oom_score to change when a thread exits, which is completely unknown to 
> userspace, OR is allows the effective per-mm oom_adj to be different from 
> all threads sharing the same memory (and, thus, /proc/pid/oom_score not 
> being representative of any thread's /proc/pid/oom_adj).
> 
A _sane_ user will just set oom_adj to thread-group-leader.
Do you think users are too fool to set per-thread oom_adj independently ?
No problems in real world.


> > I think documentation is wrong. It should say "you should think of
> > multi-thread effect to oom_adj/oom_score".
> > 
> 
> It's more likely than not that applications were probably written to the 
> way the documentation described the two files: that is, adjust 
> /proc/pid/oom_score by tuning /proc/pid/oom_adj instead of relying on an 
> undocumented implementation detail concerning the tuning of oom_adj for a 
> vfork'd child prior to exec().  The user is probably unaware of the oom 
> killer's implementation and simply interprets a higher oom_score as a more 
> likely candidate for oom kill.  My patches preserve that in all scenarios 
> without altering the documentation or adding additional files that would 
> be required to leave the oom_adj value itself in an inconsistent state as 
> you propose.
> 
No.  My understanding is this.

 - oom_adj is designed considering vfork(), of course. then. per-thread.
 - oom_score has been incorrect in multi-threaded system. The user will not
   be affected.
 - you fixed livelock but breaks the feature.


Thanks,
-Kame
  




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
  2009-08-03  1:42                       ` KAMEZAWA Hiroyuki
@ 2009-08-03  7:59                         ` David Rientjes
  -1 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2009-08-03  7:59 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Paul Menage, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Mon, 3 Aug 2009, KAMEZAWA Hiroyuki wrote:

> >  - /proc/pid/oom_score is inconsistent when tuning /proc/pid/oom_adj if it
> >    relies on the per-thread oom_adj; it now really represents nothing but
> >    an incorrect value if other threads share that memory and misleads the
> >    user on how the oom killer chooses victims, or
> 
> What's why I said to show effective_oom_adj if necessary..
> 

Right, but which of the following two behaviors do you believe the 
majority of today's user applications are written to use?

 (1) /proc/pid/oom_score represents the badness heuristic that the oom
     killer uses to determine which task to kill, or

 (2) /proc/pid/oom_adj can be adjusted after vfork() and prior to exec()
     to represent the oom preference of the child without simultaneously
     changing the oom preference of the parent.

The two are at a complete contrast and cannot co-exist.  I favor behavior 
(1), which is why my patches make it consistent in _all_ cases, since it 
is more likely than not that the majority of user applications use that 
behavior if, for no other reason, than it is the DOCUMENTED reason.

If you feel that's an unreasonable conclusion, then please say that so 
your argument can be judged based on your interpretation of that behavior 
which I believe most others would disagree with.  Otherwise, our 
discussion will continue to go in circles.

> >  - /proc/pid/oom_score is inconsistent when the thread that set the
> >    effective per-mm oom_adj exits and it is now obsolete since you have
> >    no way to determine what the next effective oom_adj value shall be.
> > 
> plz re-caluculate it. it's not a big job if done in lazy way.
> 

You can't recalculate it if all the remaining threads have a different 
oom_adj value than the effective oom_adj value from the thread that is now 
exited.  There is no assumption that, for instance, the most negative 
oom_adj value shall then be used.  Imagine the effective oom_adj value 
being +15 and a thread sharing the same memory has an oom_adj value of 
-16.  Under no reasonable circumstance should the oom preference of the 
entire thread then change to -16 just because its the side-effect of a 
thread exiting.

That's the _entire_ reason why we need consistency in oom_adj values so 
that userspace is aware of how the oom killer really works and chooses 
tasks.  I understand that it differs from the previously allowed behavior, 
but those userspace applications need to be fixed if, for no other reason, 
they are now consistent with how the oom killer kills tasks.  I think 
that's a very worthwhile goal and the cost of moving to a new interface 
such as /proc/pid/oom_adj_child to have the same inheritance property that 
was available in the past is justified.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
@ 2009-08-03  7:59                         ` David Rientjes
  0 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2009-08-03  7:59 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Paul Menage, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Mon, 3 Aug 2009, KAMEZAWA Hiroyuki wrote:

> >  - /proc/pid/oom_score is inconsistent when tuning /proc/pid/oom_adj if it
> >    relies on the per-thread oom_adj; it now really represents nothing but
> >    an incorrect value if other threads share that memory and misleads the
> >    user on how the oom killer chooses victims, or
> 
> What's why I said to show effective_oom_adj if necessary..
> 

Right, but which of the following two behaviors do you believe the 
majority of today's user applications are written to use?

 (1) /proc/pid/oom_score represents the badness heuristic that the oom
     killer uses to determine which task to kill, or

 (2) /proc/pid/oom_adj can be adjusted after vfork() and prior to exec()
     to represent the oom preference of the child without simultaneously
     changing the oom preference of the parent.

The two are at a complete contrast and cannot co-exist.  I favor behavior 
(1), which is why my patches make it consistent in _all_ cases, since it 
is more likely than not that the majority of user applications use that 
behavior if, for no other reason, than it is the DOCUMENTED reason.

If you feel that's an unreasonable conclusion, then please say that so 
your argument can be judged based on your interpretation of that behavior 
which I believe most others would disagree with.  Otherwise, our 
discussion will continue to go in circles.

> >  - /proc/pid/oom_score is inconsistent when the thread that set the
> >    effective per-mm oom_adj exits and it is now obsolete since you have
> >    no way to determine what the next effective oom_adj value shall be.
> > 
> plz re-caluculate it. it's not a big job if done in lazy way.
> 

You can't recalculate it if all the remaining threads have a different 
oom_adj value than the effective oom_adj value from the thread that is now 
exited.  There is no assumption that, for instance, the most negative 
oom_adj value shall then be used.  Imagine the effective oom_adj value 
being +15 and a thread sharing the same memory has an oom_adj value of 
-16.  Under no reasonable circumstance should the oom preference of the 
entire thread then change to -16 just because its the side-effect of a 
thread exiting.

That's the _entire_ reason why we need consistency in oom_adj values so 
that userspace is aware of how the oom killer really works and chooses 
tasks.  I understand that it differs from the previously allowed behavior, 
but those userspace applications need to be fixed if, for no other reason, 
they are now consistent with how the oom killer kills tasks.  I think 
that's a very worthwhile goal and the cost of moving to a new interface 
such as /proc/pid/oom_adj_child to have the same inheritance property that 
was available in the past is justified.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
  2009-08-03  7:59                         ` David Rientjes
@ 2009-08-03  8:02                           ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-03  8:02 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Paul Menage, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Mon, 3 Aug 2009 00:59:18 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> > >  - /proc/pid/oom_score is inconsistent when the thread that set the
> > >    effective per-mm oom_adj exits and it is now obsolete since you have
> > >    no way to determine what the next effective oom_adj value shall be.
> > > 
> > plz re-caluculate it. it's not a big job if done in lazy way.
> > 
> 
> You can't recalculate it if all the remaining threads have a different 
> oom_adj value than the effective oom_adj value from the thread that is now 
> exited.  

Then, crazy google apps pass different oom_adjs to each thread ?
And, threads other than thread-group-leader modifies its oom_adj.

Hmm, interesting.


Thanks,
-Kame


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
@ 2009-08-03  8:02                           ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-03  8:02 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Paul Menage, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Mon, 3 Aug 2009 00:59:18 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> > >  - /proc/pid/oom_score is inconsistent when the thread that set the
> > >    effective per-mm oom_adj exits and it is now obsolete since you have
> > >    no way to determine what the next effective oom_adj value shall be.
> > > 
> > plz re-caluculate it. it's not a big job if done in lazy way.
> > 
> 
> You can't recalculate it if all the remaining threads have a different 
> oom_adj value than the effective oom_adj value from the thread that is now 
> exited.  

Then, crazy google apps pass different oom_adjs to each thread ?
And, threads other than thread-group-leader modifies its oom_adj.

Hmm, interesting.


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
  2009-08-03  8:02                           ` KAMEZAWA Hiroyuki
@ 2009-08-03  8:08                             ` David Rientjes
  -1 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2009-08-03  8:08 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Paul Menage, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Mon, 3 Aug 2009, KAMEZAWA Hiroyuki wrote:

> > You can't recalculate it if all the remaining threads have a different 
> > oom_adj value than the effective oom_adj value from the thread that is now 
> > exited.  
> 
> Then, crazy google apps pass different oom_adjs to each thread ?
> And, threads other than thread-group-leader modifies its oom_adj.
> 

Nope, but I'm afraid you've just made my point for me: it shows that 
oom_adj really isn't sanely used as a per-thread attribute and actually 
only represents a preference on oom killing a quantity of memory in all 
other cases other than vfork() -> change /proc/pid-of-child/oom_adj -> 
exec() for which we now appropriately have /proc/pid/oom_adj_child for.

Thanks.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
@ 2009-08-03  8:08                             ` David Rientjes
  0 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2009-08-03  8:08 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Paul Menage, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Mon, 3 Aug 2009, KAMEZAWA Hiroyuki wrote:

> > You can't recalculate it if all the remaining threads have a different 
> > oom_adj value than the effective oom_adj value from the thread that is now 
> > exited.  
> 
> Then, crazy google apps pass different oom_adjs to each thread ?
> And, threads other than thread-group-leader modifies its oom_adj.
> 

Nope, but I'm afraid you've just made my point for me: it shows that 
oom_adj really isn't sanely used as a per-thread attribute and actually 
only represents a preference on oom killing a quantity of memory in all 
other cases other than vfork() -> change /proc/pid-of-child/oom_adj -> 
exec() for which we now appropriately have /proc/pid/oom_adj_child for.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
  2009-08-03  8:08                             ` David Rientjes
@ 2009-08-03  8:45                               ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-03  8:45 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Paul Menage, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Mon, 3 Aug 2009 01:08:42 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> On Mon, 3 Aug 2009, KAMEZAWA Hiroyuki wrote:
> 
> > > You can't recalculate it if all the remaining threads have a different 
> > > oom_adj value than the effective oom_adj value from the thread that is now 
> > > exited.  
> > 
> > Then, crazy google apps pass different oom_adjs to each thread ?
> > And, threads other than thread-group-leader modifies its oom_adj.
> > 
> 
> Nope, but I'm afraid you've just made my point for me: it shows that 
> oom_adj really isn't sanely used as a per-thread attribute and actually 
> only represents a preference on oom killing a quantity of memory in all 
> other cases other than vfork() -> change /proc/pid-of-child/oom_adj -> 
> exec() for which we now appropriately have /proc/pid/oom_adj_child for.
> 
Maybe you're man I can't persuade. but making progress will be necessary.

The most ugly thing which annoies me is this part. 

@@ -679,6 +679,7 @@ good_mm:                    #  a label in copy_mm.
 
 	tsk->mm = mm;
 	tsk->active_mm = mm;
+	tsk->oom_adj_child = mm->oom_adj;
 	return 0;

Why ?

I wonder oom_adj_exec "change oom_adj to this value when execve() is called"
is much more straightforward, simple and easy to understand than oom_adj_child.

"just inherit at fork, change at exec" is an usual manner, I think.
If oom_adj_exec rather than oom_adj_child, I won't complain, more.


Thanks,
-Kame




^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
@ 2009-08-03  8:45                               ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-03  8:45 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Paul Menage, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Mon, 3 Aug 2009 01:08:42 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> On Mon, 3 Aug 2009, KAMEZAWA Hiroyuki wrote:
> 
> > > You can't recalculate it if all the remaining threads have a different 
> > > oom_adj value than the effective oom_adj value from the thread that is now 
> > > exited.  
> > 
> > Then, crazy google apps pass different oom_adjs to each thread ?
> > And, threads other than thread-group-leader modifies its oom_adj.
> > 
> 
> Nope, but I'm afraid you've just made my point for me: it shows that 
> oom_adj really isn't sanely used as a per-thread attribute and actually 
> only represents a preference on oom killing a quantity of memory in all 
> other cases other than vfork() -> change /proc/pid-of-child/oom_adj -> 
> exec() for which we now appropriately have /proc/pid/oom_adj_child for.
> 
Maybe you're man I can't persuade. but making progress will be necessary.

The most ugly thing which annoies me is this part. 

@@ -679,6 +679,7 @@ good_mm:                    #  a label in copy_mm.
 
 	tsk->mm = mm;
 	tsk->active_mm = mm;
+	tsk->oom_adj_child = mm->oom_adj;
 	return 0;

Why ?

I wonder oom_adj_exec "change oom_adj to this value when execve() is called"
is much more straightforward, simple and easy to understand than oom_adj_child.

"just inherit at fork, change at exec" is an usual manner, I think.
If oom_adj_exec rather than oom_adj_child, I won't complain, more.


Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
  2009-08-03  8:45                               ` KAMEZAWA Hiroyuki
@ 2009-08-03  8:55                                 ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-03  8:55 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: David Rientjes, Andrew Morton, Rik van Riel, Paul Menage,
	KOSAKI Motohiro, linux-kernel, linux-mm

On Mon, 3 Aug 2009 17:45:19 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> "just inherit at fork, change at exec" is an usual manner, I think.
> If oom_adj_exec rather than oom_adj_child, I won't complain, more.
> 
But this/(and yours) requires users to rewrite their apps.
Then, breaks current API.
please fight with other guardians.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
@ 2009-08-03  8:55                                 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-03  8:55 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: David Rientjes, Andrew Morton, Rik van Riel, Paul Menage,
	KOSAKI Motohiro, linux-kernel, linux-mm

On Mon, 3 Aug 2009 17:45:19 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> "just inherit at fork, change at exec" is an usual manner, I think.
> If oom_adj_exec rather than oom_adj_child, I won't complain, more.
> 
But this/(and yours) requires users to rewrite their apps.
Then, breaks current API.
please fight with other guardians.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
  2009-07-31  9:31         ` David Rientjes
@ 2009-08-03 11:58           ` KOSAKI Motohiro
  -1 siblings, 0 replies; 64+ messages in thread
From: KOSAKI Motohiro @ 2009-08-03 11:58 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Paul Menage,
	linux-kernel, linux-mm

> On Fri, 31 Jul 2009, KOSAKI Motohiro wrote:
> 
> > > That's because the oom killer only really considers the highest oom_adj 
> > > value amongst all threads that share the same mm.  Allowing those threads 
> > > to each have different oom_adj values leads (i) to an inconsistency in 
> > > reporting /proc/pid/oom_score for how the oom killer selects a task to 
> > > kill and (ii) the oom killer livelock that it fixes when one thread 
> > > happens to be OOM_DISABLE.
> > 
> > I agree both. again I only disagree ABI breakage regression and
> > stupid new /proc interface.
> 
> Let's state the difference in behavior as of 2.6.31-rc1: applications can 
> no longer change the oom_adj value of a vfork() child prior to exec() 
> without it also affecting the parent.  I agree that was previously 
> allowed.  And it was that very allowance that LEADS TO THE LIVELOCK 
> because they both share a VM and it was possible for the oom killer to 
> select the one of the threads while the other was OOM_DISABLE.
> 
> This is an extremely simple livelock to trigger, AND YOU DON'T EVEN NEED 
> CAP_SYS_RESOURCE TO DO IT.  Consider a job scheduler that superuser has 
> set to OOM_DISABLE because of its necessity to the system.  Imagine if 
> that job scheduler vfork's a child and sets its inherited oom_adj value of 
> OOM_DISABLE to something higher so that the machine doesn't panic on 
> exec() when the child spikes in memory usage when the application first 
> starts.
> 
> Now imagine that either there are no other user threads or the job 
> scheduler itself has allocated more pages than any other thread.  Or, more 
> simply, imagine that it sets the child's oom_adj value to a higher 
> priority than other threads based on some heuristic.  Regardless, if the 
> system becomes oom before the exec() can happen and before the new VM is 
> attached to the child, the machine livelocks.
> 
> That happens because of two things:
> 
>  - the oom killer uses the oom_adj value to adjust the oom_score for a
>    task, and that score is mainly based on the size of each thread's VM,
>    and
> 
>  - the oom killer cannot kill a thread that shares a VM with an
>    OOM_DISABLE thread because it will not lead to future memory freeing.
> 
> So the preferred solution for complete consistency and to fix the livelock 
> is to make the oom_adj value a characteristic of the VM, because THAT'S 
> WHAT IT ACTS ON.  The effective oom_adj value for a thread is always equal 
> to the highest oom_adj value of any thread sharing its VM.
> 
> Do we really want to keep this inconsistency around forever in the kernel 
> so that /proc/pid/oom_score actually means NOTHING because another thread 
> sharing the memory has a different oom_adj?  Or do we want to take the 
> opportunity to fix a broken userspace model that leads to a livelock to 
> fix it and move on with a consistent interface and, with oom_adj_child, 
> all the functionality you had before.
> 
> And you and KAMEZAWA-san can continue to call my patches stupid, but 
> that's not adding anything to your argument.

Then, your patch will got full reverting ;)
I wouldn't hope this... please.


> 
> > Paul already pointed out this issue can be fixed without ABI change.
> > 
> 
> I'm unaware of any viable solution that has been proposed, sorry.

Please see my another mail. it's contain the patch.


> > if you feel my stand point is double standard, I need explain me more.
> > So, I don't think per-process oom_adj makes any regression on _real_ world.
> 
> Wrong, our machines have livelocked because of the exact scenario I 
> described above.

Hua?
David, per-process oom_adj was made _your_ patch. Do you propose
NAK yourself patch?

maybe, We made any miscommunication?

> > but vfork()'s one is real world issue.
> > 
> 
> And it's based on a broken assumption that oom_adj values actually mean 
> anything independent of other threads sharing the same memory.  That's a 
> completely false assumption.  Applications that are tuning oom_adj value 
> will rely on oom_scores, which are currently false if oom_adj differs 
> amongst those threads, and should be written to how the oom killer uses 
> the value.

No. another process have another process value is valid assumption.
sharing struct_mm is deeply implementaion detail. it shouldn't be
exposed userland.

Why do you think false assumption? In UNIX/Linux programming
is frequently used following idiom.

	if (fork() == 0) {
		setting_something_process_property();
		execve("new-command");
	}

vfork() is also frequently used. It is allowed long time common practice.
I don't think we can says "hey, you are silly!" to application developer.


> 
> > And, May I explay why I think your oom_adj_child is wrong idea?
> > The fact is: new feature introducing never fix regression. yes, some
> > application use new interface and disappear the problem. but other
> > application still hit the problem. that's not correct development style
> > in kernel.
> > 
> 
> So you're proposing that we forever allow /proc/pid/oom_score to be 
> completely wrong for pid without any knowledge to userspace?  That we 
> falsely advertise what it represents and allow userspace to believe that 
> changing oom_adj for a thread sharing memory with other threads actually 
> changes how the oom killer selects tasks?

No. perhaps no doublly.

1) In my patch, oom_score is also per-process value. all thread have the same
   oom_score.
   It's clear meaning.
2) In almost case, oom_score display collect value because oom_adj is per-process
   value too. 
   Yes, there is one exception. vfork() and change oom_adj'ed process might display 
   wrong value. but I don't think it is serious problem because vfork() process call
   exec() soon.
   Administrator never recognize this difference.

> Please.

David, I hope you join to fix this regression. I can't believe we
can't fix this issue honestly.






^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
@ 2009-08-03 11:58           ` KOSAKI Motohiro
  0 siblings, 0 replies; 64+ messages in thread
From: KOSAKI Motohiro @ 2009-08-03 11:58 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Paul Menage,
	linux-kernel, linux-mm

> On Fri, 31 Jul 2009, KOSAKI Motohiro wrote:
> 
> > > That's because the oom killer only really considers the highest oom_adj 
> > > value amongst all threads that share the same mm.  Allowing those threads 
> > > to each have different oom_adj values leads (i) to an inconsistency in 
> > > reporting /proc/pid/oom_score for how the oom killer selects a task to 
> > > kill and (ii) the oom killer livelock that it fixes when one thread 
> > > happens to be OOM_DISABLE.
> > 
> > I agree both. again I only disagree ABI breakage regression and
> > stupid new /proc interface.
> 
> Let's state the difference in behavior as of 2.6.31-rc1: applications can 
> no longer change the oom_adj value of a vfork() child prior to exec() 
> without it also affecting the parent.  I agree that was previously 
> allowed.  And it was that very allowance that LEADS TO THE LIVELOCK 
> because they both share a VM and it was possible for the oom killer to 
> select the one of the threads while the other was OOM_DISABLE.
> 
> This is an extremely simple livelock to trigger, AND YOU DON'T EVEN NEED 
> CAP_SYS_RESOURCE TO DO IT.  Consider a job scheduler that superuser has 
> set to OOM_DISABLE because of its necessity to the system.  Imagine if 
> that job scheduler vfork's a child and sets its inherited oom_adj value of 
> OOM_DISABLE to something higher so that the machine doesn't panic on 
> exec() when the child spikes in memory usage when the application first 
> starts.
> 
> Now imagine that either there are no other user threads or the job 
> scheduler itself has allocated more pages than any other thread.  Or, more 
> simply, imagine that it sets the child's oom_adj value to a higher 
> priority than other threads based on some heuristic.  Regardless, if the 
> system becomes oom before the exec() can happen and before the new VM is 
> attached to the child, the machine livelocks.
> 
> That happens because of two things:
> 
>  - the oom killer uses the oom_adj value to adjust the oom_score for a
>    task, and that score is mainly based on the size of each thread's VM,
>    and
> 
>  - the oom killer cannot kill a thread that shares a VM with an
>    OOM_DISABLE thread because it will not lead to future memory freeing.
> 
> So the preferred solution for complete consistency and to fix the livelock 
> is to make the oom_adj value a characteristic of the VM, because THAT'S 
> WHAT IT ACTS ON.  The effective oom_adj value for a thread is always equal 
> to the highest oom_adj value of any thread sharing its VM.
> 
> Do we really want to keep this inconsistency around forever in the kernel 
> so that /proc/pid/oom_score actually means NOTHING because another thread 
> sharing the memory has a different oom_adj?  Or do we want to take the 
> opportunity to fix a broken userspace model that leads to a livelock to 
> fix it and move on with a consistent interface and, with oom_adj_child, 
> all the functionality you had before.
> 
> And you and KAMEZAWA-san can continue to call my patches stupid, but 
> that's not adding anything to your argument.

Then, your patch will got full reverting ;)
I wouldn't hope this... please.


> 
> > Paul already pointed out this issue can be fixed without ABI change.
> > 
> 
> I'm unaware of any viable solution that has been proposed, sorry.

Please see my another mail. it's contain the patch.


> > if you feel my stand point is double standard, I need explain me more.
> > So, I don't think per-process oom_adj makes any regression on _real_ world.
> 
> Wrong, our machines have livelocked because of the exact scenario I 
> described above.

Hua?
David, per-process oom_adj was made _your_ patch. Do you propose
NAK yourself patch?

maybe, We made any miscommunication?

> > but vfork()'s one is real world issue.
> > 
> 
> And it's based on a broken assumption that oom_adj values actually mean 
> anything independent of other threads sharing the same memory.  That's a 
> completely false assumption.  Applications that are tuning oom_adj value 
> will rely on oom_scores, which are currently false if oom_adj differs 
> amongst those threads, and should be written to how the oom killer uses 
> the value.

No. another process have another process value is valid assumption.
sharing struct_mm is deeply implementaion detail. it shouldn't be
exposed userland.

Why do you think false assumption? In UNIX/Linux programming
is frequently used following idiom.

	if (fork() == 0) {
		setting_something_process_property();
		execve("new-command");
	}

vfork() is also frequently used. It is allowed long time common practice.
I don't think we can says "hey, you are silly!" to application developer.


> 
> > And, May I explay why I think your oom_adj_child is wrong idea?
> > The fact is: new feature introducing never fix regression. yes, some
> > application use new interface and disappear the problem. but other
> > application still hit the problem. that's not correct development style
> > in kernel.
> > 
> 
> So you're proposing that we forever allow /proc/pid/oom_score to be 
> completely wrong for pid without any knowledge to userspace?  That we 
> falsely advertise what it represents and allow userspace to believe that 
> changing oom_adj for a thread sharing memory with other threads actually 
> changes how the oom killer selects tasks?

No. perhaps no doublly.

1) In my patch, oom_score is also per-process value. all thread have the same
   oom_score.
   It's clear meaning.
2) In almost case, oom_score display collect value because oom_adj is per-process
   value too. 
   Yes, there is one exception. vfork() and change oom_adj'ed process might display 
   wrong value. but I don't think it is serious problem because vfork() process call
   exec() soon.
   Administrator never recognize this difference.

> Please.

David, I hope you join to fix this regression. I can't believe we
can't fix this issue honestly.





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
  2009-08-03 11:58           ` KOSAKI Motohiro
@ 2009-08-03 12:12             ` KOSAKI Motohiro
  -1 siblings, 0 replies; 64+ messages in thread
From: KOSAKI Motohiro @ 2009-08-03 12:12 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: kosaki.motohiro, David Rientjes, Andrew Morton, Rik van Riel,
	Paul Menage, linux-kernel, linux-mm


One mistake.

> > > And, May I explay why I think your oom_adj_child is wrong idea?
> > > The fact is: new feature introducing never fix regression. yes, some
> > > application use new interface and disappear the problem. but other
> > > application still hit the problem. that's not correct development style
> > > in kernel.
> > > 
> > 
> > So you're proposing that we forever allow /proc/pid/oom_score to be 
> > completely wrong for pid without any knowledge to userspace?  That we 
> > falsely advertise what it represents and allow userspace to believe that 
> > changing oom_adj for a thread sharing memory with other threads actually 
> > changes how the oom killer selects tasks?
> 
> No. perhaps no doublly.
> 
> 1) In my patch, oom_score is also per-process value. all thread have the same
>    oom_score.
>    It's clear meaning.

it's wrong explanation. oom_score is calculated from the same oom_adj.
but it have each different oom_score. sorry my confused.



> 2) In almost case, oom_score display collect value because oom_adj is per-process
>    value too. 
>    Yes, there is one exception. vfork() and change oom_adj'ed process might display 
>    wrong value. but I don't think it is serious problem because vfork() process call
>    exec() soon.
>    Administrator never recognize this difference.
> 
> > Please.
> 
> David, I hope you join to fix this regression. I can't believe we
> can't fix this issue honestly.
> 
> 
> 




^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
@ 2009-08-03 12:12             ` KOSAKI Motohiro
  0 siblings, 0 replies; 64+ messages in thread
From: KOSAKI Motohiro @ 2009-08-03 12:12 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: David Rientjes, Andrew Morton, Rik van Riel, Paul Menage,
	linux-kernel, linux-mm


One mistake.

> > > And, May I explay why I think your oom_adj_child is wrong idea?
> > > The fact is: new feature introducing never fix regression. yes, some
> > > application use new interface and disappear the problem. but other
> > > application still hit the problem. that's not correct development style
> > > in kernel.
> > > 
> > 
> > So you're proposing that we forever allow /proc/pid/oom_score to be 
> > completely wrong for pid without any knowledge to userspace?  That we 
> > falsely advertise what it represents and allow userspace to believe that 
> > changing oom_adj for a thread sharing memory with other threads actually 
> > changes how the oom killer selects tasks?
> 
> No. perhaps no doublly.
> 
> 1) In my patch, oom_score is also per-process value. all thread have the same
>    oom_score.
>    It's clear meaning.

it's wrong explanation. oom_score is calculated from the same oom_adj.
but it have each different oom_score. sorry my confused.



> 2) In almost case, oom_score display collect value because oom_adj is per-process
>    value too. 
>    Yes, there is one exception. vfork() and change oom_adj'ed process might display 
>    wrong value. but I don't think it is serious problem because vfork() process call
>    exec() soon.
>    Administrator never recognize this difference.
> 
> > Please.
> 
> David, I hope you join to fix this regression. I can't believe we
> can't fix this issue honestly.
> 
> 
> 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
  2009-07-31 19:38               ` David Rientjes
@ 2009-08-03 12:16                 ` KOSAKI Motohiro
  -1 siblings, 0 replies; 64+ messages in thread
From: KOSAKI Motohiro @ 2009-08-03 12:16 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, Andrew Morton, Rik van Riel,
	Paul Menage, linux-kernel, linux-mm

> On Fri, 31 Jul 2009, KOSAKI Motohiro wrote:
> 
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index 3ce5ae9..c64499e 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -1008,7 +1008,7 @@ static ssize_t oom_adjust_read(struct file *file, char __user *buf,
> >  		return -ESRCH;
> >  	task_lock(task);
> >  	if (task->mm)
> > -		oom_adjust = task->mm->oom_adj;
> > +		oom_adjust = task->signal->oom_adj;
> >  	else
> >  		oom_adjust = OOM_DISABLE;
> >  	task_unlock(task);
> 
> This may display a /proc/pid/oom_adj that is radically different from 
> task->mm->oom_adj_cached without knowledge to userspace and you can't 
> simply display task->mm>oom_adj_cached here because it gets reset on every 
> write to /proc/pid/oom_adj.

Is this necessary?
different value was only happen when vfork process change oom_adj. but
vfork()ed process call exec() soon.
userland monitor program don't confuse.

Plus, if you can post any troublesome testcase, I can fix it.
this is not essential code in this patch.



> > @@ -1046,12 +1046,13 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
> >  		put_task_struct(task);
> >  		return -EINVAL;
> >  	}
> > -	if (oom_adjust < task->mm->oom_adj && !capable(CAP_SYS_RESOURCE)) {
> > +	if (oom_adjust < task->signal->oom_adj && !capable(CAP_SYS_RESOURCE)) {
> >  		task_unlock(task);
> >  		put_task_struct(task);
> >  		return -EACCES;
> >  	}
> > -	task->mm->oom_adj = oom_adjust;
> > +	task->signal->oom_adj = oom_adjust;
> > +	task->mm->oom_adj_cached = OOM_CACHE_DEFAULT;
> >  	task_unlock(task);
> >  	put_task_struct(task);
> >  	if (end - buffer == 0)
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 7acc843..f93f97f 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -240,7 +240,8 @@ struct mm_struct {
> >  
> >  	unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
> >  
> > -	s8 oom_adj;	/* OOM kill score adjustment (bit shift) */
> > +	s8 oom_adj_cached;	/* mirror from signal_struct->oom_adj.
> > +				   in vfork case, multiple processes use the same mm. */
> >  
> >  	cpumask_t cpu_vm_mask;
> >  
> > diff --git a/include/linux/oom.h b/include/linux/oom.h
> > index a7979ba..a219480 100644
> > --- a/include/linux/oom.h
> > +++ b/include/linux/oom.h
> > @@ -3,6 +3,7 @@
> >  
> >  /* /proc/<pid>/oom_adj set to -17 protects from the oom-killer */
> >  #define OOM_DISABLE (-17)
> > +#define OOM_CACHE_DEFAULT (15)
> >  /* inclusive */
> >  #define OOM_ADJUST_MIN (-16)
> >  #define OOM_ADJUST_MAX 15
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 3ab08e4..e10b12b 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -629,6 +629,8 @@ struct signal_struct {
> >  	unsigned audit_tty;
> >  	struct tty_audit_buf *tty_audit_buf;
> >  #endif
> > +
> > +	s8 oom_adj;	/* OOM kill score adjustment (bit shift) */
> >  };
> >  
> >  /* Context switch must be unlocked if interrupts are to be enabled */
> 
> I don't believe oom_adj is an appropriate use of signal_struct, sorry.

Real world and life aren't so simple... and linux too.
At least, regression is definitely worst result. any other way
is better than it.


> 
> > diff --git a/kernel/exit.c b/kernel/exit.c
> > index 869dc22..c741a45 100644
> > --- a/kernel/exit.c
> > +++ b/kernel/exit.c
> > @@ -688,6 +689,7 @@ static void exit_mm(struct task_struct * tsk)
> >  	enter_lazy_tlb(mm, current);
> >  	/* We don't want this task to be frozen prematurely */
> >  	clear_freeze_flag(tsk);
> > +	mm->oom_adj_cached = OOM_CACHE_DEFAULT;
> >  	task_unlock(tsk);
> >  	mm_update_next_owner(mm);
> >  	mmput(mm);
> 
> This is similiar to an early proposal that wanted to keep an array of 
> oom_adj values for tasks attached to the mm in mm_struct.  The problem is 
> that you're obviously losing information about all threads attached to the 
> mm any time one of the threads exits or writes to /proc/pid/oom_adj.  That 
> information can only be regenerated with a tasklist scan.

I agree this. but I can't imazine this darkside. Can you please explain this?
maybe, we can improve this implementaion obviously. reset at each thread exitting
is caused by my lazy and small patch preference.
a bit line adding can change this.

> 
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 9b42695..b7cb474 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -426,6 +427,7 @@ static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
> >  	init_rwsem(&mm->mmap_sem);
> >  	INIT_LIST_HEAD(&mm->mmlist);
> >  	mm->flags = (current->mm) ? current->mm->flags : default_dump_filter;
> > +	mm->oom_adj_cached = OOM_CACHE_DEFAULT;
> >  	mm->core_state = NULL;
> >  	mm->nr_ptes = 0;
> >  	set_mm_counter(mm, file_rss, 0);
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 175a67a..eae2d78 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -58,7 +58,7 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
> >  	unsigned long points, cpu_time, run_time;
> >  	struct mm_struct *mm;
> >  	struct task_struct *child;
> > -	int oom_adj;
> > +	s8 oom_adj;
> >  
> >  	task_lock(p);
> >  	mm = p->mm;
> > @@ -66,7 +66,10 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
> >  		task_unlock(p);
> >  		return 0;
> >  	}
> > -	oom_adj = mm->oom_adj;
> > +
> > +	if (mm->oom_adj_cached < p->signal->oom_adj)
> > +		mm->oom_adj_cached = p->signal->oom_adj;
> 
> This conditional will never be true since mm->oom_adj_cached is 
> initialized to 15, which is the upper bound on which p->signal->oom_adj 
> may ever be, so mm->oom_adj_cached never gets changed from 
> OOM_CACHE_DEFAULT.

Ah, good catch. this condition shold be reserve. thanks.


> Thus, this patch doesn't even work, and you probably would have noticed 
> that if you'd checked /proc/pid/oom_score for any pid.
> 
> Even if mm->oom_adj_cached _was_ properly updated here, 
> /proc/pid/oom_score would be out of sync with more negative oom_adj values 
> for threads sharing the mm_struct since it calls badness() for only a 
> single thread.

Hm, I don't think this is serious problem. but I can fix it easily
because badness() isn't fast-path.



> 
> > +	oom_adj = mm->oom_adj_cached;
> >  	if (oom_adj == OOM_DISABLE) {
> >  		task_unlock(p);
> >  		return 0;
> > @@ -350,7 +354,7 @@ static int oom_kill_task(struct task_struct *p)
> >  
> >  	task_lock(p);
> >  	mm = p->mm;
> > -	if (!mm || mm->oom_adj == OOM_DISABLE) {
> > +	if (!mm || p->signal->oom_adj == OOM_DISABLE) {
> >  		task_unlock(p);
> >  		return 1;
> >  	}




^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
@ 2009-08-03 12:16                 ` KOSAKI Motohiro
  0 siblings, 0 replies; 64+ messages in thread
From: KOSAKI Motohiro @ 2009-08-03 12:16 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, Andrew Morton, Rik van Riel,
	Paul Menage, linux-kernel, linux-mm

> On Fri, 31 Jul 2009, KOSAKI Motohiro wrote:
> 
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index 3ce5ae9..c64499e 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -1008,7 +1008,7 @@ static ssize_t oom_adjust_read(struct file *file, char __user *buf,
> >  		return -ESRCH;
> >  	task_lock(task);
> >  	if (task->mm)
> > -		oom_adjust = task->mm->oom_adj;
> > +		oom_adjust = task->signal->oom_adj;
> >  	else
> >  		oom_adjust = OOM_DISABLE;
> >  	task_unlock(task);
> 
> This may display a /proc/pid/oom_adj that is radically different from 
> task->mm->oom_adj_cached without knowledge to userspace and you can't 
> simply display task->mm>oom_adj_cached here because it gets reset on every 
> write to /proc/pid/oom_adj.

Is this necessary?
different value was only happen when vfork process change oom_adj. but
vfork()ed process call exec() soon.
userland monitor program don't confuse.

Plus, if you can post any troublesome testcase, I can fix it.
this is not essential code in this patch.



> > @@ -1046,12 +1046,13 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
> >  		put_task_struct(task);
> >  		return -EINVAL;
> >  	}
> > -	if (oom_adjust < task->mm->oom_adj && !capable(CAP_SYS_RESOURCE)) {
> > +	if (oom_adjust < task->signal->oom_adj && !capable(CAP_SYS_RESOURCE)) {
> >  		task_unlock(task);
> >  		put_task_struct(task);
> >  		return -EACCES;
> >  	}
> > -	task->mm->oom_adj = oom_adjust;
> > +	task->signal->oom_adj = oom_adjust;
> > +	task->mm->oom_adj_cached = OOM_CACHE_DEFAULT;
> >  	task_unlock(task);
> >  	put_task_struct(task);
> >  	if (end - buffer == 0)
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 7acc843..f93f97f 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -240,7 +240,8 @@ struct mm_struct {
> >  
> >  	unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
> >  
> > -	s8 oom_adj;	/* OOM kill score adjustment (bit shift) */
> > +	s8 oom_adj_cached;	/* mirror from signal_struct->oom_adj.
> > +				   in vfork case, multiple processes use the same mm. */
> >  
> >  	cpumask_t cpu_vm_mask;
> >  
> > diff --git a/include/linux/oom.h b/include/linux/oom.h
> > index a7979ba..a219480 100644
> > --- a/include/linux/oom.h
> > +++ b/include/linux/oom.h
> > @@ -3,6 +3,7 @@
> >  
> >  /* /proc/<pid>/oom_adj set to -17 protects from the oom-killer */
> >  #define OOM_DISABLE (-17)
> > +#define OOM_CACHE_DEFAULT (15)
> >  /* inclusive */
> >  #define OOM_ADJUST_MIN (-16)
> >  #define OOM_ADJUST_MAX 15
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 3ab08e4..e10b12b 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -629,6 +629,8 @@ struct signal_struct {
> >  	unsigned audit_tty;
> >  	struct tty_audit_buf *tty_audit_buf;
> >  #endif
> > +
> > +	s8 oom_adj;	/* OOM kill score adjustment (bit shift) */
> >  };
> >  
> >  /* Context switch must be unlocked if interrupts are to be enabled */
> 
> I don't believe oom_adj is an appropriate use of signal_struct, sorry.

Real world and life aren't so simple... and linux too.
At least, regression is definitely worst result. any other way
is better than it.


> 
> > diff --git a/kernel/exit.c b/kernel/exit.c
> > index 869dc22..c741a45 100644
> > --- a/kernel/exit.c
> > +++ b/kernel/exit.c
> > @@ -688,6 +689,7 @@ static void exit_mm(struct task_struct * tsk)
> >  	enter_lazy_tlb(mm, current);
> >  	/* We don't want this task to be frozen prematurely */
> >  	clear_freeze_flag(tsk);
> > +	mm->oom_adj_cached = OOM_CACHE_DEFAULT;
> >  	task_unlock(tsk);
> >  	mm_update_next_owner(mm);
> >  	mmput(mm);
> 
> This is similiar to an early proposal that wanted to keep an array of 
> oom_adj values for tasks attached to the mm in mm_struct.  The problem is 
> that you're obviously losing information about all threads attached to the 
> mm any time one of the threads exits or writes to /proc/pid/oom_adj.  That 
> information can only be regenerated with a tasklist scan.

I agree this. but I can't imazine this darkside. Can you please explain this?
maybe, we can improve this implementaion obviously. reset at each thread exitting
is caused by my lazy and small patch preference.
a bit line adding can change this.

> 
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 9b42695..b7cb474 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -426,6 +427,7 @@ static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
> >  	init_rwsem(&mm->mmap_sem);
> >  	INIT_LIST_HEAD(&mm->mmlist);
> >  	mm->flags = (current->mm) ? current->mm->flags : default_dump_filter;
> > +	mm->oom_adj_cached = OOM_CACHE_DEFAULT;
> >  	mm->core_state = NULL;
> >  	mm->nr_ptes = 0;
> >  	set_mm_counter(mm, file_rss, 0);
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 175a67a..eae2d78 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -58,7 +58,7 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
> >  	unsigned long points, cpu_time, run_time;
> >  	struct mm_struct *mm;
> >  	struct task_struct *child;
> > -	int oom_adj;
> > +	s8 oom_adj;
> >  
> >  	task_lock(p);
> >  	mm = p->mm;
> > @@ -66,7 +66,10 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
> >  		task_unlock(p);
> >  		return 0;
> >  	}
> > -	oom_adj = mm->oom_adj;
> > +
> > +	if (mm->oom_adj_cached < p->signal->oom_adj)
> > +		mm->oom_adj_cached = p->signal->oom_adj;
> 
> This conditional will never be true since mm->oom_adj_cached is 
> initialized to 15, which is the upper bound on which p->signal->oom_adj 
> may ever be, so mm->oom_adj_cached never gets changed from 
> OOM_CACHE_DEFAULT.

Ah, good catch. this condition shold be reserve. thanks.


> Thus, this patch doesn't even work, and you probably would have noticed 
> that if you'd checked /proc/pid/oom_score for any pid.
> 
> Even if mm->oom_adj_cached _was_ properly updated here, 
> /proc/pid/oom_score would be out of sync with more negative oom_adj values 
> for threads sharing the mm_struct since it calls badness() for only a 
> single thread.

Hm, I don't think this is serious problem. but I can fix it easily
because badness() isn't fast-path.



> 
> > +	oom_adj = mm->oom_adj_cached;
> >  	if (oom_adj == OOM_DISABLE) {
> >  		task_unlock(p);
> >  		return 0;
> > @@ -350,7 +354,7 @@ static int oom_kill_task(struct task_struct *p)
> >  
> >  	task_lock(p);
> >  	mm = p->mm;
> > -	if (!mm || mm->oom_adj == OOM_DISABLE) {
> > +	if (!mm || p->signal->oom_adj == OOM_DISABLE) {
> >  		task_unlock(p);
> >  		return 1;
> >  	}



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
  2009-08-03  8:55                                 ` KAMEZAWA Hiroyuki
@ 2009-08-03 12:19                                   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 64+ messages in thread
From: KOSAKI Motohiro @ 2009-08-03 12:19 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: kosaki.motohiro, David Rientjes, Andrew Morton, Rik van Riel,
	Paul Menage, linux-kernel, linux-mm

> On Mon, 3 Aug 2009 17:45:19 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > "just inherit at fork, change at exec" is an usual manner, I think.
> > If oom_adj_exec rather than oom_adj_child, I won't complain, more.
> > 
> But this/(and yours) requires users to rewrite their apps.
> Then, breaks current API.
> please fight with other guardians.

Definitely, I never agree regressionful ABI change ;)
At least, I still think it can be fixable.





^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
@ 2009-08-03 12:19                                   ` KOSAKI Motohiro
  0 siblings, 0 replies; 64+ messages in thread
From: KOSAKI Motohiro @ 2009-08-03 12:19 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: kosaki.motohiro, David Rientjes, Andrew Morton, Rik van Riel,
	Paul Menage, linux-kernel, linux-mm

> On Mon, 3 Aug 2009 17:45:19 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > "just inherit at fork, change at exec" is an usual manner, I think.
> > If oom_adj_exec rather than oom_adj_child, I won't complain, more.
> > 
> But this/(and yours) requires users to rewrite their apps.
> Then, breaks current API.
> please fight with other guardians.

Definitely, I never agree regressionful ABI change ;)
At least, I still think it can be fixable.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
  2009-08-01 20:26                     ` David Rientjes
@ 2009-08-03 12:21                       ` KOSAKI Motohiro
  -1 siblings, 0 replies; 64+ messages in thread
From: KOSAKI Motohiro @ 2009-08-03 12:21 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, Andrew Morton, Rik van Riel,
	Paul Menage, linux-kernel, linux-mm

> On Sat, 1 Aug 2009, KAMEZAWA Hiroyuki wrote:
> 
> > Summarizing I think now .....
> >   - rename mm->oom_adj as mm->effective_oom_adj
> >   - re-add per-thread oom_adj
> >   - update mm->effective_oom_adj based on per-thread oom_adj
> >   - if necessary, plz add read-only /proc/pid/effective_oom_adj file.
> >     or show 2 values in /proc/pid/oom_adj
> >   - rewrite documentation about oom_score.
> >    " it's calclulated from  _process's_ memory usage and oom_adj of
> >     all threads which shares a memor  context".
> >    This behavior is not changed from old implemtation, anyway.
> >  - If necessary, rewrite oom_kill itself to scan only thread group
> >    leader. It's a way to go regardless of  vfork problem.
> > 
> 
> Ok, so you've abandoned the signal_struct proposal and now want to add it 
> back to task_struct with an effective member in mm_struct by changing the 
> documentation.  Hmm.

Oops, please see From line. The page was made from me ;)



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
@ 2009-08-03 12:21                       ` KOSAKI Motohiro
  0 siblings, 0 replies; 64+ messages in thread
From: KOSAKI Motohiro @ 2009-08-03 12:21 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, Andrew Morton, Rik van Riel,
	Paul Menage, linux-kernel, linux-mm

> On Sat, 1 Aug 2009, KAMEZAWA Hiroyuki wrote:
> 
> > Summarizing I think now .....
> >   - rename mm->oom_adj as mm->effective_oom_adj
> >   - re-add per-thread oom_adj
> >   - update mm->effective_oom_adj based on per-thread oom_adj
> >   - if necessary, plz add read-only /proc/pid/effective_oom_adj file.
> >     or show 2 values in /proc/pid/oom_adj
> >   - rewrite documentation about oom_score.
> >    " it's calclulated from  _process's_ memory usage and oom_adj of
> >     all threads which shares a memor  context".
> >    This behavior is not changed from old implemtation, anyway.
> >  - If necessary, rewrite oom_kill itself to scan only thread group
> >    leader. It's a way to go regardless of  vfork problem.
> > 
> 
> Ok, so you've abandoned the signal_struct proposal and now want to add it 
> back to task_struct with an effective member in mm_struct by changing the 
> documentation.  Hmm.

Oops, please see From line. The page was made from me ;)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
  2009-08-03  7:59                         ` David Rientjes
@ 2009-08-03 12:32                           ` KOSAKI Motohiro
  -1 siblings, 0 replies; 64+ messages in thread
From: KOSAKI Motohiro @ 2009-08-03 12:32 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, Andrew Morton, Rik van Riel,
	Paul Menage, linux-kernel, linux-mm

Hi

Sorry for queue jumping. I have one question.


> > >  - /proc/pid/oom_score is inconsistent when the thread that set the
> > >    effective per-mm oom_adj exits and it is now obsolete since you have
> > >    no way to determine what the next effective oom_adj value shall be.
> > > 
> > plz re-caluculate it. it's not a big job if done in lazy way.
> > 
> 
> You can't recalculate it if all the remaining threads have a different 
> oom_adj value than the effective oom_adj value from the thread that is now 
> exited.  There is no assumption that, for instance, the most negative 
> oom_adj value shall then be used.  Imagine the effective oom_adj value 
> being +15 and a thread sharing the same memory has an oom_adj value of 
> -16.  Under no reasonable circumstance should the oom preference of the 
> entire thread then change to -16 just because its the side-effect of a 
> thread exiting.

Why do we need recaluculate AT thread exiting time?
it is only used when oom_score is readed or actual OOM happend.
both those are slow-path.


> 
> That's the _entire_ reason why we need consistency in oom_adj values so 
> that userspace is aware of how the oom killer really works and chooses 
> tasks.  I understand that it differs from the previously allowed behavior, 
> but those userspace applications need to be fixed if, for no other reason, 
> they are now consistent with how the oom killer kills tasks.  I think 
> that's a very worthwhile goal and the cost of moving to a new interface 
> such as /proc/pid/oom_adj_child to have the same inheritance property that 
> was available in the past is justified.




^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
@ 2009-08-03 12:32                           ` KOSAKI Motohiro
  0 siblings, 0 replies; 64+ messages in thread
From: KOSAKI Motohiro @ 2009-08-03 12:32 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, Andrew Morton, Rik van Riel,
	Paul Menage, linux-kernel, linux-mm

Hi

Sorry for queue jumping. I have one question.


> > >  - /proc/pid/oom_score is inconsistent when the thread that set the
> > >    effective per-mm oom_adj exits and it is now obsolete since you have
> > >    no way to determine what the next effective oom_adj value shall be.
> > > 
> > plz re-caluculate it. it's not a big job if done in lazy way.
> > 
> 
> You can't recalculate it if all the remaining threads have a different 
> oom_adj value than the effective oom_adj value from the thread that is now 
> exited.  There is no assumption that, for instance, the most negative 
> oom_adj value shall then be used.  Imagine the effective oom_adj value 
> being +15 and a thread sharing the same memory has an oom_adj value of 
> -16.  Under no reasonable circumstance should the oom preference of the 
> entire thread then change to -16 just because its the side-effect of a 
> thread exiting.

Why do we need recaluculate AT thread exiting time?
it is only used when oom_score is readed or actual OOM happend.
both those are slow-path.


> 
> That's the _entire_ reason why we need consistency in oom_adj values so 
> that userspace is aware of how the oom killer really works and chooses 
> tasks.  I understand that it differs from the previously allowed behavior, 
> but those userspace applications need to be fixed if, for no other reason, 
> they are now consistent with how the oom killer kills tasks.  I think 
> that's a very worthwhile goal and the cost of moving to a new interface 
> such as /proc/pid/oom_adj_child to have the same inheritance property that 
> was available in the past is justified.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
  2009-08-01 20:26                     ` David Rientjes
@ 2009-08-03 16:17                       ` Paul Menage
  -1 siblings, 0 replies; 64+ messages in thread
From: Paul Menage @ 2009-08-03 16:17 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Rik van Riel, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Sat, Aug 1, 2009 at 1:26 PM, David Rientjes<rientjes@google.com> wrote:
>
> It's more likely than not that applications were probably written to the
> way the documentation described the two files: that is, adjust
> /proc/pid/oom_score by tuning /proc/pid/oom_adj

I'd actually be pretty surprised if anyone was really doing that -
don't forget that the oom_score is something that varies dynamically
depending on things like the VM size of the process, its running time,
the VM sizes of its children, etc. So tuning oom_adj based on
oom_score will be rapidly out of date. AFAIK, oom_score was added
initially as a way to debug the OOM killer, and oom_adj was added
later as an additional knob. My suspicion is that any automated users
of oom_adj are working along the lines of

http://www.google.com/codesearch/p?hl=en&sa=N&cd=4&ct=rc#X7-oBZ_RyNM/src/server/memory/base/oommanager.cpp&q=oom_score

which just uses the values -16, 0 or 15, depending on whether the
process is critical, important or expendable.

Paul

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch -mm v2] mm: introduce oom_adj_child
@ 2009-08-03 16:17                       ` Paul Menage
  0 siblings, 0 replies; 64+ messages in thread
From: Paul Menage @ 2009-08-03 16:17 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Rik van Riel, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Sat, Aug 1, 2009 at 1:26 PM, David Rientjes<rientjes@google.com> wrote:
>
> It's more likely than not that applications were probably written to the
> way the documentation described the two files: that is, adjust
> /proc/pid/oom_score by tuning /proc/pid/oom_adj

I'd actually be pretty surprised if anyone was really doing that -
don't forget that the oom_score is something that varies dynamically
depending on things like the VM size of the process, its running time,
the VM sizes of its children, etc. So tuning oom_adj based on
oom_score will be rapidly out of date. AFAIK, oom_score was added
initially as a way to debug the OOM killer, and oom_adj was added
later as an additional knob. My suspicion is that any automated users
of oom_adj are working along the lines of

http://www.google.com/codesearch/p?hl=en&sa=N&cd=4&ct=rc#X7-oBZ_RyNM/src/server/memory/base/oommanager.cpp&q=oom_score

which just uses the values -16, 0 or 15, depending on whether the
process is critical, important or expendable.

Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2009-08-03 16:17 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-07-29  4:27 [patch -mm v2] mm: introduce oom_adj_child David Rientjes
2009-07-29  4:27 ` David Rientjes
2009-07-29 23:13 ` Andrew Morton
2009-07-29 23:13   ` Andrew Morton
2009-07-29 23:25   ` Paul Menage
2009-07-29 23:25     ` Paul Menage
2009-07-30  2:32 ` KOSAKI Motohiro
2009-07-30  2:32   ` KOSAKI Motohiro
2009-07-30  7:06   ` David Rientjes
2009-07-30  7:06     ` David Rientjes
2009-07-31  6:47     ` KOSAKI Motohiro
2009-07-31  6:47       ` KOSAKI Motohiro
2009-07-31  9:31       ` David Rientjes
2009-07-31  9:31         ` David Rientjes
2009-08-03 11:58         ` KOSAKI Motohiro
2009-08-03 11:58           ` KOSAKI Motohiro
2009-08-03 12:12           ` KOSAKI Motohiro
2009-08-03 12:12             ` KOSAKI Motohiro
2009-07-30  9:00 ` KAMEZAWA Hiroyuki
2009-07-30  9:00   ` KAMEZAWA Hiroyuki
2009-07-30  9:31   ` David Rientjes
2009-07-30  9:31     ` David Rientjes
2009-07-30 10:02     ` KAMEZAWA Hiroyuki
2009-07-30 10:02       ` KAMEZAWA Hiroyuki
2009-07-30 19:05       ` David Rientjes
2009-07-30 19:05         ` David Rientjes
2009-07-31  0:33         ` KAMEZAWA Hiroyuki
2009-07-31  0:33           ` KAMEZAWA Hiroyuki
2009-07-31  6:50           ` KOSAKI Motohiro
2009-07-31  6:50             ` KOSAKI Motohiro
2009-07-31 19:38             ` David Rientjes
2009-07-31 19:38               ` David Rientjes
2009-08-03 12:16               ` KOSAKI Motohiro
2009-08-03 12:16                 ` KOSAKI Motohiro
2009-07-31  9:36           ` David Rientjes
2009-07-31  9:36             ` David Rientjes
2009-07-31 10:49             ` KAMEZAWA Hiroyuki
2009-07-31 10:49               ` KAMEZAWA Hiroyuki
2009-07-31 19:18               ` David Rientjes
2009-07-31 19:18                 ` David Rientjes
2009-08-01  1:10                 ` KAMEZAWA Hiroyuki
2009-08-01  1:10                   ` KAMEZAWA Hiroyuki
2009-08-01 20:26                   ` David Rientjes
2009-08-01 20:26                     ` David Rientjes
2009-08-03  1:42                     ` KAMEZAWA Hiroyuki
2009-08-03  1:42                       ` KAMEZAWA Hiroyuki
2009-08-03  7:59                       ` David Rientjes
2009-08-03  7:59                         ` David Rientjes
2009-08-03  8:02                         ` KAMEZAWA Hiroyuki
2009-08-03  8:02                           ` KAMEZAWA Hiroyuki
2009-08-03  8:08                           ` David Rientjes
2009-08-03  8:08                             ` David Rientjes
2009-08-03  8:45                             ` KAMEZAWA Hiroyuki
2009-08-03  8:45                               ` KAMEZAWA Hiroyuki
2009-08-03  8:55                               ` KAMEZAWA Hiroyuki
2009-08-03  8:55                                 ` KAMEZAWA Hiroyuki
2009-08-03 12:19                                 ` KOSAKI Motohiro
2009-08-03 12:19                                   ` KOSAKI Motohiro
2009-08-03 12:32                         ` KOSAKI Motohiro
2009-08-03 12:32                           ` KOSAKI Motohiro
2009-08-03 12:21                     ` KOSAKI Motohiro
2009-08-03 12:21                       ` KOSAKI Motohiro
2009-08-03 16:17                     ` Paul Menage
2009-08-03 16:17                       ` Paul Menage

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.