All of lore.kernel.org
 help / color / mirror / Atom feed
* + oom-move-oom_adj-value-from-task_struct-to-mm_struct.patch added to -mm tree
@ 2009-06-02  5:57 akpm
  2009-07-15  0:52 ` Paul Menage
  0 siblings, 1 reply; 28+ messages in thread
From: akpm @ 2009-06-02  5:57 UTC (permalink / raw)
  To: mm-commits; +Cc: rientjes, mel, npiggin, riel


The patch titled
     oom: move oom_adj value from task_struct to mm_struct
has been added to the -mm tree.  Its filename is
     oom-move-oom_adj-value-from-task_struct-to-mm_struct.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/SubmitChecklist when testing your code ***

See http://userweb.kernel.org/~akpm/stuff/added-to-mm.txt to find
out what to do about this

The current -mm tree may be found at http://userweb.kernel.org/~akpm/mmotm/

------------------------------------------------------
Subject: oom: move oom_adj value from task_struct to mm_struct
From: David Rientjes <rientjes@google.com>

The per-task oom_adj value is a characteristic of its mm more than the
task itself since it's not possible to oom kill any thread that shares the
mm.  If a task were to be killed while attached to an mm that could not be
freed because another thread were set to OOM_DISABLE, it would have
needlessly been terminated since there is no potential for future memory
freeing.

This patch moves oomkilladj (now more appropriately named oom_adj) from
struct task_struct to struct mm_struct.  This requires task_lock() on a
task to check its oom_adj value to protect against exec, but it's already
necessary to take the lock when dereferencing the mm to find the total VM
size for the badness heuristic.

This fixes a livelock if the oom killer chooses a task and another thread
sharing the same memory has an oom_adj value of OOM_DISABLE.  This occurs
because oom_kill_task() repeatedly returns 1 and refuses to kill the
chosen task while select_bad_process() will repeatedly choose the same
task during the next retry.

Taking task_lock() in select_bad_process() to check for OOM_DISABLE and in
oom_kill_task() to check for threads sharing the same memory will be
removed in the next patch in this series where it will no longer be
necessary.

Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since
these threads are immune from oom killing already.  They simply report an
oom_adj value of OOM_DISABLE.

Cc: Nick Piggin <npiggin@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/filesystems/proc.txt |   15 +++++++----
 fs/proc/base.c                     |   19 ++++++++++++---
 include/linux/mm_types.h           |    2 +
 include/linux/sched.h              |    1 
 mm/oom_kill.c                      |   34 +++++++++++++++++----------
 5 files changed, 50 insertions(+), 21 deletions(-)

diff -puN Documentation/filesystems/proc.txt~oom-move-oom_adj-value-from-task_struct-to-mm_struct Documentation/filesystems/proc.txt
--- a/Documentation/filesystems/proc.txt~oom-move-oom_adj-value-from-task_struct-to-mm_struct
+++ a/Documentation/filesystems/proc.txt
@@ -1003,11 +1003,13 @@ CHAPTER 3: PER-PROCESS PARAMETERS
 3.1 /proc/<pid>/oom_adj - Adjust the oom-killer score
 ------------------------------------------------------
 
-This file can be used to adjust the score used to select which processes
-should be killed in an  out-of-memory  situation.  Giving it a high score will
-increase the likelihood of this process being killed by the oom-killer.  Valid
-values are in the range -16 to +15, plus the special value -17, which disables
-oom-killing altogether for this process.
+This file can be used to adjust the score used to select which processes should
+be killed in an out-of-memory situation.  The oom_adj value is a characteristic
+of the task's mm, so all threads that share an mm with pid will have the same
+oom_adj value.  A high value will increase the likelihood of this process being
+killed by the oom-killer.  Valid values are in the range -16 to +15 as
+explained below and a special value of -17, which disables oom-killing
+altogether for threads sharing pid's mm.
 
 The process to be killed in an out-of-memory situation is selected among all others
 based on its badness score. This value equals the original memory size of the process
@@ -1021,6 +1023,9 @@ the parent's score if they do not share 
 are the prime candidates to be killed. Having only one 'hungry' child will make
 parent less preferable than the child.
 
+/proc/<pid>/oom_adj cannot be changed for kthreads since they are immune from
+oom-killing already.
+
 /proc/<pid>/oom_score shows process' current badness score.
 
 The following heuristics are then applied:
diff -puN fs/proc/base.c~oom-move-oom_adj-value-from-task_struct-to-mm_struct fs/proc/base.c
--- a/fs/proc/base.c~oom-move-oom_adj-value-from-task_struct-to-mm_struct
+++ a/fs/proc/base.c
@@ -1006,7 +1006,12 @@ static ssize_t oom_adjust_read(struct fi
 
 	if (!task)
 		return -ESRCH;
-	oom_adjust = task->oomkilladj;
+	task_lock(task);
+	if (task->mm)
+		oom_adjust = task->mm->oom_adj;
+	else
+		oom_adjust = OOM_DISABLE;
+	task_unlock(task);
 	put_task_struct(task);
 
 	len = snprintf(buffer, sizeof(buffer), "%i\n", oom_adjust);
@@ -1035,11 +1040,19 @@ static ssize_t oom_adjust_write(struct f
 	task = get_proc_task(file->f_path.dentry->d_inode);
 	if (!task)
 		return -ESRCH;
-	if (oom_adjust < task->oomkilladj && !capable(CAP_SYS_RESOURCE)) {
+	task_lock(task);
+	if (!task->mm) {
+		task_unlock(task);
+		put_task_struct(task);
+		return -EINVAL;
+	}
+	if (oom_adjust < task->mm->oom_adj && !capable(CAP_SYS_RESOURCE)) {
+		task_unlock(task);
 		put_task_struct(task);
 		return -EACCES;
 	}
-	task->oomkilladj = oom_adjust;
+	task->mm->oom_adj = oom_adjust;
+	task_unlock(task);
 	put_task_struct(task);
 	if (end - buffer == 0)
 		return -EIO;
diff -puN include/linux/mm_types.h~oom-move-oom_adj-value-from-task_struct-to-mm_struct include/linux/mm_types.h
--- a/include/linux/mm_types.h~oom-move-oom_adj-value-from-task_struct-to-mm_struct
+++ a/include/linux/mm_types.h
@@ -240,6 +240,8 @@ struct mm_struct {
 
 	unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
 
+	s8 oom_adj;	/* OOM kill score adjustment (bit shift) */
+
 	cpumask_t cpu_vm_mask;
 
 	/* Architecture-specific MM context */
diff -puN include/linux/sched.h~oom-move-oom_adj-value-from-task_struct-to-mm_struct include/linux/sched.h
--- a/include/linux/sched.h~oom-move-oom_adj-value-from-task_struct-to-mm_struct
+++ a/include/linux/sched.h
@@ -1151,7 +1151,6 @@ struct task_struct {
 	 * a short time
 	 */
 	unsigned char fpu_counter;
-	s8 oomkilladj; /* OOM kill score adjustment (bit shift). */
 #ifdef CONFIG_BLK_DEV_IO_TRACE
 	unsigned int btrace_seq;
 #endif
diff -puN mm/oom_kill.c~oom-move-oom_adj-value-from-task_struct-to-mm_struct mm/oom_kill.c
--- a/mm/oom_kill.c~oom-move-oom_adj-value-from-task_struct-to-mm_struct
+++ a/mm/oom_kill.c
@@ -58,6 +58,7 @@ unsigned long badness(struct task_struct
 	unsigned long points, cpu_time, run_time;
 	struct mm_struct *mm;
 	struct task_struct *child;
+	int oom_adj;
 
 	task_lock(p);
 	mm = p->mm;
@@ -65,6 +66,7 @@ unsigned long badness(struct task_struct
 		task_unlock(p);
 		return 0;
 	}
+	oom_adj = mm->oom_adj;
 
 	/*
 	 * The memory size of the process is the basis for the badness.
@@ -148,15 +150,15 @@ unsigned long badness(struct task_struct
 		points /= 8;
 
 	/*
-	 * Adjust the score by oomkilladj.
+	 * Adjust the score by oom_adj.
 	 */
-	if (p->oomkilladj) {
-		if (p->oomkilladj > 0) {
+	if (oom_adj) {
+		if (oom_adj > 0) {
 			if (!points)
 				points = 1;
-			points <<= p->oomkilladj;
+			points <<= oom_adj;
 		} else
-			points >>= -(p->oomkilladj);
+			points >>= -(oom_adj);
 	}
 
 #ifdef DEBUG
@@ -251,8 +253,12 @@ static struct task_struct *select_bad_pr
 			*ppoints = ULONG_MAX;
 		}
 
-		if (p->oomkilladj == OOM_DISABLE)
+		task_lock(p);
+		if (p->mm && p->mm->oom_adj == OOM_DISABLE) {
+			task_unlock(p);
 			continue;
+		}
+		task_unlock(p);
 
 		points = badness(p, uptime.tv_sec);
 		if (points > *ppoints || !chosen) {
@@ -304,8 +310,7 @@ static void dump_tasks(const struct mem_
 		}
 		printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d     %3d %s\n",
 		       p->pid, __task_cred(p)->uid, p->tgid, mm->total_vm,
-		       get_mm_rss(mm), (int)task_cpu(p), p->oomkilladj,
-		       p->comm);
+		       get_mm_rss(mm), (int)task_cpu(p), mm->oom_adj, p->comm);
 		task_unlock(p);
 	} while_each_thread(g, p);
 }
@@ -367,8 +372,12 @@ static int oom_kill_task(struct task_str
 	 * Don't kill the process if any threads are set to OOM_DISABLE
 	 */
 	do_each_thread(g, q) {
-		if (q->mm == mm && q->oomkilladj == OOM_DISABLE)
+		task_lock(q);
+		if (q->mm == mm && q->mm && q->mm->oom_adj == OOM_DISABLE) {
+			task_unlock(q);
 			return 1;
+		}
+		task_unlock(q);
 	} while_each_thread(g, q);
 
 	__oom_kill_task(p, 1);
@@ -393,10 +402,11 @@ static int oom_kill_process(struct task_
 	struct task_struct *c;
 
 	if (printk_ratelimit()) {
-		printk(KERN_WARNING "%s invoked oom-killer: "
-			"gfp_mask=0x%x, order=%d, oomkilladj=%d\n",
-			current->comm, gfp_mask, order, current->oomkilladj);
 		task_lock(current);
+		printk(KERN_WARNING "%s invoked oom-killer: "
+			"gfp_mask=0x%x, order=%d, oom_adj=%d\n",
+			current->comm, gfp_mask, order,
+			current->mm ? current->mm->oom_adj : OOM_DISABLE);
 		cpuset_print_task_mems_allowed(current);
 		task_unlock(current);
 		dump_stack();
_

Patches currently in -mm which might be from rientjes@google.com are

origin.patch
linux-next.patch
cpusets-restructure-the-function-cpuset_update_task_memory_state.patch
cpusets-update-tasks-page-slab-spread-flags-in-time.patch
cpusetmm-update-tasks-mems_allowed-in-time.patch
cpusetmm-update-tasks-mems_allowed-in-time-fix.patch
cpusetmm-update-tasks-mems_allowed-in-time-cleanup.patch
page-allocator-use-a-pre-calculated-value-instead-of-num_online_nodes-in-fast-paths-do-not-override-definition-of-node_set_online-with-macro.patch
mm-setup_per_zone_inactive_ratio-do-not-call-for-int_sqrt-if-not-needed.patch
mm-setup_per_zone_inactive_ratio-fix-comment-and-make-it-__init.patch
page-allocator-warn-if-__gfp_nofail-is-used-for-a-large-allocation.patch
mm-pm-freezer-disable-oom-killer-when-tasks-are-frozen.patch
oom-move-oom_adj-value-from-task_struct-to-mm_struct.patch
oom-avoid-unnecessary-mm-locking-and-scanning-for-oom_disable.patch
memcg-add-file-based-rss-accounting.patch
memcg-add-file-based-rss-accounting-fix-mem_cgroup_update_mapped_file_stat-oops.patch


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: + oom-move-oom_adj-value-from-task_struct-to-mm_struct.patch  added to -mm tree
  2009-06-02  5:57 + oom-move-oom_adj-value-from-task_struct-to-mm_struct.patch added to -mm tree akpm
@ 2009-07-15  0:52 ` Paul Menage
  2009-07-15  1:05   ` [PATCH] copy over oom_adj value at fork time Rik van Riel
  0 siblings, 1 reply; 28+ messages in thread
From: Paul Menage @ 2009-07-15  0:52 UTC (permalink / raw)
  To: linux-kernel; +Cc: mm-commits, rientjes, mel, npiggin, riel

On Mon, Jun 1, 2009 at 10:57 PM, <akpm@linux-foundation.org> wrote:
>
> The patch titled
>     oom: move oom_adj value from task_struct to mm_struct
> has been added to the -mm tree.  Its filename is
>     oom-move-oom_adj-value-from-task_struct-to-mm_struct.patch

This patch represents an API/semantics change that's invisible from
userspace, unless userspace specifically checks for the old and new
behaviours to see which applies.

Is it intentional that it's now impossible to create a tree of
processes that will inherit a non-zero oom_adj? It appears to get
reset whenever you exec a child process.

Old behaviour:

[root@localhost ~]# echo -1 > /proc/$$/oom_adj
[root@localhost ~]# cat /proc/$$/oom_adj
-1
[root@localhost ~]# bash
[root@localhost ~]# cat /proc/$$/oom_adj
-1

New behaviour:

[root@localhost ~]# echo -1 > /proc/$$/oom_adj
[root@localhost ~]# cat /proc/$$/oom_adj
-1
[root@localhost ~]# bash
[root@localhost ~]# cat /proc/$$/oom_adj
0

>
> Before you just go and hit "reply", please:
>   a) Consider who else should be cc'ed
>   b) Prefer to cc a suitable mailing list as well
>   c) Ideally: find the original patch on the mailing list and do a
>      reply-to-all to that, adding suitable additional cc's
>
> *** Remember to use Documentation/SubmitChecklist when testing your code ***
>
> See http://userweb.kernel.org/~akpm/stuff/added-to-mm.txt to find
> out what to do about this
>
> The current -mm tree may be found at http://userweb.kernel.org/~akpm/mmotm/
>
> ------------------------------------------------------
> Subject: oom: move oom_adj value from task_struct to mm_struct
> From: David Rientjes <rientjes@google.com>
>
> The per-task oom_adj value is a characteristic of its mm more than the
> task itself since it's not possible to oom kill any thread that shares the
> mm.  If a task were to be killed while attached to an mm that could not be
> freed because another thread were set to OOM_DISABLE, it would have
> needlessly been terminated since there is no potential for future memory
> freeing.
>
> This patch moves oomkilladj (now more appropriately named oom_adj) from
> struct task_struct to struct mm_struct.  This requires task_lock() on a
> task to check its oom_adj value to protect against exec, but it's already
> necessary to take the lock when dereferencing the mm to find the total VM
> size for the badness heuristic.
>
> This fixes a livelock if the oom killer chooses a task and another thread
> sharing the same memory has an oom_adj value of OOM_DISABLE.  This occurs
> because oom_kill_task() repeatedly returns 1 and refuses to kill the
> chosen task while select_bad_process() will repeatedly choose the same
> task during the next retry.
>
> Taking task_lock() in select_bad_process() to check for OOM_DISABLE and in
> oom_kill_task() to check for threads sharing the same memory will be
> removed in the next patch in this series where it will no longer be
> necessary.
>
> Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since
> these threads are immune from oom killing already.  They simply report an
> oom_adj value of OOM_DISABLE.
>
> Cc: Nick Piggin <npiggin@suse.de>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>
>  Documentation/filesystems/proc.txt |   15 +++++++----
>  fs/proc/base.c                     |   19 ++++++++++++---
>  include/linux/mm_types.h           |    2 +
>  include/linux/sched.h              |    1
>  mm/oom_kill.c                      |   34 +++++++++++++++++----------
>  5 files changed, 50 insertions(+), 21 deletions(-)
>
> diff -puN Documentation/filesystems/proc.txt~oom-move-oom_adj-value-from-task_struct-to-mm_struct Documentation/filesystems/proc.txt
> --- a/Documentation/filesystems/proc.txt~oom-move-oom_adj-value-from-task_struct-to-mm_struct
> +++ a/Documentation/filesystems/proc.txt
> @@ -1003,11 +1003,13 @@ CHAPTER 3: PER-PROCESS PARAMETERS
>  3.1 /proc/<pid>/oom_adj - Adjust the oom-killer score
>  ------------------------------------------------------
>
> -This file can be used to adjust the score used to select which processes
> -should be killed in an  out-of-memory  situation.  Giving it a high score will
> -increase the likelihood of this process being killed by the oom-killer.  Valid
> -values are in the range -16 to +15, plus the special value -17, which disables
> -oom-killing altogether for this process.
> +This file can be used to adjust the score used to select which processes should
> +be killed in an out-of-memory situation.  The oom_adj value is a characteristic
> +of the task's mm, so all threads that share an mm with pid will have the same
> +oom_adj value.  A high value will increase the likelihood of this process being
> +killed by the oom-killer.  Valid values are in the range -16 to +15 as
> +explained below and a special value of -17, which disables oom-killing
> +altogether for threads sharing pid's mm.
>
>  The process to be killed in an out-of-memory situation is selected among all others
>  based on its badness score. This value equals the original memory size of the process
> @@ -1021,6 +1023,9 @@ the parent's score if they do not share
>  are the prime candidates to be killed. Having only one 'hungry' child will make
>  parent less preferable than the child.
>
> +/proc/<pid>/oom_adj cannot be changed for kthreads since they are immune from
> +oom-killing already.
> +
>  /proc/<pid>/oom_score shows process' current badness score.
>
>  The following heuristics are then applied:
> diff -puN fs/proc/base.c~oom-move-oom_adj-value-from-task_struct-to-mm_struct fs/proc/base.c
> --- a/fs/proc/base.c~oom-move-oom_adj-value-from-task_struct-to-mm_struct
> +++ a/fs/proc/base.c
> @@ -1006,7 +1006,12 @@ static ssize_t oom_adjust_read(struct fi
>
>        if (!task)
>                return -ESRCH;
> -       oom_adjust = task->oomkilladj;
> +       task_lock(task);
> +       if (task->mm)
> +               oom_adjust = task->mm->oom_adj;
> +       else
> +               oom_adjust = OOM_DISABLE;
> +       task_unlock(task);
>        put_task_struct(task);
>
>        len = snprintf(buffer, sizeof(buffer), "%i\n", oom_adjust);
> @@ -1035,11 +1040,19 @@ static ssize_t oom_adjust_write(struct f
>        task = get_proc_task(file->f_path.dentry->d_inode);
>        if (!task)
>                return -ESRCH;
> -       if (oom_adjust < task->oomkilladj && !capable(CAP_SYS_RESOURCE)) {
> +       task_lock(task);
> +       if (!task->mm) {
> +               task_unlock(task);
> +               put_task_struct(task);
> +               return -EINVAL;
> +       }
> +       if (oom_adjust < task->mm->oom_adj && !capable(CAP_SYS_RESOURCE)) {
> +               task_unlock(task);
>                put_task_struct(task);
>                return -EACCES;
>        }
> -       task->oomkilladj = oom_adjust;
> +       task->mm->oom_adj = oom_adjust;
> +       task_unlock(task);
>        put_task_struct(task);
>        if (end - buffer == 0)
>                return -EIO;
> diff -puN include/linux/mm_types.h~oom-move-oom_adj-value-from-task_struct-to-mm_struct include/linux/mm_types.h
> --- a/include/linux/mm_types.h~oom-move-oom_adj-value-from-task_struct-to-mm_struct
> +++ a/include/linux/mm_types.h
> @@ -240,6 +240,8 @@ struct mm_struct {
>
>        unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
>
> +       s8 oom_adj;     /* OOM kill score adjustment (bit shift) */
> +
>        cpumask_t cpu_vm_mask;
>
>        /* Architecture-specific MM context */
> diff -puN include/linux/sched.h~oom-move-oom_adj-value-from-task_struct-to-mm_struct include/linux/sched.h
> --- a/include/linux/sched.h~oom-move-oom_adj-value-from-task_struct-to-mm_struct
> +++ a/include/linux/sched.h
> @@ -1151,7 +1151,6 @@ struct task_struct {
>         * a short time
>         */
>        unsigned char fpu_counter;
> -       s8 oomkilladj; /* OOM kill score adjustment (bit shift). */
>  #ifdef CONFIG_BLK_DEV_IO_TRACE
>        unsigned int btrace_seq;
>  #endif
> diff -puN mm/oom_kill.c~oom-move-oom_adj-value-from-task_struct-to-mm_struct mm/oom_kill.c
> --- a/mm/oom_kill.c~oom-move-oom_adj-value-from-task_struct-to-mm_struct
> +++ a/mm/oom_kill.c
> @@ -58,6 +58,7 @@ unsigned long badness(struct task_struct
>        unsigned long points, cpu_time, run_time;
>        struct mm_struct *mm;
>        struct task_struct *child;
> +       int oom_adj;
>
>        task_lock(p);
>        mm = p->mm;
> @@ -65,6 +66,7 @@ unsigned long badness(struct task_struct
>                task_unlock(p);
>                return 0;
>        }
> +       oom_adj = mm->oom_adj;
>
>        /*
>         * The memory size of the process is the basis for the badness.
> @@ -148,15 +150,15 @@ unsigned long badness(struct task_struct
>                points /= 8;
>
>        /*
> -        * Adjust the score by oomkilladj.
> +        * Adjust the score by oom_adj.
>         */
> -       if (p->oomkilladj) {
> -               if (p->oomkilladj > 0) {
> +       if (oom_adj) {
> +               if (oom_adj > 0) {
>                        if (!points)
>                                points = 1;
> -                       points <<= p->oomkilladj;
> +                       points <<= oom_adj;
>                } else
> -                       points >>= -(p->oomkilladj);
> +                       points >>= -(oom_adj);
>        }
>
>  #ifdef DEBUG
> @@ -251,8 +253,12 @@ static struct task_struct *select_bad_pr
>                        *ppoints = ULONG_MAX;
>                }
>
> -               if (p->oomkilladj == OOM_DISABLE)
> +               task_lock(p);
> +               if (p->mm && p->mm->oom_adj == OOM_DISABLE) {
> +                       task_unlock(p);
>                        continue;
> +               }
> +               task_unlock(p);
>
>                points = badness(p, uptime.tv_sec);
>                if (points > *ppoints || !chosen) {
> @@ -304,8 +310,7 @@ static void dump_tasks(const struct mem_
>                }
>                printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d     %3d %s\n",
>                       p->pid, __task_cred(p)->uid, p->tgid, mm->total_vm,
> -                      get_mm_rss(mm), (int)task_cpu(p), p->oomkilladj,
> -                      p->comm);
> +                      get_mm_rss(mm), (int)task_cpu(p), mm->oom_adj, p->comm);
>                task_unlock(p);
>        } while_each_thread(g, p);
>  }
> @@ -367,8 +372,12 @@ static int oom_kill_task(struct task_str
>         * Don't kill the process if any threads are set to OOM_DISABLE
>         */
>        do_each_thread(g, q) {
> -               if (q->mm == mm && q->oomkilladj == OOM_DISABLE)
> +               task_lock(q);
> +               if (q->mm == mm && q->mm && q->mm->oom_adj == OOM_DISABLE) {
> +                       task_unlock(q);
>                        return 1;
> +               }
> +               task_unlock(q);
>        } while_each_thread(g, q);
>
>        __oom_kill_task(p, 1);
> @@ -393,10 +402,11 @@ static int oom_kill_process(struct task_
>        struct task_struct *c;
>
>        if (printk_ratelimit()) {
> -               printk(KERN_WARNING "%s invoked oom-killer: "
> -                       "gfp_mask=0x%x, order=%d, oomkilladj=%d\n",
> -                       current->comm, gfp_mask, order, current->oomkilladj);
>                task_lock(current);
> +               printk(KERN_WARNING "%s invoked oom-killer: "
> +                       "gfp_mask=0x%x, order=%d, oom_adj=%d\n",
> +                       current->comm, gfp_mask, order,
> +                       current->mm ? current->mm->oom_adj : OOM_DISABLE);
>                cpuset_print_task_mems_allowed(current);
>                task_unlock(current);
>                dump_stack();
> _
>
> Patches currently in -mm which might be from rientjes@google.com are
>
> origin.patch
> linux-next.patch
> cpusets-restructure-the-function-cpuset_update_task_memory_state.patch
> cpusets-update-tasks-page-slab-spread-flags-in-time.patch
> cpusetmm-update-tasks-mems_allowed-in-time.patch
> cpusetmm-update-tasks-mems_allowed-in-time-fix.patch
> cpusetmm-update-tasks-mems_allowed-in-time-cleanup.patch
> page-allocator-use-a-pre-calculated-value-instead-of-num_online_nodes-in-fast-paths-do-not-override-definition-of-node_set_online-with-macro.patch
> mm-setup_per_zone_inactive_ratio-do-not-call-for-int_sqrt-if-not-needed.patch
> mm-setup_per_zone_inactive_ratio-fix-comment-and-make-it-__init.patch
> page-allocator-warn-if-__gfp_nofail-is-used-for-a-large-allocation.patch
> mm-pm-freezer-disable-oom-killer-when-tasks-are-frozen.patch
> oom-move-oom_adj-value-from-task_struct-to-mm_struct.patch
> oom-avoid-unnecessary-mm-locking-and-scanning-for-oom_disable.patch
> memcg-add-file-based-rss-accounting.patch
> memcg-add-file-based-rss-accounting-fix-mem_cgroup_update_mapped_file_stat-oops.patch
>
> --
> To unsubscribe from this list: send the line "unsubscribe mm-commits" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH] copy over oom_adj value at fork time
  2009-07-15  0:52 ` Paul Menage
@ 2009-07-15  1:05   ` Rik van Riel
  2009-07-15  1:15     ` Paul Menage
  2009-07-15  1:49     ` KOSAKI Motohiro
  0 siblings, 2 replies; 28+ messages in thread
From: Rik van Riel @ 2009-07-15  1:05 UTC (permalink / raw)
  To: Paul Menage; +Cc: linux-kernel, akpm, rientjes, mel, npiggin

After moving the oom_adj value from the task struct to the
mm_struct, the oom_adj value was no longer properly inherited
by child processes.

Copying over the oom_adj value at fork time fixes that bug.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Paul Menage <manage@google.com>
---
Paul, does this fix your bug?

 kernel/fork.c |    1 +
 1 file changed, 1 insertion(+)

Index: mmotm/kernel/fork.c
===================================================================
--- mmotm.orig/kernel/fork.c	2009-07-14 20:58:01.000000000 -0400
+++ mmotm/kernel/fork.c	2009-07-14 21:00:40.000000000 -0400
@@ -435,6 +435,7 @@ static struct mm_struct * mm_init(struct
 	init_rwsem(&mm->mmap_sem);
 	INIT_LIST_HEAD(&mm->mmlist);
 	mm->flags = (current->mm) ? current->mm->flags : default_dump_filter;
+	mm->oom_adj = current->mm->oom_adj;
 	mm->core_state = NULL;
 	mm->nr_ptes = 0;
 	set_mm_counter(mm, file_rss, 0);

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] copy over oom_adj value at fork time
  2009-07-15  1:05   ` [PATCH] copy over oom_adj value at fork time Rik van Riel
@ 2009-07-15  1:15     ` Paul Menage
  2009-07-15  2:01       ` KOSAKI Motohiro
                         ` (2 more replies)
  2009-07-15  1:49     ` KOSAKI Motohiro
  1 sibling, 3 replies; 28+ messages in thread
From: Paul Menage @ 2009-07-15  1:15 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, akpm, rientjes, mel, npiggin

On Tue, Jul 14, 2009 at 6:05 PM, Rik van Riel<riel@redhat.com> wrote:
> After moving the oom_adj value from the task struct to the
> mm_struct, the oom_adj value was no longer properly inherited
> by child processes.
>
> Copying over the oom_adj value at fork time fixes that bug.
>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Reported-by: Paul Menage <manage@google.com>
> ---
> Paul, does this fix your bug?

It fixes that specific problem. (Although the patch title should
probably be "copy over oom_adj value at execve time")

But ironically, with this fix applied the main part of the original
change (force all threads in a process to share a single oom_adj
value) will start to break my code - it's no longer possible to have
the regular threads in a process be oom-immune, then vfork() and set a
non-disabled oom_adj in the child, since this will set it for the
entire process. (Our job scheduler does something like this, in order
to have the launcher be OOM immune and the running jobs be at various
oom_adj levels depending on their priority).

It's probably something that I can work around, but it seems like the
kind of API change that could break unsuspecting users.

Paul

>
>  kernel/fork.c |    1 +
>  1 file changed, 1 insertion(+)
>
> Index: mmotm/kernel/fork.c
> ===================================================================
> --- mmotm.orig/kernel/fork.c    2009-07-14 20:58:01.000000000 -0400
> +++ mmotm/kernel/fork.c 2009-07-14 21:00:40.000000000 -0400
> @@ -435,6 +435,7 @@ static struct mm_struct * mm_init(struct
>        init_rwsem(&mm->mmap_sem);
>        INIT_LIST_HEAD(&mm->mmlist);
>        mm->flags = (current->mm) ? current->mm->flags : default_dump_filter;
> +       mm->oom_adj = current->mm->oom_adj;
>        mm->core_state = NULL;
>        mm->nr_ptes = 0;
>        set_mm_counter(mm, file_rss, 0);
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] copy over oom_adj value at fork time
  2009-07-15  1:05   ` [PATCH] copy over oom_adj value at fork time Rik van Riel
  2009-07-15  1:15     ` Paul Menage
@ 2009-07-15  1:49     ` KOSAKI Motohiro
  1 sibling, 0 replies; 28+ messages in thread
From: KOSAKI Motohiro @ 2009-07-15  1:49 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kosaki.motohiro, Paul Menage, linux-kernel, akpm, rientjes, mel, npiggin

> After moving the oom_adj value from the task struct to the
> mm_struct, the oom_adj value was no longer properly inherited
> by child processes.
> 
> Copying over the oom_adj value at fork time fixes that bug.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Reported-by: Paul Menage <manage@google.com>
> ---
> Paul, does this fix your bug?
> 
>  kernel/fork.c |    1 +
>  1 file changed, 1 insertion(+)
> 
> Index: mmotm/kernel/fork.c
> ===================================================================
> --- mmotm.orig/kernel/fork.c	2009-07-14 20:58:01.000000000 -0400
> +++ mmotm/kernel/fork.c	2009-07-14 21:00:40.000000000 -0400
> @@ -435,6 +435,7 @@ static struct mm_struct * mm_init(struct
>  	init_rwsem(&mm->mmap_sem);
>  	INIT_LIST_HEAD(&mm->mmlist);
>  	mm->flags = (current->mm) ? current->mm->flags : default_dump_filter;
> +	mm->oom_adj = current->mm->oom_adj;

Maybe, We need to check "(current->mm) != NULL" as mm->flags.



>  	mm->core_state = NULL;
>  	mm->nr_ptes = 0;
>  	set_mm_counter(mm, file_rss, 0);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] copy over oom_adj value at fork time
  2009-07-15  1:15     ` Paul Menage
@ 2009-07-15  2:01       ` KOSAKI Motohiro
  2009-07-15  2:12         ` [PATCH -mm] copy over oom_adj value at fork time (v2) Rik van Riel
  2009-07-15  3:48       ` [PATCH] copy over oom_adj value at fork time David Rientjes
  2009-07-16 23:29       ` Paul Menage
  2 siblings, 1 reply; 28+ messages in thread
From: KOSAKI Motohiro @ 2009-07-15  2:01 UTC (permalink / raw)
  To: Paul Menage
  Cc: kosaki.motohiro, Rik van Riel, linux-kernel, akpm, rientjes, mel,
	npiggin

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="UTF-8", Size: 2401 bytes --]

> On Tue, Jul 14, 2009 at 6:05 PM, Rik van Riel<riel@redhat.com> wrote:
> > After moving the oom_adj value from the task struct to the
> > mm_struct, the oom_adj value was no longer properly inherited
> > by child processes.
> >
> > Copying over the oom_adj value at fork time fixes that bug.
> >
> > Signed-off-by: Rik van Riel <riel@redhat.com>
> > Reported-by: Paul Menage <manage@google.com>
> > ---
> > Paul, does this fix your bug?
> 
> It fixes that specific problem. (Although the patch title should
> probably be "copy over oom_adj value at execve time")

AFAIK mm_init() is called from both fork() and execve().


> But ironically, with this fix applied the main part of the original
> change (force all threads in a process to share a single oom_adj
> value) will start to break my code - it's no longer possible to have
> the regular threads in a process be oom-immune, then vfork() and set a
> non-disabled oom_adj in the child, since this will set it for the
> entire process. (Our job scheduler does something like this, in order
> to have the launcher be OOM immune and the running jobs be at various
> oom_adj levels depending on their priority).

Fair point.
Nobody want per-thread oom_adj, but your vfork() issue seems real regression.

David, can we make more clever implementation?

> 
> It's probably something that I can work around, but it seems like the
> kind of API change that could break unsuspecting users.
> 
> Paul
> 
> >
> >  kernel/fork.c |    1 +
> >  1 file changed, 1 insertion(+)
> >
> > Index: mmotm/kernel/fork.c
> > ===================================================================
> > --- mmotm.orig/kernel/fork.c    2009-07-14 20:58:01.000000000 -0400
> > +++ mmotm/kernel/fork.c 2009-07-14 21:00:40.000000000 -0400
> > @@ -435,6 +435,7 @@ static struct mm_struct * mm_init(struct
> >        init_rwsem(&mm->mmap_sem);
> >        INIT_LIST_HEAD(&mm->mmlist);
> >        mm->flags = (current->mm) ? current->mm->flags : default_dump_filter;
> > +       mm->oom_adj = current->mm->oom_adj;
> >        mm->core_state = NULL;
> >        mm->nr_ptes = 0;
> >        set_mm_counter(mm, file_rss, 0);
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/




^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH -mm] copy over oom_adj value at fork time (v2)
  2009-07-15  2:01       ` KOSAKI Motohiro
@ 2009-07-15  2:12         ` Rik van Riel
  2009-07-15  3:42           ` David Rientjes
  0 siblings, 1 reply; 28+ messages in thread
From: Rik van Riel @ 2009-07-15  2:12 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Paul Menage, kosaki.motohiro, linux-kernel, akpm, rientjes, mel, npiggin

After moving the oom_adj value from the task struct to the
mm_struct, the oom_adj value was no longer properly inherited
by child processes.

Copying over the oom_adj value at fork time fixes that bug.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Paul Menage <manage@google.com>
---
v2: test for current->mm before dereferencing it (Kosaki Motohiro)

 kernel/fork.c |    1 +
 1 file changed, 1 insertion(+)

Index: mmotm/kernel/fork.c
===================================================================
--- mmotm.orig/kernel/fork.c	2009-07-14 20:58:01.000000000 -0400
+++ mmotm/kernel/fork.c	2009-07-14 22:09:23.000000000 -0400
@@ -435,6 +435,7 @@ static struct mm_struct * mm_init(struct
 	init_rwsem(&mm->mmap_sem);
 	INIT_LIST_HEAD(&mm->mmlist);
 	mm->flags = (current->mm) ? current->mm->flags : default_dump_filter;
+	mm->oom_adj = (current->mm) ? current->mm->oom_adj : 0;
 	mm->core_state = NULL;
 	mm->nr_ptes = 0;
 	set_mm_counter(mm, file_rss, 0);


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -mm] copy over oom_adj value at fork time (v2)
  2009-07-15  2:12         ` [PATCH -mm] copy over oom_adj value at fork time (v2) Rik van Riel
@ 2009-07-15  3:42           ` David Rientjes
  0 siblings, 0 replies; 28+ messages in thread
From: David Rientjes @ 2009-07-15  3:42 UTC (permalink / raw)
  To: Rik van Riel
  Cc: KOSAKI Motohiro, Paul Menage, linux-kernel, akpm, mel, npiggin

On Tue, 14 Jul 2009, Rik van Riel wrote:

> After moving the oom_adj value from the task struct to the
> mm_struct, the oom_adj value was no longer properly inherited
> by child processes.
> 
> Copying over the oom_adj value at fork time fixes that bug.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Reported-by: Paul Menage <manage@google.com>

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] copy over oom_adj value at fork time
  2009-07-15  1:15     ` Paul Menage
  2009-07-15  2:01       ` KOSAKI Motohiro
@ 2009-07-15  3:48       ` David Rientjes
  2009-07-16 23:29       ` Paul Menage
  2 siblings, 0 replies; 28+ messages in thread
From: David Rientjes @ 2009-07-15  3:48 UTC (permalink / raw)
  To: Paul Menage; +Cc: Rik van Riel, linux-kernel, akpm, mel, npiggin

On Tue, 14 Jul 2009, Paul Menage wrote:

> But ironically, with this fix applied the main part of the original
> change (force all threads in a process to share a single oom_adj
> value) will start to break my code - it's no longer possible to have
> the regular threads in a process be oom-immune, then vfork() and set a
> non-disabled oom_adj in the child, since this will set it for the
> entire process.

My patch to make oom_adj be a characteristic of the mm_struct and not the 
task_struct makes it consistent, the scenario you describe above would 
still have an OOM_DISABLE'd child even though /proc/pid-of-child/oom_adj 
is not.  That's the bug my patch addressed.

Without my change, the oom killer will endlessly loop if the child and not 
the parent is chosen to be killed because parent->mm->oom_adj == 
OOM_DISABLE.  The inheritance property is secondary to the semantics of 
oom_adj and should be fixed with Rik's patch.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] copy over oom_adj value at fork time
  2009-07-15  1:15     ` Paul Menage
  2009-07-15  2:01       ` KOSAKI Motohiro
  2009-07-15  3:48       ` [PATCH] copy over oom_adj value at fork time David Rientjes
@ 2009-07-16 23:29       ` Paul Menage
  2009-07-17  9:34         ` David Rientjes
  2 siblings, 1 reply; 28+ messages in thread
From: Paul Menage @ 2009-07-16 23:29 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, akpm, rientjes, mel, npiggin

On Tue, Jul 14, 2009 at 6:15 PM, Paul Menage<menage@google.com> wrote:
>
> But ironically, with this fix applied the main part of the original
> change (force all threads in a process to share a single oom_adj
> value) will start to break my code - it's no longer possible to have
> the regular threads in a process be oom-immune, then vfork() and set a
> non-disabled oom_adj in the child, since this will set it for the
> entire process. (Our job scheduler does something like this, in order
> to have the launcher be OOM immune and the running jobs be at various
> oom_adj levels depending on their priority).
>

How about if instead of having the oom_adj be per-mm, we kept an array
of counters in the mm, tracking how many users were at each oom_adj
level; the OOM killer could then use the level of the mm's highest
oom_adj user when deciding how to calculate the badness of a thread
using that mm.

That would preserve the previous semantics of letting a spawned child
inherit a per-thread oom_adj value, while avoiding the specific
problem of the OOM killer getting livelocked (that David's patch
originally addressed) and the more general case of the inconsistency
in determining the oom_adj level of an mm depending on which thread
you look at.

For the very common case where all users of the mm are at the same
oom_adj level, this array could be left unallocated as NULL, and the
thread's own oom_adj used.

Paul

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] copy over oom_adj value at fork time
  2009-07-16 23:29       ` Paul Menage
@ 2009-07-17  9:34         ` David Rientjes
  2009-07-17 20:03           ` Paul Menage
  0 siblings, 1 reply; 28+ messages in thread
From: David Rientjes @ 2009-07-17  9:34 UTC (permalink / raw)
  To: Paul Menage; +Cc: Rik van Riel, linux-kernel, akpm, mel, npiggin

On Thu, 16 Jul 2009, Paul Menage wrote:

> How about if instead of having the oom_adj be per-mm, we kept an array
> of counters in the mm, tracking how many users were at each oom_adj
> level; the OOM killer could then use the level of the mm's highest
> oom_adj user when deciding how to calculate the badness of a thread
> using that mm.
> 

That would lead to the same inconsistencies that we had before: consider 
two tasks sharing the same mm_struct, taskA and taskB.  It was previously 
possible for taskA to have an oom_adj value of -15 and taskB to have an 
oom_adj value of +15.  This would cause /proc/pid/oom_score to be very 
small for taskA and oom_score would be very large for taskB.  With your 
proposal, taskB's badness score would implicitly be very small, yet it is 
reported to userspace as very high.

The only way to workaround that is by using the highest oom_adj user for 
the mm_struct from the array in reporting /proc/pid/oom_score, as well.  
But that would lead to /proc/pid/oom_adj not affecting oom_score at all, 
which isn't consistent.

I think you'll find that having oom_adj values purely be an attribute of 
the memory it represents is the cleanest solution since it most accurately 
describes how the oom killer interprets it when deciding on which task to 
kill.

> That would preserve the previous semantics of letting a spawned child
> inherit a per-thread oom_adj value, while avoiding the specific
> problem of the OOM killer getting livelocked (that David's patch
> originally addressed) and the more general case of the inconsistency
> in determining the oom_adj level of an mm depending on which thread
> you look at.
> 

Right, it's still a little strange that changing /proc/pid/oom_adj for one 
thread will change it for another if they share memory, even if they are 
in different thread groups, but that shouldn't happen if the admin 
understands that the oom killer must kill _all_ threads sharing memory 
with the target to lead to future memory freeing.

The inheritance issue should be fixed with Rik's patch with the exception 
of vfork -> change /proc/pid-of-child/oom_adj -> execve.  If scripts were 
written to do that with the old behavior, they'll have to adjust to change 
oom_adj _after_ the execve to avoid changing the oom_adj value of the 
vfork parent.  If there is no execve, or we're just doing CLONE_VM, then 
the child shares memory with the parent and, thus, their oom_adj values 
will be the same.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] copy over oom_adj value at fork time
  2009-07-17  9:34         ` David Rientjes
@ 2009-07-17 20:03           ` Paul Menage
  2009-07-17 23:12             ` David Rientjes
  0 siblings, 1 reply; 28+ messages in thread
From: Paul Menage @ 2009-07-17 20:03 UTC (permalink / raw)
  To: David Rientjes; +Cc: Rik van Riel, linux-kernel, akpm, mel, npiggin

On Fri, Jul 17, 2009 at 2:34 AM, David Rientjes<rientjes@google.com> wrote:
>
> The only way to workaround that is by using the highest oom_adj user for
> the mm_struct from the array in reporting /proc/pid/oom_score, as well.

That sounds fine to me.

> But that would lead to /proc/pid/oom_adj not affecting oom_score at all,
> which isn't consistent.

Isn't consistent with what? It's perfectly consistent with saying "the
oom_score of a task is based on the highest oom_adj value of any task
sharing the same mm". Admittedly it's not 100% consistent with the old
semantics, but I'm having trouble imagining a scenario where someone
was relying on the changed semantics.

But taking a completely different approach, is there a reason that we
couldn't have just moved the do_each_thread() check for OOM_DISABLED
out of oom_kill_task() and into select_bad_process() at the point
where we've decided that the thread in question is a better victim
than the current victim? That would fix the OOM livelock while
allowing us to keep exactly the same oom_adj/oom_score semantics as in
previous kernels.

> The inheritance issue should be fixed with Rik's patch with the exception
> of vfork -> change /proc/pid-of-child/oom_adj -> execve.  If scripts were
> written to do that with the old behavior, they'll have to adjust to change
> oom_adj _after_ the execve

Think about what you're suggesting here - execve() replaces your code
with whatever you're execing, so unless that code is also written to
handle oom_adj (which for something like a generic job scheduler, the
exec'd code is unlikely to do) you're stuck.

Paul

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] copy over oom_adj value at fork time
  2009-07-17 20:03           ` Paul Menage
@ 2009-07-17 23:12             ` David Rientjes
  2009-07-20 21:52               ` Paul Menage
  0 siblings, 1 reply; 28+ messages in thread
From: David Rientjes @ 2009-07-17 23:12 UTC (permalink / raw)
  To: Paul Menage; +Cc: Rik van Riel, linux-kernel, akpm, mel, npiggin

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3452 bytes --]

On Fri, 17 Jul 2009, Paul Menage wrote:

> > The only way to workaround that is by using the highest oom_adj user for
> > the mm_struct from the array in reporting /proc/pid/oom_score, as well.
> 
> That sounds fine to me.
> 

It's better than what we did before my patchset, which would be report an 
erroneous oom_score if another thread attached to the same mm_struct had a 
less immune (more positive) oom_adj score.

With my patches in 2.6.31-rc3, there is no ambiguity and /proc/pid/oom_adj 
and /proc/pid/oom_score are consistent for all threads.  Because of that, 
they now accurately reflect in all circumstances how the oom killer 
chooses a victim, which should be very important to users who are tuning 
oom_adj values in the first place.

> > But that would lead to /proc/pid/oom_adj not affecting oom_score at all,
> > which isn't consistent.
> 
> Isn't consistent with what? It's perfectly consistent with saying "the
> oom_score of a task is based on the highest oom_adj value of any task
> sharing the same mm". Admittedly it's not 100% consistent with the old
> semantics, but I'm having trouble imagining a scenario where someone
> was relying on the changed semantics.
> 

I'm sure that most users who tune /proc/pid/oom_adj for a group of tasks, 
especially when tuning them in terms of killing priority in oom conditions 
as opposed to simply using OOM_DISABLE, rely on /proc/pid/oom_score to 
determine when a task should be killed because it is using an egregious 
amount of memory.  That's why oom_score is exported to userspace; 
otherwise, tuning oom_adj for anything other than OOM_DISABLE would be 
difficult.

> But taking a completely different approach, is there a reason that we
> couldn't have just moved the do_each_thread() check for OOM_DISABLED
> out of oom_kill_task() and into select_bad_process() at the point
> where we've decided that the thread in question is a better victim
> than the current victim? That would fix the OOM livelock while
> allowing us to keep exactly the same oom_adj/oom_score semantics as in
> previous kernels.
> 

My patches don't merely address threads that have an oom_adj value of 
OOM_DISABLE while others sharing the same memory do not, they address a 
consistency issue whereas those threads may all share memory but have 
radically different oom_adj values: that means that only the highest 
oom_adj value amongst them is actually used by the oom killer and all 
other oom_adj values set by the user for those threads are wrong.

> > The inheritance issue should be fixed with Rik's patch with the exception
> > of vfork -> change /proc/pid-of-child/oom_adj -> execve.  If scripts were
> > written to do that with the old behavior, they'll have to adjust to change
> > oom_adj _after_ the execve
> 
> Think about what you're suggesting here - execve() replaces your code
> with whatever you're execing, so unless that code is also written to
> handle oom_adj (which for something like a generic job scheduler, the
> exec'd code is unlikely to do) you're stuck.
> 

The burden to adjust /proc/pid/oom_adj has always been on the admin, the 
only thing the kernel needs to provide is an inheritance property (as 
added with Rik's patch) to have any consistency with a priority based 
killing approach, i.e. forked children should have the same oom_adj value 
to begin with to not be more immune or a higher priority to kill than the 
parent unless explicitly changed by the admin.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] copy over oom_adj value at fork time
  2009-07-17 23:12             ` David Rientjes
@ 2009-07-20 21:52               ` Paul Menage
  2009-07-21  3:40                 ` David Rientjes
  2009-07-21  7:19                 ` KOSAKI Motohiro
  0 siblings, 2 replies; 28+ messages in thread
From: Paul Menage @ 2009-07-20 21:52 UTC (permalink / raw)
  To: David Rientjes; +Cc: Rik van Riel, linux-kernel, akpm, mel, npiggin

On Fri, Jul 17, 2009 at 4:12 PM, David Rientjes<rientjes@google.com> wrote:
>
> My patches don't merely address threads that have an oom_adj value of
> OOM_DISABLE while others sharing the same memory do not, they address a
> consistency issue whereas those threads may all share memory but have
> radically different oom_adj values: that means that only the highest
> oom_adj value amongst them is actually used by the oom killer and all
> other oom_adj values set by the user for those threads are wrong.

But these patches change an API that has existed for at least 4 years
(i.e. before git history exists) and the change is known to break some
apps that relied on the existing API. Should there be more
justification than to fix an inconsistency that only you seem to be
claiming is a problem? (Yes, I know it also fixes the OOM killer
livelock, but that could be done without breaking the API).

Paul

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] copy over oom_adj value at fork time
  2009-07-20 21:52               ` Paul Menage
@ 2009-07-21  3:40                 ` David Rientjes
  2009-07-21  7:19                 ` KOSAKI Motohiro
  1 sibling, 0 replies; 28+ messages in thread
From: David Rientjes @ 2009-07-21  3:40 UTC (permalink / raw)
  To: Paul Menage; +Cc: Rik van Riel, linux-kernel, akpm, mel, npiggin

On Mon, 20 Jul 2009, Paul Menage wrote:

> > My patches don't merely address threads that have an oom_adj value of
> > OOM_DISABLE while others sharing the same memory do not, they address a
> > consistency issue whereas those threads may all share memory but have
> > radically different oom_adj values: that means that only the highest
> > oom_adj value amongst them is actually used by the oom killer and all
> > other oom_adj values set by the user for those threads are wrong.
> 
> But these patches change an API that has existed for at least 4 years
> (i.e. before git history exists) and the change is known to break some
> apps that relied on the existing API.

Such applications (none have been mentioned) should be modified to comply 
with how the oom killer uses oom_adj values, which is identical for all 
tasks sharing the same mm_struct.  That prevents inconsistent states such 
as being able to elevate /proc/pid/oom_adj to 16 so that pid is targeted 
first because of its massive /proc/pid/oom_score when, in actuality, it 
may be OOM_DISABLE because it shares memory with an OOM_DISABLE parent and 
cannot be killed.  It is inappropriate for the kernel to report completely 
erroneous oom scores, which the old behavior allowed, so the end result is 
your userspace will need to be fixed.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] copy over oom_adj value at fork time
  2009-07-20 21:52               ` Paul Menage
  2009-07-21  3:40                 ` David Rientjes
@ 2009-07-21  7:19                 ` KOSAKI Motohiro
  2009-07-21 16:06                   ` Paul Menage
  1 sibling, 1 reply; 28+ messages in thread
From: KOSAKI Motohiro @ 2009-07-21  7:19 UTC (permalink / raw)
  To: Paul Menage
  Cc: kosaki.motohiro, David Rientjes, Rik van Riel, linux-kernel,
	akpm, mel, npiggin

> On Fri, Jul 17, 2009 at 4:12 PM, David Rientjes<rientjes@google.com> wrote:
> >
> > My patches don't merely address threads that have an oom_adj value of
> > OOM_DISABLE while others sharing the same memory do not, they address a
> > consistency issue whereas those threads may all share memory but have
> > radically different oom_adj values: that means that only the highest
> > oom_adj value amongst them is actually used by the oom killer and all
> > other oom_adj values set by the user for those threads are wrong.
> 
> But these patches change an API that has existed for at least 4 years
> (i.e. before git history exists) and the change is known to break some
> apps that relied on the existing API. Should there be more
> justification than to fix an inconsistency that only you seem to be
> claiming is a problem? (Yes, I know it also fixes the OOM killer
> livelock, but that could be done without breaking the API).

Paul, I'd like to clarily this duscussion. Doesn't Rik's patch fix your vfork issue or
you worry about generic ABI breaking issue?



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] copy over oom_adj value at fork time
  2009-07-21  7:19                 ` KOSAKI Motohiro
@ 2009-07-21 16:06                   ` Paul Menage
  2009-07-21 19:54                     ` David Rientjes
  0 siblings, 1 reply; 28+ messages in thread
From: Paul Menage @ 2009-07-21 16:06 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: David Rientjes, Rik van Riel, linux-kernel, akpm, mel, npiggin

On Tue, Jul 21, 2009 at 12:19 AM, KOSAKI
Motohiro<kosaki.motohiro@jp.fujitsu.com> wrote:
>
> Paul, I'd like to clarily this duscussion. Doesn't Rik's patch fix your vfork issue or
> you worry about generic ABI breaking issue?
>

No, Rik's patch just fixes the lack of inheritability that David
introduced originally. It doesn't address the problem that the
intention of the patches is to disallow separate processes sharing the
same VM from having different oom_adj scores, which breaks the
previous ability to vfork() or clone(CLONE_VM) a child and set its
oom_adj to a non-disabled value prior to execve().

I think that in the case of our job scheduler, we can probably change
it to do a full fork() rather than clone(), at a little extra cost
(copying the entire mm rather than just cloning a thread). Our job
scheduler is generally a small process, so this shouldn't be too
expensive. But for large processes, the overhead of a full mm copy is
bigger.

Paul

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] copy over oom_adj value at fork time
  2009-07-21 16:06                   ` Paul Menage
@ 2009-07-21 19:54                     ` David Rientjes
  2009-07-21 19:57                       ` Paul Menage
  0 siblings, 1 reply; 28+ messages in thread
From: David Rientjes @ 2009-07-21 19:54 UTC (permalink / raw)
  To: Paul Menage
  Cc: KOSAKI Motohiro, Rik van Riel, linux-kernel, akpm, mel, npiggin

On Tue, 21 Jul 2009, Paul Menage wrote:

> No, Rik's patch just fixes the lack of inheritability that David
> introduced originally. It doesn't address the problem that the
> intention of the patches is to disallow separate processes sharing the
> same VM from having different oom_adj scores, which breaks the
> previous ability to vfork() or clone(CLONE_VM) a child and set its
> oom_adj to a non-disabled value prior to execve().
> 

That's exactly the scenario my patches were addressing, actually.  It was 
possible to get an oom killer livelock if a thread was constantly chosen 
when sharing memory with an OOM_DISABLE task.  So your example is a recipe 
for livelock without my patches since the parent is OOM_DISABLE and the 
child is not, yet they share an ->mm so neither can be killed.  If the 
child is repeatedly chosen prior to execve, no memory freeing is possible 
unless another task happens to exceed the badness score of the child.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] copy over oom_adj value at fork time
  2009-07-21 19:54                     ` David Rientjes
@ 2009-07-21 19:57                       ` Paul Menage
  2009-07-21 20:00                         ` David Rientjes
  0 siblings, 1 reply; 28+ messages in thread
From: Paul Menage @ 2009-07-21 19:57 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, Rik van Riel, linux-kernel, akpm, mel, npiggin

On Tue, Jul 21, 2009 at 12:54 PM, David Rientjes<rientjes@google.com> wrote:
>
> That's exactly the scenario my patches were addressing, actually.  It was
> possible to get an oom killer livelock if a thread was constantly chosen
> when sharing memory with an OOM_DISABLE task.  So your example is a recipe
> for livelock without my patches since the parent is OOM_DISABLE and the
> child is not, yet they share an ->mm so neither can be killed.  If the
> child is repeatedly chosen prior to execve, no memory freeing is possible
> unless another task happens to exceed the badness score of the child.


Agreed, but the same livelock can be fixed in ways that don't break
the API. (E.g. check for the victim being OOM_DISABLED in
select_bad_process() when we find a new "worst" candidate).

Paul

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] copy over oom_adj value at fork time
  2009-07-21 19:57                       ` Paul Menage
@ 2009-07-21 20:00                         ` David Rientjes
  2009-07-21 20:03                           ` Paul Menage
  0 siblings, 1 reply; 28+ messages in thread
From: David Rientjes @ 2009-07-21 20:00 UTC (permalink / raw)
  To: Paul Menage
  Cc: KOSAKI Motohiro, Rik van Riel, linux-kernel, akpm, mel, npiggin

On Tue, 21 Jul 2009, Paul Menage wrote:

> Agreed, but the same livelock can be fixed in ways that don't break
> the API. (E.g. check for the victim being OOM_DISABLED in
> select_bad_process() when we find a new "worst" candidate).
> 

And allow /proc/pid-of-child/oom_score to represent a possible candidate 
(and, additionally, a hint at the oom killing priority) when it really 
isn't?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] copy over oom_adj value at fork time
  2009-07-21 20:00                         ` David Rientjes
@ 2009-07-21 20:03                           ` Paul Menage
  2009-07-23 23:21                             ` KOSAKI Motohiro
  0 siblings, 1 reply; 28+ messages in thread
From: Paul Menage @ 2009-07-21 20:03 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, Rik van Riel, linux-kernel, akpm, mel, npiggin

On Tue, Jul 21, 2009 at 1:00 PM, David Rientjes<rientjes@google.com> wrote:
> On Tue, 21 Jul 2009, Paul Menage wrote:
>
>> Agreed, but the same livelock can be fixed in ways that don't break
>> the API. (E.g. check for the victim being OOM_DISABLED in
>> select_bad_process() when we find a new "worst" candidate).
>>
>
> And allow /proc/pid-of-child/oom_score to represent a possible candidate
> (and, additionally, a hint at the oom killing priority) when it really
> isn't?
>

It's the API that's existed for years with no complaints, AFAICS.

Paul

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] copy over oom_adj value at fork time
  2009-07-21 20:03                           ` Paul Menage
@ 2009-07-23 23:21                             ` KOSAKI Motohiro
  2009-07-23 23:29                               ` David Rientjes
  0 siblings, 1 reply; 28+ messages in thread
From: KOSAKI Motohiro @ 2009-07-23 23:21 UTC (permalink / raw)
  To: Paul Menage
  Cc: kosaki.motohiro, David Rientjes, Rik van Riel, linux-kernel,
	akpm, mel, npiggin

> On Tue, Jul 21, 2009 at 1:00 PM, David Rientjes<rientjes@google.com> wrote:
> > On Tue, 21 Jul 2009, Paul Menage wrote:
> >
> >> Agreed, but the same livelock can be fixed in ways that don't break
> >> the API. (E.g. check for the victim being OOM_DISABLED in
> >> select_bad_process() when we find a new "worst" candidate).
> >>
> >
> > And allow /proc/pid-of-child/oom_score to represent a possible candidate
> > (and, additionally, a hint at the oom killing priority) when it really
> > isn't?

Oops, I really don't like  /proc/pid-of-child/oom_score. it is very strange.

> 
> It's the API that's existed for years with no complaints, AFAICS.

I think thead and vfork() should be separeted on this discussion.
I agree vfork() regression should be fixed. but I don't think anyone
hope per-thread oom score. 

Of cource, if simple reverting is best way, I don't oppose this.... ;-)




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] copy over oom_adj value at fork time
  2009-07-23 23:21                             ` KOSAKI Motohiro
@ 2009-07-23 23:29                               ` David Rientjes
  2009-07-24  1:10                                 ` KOSAKI Motohiro
  0 siblings, 1 reply; 28+ messages in thread
From: David Rientjes @ 2009-07-23 23:29 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Paul Menage, Rik van Riel, linux-kernel, akpm, mel, npiggin

On Fri, 24 Jul 2009, KOSAKI Motohiro wrote:

> > It's the API that's existed for years with no complaints, AFAICS.
> 
> I think thead and vfork() should be separeted on this discussion.
> I agree vfork() regression should be fixed. but I don't think anyone
> hope per-thread oom score. 
> 
> Of cource, if simple reverting is best way, I don't oppose this.... ;-)
> 

Simply reverting it isn't an option unless you fix the underlying livelock 
problem that my patches originally addressed and no viable alternative has 
been proposed.

On the other hand, I think adding /proc/pid/oom_adj_child would solve the 
userspace issue.  oom_adj_child would be the oom_adj value that is used by 
the newly initialized mm_struct; this would allow the vfork'd child to 
share the same oom_score with the parent and then be replaced with an 
alternate score on execve().

oom_adj_child would only be allowed to be greater (i.e. the more preferred 
victim) than oom_adj for any given thread since the oom killer tries to 
kill a child of the selected task first if it doesn't share memory.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] copy over oom_adj value at fork time
  2009-07-23 23:29                               ` David Rientjes
@ 2009-07-24  1:10                                 ` KOSAKI Motohiro
  2009-07-24  9:27                                   ` David Rientjes
  0 siblings, 1 reply; 28+ messages in thread
From: KOSAKI Motohiro @ 2009-07-24  1:10 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Paul Menage, Rik van Riel, linux-kernel, akpm,
	mel, npiggin

> On Fri, 24 Jul 2009, KOSAKI Motohiro wrote:
> 
> > > It's the API that's existed for years with no complaints, AFAICS.
> > 
> > I think thead and vfork() should be separeted on this discussion.
> > I agree vfork() regression should be fixed. but I don't think anyone
> > hope per-thread oom score. 
> > 
> > Of cource, if simple reverting is best way, I don't oppose this.... ;-)
> > 
> 
> Simply reverting it isn't an option unless you fix the underlying livelock 
> problem that my patches originally addressed and no viable alternative has 
> been proposed.

I disagree.
I agree with old behavior have one bug. but it doesn't provide any regression
allowing reason although old behavior is totally suck.

David, We already know your patch break real application. it is never acceptable.


> On the other hand, I think adding /proc/pid/oom_adj_child would solve the 
> userspace issue.  oom_adj_child would be the oom_adj value that is used by 
> the newly initialized mm_struct; this would allow the vfork'd child to 
> share the same oom_score with the parent and then be replaced with an 
> alternate score on execve().

Not solve. "please rewrite your application" isn't good solution.

Hm...
This is just idea, Does moving oom_adj from mm_struct to signal_struct solve
this problem?
I mean vfork() share mm_struct, but doesn't share signal_struct.




> 
> oom_adj_child would only be allowed to be greater (i.e. the more preferred 
> victim) than oom_adj for any given thread since the oom killer tries to 
> kill a child of the selected task first if it doesn't share memory.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] copy over oom_adj value at fork time
  2009-07-24  1:10                                 ` KOSAKI Motohiro
@ 2009-07-24  9:27                                   ` David Rientjes
  2009-07-28  6:09                                     ` KOSAKI Motohiro
  0 siblings, 1 reply; 28+ messages in thread
From: David Rientjes @ 2009-07-24  9:27 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Paul Menage, Rik van Riel, linux-kernel, Andrew Morton,
	Mel Gorman, Nick Piggin

On Fri, 24 Jul 2009, KOSAKI Motohiro wrote:

> > Simply reverting it isn't an option unless you fix the underlying livelock 
> > problem that my patches originally addressed and no viable alternative has 
> > been proposed.
> 
> I disagree.
> I agree with old behavior have one bug. but it doesn't provide any regression
> allowing reason although old behavior is totally suck.
> 

I don't understand most of this, sorry.  I think what you're saying is 
that you don't fix one bug by introducing another.

The "regression" here is that changing /proc/pid/oom_adj for a vfork'd 
child would change the oom_adj value of the parent as well.  That is 
actually the behavior that leads to the livelock where the oom killer 
would repeatedly select a child and it could not be killed because it 
shares memory with an OOM_DISABLE parent.  That would cause the oom killer 
to be called by the page allocator infinitely without ever freeing memory.

That behavior is unacceptable, so I disagree that reverting my patches is 
an option.

I suggested a workaround by introducing /proc/pid/oom_adj_child which 
applications would need to use instead of oom_adj after vfork() and prior 
to execve() (if such open source applications exist in the first place).

> Not solve. "please rewrite your application" isn't good solution.
> 

They'd need to use the new interface because the old behavior would lead 
to a kernel livelock because it allowed tasks sharing memory to be 
oom disabled and enabled at the same time.  That seems like a very good 
reason to fix the application, otherwise it may livelock the kernel if its 
ooms before exec.  The behavior you're defending is the SOURCE of the 
livelock.

> Hm...
> This is just idea, Does moving oom_adj from mm_struct to signal_struct solve
> this problem?
> I mean vfork() share mm_struct, but doesn't share signal_struct.
> 

oom_adj values are not a characteristic of signals, they are a trait of 
memory.  They specify how the oom killer should favor (or disable) amounts 
of memory in oom conditions.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] copy over oom_adj value at fork time
  2009-07-24  9:27                                   ` David Rientjes
@ 2009-07-28  6:09                                     ` KOSAKI Motohiro
  2009-07-28  6:14                                       ` David Rientjes
  0 siblings, 1 reply; 28+ messages in thread
From: KOSAKI Motohiro @ 2009-07-28  6:09 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Paul Menage, Rik van Riel, linux-kernel,
	Andrew Morton, Mel Gorman, Nick Piggin

> On Fri, 24 Jul 2009, KOSAKI Motohiro wrote:
> 
> > > Simply reverting it isn't an option unless you fix the underlying livelock 
> > > problem that my patches originally addressed and no viable alternative has 
> > > been proposed.
> > 
> > I disagree.
> > I agree with old behavior have one bug. but it doesn't provide any regression
> > allowing reason although old behavior is totally suck.
> > 
> 
> I don't understand most of this, sorry.  I think what you're saying is 
> that you don't fix one bug by introducing another.
> 
> The "regression" here is that changing /proc/pid/oom_adj for a vfork'd 
> child would change the oom_adj value of the parent as well.  That is 
> actually the behavior that leads to the livelock where the oom killer 
> would repeatedly select a child and it could not be killed because it 
> shares memory with an OOM_DISABLE parent.  That would cause the oom killer 
> to be called by the page allocator infinitely without ever freeing memory.

Actually, if we assume the administrator is really stupid, he can mark
all processes as OOM_DISABLE. it makes livelock anyway.
ITOH, we never seen this livelock on vfork()ed application.

More important thing is: Documentation/filesysmtem/proc/txt says
oom_adj is process property and vfork()ed parent and child are definitelly
another process.





> That behavior is unacceptable, so I disagree that reverting my patches is 
> an option.
> 
> I suggested a workaround by introducing /proc/pid/oom_adj_child which 
> applications would need to use instead of oom_adj after vfork() and prior 
> to execve() (if such open source applications exist in the first place).
> 
> > Not solve. "please rewrite your application" isn't good solution.
> > 
> 
> They'd need to use the new interface because the old behavior would lead 
> to a kernel livelock because it allowed tasks sharing memory to be 
> oom disabled and enabled at the same time.  That seems like a very good 
> reason to fix the application, otherwise it may livelock the kernel if its 
> ooms before exec.  The behavior you're defending is the SOURCE of the 
> livelock.
> 
> > Hm...
> > This is just idea, Does moving oom_adj from mm_struct to signal_struct solve
> > this problem?
> > I mean vfork() share mm_struct, but doesn't share signal_struct.
> > 
> 
> oom_adj values are not a characteristic of signals, they are a trait of 
> memory.  They specify how the oom killer should favor (or disable) amounts 
> of memory in oom conditions.

That's ok. nobody think struct signal is signal related structure. almost member are
signal unrelated already.




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] copy over oom_adj value at fork time
  2009-07-28  6:09                                     ` KOSAKI Motohiro
@ 2009-07-28  6:14                                       ` David Rientjes
  0 siblings, 0 replies; 28+ messages in thread
From: David Rientjes @ 2009-07-28  6:14 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Paul Menage, Rik van Riel, linux-kernel, Andrew Morton,
	Mel Gorman, Nick Piggin

On Tue, 28 Jul 2009, KOSAKI Motohiro wrote:

> Actually, if we assume the administrator is really stupid, he can mark
> all processes as OOM_DISABLE. it makes livelock anyway.

Wrong, the oom killer will panic the machine if there are no eligible 
tasks to kill.  That's actually a more desirable result in most 
circumstances over livelocking.

> ITOH, we never seen this livelock on vfork()ed application.
> 

I have, which is why I wrote my patches.

> More important thing is: Documentation/filesysmtem/proc/txt says
> oom_adj is process property and vfork()ed parent and child are definitelly
> another process.
> 

I've proposed adding /proc/pid/oom_adj_child so that newly initialized 
mm's start with a default oom_adj value.  Paul suggested that it be a 
per-task characteristic, so that's what I'm looking at implementing.

Thanks.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* + oom-move-oom_adj-value-from-task_struct-to-mm_struct.patch added to -mm tree
@ 2009-05-11 21:20 akpm
  0 siblings, 0 replies; 28+ messages in thread
From: akpm @ 2009-05-11 21:20 UTC (permalink / raw)
  To: mm-commits; +Cc: rientjes, a.p.ziljstra, cl, dave, gregkh, mel, npiggin, san


The patch titled
     oom: move oom_adj value from task_struct to mm_struct
has been added to the -mm tree.  Its filename is
     oom-move-oom_adj-value-from-task_struct-to-mm_struct.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/SubmitChecklist when testing your code ***

See http://userweb.kernel.org/~akpm/stuff/added-to-mm.txt to find
out what to do about this

The current -mm tree may be found at http://userweb.kernel.org/~akpm/mmotm/

------------------------------------------------------
Subject: oom: move oom_adj value from task_struct to mm_struct
From: David Rientjes <rientjes@google.com>

The per-task oom_adj value is a characteristic of its mm more than the
task itself since it's not possible to oom kill any thread that shares the
mm.  If a task were to be killed while attached to an mm that could not be
freed because another thread were set to OOM_DISABLE, it would have
needlessly been terminated since there is no potential for future memory
freeing.

This patch moves oomkilladj (now more appropriately named oom_adj) from
struct task_struct to struct mm_struct.  This requires task_lock() on a
task to check its oom_adj value to protect against exec, but it's already
necessary to take the lock when dereferencing the mm to find the total VM
size for the badness heuristic.

This fixes a livelock if the oom killer chooses a task and another thread
sharing the same memory has an oom_adj value of OOM_DISABLE.  This occurs
because oom_kill_task() repeatedly returns 1 and refuses to kill the
chosen task while select_bad_process() will repeatedly choose the same
task during the next retry.

Taking task_lock() in select_bad_process() to check for OOM_DISABLE and in
oom_kill_task() to check for threads sharing the same memory will be
removed in the next patch in this series where it will no longer be
necessary.

Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since
these threads are immune from oom killing already.  They simply report an
oom_adj value of OOM_DISABLE.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: San Mehat <san@android.com>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Peter Ziljstra <a.p.ziljstra@chello.nl>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/filesystems/proc.txt        |   15 ++++++---
 drivers/staging/android/lowmemorykiller.c |   12 +++----
 fs/proc/base.c                            |   19 ++++++++++-
 include/linux/mm_types.h                  |    2 +
 include/linux/sched.h                     |    1 
 mm/oom_kill.c                             |   32 ++++++++++++--------
 6 files changed, 54 insertions(+), 27 deletions(-)

diff -puN Documentation/filesystems/proc.txt~oom-move-oom_adj-value-from-task_struct-to-mm_struct Documentation/filesystems/proc.txt
--- a/Documentation/filesystems/proc.txt~oom-move-oom_adj-value-from-task_struct-to-mm_struct
+++ a/Documentation/filesystems/proc.txt
@@ -1003,11 +1003,13 @@ CHAPTER 3: PER-PROCESS PARAMETERS
 3.1 /proc/<pid>/oom_adj - Adjust the oom-killer score
 ------------------------------------------------------
 
-This file can be used to adjust the score used to select which processes
-should be killed in an  out-of-memory  situation.  Giving it a high score will
-increase the likelihood of this process being killed by the oom-killer.  Valid
-values are in the range -16 to +15, plus the special value -17, which disables
-oom-killing altogether for this process.
+This file can be used to adjust the score used to select which processes should
+be killed in an out-of-memory situation.  The oom_adj value is a characteristic
+of the task's mm, so all threads that share an mm with pid will have the same
+oom_adj value.  A high value will increase the liklihood of this process being
+killed by the oom-killer.  Valid values are in the range -16 to +15 as
+explained below and a special value of -17, which disables oom-killing
+altogether for threads sharing pid's mm.
 
 The process to be killed in an out-of-memory situation is selected among all others
 based on its badness score. This value equals the original memory size of the process
@@ -1021,6 +1023,9 @@ the parent's score if they do not share 
 are the prime candidates to be killed. Having only one 'hungry' child will make
 parent less preferable than the child.
 
+/proc/<pid>/oom_adj cannot be changed for kthreads since they are immune from
+oom-killing already.
+
 /proc/<pid>/oom_score shows process' current badness score.
 
 The following heuristics are then applied:
diff -puN drivers/staging/android/lowmemorykiller.c~oom-move-oom_adj-value-from-task_struct-to-mm_struct drivers/staging/android/lowmemorykiller.c
--- a/drivers/staging/android/lowmemorykiller.c~oom-move-oom_adj-value-from-task_struct-to-mm_struct
+++ a/drivers/staging/android/lowmemorykiller.c
@@ -73,17 +73,17 @@ static int lowmem_shrink(int nr_to_scan,
 		lowmem_print(3, "lowmem_shrink %d, %x, ofree %d, ma %d\n", nr_to_scan, gfp_mask, other_free, min_adj);
 	read_lock(&tasklist_lock);
 	for_each_process(p) {
-		if(p->oomkilladj >= 0 && p->mm) {
+		if(p->mm->oom_adj >= 0 && p->mm) {
 			tasksize = get_mm_rss(p->mm);
-			if(nr_to_scan > 0 && tasksize > 0 && p->oomkilladj >= min_adj) {
+			if(nr_to_scan > 0 && tasksize > 0 && p->mm->oom_adj >= min_adj) {
 				if(selected == NULL ||
-				   p->oomkilladj > selected->oomkilladj ||
-				   (p->oomkilladj == selected->oomkilladj &&
+				   p->mm->oom_adj > selected->mm->oom_adj ||
+				   (p->mm->oom_adj == selected->mm->oom_adj &&
 				    tasksize > selected_tasksize)) {
 					selected = p;
 					selected_tasksize = tasksize;
 					lowmem_print(2, "select %d (%s), adj %d, size %d, to kill\n",
-					             p->pid, p->comm, p->oomkilladj, tasksize);
+					             p->pid, p->comm, p->mm->oom_adj, tasksize);
 				}
 			}
 			rem += tasksize;
@@ -92,7 +92,7 @@ static int lowmem_shrink(int nr_to_scan,
 	if(selected != NULL) {
 		lowmem_print(1, "send sigkill to %d (%s), adj %d, size %d\n",
 		             selected->pid, selected->comm,
-		             selected->oomkilladj, selected_tasksize);
+		             selected->mm->oom_adj, selected_tasksize);
 		force_sig(SIGKILL, selected);
 		rem -= selected_tasksize;
 	}
diff -puN fs/proc/base.c~oom-move-oom_adj-value-from-task_struct-to-mm_struct fs/proc/base.c
--- a/fs/proc/base.c~oom-move-oom_adj-value-from-task_struct-to-mm_struct
+++ a/fs/proc/base.c
@@ -1006,7 +1006,12 @@ static ssize_t oom_adjust_read(struct fi
 
 	if (!task)
 		return -ESRCH;
-	oom_adjust = task->oomkilladj;
+	task_lock(task);
+	if (task->mm)
+		oom_adjust = task->mm->oom_adj;
+	else
+		oom_adjust = OOM_DISABLE;
+	task_unlock(task);
 	put_task_struct(task);
 
 	len = snprintf(buffer, sizeof(buffer), "%i\n", oom_adjust);
@@ -1035,11 +1040,19 @@ static ssize_t oom_adjust_write(struct f
 	task = get_proc_task(file->f_path.dentry->d_inode);
 	if (!task)
 		return -ESRCH;
-	if (oom_adjust < task->oomkilladj && !capable(CAP_SYS_RESOURCE)) {
+	task_lock(task);
+	if (!task->mm) {
+		task_unlock(task);
+		put_task_struct(task);
+		return -EINVAL;
+	}
+	if (oom_adjust < task->mm->oom_adj && !capable(CAP_SYS_RESOURCE)) {
+		task_unlock(task);
 		put_task_struct(task);
 		return -EACCES;
 	}
-	task->oomkilladj = oom_adjust;
+	task->mm->oom_adj = oom_adjust;
+	task_unlock(task);
 	put_task_struct(task);
 	if (end - buffer == 0)
 		return -EIO;
diff -puN include/linux/mm_types.h~oom-move-oom_adj-value-from-task_struct-to-mm_struct include/linux/mm_types.h
--- a/include/linux/mm_types.h~oom-move-oom_adj-value-from-task_struct-to-mm_struct
+++ a/include/linux/mm_types.h
@@ -232,6 +232,8 @@ struct mm_struct {
 
 	unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
 
+	s8 oom_adj;	/* OOM kill score adjustment (bit shift) */
+
 	cpumask_t cpu_vm_mask;
 
 	/* Architecture-specific MM context */
diff -puN include/linux/sched.h~oom-move-oom_adj-value-from-task_struct-to-mm_struct include/linux/sched.h
--- a/include/linux/sched.h~oom-move-oom_adj-value-from-task_struct-to-mm_struct
+++ a/include/linux/sched.h
@@ -1150,7 +1150,6 @@ struct task_struct {
 	 * a short time
 	 */
 	unsigned char fpu_counter;
-	s8 oomkilladj; /* OOM kill score adjustment (bit shift). */
 #ifdef CONFIG_BLK_DEV_IO_TRACE
 	unsigned int btrace_seq;
 #endif
diff -puN mm/oom_kill.c~oom-move-oom_adj-value-from-task_struct-to-mm_struct mm/oom_kill.c
--- a/mm/oom_kill.c~oom-move-oom_adj-value-from-task_struct-to-mm_struct
+++ a/mm/oom_kill.c
@@ -58,6 +58,7 @@ unsigned long badness(struct task_struct
 	unsigned long points, cpu_time, run_time;
 	struct mm_struct *mm;
 	struct task_struct *child;
+	int oom_adj;
 
 	task_lock(p);
 	mm = p->mm;
@@ -65,6 +66,7 @@ unsigned long badness(struct task_struct
 		task_unlock(p);
 		return 0;
 	}
+	oom_adj = mm->oom_adj;
 
 	/*
 	 * The memory size of the process is the basis for the badness.
@@ -148,15 +150,15 @@ unsigned long badness(struct task_struct
 		points /= 8;
 
 	/*
-	 * Adjust the score by oomkilladj.
+	 * Adjust the score by oom_adj.
 	 */
-	if (p->oomkilladj) {
-		if (p->oomkilladj > 0) {
+	if (oom_adj) {
+		if (oom_adj > 0) {
 			if (!points)
 				points = 1;
-			points <<= p->oomkilladj;
+			points <<= oom_adj;
 		} else
-			points >>= -(p->oomkilladj);
+			points >>= -(oom_adj);
 	}
 
 #ifdef DEBUG
@@ -251,8 +253,10 @@ static struct task_struct *select_bad_pr
 			*ppoints = ULONG_MAX;
 		}
 
-		if (p->oomkilladj == OOM_DISABLE)
+		task_lock(p);
+		if (p->mm && p->mm->oom_adj == OOM_DISABLE)
 			continue;
+		task_unlock(p);
 
 		points = badness(p, uptime.tv_sec);
 		if (points > *ppoints || !chosen) {
@@ -304,8 +308,7 @@ static void dump_tasks(const struct mem_
 		}
 		printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d     %3d %s\n",
 		       p->pid, __task_cred(p)->uid, p->tgid, mm->total_vm,
-		       get_mm_rss(mm), (int)task_cpu(p), p->oomkilladj,
-		       p->comm);
+		       get_mm_rss(mm), (int)task_cpu(p), mm->oom_adj, p->comm);
 		task_unlock(p);
 	} while_each_thread(g, p);
 }
@@ -367,8 +370,12 @@ static int oom_kill_task(struct task_str
 	 * Don't kill the process if any threads are set to OOM_DISABLE
 	 */
 	do_each_thread(g, q) {
-		if (q->mm == mm && q->oomkilladj == OOM_DISABLE)
+		task_lock(q);
+		if (q->mm == mm && q->mm && q->mm->oom_adj == OOM_DISABLE) {
+			task_unlock(q);
 			return 1;
+		}
+		task_unlock(q);
 	} while_each_thread(g, q);
 
 	__oom_kill_task(p, 1);
@@ -393,10 +400,11 @@ static int oom_kill_process(struct task_
 	struct task_struct *c;
 
 	if (printk_ratelimit()) {
-		printk(KERN_WARNING "%s invoked oom-killer: "
-			"gfp_mask=0x%x, order=%d, oomkilladj=%d\n",
-			current->comm, gfp_mask, order, current->oomkilladj);
 		task_lock(current);
+		printk(KERN_WARNING "%s invoked oom-killer: "
+			"gfp_mask=0x%x, order=%d, oom_adj=%d\n",
+			current->comm, gfp_mask, order,
+			current->mm ? current->mm->oom_adj : OOM_DISABLE);
 		cpuset_print_task_mems_allowed(current);
 		task_unlock(current);
 		dump_stack();
_

Patches currently in -mm which might be from rientjes@google.com are

linux-next.patch
cpusets-restructure-the-function-cpuset_update_task_memory_state.patch
cpusets-update-tasks-page-slab-spread-flags-in-time.patch
cpusetmm-update-tasks-mems_allowed-in-time.patch
cpusetmm-update-tasks-mems_allowed-in-time-fix.patch
cpusetmm-update-tasks-mems_allowed-in-time-cleanup.patch
page-allocator-use-a-pre-calculated-value-instead-of-num_online_nodes-in-fast-paths-do-not-override-definition-of-node_set_online-with-macro.patch
mm-setup_per_zone_inactive_ratio-do-not-call-for-int_sqrt-if-not-needed.patch
mm-setup_per_zone_inactive_ratio-fix-comment-and-make-it-__init.patch
mm-invoke-oom-killer-for-__gfp_nofail.patch
oom-fix-possible-oom_dump_tasks-null-pointer.patch
oom-move-oom_adj-value-from-task_struct-to-mm_struct.patch
memcg-add-file-based-rss-accounting.patch
memcg-add-file-based-rss-accounting-fix-mem_cgroup_update_mapped_file_stat-oops.patch


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2009-07-28  6:14 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-06-02  5:57 + oom-move-oom_adj-value-from-task_struct-to-mm_struct.patch added to -mm tree akpm
2009-07-15  0:52 ` Paul Menage
2009-07-15  1:05   ` [PATCH] copy over oom_adj value at fork time Rik van Riel
2009-07-15  1:15     ` Paul Menage
2009-07-15  2:01       ` KOSAKI Motohiro
2009-07-15  2:12         ` [PATCH -mm] copy over oom_adj value at fork time (v2) Rik van Riel
2009-07-15  3:42           ` David Rientjes
2009-07-15  3:48       ` [PATCH] copy over oom_adj value at fork time David Rientjes
2009-07-16 23:29       ` Paul Menage
2009-07-17  9:34         ` David Rientjes
2009-07-17 20:03           ` Paul Menage
2009-07-17 23:12             ` David Rientjes
2009-07-20 21:52               ` Paul Menage
2009-07-21  3:40                 ` David Rientjes
2009-07-21  7:19                 ` KOSAKI Motohiro
2009-07-21 16:06                   ` Paul Menage
2009-07-21 19:54                     ` David Rientjes
2009-07-21 19:57                       ` Paul Menage
2009-07-21 20:00                         ` David Rientjes
2009-07-21 20:03                           ` Paul Menage
2009-07-23 23:21                             ` KOSAKI Motohiro
2009-07-23 23:29                               ` David Rientjes
2009-07-24  1:10                                 ` KOSAKI Motohiro
2009-07-24  9:27                                   ` David Rientjes
2009-07-28  6:09                                     ` KOSAKI Motohiro
2009-07-28  6:14                                       ` David Rientjes
2009-07-15  1:49     ` KOSAKI Motohiro
  -- strict thread matches above, loose matches on Subject: below --
2009-05-11 21:20 + oom-move-oom_adj-value-from-task_struct-to-mm_struct.patch added to -mm tree akpm

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.