linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Revert oom rewrite series
@ 2010-11-14  5:07 KOSAKI Motohiro
  2010-11-14 19:32 ` Linus Torvalds
  2010-11-14 21:58 ` David Rientjes
  0 siblings, 2 replies; 37+ messages in thread
From: KOSAKI Motohiro @ 2010-11-14  5:07 UTC (permalink / raw)
  To: LKML, Linus Torvalds
  Cc: kosaki.motohiro, David Rientjes, Andrew Morton, Ying Han,
	Bodo Eggert, Mandeep Singh Baines, Figo.zhang

Linus,

Please apply this. this patch revert commits of oom changes since v2.6.35.

briefly says, "oom: badness heuristic rewrite" was merges by mistaken.
It haven't been passed our design nor code review. then multiple bug reports
has been popped up. I believe evey patches should pass a usecase and a code
review :-/

The problem is, DavidR patches don't refrect real world usecase at all
and breaking them. He can talk about the userland is wrong. but such
excuse doesn't solve real world issue. it makes no sense.

I hope every developers keep honestly development. googlers are NOT 
exception.


David, at least rss based oom score was passed our design review. 
So, if you will resubmit such part, we will ack it. please remember it.
Also, I can accept oom_score_adj feature if you can remove imcomatibility 
issue. OK?

Linus, if you want to check the patch. please use following way.
  % git diff a63d83f427fbce97a6cea0db2e64b0eb8435cd10^ mm/oom_kill.c include/linux/oom.h fs/proc/base.c


Thanks.

--------------------------------------------------------------------------
Subject: [PATCH] Revert oom rewrite series

This reverts following commits. They has broke an ABI and made multiple
enduser claim.

9c28ab662a8e3d19d07077ac0a8931c015e8afec Revert "oom: badness heuristic rewrite"
74cd8c6cb3e093c4d67ac3eb3581e246e4981dad Revert "oom: deprecate oom_adj tunable"
79a0bd5796e754c4b4e22071c4edddef3517d010 Revert "memcg: use find_lock_task_mm() in memory cgroups oom"
a465ef80c2a9fe73c85029fcea5c68ffee8dbb69 Revert "oom: always return a badness score of non-zero for eligible tas
516fcbb0c45d943df1b739d3be3d417aee2275f3 Revert "oom: filter unkillable tasks from tasklist dump"
b1c98f95a7954c450dadd809280f86863ea9d05d Revert "oom: add per-mm oom disable count"
fd79f3f47c82a0af5288afe7556905dd171bfc43 Revert "oom: avoid killing a task if a thread sharing its mm cannot be
2d72175528870dcef577db4a2a0b49d819c6eaff Revert "oom: kill all threads sharing oom killed task's mm"
be212960618ddcdb9526ce2cb73fd081fd3e90ea Revert "oom: rewrite error handling for oom_adj and oom_score_adj tunab
1b17c41599c594c7d11ef415a92d47c205fe89ea Revert "oom: fix locking for oom_adj and oom_score_adj"

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 Documentation/feature-removal-schedule.txt |   25 ---
 Documentation/filesystems/proc.txt         |   97 ++++-----
 fs/exec.c                                  |    5 -
 fs/proc/base.c                             |  176 ++--------------
 include/linux/memcontrol.h                 |    8 -
 include/linux/mm_types.h                   |    2 -
 include/linux/oom.h                        |   19 +--
 include/linux/sched.h                      |    3 +-
 kernel/exit.c                              |    3 -
 kernel/fork.c                              |   16 +--
 mm/memcontrol.c                            |   28 +---
 mm/oom_kill.c                              |  323 ++++++++++++++--------------
 12 files changed, 227 insertions(+), 478 deletions(-)

diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
index d8f36f9..9af16b9 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -166,31 +166,6 @@ Who:	Eric Biederman <ebiederm@xmission.com>
 
 ---------------------------
 
-What:	/proc/<pid>/oom_adj
-When:	August 2012
-Why:	/proc/<pid>/oom_adj allows userspace to influence the oom killer's
-	badness heuristic used to determine which task to kill when the kernel
-	is out of memory.
-
-	The badness heuristic has since been rewritten since the introduction of
-	this tunable such that its meaning is deprecated.  The value was
-	implemented as a bitshift on a score generated by the badness()
-	function that did not have any precise units of measure.  With the
-	rewrite, the score is given as a proportion of available memory to the
-	task allocating pages, so using a bitshift which grows the score
-	exponentially is, thus, impossible to tune with fine granularity.
-
-	A much more powerful interface, /proc/<pid>/oom_score_adj, was
-	introduced with the oom killer rewrite that allows users to increase or
-	decrease the badness() score linearly.  This interface will replace
-	/proc/<pid>/oom_adj.
-
-	A warning will be emitted to the kernel log if an application uses this
-	deprecated interface.  After it is printed once, future warnings will be
-	suppressed until the kernel is rebooted.
-
----------------------------
-
 What:	remove EXPORT_SYMBOL(kernel_thread)
 When:	August 2006
 Files:	arch/*/kernel/*_ksyms.c
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index e73df27..030e3a1 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -33,8 +33,7 @@ Table of Contents
   2	Modifying System Parameters
 
   3	Per-Process Parameters
-  3.1	/proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj - Adjust the oom-killer
-								score
+  3.1	/proc/<pid>/oom_adj - Adjust the oom-killer score
   3.2	/proc/<pid>/oom_score - Display current oom-killer score
   3.3	/proc/<pid>/io - Display the IO accounting fields
   3.4	/proc/<pid>/coredump_filter - Core dump filtering settings
@@ -1246,64 +1245,42 @@ of the kernel.
 CHAPTER 3: PER-PROCESS PARAMETERS
 ------------------------------------------------------------------------------
 
-3.1 /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj- Adjust the oom-killer score
---------------------------------------------------------------------------------
-
-These file can be used to adjust the badness heuristic used to select which
-process gets killed in out of memory conditions.
-
-The badness heuristic assigns a value to each candidate task ranging from 0
-(never kill) to 1000 (always kill) to determine which process is targeted.  The
-units are roughly a proportion along that range of allowed memory the process
-may allocate from based on an estimation of its current memory and swap use.
-For example, if a task is using all allowed memory, its badness score will be
-1000.  If it is using half of its allowed memory, its score will be 500.
-
-There is an additional factor included in the badness score: root
-processes are given 3% extra memory over other tasks.
-
-The amount of "allowed" memory depends on the context in which the oom killer
-was called.  If it is due to the memory assigned to the allocating task's cpuset
-being exhausted, the allowed memory represents the set of mems assigned to that
-cpuset.  If it is due to a mempolicy's node(s) being exhausted, the allowed
-memory represents the set of mempolicy nodes.  If it is due to a memory
-limit (or swap limit) being reached, the allowed memory is that configured
-limit.  Finally, if it is due to the entire system being out of memory, the
-allowed memory represents all allocatable resources.
-
-The value of /proc/<pid>/oom_score_adj is added to the badness score before it
-is used to determine which task to kill.  Acceptable values range from -1000
-(OOM_SCORE_ADJ_MIN) to +1000 (OOM_SCORE_ADJ_MAX).  This allows userspace to
-polarize the preference for oom killing either by always preferring a certain
-task or completely disabling it.  The lowest possible value, -1000, is
-equivalent to disabling oom killing entirely for that task since it will always
-report a badness score of 0.
-
-Consequently, it is very simple for userspace to define the amount of memory to
-consider for each task.  Setting a /proc/<pid>/oom_score_adj value of +500, for
-example, is roughly equivalent to allowing the remainder of tasks sharing the
-same system, cpuset, mempolicy, or memory controller resources to use at least
-50% more memory.  A value of -500, on the other hand, would be roughly
-equivalent to discounting 50% of the task's allowed memory from being considered
-as scoring against the task.
-
-For backwards compatibility with previous kernels, /proc/<pid>/oom_adj may also
-be used to tune the badness score.  Its acceptable values range from -16
-(OOM_ADJUST_MIN) to +15 (OOM_ADJUST_MAX) and a special value of -17
-(OOM_DISABLE) to disable oom killing entirely for that task.  Its value is
-scaled linearly with /proc/<pid>/oom_score_adj.
-
-Writing to /proc/<pid>/oom_score_adj or /proc/<pid>/oom_adj will change the
-other with its scaled value.
-
-NOTICE: /proc/<pid>/oom_adj is deprecated and will be removed, please see
-Documentation/feature-removal-schedule.txt.
-
-Caveat: when a parent task is selected, the oom killer will sacrifice any first
-generation children with seperate address spaces instead, if possible.  This
-avoids servers and important system daemons from being killed and loses the
-minimal amount of work.
-
+3.1 /proc/<pid>/oom_adj - Adjust the oom-killer score
+------------------------------------------------------
+
+This file can be used to adjust the score used to select which processes
+should be killed in an  out-of-memory  situation.  Giving it a high score will
+increase the likelihood of this process being killed by the oom-killer.  Valid
+values are in the range -16 to +15, plus the special value -17, which disables
+oom-killing altogether for this process.
+
+The process to be killed in an out-of-memory situation is selected among all others
+based on its badness score. This value equals the original memory size of the process
+and is then updated according to its CPU time (utime + stime) and the
+run time (uptime - start time). The longer it runs the smaller is the score.
+Badness score is divided by the square root of the CPU time and then by
+the double square root of the run time.
+
+Swapped out tasks are killed first. Half of each child's memory size is added to
+the parent's score if they do not share the same memory. Thus forking servers
+are the prime candidates to be killed. Having only one 'hungry' child will make
+parent less preferable than the child.
+
+/proc/<pid>/oom_score shows process' current badness score.
+
+The following heuristics are then applied:
+ * if the task was reniced, its score doubles
+ * superuser or direct hardware access tasks (CAP_SYS_ADMIN, CAP_SYS_RESOURCE
+ 	or CAP_SYS_RAWIO) have their score divided by 4
+ * if oom condition happened in one cpuset and checked process does not belong
+ 	to it, its score is divided by 8
+ * the resulting score is multiplied by two to the power of oom_adj, i.e.
+	points <<= oom_adj when it is positive and
+	points >>= -(oom_adj) otherwise
+
+The task with the highest badness score is then selected and its children
+are killed, process itself will be killed in an OOM situation when it does
+not have children or some of them disabled oom like described above.
 
 3.2 /proc/<pid>/oom_score - Display current oom-killer score
 -------------------------------------------------------------
diff --git a/fs/exec.c b/fs/exec.c
index 99d33a1..47986fb 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -54,7 +54,6 @@
 #include <linux/fsnotify.h>
 #include <linux/fs_struct.h>
 #include <linux/pipe_fs_i.h>
-#include <linux/oom.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -766,10 +765,6 @@ static int exec_mmap(struct mm_struct *mm)
 	tsk->mm = mm;
 	tsk->active_mm = mm;
 	activate_mm(active_mm, mm);
-	if (old_mm && tsk->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
-		atomic_dec(&old_mm->oom_disable_count);
-		atomic_inc(&tsk->mm->oom_disable_count);
-	}
 	task_unlock(tsk);
 	arch_pick_mmap_layout(mm);
 	if (old_mm) {
diff --git a/fs/proc/base.c b/fs/proc/base.c
index f3d02ca..ed7d18e 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -63,7 +63,6 @@
 #include <linux/namei.h>
 #include <linux/mnt_namespace.h>
 #include <linux/mm.h>
-#include <linux/swap.h>
 #include <linux/rcupdate.h>
 #include <linux/kallsyms.h>
 #include <linux/stacktrace.h>
@@ -431,11 +430,12 @@ static const struct file_operations proc_lstats_operations = {
 static int proc_oom_score(struct task_struct *task, char *buffer)
 {
 	unsigned long points = 0;
+	struct timespec uptime;
 
+	do_posix_clock_monotonic_gettime(&uptime);
 	read_lock(&tasklist_lock);
 	if (pid_alive(task))
-		points = oom_badness(task, NULL, NULL,
-					totalram_pages + total_swap_pages);
+		points = badness(task, NULL, NULL, uptime.tv_sec);
 	read_unlock(&tasklist_lock);
 	return sprintf(buffer, "%lu\n", points);
 }
@@ -1025,74 +1025,36 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
 	memset(buffer, 0, sizeof(buffer));
 	if (count > sizeof(buffer) - 1)
 		count = sizeof(buffer) - 1;
-	if (copy_from_user(buffer, buf, count)) {
-		err = -EFAULT;
-		goto out;
-	}
+	if (copy_from_user(buffer, buf, count))
+		return -EFAULT;
 
 	err = strict_strtol(strstrip(buffer), 0, &oom_adjust);
 	if (err)
-		goto out;
+		return -EINVAL;
 	if ((oom_adjust < OOM_ADJUST_MIN || oom_adjust > OOM_ADJUST_MAX) &&
-	     oom_adjust != OOM_DISABLE) {
-		err = -EINVAL;
-		goto out;
-	}
+	     oom_adjust != OOM_DISABLE)
+		return -EINVAL;
 
 	task = get_proc_task(file->f_path.dentry->d_inode);
-	if (!task) {
-		err = -ESRCH;
-		goto out;
-	}
-
-	task_lock(task);
-	if (!task->mm) {
-		err = -EINVAL;
-		goto err_task_lock;
-	}
-
+	if (!task)
+		return -ESRCH;
 	if (!lock_task_sighand(task, &flags)) {
-		err = -ESRCH;
-		goto err_task_lock;
+		put_task_struct(task);
+		return -ESRCH;
 	}
 
 	if (oom_adjust < task->signal->oom_adj && !capable(CAP_SYS_RESOURCE)) {
-		err = -EACCES;
-		goto err_sighand;
-	}
-
-	if (oom_adjust != task->signal->oom_adj) {
-		if (oom_adjust == OOM_DISABLE)
-			atomic_inc(&task->mm->oom_disable_count);
-		if (task->signal->oom_adj == OOM_DISABLE)
-			atomic_dec(&task->mm->oom_disable_count);
+		unlock_task_sighand(task, &flags);
+		put_task_struct(task);
+		return -EACCES;
 	}
 
-	/*
-	 * Warn that /proc/pid/oom_adj is deprecated, see
-	 * Documentation/feature-removal-schedule.txt.
-	 */
-	printk_once(KERN_WARNING "%s (%d): /proc/%d/oom_adj is deprecated, "
-			"please use /proc/%d/oom_score_adj instead.\n",
-			current->comm, task_pid_nr(current),
-			task_pid_nr(task), task_pid_nr(task));
 	task->signal->oom_adj = oom_adjust;
-	/*
-	 * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum
-	 * value is always attainable.
-	 */
-	if (task->signal->oom_adj == OOM_ADJUST_MAX)
-		task->signal->oom_score_adj = OOM_SCORE_ADJ_MAX;
-	else
-		task->signal->oom_score_adj = (oom_adjust * OOM_SCORE_ADJ_MAX) /
-								-OOM_DISABLE;
-err_sighand:
+
 	unlock_task_sighand(task, &flags);
-err_task_lock:
-	task_unlock(task);
 	put_task_struct(task);
-out:
-	return err < 0 ? err : count;
+
+	return count;
 }
 
 static const struct file_operations proc_oom_adjust_operations = {
@@ -1101,106 +1063,6 @@ static const struct file_operations proc_oom_adjust_operations = {
 	.llseek		= generic_file_llseek,
 };
 
-static ssize_t oom_score_adj_read(struct file *file, char __user *buf,
-					size_t count, loff_t *ppos)
-{
-	struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
-	char buffer[PROC_NUMBUF];
-	int oom_score_adj = OOM_SCORE_ADJ_MIN;
-	unsigned long flags;
-	size_t len;
-
-	if (!task)
-		return -ESRCH;
-	if (lock_task_sighand(task, &flags)) {
-		oom_score_adj = task->signal->oom_score_adj;
-		unlock_task_sighand(task, &flags);
-	}
-	put_task_struct(task);
-	len = snprintf(buffer, sizeof(buffer), "%d\n", oom_score_adj);
-	return simple_read_from_buffer(buf, count, ppos, buffer, len);
-}
-
-static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
-					size_t count, loff_t *ppos)
-{
-	struct task_struct *task;
-	char buffer[PROC_NUMBUF];
-	unsigned long flags;
-	long oom_score_adj;
-	int err;
-
-	memset(buffer, 0, sizeof(buffer));
-	if (count > sizeof(buffer) - 1)
-		count = sizeof(buffer) - 1;
-	if (copy_from_user(buffer, buf, count)) {
-		err = -EFAULT;
-		goto out;
-	}
-
-	err = strict_strtol(strstrip(buffer), 0, &oom_score_adj);
-	if (err)
-		goto out;
-	if (oom_score_adj < OOM_SCORE_ADJ_MIN ||
-			oom_score_adj > OOM_SCORE_ADJ_MAX) {
-		err = -EINVAL;
-		goto out;
-	}
-
-	task = get_proc_task(file->f_path.dentry->d_inode);
-	if (!task) {
-		err = -ESRCH;
-		goto out;
-	}
-
-	task_lock(task);
-	if (!task->mm) {
-		err = -EINVAL;
-		goto err_task_lock;
-	}
-
-	if (!lock_task_sighand(task, &flags)) {
-		err = -ESRCH;
-		goto err_task_lock;
-	}
-
-	if (oom_score_adj < task->signal->oom_score_adj &&
-			!capable(CAP_SYS_RESOURCE)) {
-		err = -EACCES;
-		goto err_sighand;
-	}
-
-	if (oom_score_adj != task->signal->oom_score_adj) {
-		if (oom_score_adj == OOM_SCORE_ADJ_MIN)
-			atomic_inc(&task->mm->oom_disable_count);
-		if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
-			atomic_dec(&task->mm->oom_disable_count);
-	}
-	task->signal->oom_score_adj = oom_score_adj;
-	/*
-	 * Scale /proc/pid/oom_adj appropriately ensuring that OOM_DISABLE is
-	 * always attainable.
-	 */
-	if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
-		task->signal->oom_adj = OOM_DISABLE;
-	else
-		task->signal->oom_adj = (oom_score_adj * OOM_ADJUST_MAX) /
-							OOM_SCORE_ADJ_MAX;
-err_sighand:
-	unlock_task_sighand(task, &flags);
-err_task_lock:
-	task_unlock(task);
-	put_task_struct(task);
-out:
-	return err < 0 ? err : count;
-}
-
-static const struct file_operations proc_oom_score_adj_operations = {
-	.read		= oom_score_adj_read,
-	.write		= oom_score_adj_write,
-	.llseek		= default_llseek,
-};
-
 #ifdef CONFIG_AUDITSYSCALL
 #define TMPBUFLEN 21
 static ssize_t proc_loginuid_read(struct file * file, char __user * buf,
@@ -2779,7 +2641,6 @@ static const struct pid_entry tgid_base_stuff[] = {
 #endif
 	INF("oom_score",  S_IRUGO, proc_oom_score),
 	REG("oom_adj",    S_IRUGO|S_IWUSR, proc_oom_adjust_operations),
-	REG("oom_score_adj", S_IRUGO|S_IWUSR, proc_oom_score_adj_operations),
 #ifdef CONFIG_AUDITSYSCALL
 	REG("loginuid",   S_IWUSR|S_IRUGO, proc_loginuid_operations),
 	REG("sessionid",  S_IRUGO, proc_sessionid_operations),
@@ -3115,7 +2976,6 @@ static const struct pid_entry tid_base_stuff[] = {
 #endif
 	INF("oom_score", S_IRUGO, proc_oom_score),
 	REG("oom_adj",   S_IRUGO|S_IWUSR, proc_oom_adjust_operations),
-	REG("oom_score_adj", S_IRUGO|S_IWUSR, proc_oom_score_adj_operations),
 #ifdef CONFIG_AUDITSYSCALL
 	REG("loginuid",  S_IWUSR|S_IRUGO, proc_loginuid_operations),
 	REG("sessionid",  S_IRUSR, proc_sessionid_operations),
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 159a076..b13fc2a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -124,8 +124,6 @@ static inline bool mem_cgroup_disabled(void)
 void mem_cgroup_update_file_mapped(struct page *page, int val);
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask);
-u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
-
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;
 
@@ -305,12 +303,6 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 	return 0;
 }
 
-static inline
-u64 mem_cgroup_get_limit(struct mem_cgroup *mem)
-{
-	return 0;
-}
-
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index bb7288a..cb57d65 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -310,8 +310,6 @@ struct mm_struct {
 #ifdef CONFIG_MMU_NOTIFIER
 	struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
-	/* How many tasks sharing this mm are OOM_DISABLE */
-	atomic_t oom_disable_count;
 };
 
 /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 5e3aa83..40e5e3a 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -1,27 +1,14 @@
 #ifndef __INCLUDE_LINUX_OOM_H
 #define __INCLUDE_LINUX_OOM_H
 
-/*
- * /proc/<pid>/oom_adj is deprecated, see
- * Documentation/feature-removal-schedule.txt.
- *
- * /proc/<pid>/oom_adj set to -17 protects from the oom-killer
- */
+/* /proc/<pid>/oom_adj set to -17 protects from the oom-killer */
 #define OOM_DISABLE (-17)
 /* inclusive */
 #define OOM_ADJUST_MIN (-16)
 #define OOM_ADJUST_MAX 15
 
-/*
- * /proc/<pid>/oom_score_adj set to OOM_SCORE_ADJ_MIN disables oom killing for
- * pid.
- */
-#define OOM_SCORE_ADJ_MIN	(-1000)
-#define OOM_SCORE_ADJ_MAX	1000
-
 #ifdef __KERNEL__
 
-#include <linux/sched.h>
 #include <linux/types.h>
 #include <linux/nodemask.h>
 
@@ -40,8 +27,6 @@ enum oom_constraint {
 	CONSTRAINT_MEMCG,
 };
 
-extern unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
-			const nodemask_t *nodemask, unsigned long totalpages);
 extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 
@@ -66,8 +51,6 @@ static inline void oom_killer_enable(void)
 extern unsigned long badness(struct task_struct *p, struct mem_cgroup *mem,
 		      const nodemask_t *nodemask, unsigned long uptime);
 
-extern struct task_struct *find_lock_task_mm(struct task_struct *p);
-
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d0036e5..a35acb6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -624,8 +624,7 @@ struct signal_struct {
 	struct tty_audit_buf *tty_audit_buf;
 #endif
 
-	int oom_adj;		/* OOM kill score adjustment (bit shift) */
-	int oom_score_adj;	/* OOM kill score adjustment */
+	int oom_adj;	/* OOM kill score adjustment (bit shift) */
 
 	struct mutex cred_guard_mutex;	/* guard against foreign influences on
 					 * credential calculations
diff --git a/kernel/exit.c b/kernel/exit.c
index 21aa7b3..c806406 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -50,7 +50,6 @@
 #include <linux/perf_event.h>
 #include <trace/events/sched.h>
 #include <linux/hw_breakpoint.h>
-#include <linux/oom.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -696,8 +695,6 @@ static void exit_mm(struct task_struct * tsk)
 	enter_lazy_tlb(mm, current);
 	/* We don't want this task to be frozen prematurely */
 	clear_freeze_flag(tsk);
-	if (tsk->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
-		atomic_dec(&mm->oom_disable_count);
 	task_unlock(tsk);
 	mm_update_next_owner(mm);
 	mmput(mm);
diff --git a/kernel/fork.c b/kernel/fork.c
index 3b159c5..cca5e8b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -65,7 +65,6 @@
 #include <linux/perf_event.h>
 #include <linux/posix-timers.h>
 #include <linux/user-return-notifier.h>
-#include <linux/oom.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -489,7 +488,6 @@ static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
 	mm->cached_hole_size = ~0UL;
 	mm_init_aio(mm);
 	mm_init_owner(mm, p);
-	atomic_set(&mm->oom_disable_count, 0);
 
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
@@ -743,8 +741,6 @@ good_mm:
 	/* Initializing for Swap token stuff */
 	mm->token_priority = 0;
 	mm->last_interval = 0;
-	if (tsk->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
-		atomic_inc(&mm->oom_disable_count);
 
 	tsk->mm = mm;
 	tsk->active_mm = mm;
@@ -906,7 +902,6 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	tty_audit_fork(sig);
 
 	sig->oom_adj = current->signal->oom_adj;
-	sig->oom_score_adj = current->signal->oom_score_adj;
 
 	mutex_init(&sig->cred_guard_mutex);
 
@@ -1305,13 +1300,8 @@ bad_fork_cleanup_io:
 bad_fork_cleanup_namespaces:
 	exit_task_namespaces(p);
 bad_fork_cleanup_mm:
-	if (p->mm) {
-		task_lock(p);
-		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
-			atomic_dec(&p->mm->oom_disable_count);
-		task_unlock(p);
+	if (p->mm)
 		mmput(p->mm);
-	}
 bad_fork_cleanup_signal:
 	if (!(clone_flags & CLONE_THREAD))
 		free_signal_struct(p->signal);
@@ -1704,10 +1694,6 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags)
 			active_mm = current->active_mm;
 			current->mm = new_mm;
 			current->active_mm = new_mm;
-			if (current->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
-				atomic_dec(&mm->oom_disable_count);
-				atomic_inc(&new_mm->oom_disable_count);
-			}
 			activate_mm(active_mm, new_mm);
 			new_mm = mm;
 		}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9a99cfa..c628370 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -47,7 +47,6 @@
 #include <linux/mm_inline.h>
 #include <linux/page_cgroup.h>
 #include <linux/cpu.h>
-#include <linux/oom.h>
 #include "internal.h"
 
 #include <asm/uaccess.h>
@@ -917,13 +916,10 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
 {
 	int ret;
 	struct mem_cgroup *curr = NULL;
-	struct task_struct *p;
 
-	p = find_lock_task_mm(task);
-	if (!p)
-		return 0;
-	curr = try_get_mem_cgroup_from_mm(p->mm);
-	task_unlock(p);
+	task_lock(task);
+	curr = try_get_mem_cgroup_from_mm(task->mm);
+	task_unlock(task);
 	if (!curr)
 		return 0;
 	/*
@@ -1297,24 +1293,6 @@ static int mem_cgroup_count_children(struct mem_cgroup *mem)
 }
 
 /*
- * Return the memory (and swap, if configured) limit for a memcg.
- */
-u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
-{
-	u64 limit;
-	u64 memsw;
-
-	limit = res_counter_read_u64(&memcg->res, RES_LIMIT) +
-			total_swap_pages;
-	memsw = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
-	/*
-	 * If memsw is finite and limits the amount of swap space available
-	 * to this memcg, return that limit.
-	 */
-	return min(limit, memsw);
-}
-
-/*
  * Visit the first child (need not be the first child as per the ordering
  * of the cgroup list, since we track last_scanned_child) of @mem and use
  * that to reclaim free pages from.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7dcca55..f251ddb 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -4,8 +4,6 @@
  *  Copyright (C)  1998,2000  Rik van Riel
  *	Thanks go out to Claus Fischer for some serious inspiration and
  *	for goading me into coding this file...
- *  Copyright (C)  2010  Google, Inc.
- *	Rewritten by David Rientjes
  *
  *  The routines in this file are used to kill a process when
  *  we're seriously out of memory. This gets called from __alloc_pages()
@@ -36,6 +34,7 @@ int sysctl_panic_on_oom;
 int sysctl_oom_kill_allocating_task;
 int sysctl_oom_dump_tasks = 1;
 static DEFINE_SPINLOCK(zone_scan_lock);
+/* #define DEBUG */
 
 #ifdef CONFIG_NUMA
 /**
@@ -106,7 +105,7 @@ static void boost_dying_task_prio(struct task_struct *p,
  * pointer.  Return p, or any of its subthreads with a valid ->mm, with
  * task_lock() held.
  */
-struct task_struct *find_lock_task_mm(struct task_struct *p)
+static struct task_struct *find_lock_task_mm(struct task_struct *p)
 {
 	struct task_struct *t = p;
 
@@ -121,8 +120,8 @@ struct task_struct *find_lock_task_mm(struct task_struct *p)
 }
 
 /* return true if the task is not adequate as candidate victim task. */
-static bool oom_unkillable_task(struct task_struct *p,
-		const struct mem_cgroup *mem, const nodemask_t *nodemask)
+static bool oom_unkillable_task(struct task_struct *p, struct mem_cgroup *mem,
+			   const nodemask_t *nodemask)
 {
 	if (is_global_init(p))
 		return true;
@@ -141,82 +140,137 @@ static bool oom_unkillable_task(struct task_struct *p,
 }
 
 /**
- * oom_badness - heuristic function to determine which candidate task to kill
+ * badness - calculate a numeric value for how bad this task has been
  * @p: task struct of which task we should calculate
- * @totalpages: total present RAM allowed for page allocation
+ * @uptime: current uptime in seconds
  *
- * The heuristic for determining which task to kill is made to be as simple and
- * predictable as possible.  The goal is to return the highest value for the
- * task consuming the most memory to avoid subsequent oom failures.
+ * The formula used is relatively simple and documented inline in the
+ * function. The main rationale is that we want to select a good task
+ * to kill when we run out of memory.
+ *
+ * Good in this context means that:
+ * 1) we lose the minimum amount of work done
+ * 2) we recover a large amount of memory
+ * 3) we don't kill anything innocent of eating tons of memory
+ * 4) we want to kill the minimum amount of processes (one)
+ * 5) we try to kill the process the user expects us to kill, this
+ *    algorithm has been meticulously tuned to meet the principle
+ *    of least surprise ... (be careful when you change it)
  */
-unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
-		      const nodemask_t *nodemask, unsigned long totalpages)
+unsigned long badness(struct task_struct *p, struct mem_cgroup *mem,
+		      const nodemask_t *nodemask, unsigned long uptime)
 {
-	int points;
+	unsigned long points, cpu_time, run_time;
+	struct task_struct *child;
+	struct task_struct *c, *t;
+	int oom_adj = p->signal->oom_adj;
+	struct task_cputime task_time;
+	unsigned long utime;
+	unsigned long stime;
 
 	if (oom_unkillable_task(p, mem, nodemask))
 		return 0;
+	if (oom_adj == OOM_DISABLE)
+		return 0;
 
 	p = find_lock_task_mm(p);
 	if (!p)
 		return 0;
 
 	/*
-	 * Shortcut check for a thread sharing p->mm that is OOM_SCORE_ADJ_MIN
-	 * so the entire heuristic doesn't need to be executed for something
-	 * that cannot be killed.
+	 * The memory size of the process is the basis for the badness.
 	 */
-	if (atomic_read(&p->mm->oom_disable_count)) {
-		task_unlock(p);
-		return 0;
-	}
+	points = p->mm->total_vm;
+	task_unlock(p);
 
 	/*
-	 * When the PF_OOM_ORIGIN bit is set, it indicates the task should have
-	 * priority for oom killing.
+	 * swapoff can easily use up all memory, so kill those first.
 	 */
-	if (p->flags & PF_OOM_ORIGIN) {
-		task_unlock(p);
-		return 1000;
-	}
+	if (p->flags & PF_OOM_ORIGIN)
+		return ULONG_MAX;
 
 	/*
-	 * The memory controller may have a limit of 0 bytes, so avoid a divide
-	 * by zero, if necessary.
+	 * Processes which fork a lot of child processes are likely
+	 * a good choice. We add half the vmsize of the children if they
+	 * have an own mm. This prevents forking servers to flood the
+	 * machine with an endless amount of children. In case a single
+	 * child is eating the vast majority of memory, adding only half
+	 * to the parents will make the child our kill candidate of choice.
 	 */
-	if (!totalpages)
-		totalpages = 1;
+	t = p;
+	do {
+		list_for_each_entry(c, &t->children, sibling) {
+			child = find_lock_task_mm(c);
+			if (child) {
+				if (child->mm != p->mm)
+					points += child->mm->total_vm/2 + 1;
+				task_unlock(child);
+			}
+		}
+	} while_each_thread(p, t);
 
 	/*
-	 * The baseline for the badness score is the proportion of RAM that each
-	 * task's rss and swap space use.
+	 * CPU time is in tens of seconds and run time is in thousands
+         * of seconds. There is no particular reason for this other than
+         * that it turned out to work very well in practice.
 	 */
-	points = (get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS)) * 1000 /
-			totalpages;
-	task_unlock(p);
+	thread_group_cputime(p, &task_time);
+	utime = cputime_to_jiffies(task_time.utime);
+	stime = cputime_to_jiffies(task_time.stime);
+	cpu_time = (utime + stime) >> (SHIFT_HZ + 3);
+
+
+	if (uptime >= p->start_time.tv_sec)
+		run_time = (uptime - p->start_time.tv_sec) >> 10;
+	else
+		run_time = 0;
+
+	if (cpu_time)
+		points /= int_sqrt(cpu_time);
+	if (run_time)
+		points /= int_sqrt(int_sqrt(run_time));
 
 	/*
-	 * Root processes get 3% bonus, just like the __vm_enough_memory()
-	 * implementation used by LSMs.
+	 * Niced processes are most likely less important, so double
+	 * their badness points.
 	 */
-	if (has_capability_noaudit(p, CAP_SYS_ADMIN))
-		points -= 30;
+	if (task_nice(p) > 0)
+		points *= 2;
 
 	/*
-	 * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
-	 * either completely disable oom killing or always prefer a certain
-	 * task.
+	 * Superuser processes are usually more important, so we make it
+	 * less likely that we kill those.
 	 */
-	points += p->signal->oom_score_adj;
+	if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
+	    has_capability_noaudit(p, CAP_SYS_RESOURCE))
+		points /= 4;
 
 	/*
-	 * Never return 0 for an eligible task that may be killed since it's
-	 * possible that no single user task uses more than 0.1% of memory and
-	 * no single admin tasks uses more than 3.0%.
+	 * We don't want to kill a process with direct hardware access.
+	 * Not only could that mess up the hardware, but usually users
+	 * tend to only have this flag set on applications they think
+	 * of as important.
 	 */
-	if (points <= 0)
-		return 1;
-	return (points < 1000) ? points : 1000;
+	if (has_capability_noaudit(p, CAP_SYS_RAWIO))
+		points /= 4;
+
+	/*
+	 * Adjust the score by oom_adj.
+	 */
+	if (oom_adj) {
+		if (oom_adj > 0) {
+			if (!points)
+				points = 1;
+			points <<= oom_adj;
+		} else
+			points >>= -(oom_adj);
+	}
+
+#ifdef DEBUG
+	printk(KERN_DEBUG "OOMkill: task %d (%s) got %lu points\n",
+	p->pid, p->comm, points);
+#endif
+	return points;
 }
 
 /*
@@ -224,20 +278,12 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
  */
 #ifdef CONFIG_NUMA
 static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
-				gfp_t gfp_mask, nodemask_t *nodemask,
-				unsigned long *totalpages)
+				    gfp_t gfp_mask, nodemask_t *nodemask)
 {
 	struct zone *zone;
 	struct zoneref *z;
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
-	bool cpuset_limited = false;
-	int nid;
-
-	/* Default to all available memory */
-	*totalpages = totalram_pages + total_swap_pages;
 
-	if (!zonelist)
-		return CONSTRAINT_NONE;
 	/*
 	 * Reach here only when __GFP_NOFAIL is used. So, we should avoid
 	 * to kill current.We have to random task kill in this case.
@@ -247,37 +293,26 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
 		return CONSTRAINT_NONE;
 
 	/*
-	 * This is not a __GFP_THISNODE allocation, so a truncated nodemask in
-	 * the page allocator means a mempolicy is in effect.  Cpuset policy
-	 * is enforced in get_page_from_freelist().
+	 * The nodemask here is a nodemask passed to alloc_pages(). Now,
+	 * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
+	 * feature. mempolicy is an only user of nodemask here.
+	 * check mempolicy's nodemask contains all N_HIGH_MEMORY
 	 */
-	if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask)) {
-		*totalpages = total_swap_pages;
-		for_each_node_mask(nid, *nodemask)
-			*totalpages += node_spanned_pages(nid);
+	if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
 		return CONSTRAINT_MEMORY_POLICY;
-	}
 
 	/* Check this allocation failure is caused by cpuset's wall function */
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 			high_zoneidx, nodemask)
 		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
-			cpuset_limited = true;
+			return CONSTRAINT_CPUSET;
 
-	if (cpuset_limited) {
-		*totalpages = total_swap_pages;
-		for_each_node_mask(nid, cpuset_current_mems_allowed)
-			*totalpages += node_spanned_pages(nid);
-		return CONSTRAINT_CPUSET;
-	}
 	return CONSTRAINT_NONE;
 }
 #else
 static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
-				gfp_t gfp_mask, nodemask_t *nodemask,
-				unsigned long *totalpages)
+				gfp_t gfp_mask, nodemask_t *nodemask)
 {
-	*totalpages = totalram_pages + total_swap_pages;
 	return CONSTRAINT_NONE;
 }
 #endif
@@ -288,16 +323,17 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
  *
  * (not docbooked, we don't want this one cluttering up the manual)
  */
-static struct task_struct *select_bad_process(unsigned int *ppoints,
-		unsigned long totalpages, struct mem_cgroup *mem,
-		const nodemask_t *nodemask)
+static struct task_struct *select_bad_process(unsigned long *ppoints,
+		struct mem_cgroup *mem, const nodemask_t *nodemask)
 {
 	struct task_struct *p;
 	struct task_struct *chosen = NULL;
+	struct timespec uptime;
 	*ppoints = 0;
 
+	do_posix_clock_monotonic_gettime(&uptime);
 	for_each_process(p) {
-		unsigned int points;
+		unsigned long points;
 
 		if (oom_unkillable_task(p, mem, nodemask))
 			continue;
@@ -329,11 +365,11 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 				return ERR_PTR(-1UL);
 
 			chosen = p;
-			*ppoints = 1000;
+			*ppoints = ULONG_MAX;
 		}
 
-		points = oom_badness(p, mem, nodemask, totalpages);
-		if (points > *ppoints) {
+		points = badness(p, mem, nodemask, uptime.tv_sec);
+		if (points > *ppoints || !chosen) {
 			chosen = p;
 			*ppoints = points;
 		}
@@ -345,24 +381,27 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 /**
  * dump_tasks - dump current memory state of all system tasks
  * @mem: current's memory controller, if constrained
- * @nodemask: nodemask passed to page allocator for mempolicy ooms
  *
- * Dumps the current memory state of all eligible tasks.  Tasks not in the same
- * memcg, not in the same cpuset, or bound to a disjoint set of mempolicy nodes
- * are not shown.
+ * Dumps the current memory state of all system tasks, excluding kernel threads.
  * State information includes task's pid, uid, tgid, vm size, rss, cpu, oom_adj
- * value, oom_score_adj value, and name.
+ * score, and name.
+ *
+ * If the actual is non-NULL, only tasks that are a member of the mem_cgroup are
+ * shown.
  *
  * Call with tasklist_lock read-locked.
  */
-static void dump_tasks(const struct mem_cgroup *mem, const nodemask_t *nodemask)
+static void dump_tasks(const struct mem_cgroup *mem)
 {
 	struct task_struct *p;
 	struct task_struct *task;
 
-	pr_info("[ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name\n");
+	printk(KERN_INFO "[ pid ]   uid  tgid total_vm      rss cpu oom_adj "
+	       "name\n");
 	for_each_process(p) {
-		if (oom_unkillable_task(p, mem, nodemask))
+		if (p->flags & PF_KTHREAD)
+			continue;
+		if (mem && !task_in_mem_cgroup(p, mem))
 			continue;
 
 		task = find_lock_task_mm(p);
@@ -375,69 +414,43 @@ static void dump_tasks(const struct mem_cgroup *mem, const nodemask_t *nodemask)
 			continue;
 		}
 
-		pr_info("[%5d] %5d %5d %8lu %8lu %3u     %3d         %5d %s\n",
+		pr_info("[%5d] %5d %5d %8lu %8lu %3u     %3d %s\n",
 			task->pid, task_uid(task), task->tgid,
 			task->mm->total_vm, get_mm_rss(task->mm),
-			task_cpu(task), task->signal->oom_adj,
-			task->signal->oom_score_adj, task->comm);
+			task_cpu(task), task->signal->oom_adj, task->comm);
 		task_unlock(task);
 	}
 }
 
 static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
-			struct mem_cgroup *mem, const nodemask_t *nodemask)
+							struct mem_cgroup *mem)
 {
 	task_lock(current);
 	pr_warning("%s invoked oom-killer: gfp_mask=0x%x, order=%d, "
-		"oom_adj=%d, oom_score_adj=%d\n",
-		current->comm, gfp_mask, order, current->signal->oom_adj,
-		current->signal->oom_score_adj);
+		"oom_adj=%d\n",
+		current->comm, gfp_mask, order, current->signal->oom_adj);
 	cpuset_print_task_mems_allowed(current);
 	task_unlock(current);
 	dump_stack();
 	mem_cgroup_print_oom_info(mem, p);
 	show_mem();
 	if (sysctl_oom_dump_tasks)
-		dump_tasks(mem, nodemask);
+		dump_tasks(mem);
 }
 
 #define K(x) ((x) << (PAGE_SHIFT-10))
 static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
 {
-	struct task_struct *q;
-	struct mm_struct *mm;
-
 	p = find_lock_task_mm(p);
 	if (!p)
 		return 1;
 
-	/* mm cannot be safely dereferenced after task_unlock(p) */
-	mm = p->mm;
-
 	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
 		task_pid_nr(p), p->comm, K(p->mm->total_vm),
 		K(get_mm_counter(p->mm, MM_ANONPAGES)),
 		K(get_mm_counter(p->mm, MM_FILEPAGES)));
 	task_unlock(p);
 
-	/*
-	 * Kill all processes sharing p->mm in other thread groups, if any.
-	 * They don't get access to memory reserves or a higher scheduler
-	 * priority, though, to avoid depletion of all memory or task
-	 * starvation.  This prevents mm->mmap_sem livelock when an oom killed
-	 * task cannot exit because it requires the semaphore and its contended
-	 * by another thread trying to allocate memory itself.  That thread will
-	 * now get access to memory reserves since it has a pending fatal
-	 * signal.
-	 */
-	for_each_process(q)
-		if (q->mm == mm && !same_thread_group(q, p)) {
-			task_lock(q);	/* Protect ->comm from prctl() */
-			pr_err("Kill process %d (%s) sharing same memory\n",
-				task_pid_nr(q), q->comm);
-			task_unlock(q);
-			force_sig(SIGKILL, q);
-		}
 
 	set_tsk_thread_flag(p, TIF_MEMDIE);
 	force_sig(SIGKILL, p);
@@ -454,17 +467,17 @@ static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
 #undef K
 
 static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
-			    unsigned int points, unsigned long totalpages,
-			    struct mem_cgroup *mem, nodemask_t *nodemask,
-			    const char *message)
+			    unsigned long points, struct mem_cgroup *mem,
+			    nodemask_t *nodemask, const char *message)
 {
 	struct task_struct *victim = p;
 	struct task_struct *child;
 	struct task_struct *t = p;
-	unsigned int victim_points = 0;
+	unsigned long victim_points = 0;
+	struct timespec uptime;
 
 	if (printk_ratelimit())
-		dump_header(p, gfp_mask, order, mem, nodemask);
+		dump_header(p, gfp_mask, order, mem);
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -477,7 +490,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	}
 
 	task_lock(p);
-	pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
+	pr_err("%s: Kill process %d (%s) score %lu or sacrifice child\n",
 		message, task_pid_nr(p), p->comm, points);
 	task_unlock(p);
 
@@ -487,15 +500,14 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 * parent.  This attempts to lose the minimal amount of work done while
 	 * still freeing memory.
 	 */
+	do_posix_clock_monotonic_gettime(&uptime);
 	do {
 		list_for_each_entry(child, &t->children, sibling) {
-			unsigned int child_points;
+			unsigned long child_points;
 
-			/*
-			 * oom_badness() returns 0 if the thread is unkillable
-			 */
-			child_points = oom_badness(child, mem, nodemask,
-								totalpages);
+			/* badness() returns 0 if the thread is unkillable */
+			child_points = badness(child, mem, nodemask,
+					       uptime.tv_sec);
 			if (child_points > victim_points) {
 				victim = child;
 				victim_points = child_points;
@@ -510,7 +522,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
  * Determines whether the kernel must panic because of the panic_on_oom sysctl.
  */
 static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
-				int order, const nodemask_t *nodemask)
+				int order)
 {
 	if (likely(!sysctl_panic_on_oom))
 		return;
@@ -524,7 +536,7 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
 			return;
 	}
 	read_lock(&tasklist_lock);
-	dump_header(NULL, gfp_mask, order, NULL, nodemask);
+	dump_header(NULL, gfp_mask, order, NULL);
 	read_unlock(&tasklist_lock);
 	panic("Out of memory: %s panic_on_oom is enabled\n",
 		sysctl_panic_on_oom == 2 ? "compulsory" : "system-wide");
@@ -533,19 +545,17 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
 {
-	unsigned long limit;
-	unsigned int points = 0;
+	unsigned long points = 0;
 	struct task_struct *p;
 
-	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, 0, NULL);
-	limit = mem_cgroup_get_limit(mem) >> PAGE_SHIFT;
+	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, 0);
 	read_lock(&tasklist_lock);
 retry:
-	p = select_bad_process(&points, limit, mem, NULL);
+	p = select_bad_process(&points, mem, NULL);
 	if (!p || PTR_ERR(p) == -1UL)
 		goto out;
 
-	if (oom_kill_process(p, gfp_mask, 0, points, limit, mem, NULL,
+	if (oom_kill_process(p, gfp_mask, 0, points, mem, NULL,
 				"Memory cgroup out of memory"))
 		goto retry;
 out:
@@ -669,11 +679,9 @@ static void clear_system_oom(void)
 void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *nodemask)
 {
-	const nodemask_t *mpol_mask;
 	struct task_struct *p;
-	unsigned long totalpages;
 	unsigned long freed = 0;
-	unsigned int points;
+	unsigned long points;
 	enum oom_constraint constraint = CONSTRAINT_NONE;
 	int killed = 0;
 
@@ -697,40 +705,41 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.
 	 */
-	constraint = constrained_alloc(zonelist, gfp_mask, nodemask,
-						&totalpages);
-	mpol_mask = (constraint == CONSTRAINT_MEMORY_POLICY) ? nodemask : NULL;
-	check_panic_on_oom(constraint, gfp_mask, order, mpol_mask);
+	if (zonelist)
+		constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
+	check_panic_on_oom(constraint, gfp_mask, order);
 
 	read_lock(&tasklist_lock);
 	if (sysctl_oom_kill_allocating_task &&
 	    !oom_unkillable_task(current, NULL, nodemask) &&
-	    current->mm && !atomic_read(&current->mm->oom_disable_count)) {
+	    (current->signal->oom_adj != OOM_DISABLE)) {
 		/*
 		 * oom_kill_process() needs tasklist_lock held.  If it returns
 		 * non-zero, current could not be killed so we must fallback to
 		 * the tasklist scan.
 		 */
-		if (!oom_kill_process(current, gfp_mask, order, 0, totalpages,
-				NULL, nodemask,
+		if (!oom_kill_process(current, gfp_mask, order, 0, NULL,
+				nodemask,
 				"Out of memory (oom_kill_allocating_task)"))
 			goto out;
 	}
 
 retry:
-	p = select_bad_process(&points, totalpages, NULL, mpol_mask);
+	p = select_bad_process(&points, NULL,
+			constraint == CONSTRAINT_MEMORY_POLICY ? nodemask :
+								 NULL);
 	if (PTR_ERR(p) == -1UL)
 		goto out;
 
 	/* Found nothing?!?! Either we hang forever, or we panic. */
 	if (!p) {
-		dump_header(NULL, gfp_mask, order, NULL, mpol_mask);
+		dump_header(NULL, gfp_mask, order, NULL);
 		read_unlock(&tasklist_lock);
 		panic("Out of memory and no killable processes...\n");
 	}
 
-	if (oom_kill_process(p, gfp_mask, order, points, totalpages, NULL,
-				nodemask, "Out of memory"))
+	if (oom_kill_process(p, gfp_mask, order, points, NULL, nodemask,
+			     "Out of memory"))
 		goto retry;
 	killed = 1;
 out:
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH] Revert oom rewrite series
  2010-11-14  5:07 [PATCH] Revert oom rewrite series KOSAKI Motohiro
@ 2010-11-14 19:32 ` Linus Torvalds
  2010-11-15  0:54   ` KOSAKI Motohiro
  2010-11-23 23:51   ` KOSAKI Motohiro
  2010-11-14 21:58 ` David Rientjes
  1 sibling, 2 replies; 37+ messages in thread
From: Linus Torvalds @ 2010-11-14 19:32 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, David Rientjes, Andrew Morton, Ying Han, Bodo Eggert,
	Mandeep Singh Baines, Figo.zhang

2010/11/13 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>:
>
> Please apply this. this patch revert commits of oom changes since v2.6.35.

I'm not getting involved in this whole flame-war. You need to convince
Andrew, who has been the person everything went through.

                    Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Revert oom rewrite series
  2010-11-14  5:07 [PATCH] Revert oom rewrite series KOSAKI Motohiro
  2010-11-14 19:32 ` Linus Torvalds
@ 2010-11-14 21:58 ` David Rientjes
  2010-11-15 23:33   ` Bodo Eggert
  1 sibling, 1 reply; 37+ messages in thread
From: David Rientjes @ 2010-11-14 21:58 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, Linus Torvalds, Andrew Morton, Ying Han, Bodo Eggert,
	Mandeep Singh Baines, Figo.zhang

On Sun, 14 Nov 2010, KOSAKI Motohiro wrote:

> Linus,
> 
> Please apply this. this patch revert commits of oom changes since v2.6.35.
> 
> briefly says, "oom: badness heuristic rewrite" was merges by mistaken.
> It haven't been passed our design nor code review. then multiple bug reports
> has been popped up. I believe evey patches should pass a usecase and a code
> review :-/
> 

That's inaccurate, there haven't been multiple bug reports popping up 
since the rewrite; in fact, there hasn't been a single bug report.

There have been two changes to the oom killer since the rewrite:

 - we now kill all threads sharing the oom killed task that share the ->mm 
   since we can't free any memory without them exiting as well, and

 - we count threads that are immune from oom kill attached to an ->mm so 
   we can avoid needlessly killing tasks that aren't immune themselves but 
   have other threads sharing the ->mm that are.

Both of those changes were needed in the old oom killer as well, they have 
nothing to do with the rewrite.

Also, stating that the new heuristic doesn't address CAP_SYS_RESOURCE 
approrpiately isn't a bug report, it's the desired behavior.  I eliminated 
all of the arbitrary heursitics in the old heuristic that we had the 
remove internally as well so that is predictable as possible and achieves 
the oom killer's sole goal: to kill the most memory-hogging task that is 
eligible to allow memory allocations in the current context to succeed.  
CAP_SYS_RESOURCE threads have full control over their oom killing priority 
by /proc/pid/oom_score_adj and need no consideration in the heuristic by 
default since it otherwise allows for the probability that multiple tasks 
will need to be killed when a CAP_SYS_RESOURCE thread uses an egregious 
amount of memory.

> The problem is, DavidR patches don't refrect real world usecase at all
> and breaking them. He can talk about the userland is wrong. but such
> excuse doesn't solve real world issue. it makes no sense.
> 

As mentioned just a few minutes ago in another thread, there is no 
userspace breakage with the rewrite and you're only complaining here about 
the deprecation of /proc/pid/oom_adj for a period of two years.  Until 
it's removed in 2012 or later, it maps to the linear scale that 
oom_score_adj uses rather than its old exponential scale that was 
unusable for prioritization because of (1) the extremely low resolution, 
and (2) the arbitrary heuristics that preceeded it.

You've proposed various forms of your revert (this is the fifth one) and 
I've responded in a very respectful and technical way each time even 
though you have repeatedly called me stupid.  Linus is under the 
impression that this is some kind of flamewar when in reality it's only a 
desperate attempt of yours to start one, this kind of thing just really 
bounces off of me on a personal level.  I will, however, continue to 
remain professional.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Revert oom rewrite series
  2010-11-14 19:32 ` Linus Torvalds
@ 2010-11-15  0:54   ` KOSAKI Motohiro
  2010-11-15  2:19     ` Andrew Morton
  2010-11-23 23:51   ` KOSAKI Motohiro
  1 sibling, 1 reply; 37+ messages in thread
From: KOSAKI Motohiro @ 2010-11-15  0:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kosaki.motohiro, LKML, David Rientjes, Andrew Morton, Ying Han,
	Bodo Eggert, Mandeep Singh Baines, Figo.zhang

> 2010/11/13 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>:
> >
> > Please apply this. this patch revert commits of oom changes since v2.6.35.
> 
> I'm not getting involved in this whole flame-war. You need to convince
> Andrew, who has been the person everything went through.

I wonder why he deep silence. But, _I_ strongly don't want to ignore bug report and
userland complain. I hope to fix any bug as far as my development time is allowed.




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Revert oom rewrite series
  2010-11-15  0:54   ` KOSAKI Motohiro
@ 2010-11-15  2:19     ` Andrew Morton
       [not found]       ` <AANLkTik_SDaiu2eQsJ9+4ywLR5K5V1Od-hwop6gwas3F@mail.gmail.com>
  2010-11-15  6:57       ` KOSAKI Motohiro
  0 siblings, 2 replies; 37+ messages in thread
From: Andrew Morton @ 2010-11-15  2:19 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Linus Torvalds, LKML, David Rientjes, Ying Han, Bodo Eggert,
	Mandeep Singh Baines, Figo.zhang

On Mon, 15 Nov 2010 09:54:14 +0900 (JST) KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > 2010/11/13 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>:
> > >
> > > Please apply this. this patch revert commits of oom changes since v2.6.35.
> > 
> > I'm not getting involved in this whole flame-war. You need to convince
> > Andrew, who has been the person everything went through.
> 
> I wonder why he deep silence.

Nothing to say, really.  Seems each time we're told about a bug or a
regression, David either fixes the bug or points out why it wasn't a
bug or why it wasn't a regression or how it was a deliberate behaviour
change for the better.

I just haven't seen any solid reason to be concerned about the state of
the current oom-killer, sorry.

I'm concerned that you're concerned!  A lot.  When someone such as
yourself is unhappy with part of MM then I sit up and pay attention. 
But after all this time I simply don't understand the technical issues
which you're seeing here.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Revert oom rewrite series
       [not found]       ` <AANLkTik_SDaiu2eQsJ9+4ywLR5K5V1Od-hwop6gwas3F@mail.gmail.com>
@ 2010-11-15  4:41         ` Figo.zhang
  0 siblings, 0 replies; 37+ messages in thread
From: Figo.zhang @ 2010-11-15  4:41 UTC (permalink / raw)
  To: Andrew Morton, David Rientjes
  Cc: figo zhang, KOSAKI Motohiro, Linus Torvalds, LKML, Ying Han,
	Bodo Eggert, Mandeep Singh Baines, linux-mm

 >Nothing to say, really.  Seems each time we're told about a bug or a
 >regression, David either fixes the bug or points out why it wasn't a
 >bug or why it wasn't a regression or how it was a deliberate behaviour
 >change for the better.

 >I just haven't seen any solid reason to be concerned about the state of
 >the current oom-killer, sorry.

 >I'm concerned that you're concerned!  A lot.  When someone such as
 >yourself is unhappy with part of MM then I sit up and pay attention.
 >But after all this time I simply don't understand the technical issues
 >which you're seeing here.

we just talk about oom-killer technical issues.

i am doubt that a new rewrite but the athor canot provide some evidence
and experiment result, why did you do that? what is the prominent change

for your new algorithm?

as KOSAKI Motohiro said, "you removed CAP_SYS_RESOURCE condition with
ZERO explanation".

David just said that pls use userspace tunable for protection by
oom_score_adj. but may i ask question:


1. what is your innovation for your new algorithm, the old one have the
same way for user tunable oom_adj.

2. if server like db-server/financial-server have huge import processes
(such as root/hardware access processes)want to be protection, you let

the administrator to find out which processes should be protection. you
will let the  financial-server administrator huge crazy!! and lose so
many money!! ^~^

3. i see your email in LKML, you just said
"I have repeatedly said that the oom killer no longer kills KDE when run

on my desktop in the presence of a memory hogging task that was written
specifically to oom the machine."
http://thread.gmane.org/gmane.linux.kernel.mm/48998


so you just test your new oom_killer algorithm on your desktop with KDE,
so have you provide the detail how you do the test? is it do the
experiment again for anyone and got the same result as your comment ?


as KOSAKI Motohiro said, in reality word, it we makes 5-6 brain
simulation, embedded, desktop, web server,db server, hpc, finance.
Different workloads certenally makes big impact. have you do those
experiments?


i think that technology should base on experiment not on imagine.


Best,
Figo.zhang

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Revert oom rewrite series
  2010-11-15  2:19     ` Andrew Morton
       [not found]       ` <AANLkTik_SDaiu2eQsJ9+4ywLR5K5V1Od-hwop6gwas3F@mail.gmail.com>
@ 2010-11-15  6:57       ` KOSAKI Motohiro
  2010-11-15 10:34         ` David Rientjes
  2010-11-23  7:16         ` KOSAKI Motohiro
  1 sibling, 2 replies; 37+ messages in thread
From: KOSAKI Motohiro @ 2010-11-15  6:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, Linus Torvalds, LKML, David Rientjes, Ying Han,
	Bodo Eggert, Mandeep Singh Baines, Figo.zhang

> On Mon, 15 Nov 2010 09:54:14 +0900 (JST) KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > > 2010/11/13 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>:
> > > >
> > > > Please apply this. this patch revert commits of oom changes since v2.6.35.
> > > 
> > > I'm not getting involved in this whole flame-war. You need to convince
> > > Andrew, who has been the person everything went through.
> > 
> > I wonder why he deep silence.
> 
> Nothing to say, really.  Seems each time we're told about a bug or a
> regression, David either fixes the bug or points out why it wasn't a
> bug or why it wasn't a regression or how it was a deliberate behaviour
> change for the better.

Of cource, I denied. He seems to think number of email is meaningful than
how talk about. but it's incorrect and makes no sense. Why not? Also, He
have to talk about logically. "Hey, I think it's not bug" makes no sense.
Such claim don't solve anything. userland is still unhappy. Why not?
I want to quickly action.

I would like to suggest they join and contribute any distro kernel 
maintainance team. Many community based distribution welcome to developrs.
And a bugfix work tell them a lot of thing. which usecase are freqently used,
which bug reports are fequently raised, etc.

That said, If anyone want to change userland ABI, Be carefully. They have
to investigate userland usecase carefully and avoid to break them carefully 
again. If someone think "hey, It's no big matter. userland rewritten can solve
an issue", I strongly disagree. they don't understand why all of userland 
applications rewritten is harmful.



> I just haven't seen any solid reason to be concerned about the state of
> the current oom-killer, sorry.

You can't say "I haven't seen". I always cced you. 


> I'm concerned that you're concerned!  A lot.  When someone such as
> yourself is unhappy with part of MM then I sit up and pay attention. 
> But after all this time I simply don't understand the technical issues
> which you're seeing here.

You should have read my patch descriptions which I sent and my e-mail.


1) About two month ago, Dave hansen observed strange OOM issue because he
   has a big machine and ALL process are not so big. thus, eventually all 
   process got oom-score=0 and oom-killer didn't work.

   https://kerneltrap.org/mailarchive/linux-driver-devel/2010/9/9/6886383

   DavidR changed oom-score to +1 in such situation. 

   http://kerneltrap.org/mailarchive/linux-kernel/2010/9/9/4617455

   But it is completely bognus. If all process have score=1, oom-killer fall
   back to purely random killer. I expected and explained his patch has
   its problem at half years ago. but he didn't fix yet.

2) Also half years ago, I did explained oom_adj is used from multiple 
   applications. And we can't break them. But DavidR didn't fix.

3) Also about four month ago, I and kamezawa-san pointed out his patch
   don't work on memcg. It also haven't been fixed.


In the other hand, You can't explain what worth OOM-rewritten patch has. 
Because there is nothing. It is only "powerful"(TM) for Google. but 
instead It has zero worth for every other people. Here is just technical 
issue. Bah.



And, I just don't understand why some people try to remove or obsolate
oom_adj. It's just eight lines code and It's used from multiple applications.
There is no reason to break userland at all.
--------------------------------------------------------
 178        /*
 179         * Adjust the score by oom_adj.
 180         */
 181        if (oom_adj) {
 182                if (oom_adj > 0) {
 183                        if (!points)
 184                                points = 1;
 185                        points <<= oom_adj;
 186                } else
 187                        points >>= -(oom_adj);
 188        }
--------------------------------------------------------


If you still have a question, please ask me. maybe I can answer all of 
your question.




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Revert oom rewrite series
  2010-11-15  6:57       ` KOSAKI Motohiro
@ 2010-11-15 10:34         ` David Rientjes
  2010-11-15 23:31           ` Jesper Juhl
  2010-11-23  7:16           ` KOSAKI Motohiro
  2010-11-23  7:16         ` KOSAKI Motohiro
  1 sibling, 2 replies; 37+ messages in thread
From: David Rientjes @ 2010-11-15 10:34 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Linus Torvalds, LKML, Ying Han, Bodo Eggert,
	Mandeep Singh Baines, Figo.zhang

On Mon, 15 Nov 2010, KOSAKI Motohiro wrote:

> Of cource, I denied. He seems to think number of email is meaningful than
> how talk about. but it's incorrect and makes no sense. Why not? Also, He
> have to talk about logically. "Hey, I think it's not bug" makes no sense.
> Such claim don't solve anything. userland is still unhappy. Why not?
> I want to quickly action.
> 

If there are pending complaints or bugs that I haven't addressed, please 
bring them to my attention.  To date, I know of no issues that have been 
raised that I have not addressed; you're always free to disagree with my 
position, but in the end you may find that when the kernel moves in a 
different direction that you should begin to accept it.

> That said, If anyone want to change userland ABI, Be carefully. They have
> to investigate userland usecase carefully and avoid to break them carefully 
> again. If someone think "hey, It's no big matter. userland rewritten can solve
> an issue", I strongly disagree. they don't understand why all of userland 
> applications rewritten is harmful.
> 

You may remember that the initial version of my rewrite replaced oom_adj 
entirely with the new oom_score_adj semantics.  Others suggested that it 
be seperated into a new tunable and the old tunable deprecated for a 
lengthy period of time.  I accepted that criticism and understood the 
drawbacks of replacing the tunable immediately and followed those 
suggestions.  I disagree with you that the deprecation of oom_adj for a 
period of two years is as dramatic as you imply and I disagree that users 
are experiencing problems with the linear scale that it now operates on 
versus the old exponential scale.

> 1) About two month ago, Dave hansen observed strange OOM issue because he
>    has a big machine and ALL process are not so big. thus, eventually all 
>    process got oom-score=0 and oom-killer didn't work.
> 
>    https://kerneltrap.org/mailarchive/linux-driver-devel/2010/9/9/6886383
> 
>    DavidR changed oom-score to +1 in such situation. 
> 
>    http://kerneltrap.org/mailarchive/linux-kernel/2010/9/9/4617455
> 
>    But it is completely bognus. If all process have score=1, oom-killer fall
>    back to purely random killer. I expected and explained his patch has
>    its problem at half years ago. but he didn't fix yet.
> 

The resolution with which the oom killer considers memory is at 0.1% of 
system RAM at its highest (smaller when you have a memory controller, 
cpuset, or mempolicy constrained oom).  It considers a task within 0.1% of 
memory of another task to have equal "badness" to kill, we don't break 
ties in between that resolution -- it all depends on which one shows up in 
the tasklist first.  If you disagree with that resolution, which I support 
as being high enough, then you may certainly propose a patch to make it 
even finer at 0.01%, 0.001%, etc.  It would only change oom_badness() to 
range between [0,10000], [0,100000], etc.

> 2) Also half years ago, I did explained oom_adj is used from multiple 
>    applications. And we can't break them. But DavidR didn't fix.
> 

And we didn't.  oom_adj is still there and maps linearly to oom_score_adj; 
you just can't show a single application where that mapping breaks because 
it was based on an actual calculation.

If you would like to cite these "multiple" applications that need to be 
converted to use oom_score_adj (I know of udev), please let me know and 
if they're open-source applications then I will commit to submitting 
patches for them myself.  I believe the two year window is sufficient for 
everyone else, though.

> 3) Also about four month ago, I and kamezawa-san pointed out his patch
>    don't work on memcg. It also haven't been fixed.
> 

I don't know what you're referring to here, sorry.

> In the other hand, You can't explain what worth OOM-rewritten patch has. 
> Because there is nothing. It is only "powerful"(TM) for Google. but 
> instead It has zero worth for every other people. Here is just technical 
> issue. Bah.
> 

Please see my reply to Figo.zhang where I enumerate the four reasons why 
the new userspace tunable is more powerful than oom_adj.


At this point, I can only speculate that your distaste for the new oom 
killer is one of disposition; it seems like everytime you reply to an 
email (or, more regularly, just repost your revert) that you come into it 
with the attitude that my response cannot possibly be correct and that the 
way you see things is exactly as they should be.  If you were to consider 
other people's opinions, however, you may find some common ground that can 
be met.  I certainly did that when I introduced oom_score_adj instead of 
replacing oom_adj immediatley.  I also did it when I removed the forkbomb 
detector from the rewrite.  I also did it when considering swap in the 
heuristic when it initially was only rss.  Andrew is in the position where 
he has to make a judgment call on what should be included and what 
shouldn't and it should be pretty darn clear after you post your revert 
the first time, then the second time, then the third time, then the fourth 
time, and now the fifth time.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Revert oom rewrite series
  2010-11-15 10:34         ` David Rientjes
@ 2010-11-15 23:31           ` Jesper Juhl
  2010-11-16  0:06             ` David Rientjes
  2010-11-16  0:13             ` Valdis.Kletnieks
  2010-11-23  7:16           ` KOSAKI Motohiro
  1 sibling, 2 replies; 37+ messages in thread
From: Jesper Juhl @ 2010-11-15 23:31 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, Andrew Morton, Linus Torvalds, LKML, Ying Han,
	Bodo Eggert, Mandeep Singh Baines, Figo.zhang

On Mon, 15 Nov 2010, David Rientjes wrote:

[...]
> If you would like to cite these "multiple" applications that need to be 
> converted to use oom_score_adj (I know of udev), please let me know and 
> if they're open-source applications then I will commit to submitting 
> patches for them myself.  I believe the two year window is sufficient for 
> everyone else, though.
[...]

I'm not going into the debate about whether or not deprecating one tunable 
for two years is sufficient or not. I'm simply going to mention one app 
that I know of that needs to be converted to use "oom_score_adj" on my 
box :

[jj@dragon ~]$ uname -a
Linux dragon 2.6.37-rc1-ARCH-00542-g0143832-dirty #1 SMP PREEMPT Mon Nov 15 22:01:52 CET 2010 x86_64 Intel(R) Core(TM)2 Duo CPU T7250 @ 2.00GHz GenuineIntel GNU/Linux
[jj@dragon ~]$ dmesg | grep oom_adj
start_kdeinit (1502): /proc/1502/oom_adj is deprecated, please use /proc/1502/oom_score_adj instead.
[jj@dragon ~]$ /usr/lib/kde4/libexec/start_kdeinit --version

Qt: 4.7.1
KDE: 4.5.3 (KDE 4.5.3)



-- 
Jesper Juhl <jj@chaosbits.net>            http://www.chaosbits.net/
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Revert oom rewrite series
  2010-11-14 21:58 ` David Rientjes
@ 2010-11-15 23:33   ` Bodo Eggert
  2010-11-15 23:50     ` David Rientjes
  0 siblings, 1 reply; 37+ messages in thread
From: Bodo Eggert @ 2010-11-15 23:33 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, LKML, Linus Torvalds, Andrew Morton, Ying Han,
	Bodo Eggert, Mandeep Singh Baines, Figo.zhang

On Sun, 14 Nov 2010, David Rientjes wrote:

> Also, stating that the new heuristic doesn't address CAP_SYS_RESOURCE
> approrpiately isn't a bug report, it's the desired behavior.  I eliminated
> all of the arbitrary heursitics in the old heuristic that we had the
> remove internally as well so that is predictable as possible and achieves
> the oom killer's sole goal: to kill the most memory-hogging task that is
> eligible to allow memory allocations in the current context to succeed.

> CAP_SYS_RESOURCE threads have full control over their oom killing priority
> by /proc/pid/oom_score_adj

, but unless they are written in the last months and designed for linux
and if the author took some time to research each external process 
invocation, they can not be aware of this possibility.

Besides that, if each process is supposed to change the default, the 
default is wrong.

> and need no consideration in the heuristic by
> default since it otherwise allows for the probability that multiple tasks
> will need to be killed when a CAP_SYS_RESOURCE thread uses an egregious
> amount of memory.

If it happens to use an egregious mount of memory, it SHOULD score
enough to get killed.

>> The problem is, DavidR patches don't refrect real world usecase at all
>> and breaking them. He can talk about the userland is wrong. but such
>> excuse doesn't solve real world issue. it makes no sense.
>
> As mentioned just a few minutes ago in another thread, there is no
> userspace breakage with the rewrite and you're only complaining here about
> the deprecation of /proc/pid/oom_adj for a period of two years.  Until
> it's removed in 2012 or later, it maps to the linear scale that
> oom_score_adj uses rather than its old exponential scale that was
> unusable for prioritization because of (1) the extremely low resolution,
> and (2) the arbitrary heuristics that preceeded it.

1) The exponential scale did have a low resolution.

2) The heuristics were developed using much brain power and much
    trial-and-error. You are going back to basics, and some people
    are not convinced that this is better. I googled and I did not
    find a discussion about how and why the new score was designed
    this way.
    looking at the output of:
    cd /proc; for a in [0-9]*; do
      echo `cat $a/oom_score` $a `perl -pes/'\0.*$'// < $a/cmdline`;
    done|grep -v ^0|sort -n |less
    , I 'm not convinced, too.

PS) Mapping an exponential value to a linear score is bad. E.g. A
     oom_adj of 8 should make an 1-MB-process as likely to kill as
     a 256-MB-process with oom_adj=0.

PS2) Because I saw this in your presentation PDF: (@udev-people)
     The -17 score of udevd is wrong, since it will even prevent
     the OOM killer from working correctly if it grows to 100 MB:

     It's default OOM score is 13, while root's shell is at 190
     and some KDE processes are at 200 000. It will not get killed
     under normal circumstances.

     If it udevd grows enough to score 190 as well, it has a bug
     that causes it to eat memory and it needs to be killed. Having
     a -17 oom_adj, it will cause the system to fail instead.
     Considering udevd's size, an adj of -1 or -2 should be enough on
     embedded systems, while desktop systems should not need it.
     If you are worried about udevd getting killed, protect ist using
     a wrapper.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Revert oom rewrite series
  2010-11-15 23:33   ` Bodo Eggert
@ 2010-11-15 23:50     ` David Rientjes
  2010-11-17  0:06       ` Bodo Eggert
  0 siblings, 1 reply; 37+ messages in thread
From: David Rientjes @ 2010-11-15 23:50 UTC (permalink / raw)
  To: Bodo Eggert
  Cc: KOSAKI Motohiro, LKML, Linus Torvalds, Andrew Morton, Ying Han,
	Bodo Eggert, Mandeep Singh Baines, Figo.zhang

On Tue, 16 Nov 2010, Bodo Eggert wrote:

> > CAP_SYS_RESOURCE threads have full control over their oom killing priority
> > by /proc/pid/oom_score_adj
> 
> , but unless they are written in the last months and designed for linux
> and if the author took some time to research each external process invocation,
> they can not be aware of this possibility.
> 

You're clearly wrong, CAP_SYS_RESOURCE has been required to modify oom_adj 
for over five years (as long as the git history).  8fb4fc68, merged into 
2.6.20, allowed tasks to raise their own oom_adj but not decrease it.  
That is unchanged by the rewrite.

> Besides that, if each process is supposed to change the default, the default
> is wrong.
> 

That doesn't make any sense, if want to protect a thread from the oom 
killer you're going to need to modify oom_score_adj, the kernel can't know 
what you perceive as being vital.  Having CAP_SYS_RESOURCE alone does not 
imply that, it only allows unbounded access to resources.  That's 
completely orthogonal to the goal of the oom killer heuristic, which is to 
find the most memory-hogging task to kill.

> 1) The exponential scale did have a low resolution.
> 
> 2) The heuristics were developed using much brain power and much
>    trial-and-error. You are going back to basics, and some people
>    are not convinced that this is better. I googled and I did not
>    find a discussion about how and why the new score was designed
>    this way.
>    looking at the output of:
>    cd /proc; for a in [0-9]*; do
>      echo `cat $a/oom_score` $a `perl -pes/'\0.*$'// < $a/cmdline`;
>    done|grep -v ^0|sort -n |less
>    , I 'm not convinced, too.
> 

The old heuristics were a mixture of arbitrary values that didn't adjust 
scores based on a unit and would often cause the incorrect task to be 
targeted because there was no clear goal being achieved.  The new 
heuristic has a solid goal: to identify and kill the most memory-hogging 
task that is eligible given the context in which the oom occurs.  If you 
disagree with that goal and want any of the old heursitics reintroduced, 
please show that it makes sense in the oom killer.

> PS) Mapping an exponential value to a linear score is bad. E.g. A
>     oom_adj of 8 should make an 1-MB-process as likely to kill as
>     a 256-MB-process with oom_adj=0.
> 

To show that, you would have to show that an application that exists today 
uses an oom_adj for something other than polarization and is based on a 
calculation of allowable memory usage.  It simply doesn't exist.

> PS2) Because I saw this in your presentation PDF: (@udev-people)
>     The -17 score of udevd is wrong, since it will even prevent
>     the OOM killer from working correctly if it grows to 100 MB:
> 

Threads with CAP_SYS_RESOURCE are free to lower the oom_score_adj of any 
thread they deem fit and that includes applications that lower its own 
oom_score_adj.  The kernel isn't going to prohibit users from setting 
their own oom_score_adj.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Revert oom rewrite series
  2010-11-15 23:31           ` Jesper Juhl
@ 2010-11-16  0:06             ` David Rientjes
  2010-11-16 10:04               ` Martin Knoblauch
  2010-11-16  0:13             ` Valdis.Kletnieks
  1 sibling, 1 reply; 37+ messages in thread
From: David Rientjes @ 2010-11-16  0:06 UTC (permalink / raw)
  To: Jesper Juhl
  Cc: KOSAKI Motohiro, Andrew Morton, Linus Torvalds, LKML, Ying Han,
	Bodo Eggert, Mandeep Singh Baines, Figo.zhang

On Tue, 16 Nov 2010, Jesper Juhl wrote:

> [jj@dragon ~]$ uname -a
> Linux dragon 2.6.37-rc1-ARCH-00542-g0143832-dirty #1 SMP PREEMPT Mon Nov 15 22:01:52 CET 2010 x86_64 Intel(R) Core(TM)2 Duo CPU T7250 @ 2.00GHz GenuineIntel GNU/Linux
> [jj@dragon ~]$ dmesg | grep oom_adj
> start_kdeinit (1502): /proc/1502/oom_adj is deprecated, please use /proc/1502/oom_score_adj instead.
> [jj@dragon ~]$ /usr/lib/kde4/libexec/start_kdeinit --version
> 
> Qt: 4.7.1
> KDE: 4.5.3 (KDE 4.5.3)
> 

Thanks for the report!  I'll get involved with kde-devel and send a patch 
to remove this dependency on newer kernels to expedite the process.

 [ Others with reports of deprecated use of oom_adj can contact me 
   privately and I'll find the parties of interest to avoid topics 
   unrelated to the kernel itself on LKML. ]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Revert oom rewrite series
  2010-11-15 23:31           ` Jesper Juhl
  2010-11-16  0:06             ` David Rientjes
@ 2010-11-16  0:13             ` Valdis.Kletnieks
  2010-11-16  6:43               ` David Rientjes
                                 ` (2 more replies)
  1 sibling, 3 replies; 37+ messages in thread
From: Valdis.Kletnieks @ 2010-11-16  0:13 UTC (permalink / raw)
  To: Jesper Juhl
  Cc: David Rientjes, KOSAKI Motohiro, Andrew Morton, Linus Torvalds,
	LKML, Ying Han, Bodo Eggert, Mandeep Singh Baines, Figo.zhang

[-- Attachment #1: Type: text/plain, Size: 1089 bytes --]

On Tue, 16 Nov 2010 00:31:00 +0100, Jesper Juhl said:

> I'm not going into the debate about whether or not deprecating one tunable 
> for two years is sufficient or not. I'm simply going to mention one app 
> that I know of that needs to be converted to use "oom_score_adj" on my 
> box :
> 
> [jj@dragon ~]$ uname -a
> Linux dragon 2.6.37-rc1-ARCH-00542-g0143832-dirty #1 SMP PREEMPT Mon Nov 15 2
2:01:52 CET 2010 x86_64 Intel(R) Core(TM)2 Duo CPU T7250 @ 2.00GHz GenuineIntel
 GNU/Linux
> [jj@dragon ~]$ dmesg | grep oom_adj
> start_kdeinit (1502): /proc/1502/oom_adj is deprecated, please use /proc/1502/oom_score_adj instead.

Make that 2 common apps:

% uname -a
Linux turing-police.cc.vt.edu 2.6.37-rc1-mmotm1109 #1 SMP PREEMPT Wed Nov 10 12:30:17 EST 2010 x86_64 x86_64 x86_64 GNU/Linux
% dmesg | grep oom
[   89.981594] sshd (4168): /proc/4168/oom_adj is deprecated, please use /proc/4168/oom_score_adj instead.
% rpm -q openssh
openssh-5.6p1-16.fc15.x86_64

5.6p1 is the latest-n-greatest released version on www.openssh.org, so somebody
probably needs to rattle their chain...


[-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Revert oom rewrite series
  2010-11-16  0:13             ` Valdis.Kletnieks
@ 2010-11-16  6:43               ` David Rientjes
  2010-11-16 11:03               ` Alan Cox
  2010-11-16 15:15               ` Alejandro Riveira Fernández
  2 siblings, 0 replies; 37+ messages in thread
From: David Rientjes @ 2010-11-16  6:43 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Jesper Juhl, KOSAKI Motohiro, Andrew Morton, Linus Torvalds,
	LKML, Ying Han, Bodo Eggert, Mandeep Singh Baines, Figo.zhang

On Mon, 15 Nov 2010, Valdis.Kletnieks@vt.edu wrote:

> Make that 2 common apps:
> 
> % uname -a
> Linux turing-police.cc.vt.edu 2.6.37-rc1-mmotm1109 #1 SMP PREEMPT Wed Nov 10 12:30:17 EST 2010 x86_64 x86_64 x86_64 GNU/Linux
> % dmesg | grep oom
> [   89.981594] sshd (4168): /proc/4168/oom_adj is deprecated, please use /proc/4168/oom_score_adj instead.
> % rpm -q openssh
> openssh-5.6p1-16.fc15.x86_64
> 
> 5.6p1 is the latest-n-greatest released version on www.openssh.org, so somebody
> probably needs to rattle their chain...
> 

Thanks, Darren Tucker fixed this a few hours after I reported it on the 
openssh bugzilla, the patch is at 
https://bugzilla.mindrot.org/show_bug.cgi?id=1838 -- it uses oom_score_adj 
if it exists and then falls back to oom_adj if running on an older kernel.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Revert oom rewrite series
  2010-11-16  0:06             ` David Rientjes
@ 2010-11-16 10:04               ` Martin Knoblauch
  2010-11-16 10:33                 ` Alessandro Suardi
  0 siblings, 1 reply; 37+ messages in thread
From: Martin Knoblauch @ 2010-11-16 10:04 UTC (permalink / raw)
  To: David Rientjes; +Cc: LKML

CC trimmed for sanity ...

----- Original Message ----

> From: David Rientjes <rientjes@google.com>
> To: Jesper Juhl <jj@chaosbits.net>
> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>; Andrew Morton 
><akpm@linux-foundation.org>; Linus Torvalds <torvalds@linux-foundation.org>; 
>LKML <linux-kernel@vger.kernel.org>; Ying Han <yinghan@google.com>; Bodo Eggert 
><7eggert@web.de>; Mandeep@yahoo.com
> Sent: Tue, November 16, 2010 1:06:21 AM
> Subject: Re: [PATCH] Revert oom rewrite series
> 
> 
> Thanks for the report!  I'll get  involved with kde-devel and send a patch 
> to remove this dependency on newer  kernels to expedite the process.
> 
>  [ Others with reports of deprecated use  of oom_adj can contact me 
>    privately and I'll find the parties of  interest to avoid topics 
>    unrelated to the kernel itself on LKML.  ]
David,

 another one for your collection. You asked for it :-) This is CentOS-5.5 
running on top of kernel 2.6.36, likely out of initrd:

$ dmesg | grep deprecated
[    2.430330] nash-hotplug (67): /proc/67/oom_adj is deprecated, please use 
/proc/67/oom_score_adj instead.

Cheers
Martin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Revert oom rewrite series
  2010-11-16 10:04               ` Martin Knoblauch
@ 2010-11-16 10:33                 ` Alessandro Suardi
  0 siblings, 0 replies; 37+ messages in thread
From: Alessandro Suardi @ 2010-11-16 10:33 UTC (permalink / raw)
  To: Martin Knoblauch; +Cc: David Rientjes, LKML

On Tue, Nov 16, 2010 at 11:04 AM, Martin Knoblauch
<spamtrap@knobisoft.de> wrote:
> CC trimmed for sanity ...
>
> ----- Original Message ----
>
>> From: David Rientjes <rientjes@google.com>
>> To: Jesper Juhl <jj@chaosbits.net>
>> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>; Andrew Morton
>><akpm@linux-foundation.org>; Linus Torvalds <torvalds@linux-foundation.org>;
>>LKML <linux-kernel@vger.kernel.org>; Ying Han <yinghan@google.com>; Bodo Eggert
>><7eggert@web.de>; Mandeep@yahoo.com
>> Sent: Tue, November 16, 2010 1:06:21 AM
>> Subject: Re: [PATCH] Revert oom rewrite series
>>
>>
>> Thanks for the report!  I'll get  involved with kde-devel and send a patch
>> to remove this dependency on newer  kernels to expedite the process.
>>
>>  [ Others with reports of deprecated use  of oom_adj can contact me
>>    privately and I'll find the parties of  interest to avoid topics
>>    unrelated to the kernel itself on LKML.  ]
> David,
>
>  another one for your collection. You asked for it :-) This is CentOS-5.5
> running on top of kernel 2.6.36, likely out of initrd:
>
> $ dmesg | grep deprecated
> [    2.430330] nash-hotplug (67): /proc/67/oom_adj is deprecated, please use
> /proc/67/oom_score_adj instead.

...and another, on Fedora 14, 2.6.37-rc1-git11:

auditd (2583): /proc/2583/oom_adj is deprecated, please use
/proc/2583/oom_score_adj instead.

Cheers,

--alessandro

 "There's always a siren singing you to shipwreck"

   (Radiohead, "There There")

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Revert oom rewrite series
  2010-11-16  0:13             ` Valdis.Kletnieks
  2010-11-16  6:43               ` David Rientjes
@ 2010-11-16 11:03               ` Alan Cox
  2010-11-16 13:03                 ` Florian Mickler
  2010-11-16 15:15               ` Alejandro Riveira Fernández
  2 siblings, 1 reply; 37+ messages in thread
From: Alan Cox @ 2010-11-16 11:03 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Jesper Juhl, David Rientjes, KOSAKI Motohiro, Andrew Morton,
	Linus Torvalds, LKML, Ying Han, Bodo Eggert,
	Mandeep Singh Baines, Figo.zhang

> 5.6p1 is the latest-n-greatest released version on www.openssh.org, so somebody
> probably needs to rattle their chain...

But current openssh needs to support old kernels.

This is why this kind of obsoleting doesn't work well. It's not "update
your app" so much as "drop support for older stuff or start doing
complicated crap dependant on version"

and it's why for tiny amounts of code it is the *wrong* thing to force
obsolete stuff especially when it still doesn't seem to have been
properly marked for deprecation in the first place.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Revert oom rewrite series
  2010-11-16 11:03               ` Alan Cox
@ 2010-11-16 13:03                 ` Florian Mickler
  2010-11-16 14:55                   ` Alan Cox
  0 siblings, 1 reply; 37+ messages in thread
From: Florian Mickler @ 2010-11-16 13:03 UTC (permalink / raw)
  To: Alan Cox
  Cc: Valdis.Kletnieks, Jesper Juhl, David Rientjes, KOSAKI Motohiro,
	Andrew Morton, Linus Torvalds, LKML, Ying Han, Bodo Eggert,
	Mandeep Singh Baines, Figo.zhang

On Tue, 16 Nov 2010 11:03:10 +0000
Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:

> > 5.6p1 is the latest-n-greatest released version on www.openssh.org, so somebody
> > probably needs to rattle their chain...
> 
> But current openssh needs to support old kernels.
> 
> This is why this kind of obsoleting doesn't work well. It's not "update
> your app" so much as "drop support for older stuff or start doing
> complicated crap dependant on version"
> 
> and it's why for tiny amounts of code it is the *wrong* thing to force
> obsolete stuff especially when it still doesn't seem to have been
> properly marked for deprecation in the first place.
> 

How does one mark it apropriately?
The commit 51b1bd2 (oom: deprecate oom_adj tunable, see below) 
added it to feature-removal-schedule.txt, a patch for
Documentation/ABI has also been provided in the meantime, if i'm not
mistaken. 

And there is already a patch for openssh:
https://bugzilla.mindrot.org/show_bug.cgi?id=1838

Regards,
Flo

commit 51b1bd2ace1595b72956224deda349efa880b693
Author: David Rientjes <rientjes@google.com>
Date:   Mon Aug 9 17:19:47 2010 -0700

    oom: deprecate oom_adj tunable
    
    /proc/pid/oom_adj is now deprecated so that that it may eventually be
    removed.  The target date for removal is August 2012.
    
    A warning will be printed to the kernel log if a task attempts to use this
    interface.  Future warning will be suppressed until the kernel is rebooted
    to prevent spamming the kernel log.
    
    Signed-off-by: David Rientjes <rientjes@google.com>
    Cc: Nick Piggin <npiggin@suse.de>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Balbir Singh <balbir@in.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Revert oom rewrite series
  2010-11-16 13:03                 ` Florian Mickler
@ 2010-11-16 14:55                   ` Alan Cox
  2010-11-16 20:57                     ` David Rientjes
  2010-11-17  4:04                     ` Valdis.Kletnieks
  0 siblings, 2 replies; 37+ messages in thread
From: Alan Cox @ 2010-11-16 14:55 UTC (permalink / raw)
  To: Florian Mickler
  Cc: Valdis.Kletnieks, Jesper Juhl, David Rientjes, KOSAKI Motohiro,
	Andrew Morton, Linus Torvalds, LKML, Ying Han, Bodo Eggert,
	Mandeep Singh Baines, Figo.zhang

> How does one mark it apropriately?
> The commit 51b1bd2 (oom: deprecate oom_adj tunable, see below) 
> added it to feature-removal-schedule.txt, a patch for
> Documentation/ABI has also been provided in the meantime, if i'm not
> mistaken. 

Yes - so why is it spewing crap, annoying users and trying to irritate
application authors. It's not 2012 yet.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Revert oom rewrite series
  2010-11-16  0:13             ` Valdis.Kletnieks
  2010-11-16  6:43               ` David Rientjes
  2010-11-16 11:03               ` Alan Cox
@ 2010-11-16 15:15               ` Alejandro Riveira Fernández
  2 siblings, 0 replies; 37+ messages in thread
From: Alejandro Riveira Fernández @ 2010-11-16 15:15 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Jesper Juhl, David Rientjes, KOSAKI Motohiro, Andrew Morton,
	Linus Torvalds, LKML, Ying Han, Bodo Eggert,
	Mandeep Singh Baines, Figo.zhang

[-- Attachment #1: Type: text/plain, Size: 1921 bytes --]

El Mon, 15 Nov 2010 19:13:15 -0500
Valdis.Kletnieks@vt.edu escribió:

> On Tue, 16 Nov 2010 00:31:00 +0100, Jesper Juhl said:
> 
> > I'm not going into the debate about whether or not deprecating one tunable 
> > for two years is sufficient or not. I'm simply going to mention one app 
> > that I know of that needs to be converted to use "oom_score_adj" on my 
> > box :
> > 
> > [jj@dragon ~]$ uname -a
> > Linux dragon 2.6.37-rc1-ARCH-00542-g0143832-dirty #1 SMP PREEMPT Mon Nov 15 2
> 2:01:52 CET 2010 x86_64 Intel(R) Core(TM)2 Duo CPU T7250 @ 2.00GHz GenuineIntel
>  GNU/Linux
> > [jj@dragon ~]$ dmesg | grep oom_adj
> > start_kdeinit (1502): /proc/1502/oom_adj is deprecated, please use /proc/1502/oom_score_adj instead.
> 
> Make that 2 common apps:
> 
> % uname -a
> Linux turing-police.cc.vt.edu 2.6.37-rc1-mmotm1109 #1 SMP PREEMPT Wed Nov 10 12:30:17 EST 2010 x86_64 x86_64 x86_64 GNU/Linux
> % dmesg | grep oom
> [   89.981594] sshd (4168): /proc/4168/oom_adj is deprecated, please use /proc/4168/oom_score_adj instead.
> % rpm -q openssh
> openssh-5.6p1-16.fc15.x86_64
> 
> 5.6p1 is the latest-n-greatest released version on www.openssh.org, so somebody
> probably needs to rattle their chain...

$ dmesg | grep deprecated
[    1.473365] udevd (662): /proc/662/oom_adj is deprecated, please use /proc/662/oom_score_adj instead.
$ apt-cache policy udev
udev:
  Instalados: 151-12.3
  Candidato: 151-12.3
  Tabla de versión:
 *** 151-12.3 0
        500 http://es.archive.ubuntu.com/ubuntu/ lucid-proposed/main Packages
        100 /var/lib/dpkg/status
     151-12.2 0
        500 http://es.archive.ubuntu.com/ubuntu/ lucid-updates/main Packages
     151-12 0
        500 http://es.archive.ubuntu.com/ubuntu/ lucid/main Packages
$ uname -a
Linux varda 2.6.36-00001-g90d39e9 #145 SMP PREEMPT Wed Oct 20 23:27:44 CEST 2010 x86_64 GNU/Linux

 Ubuntu 10.04 LTS

> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Revert oom rewrite series
  2010-11-16 14:55                   ` Alan Cox
@ 2010-11-16 20:57                     ` David Rientjes
  2010-11-16 21:01                       ` Fabio Comolli
  2010-11-17  4:04                     ` Valdis.Kletnieks
  1 sibling, 1 reply; 37+ messages in thread
From: David Rientjes @ 2010-11-16 20:57 UTC (permalink / raw)
  To: Alan Cox
  Cc: Florian Mickler, Valdis.Kletnieks, Jesper Juhl, KOSAKI Motohiro,
	Andrew Morton, Linus Torvalds, LKML, Ying Han, Bodo Eggert,
	Mandeep Singh Baines, Figo.zhang

On Tue, 16 Nov 2010, Alan Cox wrote:

> Yes - so why is it spewing crap, annoying users and trying to irritate
> application authors. It's not 2012 yet.
> 

It's a WARN_ON_ONCE() so it will only spew a single line as a reminder 
that the application needs to be updated; would you prefer that to be 
suppressed until a year before removal, for example?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Revert oom rewrite series
  2010-11-16 20:57                     ` David Rientjes
@ 2010-11-16 21:01                       ` Fabio Comolli
  0 siblings, 0 replies; 37+ messages in thread
From: Fabio Comolli @ 2010-11-16 21:01 UTC (permalink / raw)
  To: David Rientjes; +Cc: LKML

[CC: list trimmed again for sanity]

Another one:

[   34.709156] chromium-browse (1439): /proc/1480/oom_adj is
deprecated, please use /proc/1480/oom_score_adj instead.

2.6.37-rc2 - archlinux - package chromium-browser-ppa from AUR




On Tue, Nov 16, 2010 at 9:57 PM, David Rientjes <rientjes@google.com> wrote:
> On Tue, 16 Nov 2010, Alan Cox wrote:
>
>> Yes - so why is it spewing crap, annoying users and trying to irritate
>> application authors. It's not 2012 yet.
>>
>
> It's a WARN_ON_ONCE() so it will only spew a single line as a reminder
> that the application needs to be updated; would you prefer that to be
> suppressed until a year before removal, for example?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Revert oom rewrite series
  2010-11-15 23:50     ` David Rientjes
@ 2010-11-17  0:06       ` Bodo Eggert
  2010-11-17  0:25         ` David Rientjes
  2010-11-17  0:48         ` Mandeep Singh Baines
  0 siblings, 2 replies; 37+ messages in thread
From: Bodo Eggert @ 2010-11-17  0:06 UTC (permalink / raw)
  To: David Rientjes
  Cc: Bodo Eggert, KOSAKI Motohiro, LKML, Linus Torvalds,
	Andrew Morton, Ying Han, Bodo Eggert, Mandeep Singh Baines,
	Figo.zhang

On Mon, 15 Nov 2010, David Rientjes wrote:
> On Tue, 16 Nov 2010, Bodo Eggert wrote:

> > > CAP_SYS_RESOURCE threads have full control over their oom killing priority
> > > by /proc/pid/oom_score_adj
> > 
> > , but unless they are written in the last months and designed for linux
> > and if the author took some time to research each external process invocation,
> > they can not be aware of this possibility.
> > 
> 
> You're clearly wrong, CAP_SYS_RESOURCE has been required to modify oom_adj 
> for over five years (as long as the git history).  8fb4fc68, merged into 
> 2.6.20, allowed tasks to raise their own oom_adj but not decrease it.  
> That is unchanged by the rewrite.

You are misunderstanding me. It was allowed to do this, but it did not need 
to do it yet. It was enough to be a well-written POSIX application without 
linux-specific OOM hacks for some specific kernel versions.

> > Besides that, if each process is supposed to change the default, the default
> > is wrong.
> 
> That doesn't make any sense, if want to protect a thread from the oom 
> killer you're going to need to modify oom_score_adj, the kernel can't know 
> what you perceive as being vital.  Having CAP_SYS_RESOURCE alone does not 
> imply that, it only allows unbounded access to resources.  That's 
> completely orthogonal to the goal of the oom killer heuristic, which is to 
> find the most memory-hogging task to kill.

The old oom killer's task was to guess the best victim to kill. For me, it 
did a good job (but the system kept thrashing for too long until it kicked
the offender). Looking at CAP_SYS_RESOURCE was one way to recognize 
important processes.

> > 1) The exponential scale did have a low resolution.
> > 
> > 2) The heuristics were developed using much brain power and much
> >    trial-and-error. You are going back to basics, and some people
> >    are not convinced that this is better. I googled and I did not
> >    find a discussion about how and why the new score was designed
> >    this way.
> >    looking at the output of:
> >    cd /proc; for a in [0-9]*; do
> >      echo `cat $a/oom_score` $a `perl -pes/'\0.*$'// < $a/cmdline`;
> >    done|grep -v ^0|sort -n |less
> >    , I 'm not convinced, too.
> > 
> 
> The old heuristics were a mixture of arbitrary values that didn't adjust 
> scores based on a unit and would often cause the incorrect task to be 
> targeted because there was no clear goal being achieved.  The new 
> heuristic has a solid goal: to identify and kill the most memory-hogging 
> task that is eligible given the context in which the oom occurs.  If you 
> disagree with that goal and want any of the old heursitics reintroduced, 
> please show that it makes sense in the oom killer.

The first old OOM killer did the same as you promise the current one does,
except for your bugfixes. That's why it killed the wrong applications and
all the heuristics were added until the complaints stopped.

Off cause I did not yet test your OOM killer, maybe it really is better.
Heuristics tend to rot and you did much work to make it right.

I don't want the old OOM killer back, but I don't want you to fall
into the same pits as the pre-old OOM killer used to do.

> > PS) Mapping an exponential value to a linear score is bad. E.g. A
> >     oom_adj of 8 should make an 1-MB-process as likely to kill as
> >     a 256-MB-process with oom_adj=0.
> > 
> 
> To show that, you would have to show that an application that exists today 
> uses an oom_adj for something other than polarization and is based on a 
> calculation of allowable memory usage.  It simply doesn't exist.

No such application should exist because the OOM killer should DTRT.
oom_adj was supposed to let the sysadmin lower his mission-critical
DB's score to be just lower than the less-important tasks, or to
point the kernel to his ever-faulty and easily-restarted browser.

> > PS2) Because I saw this in your presentation PDF: (@udev-people)
> >     The -17 score of udevd is wrong, since it will even prevent
> >     the OOM killer from working correctly if it grows to 100 MB:
> > 
> 
> Threads with CAP_SYS_RESOURCE are free to lower the oom_score_adj of any 
> thread they deem fit and that includes applications that lower its own 
> oom_score_adj.  The kernel isn't going to prohibit users from setting 
> their own oom_score_adj.

My point is: The udev people should not prevent the OOM killer 
unconditionally, it has an important task in case something goes wrong.
I just didn't want to start a new thread at that time of day.
-- 
How do I set my laser printer on stun?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Revert oom rewrite series
  2010-11-17  0:06       ` Bodo Eggert
@ 2010-11-17  0:25         ` David Rientjes
  2010-11-17  0:48         ` Mandeep Singh Baines
  1 sibling, 0 replies; 37+ messages in thread
From: David Rientjes @ 2010-11-17  0:25 UTC (permalink / raw)
  To: Bodo Eggert
  Cc: KOSAKI Motohiro, LKML, Linus Torvalds, Andrew Morton, Ying Han,
	Bodo Eggert, Mandeep Singh Baines, Figo.zhang

On Wed, 17 Nov 2010, Bodo Eggert wrote:

> The old oom killer's task was to guess the best victim to kill. For me, it 
> did a good job (but the system kept thrashing for too long until it kicked
> the offender). Looking at CAP_SYS_RESOURCE was one way to recognize 
> important processes.
> 

CAP_SYS_RESOURCE does not imply the task is important.

There's a problem when the kernel is oom; killing a thread that is getting 
work done is one of the most serious remedies the kernel will ever do to 
allow forward progress.  In almost all scenarios (except in some cpuset or 
memcg configurations), it's a userspace configuration issue that exhausts 
memory and the VM finds no other alternative.  CAP_SYS_RESOURCE threads 
have access to unbounded amounts of resources and thus can use an 
extremely large amount of memory very quickly and at a detriment to other 
threads that may be as important to more important.  Considering them any 
different is an unsubstantiated and undefined behavior that should not be 
considered in the heuristic _unless_ the administrator or the task itself 
tells the kernel via oom_score_adj of its priority.

> > The old heuristics were a mixture of arbitrary values that didn't adjust 
> > scores based on a unit and would often cause the incorrect task to be 
> > targeted because there was no clear goal being achieved.  The new 
> > heuristic has a solid goal: to identify and kill the most memory-hogging 
> > task that is eligible given the context in which the oom occurs.  If you 
> > disagree with that goal and want any of the old heursitics reintroduced, 
> > please show that it makes sense in the oom killer.
> 
> The first old OOM killer did the same as you promise the current one does,
> except for your bugfixes. That's why it killed the wrong applications and
> all the heuristics were added until the complaints stopped.
> 

No, the old oom killer did not always kill the application that used the 
most amount of memory; it considered other factors with arbitrary point 
deductions such as nice level, runtime, CAP_SYS_RAWIO, CAP_SYS_RESOURCE, 
etc.  We had to remove those heuristics internally in older kernels as 
well because it would often allow a task to runaway using a massive amount 
of memory because of leaks and kill everything else on the system before 
targeting the appropriate task.  At that point, it left the system with 
barely anything running and no work was getting done.

> Off cause I did not yet test your OOM killer, maybe it really is better.
> Heuristics tend to rot and you did much work to make it right.
> 
> I don't want the old OOM killer back, but I don't want you to fall
> into the same pits as the pre-old OOM killer used to do.
> 

Thanks, and that's why I'm trying to avoid additional heuristics such 
CAP_SYS_RESOURCE where the priority is _implied_ rather than _proven_.  If 
CAP_SYS_RESOURCE was defined to be more preferred to stay alive, then I'd 
have no argument; it isn't.

> > > PS) Mapping an exponential value to a linear score is bad. E.g. A
> > >     oom_adj of 8 should make an 1-MB-process as likely to kill as
> > >     a 256-MB-process with oom_adj=0.
> > > 
> > 
> > To show that, you would have to show that an application that exists today 
> > uses an oom_adj for something other than polarization and is based on a 
> > calculation of allowable memory usage.  It simply doesn't exist.
> 
> No such application should exist because the OOM killer should DTRT.
> oom_adj was supposed to let the sysadmin lower his mission-critical
> DB's score to be just lower than the less-important tasks, or to
> point the kernel to his ever-faulty and easily-restarted browser.
> 

oom_score_adj allows use to define when an application is using more 
memory than expected and is often helpful in cpuset, memcg, or mempolicy 
constrained cases as well.  We'd like to be able to say that 30% of 
available memory should be discounted from a particular task that is 
expected to use 30% more memory than others without getting preferred.  
oom_score_adj can do that, oom_adj could not.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Revert oom rewrite series
  2010-11-17  0:06       ` Bodo Eggert
  2010-11-17  0:25         ` David Rientjes
@ 2010-11-17  0:48         ` Mandeep Singh Baines
  1 sibling, 0 replies; 37+ messages in thread
From: Mandeep Singh Baines @ 2010-11-17  0:48 UTC (permalink / raw)
  To: Bodo Eggert
  Cc: David Rientjes, KOSAKI Motohiro, LKML, Linus Torvalds,
	Andrew Morton, Ying Han, Bodo Eggert, Figo.zhang

Bodo Eggert (7eggert@gmx.de) wrote:
> On Mon, 15 Nov 2010, David Rientjes wrote:
> > On Tue, 16 Nov 2010, Bodo Eggert wrote:
> 
> > > > CAP_SYS_RESOURCE threads have full control over their oom killing priority
> > > > by /proc/pid/oom_score_adj
> > > 
> > > , but unless they are written in the last months and designed for linux
> > > and if the author took some time to research each external process invocation,
> > > they can not be aware of this possibility.
> > > 
> > 
> > You're clearly wrong, CAP_SYS_RESOURCE has been required to modify oom_adj 
> > for over five years (as long as the git history).  8fb4fc68, merged into 
> > 2.6.20, allowed tasks to raise their own oom_adj but not decrease it.  
> > That is unchanged by the rewrite.
> 
> You are misunderstanding me. It was allowed to do this, but it did not need 
> to do it yet. It was enough to be a well-written POSIX application without 
> linux-specific OOM hacks for some specific kernel versions.
> 
> > > Besides that, if each process is supposed to change the default, the default
> > > is wrong.
> > 
> > That doesn't make any sense, if want to protect a thread from the oom 
> > killer you're going to need to modify oom_score_adj, the kernel can't know 
> > what you perceive as being vital.  Having CAP_SYS_RESOURCE alone does not 
> > imply that, it only allows unbounded access to resources.  That's 
> > completely orthogonal to the goal of the oom killer heuristic, which is to 
> > find the most memory-hogging task to kill.
> 
> The old oom killer's task was to guess the best victim to kill. For me, it 
> did a good job (but the system kept thrashing for too long until it kicked

Here's a patch I've been working on to control thrashing.

http://lkml.org/lkml/2010/10/28/289

It works well for our app: web browser. We'd rather OOM quickly and kill
a browser tab than thrash for a few minutes and then OOM. It works well for
us but I'm working on a more generally useful solution.

> the offender). Looking at CAP_SYS_RESOURCE was one way to recognize 
> important processes.
> 
> > > 1) The exponential scale did have a low resolution.
> > > 
> > > 2) The heuristics were developed using much brain power and much
> > >    trial-and-error. You are going back to basics, and some people
> > >    are not convinced that this is better. I googled and I did not
> > >    find a discussion about how and why the new score was designed
> > >    this way.
> > >    looking at the output of:
> > >    cd /proc; for a in [0-9]*; do
> > >      echo `cat $a/oom_score` $a `perl -pes/'\0.*$'// < $a/cmdline`;
> > >    done|grep -v ^0|sort -n |less
> > >    , I 'm not convinced, too.
> > > 
> > 
> > The old heuristics were a mixture of arbitrary values that didn't adjust 
> > scores based on a unit and would often cause the incorrect task to be 
> > targeted because there was no clear goal being achieved.  The new 
> > heuristic has a solid goal: to identify and kill the most memory-hogging 
> > task that is eligible given the context in which the oom occurs.  If you 
> > disagree with that goal and want any of the old heursitics reintroduced, 
> > please show that it makes sense in the oom killer.
> 
> The first old OOM killer did the same as you promise the current one does,
> except for your bugfixes. That's why it killed the wrong applications and
> all the heuristics were added until the complaints stopped.
> 
> Off cause I did not yet test your OOM killer, maybe it really is better.
> Heuristics tend to rot and you did much work to make it right.
> 
> I don't want the old OOM killer back, but I don't want you to fall
> into the same pits as the pre-old OOM killer used to do.
> 
> > > PS) Mapping an exponential value to a linear score is bad. E.g. A
> > >     oom_adj of 8 should make an 1-MB-process as likely to kill as
> > >     a 256-MB-process with oom_adj=0.
> > > 
> > 
> > To show that, you would have to show that an application that exists today 
> > uses an oom_adj for something other than polarization and is based on a 
> > calculation of allowable memory usage.  It simply doesn't exist.
> 
> No such application should exist because the OOM killer should DTRT.
> oom_adj was supposed to let the sysadmin lower his mission-critical
> DB's score to be just lower than the less-important tasks, or to
> point the kernel to his ever-faulty and easily-restarted browser.
> 
> > > PS2) Because I saw this in your presentation PDF: (@udev-people)
> > >     The -17 score of udevd is wrong, since it will even prevent
> > >     the OOM killer from working correctly if it grows to 100 MB:
> > > 
> > 
> > Threads with CAP_SYS_RESOURCE are free to lower the oom_score_adj of any 
> > thread they deem fit and that includes applications that lower its own 
> > oom_score_adj.  The kernel isn't going to prohibit users from setting 
> > their own oom_score_adj.
> 
> My point is: The udev people should not prevent the OOM killer 
> unconditionally, it has an important task in case something goes wrong.
> I just didn't want to start a new thread at that time of day.
> -- 
> How do I set my laser printer on stun?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Revert oom rewrite series
  2010-11-16 14:55                   ` Alan Cox
  2010-11-16 20:57                     ` David Rientjes
@ 2010-11-17  4:04                     ` Valdis.Kletnieks
  1 sibling, 0 replies; 37+ messages in thread
From: Valdis.Kletnieks @ 2010-11-17  4:04 UTC (permalink / raw)
  To: Alan Cox
  Cc: Florian Mickler, Jesper Juhl, David Rientjes, KOSAKI Motohiro,
	Andrew Morton, Linus Torvalds, LKML, Ying Han, Bodo Eggert,
	Mandeep Singh Baines, Figo.zhang

[-- Attachment #1: Type: text/plain, Size: 889 bytes --]

On Tue, 16 Nov 2010 14:55:51 GMT, Alan Cox said:
> > How does one mark it apropriately?
> > The commit 51b1bd2 (oom: deprecate oom_adj tunable, see below) 
> > added it to feature-removal-schedule.txt, a patch for
> > Documentation/ABI has also been provided in the meantime, if i'm not
> > mistaken. 
> 
> Yes - so why is it spewing crap, annoying users and trying to irritate
> application authors. It's not 2012 yet.

Aug 2012 is only 6 kernel releases or so away....

Presumably the whinging is so we start tracking down the offending userspace
and getting it fixed before 2012 gets here.  Sticking the warning in just one
or two kernel releases before it becomes official leads to "I can't run the new
kernel because my userspace isn't patched yet".  We really can't win here,
we don't whinge and stuff doesn't get tracked down and fixed, we do whinge
and that gets people upset too.

[-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Revert oom rewrite series
  2010-11-15 10:34         ` David Rientjes
  2010-11-15 23:31           ` Jesper Juhl
@ 2010-11-23  7:16           ` KOSAKI Motohiro
  2010-11-28  1:45             ` David Rientjes
  1 sibling, 1 reply; 37+ messages in thread
From: KOSAKI Motohiro @ 2010-11-23  7:16 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Linus Torvalds, LKML, Ying Han,
	Bodo Eggert, Mandeep Singh Baines, Figo.zhang

Sorry for the delay.

> On Mon, 15 Nov 2010, KOSAKI Motohiro wrote:
> 
> > Of cource, I denied. He seems to think number of email is meaningful than
> > how talk about. but it's incorrect and makes no sense. Why not? Also, He
> > have to talk about logically. "Hey, I think it's not bug" makes no sense.
> > Such claim don't solve anything. userland is still unhappy. Why not?
> > I want to quickly action.
> 
> If there are pending complaints or bugs that I haven't addressed, please 
> bring them to my attention.  To date, I know of no issues that have been 
> raised that I have not addressed; you're always free to disagree with my 
> position, but in the end you may find that when the kernel moves in a 
> different direction that you should begin to accept it.

I can't understand. Why do I need to ignore userland folks? WHY?
I have no reason userland complain. I tend to prefer to avoid userland 
folks painful than kernel developers.


> 
> > That said, If anyone want to change userland ABI, Be carefully. They have
> > to investigate userland usecase carefully and avoid to break them carefully 
> > again. If someone think "hey, It's no big matter. userland rewritten can solve
> > an issue", I strongly disagree. they don't understand why all of userland 
> > applications rewritten is harmful.
>
> You may remember that the initial version of my rewrite replaced oom_adj 
> entirely with the new oom_score_adj semantics.  Others suggested that it 
> be seperated into a new tunable and the old tunable deprecated for a 
> lengthy period of time.  I accepted that criticism and understood the 
> drawbacks of replacing the tunable immediately and followed those 
> suggestions.  I disagree with you that the deprecation of oom_adj for a 
> period of two years is as dramatic as you imply and I disagree that users 
> are experiencing problems with the linear scale that it now operates on 
> versus the old exponential scale.

Yes and No. People wanted to separate AND don't break old one.


> 
> > 1) About two month ago, Dave hansen observed strange OOM issue because he
> >    has a big machine and ALL process are not so big. thus, eventually all 
> >    process got oom-score=0 and oom-killer didn't work.
> > 
> >    https://kerneltrap.org/mailarchive/linux-driver-devel/2010/9/9/6886383
> > 
> >    DavidR changed oom-score to +1 in such situation. 
> > 
> >    http://kerneltrap.org/mailarchive/linux-kernel/2010/9/9/4617455
> > 
> >    But it is completely bognus. If all process have score=1, oom-killer fall
> >    back to purely random killer. I expected and explained his patch has
> >    its problem at half years ago. but he didn't fix yet.
> > 
> 
> The resolution with which the oom killer considers memory is at 0.1% of 
> system RAM at its highest (smaller when you have a memory controller, 
> cpuset, or mempolicy constrained oom).  It considers a task within 0.1% of 
> memory of another task to have equal "badness" to kill, we don't break 
> ties in between that resolution -- it all depends on which one shows up in 
> the tasklist first.  If you disagree with that resolution, which I support 
> as being high enough, then you may certainly propose a patch to make it 
> even finer at 0.01%, 0.001%, etc.  It would only change oom_badness() to 
> range between [0,10000], [0,100000], etc.

No.
Think Moore's Law. rational value will be not able to work in future anyway.
10 years ago, I used 20M bytes memory desktop machine and I'm now using 2GB.
memory amount is growing and growing. and bash size doesn't grwoing so fast.


> 
> > 2) Also half years ago, I did explained oom_adj is used from multiple 
> >    applications. And we can't break them. But DavidR didn't fix.
> > 
> 
> And we didn't.  oom_adj is still there and maps linearly to oom_score_adj; 
> you just can't show a single application where that mapping breaks because 
> it was based on an actual calculation.
> 
> If you would like to cite these "multiple" applications that need to be 
> converted to use oom_score_adj (I know of udev), please let me know and 
> if they're open-source applications then I will commit to submitting 
> patches for them myself.  I believe the two year window is sufficient for 
> everyone else, though.

If you want, you have to change userland at first and by yourself. Don't
claim anyoneelse should working for you.


> > 3) Also about four month ago, I and kamezawa-san pointed out his patch
> >    don't work on memcg. It also haven't been fixed.
> 
> I don't know what you're referring to here, sorry.

You should have read my patch. Even though you haven't use memcg, We do.



>    As kamezawa-san pointed out, This break cgroup and lxr environment.
>    He said,
> 	> Assume 2 proceses A, B which has oom_score_adj of 300 and 0
> 	> And A uses 200M, B uses 1G of memory under 4G system
> 	>
> 	> Under the system.
> 	> 	A's socre = (200M *1000)/4G + 300 = 350
> 	> 	B's score = (1G * 1000)/4G = 250.
> 	>
> 	> In the cpuset, it has 2G of memory.
> 	> 	A's score = (200M * 1000)/2G + 300 = 400
> 	> 	B's socre = (1G * 1000)/2G = 500
> 	>
> 	> This priority-inversion don't happen in current system.



> 
> > In the other hand, You can't explain what worth OOM-rewritten patch has. 
> > Because there is nothing. It is only "powerful"(TM) for Google. but 
> > instead It has zero worth for every other people. Here is just technical 
> > issue. Bah.
> > 
> 
> Please see my reply to Figo.zhang where I enumerate the four reasons why 
> the new userspace tunable is more powerful than oom_adj.

I'm NOT interesting *powerful* crap. Please DON'T talk which is powerful.
I can only said, It's useful only for you.



> At this point, I can only speculate that your distaste for the new oom 
> killer is one of disposition; it seems like everytime you reply to an 
> email (or, more regularly, just repost your revert) that you come into it 
> with the attitude that my response cannot possibly be correct and that the 
> way you see things is exactly as they should be.  If you were to consider 
> other people's opinions, however, you may find some common ground that can 
> be met.  I certainly did that when I introduced oom_score_adj instead of 
> replacing oom_adj immediatley.  I also did it when I removed the forkbomb 
> detector from the rewrite.  I also did it when considering swap in the 
> heuristic when it initially was only rss.  Andrew is in the position where 
> he has to make a judgment call on what should be included and what 
> shouldn't and it should be pretty darn clear after you post your revert 
> the first time, then the second time, then the third time, then the fourth 
> time, and now the fifth time.




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Revert oom rewrite series
  2010-11-15  6:57       ` KOSAKI Motohiro
  2010-11-15 10:34         ` David Rientjes
@ 2010-11-23  7:16         ` KOSAKI Motohiro
  1 sibling, 0 replies; 37+ messages in thread
From: KOSAKI Motohiro @ 2010-11-23  7:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, Linus Torvalds, LKML, David Rientjes, Ying Han,
	Bodo Eggert, Mandeep Singh Baines, Figo.zhang

> If you still have a question, please ask me. maybe I can answer all of 
> your question.

Zero question?  If so, I'll resend the revert to linus.


Actually, I don't tend hear the shouting. They aren't discussion. It's
only crappy shout. Googlers have to think why no person agree their claim.
ZERO. even though >20 people discussed with them. DavidR seems to continue
to make flame. But I don't care. He have to learn making flame don't solve
ANYTHING.

And they have to learn correct discussion way and which is different of 
discusstion and shouting. and why we have to learn userland workload and
have to avoid any breakage. I'm angry googlers frequently break kernel
and frequently ignore userland claim.




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Revert oom rewrite series
  2010-11-14 19:32 ` Linus Torvalds
  2010-11-15  0:54   ` KOSAKI Motohiro
@ 2010-11-23 23:51   ` KOSAKI Motohiro
  1 sibling, 0 replies; 37+ messages in thread
From: KOSAKI Motohiro @ 2010-11-23 23:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kosaki.motohiro, LKML, David Rientjes, Andrew Morton, Ying Han,
	Bodo Eggert, Mandeep Singh Baines, Figo.zhang

> 2010/11/13 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>:
> >
> > Please apply this. this patch revert commits of oom changes since v2.6.35.
> 
> I'm not getting involved in this whole flame-war. You need to convince
> Andrew, who has been the person everything went through.

I did.

Therefore, I will resend the patch to you. Thanks.


--------------------------------------------------------------------------
Subject: [PATCH] Revert oom rewrite series

This reverts following commits. They has broke an ABI and made multiple
enduser claim.

9c28ab662a8e3d19d07077ac0a8931c015e8afec Revert "oom: badness heuristic rewrite"
74cd8c6cb3e093c4d67ac3eb3581e246e4981dad Revert "oom: deprecate oom_adj tunable"
79a0bd5796e754c4b4e22071c4edddef3517d010 Revert "memcg: use find_lock_task_mm() in memory cgroups oom"
a465ef80c2a9fe73c85029fcea5c68ffee8dbb69 Revert "oom: always return a badness score of non-zero for eligible tas
516fcbb0c45d943df1b739d3be3d417aee2275f3 Revert "oom: filter unkillable tasks from tasklist dump"
b1c98f95a7954c450dadd809280f86863ea9d05d Revert "oom: add per-mm oom disable count"
fd79f3f47c82a0af5288afe7556905dd171bfc43 Revert "oom: avoid killing a task if a thread sharing its mm cannot be
2d72175528870dcef577db4a2a0b49d819c6eaff Revert "oom: kill all threads sharing oom killed task's mm"
be212960618ddcdb9526ce2cb73fd081fd3e90ea Revert "oom: rewrite error handling for oom_adj and oom_score_adj tunab
1b17c41599c594c7d11ef415a92d47c205fe89ea Revert "oom: fix locking for oom_adj and oom_score_adj"

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 Documentation/feature-removal-schedule.txt |   25 ---
 Documentation/filesystems/proc.txt         |   97 ++++-----
 fs/exec.c                                  |    5 -
 fs/proc/base.c                             |  176 ++--------------
 include/linux/memcontrol.h                 |    8 -
 include/linux/mm_types.h                   |    2 -
 include/linux/oom.h                        |   19 +--
 include/linux/sched.h                      |    3 +-
 kernel/exit.c                              |    3 -
 kernel/fork.c                              |   16 +--
 mm/memcontrol.c                            |   28 +---
 mm/oom_kill.c                              |  323 ++++++++++++++--------------
 12 files changed, 227 insertions(+), 478 deletions(-)

diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
index d8f36f9..9af16b9 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -166,31 +166,6 @@ Who:	Eric Biederman <ebiederm@xmission.com>
 
 ---------------------------
 
-What:	/proc/<pid>/oom_adj
-When:	August 2012
-Why:	/proc/<pid>/oom_adj allows userspace to influence the oom killer's
-	badness heuristic used to determine which task to kill when the kernel
-	is out of memory.
-
-	The badness heuristic has since been rewritten since the introduction of
-	this tunable such that its meaning is deprecated.  The value was
-	implemented as a bitshift on a score generated by the badness()
-	function that did not have any precise units of measure.  With the
-	rewrite, the score is given as a proportion of available memory to the
-	task allocating pages, so using a bitshift which grows the score
-	exponentially is, thus, impossible to tune with fine granularity.
-
-	A much more powerful interface, /proc/<pid>/oom_score_adj, was
-	introduced with the oom killer rewrite that allows users to increase or
-	decrease the badness() score linearly.  This interface will replace
-	/proc/<pid>/oom_adj.
-
-	A warning will be emitted to the kernel log if an application uses this
-	deprecated interface.  After it is printed once, future warnings will be
-	suppressed until the kernel is rebooted.
-
----------------------------
-
 What:	remove EXPORT_SYMBOL(kernel_thread)
 When:	August 2006
 Files:	arch/*/kernel/*_ksyms.c
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index e73df27..030e3a1 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -33,8 +33,7 @@ Table of Contents
   2	Modifying System Parameters
 
   3	Per-Process Parameters
-  3.1	/proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj - Adjust the oom-killer
-								score
+  3.1	/proc/<pid>/oom_adj - Adjust the oom-killer score
   3.2	/proc/<pid>/oom_score - Display current oom-killer score
   3.3	/proc/<pid>/io - Display the IO accounting fields
   3.4	/proc/<pid>/coredump_filter - Core dump filtering settings
@@ -1246,64 +1245,42 @@ of the kernel.
 CHAPTER 3: PER-PROCESS PARAMETERS
 ------------------------------------------------------------------------------
 
-3.1 /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj- Adjust the oom-killer score
---------------------------------------------------------------------------------
-
-These file can be used to adjust the badness heuristic used to select which
-process gets killed in out of memory conditions.
-
-The badness heuristic assigns a value to each candidate task ranging from 0
-(never kill) to 1000 (always kill) to determine which process is targeted.  The
-units are roughly a proportion along that range of allowed memory the process
-may allocate from based on an estimation of its current memory and swap use.
-For example, if a task is using all allowed memory, its badness score will be
-1000.  If it is using half of its allowed memory, its score will be 500.
-
-There is an additional factor included in the badness score: root
-processes are given 3% extra memory over other tasks.
-
-The amount of "allowed" memory depends on the context in which the oom killer
-was called.  If it is due to the memory assigned to the allocating task's cpuset
-being exhausted, the allowed memory represents the set of mems assigned to that
-cpuset.  If it is due to a mempolicy's node(s) being exhausted, the allowed
-memory represents the set of mempolicy nodes.  If it is due to a memory
-limit (or swap limit) being reached, the allowed memory is that configured
-limit.  Finally, if it is due to the entire system being out of memory, the
-allowed memory represents all allocatable resources.
-
-The value of /proc/<pid>/oom_score_adj is added to the badness score before it
-is used to determine which task to kill.  Acceptable values range from -1000
-(OOM_SCORE_ADJ_MIN) to +1000 (OOM_SCORE_ADJ_MAX).  This allows userspace to
-polarize the preference for oom killing either by always preferring a certain
-task or completely disabling it.  The lowest possible value, -1000, is
-equivalent to disabling oom killing entirely for that task since it will always
-report a badness score of 0.
-
-Consequently, it is very simple for userspace to define the amount of memory to
-consider for each task.  Setting a /proc/<pid>/oom_score_adj value of +500, for
-example, is roughly equivalent to allowing the remainder of tasks sharing the
-same system, cpuset, mempolicy, or memory controller resources to use at least
-50% more memory.  A value of -500, on the other hand, would be roughly
-equivalent to discounting 50% of the task's allowed memory from being considered
-as scoring against the task.
-
-For backwards compatibility with previous kernels, /proc/<pid>/oom_adj may also
-be used to tune the badness score.  Its acceptable values range from -16
-(OOM_ADJUST_MIN) to +15 (OOM_ADJUST_MAX) and a special value of -17
-(OOM_DISABLE) to disable oom killing entirely for that task.  Its value is
-scaled linearly with /proc/<pid>/oom_score_adj.
-
-Writing to /proc/<pid>/oom_score_adj or /proc/<pid>/oom_adj will change the
-other with its scaled value.
-
-NOTICE: /proc/<pid>/oom_adj is deprecated and will be removed, please see
-Documentation/feature-removal-schedule.txt.
-
-Caveat: when a parent task is selected, the oom killer will sacrifice any first
-generation children with seperate address spaces instead, if possible.  This
-avoids servers and important system daemons from being killed and loses the
-minimal amount of work.
-
+3.1 /proc/<pid>/oom_adj - Adjust the oom-killer score
+------------------------------------------------------
+
+This file can be used to adjust the score used to select which processes
+should be killed in an  out-of-memory  situation.  Giving it a high score will
+increase the likelihood of this process being killed by the oom-killer.  Valid
+values are in the range -16 to +15, plus the special value -17, which disables
+oom-killing altogether for this process.
+
+The process to be killed in an out-of-memory situation is selected among all others
+based on its badness score. This value equals the original memory size of the process
+and is then updated according to its CPU time (utime + stime) and the
+run time (uptime - start time). The longer it runs the smaller is the score.
+Badness score is divided by the square root of the CPU time and then by
+the double square root of the run time.
+
+Swapped out tasks are killed first. Half of each child's memory size is added to
+the parent's score if they do not share the same memory. Thus forking servers
+are the prime candidates to be killed. Having only one 'hungry' child will make
+parent less preferable than the child.
+
+/proc/<pid>/oom_score shows process' current badness score.
+
+The following heuristics are then applied:
+ * if the task was reniced, its score doubles
+ * superuser or direct hardware access tasks (CAP_SYS_ADMIN, CAP_SYS_RESOURCE
+ 	or CAP_SYS_RAWIO) have their score divided by 4
+ * if oom condition happened in one cpuset and checked process does not belong
+ 	to it, its score is divided by 8
+ * the resulting score is multiplied by two to the power of oom_adj, i.e.
+	points <<= oom_adj when it is positive and
+	points >>= -(oom_adj) otherwise
+
+The task with the highest badness score is then selected and its children
+are killed, process itself will be killed in an OOM situation when it does
+not have children or some of them disabled oom like described above.
 
 3.2 /proc/<pid>/oom_score - Display current oom-killer score
 -------------------------------------------------------------
diff --git a/fs/exec.c b/fs/exec.c
index 99d33a1..47986fb 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -54,7 +54,6 @@
 #include <linux/fsnotify.h>
 #include <linux/fs_struct.h>
 #include <linux/pipe_fs_i.h>
-#include <linux/oom.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -766,10 +765,6 @@ static int exec_mmap(struct mm_struct *mm)
 	tsk->mm = mm;
 	tsk->active_mm = mm;
 	activate_mm(active_mm, mm);
-	if (old_mm && tsk->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
-		atomic_dec(&old_mm->oom_disable_count);
-		atomic_inc(&tsk->mm->oom_disable_count);
-	}
 	task_unlock(tsk);
 	arch_pick_mmap_layout(mm);
 	if (old_mm) {
diff --git a/fs/proc/base.c b/fs/proc/base.c
index f3d02ca..ed7d18e 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -63,7 +63,6 @@
 #include <linux/namei.h>
 #include <linux/mnt_namespace.h>
 #include <linux/mm.h>
-#include <linux/swap.h>
 #include <linux/rcupdate.h>
 #include <linux/kallsyms.h>
 #include <linux/stacktrace.h>
@@ -431,11 +430,12 @@ static const struct file_operations proc_lstats_operations = {
 static int proc_oom_score(struct task_struct *task, char *buffer)
 {
 	unsigned long points = 0;
+	struct timespec uptime;
 
+	do_posix_clock_monotonic_gettime(&uptime);
 	read_lock(&tasklist_lock);
 	if (pid_alive(task))
-		points = oom_badness(task, NULL, NULL,
-					totalram_pages + total_swap_pages);
+		points = badness(task, NULL, NULL, uptime.tv_sec);
 	read_unlock(&tasklist_lock);
 	return sprintf(buffer, "%lu\n", points);
 }
@@ -1025,74 +1025,36 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
 	memset(buffer, 0, sizeof(buffer));
 	if (count > sizeof(buffer) - 1)
 		count = sizeof(buffer) - 1;
-	if (copy_from_user(buffer, buf, count)) {
-		err = -EFAULT;
-		goto out;
-	}
+	if (copy_from_user(buffer, buf, count))
+		return -EFAULT;
 
 	err = strict_strtol(strstrip(buffer), 0, &oom_adjust);
 	if (err)
-		goto out;
+		return -EINVAL;
 	if ((oom_adjust < OOM_ADJUST_MIN || oom_adjust > OOM_ADJUST_MAX) &&
-	     oom_adjust != OOM_DISABLE) {
-		err = -EINVAL;
-		goto out;
-	}
+	     oom_adjust != OOM_DISABLE)
+		return -EINVAL;
 
 	task = get_proc_task(file->f_path.dentry->d_inode);
-	if (!task) {
-		err = -ESRCH;
-		goto out;
-	}
-
-	task_lock(task);
-	if (!task->mm) {
-		err = -EINVAL;
-		goto err_task_lock;
-	}
-
+	if (!task)
+		return -ESRCH;
 	if (!lock_task_sighand(task, &flags)) {
-		err = -ESRCH;
-		goto err_task_lock;
+		put_task_struct(task);
+		return -ESRCH;
 	}
 
 	if (oom_adjust < task->signal->oom_adj && !capable(CAP_SYS_RESOURCE)) {
-		err = -EACCES;
-		goto err_sighand;
-	}
-
-	if (oom_adjust != task->signal->oom_adj) {
-		if (oom_adjust == OOM_DISABLE)
-			atomic_inc(&task->mm->oom_disable_count);
-		if (task->signal->oom_adj == OOM_DISABLE)
-			atomic_dec(&task->mm->oom_disable_count);
+		unlock_task_sighand(task, &flags);
+		put_task_struct(task);
+		return -EACCES;
 	}
 
-	/*
-	 * Warn that /proc/pid/oom_adj is deprecated, see
-	 * Documentation/feature-removal-schedule.txt.
-	 */
-	printk_once(KERN_WARNING "%s (%d): /proc/%d/oom_adj is deprecated, "
-			"please use /proc/%d/oom_score_adj instead.\n",
-			current->comm, task_pid_nr(current),
-			task_pid_nr(task), task_pid_nr(task));
 	task->signal->oom_adj = oom_adjust;
-	/*
-	 * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum
-	 * value is always attainable.
-	 */
-	if (task->signal->oom_adj == OOM_ADJUST_MAX)
-		task->signal->oom_score_adj = OOM_SCORE_ADJ_MAX;
-	else
-		task->signal->oom_score_adj = (oom_adjust * OOM_SCORE_ADJ_MAX) /
-								-OOM_DISABLE;
-err_sighand:
+
 	unlock_task_sighand(task, &flags);
-err_task_lock:
-	task_unlock(task);
 	put_task_struct(task);
-out:
-	return err < 0 ? err : count;
+
+	return count;
 }
 
 static const struct file_operations proc_oom_adjust_operations = {
@@ -1101,106 +1063,6 @@ static const struct file_operations proc_oom_adjust_operations = {
 	.llseek		= generic_file_llseek,
 };
 
-static ssize_t oom_score_adj_read(struct file *file, char __user *buf,
-					size_t count, loff_t *ppos)
-{
-	struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
-	char buffer[PROC_NUMBUF];
-	int oom_score_adj = OOM_SCORE_ADJ_MIN;
-	unsigned long flags;
-	size_t len;
-
-	if (!task)
-		return -ESRCH;
-	if (lock_task_sighand(task, &flags)) {
-		oom_score_adj = task->signal->oom_score_adj;
-		unlock_task_sighand(task, &flags);
-	}
-	put_task_struct(task);
-	len = snprintf(buffer, sizeof(buffer), "%d\n", oom_score_adj);
-	return simple_read_from_buffer(buf, count, ppos, buffer, len);
-}
-
-static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
-					size_t count, loff_t *ppos)
-{
-	struct task_struct *task;
-	char buffer[PROC_NUMBUF];
-	unsigned long flags;
-	long oom_score_adj;
-	int err;
-
-	memset(buffer, 0, sizeof(buffer));
-	if (count > sizeof(buffer) - 1)
-		count = sizeof(buffer) - 1;
-	if (copy_from_user(buffer, buf, count)) {
-		err = -EFAULT;
-		goto out;
-	}
-
-	err = strict_strtol(strstrip(buffer), 0, &oom_score_adj);
-	if (err)
-		goto out;
-	if (oom_score_adj < OOM_SCORE_ADJ_MIN ||
-			oom_score_adj > OOM_SCORE_ADJ_MAX) {
-		err = -EINVAL;
-		goto out;
-	}
-
-	task = get_proc_task(file->f_path.dentry->d_inode);
-	if (!task) {
-		err = -ESRCH;
-		goto out;
-	}
-
-	task_lock(task);
-	if (!task->mm) {
-		err = -EINVAL;
-		goto err_task_lock;
-	}
-
-	if (!lock_task_sighand(task, &flags)) {
-		err = -ESRCH;
-		goto err_task_lock;
-	}
-
-	if (oom_score_adj < task->signal->oom_score_adj &&
-			!capable(CAP_SYS_RESOURCE)) {
-		err = -EACCES;
-		goto err_sighand;
-	}
-
-	if (oom_score_adj != task->signal->oom_score_adj) {
-		if (oom_score_adj == OOM_SCORE_ADJ_MIN)
-			atomic_inc(&task->mm->oom_disable_count);
-		if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
-			atomic_dec(&task->mm->oom_disable_count);
-	}
-	task->signal->oom_score_adj = oom_score_adj;
-	/*
-	 * Scale /proc/pid/oom_adj appropriately ensuring that OOM_DISABLE is
-	 * always attainable.
-	 */
-	if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
-		task->signal->oom_adj = OOM_DISABLE;
-	else
-		task->signal->oom_adj = (oom_score_adj * OOM_ADJUST_MAX) /
-							OOM_SCORE_ADJ_MAX;
-err_sighand:
-	unlock_task_sighand(task, &flags);
-err_task_lock:
-	task_unlock(task);
-	put_task_struct(task);
-out:
-	return err < 0 ? err : count;
-}
-
-static const struct file_operations proc_oom_score_adj_operations = {
-	.read		= oom_score_adj_read,
-	.write		= oom_score_adj_write,
-	.llseek		= default_llseek,
-};
-
 #ifdef CONFIG_AUDITSYSCALL
 #define TMPBUFLEN 21
 static ssize_t proc_loginuid_read(struct file * file, char __user * buf,
@@ -2779,7 +2641,6 @@ static const struct pid_entry tgid_base_stuff[] = {
 #endif
 	INF("oom_score",  S_IRUGO, proc_oom_score),
 	REG("oom_adj",    S_IRUGO|S_IWUSR, proc_oom_adjust_operations),
-	REG("oom_score_adj", S_IRUGO|S_IWUSR, proc_oom_score_adj_operations),
 #ifdef CONFIG_AUDITSYSCALL
 	REG("loginuid",   S_IWUSR|S_IRUGO, proc_loginuid_operations),
 	REG("sessionid",  S_IRUGO, proc_sessionid_operations),
@@ -3115,7 +2976,6 @@ static const struct pid_entry tid_base_stuff[] = {
 #endif
 	INF("oom_score", S_IRUGO, proc_oom_score),
 	REG("oom_adj",   S_IRUGO|S_IWUSR, proc_oom_adjust_operations),
-	REG("oom_score_adj", S_IRUGO|S_IWUSR, proc_oom_score_adj_operations),
 #ifdef CONFIG_AUDITSYSCALL
 	REG("loginuid",  S_IWUSR|S_IRUGO, proc_loginuid_operations),
 	REG("sessionid",  S_IRUSR, proc_sessionid_operations),
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 159a076..b13fc2a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -124,8 +124,6 @@ static inline bool mem_cgroup_disabled(void)
 void mem_cgroup_update_file_mapped(struct page *page, int val);
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask);
-u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
-
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;
 
@@ -305,12 +303,6 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 	return 0;
 }
 
-static inline
-u64 mem_cgroup_get_limit(struct mem_cgroup *mem)
-{
-	return 0;
-}
-
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index bb7288a..cb57d65 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -310,8 +310,6 @@ struct mm_struct {
 #ifdef CONFIG_MMU_NOTIFIER
 	struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
-	/* How many tasks sharing this mm are OOM_DISABLE */
-	atomic_t oom_disable_count;
 };
 
 /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 5e3aa83..40e5e3a 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -1,27 +1,14 @@
 #ifndef __INCLUDE_LINUX_OOM_H
 #define __INCLUDE_LINUX_OOM_H
 
-/*
- * /proc/<pid>/oom_adj is deprecated, see
- * Documentation/feature-removal-schedule.txt.
- *
- * /proc/<pid>/oom_adj set to -17 protects from the oom-killer
- */
+/* /proc/<pid>/oom_adj set to -17 protects from the oom-killer */
 #define OOM_DISABLE (-17)
 /* inclusive */
 #define OOM_ADJUST_MIN (-16)
 #define OOM_ADJUST_MAX 15
 
-/*
- * /proc/<pid>/oom_score_adj set to OOM_SCORE_ADJ_MIN disables oom killing for
- * pid.
- */
-#define OOM_SCORE_ADJ_MIN	(-1000)
-#define OOM_SCORE_ADJ_MAX	1000
-
 #ifdef __KERNEL__
 
-#include <linux/sched.h>
 #include <linux/types.h>
 #include <linux/nodemask.h>
 
@@ -40,8 +27,6 @@ enum oom_constraint {
 	CONSTRAINT_MEMCG,
 };
 
-extern unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
-			const nodemask_t *nodemask, unsigned long totalpages);
 extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 
@@ -66,8 +51,6 @@ static inline void oom_killer_enable(void)
 extern unsigned long badness(struct task_struct *p, struct mem_cgroup *mem,
 		      const nodemask_t *nodemask, unsigned long uptime);
 
-extern struct task_struct *find_lock_task_mm(struct task_struct *p);
-
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d0036e5..a35acb6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -624,8 +624,7 @@ struct signal_struct {
 	struct tty_audit_buf *tty_audit_buf;
 #endif
 
-	int oom_adj;		/* OOM kill score adjustment (bit shift) */
-	int oom_score_adj;	/* OOM kill score adjustment */
+	int oom_adj;	/* OOM kill score adjustment (bit shift) */
 
 	struct mutex cred_guard_mutex;	/* guard against foreign influences on
 					 * credential calculations
diff --git a/kernel/exit.c b/kernel/exit.c
index 21aa7b3..c806406 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -50,7 +50,6 @@
 #include <linux/perf_event.h>
 #include <trace/events/sched.h>
 #include <linux/hw_breakpoint.h>
-#include <linux/oom.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -696,8 +695,6 @@ static void exit_mm(struct task_struct * tsk)
 	enter_lazy_tlb(mm, current);
 	/* We don't want this task to be frozen prematurely */
 	clear_freeze_flag(tsk);
-	if (tsk->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
-		atomic_dec(&mm->oom_disable_count);
 	task_unlock(tsk);
 	mm_update_next_owner(mm);
 	mmput(mm);
diff --git a/kernel/fork.c b/kernel/fork.c
index 3b159c5..cca5e8b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -65,7 +65,6 @@
 #include <linux/perf_event.h>
 #include <linux/posix-timers.h>
 #include <linux/user-return-notifier.h>
-#include <linux/oom.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -489,7 +488,6 @@ static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
 	mm->cached_hole_size = ~0UL;
 	mm_init_aio(mm);
 	mm_init_owner(mm, p);
-	atomic_set(&mm->oom_disable_count, 0);
 
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
@@ -743,8 +741,6 @@ good_mm:
 	/* Initializing for Swap token stuff */
 	mm->token_priority = 0;
 	mm->last_interval = 0;
-	if (tsk->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
-		atomic_inc(&mm->oom_disable_count);
 
 	tsk->mm = mm;
 	tsk->active_mm = mm;
@@ -906,7 +902,6 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	tty_audit_fork(sig);
 
 	sig->oom_adj = current->signal->oom_adj;
-	sig->oom_score_adj = current->signal->oom_score_adj;
 
 	mutex_init(&sig->cred_guard_mutex);
 
@@ -1305,13 +1300,8 @@ bad_fork_cleanup_io:
 bad_fork_cleanup_namespaces:
 	exit_task_namespaces(p);
 bad_fork_cleanup_mm:
-	if (p->mm) {
-		task_lock(p);
-		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
-			atomic_dec(&p->mm->oom_disable_count);
-		task_unlock(p);
+	if (p->mm)
 		mmput(p->mm);
-	}
 bad_fork_cleanup_signal:
 	if (!(clone_flags & CLONE_THREAD))
 		free_signal_struct(p->signal);
@@ -1704,10 +1694,6 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags)
 			active_mm = current->active_mm;
 			current->mm = new_mm;
 			current->active_mm = new_mm;
-			if (current->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
-				atomic_dec(&mm->oom_disable_count);
-				atomic_inc(&new_mm->oom_disable_count);
-			}
 			activate_mm(active_mm, new_mm);
 			new_mm = mm;
 		}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9a99cfa..c628370 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -47,7 +47,6 @@
 #include <linux/mm_inline.h>
 #include <linux/page_cgroup.h>
 #include <linux/cpu.h>
-#include <linux/oom.h>
 #include "internal.h"
 
 #include <asm/uaccess.h>
@@ -917,13 +916,10 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
 {
 	int ret;
 	struct mem_cgroup *curr = NULL;
-	struct task_struct *p;
 
-	p = find_lock_task_mm(task);
-	if (!p)
-		return 0;
-	curr = try_get_mem_cgroup_from_mm(p->mm);
-	task_unlock(p);
+	task_lock(task);
+	curr = try_get_mem_cgroup_from_mm(task->mm);
+	task_unlock(task);
 	if (!curr)
 		return 0;
 	/*
@@ -1297,24 +1293,6 @@ static int mem_cgroup_count_children(struct mem_cgroup *mem)
 }
 
 /*
- * Return the memory (and swap, if configured) limit for a memcg.
- */
-u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
-{
-	u64 limit;
-	u64 memsw;
-
-	limit = res_counter_read_u64(&memcg->res, RES_LIMIT) +
-			total_swap_pages;
-	memsw = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
-	/*
-	 * If memsw is finite and limits the amount of swap space available
-	 * to this memcg, return that limit.
-	 */
-	return min(limit, memsw);
-}
-
-/*
  * Visit the first child (need not be the first child as per the ordering
  * of the cgroup list, since we track last_scanned_child) of @mem and use
  * that to reclaim free pages from.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7dcca55..f251ddb 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -4,8 +4,6 @@
  *  Copyright (C)  1998,2000  Rik van Riel
  *	Thanks go out to Claus Fischer for some serious inspiration and
  *	for goading me into coding this file...
- *  Copyright (C)  2010  Google, Inc.
- *	Rewritten by David Rientjes
  *
  *  The routines in this file are used to kill a process when
  *  we're seriously out of memory. This gets called from __alloc_pages()
@@ -36,6 +34,7 @@ int sysctl_panic_on_oom;
 int sysctl_oom_kill_allocating_task;
 int sysctl_oom_dump_tasks = 1;
 static DEFINE_SPINLOCK(zone_scan_lock);
+/* #define DEBUG */
 
 #ifdef CONFIG_NUMA
 /**
@@ -106,7 +105,7 @@ static void boost_dying_task_prio(struct task_struct *p,
  * pointer.  Return p, or any of its subthreads with a valid ->mm, with
  * task_lock() held.
  */
-struct task_struct *find_lock_task_mm(struct task_struct *p)
+static struct task_struct *find_lock_task_mm(struct task_struct *p)
 {
 	struct task_struct *t = p;
 
@@ -121,8 +120,8 @@ struct task_struct *find_lock_task_mm(struct task_struct *p)
 }
 
 /* return true if the task is not adequate as candidate victim task. */
-static bool oom_unkillable_task(struct task_struct *p,
-		const struct mem_cgroup *mem, const nodemask_t *nodemask)
+static bool oom_unkillable_task(struct task_struct *p, struct mem_cgroup *mem,
+			   const nodemask_t *nodemask)
 {
 	if (is_global_init(p))
 		return true;
@@ -141,82 +140,137 @@ static bool oom_unkillable_task(struct task_struct *p,
 }
 
 /**
- * oom_badness - heuristic function to determine which candidate task to kill
+ * badness - calculate a numeric value for how bad this task has been
  * @p: task struct of which task we should calculate
- * @totalpages: total present RAM allowed for page allocation
+ * @uptime: current uptime in seconds
  *
- * The heuristic for determining which task to kill is made to be as simple and
- * predictable as possible.  The goal is to return the highest value for the
- * task consuming the most memory to avoid subsequent oom failures.
+ * The formula used is relatively simple and documented inline in the
+ * function. The main rationale is that we want to select a good task
+ * to kill when we run out of memory.
+ *
+ * Good in this context means that:
+ * 1) we lose the minimum amount of work done
+ * 2) we recover a large amount of memory
+ * 3) we don't kill anything innocent of eating tons of memory
+ * 4) we want to kill the minimum amount of processes (one)
+ * 5) we try to kill the process the user expects us to kill, this
+ *    algorithm has been meticulously tuned to meet the principle
+ *    of least surprise ... (be careful when you change it)
  */
-unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
-		      const nodemask_t *nodemask, unsigned long totalpages)
+unsigned long badness(struct task_struct *p, struct mem_cgroup *mem,
+		      const nodemask_t *nodemask, unsigned long uptime)
 {
-	int points;
+	unsigned long points, cpu_time, run_time;
+	struct task_struct *child;
+	struct task_struct *c, *t;
+	int oom_adj = p->signal->oom_adj;
+	struct task_cputime task_time;
+	unsigned long utime;
+	unsigned long stime;
 
 	if (oom_unkillable_task(p, mem, nodemask))
 		return 0;
+	if (oom_adj == OOM_DISABLE)
+		return 0;
 
 	p = find_lock_task_mm(p);
 	if (!p)
 		return 0;
 
 	/*
-	 * Shortcut check for a thread sharing p->mm that is OOM_SCORE_ADJ_MIN
-	 * so the entire heuristic doesn't need to be executed for something
-	 * that cannot be killed.
+	 * The memory size of the process is the basis for the badness.
 	 */
-	if (atomic_read(&p->mm->oom_disable_count)) {
-		task_unlock(p);
-		return 0;
-	}
+	points = p->mm->total_vm;
+	task_unlock(p);
 
 	/*
-	 * When the PF_OOM_ORIGIN bit is set, it indicates the task should have
-	 * priority for oom killing.
+	 * swapoff can easily use up all memory, so kill those first.
 	 */
-	if (p->flags & PF_OOM_ORIGIN) {
-		task_unlock(p);
-		return 1000;
-	}
+	if (p->flags & PF_OOM_ORIGIN)
+		return ULONG_MAX;
 
 	/*
-	 * The memory controller may have a limit of 0 bytes, so avoid a divide
-	 * by zero, if necessary.
+	 * Processes which fork a lot of child processes are likely
+	 * a good choice. We add half the vmsize of the children if they
+	 * have an own mm. This prevents forking servers to flood the
+	 * machine with an endless amount of children. In case a single
+	 * child is eating the vast majority of memory, adding only half
+	 * to the parents will make the child our kill candidate of choice.
 	 */
-	if (!totalpages)
-		totalpages = 1;
+	t = p;
+	do {
+		list_for_each_entry(c, &t->children, sibling) {
+			child = find_lock_task_mm(c);
+			if (child) {
+				if (child->mm != p->mm)
+					points += child->mm->total_vm/2 + 1;
+				task_unlock(child);
+			}
+		}
+	} while_each_thread(p, t);
 
 	/*
-	 * The baseline for the badness score is the proportion of RAM that each
-	 * task's rss and swap space use.
+	 * CPU time is in tens of seconds and run time is in thousands
+         * of seconds. There is no particular reason for this other than
+         * that it turned out to work very well in practice.
 	 */
-	points = (get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS)) * 1000 /
-			totalpages;
-	task_unlock(p);
+	thread_group_cputime(p, &task_time);
+	utime = cputime_to_jiffies(task_time.utime);
+	stime = cputime_to_jiffies(task_time.stime);
+	cpu_time = (utime + stime) >> (SHIFT_HZ + 3);
+
+
+	if (uptime >= p->start_time.tv_sec)
+		run_time = (uptime - p->start_time.tv_sec) >> 10;
+	else
+		run_time = 0;
+
+	if (cpu_time)
+		points /= int_sqrt(cpu_time);
+	if (run_time)
+		points /= int_sqrt(int_sqrt(run_time));
 
 	/*
-	 * Root processes get 3% bonus, just like the __vm_enough_memory()
-	 * implementation used by LSMs.
+	 * Niced processes are most likely less important, so double
+	 * their badness points.
 	 */
-	if (has_capability_noaudit(p, CAP_SYS_ADMIN))
-		points -= 30;
+	if (task_nice(p) > 0)
+		points *= 2;
 
 	/*
-	 * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
-	 * either completely disable oom killing or always prefer a certain
-	 * task.
+	 * Superuser processes are usually more important, so we make it
+	 * less likely that we kill those.
 	 */
-	points += p->signal->oom_score_adj;
+	if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
+	    has_capability_noaudit(p, CAP_SYS_RESOURCE))
+		points /= 4;
 
 	/*
-	 * Never return 0 for an eligible task that may be killed since it's
-	 * possible that no single user task uses more than 0.1% of memory and
-	 * no single admin tasks uses more than 3.0%.
+	 * We don't want to kill a process with direct hardware access.
+	 * Not only could that mess up the hardware, but usually users
+	 * tend to only have this flag set on applications they think
+	 * of as important.
 	 */
-	if (points <= 0)
-		return 1;
-	return (points < 1000) ? points : 1000;
+	if (has_capability_noaudit(p, CAP_SYS_RAWIO))
+		points /= 4;
+
+	/*
+	 * Adjust the score by oom_adj.
+	 */
+	if (oom_adj) {
+		if (oom_adj > 0) {
+			if (!points)
+				points = 1;
+			points <<= oom_adj;
+		} else
+			points >>= -(oom_adj);
+	}
+
+#ifdef DEBUG
+	printk(KERN_DEBUG "OOMkill: task %d (%s) got %lu points\n",
+	p->pid, p->comm, points);
+#endif
+	return points;
 }
 
 /*
@@ -224,20 +278,12 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
  */
 #ifdef CONFIG_NUMA
 static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
-				gfp_t gfp_mask, nodemask_t *nodemask,
-				unsigned long *totalpages)
+				    gfp_t gfp_mask, nodemask_t *nodemask)
 {
 	struct zone *zone;
 	struct zoneref *z;
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
-	bool cpuset_limited = false;
-	int nid;
-
-	/* Default to all available memory */
-	*totalpages = totalram_pages + total_swap_pages;
 
-	if (!zonelist)
-		return CONSTRAINT_NONE;
 	/*
 	 * Reach here only when __GFP_NOFAIL is used. So, we should avoid
 	 * to kill current.We have to random task kill in this case.
@@ -247,37 +293,26 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
 		return CONSTRAINT_NONE;
 
 	/*
-	 * This is not a __GFP_THISNODE allocation, so a truncated nodemask in
-	 * the page allocator means a mempolicy is in effect.  Cpuset policy
-	 * is enforced in get_page_from_freelist().
+	 * The nodemask here is a nodemask passed to alloc_pages(). Now,
+	 * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
+	 * feature. mempolicy is an only user of nodemask here.
+	 * check mempolicy's nodemask contains all N_HIGH_MEMORY
 	 */
-	if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask)) {
-		*totalpages = total_swap_pages;
-		for_each_node_mask(nid, *nodemask)
-			*totalpages += node_spanned_pages(nid);
+	if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
 		return CONSTRAINT_MEMORY_POLICY;
-	}
 
 	/* Check this allocation failure is caused by cpuset's wall function */
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 			high_zoneidx, nodemask)
 		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
-			cpuset_limited = true;
+			return CONSTRAINT_CPUSET;
 
-	if (cpuset_limited) {
-		*totalpages = total_swap_pages;
-		for_each_node_mask(nid, cpuset_current_mems_allowed)
-			*totalpages += node_spanned_pages(nid);
-		return CONSTRAINT_CPUSET;
-	}
 	return CONSTRAINT_NONE;
 }
 #else
 static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
-				gfp_t gfp_mask, nodemask_t *nodemask,
-				unsigned long *totalpages)
+				gfp_t gfp_mask, nodemask_t *nodemask)
 {
-	*totalpages = totalram_pages + total_swap_pages;
 	return CONSTRAINT_NONE;
 }
 #endif
@@ -288,16 +323,17 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
  *
  * (not docbooked, we don't want this one cluttering up the manual)
  */
-static struct task_struct *select_bad_process(unsigned int *ppoints,
-		unsigned long totalpages, struct mem_cgroup *mem,
-		const nodemask_t *nodemask)
+static struct task_struct *select_bad_process(unsigned long *ppoints,
+		struct mem_cgroup *mem, const nodemask_t *nodemask)
 {
 	struct task_struct *p;
 	struct task_struct *chosen = NULL;
+	struct timespec uptime;
 	*ppoints = 0;
 
+	do_posix_clock_monotonic_gettime(&uptime);
 	for_each_process(p) {
-		unsigned int points;
+		unsigned long points;
 
 		if (oom_unkillable_task(p, mem, nodemask))
 			continue;
@@ -329,11 +365,11 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 				return ERR_PTR(-1UL);
 
 			chosen = p;
-			*ppoints = 1000;
+			*ppoints = ULONG_MAX;
 		}
 
-		points = oom_badness(p, mem, nodemask, totalpages);
-		if (points > *ppoints) {
+		points = badness(p, mem, nodemask, uptime.tv_sec);
+		if (points > *ppoints || !chosen) {
 			chosen = p;
 			*ppoints = points;
 		}
@@ -345,24 +381,27 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 /**
  * dump_tasks - dump current memory state of all system tasks
  * @mem: current's memory controller, if constrained
- * @nodemask: nodemask passed to page allocator for mempolicy ooms
  *
- * Dumps the current memory state of all eligible tasks.  Tasks not in the same
- * memcg, not in the same cpuset, or bound to a disjoint set of mempolicy nodes
- * are not shown.
+ * Dumps the current memory state of all system tasks, excluding kernel threads.
  * State information includes task's pid, uid, tgid, vm size, rss, cpu, oom_adj
- * value, oom_score_adj value, and name.
+ * score, and name.
+ *
+ * If the actual is non-NULL, only tasks that are a member of the mem_cgroup are
+ * shown.
  *
  * Call with tasklist_lock read-locked.
  */
-static void dump_tasks(const struct mem_cgroup *mem, const nodemask_t *nodemask)
+static void dump_tasks(const struct mem_cgroup *mem)
 {
 	struct task_struct *p;
 	struct task_struct *task;
 
-	pr_info("[ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name\n");
+	printk(KERN_INFO "[ pid ]   uid  tgid total_vm      rss cpu oom_adj "
+	       "name\n");
 	for_each_process(p) {
-		if (oom_unkillable_task(p, mem, nodemask))
+		if (p->flags & PF_KTHREAD)
+			continue;
+		if (mem && !task_in_mem_cgroup(p, mem))
 			continue;
 
 		task = find_lock_task_mm(p);
@@ -375,69 +414,43 @@ static void dump_tasks(const struct mem_cgroup *mem, const nodemask_t *nodemask)
 			continue;
 		}
 
-		pr_info("[%5d] %5d %5d %8lu %8lu %3u     %3d         %5d %s\n",
+		pr_info("[%5d] %5d %5d %8lu %8lu %3u     %3d %s\n",
 			task->pid, task_uid(task), task->tgid,
 			task->mm->total_vm, get_mm_rss(task->mm),
-			task_cpu(task), task->signal->oom_adj,
-			task->signal->oom_score_adj, task->comm);
+			task_cpu(task), task->signal->oom_adj, task->comm);
 		task_unlock(task);
 	}
 }
 
 static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
-			struct mem_cgroup *mem, const nodemask_t *nodemask)
+							struct mem_cgroup *mem)
 {
 	task_lock(current);
 	pr_warning("%s invoked oom-killer: gfp_mask=0x%x, order=%d, "
-		"oom_adj=%d, oom_score_adj=%d\n",
-		current->comm, gfp_mask, order, current->signal->oom_adj,
-		current->signal->oom_score_adj);
+		"oom_adj=%d\n",
+		current->comm, gfp_mask, order, current->signal->oom_adj);
 	cpuset_print_task_mems_allowed(current);
 	task_unlock(current);
 	dump_stack();
 	mem_cgroup_print_oom_info(mem, p);
 	show_mem();
 	if (sysctl_oom_dump_tasks)
-		dump_tasks(mem, nodemask);
+		dump_tasks(mem);
 }
 
 #define K(x) ((x) << (PAGE_SHIFT-10))
 static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
 {
-	struct task_struct *q;
-	struct mm_struct *mm;
-
 	p = find_lock_task_mm(p);
 	if (!p)
 		return 1;
 
-	/* mm cannot be safely dereferenced after task_unlock(p) */
-	mm = p->mm;
-
 	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
 		task_pid_nr(p), p->comm, K(p->mm->total_vm),
 		K(get_mm_counter(p->mm, MM_ANONPAGES)),
 		K(get_mm_counter(p->mm, MM_FILEPAGES)));
 	task_unlock(p);
 
-	/*
-	 * Kill all processes sharing p->mm in other thread groups, if any.
-	 * They don't get access to memory reserves or a higher scheduler
-	 * priority, though, to avoid depletion of all memory or task
-	 * starvation.  This prevents mm->mmap_sem livelock when an oom killed
-	 * task cannot exit because it requires the semaphore and its contended
-	 * by another thread trying to allocate memory itself.  That thread will
-	 * now get access to memory reserves since it has a pending fatal
-	 * signal.
-	 */
-	for_each_process(q)
-		if (q->mm == mm && !same_thread_group(q, p)) {
-			task_lock(q);	/* Protect ->comm from prctl() */
-			pr_err("Kill process %d (%s) sharing same memory\n",
-				task_pid_nr(q), q->comm);
-			task_unlock(q);
-			force_sig(SIGKILL, q);
-		}
 
 	set_tsk_thread_flag(p, TIF_MEMDIE);
 	force_sig(SIGKILL, p);
@@ -454,17 +467,17 @@ static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
 #undef K
 
 static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
-			    unsigned int points, unsigned long totalpages,
-			    struct mem_cgroup *mem, nodemask_t *nodemask,
-			    const char *message)
+			    unsigned long points, struct mem_cgroup *mem,
+			    nodemask_t *nodemask, const char *message)
 {
 	struct task_struct *victim = p;
 	struct task_struct *child;
 	struct task_struct *t = p;
-	unsigned int victim_points = 0;
+	unsigned long victim_points = 0;
+	struct timespec uptime;
 
 	if (printk_ratelimit())
-		dump_header(p, gfp_mask, order, mem, nodemask);
+		dump_header(p, gfp_mask, order, mem);
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -477,7 +490,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	}
 
 	task_lock(p);
-	pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
+	pr_err("%s: Kill process %d (%s) score %lu or sacrifice child\n",
 		message, task_pid_nr(p), p->comm, points);
 	task_unlock(p);
 
@@ -487,15 +500,14 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 * parent.  This attempts to lose the minimal amount of work done while
 	 * still freeing memory.
 	 */
+	do_posix_clock_monotonic_gettime(&uptime);
 	do {
 		list_for_each_entry(child, &t->children, sibling) {
-			unsigned int child_points;
+			unsigned long child_points;
 
-			/*
-			 * oom_badness() returns 0 if the thread is unkillable
-			 */
-			child_points = oom_badness(child, mem, nodemask,
-								totalpages);
+			/* badness() returns 0 if the thread is unkillable */
+			child_points = badness(child, mem, nodemask,
+					       uptime.tv_sec);
 			if (child_points > victim_points) {
 				victim = child;
 				victim_points = child_points;
@@ -510,7 +522,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
  * Determines whether the kernel must panic because of the panic_on_oom sysctl.
  */
 static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
-				int order, const nodemask_t *nodemask)
+				int order)
 {
 	if (likely(!sysctl_panic_on_oom))
 		return;
@@ -524,7 +536,7 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
 			return;
 	}
 	read_lock(&tasklist_lock);
-	dump_header(NULL, gfp_mask, order, NULL, nodemask);
+	dump_header(NULL, gfp_mask, order, NULL);
 	read_unlock(&tasklist_lock);
 	panic("Out of memory: %s panic_on_oom is enabled\n",
 		sysctl_panic_on_oom == 2 ? "compulsory" : "system-wide");
@@ -533,19 +545,17 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
 {
-	unsigned long limit;
-	unsigned int points = 0;
+	unsigned long points = 0;
 	struct task_struct *p;
 
-	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, 0, NULL);
-	limit = mem_cgroup_get_limit(mem) >> PAGE_SHIFT;
+	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, 0);
 	read_lock(&tasklist_lock);
 retry:
-	p = select_bad_process(&points, limit, mem, NULL);
+	p = select_bad_process(&points, mem, NULL);
 	if (!p || PTR_ERR(p) == -1UL)
 		goto out;
 
-	if (oom_kill_process(p, gfp_mask, 0, points, limit, mem, NULL,
+	if (oom_kill_process(p, gfp_mask, 0, points, mem, NULL,
 				"Memory cgroup out of memory"))
 		goto retry;
 out:
@@ -669,11 +679,9 @@ static void clear_system_oom(void)
 void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *nodemask)
 {
-	const nodemask_t *mpol_mask;
 	struct task_struct *p;
-	unsigned long totalpages;
 	unsigned long freed = 0;
-	unsigned int points;
+	unsigned long points;
 	enum oom_constraint constraint = CONSTRAINT_NONE;
 	int killed = 0;
 
@@ -697,40 +705,41 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.
 	 */
-	constraint = constrained_alloc(zonelist, gfp_mask, nodemask,
-						&totalpages);
-	mpol_mask = (constraint == CONSTRAINT_MEMORY_POLICY) ? nodemask : NULL;
-	check_panic_on_oom(constraint, gfp_mask, order, mpol_mask);
+	if (zonelist)
+		constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
+	check_panic_on_oom(constraint, gfp_mask, order);
 
 	read_lock(&tasklist_lock);
 	if (sysctl_oom_kill_allocating_task &&
 	    !oom_unkillable_task(current, NULL, nodemask) &&
-	    current->mm && !atomic_read(&current->mm->oom_disable_count)) {
+	    (current->signal->oom_adj != OOM_DISABLE)) {
 		/*
 		 * oom_kill_process() needs tasklist_lock held.  If it returns
 		 * non-zero, current could not be killed so we must fallback to
 		 * the tasklist scan.
 		 */
-		if (!oom_kill_process(current, gfp_mask, order, 0, totalpages,
-				NULL, nodemask,
+		if (!oom_kill_process(current, gfp_mask, order, 0, NULL,
+				nodemask,
 				"Out of memory (oom_kill_allocating_task)"))
 			goto out;
 	}
 
 retry:
-	p = select_bad_process(&points, totalpages, NULL, mpol_mask);
+	p = select_bad_process(&points, NULL,
+			constraint == CONSTRAINT_MEMORY_POLICY ? nodemask :
+								 NULL);
 	if (PTR_ERR(p) == -1UL)
 		goto out;
 
 	/* Found nothing?!?! Either we hang forever, or we panic. */
 	if (!p) {
-		dump_header(NULL, gfp_mask, order, NULL, mpol_mask);
+		dump_header(NULL, gfp_mask, order, NULL);
 		read_unlock(&tasklist_lock);
 		panic("Out of memory and no killable processes...\n");
 	}
 
-	if (oom_kill_process(p, gfp_mask, order, points, totalpages, NULL,
-				nodemask, "Out of memory"))
+	if (oom_kill_process(p, gfp_mask, order, points, NULL, nodemask,
+			     "Out of memory"))
 		goto retry;
 	killed = 1;
 out:
-- 
1.6.5.2








^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH] Revert oom rewrite series
  2010-11-23  7:16           ` KOSAKI Motohiro
@ 2010-11-28  1:45             ` David Rientjes
  2010-11-30 13:04               ` KOSAKI Motohiro
  0 siblings, 1 reply; 37+ messages in thread
From: David Rientjes @ 2010-11-28  1:45 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Linus Torvalds, LKML, Ying Han, Bodo Eggert,
	Mandeep Singh Baines, Figo.zhang

On Tue, 23 Nov 2010, KOSAKI Motohiro wrote:

> > You may remember that the initial version of my rewrite replaced oom_adj 
> > entirely with the new oom_score_adj semantics.  Others suggested that it 
> > be seperated into a new tunable and the old tunable deprecated for a 
> > lengthy period of time.  I accepted that criticism and understood the 
> > drawbacks of replacing the tunable immediately and followed those 
> > suggestions.  I disagree with you that the deprecation of oom_adj for a 
> > period of two years is as dramatic as you imply and I disagree that users 
> > are experiencing problems with the linear scale that it now operates on 
> > versus the old exponential scale.
> 
> Yes and No. People wanted to separate AND don't break old one.
> 

You're arguing on the behalf of applications that don't exist.

> > > 1) About two month ago, Dave hansen observed strange OOM issue because he
> > >    has a big machine and ALL process are not so big. thus, eventually all 
> > >    process got oom-score=0 and oom-killer didn't work.
> > > 
> > >    https://kerneltrap.org/mailarchive/linux-driver-devel/2010/9/9/6886383
> > > 
> > >    DavidR changed oom-score to +1 in such situation. 
> > > 
> > >    http://kerneltrap.org/mailarchive/linux-kernel/2010/9/9/4617455
> > > 
> > >    But it is completely bognus. If all process have score=1, oom-killer fall
> > >    back to purely random killer. I expected and explained his patch has
> > >    its problem at half years ago. but he didn't fix yet.
> > > 
> > 
> > The resolution with which the oom killer considers memory is at 0.1% of 
> > system RAM at its highest (smaller when you have a memory controller, 
> > cpuset, or mempolicy constrained oom).  It considers a task within 0.1% of 
> > memory of another task to have equal "badness" to kill, we don't break 
> > ties in between that resolution -- it all depends on which one shows up in 
> > the tasklist first.  If you disagree with that resolution, which I support 
> > as being high enough, then you may certainly propose a patch to make it 
> > even finer at 0.01%, 0.001%, etc.  It would only change oom_badness() to 
> > range between [0,10000], [0,100000], etc.
> 
> No.
> Think Moore's Law. rational value will be not able to work in future anyway.
> 10 years ago, I used 20M bytes memory desktop machine and I'm now using 2GB.
> memory amount is growing and growing. and bash size doesn't grwoing so fast.
> 

If you'd like to suggest an increase to the upper-bound of the badness 
score, please do so, although I don't think we need to break ties amongst 
tasks that differ by at most <0.1% of the system's capacity.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Revert oom rewrite series
  2010-11-28  1:45             ` David Rientjes
@ 2010-11-30 13:04               ` KOSAKI Motohiro
  2010-11-30 20:02                 ` David Rientjes
  0 siblings, 1 reply; 37+ messages in thread
From: KOSAKI Motohiro @ 2010-11-30 13:04 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Linus Torvalds, LKML, Ying Han,
	Bodo Eggert, Mandeep Singh Baines, Figo.zhang

> On Tue, 23 Nov 2010, KOSAKI Motohiro wrote:
> 
> > > You may remember that the initial version of my rewrite replaced oom_adj 
> > > entirely with the new oom_score_adj semantics.  Others suggested that it 
> > > be seperated into a new tunable and the old tunable deprecated for a 
> > > lengthy period of time.  I accepted that criticism and understood the 
> > > drawbacks of replacing the tunable immediately and followed those 
> > > suggestions.  I disagree with you that the deprecation of oom_adj for a 
> > > period of two years is as dramatic as you imply and I disagree that users 
> > > are experiencing problems with the linear scale that it now operates on 
> > > versus the old exponential scale.
> > 
> > Yes and No. People wanted to separate AND don't break old one.
> > 
> 
> You're arguing on the behalf of applications that don't exist.

Why?
You actually got the bug report.


> 
> > > > 1) About two month ago, Dave hansen observed strange OOM issue because he
> > > >    has a big machine and ALL process are not so big. thus, eventually all 
> > > >    process got oom-score=0 and oom-killer didn't work.
> > > > 
> > > >    https://kerneltrap.org/mailarchive/linux-driver-devel/2010/9/9/6886383
> > > > 
> > > >    DavidR changed oom-score to +1 in such situation. 
> > > > 
> > > >    http://kerneltrap.org/mailarchive/linux-kernel/2010/9/9/4617455
> > > > 
> > > >    But it is completely bognus. If all process have score=1, oom-killer fall
> > > >    back to purely random killer. I expected and explained his patch has
> > > >    its problem at half years ago. but he didn't fix yet.
> > > > 
> > > 
> > > The resolution with which the oom killer considers memory is at 0.1% of 
> > > system RAM at its highest (smaller when you have a memory controller, 
> > > cpuset, or mempolicy constrained oom).  It considers a task within 0.1% of 
> > > memory of another task to have equal "badness" to kill, we don't break 
> > > ties in between that resolution -- it all depends on which one shows up in 
> > > the tasklist first.  If you disagree with that resolution, which I support 
> > > as being high enough, then you may certainly propose a patch to make it 
> > > even finer at 0.01%, 0.001%, etc.  It would only change oom_badness() to 
> > > range between [0,10000], [0,100000], etc.
> > 
> > No.
> > Think Moore's Law. rational value will be not able to work in future anyway.
> > 10 years ago, I used 20M bytes memory desktop machine and I'm now using 2GB.
> > memory amount is growing and growing. and bash size doesn't grwoing so fast.
> > 
> 
> If you'd like to suggest an increase to the upper-bound of the badness 
> score, please do so, although I don't think we need to break ties amongst 
> tasks that differ by at most <0.1% of the system's capacity.

No. I dislike. I dislike propotinal score.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Revert oom rewrite series
  2010-11-30 13:04               ` KOSAKI Motohiro
@ 2010-11-30 20:02                 ` David Rientjes
  0 siblings, 0 replies; 37+ messages in thread
From: David Rientjes @ 2010-11-30 20:02 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Linus Torvalds, LKML, Ying Han, Bodo Eggert,
	Mandeep Singh Baines, Figo.zhang

On Tue, 30 Nov 2010, KOSAKI Motohiro wrote:

> > > > You may remember that the initial version of my rewrite replaced oom_adj 
> > > > entirely with the new oom_score_adj semantics.  Others suggested that it 
> > > > be seperated into a new tunable and the old tunable deprecated for a 
> > > > lengthy period of time.  I accepted that criticism and understood the 
> > > > drawbacks of replacing the tunable immediately and followed those 
> > > > suggestions.  I disagree with you that the deprecation of oom_adj for a 
> > > > period of two years is as dramatic as you imply and I disagree that users 
> > > > are experiencing problems with the linear scale that it now operates on 
> > > > versus the old exponential scale.
> > > 
> > > Yes and No. People wanted to separate AND don't break old one.
> > > 
> > 
> > You're arguing on the behalf of applications that don't exist.
> 
> Why?
> You actually got the bug report.
> 

There have never been any bug reports related to applications using 
oom_score_adj and being impacted with its linear mapping onto oom_adj's 
exponential scale.  That's because no users prior to the rewrite were 
using oom_adj scores that were based on either the expected memory usage 
of the application nor the capacity of the machine.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Revert oom rewrite series
  2010-11-15 10:57           ` Alan Cox
  2010-11-15 20:54             ` David Rientjes
@ 2010-11-23  7:16             ` KOSAKI Motohiro
  1 sibling, 0 replies; 37+ messages in thread
From: KOSAKI Motohiro @ 2010-11-23  7:16 UTC (permalink / raw)
  To: Alan Cox
  Cc: kosaki.motohiro, David Rientjes, Figo.zhang, Figo.zhang, lkml,
	linux-mm, Andrew Morton, Linus Torvalds


sorry for the delay.

> > The goal was to make the oom killer heuristic as predictable as possible 
> > and to kill the most memory-hogging task to avoid having to recall it and 
> > needlessly kill several tasks.
> 
> Meta question - why is that a good thing. In a desktop environment it's
> frequently wrong, in a server environment it is often wrong. We had this
> before where people spend months fiddling with the vm and make it work
> slightly differently and it suits their workload, then other workloads go
> downhill. Then the cycle repeats.
> 
> > You have full control over disabling a task from being considered with 
> > oom_score_adj just like you did with oom_adj.  Since oom_adj is 
> > deprecated for two years, you can even use the old interface until then.
> 
> Which changeset added it to the Documentation directory as deprecated ?

It's insufficient.
a63d83f427fbce97a6cea0db2e64b0eb8435cd10 (oom: badness heuristic rewrite)
introduced a lot of incompatibility to oom_adj and oom_score.
Theresore I would sugestted full revert and resubmit some patches which
cherry pick no pain piece.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Revert oom rewrite series
  2010-11-15 10:57           ` Alan Cox
@ 2010-11-15 20:54             ` David Rientjes
  2010-11-23  7:16             ` KOSAKI Motohiro
  1 sibling, 0 replies; 37+ messages in thread
From: David Rientjes @ 2010-11-15 20:54 UTC (permalink / raw)
  To: Alan Cox
  Cc: Figo.zhang, KOSAKI Motohiro, Figo.zhang, lkml, linux-mm,
	Andrew Morton, Linus Torvalds

On Mon, 15 Nov 2010, Alan Cox wrote:

> > The goal was to make the oom killer heuristic as predictable as possible 
> > and to kill the most memory-hogging task to avoid having to recall it and 
> > needlessly kill several tasks.
> 
> Meta question - why is that a good thing. In a desktop environment it's
> frequently wrong, in a server environment it is often wrong. We had this
> before where people spend months fiddling with the vm and make it work
> slightly differently and it suits their workload, then other workloads go
> downhill. Then the cycle repeats.
> 

Most of the arbitrary heuristics were removed from oom_badness(), things 
like nice level, runtime, CAP_SYS_RESOURCE, etc., so that we only consider 
the rss and swap usage of each application in comparison to each other 
when deciding which task to kill.  We give root tasks a 3% bonus since 
they tend to be more important to the productivity or uptime of the 
machine, which did exist -- albeit with a more dramatic impact -- in the 
old heursitic.

You'll find that the new heuristic always kills the task consuming the 
most amount of rss unless influenced by userspace via the tunables (or 
within 3% of root tasks).

We always want to kill the most memory-hogging task because it avoids 
needlessly killing additional tasks when we must immediately recall the 
oom killer because we continue to allocate memory.  If that task happens 
to be of vital importance to userspace, then the user has full control 
over tuning the oom killer priorities in such circumstances.

> > You have full control over disabling a task from being considered with 
> > oom_score_adj just like you did with oom_adj.  Since oom_adj is 
> > deprecated for two years, you can even use the old interface until then.
> 
> Which changeset added it to the Documentation directory as deprecated ?
> 

51b1bd2a was the actual change that deprecated it, which was a direct 
follow-up to a63d83f4 which actually obsoleted it.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Revert oom rewrite series
  2010-11-15 10:14         ` David Rientjes
@ 2010-11-15 10:57           ` Alan Cox
  2010-11-15 20:54             ` David Rientjes
  2010-11-23  7:16             ` KOSAKI Motohiro
  0 siblings, 2 replies; 37+ messages in thread
From: Alan Cox @ 2010-11-15 10:57 UTC (permalink / raw)
  To: David Rientjes
  Cc: Figo.zhang, KOSAKI Motohiro, Figo.zhang, lkml, linux-mm,
	Andrew Morton, Linus Torvalds

> The goal was to make the oom killer heuristic as predictable as possible 
> and to kill the most memory-hogging task to avoid having to recall it and 
> needlessly kill several tasks.

Meta question - why is that a good thing. In a desktop environment it's
frequently wrong, in a server environment it is often wrong. We had this
before where people spend months fiddling with the vm and make it work
slightly differently and it suits their workload, then other workloads go
downhill. Then the cycle repeats.

> You have full control over disabling a task from being considered with 
> oom_score_adj just like you did with oom_adj.  Since oom_adj is 
> deprecated for two years, you can even use the old interface until then.

Which changeset added it to the Documentation directory as deprecated ?

Alan

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Revert oom rewrite series
  2010-11-15  3:26       ` [PATCH] Revert oom rewrite series Figo.zhang
@ 2010-11-15 10:14         ` David Rientjes
  2010-11-15 10:57           ` Alan Cox
  0 siblings, 1 reply; 37+ messages in thread
From: David Rientjes @ 2010-11-15 10:14 UTC (permalink / raw)
  To: Figo.zhang
  Cc: KOSAKI Motohiro, Figo.zhang, lkml, linux-mm, Andrew Morton,
	Linus Torvalds

On Mon, 15 Nov 2010, Figo.zhang wrote:

> i am doubt that a new rewrite but the athor canot provide some evidence and
> experiment result, why did you do that? what is the prominent change for your
> new algorithm?
> 
> as KOSAKI Motohiro said, "you removed CAP_SYS_RESOURCE condition with ZERO
> explanation".
> 
> David just said that pls use userspace tunable for protection by
> oom_score_adj. but may i ask question:
> 
> 1. what is your innovation for your new algorithm, the old one have the same
> way for user tunable oom_adj.
> 

The goal was to make the oom killer heuristic as predictable as possible 
and to kill the most memory-hogging task to avoid having to recall it and 
needlessly kill several tasks.

The goal behind oom_score_adj vs. oom_adj was for several reasons, as 
pointed out before:

 - give it a unit (proportion of available memory), oom_adj had no unit,

 - allow it to work on a linear scale for more control over 
   prioritization, oom_adj had an exponential scale,

 - give it a much higher resolution so it can be fine-tuned, it works with 
   a granularity of 0.1% of memory (~128M on a 128G machine), and

 - allow it to describe the oom killing priority of a task regardless of 
   its cpuset attachment, mempolicy, or memcg, or when their respective
   limits change.

> 2. if server like db-server/financial-server have huge import processes (such
> as root/hardware access processes)want to be protection, you let the
> administrator to find out which processes should be protection. you
> will let the  financial-server administrator huge crazy!! and lose so many
> money!! ^~^
> 

You have full control over disabling a task from being considered with 
oom_score_adj just like you did with oom_adj.  Since oom_adj is 
deprecated for two years, you can even use the old interface until then.

> 3. i see your email in LKML, you just said
> "I have repeatedly said that the oom killer no longer kills KDE when run on my
> desktop in the presence of a memory hogging task that was written specifically
> to oom the machine."
> http://thread.gmane.org/gmane.linux.kernel.mm/48998
> 
> so you just test your new oom_killer algorithm on your desktop with KDE, so
> have you provide the detail how you do the test? is it do the
> experiment again for anyone and got the same result as your comment ?
> 

Xorg tends to be killed less because of the change to the heuristic's 
baseline, which is now based on rss and swap instead of total_vm.  This is 
seperate from the issues you list above, but is a benefit to the oom 
killer that desktop users especially will notice.  I, personally, am 
interested more in the server market and that's why I looked for a more 
robust userspace tunable that would still be applicable when things like 
cpusets have a node added or removed.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Revert oom rewrite series
  2010-11-14 21:33     ` David Rientjes
@ 2010-11-15  3:26       ` Figo.zhang
  2010-11-15 10:14         ` David Rientjes
  0 siblings, 1 reply; 37+ messages in thread
From: Figo.zhang @ 2010-11-15  3:26 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, Figo.zhang, lkml, linux-mm, Andrew Morton,
	Linus Torvalds

 >Nothing to say, really.  Seems each time we're told about a bug or a
 >regression, David either fixes the bug or points out why it wasn't a
 >bug or why it wasn't a regression or how it was a deliberate behaviour
 >change for the better.

 >I just haven't seen any solid reason to be concerned about the state of
 >the current oom-killer, sorry.

 >I'm concerned that you're concerned!  A lot.  When someone such as
 >yourself is unhappy with part of MM then I sit up and pay attention.
 >But after all this time I simply don't understand the technical issues
 >which you're seeing here.

we just talk about oom-killer technical issues.

i am doubt that a new rewrite but the athor canot provide some evidence 
and experiment result, why did you do that? what is the prominent change 
for your new algorithm?

as KOSAKI Motohiro said, "you removed CAP_SYS_RESOURCE condition with 
ZERO explanation".

David just said that pls use userspace tunable for protection by 
oom_score_adj. but may i ask question:

1. what is your innovation for your new algorithm, the old one have the 
same way for user tunable oom_adj.

2. if server like db-server/financial-server have huge import processes 
(such as root/hardware access processes)want to be protection, you let 
the administrator to find out which processes should be protection. you
will let the  financial-server administrator huge crazy!! and lose so 
many money!! ^~^

3. i see your email in LKML, you just said
"I have repeatedly said that the oom killer no longer kills KDE when run 
on my desktop in the presence of a memory hogging task that was written 
specifically to oom the machine."
http://thread.gmane.org/gmane.linux.kernel.mm/48998

so you just test your new oom_killer algorithm on your desktop with KDE, 
so have you provide the detail how you do the test? is it do the
experiment again for anyone and got the same result as your comment ?

as KOSAKI Motohiro said, in reality word, it we makes 5-6 brain 
simulation, embedded, desktop, web server,db server, hpc, finance. 
Different workloads certenally makes big impact. have you do those
experiments?

i think that technology should base on experiment not on imagine.


Best,
Figo.zhang





^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2010-11-30 20:02 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-14  5:07 [PATCH] Revert oom rewrite series KOSAKI Motohiro
2010-11-14 19:32 ` Linus Torvalds
2010-11-15  0:54   ` KOSAKI Motohiro
2010-11-15  2:19     ` Andrew Morton
     [not found]       ` <AANLkTik_SDaiu2eQsJ9+4ywLR5K5V1Od-hwop6gwas3F@mail.gmail.com>
2010-11-15  4:41         ` Figo.zhang
2010-11-15  6:57       ` KOSAKI Motohiro
2010-11-15 10:34         ` David Rientjes
2010-11-15 23:31           ` Jesper Juhl
2010-11-16  0:06             ` David Rientjes
2010-11-16 10:04               ` Martin Knoblauch
2010-11-16 10:33                 ` Alessandro Suardi
2010-11-16  0:13             ` Valdis.Kletnieks
2010-11-16  6:43               ` David Rientjes
2010-11-16 11:03               ` Alan Cox
2010-11-16 13:03                 ` Florian Mickler
2010-11-16 14:55                   ` Alan Cox
2010-11-16 20:57                     ` David Rientjes
2010-11-16 21:01                       ` Fabio Comolli
2010-11-17  4:04                     ` Valdis.Kletnieks
2010-11-16 15:15               ` Alejandro Riveira Fernández
2010-11-23  7:16           ` KOSAKI Motohiro
2010-11-28  1:45             ` David Rientjes
2010-11-30 13:04               ` KOSAKI Motohiro
2010-11-30 20:02                 ` David Rientjes
2010-11-23  7:16         ` KOSAKI Motohiro
2010-11-23 23:51   ` KOSAKI Motohiro
2010-11-14 21:58 ` David Rientjes
2010-11-15 23:33   ` Bodo Eggert
2010-11-15 23:50     ` David Rientjes
2010-11-17  0:06       ` Bodo Eggert
2010-11-17  0:25         ` David Rientjes
2010-11-17  0:48         ` Mandeep Singh Baines
  -- strict thread matches above, loose matches on Subject: below --
2010-11-10 15:14 [PATCH v3]mm/oom-kill: direct hardware access processes should get bonus Figo.zhang
2010-11-10 15:24 ` Figo.zhang
2010-11-14  5:21   ` KOSAKI Motohiro
2010-11-14 21:33     ` David Rientjes
2010-11-15  3:26       ` [PATCH] Revert oom rewrite series Figo.zhang
2010-11-15 10:14         ` David Rientjes
2010-11-15 10:57           ` Alan Cox
2010-11-15 20:54             ` David Rientjes
2010-11-23  7:16             ` KOSAKI Motohiro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).