linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2][BUGFIX] oom: remove totalpage normalization from oom_badness()
@ 2010-08-25  9:42 KOSAKI Motohiro
  2010-08-25  9:42 ` [PATCH 2/2][BUGFIX] Revert "oom: deprecate oom_adj tunable" KOSAKI Motohiro
  2010-08-25 10:25 ` [PATCH 1/2][BUGFIX] oom: remove totalpage normalization from oom_badness() David Rientjes
  0 siblings, 2 replies; 15+ messages in thread
From: KOSAKI Motohiro @ 2010-08-25  9:42 UTC (permalink / raw)
  To: LKML, linux-mm, Andrew Morton, Oleg Nesterov, KAMEZAWA Hiroyuki,
	David Rientjes, Minchan Kim
  Cc: kosaki.motohiro

Current oom_score_adj is completely broken because It is strongly bound
google usecase and ignore other all.

1) Priority inversion
   As kamezawa-san pointed out, This break cgroup and lxr environment.
   He said,
	> Assume 2 proceses A, B which has oom_score_adj of 300 and 0
	> And A uses 200M, B uses 1G of memory under 4G system
	>
	> Under the system.
	> 	A's socre = (200M *1000)/4G + 300 = 350
	> 	B's score = (1G * 1000)/4G = 250.
	>
	> In the cpuset, it has 2G of memory.
	> 	A's score = (200M * 1000)/2G + 300 = 400
	> 	B's socre = (1G * 1000)/2G = 500
	>
	> This priority-inversion don't happen in current system.

2) Ratio base point don't works large machine
   oom_score_adj normalize oom-score to 0-1000 range.
   but if the machine has 1TB memory, 1 point (i.e. 0.1%) mean
   1GB. this is no suitable for tuning parameter.
   As I said, proposional value oriented tuning parameter has
   scalability risk.

3) No reason to implement ABI breakage.
   old tuning parameter mean)
	oom-score = oom-base-score x 2^oom_adj
   new tuning parameter mean)
	oom-score = oom-base-score + oom_score_adj / (totalram + totalswap)
   but "oom_score_adj / (totalram + totalswap)" can be calculated in
   userland too. beucase both totalram and totalswap has been exporsed by
   /proc. So no reason to introduce funny new equation.

4) totalram based normalization assume flat memory model.
   example, the machine is assymmetric numa. fat node memory and thin
   node memory might have another wight value.
   In other word, totalram based priority is a one of policy. Fixed and
   workload depended policy shouldn't be embedded in kernel. probably.

Then, this patch remove *UGLY* total_pages suck completely. Googler
can calculate it at userland!

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 fs/proc/base.c        |   33 ++---------
 include/linux/oom.h   |   16 +-----
 include/linux/sched.h |    2 +-
 mm/oom_kill.c         |  142 ++++++++++++++++++++-----------------------------
 4 files changed, 67 insertions(+), 126 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index a1c43e7..90ba487 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -434,8 +434,7 @@ static int proc_oom_score(struct task_struct *task, char *buffer)
 
 	read_lock(&tasklist_lock);
 	if (pid_alive(task))
-		points = oom_badness(task, NULL, NULL,
-					totalram_pages + total_swap_pages);
+		points = oom_badness(task, NULL, NULL);
 	read_unlock(&tasklist_lock);
 	return sprintf(buffer, "%lu\n", points);
 }
@@ -1056,15 +1055,7 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
 			current->comm, task_pid_nr(current),
 			task_pid_nr(task), task_pid_nr(task));
 	task->signal->oom_adj = oom_adjust;
-	/*
-	 * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum
-	 * value is always attainable.
-	 */
-	if (task->signal->oom_adj == OOM_ADJUST_MAX)
-		task->signal->oom_score_adj = OOM_SCORE_ADJ_MAX;
-	else
-		task->signal->oom_score_adj = (oom_adjust * OOM_SCORE_ADJ_MAX) /
-								-OOM_DISABLE;
+
 	unlock_task_sighand(task, &flags);
 	put_task_struct(task);
 
@@ -1081,8 +1072,8 @@ static ssize_t oom_score_adj_read(struct file *file, char __user *buf,
 					size_t count, loff_t *ppos)
 {
 	struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
-	char buffer[PROC_NUMBUF];
-	int oom_score_adj = OOM_SCORE_ADJ_MIN;
+	char buffer[21];
+	long oom_score_adj = 0;
 	unsigned long flags;
 	size_t len;
 
@@ -1093,7 +1084,7 @@ static ssize_t oom_score_adj_read(struct file *file, char __user *buf,
 		unlock_task_sighand(task, &flags);
 	}
 	put_task_struct(task);
-	len = snprintf(buffer, sizeof(buffer), "%d\n", oom_score_adj);
+	len = snprintf(buffer, sizeof(buffer), "%ld\n", oom_score_adj);
 	return simple_read_from_buffer(buf, count, ppos, buffer, len);
 }
 
@@ -1101,7 +1092,7 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 					size_t count, loff_t *ppos)
 {
 	struct task_struct *task;
-	char buffer[PROC_NUMBUF];
+	char buffer[21];
 	unsigned long flags;
 	long oom_score_adj;
 	int err;
@@ -1115,9 +1106,6 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 	err = strict_strtol(strstrip(buffer), 0, &oom_score_adj);
 	if (err)
 		return -EINVAL;
-	if (oom_score_adj < OOM_SCORE_ADJ_MIN ||
-			oom_score_adj > OOM_SCORE_ADJ_MAX)
-		return -EINVAL;
 
 	task = get_proc_task(file->f_path.dentry->d_inode);
 	if (!task)
@@ -1134,15 +1122,6 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 	}
 
 	task->signal->oom_score_adj = oom_score_adj;
-	/*
-	 * Scale /proc/pid/oom_adj appropriately ensuring that OOM_DISABLE is
-	 * always attainable.
-	 */
-	if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
-		task->signal->oom_adj = OOM_DISABLE;
-	else
-		task->signal->oom_adj = (oom_score_adj * OOM_ADJUST_MAX) /
-							OOM_SCORE_ADJ_MAX;
 	unlock_task_sighand(task, &flags);
 	put_task_struct(task);
 	return count;
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 5e3aa83..21006dc 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -12,13 +12,6 @@
 #define OOM_ADJUST_MIN (-16)
 #define OOM_ADJUST_MAX 15
 
-/*
- * /proc/<pid>/oom_score_adj set to OOM_SCORE_ADJ_MIN disables oom killing for
- * pid.
- */
-#define OOM_SCORE_ADJ_MIN	(-1000)
-#define OOM_SCORE_ADJ_MAX	1000
-
 #ifdef __KERNEL__
 
 #include <linux/sched.h>
@@ -40,8 +33,9 @@ enum oom_constraint {
 	CONSTRAINT_MEMCG,
 };
 
-extern unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
-			const nodemask_t *nodemask, unsigned long totalpages);
+/* The badness from the OOM killer */
+extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
+				 const nodemask_t *nodemask);
 extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 
@@ -62,10 +56,6 @@ static inline void oom_killer_enable(void)
 	oom_killer_disabled = false;
 }
 
-/* The badness from the OOM killer */
-extern unsigned long badness(struct task_struct *p, struct mem_cgroup *mem,
-		      const nodemask_t *nodemask, unsigned long uptime);
-
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
 /* sysctls */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1e2a6db..5e61d60 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -622,7 +622,7 @@ struct signal_struct {
 #endif
 
 	int oom_adj;		/* OOM kill score adjustment (bit shift) */
-	int oom_score_adj;	/* OOM kill score adjustment */
+	long oom_score_adj;	/* OOM kill score adjustment */
 };
 
 /* Context switch must be unlocked if interrupts are to be enabled */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index fc81cb2..b242ddc 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -143,55 +143,41 @@ static bool oom_unkillable_task(struct task_struct *p, struct mem_cgroup *mem,
 /**
  * oom_badness - heuristic function to determine which candidate task to kill
  * @p: task struct of which task we should calculate
- * @totalpages: total present RAM allowed for page allocation
  *
  * The heuristic for determining which task to kill is made to be as simple and
  * predictable as possible.  The goal is to return the highest value for the
  * task consuming the most memory to avoid subsequent oom failures.
  */
-unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
-		      const nodemask_t *nodemask, unsigned long totalpages)
+unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
+			  const nodemask_t *nodemask)
 {
-	int points;
+	unsigned long points;
+	unsigned long points_orig;
+	int oom_adj = p->signal->oom_adj;
+	long oom_score_adj = p->signal->oom_score_adj;
 
-	if (oom_unkillable_task(p, mem, nodemask))
-		return 0;
 
-	p = find_lock_task_mm(p);
-	if (!p)
+	if (oom_unkillable_task(p, mem, nodemask))
 		return 0;
-
-	/*
-	 * Shortcut check for OOM_SCORE_ADJ_MIN so the entire heuristic doesn't
-	 * need to be executed for something that cannot be killed.
-	 */
-	if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
-		task_unlock(p);
+	if (oom_adj == OOM_DISABLE)
 		return 0;
-	}
 
 	/*
 	 * When the PF_OOM_ORIGIN bit is set, it indicates the task should have
 	 * priority for oom killing.
 	 */
-	if (p->flags & PF_OOM_ORIGIN) {
-		task_unlock(p);
-		return 1000;
-	}
+	if (p->flags & PF_OOM_ORIGIN)
+		return ULONG_MAX;
 
-	/*
-	 * The memory controller may have a limit of 0 bytes, so avoid a divide
-	 * by zero, if necessary.
-	 */
-	if (!totalpages)
-		totalpages = 1;
+	p = find_lock_task_mm(p);
+	if (!p)
+		return 0;
 
 	/*
 	 * The baseline for the badness score is the proportion of RAM that each
 	 * task's rss and swap space use.
 	 */
-	points = (get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS)) * 1000 /
-			totalpages;
+	points = (get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS));
 	task_unlock(p);
 
 	/*
@@ -202,15 +188,25 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
 		points -= 30;
 
 	/*
-	 * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
-	 * either completely disable oom killing or always prefer a certain
-	 * task.
+	 * Adjust the score by oom_adj and oom_score_adj.
 	 */
-	points += p->signal->oom_score_adj;
+	points_orig = points;
+	points += oom_score_adj;
+	if ((oom_score_adj > 0) && (points < points_orig))
+		points = ULONG_MAX;	/* may be overflow */
+	if ((oom_score_adj < 0) && (points > points_orig))
+		points = 1;		/* may be underflow */
+
+	if (oom_adj) {
+		if (oom_adj > 0) {
+			if (!points)
+				points = 1;
+			points <<= oom_adj;
+		} else
+			points >>= -(oom_adj);
+	}
 
-	if (points < 0)
-		return 0;
-	return (points < 1000) ? points : 1000;
+	return points;
 }
 
 /*
@@ -218,17 +214,11 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
  */
 #ifdef CONFIG_NUMA
 static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
-				gfp_t gfp_mask, nodemask_t *nodemask,
-				unsigned long *totalpages)
+			gfp_t gfp_mask, nodemask_t *nodemask)
 {
 	struct zone *zone;
 	struct zoneref *z;
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
-	bool cpuset_limited = false;
-	int nid;
-
-	/* Default to all available memory */
-	*totalpages = totalram_pages + total_swap_pages;
 
 	if (!zonelist)
 		return CONSTRAINT_NONE;
@@ -245,33 +235,21 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
 	 * the page allocator means a mempolicy is in effect.  Cpuset policy
 	 * is enforced in get_page_from_freelist().
 	 */
-	if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask)) {
-		*totalpages = total_swap_pages;
-		for_each_node_mask(nid, *nodemask)
-			*totalpages += node_spanned_pages(nid);
+	if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
 		return CONSTRAINT_MEMORY_POLICY;
-	}
 
 	/* Check this allocation failure is caused by cpuset's wall function */
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 			high_zoneidx, nodemask)
 		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
-			cpuset_limited = true;
+			return CONSTRAINT_CPUSET;
 
-	if (cpuset_limited) {
-		*totalpages = total_swap_pages;
-		for_each_node_mask(nid, cpuset_current_mems_allowed)
-			*totalpages += node_spanned_pages(nid);
-		return CONSTRAINT_CPUSET;
-	}
 	return CONSTRAINT_NONE;
 }
 #else
 static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
-				gfp_t gfp_mask, nodemask_t *nodemask,
-				unsigned long *totalpages)
+					gfp_t gfp_mask, nodemask_t *nodemask)
 {
-	*totalpages = totalram_pages + total_swap_pages;
 	return CONSTRAINT_NONE;
 }
 #endif
@@ -282,16 +260,16 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
  *
  * (not docbooked, we don't want this one cluttering up the manual)
  */
-static struct task_struct *select_bad_process(unsigned int *ppoints,
-		unsigned long totalpages, struct mem_cgroup *mem,
-		const nodemask_t *nodemask)
+static struct task_struct *select_bad_process(unsigned long *ppoints,
+					      struct mem_cgroup *mem,
+					      const nodemask_t *nodemask)
 {
 	struct task_struct *p;
 	struct task_struct *chosen = NULL;
 	*ppoints = 0;
 
 	for_each_process(p) {
-		unsigned int points;
+		unsigned long points;
 
 		if (oom_unkillable_task(p, mem, nodemask))
 			continue;
@@ -323,10 +301,10 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 				return ERR_PTR(-1UL);
 
 			chosen = p;
-			*ppoints = 1000;
+			*ppoints = ULONG_MAX;
 		}
 
-		points = oom_badness(p, mem, nodemask, totalpages);
+		points = oom_badness(p, mem, nodemask);
 		if (points > *ppoints) {
 			chosen = p;
 			*ppoints = points;
@@ -371,7 +349,7 @@ static void dump_tasks(const struct mem_cgroup *mem)
 			continue;
 		}
 
-		pr_info("[%5d] %5d %5d %8lu %8lu %3u     %3d         %5d %s\n",
+		pr_info("[%5d] %5d %5d %8lu %8lu %3u     %3d         %5ld %s\n",
 			task->pid, task_uid(task), task->tgid,
 			task->mm->total_vm, get_mm_rss(task->mm),
 			task_cpu(task), task->signal->oom_adj,
@@ -385,7 +363,7 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 {
 	task_lock(current);
 	pr_warning("%s invoked oom-killer: gfp_mask=0x%x, order=%d, "
-		"oom_adj=%d, oom_score_adj=%d\n",
+		"oom_adj=%d, oom_score_adj=%ld\n",
 		current->comm, gfp_mask, order, current->signal->oom_adj,
 		current->signal->oom_score_adj);
 	cpuset_print_task_mems_allowed(current);
@@ -426,14 +404,13 @@ static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
 #undef K
 
 static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
-			    unsigned int points, unsigned long totalpages,
-			    struct mem_cgroup *mem, nodemask_t *nodemask,
-			    const char *message)
+			    unsigned long points, struct mem_cgroup *mem,
+			    nodemask_t *nodemask, const char *message)
 {
 	struct task_struct *victim = p;
 	struct task_struct *child;
 	struct task_struct *t = p;
-	unsigned int victim_points = 0;
+	unsigned long victim_points = 0;
 
 	if (printk_ratelimit())
 		dump_header(p, gfp_mask, order, mem);
@@ -449,7 +426,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	}
 
 	task_lock(p);
-	pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
+	pr_err("%s: Kill process %d (%s) score %lu or sacrifice child\n",
 		message, task_pid_nr(p), p->comm, points);
 	task_unlock(p);
 
@@ -461,13 +438,12 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 */
 	do {
 		list_for_each_entry(child, &t->children, sibling) {
-			unsigned int child_points;
+			unsigned long child_points;
 
 			/*
 			 * oom_badness() returns 0 if the thread is unkillable
 			 */
-			child_points = oom_badness(child, mem, nodemask,
-								totalpages);
+			child_points = oom_badness(child, mem, nodemask);
 			if (child_points > victim_points) {
 				victim = child;
 				victim_points = child_points;
@@ -505,19 +481,17 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
 {
-	unsigned long limit;
-	unsigned int points = 0;
+	unsigned long points = 0;
 	struct task_struct *p;
 
 	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, 0);
-	limit = mem_cgroup_get_limit(mem) >> PAGE_SHIFT;
 	read_lock(&tasklist_lock);
 retry:
-	p = select_bad_process(&points, limit, mem, NULL);
+	p = select_bad_process(&points, mem, NULL);
 	if (!p || PTR_ERR(p) == -1UL)
 		goto out;
 
-	if (oom_kill_process(p, gfp_mask, 0, points, limit, mem, NULL,
+	if (oom_kill_process(p, gfp_mask, 0, points, mem, NULL,
 				"Memory cgroup out of memory"))
 		goto retry;
 out:
@@ -642,9 +616,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *nodemask)
 {
 	struct task_struct *p;
-	unsigned long totalpages;
 	unsigned long freed = 0;
-	unsigned int points;
+	unsigned long points;
 	enum oom_constraint constraint = CONSTRAINT_NONE;
 	int killed = 0;
 
@@ -668,8 +641,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.
 	 */
-	constraint = constrained_alloc(zonelist, gfp_mask, nodemask,
-						&totalpages);
+	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
 	check_panic_on_oom(constraint, gfp_mask, order);
 
 	read_lock(&tasklist_lock);
@@ -681,14 +653,14 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		 * non-zero, current could not be killed so we must fallback to
 		 * the tasklist scan.
 		 */
-		if (!oom_kill_process(current, gfp_mask, order, 0, totalpages,
+		if (!oom_kill_process(current, gfp_mask, order, 0,
 				NULL, nodemask,
 				"Out of memory (oom_kill_allocating_task)"))
 			goto out;
 	}
 
 retry:
-	p = select_bad_process(&points, totalpages, NULL,
+	p = select_bad_process(&points, NULL,
 			constraint == CONSTRAINT_MEMORY_POLICY ? nodemask :
 								 NULL);
 	if (PTR_ERR(p) == -1UL)
@@ -701,7 +673,7 @@ retry:
 		panic("Out of memory and no killable processes...\n");
 	}
 
-	if (oom_kill_process(p, gfp_mask, order, points, totalpages, NULL,
+	if (oom_kill_process(p, gfp_mask, order, points, NULL,
 				nodemask, "Out of memory"))
 		goto retry;
 	killed = 1;
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/2][BUGFIX] Revert "oom: deprecate oom_adj tunable"
  2010-08-25  9:42 [PATCH 1/2][BUGFIX] oom: remove totalpage normalization from oom_badness() KOSAKI Motohiro
@ 2010-08-25  9:42 ` KOSAKI Motohiro
  2010-08-25 10:27   ` David Rientjes
  2010-08-25 10:25 ` [PATCH 1/2][BUGFIX] oom: remove totalpage normalization from oom_badness() David Rientjes
  1 sibling, 1 reply; 15+ messages in thread
From: KOSAKI Motohiro @ 2010-08-25  9:42 UTC (permalink / raw)
  To: LKML, linux-mm, Andrew Morton, Oleg Nesterov, KAMEZAWA Hiroyuki,
	David Rientjes, Minchan Kim
  Cc: kosaki.motohiro

oom_adj is not only used for kernel knob, but also used for
application interface.
Then, adding new knob is no good reason to deprecate it.

Also, after former patch, oom_score_adj can't be used for setting
OOM_DISABLE. We need "echo -17 > /proc/<pid>/oom_adj" thing.

This reverts commit 51b1bd2ace1595b72956224deda349efa880b693.
---
 Documentation/feature-removal-schedule.txt |   25 -------------------------
 Documentation/filesystems/proc.txt         |    3 ---
 fs/proc/base.c                             |    8 --------
 include/linux/oom.h                        |    3 ---
 4 files changed, 0 insertions(+), 39 deletions(-)

diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
index 842aa9d..aff4d11 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -151,31 +151,6 @@ Who:	Eric Biederman <ebiederm@xmission.com>
 
 ---------------------------
 
-What:	/proc/<pid>/oom_adj
-When:	August 2012
-Why:	/proc/<pid>/oom_adj allows userspace to influence the oom killer's
-	badness heuristic used to determine which task to kill when the kernel
-	is out of memory.
-
-	The badness heuristic has since been rewritten since the introduction of
-	this tunable such that its meaning is deprecated.  The value was
-	implemented as a bitshift on a score generated by the badness()
-	function that did not have any precise units of measure.  With the
-	rewrite, the score is given as a proportion of available memory to the
-	task allocating pages, so using a bitshift which grows the score
-	exponentially is, thus, impossible to tune with fine granularity.
-
-	A much more powerful interface, /proc/<pid>/oom_score_adj, was
-	introduced with the oom killer rewrite that allows users to increase or
-	decrease the badness() score linearly.  This interface will replace
-	/proc/<pid>/oom_adj.
-
-	A warning will be emitted to the kernel log if an application uses this
-	deprecated interface.  After it is printed once, future warnings will be
-	suppressed until the kernel is rebooted.
-
----------------------------
-
 What:	remove EXPORT_SYMBOL(kernel_thread)
 When:	August 2006
 Files:	arch/*/kernel/*_ksyms.c
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index a6aca87..cf1295c 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -1285,9 +1285,6 @@ scaled linearly with /proc/<pid>/oom_score_adj.
 Writing to /proc/<pid>/oom_score_adj or /proc/<pid>/oom_adj will change the
 other with its scaled value.
 
-NOTICE: /proc/<pid>/oom_adj is deprecated and will be removed, please see
-Documentation/feature-removal-schedule.txt.
-
 Caveat: when a parent task is selected, the oom killer will sacrifice any first
 generation children with seperate address spaces instead, if possible.  This
 avoids servers and important system daemons from being killed and loses the
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 90ba487..55a16f2 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1046,14 +1046,6 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
 		return -EACCES;
 	}
 
-	/*
-	 * Warn that /proc/pid/oom_adj is deprecated, see
-	 * Documentation/feature-removal-schedule.txt.
-	 */
-	printk_once(KERN_WARNING "%s (%d): /proc/%d/oom_adj is deprecated, "
-			"please use /proc/%d/oom_score_adj instead.\n",
-			current->comm, task_pid_nr(current),
-			task_pid_nr(task), task_pid_nr(task));
 	task->signal->oom_adj = oom_adjust;
 
 	unlock_task_sighand(task, &flags);
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 21006dc..394f2e6 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -2,9 +2,6 @@
 #define __INCLUDE_LINUX_OOM_H
 
 /*
- * /proc/<pid>/oom_adj is deprecated, see
- * Documentation/feature-removal-schedule.txt.
- *
  * /proc/<pid>/oom_adj set to -17 protects from the oom-killer
  */
 #define OOM_DISABLE (-17)
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2][BUGFIX] oom: remove totalpage normalization from oom_badness()
  2010-08-25  9:42 [PATCH 1/2][BUGFIX] oom: remove totalpage normalization from oom_badness() KOSAKI Motohiro
  2010-08-25  9:42 ` [PATCH 2/2][BUGFIX] Revert "oom: deprecate oom_adj tunable" KOSAKI Motohiro
@ 2010-08-25 10:25 ` David Rientjes
  2010-08-26  0:39   ` KAMEZAWA Hiroyuki
  2010-08-30  2:58   ` KOSAKI Motohiro
  1 sibling, 2 replies; 15+ messages in thread
From: David Rientjes @ 2010-08-25 10:25 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Andrew Morton, Oleg Nesterov, KAMEZAWA Hiroyuki,
	Minchan Kim

On Wed, 25 Aug 2010, KOSAKI Motohiro wrote:

> Current oom_score_adj is completely broken because It is strongly bound
> google usecase and ignore other all.
> 

That's wrong, we don't even use this heuristic yet and there is nothing, 
in any way, that is specific to Google.

> 1) Priority inversion
>    As kamezawa-san pointed out, This break cgroup and lxr environment.
>    He said,
> 	> Assume 2 proceses A, B which has oom_score_adj of 300 and 0
> 	> And A uses 200M, B uses 1G of memory under 4G system
> 	>
> 	> Under the system.
> 	> 	A's socre = (200M *1000)/4G + 300 = 350
> 	> 	B's score = (1G * 1000)/4G = 250.
> 	>
> 	> In the cpuset, it has 2G of memory.
> 	> 	A's score = (200M * 1000)/2G + 300 = 400
> 	> 	B's socre = (1G * 1000)/2G = 500
> 	>
> 	> This priority-inversion don't happen in current system.
> 

You continually bring this up, and I've answered it three times, but 
you've never responded to it before and completely ignore it.  I really 
hope and expect that you'll participate more in the development process 
and not continue to reinterate your talking points when you have no answer 
to my response.

You're wrong, especially with regard to cpusets, which was formally part 
of the heuristic itself.

Users bind an aggregate of tasks to a cgroup (cpusets or memcg) as a means 
of isolation and attach a set of resources (memory, in this case) for 
those tasks to use.  The user who does this is fully aware of the set of 
tasks being bound, there is no mystery or unexpected results when doing 
so.  So when you set an oom_score_adj for a task, you don't necessarily 
need to be aware of the set of resources it has available, which is 
dynamic and an attribute of the system or cgroup, but rather the priority 
of that task in competition with other tasks for the same resources.

_That_ is what is important in having a userspace influence on a badness 
heursitic: how those badness scores compare relative to other tasks that 
share the same resources.  That's how a task is chosen for oom kill, not 
because of a static formula such as you're introducing here that outputs a 
value (and, thus, a priority) regardless of the context in which the task 
is bound.

That also means that the same task is not necessarily killed in a 
cpuset-constrained oom compared to a system-wide oom.  If you bias a task 
by 30% of available memory, which Kame did in his example above, it's 
entirely plausible that task A should be killed because it's actual usage 
is only 1/20th of the machine.  When its cpuset is oom, and the admin has 
specifically bound that task to only 2G of memory, we'd natually want to 
kill the memory hogger, that is using 50% of the total memory available to 
it.

> 2) Ratio base point don't works large machine
>    oom_score_adj normalize oom-score to 0-1000 range.
>    but if the machine has 1TB memory, 1 point (i.e. 0.1%) mean
>    1GB. this is no suitable for tuning parameter.
>    As I said, proposional value oriented tuning parameter has
>    scalability risk.
> 

So you'd rather use the range of oom_adj from -17 to +15 instead of 
oom_score_adj from -1000 to +1000 where each point is 68GB?  I don't 
understand your point here as to why oom_score_adj is worse.

But, yes, in reality we don't really care about the granularity so much 
that we need to prioritize a task using 512MB more memory than another to 
break the tie on a 1TB machine, 1/2048th of its memory.

> 3) No reason to implement ABI breakage.
>    old tuning parameter mean)
> 	oom-score = oom-base-score x 2^oom_adj

Everybody knows this is useless beyond polarizing a task for kill or 
making it immune.

>    new tuning parameter mean)
> 	oom-score = oom-base-score + oom_score_adj / (totalram + totalswap)

This, on the other hand, has an actual unit (proportion of available 
memory) that can be used to prioritize tasks amongst those competing for 
the same set of shared resources and remains constant even when a task 
changes cpuset, its memcg limit changes, etc.

And your equation is wrong, it's

	((rss + swap) / (available ram + swap)) + oom_score_adj

which is completely different from what you think it is.

>    but "oom_score_adj / (totalram + totalswap)" can be calculated in
>    userland too. beucase both totalram and totalswap has been exporsed by
>    /proc. So no reason to introduce funny new equation.
> 

Yup, it definitely can, which is why as I mentioned to Kame (who doesn't 
have strong feelings either way, even though you quote him as having these 
strong objections) that you can easily convert oom_score_adj into a 
stand-alone memory quantity (biasing or forgiving 512MB of a task's 
memory, for example) in the context it is currently attached to with 
simple arithemetic in userspace.  That's why oom_score_adj is powerful.

> 4) totalram based normalization assume flat memory model.
>    example, the machine is assymmetric numa. fat node memory and thin
>    node memory might have another wight value.
>    In other word, totalram based priority is a one of policy. Fixed and
>    workload depended policy shouldn't be embedded in kernel. probably.
> 

I don't know what this means, and this was your criticism before I changed 
the denominator during the revision of the patchset, so it's probably 
obsoleted.  oom_score_adj always operates based on the proportion of 
memory available to the application which is how the new oom killer 
determines which tasks to kill: relative to the importance (if defined by 
userspace) and memory usage compared to other tasks competing for it.

> Then, this patch remove *UGLY* total_pages suck completely. Googler
> can calculate it at userland!
> 

Nothing specific about any of this to Google.  Users who actually setup 
their machines to use mempolicies, cpusets, or memcgs actually do want a 
powerful interface from userspace to tune the priorities in terms of both 
business goals and also importance of the task.  That is done much more 
powerfully now with oom_score_adj than the previous implementation.  Users 
who don't use these cgroups, especially desktop users, can see 
oom_score_adj in terms of a memory quantity that remains static: they 
aren't going to encounter changing memcg limits, cpuset mems, etc.

That said, I really don't know why you keep mentioning "Google this" and 
"Google that" when the company I'm working for is really irrelevant to 
this discussion.

With that, I respectfully nack your patch.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2][BUGFIX] Revert "oom: deprecate oom_adj tunable"
  2010-08-25  9:42 ` [PATCH 2/2][BUGFIX] Revert "oom: deprecate oom_adj tunable" KOSAKI Motohiro
@ 2010-08-25 10:27   ` David Rientjes
  0 siblings, 0 replies; 15+ messages in thread
From: David Rientjes @ 2010-08-25 10:27 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Andrew Morton, Oleg Nesterov, KAMEZAWA Hiroyuki,
	Minchan Kim

On Wed, 25 Aug 2010, KOSAKI Motohiro wrote:

> oom_adj is not only used for kernel knob, but also used for
> application interface.
> Then, adding new knob is no good reason to deprecate it.
> 
> Also, after former patch, oom_score_adj can't be used for setting
> OOM_DISABLE. We need "echo -17 > /proc/<pid>/oom_adj" thing.
> 
> This reverts commit 51b1bd2ace1595b72956224deda349efa880b693.

Since I nacked the parent patch of this, I implicitly nack this one as 
well since oom_score_adj shouldn't be going anywhere.  The way to disable 
oom killing for a task via the new interface, /proc/pid/oom_score_adj, is 
by OOM_SCORE_ADJ_MIN as specified in the documentation.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2][BUGFIX] oom: remove totalpage normalization from oom_badness()
  2010-08-25 10:25 ` [PATCH 1/2][BUGFIX] oom: remove totalpage normalization from oom_badness() David Rientjes
@ 2010-08-26  0:39   ` KAMEZAWA Hiroyuki
  2010-08-26  0:52     ` David Rientjes
  2010-08-30  2:58   ` KOSAKI Motohiro
  1 sibling, 1 reply; 15+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-08-26  0:39 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton, Oleg Nesterov,
	Minchan Kim

On Wed, 25 Aug 2010 03:25:25 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:
> 
> > 3) No reason to implement ABI breakage.
> >    old tuning parameter mean)
> > 	oom-score = oom-base-score x 2^oom_adj
> 
> Everybody knows this is useless beyond polarizing a task for kill or 
> making it immune.
> 
> >    new tuning parameter mean)
> > 	oom-score = oom-base-score + oom_score_adj / (totalram + totalswap)
> 
> This, on the other hand, has an actual unit (proportion of available 
> memory) that can be used to prioritize tasks amongst those competing for 
> the same set of shared resources and remains constant even when a task 
> changes cpuset, its memcg limit changes, etc.
> 
> And your equation is wrong, it's
> 
> 	((rss + swap) / (available ram + swap)) + oom_score_adj
> 
> which is completely different from what you think it is.
> 

I'm now trying to write a userspace tool to calculate this, for me.
Then, could you update documentation ? 
==
3.2 /proc/<pid>/oom_score - Display current oom-killer score
-------------------------------------------------------------

This file can be used to check the current score used by the oom-killer is for
any given <pid>. Use it together with /proc/<pid>/oom_adj to tune which
process should be killed in an out-of-memory situation.
==

add a some documentation like:
==
(For system monitoring tool developpers, not for usual users.)
oom_score calculation is implemnentation dependent and can be modified without
any caution. But current logic is

oom_score = ((proc's rss + proc's swap) / (available ram + swap)) + oom_score_adj

proc's rss and swap can be obtained by /proc/<pid>/statm and available ram + swap
is dependent on the situation.
If the system is totaly under oom,
	available ram  == /proc/meminfo's MemTotal
	available swap == in most case == /proc/meminfo's SwapTotal
When you use memory cgroup,
	When swap is limited,  avaliable ram + swap == memory cgroup's memsw limit.
	When swap is unlimited, avaliable ram + swap = memory cgroup's memory limit + 
							SwapTotal

Then, please be careful that oom_score's order among tasks depends on the
situation. Assume 2 proceses A, B which has oom_score_adj of 300 and 0
 And A uses 200M, B uses 1G of memory under 4G system

 Under the 4G system.
 	A's socre = (200M *1000)/4G + 300 = 350
 	B's score = (1G * 1000)/4G = 250.

	 In the memory cgroup, it has 2G of resource.
	A's score = (200M * 1000)/2G + 300 = 400
	B's socre = (1G * 1000)/2G = 500

You shoudn't depend on /proc/<pid>/oom_score if you have to handle OOM under
cgroups and cpuset. But the logic is simple.
==

If you don't want, I'll add text and a sample tool to cgroup/memory.txt.


Thanks,
-Kame










^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2][BUGFIX] oom: remove totalpage normalization from oom_badness()
  2010-08-26  0:39   ` KAMEZAWA Hiroyuki
@ 2010-08-26  0:52     ` David Rientjes
  2010-08-26  1:03       ` KAMEZAWA Hiroyuki
  2010-08-26  1:11       ` KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 15+ messages in thread
From: David Rientjes @ 2010-08-26  0:52 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton, Oleg Nesterov,
	Minchan Kim

On Thu, 26 Aug 2010, KAMEZAWA Hiroyuki wrote:

> I'm now trying to write a userspace tool to calculate this, for me.
> Then, could you update documentation ? 
> ==
> 3.2 /proc/<pid>/oom_score - Display current oom-killer score
> -------------------------------------------------------------
> 
> This file can be used to check the current score used by the oom-killer is for
> any given <pid>. Use it together with /proc/<pid>/oom_adj to tune which
> process should be killed in an out-of-memory situation.
> ==
> 

You'll want to look at section 3.1 of Documentation/filesystems/proc.txt, 
which describes /proc/pid/oom_score_adj, not 3.2.

> add a some documentation like:
> ==
> (For system monitoring tool developpers, not for usual users.)
> oom_score calculation is implemnentation dependent and can be modified without
> any caution. But current logic is
> 
> oom_score = ((proc's rss + proc's swap) / (available ram + swap)) + oom_score_adj
> 

I'd hesitate to state the formula outside of the implementation and 
instead focus on the semantics of oom_score_adj (as a proportion of 
available memory compared to other tasks), which I tried doing in section 
3.1.  Then, the userspace tool only need be concerned about the units of 
oom_score_adj rather than whether rss, swap, or later extentions such as 
shm are added.

Thanks for working on this, Kame!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2][BUGFIX] oom: remove totalpage normalization from oom_badness()
  2010-08-26  0:52     ` David Rientjes
@ 2010-08-26  1:03       ` KAMEZAWA Hiroyuki
  2010-08-26  1:11       ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 15+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-08-26  1:03 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton, Oleg Nesterov,
	Minchan Kim

On Wed, 25 Aug 2010 17:52:06 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> On Thu, 26 Aug 2010, KAMEZAWA Hiroyuki wrote:
> 
> > I'm now trying to write a userspace tool to calculate this, for me.
> > Then, could you update documentation ? 
> > ==
> > 3.2 /proc/<pid>/oom_score - Display current oom-killer score
> > -------------------------------------------------------------
> > 
> > This file can be used to check the current score used by the oom-killer is for
> > any given <pid>. Use it together with /proc/<pid>/oom_adj to tune which
> > process should be killed in an out-of-memory situation.
> > ==
> > 
> 
> You'll want to look at section 3.1 of Documentation/filesystems/proc.txt, 
> which describes /proc/pid/oom_score_adj, not 3.2.
> 
> > add a some documentation like:
> > ==
> > (For system monitoring tool developpers, not for usual users.)
> > oom_score calculation is implemnentation dependent and can be modified without
> > any caution. But current logic is
> > 
> > oom_score = ((proc's rss + proc's swap) / (available ram + swap)) + oom_score_adj
> > 
> 
> I'd hesitate to state the formula outside of the implementation and 
> instead focus on the semantics of oom_score_adj (as a proportion of 
> available memory compared to other tasks), which I tried doing in section 
> 3.1.  Then, the userspace tool only need be concerned about the units of 
> oom_score_adj rather than whether rss, swap, or later extentions such as 
> shm are added.
> 
> Thanks for working on this, Kame!
> 
BTW, why you don't subtract the amount of Hugepages ?

The old code did
	"totalrampages - hugepage" as available memory.

IIUC, the number of hugepages is not accounted into mm->rss, so, isn't it
better to subtract # of hugepage ?
Hmm...makes no difference ?

Thanks,
-Kame







^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2][BUGFIX] oom: remove totalpage normalization from oom_badness()
  2010-08-26  0:52     ` David Rientjes
  2010-08-26  1:03       ` KAMEZAWA Hiroyuki
@ 2010-08-26  1:11       ` KAMEZAWA Hiroyuki
  2010-08-26  2:50         ` David Rientjes
  1 sibling, 1 reply; 15+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-08-26  1:11 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton, Oleg Nesterov,
	Minchan Kim

On Wed, 25 Aug 2010 17:52:06 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> On Thu, 26 Aug 2010, KAMEZAWA Hiroyuki wrote:
> 
> > I'm now trying to write a userspace tool to calculate this, for me.
> > Then, could you update documentation ? 
> > ==
> > 3.2 /proc/<pid>/oom_score - Display current oom-killer score
> > -------------------------------------------------------------
> > 
> > This file can be used to check the current score used by the oom-killer is for
> > any given <pid>. Use it together with /proc/<pid>/oom_adj to tune which
> > process should be killed in an out-of-memory situation.
> > ==
> > 
> 
> You'll want to look at section 3.1 of Documentation/filesystems/proc.txt, 
> which describes /proc/pid/oom_score_adj, not 3.2.
> 
> > add a some documentation like:
> > ==
> > (For system monitoring tool developpers, not for usual users.)
> > oom_score calculation is implemnentation dependent and can be modified without
> > any caution. But current logic is
> > 
> > oom_score = ((proc's rss + proc's swap) / (available ram + swap)) + oom_score_adj
> > 
> 
> I'd hesitate to state the formula outside of the implementation and 
> instead focus on the semantics of oom_score_adj (as a proportion of 
> available memory compared to other tasks), which I tried doing in section 
> 3.1.  Then, the userspace tool only need be concerned about the units of 
> oom_score_adj rather than whether rss, swap, or later extentions such as 
> shm are added.
> 
Hmm. I'll add a text like following to cgroup/memory.txt. O.K. ?

==
Notes on oom_score and oom_score_adj.

oom_score is calculated as
	oom_score = (taks's proportion of memory) + oom_score_adj.

Then, when you use oom_score_adj to control the order of priority of oom,
you should know about the amount of memory you can use.
So, an approximate oom_score under memcg can be

 memcg_oom_score = (oom_score - oom_score_adj) * system_memory/memcg's limit
		+ oom_score_adj.

And yes, this can be affected by hierarchy control of memcg and calculation
will be more complicated. See, oom_disable feature also.
==

Thanks,
-Kame













^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2][BUGFIX] oom: remove totalpage normalization from oom_badness()
  2010-08-26  1:11       ` KAMEZAWA Hiroyuki
@ 2010-08-26  2:50         ` David Rientjes
  2010-08-26  3:20           ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 15+ messages in thread
From: David Rientjes @ 2010-08-26  2:50 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton, Oleg Nesterov,
	Minchan Kim

On Thu, 26 Aug 2010, KAMEZAWA Hiroyuki wrote:

> Hmm. I'll add a text like following to cgroup/memory.txt. O.K. ?
> 
> ==
> Notes on oom_score and oom_score_adj.
> 
> oom_score is calculated as
> 	oom_score = (taks's proportion of memory) + oom_score_adj.
> 

I'd replace "memory" with "memory limit (or memsw limit)" so it's clear 
we're talking about the amount of memory available to task.

> Then, when you use oom_score_adj to control the order of priority of oom,
> you should know about the amount of memory you can use.

Hmm, you need to know the amount of memory that you can use iff you know 
the memcg limit and it's a static value.  Otherwise, you only need to know 
the "memory usage of your application relative to others in the same 
cgroup."  An oom_score_adj of +300 adds 30% of that memcg's limit to the 
task, allowing all other tasks to use 30% more memory than that task with 
it still be killed.  An oom_score_adj of -300 allows that task to use 30% 
more memory than other tasks without getting killed.  These don't need to 
know the actual limit.

> So, an approximate oom_score under memcg can be
> 
>  memcg_oom_score = (oom_score - oom_score_adj) * system_memory/memcg's limit
> 		+ oom_score_adj.
> 

Right, that's the exact score within the memcg.

But, I still wouldn't encourage a formula like this because the memcg 
limit (or cpuset mems, mempolicy nodes, etc) are dynamic and may change 
out from under us.  So it's more important to define oom_score_adj in the 
user's mind as a proportion of memory available to be added (either 
positively or negatively) to its memory use when comparing it to other 
tasks.  The point is that the memcg limit isn't interesting in this 
formula, it's more important to understand the priority of the task 
_compared_ to other tasks memory usage in that memcg.

It probably would be helpful, though, if you know that a vital system task 
uses 1G, for instance, in a 4G memcg that an oom_score_adj of -250 will 
disable oom killing for it.  If that tasks leaks memory or becomes 
significantly large, for whatever reason, it could be killed, but we _can_ 
discount the 1G in comparison to other tasks as the "cost of doing 
business" when it comes to vital system tasks:

	(memory usage) * (memory+swap limit / system memory)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2][BUGFIX] oom: remove totalpage normalization from oom_badness()
  2010-08-26  2:50         ` David Rientjes
@ 2010-08-26  3:20           ` KAMEZAWA Hiroyuki
  2010-08-26  3:52             ` David Rientjes
  0 siblings, 1 reply; 15+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-08-26  3:20 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton, Oleg Nesterov,
	Minchan Kim

On Wed, 25 Aug 2010 19:50:22 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> On Thu, 26 Aug 2010, KAMEZAWA Hiroyuki wrote:
> 
> > Hmm. I'll add a text like following to cgroup/memory.txt. O.K. ?
> > 
> > ==
> > Notes on oom_score and oom_score_adj.
> > 
> > oom_score is calculated as
> > 	oom_score = (taks's proportion of memory) + oom_score_adj.
> > 
> 
> I'd replace "memory" with "memory limit (or memsw limit)" so it's clear 
> we're talking about the amount of memory available to task.
> 
ok.

> > Then, when you use oom_score_adj to control the order of priority of oom,
> > you should know about the amount of memory you can use.
> 
> Hmm, you need to know the amount of memory that you can use iff you know 
> the memcg limit and it's a static value.  Otherwise, you only need to know 
> the "memory usage of your application relative to others in the same 
> cgroup."  An oom_score_adj of +300 adds 30% of that memcg's limit to the 
> task, allowing all other tasks to use 30% more memory than that task with 
> it still be killed.  An oom_score_adj of -300 allows that task to use 30% 
> more memory than other tasks without getting killed.  These don't need to 
> know the actual limit.
> 

Hmm. What's complicated is oom_score_adj's behavior.


> > So, an approximate oom_score under memcg can be
> > 
> >  memcg_oom_score = (oom_score - oom_score_adj) * system_memory/memcg's limit
> > 		+ oom_score_adj.
> > 
> 
> Right, that's the exact score within the memcg.
> 
> But, I still wouldn't encourage a formula like this because the memcg 
> limit (or cpuset mems, mempolicy nodes, etc) are dynamic and may change 
> out from under us.  So it's more important to define oom_score_adj in the 
> user's mind as a proportion of memory available to be added (either 
> positively or negatively) to its memory use when comparing it to other 
> tasks.  The point is that the memcg limit isn't interesting in this 
> formula, it's more important to understand the priority of the task 
> _compared_ to other tasks memory usage in that memcg.
> 

yes. For defineing/understanding priority, oom_score_adj is that.
But it's priority isn't static.

> It probably would be helpful, though, if you know that a vital system task 
> uses 1G, for instance, in a 4G memcg that an oom_score_adj of -250 will 
> disable oom killing for it.  

yes.

> If that tasks leaks memory or becomes 
> significantly large, for whatever reason, it could be killed, but we _can_ 
> discount the 1G in comparison to other tasks as the "cost of doing 
> business" when it comes to vital system tasks:
> 
> 	(memory usage) * (memory+swap limit / system memory)
> 

yes. under 8G system, -250 will allow ingnoring 2G of usage.

== How about this text ? ==

When you set a task's oom_score_adj, it can get priority not to be oom-killed.
oom_score_adj gives priority proportional to the memory limitation.

Assuming you set -250 to oom_score_adj.

Under 4G memory limit, it gets 25% of bonus...1G memory bonus for avoiding OOM.
Under 8G memory limit, it gets 25% of bonus...2G memory bonus for avoiding OOM.

Then, what bonus a task can get depends on the context of OOM. If you use
oom_score_adj and want to give bonus to a task, setting it in regard with
minimum memory limitation which a task is under will work well.
==

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2][BUGFIX] oom: remove totalpage normalization from oom_badness()
  2010-08-26  3:20           ` KAMEZAWA Hiroyuki
@ 2010-08-26  3:52             ` David Rientjes
  0 siblings, 0 replies; 15+ messages in thread
From: David Rientjes @ 2010-08-26  3:52 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton, Oleg Nesterov,
	Minchan Kim

On Thu, 26 Aug 2010, KAMEZAWA Hiroyuki wrote:

> > Hmm, you need to know the amount of memory that you can use iff you know 
> > the memcg limit and it's a static value.  Otherwise, you only need to know 
> > the "memory usage of your application relative to others in the same 
> > cgroup."  An oom_score_adj of +300 adds 30% of that memcg's limit to the 
> > task, allowing all other tasks to use 30% more memory than that task with 
> > it still be killed.  An oom_score_adj of -300 allows that task to use 30% 
> > more memory than other tasks without getting killed.  These don't need to 
> > know the actual limit.
> > 
> 
> Hmm. What's complicated is oom_score_adj's behavior.
> 

I understand it's a little tricky when dealing with memcg-constrained oom 
conditions versus system-wide oom conditions.  Think of it this way: if 
the system is oom, then every memcg, every cpuset, and every mempolicy is 
also oom.  That doesn't imply that something in every memcg, every cpuset, 
or every mempolicy must be killed, however.  What cgroup happens to be 
penalized in this scenario isn't necessarily the scope of oom_score_adj's 
purpose.  oom_score_adj certainly does have a stronger influence over a 
task's priority when it's a system oom and not a memcg oom because the 
size of available memory is different, but that's fine: we set positive 
and negative oom_score_adj values for a reason based on the application, 
and that's not necessarily (but can be) a function of the memcg or system 
capacity.  Again, oom_score_adj is only meaningful when considered 
relative to other candidate tasks since the badness score itself is 
considered relative to other candidate tasks.

You can have multiple tasks that have +1000 oom_score_adj values (or 
multiple tasks that have +15 oom_adj values).  Only one will be killed and 
it's dependent only on the ordering of the tasklist.  That isn't an 
exception case, that only means that we prevented needless oom killing.

> > > So, an approximate oom_score under memcg can be
> > > 
> > >  memcg_oom_score = (oom_score - oom_score_adj) * system_memory/memcg's limit
> > > 		+ oom_score_adj.
> > > 
> > 
> > Right, that's the exact score within the memcg.
> > 
> > But, I still wouldn't encourage a formula like this because the memcg 
> > limit (or cpuset mems, mempolicy nodes, etc) are dynamic and may change 
> > out from under us.  So it's more important to define oom_score_adj in the 
> > user's mind as a proportion of memory available to be added (either 
> > positively or negatively) to its memory use when comparing it to other 
> > tasks.  The point is that the memcg limit isn't interesting in this 
> > formula, it's more important to understand the priority of the task 
> > _compared_ to other tasks memory usage in that memcg.
> > 
> 
> yes. For defineing/understanding priority, oom_score_adj is that.
> But it's priority isn't static.
> 

If the memcg limit changes because we're attaching more tasks, yes, we may 
want to change its oom_score_adj relative to those tasks.  So 
oom_score_adj is a function of the attached tasks and its allowed set of 
resources in comparison to them, not the limit itself.

> > If that tasks leaks memory or becomes 
> > significantly large, for whatever reason, it could be killed, but we _can_ 
> > discount the 1G in comparison to other tasks as the "cost of doing 
> > business" when it comes to vital system tasks:
> > 
> > 	(memory usage) * (memory+swap limit / system memory)
> > 
> 
> yes. under 8G system, -250 will allow ingnoring 2G of usage.
> 

Yeah, that conversion could be useful if the system RAM capacity or memcg 
limit, etc, remains static.

> == How about this text ? ==
> 
> When you set a task's oom_score_adj, it can get priority not to be oom-killed.

And the reverse, it can get priority to be killed :)

> oom_score_adj gives priority proportional to the memory limitation.
> 
> Assuming you set -250 to oom_score_adj.
> 
> Under 4G memory limit, it gets 25% of bonus...1G memory bonus for avoiding OOM.
> Under 8G memory limit, it gets 25% of bonus...2G memory bonus for avoiding OOM.
> 
> Then, what bonus a task can get depends on the context of OOM. If you use
> oom_score_adj and want to give bonus to a task, setting it in regard with
> minimum memory limitation which a task is under will work well.

Very nice, and the "bonus" there is what the task can safely use in 
comparison to any other task competing for the same resources without 
getting selected itself because of that memory.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2][BUGFIX] oom: remove totalpage normalization from oom_badness()
  2010-08-25 10:25 ` [PATCH 1/2][BUGFIX] oom: remove totalpage normalization from oom_badness() David Rientjes
  2010-08-26  0:39   ` KAMEZAWA Hiroyuki
@ 2010-08-30  2:58   ` KOSAKI Motohiro
  2010-09-01 22:06     ` David Rientjes
  1 sibling, 1 reply; 15+ messages in thread
From: KOSAKI Motohiro @ 2010-08-30  2:58 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, LKML, linux-mm, Andrew Morton, Oleg Nesterov,
	KAMEZAWA Hiroyuki, Minchan Kim

> On Wed, 25 Aug 2010, KOSAKI Motohiro wrote:
> 
> > Current oom_score_adj is completely broken because It is strongly bound
> > google usecase and ignore other all.
> > 
> 
> That's wrong, we don't even use this heuristic yet and there is nothing, 
> in any way, that is specific to Google.

Please show us an evidence. Big mouth is no good way to persuade us.
I requested you "COMMUNICATE REAL WORLD USER", do you really realized this?

> 
> > 1) Priority inversion
> >    As kamezawa-san pointed out, This break cgroup and lxr environment.
> >    He said,
> > 	> Assume 2 proceses A, B which has oom_score_adj of 300 and 0
> > 	> And A uses 200M, B uses 1G of memory under 4G system
> > 	>
> > 	> Under the system.
> > 	> 	A's socre = (200M *1000)/4G + 300 = 350
> > 	> 	B's score = (1G * 1000)/4G = 250.
> > 	>
> > 	> In the cpuset, it has 2G of memory.
> > 	> 	A's score = (200M * 1000)/2G + 300 = 400
> > 	> 	B's socre = (1G * 1000)/2G = 500
> > 	>
> > 	> This priority-inversion don't happen in current system.
> > 
> 
> You continually bring this up, and I've answered it three times, but 
> you've never responded to it before and completely ignore it.  

Yes, I ignored. Don't talk your dream. I hope to see concrete use-case.
As I repeatedly said, I don't care you while you ignore real world end user.
ANY BODY DON'T EXCEPT STABILIZATION DEVELOPERS ARE KINDFUL FOR END USER
HARMFUL. WE HAVE NO MERCY WHILE YOU CONTINUE TO INMORAL DEVELOPMENT.

I'm waiting ome more day. Pray! anyone join to this discussion and
explain real use instead you. We don't ignore end-user. But nobody
except you reponce this even though I don't care your , I definitely 
will sent this patch to mainline.



> I really 
> hope and expect that you'll participate more in the development process 
> and not continue to reinterate your talking points when you have no answer 
> to my response.
> 
> You're wrong, especially with regard to cpusets, which was formally part 
> of the heuristic itself.
> 
> Users bind an aggregate of tasks to a cgroup (cpusets or memcg) as a means 
> of isolation and attach a set of resources (memory, in this case) for 
> those tasks to use.  The user who does this is fully aware of the set of 
> tasks being bound, there is no mystery or unexpected results when doing 
> so.  So when you set an oom_score_adj for a task, you don't necessarily 
> need to be aware of the set of resources it has available, which is 
> dynamic and an attribute of the system or cgroup, but rather the priority 
> of that task in competition with other tasks for the same resources.

That is YOUR policy. The problem is IT'S NO GENERIC.

> 
> _That_ is what is important in having a userspace influence on a badness 
> heursitic: how those badness scores compare relative to other tasks that 
> share the same resources.  That's how a task is chosen for oom kill, not 
> because of a static formula such as you're introducing here that outputs a 
> value (and, thus, a priority) regardless of the context in which the task 
> is bound.
> 
> That also means that the same task is not necessarily killed in a 
> cpuset-constrained oom compared to a system-wide oom.  If you bias a task 
> by 30% of available memory, which Kame did in his example above, it's 
> entirely plausible that task A should be killed because it's actual usage 
> is only 1/20th of the machine.  When its cpuset is oom, and the admin has 
> specifically bound that task to only 2G of memory, we'd natually want to 
> kill the memory hogger, that is using 50% of the total memory available to 
> it.

I agree your implementation works fine if admins have the same policy
with you. I oppose you assume you can change all admins in the world.



> > 2) Ratio base point don't works large machine
> >    oom_score_adj normalize oom-score to 0-1000 range.
> >    but if the machine has 1TB memory, 1 point (i.e. 0.1%) mean
> >    1GB. this is no suitable for tuning parameter.
> >    As I said, proposional value oriented tuning parameter has
> >    scalability risk.
> > 
> 
> So you'd rather use the range of oom_adj from -17 to +15 instead of 
> oom_score_adj from -1000 to +1000 where each point is 68GB?  I don't 
> understand your point here as to why oom_score_adj is worse.

No. As I said,
 - If you want to solve minority issue, you have to keep no regression
   for majority user.
 - If you want to solve major isssue and making bug change. Investigate
   world wide use case carefully. and refrect it.

oom_score_adj was pointed out it overlook a lot of use case. then I
request 1) remake all, or 2) kill existing code change.



> But, yes, in reality we don't really care about the granularity so much 
> that we need to prioritize a task using 512MB more memory than another to 
> break the tie on a 1TB machine, 1/2048th of its memory.
> 
> > 3) No reason to implement ABI breakage.
> >    old tuning parameter mean)
> > 	oom-score = oom-base-score x 2^oom_adj
> 
> Everybody knows this is useless beyond polarizing a task for kill or 
> making it immune.
> 
> >    new tuning parameter mean)
> > 	oom-score = oom-base-score + oom_score_adj / (totalram + totalswap)
> 
> This, on the other hand, has an actual unit (proportion of available 
> memory) that can be used to prioritize tasks amongst those competing for 
> the same set of shared resources and remains constant even when a task 
> changes cpuset, its memcg limit changes, etc.
> 
> And your equation is wrong, it's
> 
> 	((rss + swap) / (available ram + swap)) + oom_score_adj
> 
> which is completely different from what you think it is.

you equetion can be changed 

	(rss + swap)  + oom_score_adj x (available ram + swap)
	-----------------------------------------------------------
		(available ram + swap)

That said, same oom_score_adj can be calculated.



> >    but "oom_score_adj / (totalram + totalswap)" can be calculated in
> >    userland too. beucase both totalram and totalswap has been exporsed by
> >    /proc. So no reason to introduce funny new equation.
> > 
> 
> Yup, it definitely can, which is why as I mentioned to Kame (who doesn't 
> have strong feelings either way, even though you quote him as having these 
> strong objections) that you can easily convert oom_score_adj into a 
> stand-alone memory quantity (biasing or forgiving 512MB of a task's 
> memory, for example) in the context it is currently attached to with 
> simple arithemetic in userspace.  That's why oom_score_adj is powerful.

I already said I disagree. 


> 
> > 4) totalram based normalization assume flat memory model.
> >    example, the machine is assymmetric numa. fat node memory and thin
> >    node memory might have another wight value.
> >    In other word, totalram based priority is a one of policy. Fixed and
> >    workload depended policy shouldn't be embedded in kernel. probably.
> > 
> 
> I don't know what this means, and this was your criticism before I changed 
> the denominator during the revision of the patchset, so it's probably 
> obsoleted.  oom_score_adj always operates based on the proportion of 
> memory available to the application which is how the new oom killer 
> determines which tasks to kill: relative to the importance (if defined by 
> userspace) and memory usage compared to other tasks competing for it.

I already explained asymmetric numa issue in past. again, don't assuem
you policy and your machine if you want to change kernel core code.



> 
> > Then, this patch remove *UGLY* total_pages suck completely. Googler
> > can calculate it at userland!
> > 
> 
> Nothing specific about any of this to Google.  Users who actually setup 
> their machines to use mempolicies, cpusets, or memcgs actually do want a 
> powerful interface from userspace to tune the priorities in terms of both 
> business goals and also importance of the task.  That is done much more 
> powerfully now with oom_score_adj than the previous implementation.  Users 
> who don't use these cgroups, especially desktop users, can see 
> oom_score_adj in terms of a memory quantity that remains static: they 
> aren't going to encounter changing memcg limits, cpuset mems, etc.
> 
> That said, I really don't know why you keep mentioning "Google this" and 
> "Google that" when the company I'm working for is really irrelevant to 
> this discussion.

Please don't talk your imazine. You have to talk about concrete use-case.



> With that, I respectfully nack your patch.

Sorry, I don't care this. Please fix you.

Thanks.




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2][BUGFIX] oom: remove totalpage normalization from oom_badness()
  2010-08-30  2:58   ` KOSAKI Motohiro
@ 2010-09-01 22:06     ` David Rientjes
  2010-09-08  2:44       ` KOSAKI Motohiro
  0 siblings, 1 reply; 15+ messages in thread
From: David Rientjes @ 2010-09-01 22:06 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Andrew Morton, Oleg Nesterov, KAMEZAWA Hiroyuki,
	Minchan Kim

On Mon, 30 Aug 2010, KOSAKI Motohiro wrote:

> > > Current oom_score_adj is completely broken because It is strongly bound
> > > google usecase and ignore other all.
> > > 
> > 
> > That's wrong, we don't even use this heuristic yet and there is nothing, 
> > in any way, that is specific to Google.
> 
> Please show us an evidence. Big mouth is no good way to persuade us.

Evidence that Google isn't using this currently?

> I requested you "COMMUNICATE REAL WORLD USER", do you really realized this?
> 

We are certainly looking forward to using this when 2.6.36 is released 
since we work with both cpusets and memcg.

> > > 1) Priority inversion
> > >    As kamezawa-san pointed out, This break cgroup and lxr environment.
> > >    He said,
> > > 	> Assume 2 proceses A, B which has oom_score_adj of 300 and 0
> > > 	> And A uses 200M, B uses 1G of memory under 4G system
> > > 	>
> > > 	> Under the system.
> > > 	> 	A's socre = (200M *1000)/4G + 300 = 350
> > > 	> 	B's score = (1G * 1000)/4G = 250.
> > > 	>
> > > 	> In the cpuset, it has 2G of memory.
> > > 	> 	A's score = (200M * 1000)/2G + 300 = 400
> > > 	> 	B's socre = (1G * 1000)/2G = 500
> > > 	>
> > > 	> This priority-inversion don't happen in current system.
> > > 
> > 
> > You continually bring this up, and I've answered it three times, but 
> > you've never responded to it before and completely ignore it.  
> 
> Yes, I ignored. Don't talk your dream. I hope to see concrete use-case.
> As I repeatedly said, I don't care you while you ignore real world end user.
> ANY BODY DON'T EXCEPT STABILIZATION DEVELOPERS ARE KINDFUL FOR END USER
> HARMFUL. WE HAVE NO MERCY WHILE YOU CONTINUE TO INMORAL DEVELOPMENT.
> 

I'm not ignoring any user with this change, oom_score_adj is an extremely 
powerful interface for users who want to use it.  I'm sorry that it's not 
as simple to use as you may like.

Basically, it comes down to this: few users actually tune their oom 
killing priority, period.  That's partly because they accept the oom 
killer's heuristics to kill a memory-hogging task or use panic_on_oom, or 
because the old interface, /proc/pid/oom_adj, had no unit and no logical 
way of using it other than polarizing it (either +15 or -17).

For those users who do change their oom killing priority, few are using 
cpusets or memcg.  Yes, the priority changes depending on the context of 
the oom, but for users who don't use these cgroups the oom_score_adj unit 
is static since the amount of system memory (the only oom constraint) is 
static.

Now, for the users of both oom_score_adj and cpusets or memcg (in the 
future this will include Google), these users are interested in oom 
killing priority relative to other tasks attached to the same set of 
resources.  For our particular use case, we attach an aggregate of tasks 
to a cgroup and have a preference on the order in which those tasks are 
killed whenever that cgroup limit is exhausted.  We also care about 
protecting vital system tasks so that they aren't targeted before others 
are killed, such as job schedulers.

I think the key point your missing in our use case is that we don't 
necessary care about the system-wide oom condition when we're running with 
cpusets or memcg.  We can protect tasks with negative oom_score_adj, but 
we don't care about close tiebreakers on which cpuset or memg is penalized 
when the entire system is out of memory.  If that's the case, each cpuset 
and memcg is also, by definition, out of memory, so they are all subject 
to the oom killer.  This is equivalent to having several tasks with an 
oom_score_adj of +1000 (or oom_adj of +15) and only one getting killed 
based on the order of the tasklist.

So there is actually no "bug" or "regression" in this behavior (especially 
since the old oom killer had inversion as well because it factored cpuset 
placement into the heuristic score) and describing it as such is 
misleading.  It's actually a very powerful interface for those who choose 
to use it and accurately reflect the way the oom killer chooses tasks: 
relative to other eligible tasks competing for the same set of resources.

> > I don't know what this means, and this was your criticism before I changed 
> > the denominator during the revision of the patchset, so it's probably 
> > obsoleted.  oom_score_adj always operates based on the proportion of 
> > memory available to the application which is how the new oom killer 
> > determines which tasks to kill: relative to the importance (if defined by 
> > userspace) and memory usage compared to other tasks competing for it.
> 
> I already explained asymmetric numa issue in past. again, don't assuem
> you policy and your machine if you want to change kernel core code.
> 

Please explain it again, I don't see what the asymmetry in NUMA node size 
has to do with either mempolicy or cpuset ooms.  The administrator is 
fully aware of the sizes of these nodes at the time he or she attaches 
them.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2][BUGFIX] oom: remove totalpage normalization from oom_badness()
  2010-09-01 22:06     ` David Rientjes
@ 2010-09-08  2:44       ` KOSAKI Motohiro
  2010-09-08  3:12         ` David Rientjes
  0 siblings, 1 reply; 15+ messages in thread
From: KOSAKI Motohiro @ 2010-09-08  2:44 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, LKML, linux-mm, Andrew Morton, Oleg Nesterov,
	KAMEZAWA Hiroyuki, Minchan Kim

> On Mon, 30 Aug 2010, KOSAKI Motohiro wrote:
> 
> > > > Current oom_score_adj is completely broken because It is strongly bound
> > > > google usecase and ignore other all.
> > > > 
> > > 
> > > That's wrong, we don't even use this heuristic yet and there is nothing, 
> > > in any way, that is specific to Google.
> > 
> > Please show us an evidence. Big mouth is no good way to persuade us.
> 
> Evidence that Google isn't using this currently?

Of cource, there is simply just zero justification. Who ask google usage?
Every developer have their own debugging and machine administate patches.
Don't you concern why we don't push them into upstream? only one usage is
not good reason to make kernel bloat. Need to generalize.


> 
> > I requested you "COMMUNICATE REAL WORLD USER", do you really realized this?
> > 
> 
> We are certainly looking forward to using this when 2.6.36 is released 
> since we work with both cpusets and memcg.

Could you please be serious? We are not making a sand castle, we are making
a kernel. You have to understand the difference of them. Zero user feature
trial don't have any good reason to make userland breakage.



> > > > 1) Priority inversion
> > > >    As kamezawa-san pointed out, This break cgroup and lxr environment.
> > > >    He said,
> > > > 	> Assume 2 proceses A, B which has oom_score_adj of 300 and 0
> > > > 	> And A uses 200M, B uses 1G of memory under 4G system
> > > > 	>
> > > > 	> Under the system.
> > > > 	> 	A's socre = (200M *1000)/4G + 300 = 350
> > > > 	> 	B's score = (1G * 1000)/4G = 250.
> > > > 	>
> > > > 	> In the cpuset, it has 2G of memory.
> > > > 	> 	A's score = (200M * 1000)/2G + 300 = 400
> > > > 	> 	B's socre = (1G * 1000)/2G = 500
> > > > 	>
> > > > 	> This priority-inversion don't happen in current system.
> > > > 
> > > 
> > > You continually bring this up, and I've answered it three times, but 
> > > you've never responded to it before and completely ignore it.  
> > 
> > Yes, I ignored. Don't talk your dream. I hope to see concrete use-case.
> > As I repeatedly said, I don't care you while you ignore real world end user.
> > ANY BODY DON'T EXCEPT STABILIZATION DEVELOPERS ARE KINDFUL FOR END USER
> > HARMFUL. WE HAVE NO MERCY WHILE YOU CONTINUE TO INMORAL DEVELOPMENT.
> > 
> 
> I'm not ignoring any user with this change, oom_score_adj is an extremely 
> powerful interface for users who want to use it.  I'm sorry that it's not 
> as simple to use as you may like.
> 
> Basically, it comes down to this: few users actually tune their oom 
> killing priority, period.  That's partly because they accept the oom 
> killer's heuristics to kill a memory-hogging task or use panic_on_oom, or 
> because the old interface, /proc/pid/oom_adj, had no unit and no logical 
> way of using it other than polarizing it (either +15 or -17).

Unrelated. We already have oom notifier. we have no reason to add new knob
even though oom_adj is suck. We are not necessary to break userland application.



> 
> For those users who do change their oom killing priority, few are using 
> cpusets or memcg.  Yes, the priority changes depending on the context of 
> the oom, but for users who don't use these cgroups the oom_score_adj unit 
> is static since the amount of system memory (the only oom constraint) is 
> static.
> 
> Now, for the users of both oom_score_adj and cpusets or memcg (in the 
> future this will include Google), these users are interested in oom 
> killing priority relative to other tasks attached to the same set of 
> resources.  For our particular use case, we attach an aggregate of tasks 
> to a cgroup and have a preference on the order in which those tasks are 
> killed whenever that cgroup limit is exhausted.  We also care about 
> protecting vital system tasks so that they aren't targeted before others 
> are killed, such as job schedulers.

memcg limit already have been exposed via /cgroup/memory.limit_in_bytes.
It's clearly userland role.

The fact is, my patch is more powerful than yours because your patch
has fixed oom management policy, but mine don't. It can be custermized
to adjust customer.

More importantly, We already have oom notifier. and It is most powerful
infrastructure. It can be constructed any oom policy freely. It's not 
restricted kernel implementaion.


The fact is, your new interface don't match HPC, Server, Banking systems
and embedded, AFAIK. At least I couldn't find such usercase in my job 
experience of such area. Also, now I'm jourlist theresore I have some
connection of linux user group. but I didn't get positive feedback for
your. instead got some negative feedback. Of cource, I don't know all of
the world theresore I did ask you real world usercase repeatedly. But
I've got no responce. You need to prove your feature improve the world.



> I think the key point your missing in our use case is that we don't 
> necessary care about the system-wide oom condition when we're running with 
> cpusets or memcg.  We can protect tasks with negative oom_score_adj, but 
> we don't care about close tiebreakers on which cpuset or memg is penalized 
> when the entire system is out of memory.  If that's the case, each cpuset 
> and memcg is also, by definition, out of memory, so they are all subject 
> to the oom killer.  This is equivalent to having several tasks with an 
> oom_score_adj of +1000 (or oom_adj of +15) and only one getting killed 
> based on the order of the tasklist.

Unrelated. You are still talking about your policy. Why do we need care it?
I don't inhibit you use your policy.



> So there is actually no "bug" or "regression" in this behavior (especially 
> since the old oom killer had inversion as well because it factored cpuset 
> placement into the heuristic score) and describing it as such is 
> misleading.  It's actually a very powerful interface for those who choose 
> to use it and accurately reflect the way the oom killer chooses tasks: 
> relative to other eligible tasks competing for the same set of resources.

Well, this is clearly bug. oom_adj was changed behavior. and It was deprecated
by mistake, therefore latest kernel output pointless warnings each boot time.

That said, Be serious! otherwise GO AWAY.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2][BUGFIX] oom: remove totalpage normalization from oom_badness()
  2010-09-08  2:44       ` KOSAKI Motohiro
@ 2010-09-08  3:12         ` David Rientjes
  0 siblings, 0 replies; 15+ messages in thread
From: David Rientjes @ 2010-09-08  3:12 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Andrew Morton, Oleg Nesterov, KAMEZAWA Hiroyuki,
	Minchan Kim

On Wed, 8 Sep 2010, KOSAKI Motohiro wrote:

> Of cource, there is simply just zero justification. Who ask google usage?
> Every developer have their own debugging and machine administate patches.
> Don't you concern why we don't push them into upstream? only one usage is
> not good reason to make kernel bloat. Need to generalize.
> 

/proc/pid/oom_adj was introduced before cpusets or memcg, so we were 
dealing with a static amount of system resources that would not change out 
from underneath an application.  Although that's not a defense of using a 
bitshift on a heuristic that included many arbitrary selections, it was 
reasonable to make it only a scalar that didn't consider the amount of 
resources that an application was allowed to access.

As time moved on, cgroups such as cpusets and memcg were introduced (and 
mempolicies became much more popular as larger NUMA machines became more 
popular in the industry) that bound or restricted the amount of memory 
that an aggregate of tasks could access.  Each of those methods may change 
the amount of memory resources that an application has available to it at 
any time without knowledge of that application, job scheduler, or system 
daemon.  Therefore, it's important to define a more powerful oom_adj 
mechanism that can properly attribute the oom killing priority of a task 
in comparison to others that are bound to the same resources without 
causing a regression on users who don't use those mechanisms that allow 
that working set size to be dynamic.  That's what oom_score_adj does.

> > We are certainly looking forward to using this when 2.6.36 is released 
> > since we work with both cpusets and memcg.
> 
> Could you please be serious? We are not making a sand castle, we are making
> a kernel. You have to understand the difference of them. Zero user feature
> trial don't have any good reason to make userland breakage.
> 

With respect to memory isolation and the oom killer, we want to follow the 
upstream behavior as much as possible.  We are looking forward to this 
interface being available in 2.6.36 and will begin to actively use it once 
it is released.  So although there is no existing user today, there will 
be when 2.6.36 is released.  I also hope that other users of cpusets or 
memcg will find it helpful to define oom killing priority with a unit 
rather than a bitshift and that understands the dynamic nature of resource 
isolation that they both imply.

> > Basically, it comes down to this: few users actually tune their oom 
> > killing priority, period.  That's partly because they accept the oom 
> > killer's heuristics to kill a memory-hogging task or use panic_on_oom, or 
> > because the old interface, /proc/pid/oom_adj, had no unit and no logical 
> > way of using it other than polarizing it (either +15 or -17).
> 
> Unrelated. We already have oom notifier. we have no reason to add new knob
> even though oom_adj is suck. We are not necessary to break userland application.
> 

There is no generic oom notifier solution other than what is implemented 
in the memory controller, and that comes at a cost of roughly 1% of system 
RAM since that's the amount of metadata that the memcg requires.  
oom_score_adj works for cpusets, memcg, and mempolicies.

Now you may insist that the fact that very few users actually use 
/proc/pid/oom_adj for _anything_ other than -17, -15 or +15 is unrelated, 
but in fact it provides a very nice context for your objections that it 
introduces all sorts of regressions and bugs that will break everybody's 
system.  Show me a single example of an application that tunes its oom_adj 
value based on any formula whatsoever that includes its own anticipated 
RAM usage or system capacity (which is required to make the bitshift make 
any sense).

> More importantly, We already have oom notifier. and It is most powerful
> infrastructure. It can be constructed any oom policy freely. It's not 
> restricted kernel implementaion.
> 

A generic oom notifier would be very nice to have in the kernel that 
isn't dependent on memory controller.  Unfortunately, many users cannot 
incur the 1% of memory penalty that it comes with.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2010-09-08  3:13 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-25  9:42 [PATCH 1/2][BUGFIX] oom: remove totalpage normalization from oom_badness() KOSAKI Motohiro
2010-08-25  9:42 ` [PATCH 2/2][BUGFIX] Revert "oom: deprecate oom_adj tunable" KOSAKI Motohiro
2010-08-25 10:27   ` David Rientjes
2010-08-25 10:25 ` [PATCH 1/2][BUGFIX] oom: remove totalpage normalization from oom_badness() David Rientjes
2010-08-26  0:39   ` KAMEZAWA Hiroyuki
2010-08-26  0:52     ` David Rientjes
2010-08-26  1:03       ` KAMEZAWA Hiroyuki
2010-08-26  1:11       ` KAMEZAWA Hiroyuki
2010-08-26  2:50         ` David Rientjes
2010-08-26  3:20           ` KAMEZAWA Hiroyuki
2010-08-26  3:52             ` David Rientjes
2010-08-30  2:58   ` KOSAKI Motohiro
2010-09-01 22:06     ` David Rientjes
2010-09-08  2:44       ` KOSAKI Motohiro
2010-09-08  3:12         ` David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).