All of lore.kernel.org
 help / color / mirror / Atom feed
* [0/10] 3rd pile of OOM patch series
@ 2010-06-08 11:53 ` KOSAKI Motohiro
  0 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 11:53 UTC (permalink / raw)
  To: Luis Claudio R. Goncalves, LKML, linux-mm, Oleg Nesterov,
	David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin,
	Minchan Kim
  Cc: kosaki.motohiro

This is 3rd pile of collection of OOM patches.
I think they don't immix anyone objected patch. but please double check.


=====================================================================
 Documentation/sysctl/vm.txt |    2 +-
 fs/proc/base.c              |    4 +-
 include/linux/mempolicy.h   |   13 +++-
 include/linux/oom.h         |    8 ++
 kernel/sysctl.c             |    4 +-
 mm/mempolicy.c              |   44 ++++++++++++
 mm/oom_kill.c               |  156 ++++++++++++++++++++++---------------------
 7 files changed, 146 insertions(+), 85 deletions(-)




^ permalink raw reply	[flat|nested] 50+ messages in thread

* [0/10] 3rd pile of OOM patch series
@ 2010-06-08 11:53 ` KOSAKI Motohiro
  0 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 11:53 UTC (permalink / raw)
  To: Luis Claudio R. Goncalves, LKML, linux-mm, Oleg Nesterov,
	David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin,
	Minchan Kim
  Cc: kosaki.motohiro

This is 3rd pile of collection of OOM patches.
I think they don't immix anyone objected patch. but please double check.


=====================================================================
 Documentation/sysctl/vm.txt |    2 +-
 fs/proc/base.c              |    4 +-
 include/linux/mempolicy.h   |   13 +++-
 include/linux/oom.h         |    8 ++
 kernel/sysctl.c             |    4 +-
 mm/mempolicy.c              |   44 ++++++++++++
 mm/oom_kill.c               |  156 ++++++++++++++++++++++---------------------
 7 files changed, 146 insertions(+), 85 deletions(-)



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 01/10] oom: don't try to kill oom_unkillable child
  2010-06-08 11:53 ` KOSAKI Motohiro
@ 2010-06-08 11:54   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 11:54 UTC (permalink / raw)
  To: Luis Claudio R. Goncalves, LKML, linux-mm, Oleg Nesterov,
	David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin,
	Minchan Kim
  Cc: kosaki.motohiro

Now, badness() doesn't care neigher CPUSET nor mempolicy. Then
if the victim child process is oom_unkillable()==1, __out_of_memory()
can makes kernel hang eventually.

This patch fixes it.


[remark: this is needed to fold "oom: sacrifice child with highest
badness score for parent"]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/oom_kill.c |    5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d49d542..0d7397b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -387,9 +387,6 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 static int __oom_kill_process(struct task_struct *p, struct mem_cgroup *mem,
 			      int verbose)
 {
-	if (oom_unkillable(p, mem))
-		return 1;
-
 	p = find_lock_task_mm(p);
 	if (!p)
 		return 1;
@@ -440,6 +437,8 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 
 			if (c->mm == p->mm)
 				continue;
+			if (oom_unkillable(c, mem, nodemask))
+				continue;
 
 			/* badness() returns 0 if the thread is unkillable */
 			cpoints = badness(c, uptime.tv_sec);
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 01/10] oom: don't try to kill oom_unkillable child
@ 2010-06-08 11:54   ` KOSAKI Motohiro
  0 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 11:54 UTC (permalink / raw)
  To: Luis Claudio R. Goncalves, LKML, linux-mm, Oleg Nesterov,
	David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin,
	Minchan Kim
  Cc: kosaki.motohiro

Now, badness() doesn't care neigher CPUSET nor mempolicy. Then
if the victim child process is oom_unkillable()==1, __out_of_memory()
can makes kernel hang eventually.

This patch fixes it.


[remark: this is needed to fold "oom: sacrifice child with highest
badness score for parent"]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/oom_kill.c |    5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d49d542..0d7397b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -387,9 +387,6 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 static int __oom_kill_process(struct task_struct *p, struct mem_cgroup *mem,
 			      int verbose)
 {
-	if (oom_unkillable(p, mem))
-		return 1;
-
 	p = find_lock_task_mm(p);
 	if (!p)
 		return 1;
@@ -440,6 +437,8 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 
 			if (c->mm == p->mm)
 				continue;
+			if (oom_unkillable(c, mem, nodemask))
+				continue;
 
 			/* badness() returns 0 if the thread is unkillable */
 			cpoints = badness(c, uptime.tv_sec);
-- 
1.6.5.2



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 02/10] oom: remove verbose argument from __oom_kill_process()
  2010-06-08 11:53 ` KOSAKI Motohiro
@ 2010-06-08 11:55   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 11:55 UTC (permalink / raw)
  To: Luis Claudio R. Goncalves, LKML, linux-mm, Oleg Nesterov,
	David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin,
	Minchan Kim
  Cc: kosaki.motohiro

Now, verbose argument is unused. This patch remove it.

[remark: this patch is needed to fold to "oom: remove PF_EXITING
check completely"]

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/oom_kill.c |   20 +++++++++-----------
 1 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0d7397b..3c83fba 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -384,20 +384,18 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
  * flag though it's unlikely that  we select a process with CAP_SYS_RAW_IO
  * set.
  */
-static int __oom_kill_process(struct task_struct *p, struct mem_cgroup *mem,
-			      int verbose)
+static int __oom_kill_process(struct task_struct *p, struct mem_cgroup *mem)
 {
 	p = find_lock_task_mm(p);
 	if (!p)
 		return 1;
 
-	if (verbose)
-		printk(KERN_ERR "Killed process %d (%s) "
-		       "vsz:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
-		       task_pid_nr(p), p->comm,
-		       K(p->mm->total_vm),
-		       K(get_mm_counter(p->mm, MM_ANONPAGES)),
-		       K(get_mm_counter(p->mm, MM_FILEPAGES)));
+	printk(KERN_ERR "Killed process %d (%s) "
+	       "vsz:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
+	       task_pid_nr(p), p->comm,
+	       K(p->mm->total_vm),
+	       K(get_mm_counter(p->mm, MM_ANONPAGES)),
+	       K(get_mm_counter(p->mm, MM_FILEPAGES)));
 	task_unlock(p);
 
 	/*
@@ -437,7 +435,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 
 			if (c->mm == p->mm)
 				continue;
-			if (oom_unkillable(c, mem, nodemask))
+			if (oom_unkillable(c, mem))
 				continue;
 
 			/* badness() returns 0 if the thread is unkillable */
@@ -449,7 +447,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 		}
 	} while_each_thread(p, t);
 
-	return __oom_kill_process(victim, mem, 1);
+	return __oom_kill_process(victim, mem);
 }
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 02/10] oom: remove verbose argument from __oom_kill_process()
@ 2010-06-08 11:55   ` KOSAKI Motohiro
  0 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 11:55 UTC (permalink / raw)
  To: Luis Claudio R. Goncalves, LKML, linux-mm, Oleg Nesterov,
	David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin,
	Minchan Kim
  Cc: kosaki.motohiro

Now, verbose argument is unused. This patch remove it.

[remark: this patch is needed to fold to "oom: remove PF_EXITING
check completely"]

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/oom_kill.c |   20 +++++++++-----------
 1 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0d7397b..3c83fba 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -384,20 +384,18 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
  * flag though it's unlikely that  we select a process with CAP_SYS_RAW_IO
  * set.
  */
-static int __oom_kill_process(struct task_struct *p, struct mem_cgroup *mem,
-			      int verbose)
+static int __oom_kill_process(struct task_struct *p, struct mem_cgroup *mem)
 {
 	p = find_lock_task_mm(p);
 	if (!p)
 		return 1;
 
-	if (verbose)
-		printk(KERN_ERR "Killed process %d (%s) "
-		       "vsz:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
-		       task_pid_nr(p), p->comm,
-		       K(p->mm->total_vm),
-		       K(get_mm_counter(p->mm, MM_ANONPAGES)),
-		       K(get_mm_counter(p->mm, MM_FILEPAGES)));
+	printk(KERN_ERR "Killed process %d (%s) "
+	       "vsz:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
+	       task_pid_nr(p), p->comm,
+	       K(p->mm->total_vm),
+	       K(get_mm_counter(p->mm, MM_ANONPAGES)),
+	       K(get_mm_counter(p->mm, MM_FILEPAGES)));
 	task_unlock(p);
 
 	/*
@@ -437,7 +435,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 
 			if (c->mm == p->mm)
 				continue;
-			if (oom_unkillable(c, mem, nodemask))
+			if (oom_unkillable(c, mem))
 				continue;
 
 			/* badness() returns 0 if the thread is unkillable */
@@ -449,7 +447,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 		}
 	} while_each_thread(p, t);
 
-	return __oom_kill_process(victim, mem, 1);
+	return __oom_kill_process(victim, mem);
 }
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
-- 
1.6.5.2



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 03/10] oom: rename badness() to oom_badness()
  2010-06-08 11:53 ` KOSAKI Motohiro
@ 2010-06-08 11:56   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 11:56 UTC (permalink / raw)
  To: Luis Claudio R. Goncalves, LKML, linux-mm, Oleg Nesterov,
	David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin,
	Minchan Kim
  Cc: kosaki.motohiro

badness() is wrong name because it's too generic name.

rename it.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 fs/proc/base.c      |    4 +---
 include/linux/oom.h |    2 ++
 mm/oom_kill.c       |   10 +++++-----
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 2222102..c099f03 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -427,8 +427,6 @@ static const struct file_operations proc_lstats_operations = {
 
 #endif
 
-/* The badness from the OOM killer */
-unsigned long badness(struct task_struct *p, unsigned long uptime);
 static int proc_oom_score(struct task_struct *task, char *buffer)
 {
 	unsigned long points = 0;
@@ -437,7 +435,7 @@ static int proc_oom_score(struct task_struct *task, char *buffer)
 	do_posix_clock_monotonic_gettime(&uptime);
 	read_lock(&tasklist_lock);
 	if (pid_alive(task))
-		points = badness(task, uptime.tv_sec);
+		points = oom_badness(task, uptime.tv_sec);
 	read_unlock(&tasklist_lock);
 	return sprintf(buffer, "%lu\n", points);
 }
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 5376623..effb223 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -14,6 +14,7 @@
 
 struct zonelist;
 struct notifier_block;
+struct task_struct;
 
 /*
  * Types of limitations to the nodes from which allocations may occur
@@ -29,6 +30,7 @@ extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 
 extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *mask);
+extern unsigned long oom_badness(struct task_struct *p, unsigned long uptime);
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3c83fba..80492ff 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -66,7 +66,7 @@ static struct task_struct *find_lock_task_mm(struct task_struct *p)
 }
 
 /**
- * badness - calculate a numeric value for how bad this task has been
+ * oom_badness - calculate a numeric value for how bad this task has been
  * @p: task struct of which task we should calculate
  * @uptime: current uptime in seconds
  *
@@ -84,7 +84,7 @@ static struct task_struct *find_lock_task_mm(struct task_struct *p)
  *    of least surprise ... (be careful when you change it)
  */
 
-unsigned long badness(struct task_struct *p, unsigned long uptime)
+unsigned long oom_badness(struct task_struct *p, unsigned long uptime)
 {
 	unsigned long points, cpu_time, run_time;
 	struct task_struct *c;
@@ -302,7 +302,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 		if (test_tsk_thread_flag(p, TIF_MEMDIE))
 			return ERR_PTR(-1UL);
 
-		points = badness(p, uptime.tv_sec);
+		points = oom_badness(p, uptime.tv_sec);
 		if (points > *ppoints || !chosen) {
 			chosen = p;
 			*ppoints = points;
@@ -438,8 +438,8 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			if (oom_unkillable(c, mem))
 				continue;
 
-			/* badness() returns 0 if the thread is unkillable */
-			cpoints = badness(c, uptime.tv_sec);
+			/* oom_badness() returns 0 if the thread is unkillable */
+			cpoints = oom_badness(c, uptime.tv_sec);
 			if (cpoints > victim_points) {
 				victim = c;
 				victim_points = cpoints;
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 03/10] oom: rename badness() to oom_badness()
@ 2010-06-08 11:56   ` KOSAKI Motohiro
  0 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 11:56 UTC (permalink / raw)
  To: Luis Claudio R. Goncalves, LKML, linux-mm, Oleg Nesterov,
	David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin,
	Minchan Kim
  Cc: kosaki.motohiro

badness() is wrong name because it's too generic name.

rename it.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 fs/proc/base.c      |    4 +---
 include/linux/oom.h |    2 ++
 mm/oom_kill.c       |   10 +++++-----
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 2222102..c099f03 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -427,8 +427,6 @@ static const struct file_operations proc_lstats_operations = {
 
 #endif
 
-/* The badness from the OOM killer */
-unsigned long badness(struct task_struct *p, unsigned long uptime);
 static int proc_oom_score(struct task_struct *task, char *buffer)
 {
 	unsigned long points = 0;
@@ -437,7 +435,7 @@ static int proc_oom_score(struct task_struct *task, char *buffer)
 	do_posix_clock_monotonic_gettime(&uptime);
 	read_lock(&tasklist_lock);
 	if (pid_alive(task))
-		points = badness(task, uptime.tv_sec);
+		points = oom_badness(task, uptime.tv_sec);
 	read_unlock(&tasklist_lock);
 	return sprintf(buffer, "%lu\n", points);
 }
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 5376623..effb223 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -14,6 +14,7 @@
 
 struct zonelist;
 struct notifier_block;
+struct task_struct;
 
 /*
  * Types of limitations to the nodes from which allocations may occur
@@ -29,6 +30,7 @@ extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 
 extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *mask);
+extern unsigned long oom_badness(struct task_struct *p, unsigned long uptime);
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3c83fba..80492ff 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -66,7 +66,7 @@ static struct task_struct *find_lock_task_mm(struct task_struct *p)
 }
 
 /**
- * badness - calculate a numeric value for how bad this task has been
+ * oom_badness - calculate a numeric value for how bad this task has been
  * @p: task struct of which task we should calculate
  * @uptime: current uptime in seconds
  *
@@ -84,7 +84,7 @@ static struct task_struct *find_lock_task_mm(struct task_struct *p)
  *    of least surprise ... (be careful when you change it)
  */
 
-unsigned long badness(struct task_struct *p, unsigned long uptime)
+unsigned long oom_badness(struct task_struct *p, unsigned long uptime)
 {
 	unsigned long points, cpu_time, run_time;
 	struct task_struct *c;
@@ -302,7 +302,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 		if (test_tsk_thread_flag(p, TIF_MEMDIE))
 			return ERR_PTR(-1UL);
 
-		points = badness(p, uptime.tv_sec);
+		points = oom_badness(p, uptime.tv_sec);
 		if (points > *ppoints || !chosen) {
 			chosen = p;
 			*ppoints = points;
@@ -438,8 +438,8 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			if (oom_unkillable(c, mem))
 				continue;
 
-			/* badness() returns 0 if the thread is unkillable */
-			cpoints = badness(c, uptime.tv_sec);
+			/* oom_badness() returns 0 if the thread is unkillable */
+			cpoints = oom_badness(c, uptime.tv_sec);
 			if (cpoints > victim_points) {
 				victim = c;
 				victim_points = cpoints;
-- 
1.6.5.2



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 04/10] oom: move sysctl declarations to oom.h
  2010-06-08 11:53 ` KOSAKI Motohiro
@ 2010-06-08 11:57   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 11:57 UTC (permalink / raw)
  To: Luis Claudio R. Goncalves, LKML, linux-mm, Oleg Nesterov,
	David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin,
	Minchan Kim
  Cc: kosaki.motohiro

From: David Rientjes <rientjes@google.com>

The three oom killer sysctl variables (sysctl_oom_dump_tasks,
sysctl_oom_kill_allocating_task, and sysctl_panic_on_oom) are better
declared in include/linux/oom.h rather than kernel/sysctl.c.

Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 include/linux/oom.h |    6 ++++++
 kernel/sysctl.c     |    4 +---
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index effb223..14eb449 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -45,5 +45,11 @@ static inline void oom_killer_enable(void)
 {
 	oom_killer_disabled = false;
 }
+
+/* sysctls */
+extern int sysctl_oom_dump_tasks;
+extern int sysctl_oom_kill_allocating_task;
+extern int sysctl_panic_on_oom;
+
 #endif /* __KERNEL__*/
 #endif /* _INCLUDE_LINUX_OOM_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 88ac884..4dfd618 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -55,6 +55,7 @@
 #include <linux/perf_event.h>
 #include <linux/kprobes.h>
 #include <linux/pipe_fs_i.h>
+#include <linux/oom.h>
 
 #include <asm/uaccess.h>
 #include <asm/processor.h>
@@ -87,9 +88,6 @@
 /* External variables not in a header file. */
 extern int sysctl_overcommit_memory;
 extern int sysctl_overcommit_ratio;
-extern int sysctl_panic_on_oom;
-extern int sysctl_oom_kill_allocating_task;
-extern int sysctl_oom_dump_tasks;
 extern int max_threads;
 extern int core_uses_pid;
 extern int suid_dumpable;
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 04/10] oom: move sysctl declarations to oom.h
@ 2010-06-08 11:57   ` KOSAKI Motohiro
  0 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 11:57 UTC (permalink / raw)
  To: Luis Claudio R. Goncalves, LKML, linux-mm, Oleg Nesterov,
	David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin,
	Minchan Kim
  Cc: kosaki.motohiro

From: David Rientjes <rientjes@google.com>

The three oom killer sysctl variables (sysctl_oom_dump_tasks,
sysctl_oom_kill_allocating_task, and sysctl_panic_on_oom) are better
declared in include/linux/oom.h rather than kernel/sysctl.c.

Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 include/linux/oom.h |    6 ++++++
 kernel/sysctl.c     |    4 +---
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index effb223..14eb449 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -45,5 +45,11 @@ static inline void oom_killer_enable(void)
 {
 	oom_killer_disabled = false;
 }
+
+/* sysctls */
+extern int sysctl_oom_dump_tasks;
+extern int sysctl_oom_kill_allocating_task;
+extern int sysctl_panic_on_oom;
+
 #endif /* __KERNEL__*/
 #endif /* _INCLUDE_LINUX_OOM_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 88ac884..4dfd618 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -55,6 +55,7 @@
 #include <linux/perf_event.h>
 #include <linux/kprobes.h>
 #include <linux/pipe_fs_i.h>
+#include <linux/oom.h>
 
 #include <asm/uaccess.h>
 #include <asm/processor.h>
@@ -87,9 +88,6 @@
 /* External variables not in a header file. */
 extern int sysctl_overcommit_memory;
 extern int sysctl_overcommit_ratio;
-extern int sysctl_panic_on_oom;
-extern int sysctl_oom_kill_allocating_task;
-extern int sysctl_oom_dump_tasks;
 extern int max_threads;
 extern int core_uses_pid;
 extern int suid_dumpable;
-- 
1.6.5.2



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 05/10] oom: enable oom tasklist dump by default
  2010-06-08 11:53 ` KOSAKI Motohiro
@ 2010-06-08 11:58   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 11:58 UTC (permalink / raw)
  To: Luis Claudio R. Goncalves, LKML, linux-mm, Oleg Nesterov,
	David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin,
	Minchan Kim
  Cc: kosaki.motohiro

From: David Rientjes <rientjes@google.com>

The oom killer tasklist dump, enabled with the oom_dump_tasks sysctl, is
very helpful information in diagnosing why a user's task has been killed.
It emits useful information such as each eligible thread's memory usage
that can determine why the system is oom, so it should be enabled by
default.

Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 Documentation/sysctl/vm.txt |    2 +-
 mm/oom_kill.c               |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 5fdbb61..3936b61 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -511,7 +511,7 @@ information may not be desired.
 If this is set to non-zero, this information is shown whenever the
 OOM killer actually kills a memory-hogging task.
 
-The default value is 0.
+The default value is 1 (enabled).
 
 ==============================================================
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 80492ff..b88172c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -31,7 +31,7 @@
 
 int sysctl_panic_on_oom;
 int sysctl_oom_kill_allocating_task;
-int sysctl_oom_dump_tasks;
+int sysctl_oom_dump_tasks = 1;
 static DEFINE_SPINLOCK(zone_scan_lock);
 /* #define DEBUG */
 
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 05/10] oom: enable oom tasklist dump by default
@ 2010-06-08 11:58   ` KOSAKI Motohiro
  0 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 11:58 UTC (permalink / raw)
  To: Luis Claudio R. Goncalves, LKML, linux-mm, Oleg Nesterov,
	David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin,
	Minchan Kim
  Cc: kosaki.motohiro

From: David Rientjes <rientjes@google.com>

The oom killer tasklist dump, enabled with the oom_dump_tasks sysctl, is
very helpful information in diagnosing why a user's task has been killed.
It emits useful information such as each eligible thread's memory usage
that can determine why the system is oom, so it should be enabled by
default.

Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 Documentation/sysctl/vm.txt |    2 +-
 mm/oom_kill.c               |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 5fdbb61..3936b61 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -511,7 +511,7 @@ information may not be desired.
 If this is set to non-zero, this information is shown whenever the
 OOM killer actually kills a memory-hogging task.
 
-The default value is 0.
+The default value is 1 (enabled).
 
 ==============================================================
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 80492ff..b88172c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -31,7 +31,7 @@
 
 int sysctl_panic_on_oom;
 int sysctl_oom_kill_allocating_task;
-int sysctl_oom_dump_tasks;
+int sysctl_oom_dump_tasks = 1;
 static DEFINE_SPINLOCK(zone_scan_lock);
 /* #define DEBUG */
 
-- 
1.6.5.2



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 06/10] oom: cleanup has_intersects_mems_allowed()
  2010-06-08 11:53 ` KOSAKI Motohiro
@ 2010-06-08 11:59   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 11:59 UTC (permalink / raw)
  To: Luis Claudio R. Goncalves, LKML, linux-mm, Oleg Nesterov,
	David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin,
	Minchan Kim
  Cc: kosaki.motohiro

Now has_intersects_mems_allowed() has own thread iterate logic, but
it should use while_each_thread().

It slightly improve the code readability.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/oom_kill.c |    8 +++-----
 1 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b88172c..8376ad1 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -38,16 +38,14 @@ static DEFINE_SPINLOCK(zone_scan_lock);
 /*
  * Is all threads of the target process nodes overlap ours?
  */
-static int has_intersects_mems_allowed(struct task_struct *tsk)
+static int has_intersects_mems_allowed(struct task_struct *p)
 {
-	struct task_struct *t;
+	struct task_struct *t = p;
 
-	t = tsk;
 	do {
 		if (cpuset_mems_allowed_intersects(current, t))
 			return 1;
-		t = next_thread(t);
-	} while (t != tsk);
+	} while_each_thread(p, t);
 
 	return 0;
 }
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 06/10] oom: cleanup has_intersects_mems_allowed()
@ 2010-06-08 11:59   ` KOSAKI Motohiro
  0 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 11:59 UTC (permalink / raw)
  To: Luis Claudio R. Goncalves, LKML, linux-mm, Oleg Nesterov,
	David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin,
	Minchan Kim
  Cc: kosaki.motohiro

Now has_intersects_mems_allowed() has own thread iterate logic, but
it should use while_each_thread().

It slightly improve the code readability.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/oom_kill.c |    8 +++-----
 1 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b88172c..8376ad1 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -38,16 +38,14 @@ static DEFINE_SPINLOCK(zone_scan_lock);
 /*
  * Is all threads of the target process nodes overlap ours?
  */
-static int has_intersects_mems_allowed(struct task_struct *tsk)
+static int has_intersects_mems_allowed(struct task_struct *p)
 {
-	struct task_struct *t;
+	struct task_struct *t = p;
 
-	t = tsk;
 	do {
 		if (cpuset_mems_allowed_intersects(current, t))
 			return 1;
-		t = next_thread(t);
-	} while (t != tsk);
+	} while_each_thread(p, t);
 
 	return 0;
 }
-- 
1.6.5.2



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 07/10] oom: kill useless debug print
  2010-06-08 11:53 ` KOSAKI Motohiro
@ 2010-06-08 11:59   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 11:59 UTC (permalink / raw)
  To: Luis Claudio R. Goncalves, LKML, linux-mm, Oleg Nesterov,
	David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin,
	Minchan Kim
  Cc: kosaki.motohiro

Now, all of oom developers usually are using sysctl_oom_dump_tasks.
Redundunt useless debug print can be removed.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/oom_kill.c |    5 -----
 1 files changed, 0 insertions(+), 5 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 8376ad1..e7d3a5d 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -33,7 +33,6 @@ int sysctl_panic_on_oom;
 int sysctl_oom_kill_allocating_task;
 int sysctl_oom_dump_tasks = 1;
 static DEFINE_SPINLOCK(zone_scan_lock);
-/* #define DEBUG */
 
 /*
  * Is all threads of the target process nodes overlap ours?
@@ -201,10 +200,6 @@ unsigned long oom_badness(struct task_struct *p, unsigned long uptime)
 			points >>= -(oom_adj);
 	}
 
-#ifdef DEBUG
-	printk(KERN_DEBUG "OOMkill: task %d (%s) got %lu points\n",
-	p->pid, p->comm, points);
-#endif
 	return points;
 }
 
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 07/10] oom: kill useless debug print
@ 2010-06-08 11:59   ` KOSAKI Motohiro
  0 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 11:59 UTC (permalink / raw)
  To: Luis Claudio R. Goncalves, LKML, linux-mm, Oleg Nesterov,
	David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin,
	Minchan Kim
  Cc: kosaki.motohiro

Now, all of oom developers usually are using sysctl_oom_dump_tasks.
Redundunt useless debug print can be removed.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/oom_kill.c |    5 -----
 1 files changed, 0 insertions(+), 5 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 8376ad1..e7d3a5d 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -33,7 +33,6 @@ int sysctl_panic_on_oom;
 int sysctl_oom_kill_allocating_task;
 int sysctl_oom_dump_tasks = 1;
 static DEFINE_SPINLOCK(zone_scan_lock);
-/* #define DEBUG */
 
 /*
  * Is all threads of the target process nodes overlap ours?
@@ -201,10 +200,6 @@ unsigned long oom_badness(struct task_struct *p, unsigned long uptime)
 			points >>= -(oom_adj);
 	}
 
-#ifdef DEBUG
-	printk(KERN_DEBUG "OOMkill: task %d (%s) got %lu points\n",
-	p->pid, p->comm, points);
-#endif
 	return points;
 }
 
-- 
1.6.5.2



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 08/10] oom: use send_sig() instead force_sig()
  2010-06-08 11:53 ` KOSAKI Motohiro
@ 2010-06-08 12:01   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 12:01 UTC (permalink / raw)
  To: Luis Claudio R. Goncalves, LKML, linux-mm, Oleg Nesterov,
	David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin,
	Minchan Kim
  Cc: kosaki.motohiro

Oleg, I parsed your mention mean following patch, correct?


===========================================
Oleg pointed out oom_kill.c has force_sig() abuse. force_sig() mean 
ignore signal mask. but SIGKILL itself is not maskable.
So, we can use send_sig() sefely.

Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/oom_kill.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index e7d3a5d..599f977 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -399,7 +399,7 @@ static int __oom_kill_process(struct task_struct *p, struct mem_cgroup *mem)
 	p->rt.time_slice = HZ;
 	set_tsk_thread_flag(p, TIF_MEMDIE);
 
-	force_sig(SIGKILL, p);
+	send_sig(SIGKILL, p, 1);
 
 	return 0;
 }
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 08/10] oom: use send_sig() instead force_sig()
@ 2010-06-08 12:01   ` KOSAKI Motohiro
  0 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 12:01 UTC (permalink / raw)
  To: Luis Claudio R. Goncalves, LKML, linux-mm, Oleg Nesterov,
	David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin,
	Minchan Kim
  Cc: kosaki.motohiro

Oleg, I parsed your mention mean following patch, correct?


===========================================
Oleg pointed out oom_kill.c has force_sig() abuse. force_sig() mean 
ignore signal mask. but SIGKILL itself is not maskable.
So, we can use send_sig() sefely.

Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/oom_kill.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index e7d3a5d..599f977 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -399,7 +399,7 @@ static int __oom_kill_process(struct task_struct *p, struct mem_cgroup *mem)
 	p->rt.time_slice = HZ;
 	set_tsk_thread_flag(p, TIF_MEMDIE);
 
-	force_sig(SIGKILL, p);
+	send_sig(SIGKILL, p, 1);
 
 	return 0;
 }
-- 
1.6.5.2



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 09/10] oom: filter tasks not sharing the same cpuset
  2010-06-08 11:53 ` KOSAKI Motohiro
@ 2010-06-08 12:02   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 12:02 UTC (permalink / raw)
  To: Luis Claudio R. Goncalves, LKML, linux-mm, Oleg Nesterov,
	David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin,
	Minchan Kim
  Cc: kosaki.motohiro

From: David Rientjes <rientjes@google.com>

Tasks that do not share the same set of allowed nodes with the task that
triggered the oom should not be considered as candidates for oom kill.

Tasks in other cpusets with a disjoint set of mems would be unfairly
penalized otherwise because of oom conditions elsewhere; an extreme
example could unfairly kill all other applications on the system if a
single task in a user's cpuset sets itself to OOM_DISABLE and then uses
more memory than allowed.

Killing tasks outside of current's cpuset rarely would free memory for
current anyway.  To use a sane heuristic, we must ensure that killing a
task would likely free memory for current and avoid needlessly killing
others at all costs just because their potential memory freeing is
unknown.  It is better to kill current than another task needlessly.

kosaki: a historical interlude...

We applied the exactly same patch in 2005:

	: commit ef08e3b4981aebf2ba9bd7025ef7210e8eec07ce
	: Author: Paul Jackson <pj@sgi.com>
	: Date:   Tue Sep 6 15:18:13 2005 -0700
	:
	: [PATCH] cpusets: confine oom_killer to mem_exclusive cpuset
	:
	: Now the real motivation for this cpuset mem_exclusive patch series seems
	: trivial.
	:
	: This patch keeps a task in or under one mem_exclusive cpuset from provoking an
	: oom kill of a task under a non-overlapping mem_exclusive cpuset.  Since only
	: interrupt and GFP_ATOMIC allocations are allowed to escape mem_exclusive
	: containment, there is little to gain from oom killing a task under a
	: non-overlapping mem_exclusive cpuset, as almost all kernel and user memory
	: allocation must come from disjoint memory nodes.
	:
	: This patch enables configuring a system so that a runaway job under one
	: mem_exclusive cpuset cannot cause the killing of a job in another such cpuset
	: that might be using very high compute and memory resources for a prolonged
	: time.

And we changed it to current logic in 2006

	: commit 7887a3da753e1ba8244556cc9a2b38c815bfe256
	: Author: Nick Piggin <npiggin@suse.de>
	: Date:   Mon Sep 25 23:31:29 2006 -0700
	:
	: [PATCH] oom: cpuset hint
	:
	: cpuset_excl_nodes_overlap does not always indicate that killing a task will
	: not free any memory we for us.  For example, we may be asking for an
	: allocation from _anywhere_ in the machine, or the task in question may be
	: pinning memory that is outside its cpuset.  Fix this by just causing
	: cpuset_excl_nodes_overlap to reduce the badness rather than disallow it.

And we haven't get the explanation why this patch doesn't reintroduced
an old issue. but I don't refuse a patch if it have multiple ack.

Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [add to
care of oom_kill_allocating_task case and dump_tasks]
---
 mm/oom_kill.c |   16 +++++++---------
 1 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 599f977..f45ac18 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -35,7 +35,7 @@ int sysctl_oom_dump_tasks = 1;
 static DEFINE_SPINLOCK(zone_scan_lock);
 
 /*
- * Is all threads of the target process nodes overlap ours?
+ * Do all threads of the target process overlap our allowed nodes?
  */
 static int has_intersects_mems_allowed(struct task_struct *p)
 {
@@ -181,14 +181,6 @@ unsigned long oom_badness(struct task_struct *p, unsigned long uptime)
 		points /= 4;
 
 	/*
-	 * If p's nodes don't overlap ours, it may still help to kill p
-	 * because p may have allocated or otherwise mapped memory on
-	 * this node before. However it will be less likely.
-	 */
-	if (!has_intersects_mems_allowed(p))
-		points /= 8;
-
-	/*
 	 * Adjust the score by oom_adj.
 	 */
 	if (oom_adj) {
@@ -259,6 +251,10 @@ static int oom_unkillable(struct task_struct *p, struct mem_cgroup *mem)
 	if (p->signal->oom_adj == OOM_DISABLE)
 		return 1;
 
+	/* If p's nodes don't overlap ours, it may not help to kill p. */
+	if (!has_intersects_mems_allowed(p))
+		return 1;
+
 	return 0;
 }
 
@@ -336,6 +332,8 @@ static void dump_tasks(const struct mem_cgroup *mem)
 			continue;
 		if (mem && !task_in_mem_cgroup(p, mem))
 			continue;
+		if (!has_intersects_mems_allowed(p))
+			continue;
 
 		task = find_lock_task_mm(p);
 		if (!task)
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 09/10] oom: filter tasks not sharing the same cpuset
@ 2010-06-08 12:02   ` KOSAKI Motohiro
  0 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 12:02 UTC (permalink / raw)
  To: Luis Claudio R. Goncalves, LKML, linux-mm, Oleg Nesterov,
	David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin,
	Minchan Kim
  Cc: kosaki.motohiro

From: David Rientjes <rientjes@google.com>

Tasks that do not share the same set of allowed nodes with the task that
triggered the oom should not be considered as candidates for oom kill.

Tasks in other cpusets with a disjoint set of mems would be unfairly
penalized otherwise because of oom conditions elsewhere; an extreme
example could unfairly kill all other applications on the system if a
single task in a user's cpuset sets itself to OOM_DISABLE and then uses
more memory than allowed.

Killing tasks outside of current's cpuset rarely would free memory for
current anyway.  To use a sane heuristic, we must ensure that killing a
task would likely free memory for current and avoid needlessly killing
others at all costs just because their potential memory freeing is
unknown.  It is better to kill current than another task needlessly.

kosaki: a historical interlude...

We applied the exactly same patch in 2005:

	: commit ef08e3b4981aebf2ba9bd7025ef7210e8eec07ce
	: Author: Paul Jackson <pj@sgi.com>
	: Date:   Tue Sep 6 15:18:13 2005 -0700
	:
	: [PATCH] cpusets: confine oom_killer to mem_exclusive cpuset
	:
	: Now the real motivation for this cpuset mem_exclusive patch series seems
	: trivial.
	:
	: This patch keeps a task in or under one mem_exclusive cpuset from provoking an
	: oom kill of a task under a non-overlapping mem_exclusive cpuset.  Since only
	: interrupt and GFP_ATOMIC allocations are allowed to escape mem_exclusive
	: containment, there is little to gain from oom killing a task under a
	: non-overlapping mem_exclusive cpuset, as almost all kernel and user memory
	: allocation must come from disjoint memory nodes.
	:
	: This patch enables configuring a system so that a runaway job under one
	: mem_exclusive cpuset cannot cause the killing of a job in another such cpuset
	: that might be using very high compute and memory resources for a prolonged
	: time.

And we changed it to current logic in 2006

	: commit 7887a3da753e1ba8244556cc9a2b38c815bfe256
	: Author: Nick Piggin <npiggin@suse.de>
	: Date:   Mon Sep 25 23:31:29 2006 -0700
	:
	: [PATCH] oom: cpuset hint
	:
	: cpuset_excl_nodes_overlap does not always indicate that killing a task will
	: not free any memory we for us.  For example, we may be asking for an
	: allocation from _anywhere_ in the machine, or the task in question may be
	: pinning memory that is outside its cpuset.  Fix this by just causing
	: cpuset_excl_nodes_overlap to reduce the badness rather than disallow it.

And we haven't get the explanation why this patch doesn't reintroduced
an old issue. but I don't refuse a patch if it have multiple ack.

Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [add to
care of oom_kill_allocating_task case and dump_tasks]
---
 mm/oom_kill.c |   16 +++++++---------
 1 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 599f977..f45ac18 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -35,7 +35,7 @@ int sysctl_oom_dump_tasks = 1;
 static DEFINE_SPINLOCK(zone_scan_lock);
 
 /*
- * Is all threads of the target process nodes overlap ours?
+ * Do all threads of the target process overlap our allowed nodes?
  */
 static int has_intersects_mems_allowed(struct task_struct *p)
 {
@@ -181,14 +181,6 @@ unsigned long oom_badness(struct task_struct *p, unsigned long uptime)
 		points /= 4;
 
 	/*
-	 * If p's nodes don't overlap ours, it may still help to kill p
-	 * because p may have allocated or otherwise mapped memory on
-	 * this node before. However it will be less likely.
-	 */
-	if (!has_intersects_mems_allowed(p))
-		points /= 8;
-
-	/*
 	 * Adjust the score by oom_adj.
 	 */
 	if (oom_adj) {
@@ -259,6 +251,10 @@ static int oom_unkillable(struct task_struct *p, struct mem_cgroup *mem)
 	if (p->signal->oom_adj == OOM_DISABLE)
 		return 1;
 
+	/* If p's nodes don't overlap ours, it may not help to kill p. */
+	if (!has_intersects_mems_allowed(p))
+		return 1;
+
 	return 0;
 }
 
@@ -336,6 +332,8 @@ static void dump_tasks(const struct mem_cgroup *mem)
 			continue;
 		if (mem && !task_in_mem_cgroup(p, mem))
 			continue;
+		if (!has_intersects_mems_allowed(p))
+			continue;
 
 		task = find_lock_task_mm(p);
 		if (!task)
-- 
1.6.5.2



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 10/10] oom: select task from tasklist for mempolicy ooms
  2010-06-08 11:53 ` KOSAKI Motohiro
@ 2010-06-08 12:04   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 12:04 UTC (permalink / raw)
  To: Luis Claudio R. Goncalves, LKML, linux-mm, Oleg Nesterov,
	David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin,
	Minchan Kim
  Cc: kosaki.motohiro

From: David Rientjes <rientjes@google.com>

The oom killer presently kills current whenever there is no more memory
free or reclaimable on its mempolicy's nodes.  There is no guarantee
that current is a memory-hogging task or that killing it will free any
substantial amount of memory, however.

In such situations, it is better to scan the tasklist for nodes that are
allowed to allocate on current's set of nodes and kill the task with the
highest badness() score.  This ensures that the most memory-hogging
task, or the one configured by the user with /proc/pid/oom_adj, is always
selected in such scenarios.

Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 include/linux/mempolicy.h |   13 +++++-
 mm/mempolicy.c            |   44 ++++++++++++++++++++
 mm/oom_kill.c             |   98 +++++++++++++++++++++++++-------------------
 3 files changed, 112 insertions(+), 43 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 7b9ef6b..9c84270 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -210,6 +210,8 @@ extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
 				unsigned long addr, gfp_t gfp_flags,
 				struct mempolicy **mpol, nodemask_t **nodemask);
 extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
+extern bool mempolicy_nodemask_intersects(struct task_struct *tsk,
+					  const nodemask_t *mask);
 extern unsigned slab_node(struct mempolicy *policy);
 
 extern enum zone_type policy_zone;
@@ -338,7 +340,16 @@ static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
 	return node_zonelist(0, gfp_flags);
 }
 
-static inline bool init_nodemask_of_mempolicy(nodemask_t *m) { return false; }
+static inline bool init_nodemask_of_mempolicy(nodemask_t *m)
+{
+	return false;
+}
+
+static inline bool mempolicy_nodemask_intersects(struct task_struct *tsk,
+						 const nodemask_t *mask)
+{
+	return false;
+}
 
 static inline int do_migrate_pages(struct mm_struct *mm,
 			const nodemask_t *from_nodes,
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 296eef1..4cf5302 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1712,6 +1712,50 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
 }
 #endif
 
+/*
+ * mempolicy_nodemask_intersects
+ *
+ * If tsk's mempolicy is "default" [NULL], return 'true' to indicate default
+ * policy.  Otherwise, check for intersection between mask and the policy
+ * nodemask for 'bind' or 'interleave' policy.  For 'perferred' or 'local'
+ * policy, always return true since it may allocate elsewhere on fallback.
+ *
+ * Takes task_lock(tsk) to prevent freeing of its mempolicy.
+ */
+bool mempolicy_nodemask_intersects(struct task_struct *tsk,
+				   const nodemask_t *mask)
+{
+	struct mempolicy *mempolicy;
+	bool ret = true;
+
+	if (!mask)
+		return ret;
+	task_lock(tsk);
+	mempolicy = tsk->mempolicy;
+	if (!mempolicy)
+		goto out;
+
+	switch (mempolicy->mode) {
+	case MPOL_PREFERRED:
+		/*
+		 * MPOL_PREFERRED and MPOL_F_LOCAL are only preferred nodes to
+		 * allocate from, they may fallback to other nodes when oom.
+		 * Thus, it's possible for tsk to have allocated memory from
+		 * nodes in mask.
+		 */
+		break;
+	case MPOL_BIND:
+	case MPOL_INTERLEAVE:
+		ret = nodes_intersects(mempolicy->v.nodes, *mask);
+		break;
+	default:
+		BUG();
+	}
+ out:
+	task_unlock(tsk);
+	return ret;
+}
+
 /* Allocate a page in interleaved policy.
    Own path because it needs to do special accounting. */
 static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f45ac18..1dff3c3 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -27,6 +27,7 @@
 #include <linux/module.h>
 #include <linux/notifier.h>
 #include <linux/memcontrol.h>
+#include <linux/mempolicy.h>
 #include <linux/security.h>
 
 int sysctl_panic_on_oom;
@@ -36,17 +37,35 @@ static DEFINE_SPINLOCK(zone_scan_lock);
 
 /*
  * Do all threads of the target process overlap our allowed nodes?
+ * @p: task struct of which task to consider
+ * @nodemask: nodemask passed to page allocator for mempolicy ooms
  */
-static int has_intersects_mems_allowed(struct task_struct *p)
+static bool has_intersects_mems_allowed(struct task_struct *p,
+				       const nodemask_t *nodemask)
 {
 	struct task_struct *t = p;
 
 	do {
-		if (cpuset_mems_allowed_intersects(current, t))
-			return 1;
+		if (nodemask) {
+			/*
+			 * If this is a mempolicy constrained oom, tsk's
+			 * cpuset is irrelevant.  Only return true if its
+			 * mempolicy intersects current, otherwise it may be
+			 * needlessly killed.
+			 */
+			if (mempolicy_nodemask_intersects(t, nodemask))
+				return true;
+		} else {
+			/*
+			 * This is not a mempolicy constrained oom, so only
+			 * check the mems of tsk's cpuset.
+			 */
+			if (cpuset_mems_allowed_intersects(current, t))
+				return true;
+		}
 	} while_each_thread(p, t);
 
-	return 0;
+	return false;
 }
 
 static struct task_struct *find_lock_task_mm(struct task_struct *p)
@@ -239,7 +258,8 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
 }
 #endif
 
-static int oom_unkillable(struct task_struct *p, struct mem_cgroup *mem)
+static int oom_unkillable(struct task_struct *p, struct mem_cgroup *mem,
+			  nodemask_t *nodemask)
 {
 	/* skip the init task and kthreads */
 	if (is_global_init(p) || (p->flags & PF_KTHREAD))
@@ -252,7 +272,7 @@ static int oom_unkillable(struct task_struct *p, struct mem_cgroup *mem)
 		return 1;
 
 	/* If p's nodes don't overlap ours, it may not help to kill p. */
-	if (!has_intersects_mems_allowed(p))
+	if (!has_intersects_mems_allowed(p, nodemask))
 		return 1;
 
 	return 0;
@@ -265,7 +285,8 @@ static int oom_unkillable(struct task_struct *p, struct mem_cgroup *mem)
  * (not docbooked, we don't want this one cluttering up the manual)
  */
 static struct task_struct *select_bad_process(unsigned long *ppoints,
-						struct mem_cgroup *mem)
+					      struct mem_cgroup *mem,
+					      nodemask_t *nodemask)
 {
 	struct task_struct *p;
 	struct task_struct *chosen = NULL;
@@ -276,7 +297,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 	for_each_process(p) {
 		unsigned long points;
 
-		if (oom_unkillable(p, mem))
+		if (oom_unkillable(p, mem, nodemask))
 			continue;
 
 		/*
@@ -314,7 +335,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
  *
  * Call with tasklist_lock read-locked.
  */
-static void dump_tasks(const struct mem_cgroup *mem)
+static void dump_tasks(const struct mem_cgroup *mem, nodemask_t *nodemask)
 {
 	struct task_struct *p;
 	struct task_struct *task;
@@ -332,7 +353,7 @@ static void dump_tasks(const struct mem_cgroup *mem)
 			continue;
 		if (mem && !task_in_mem_cgroup(p, mem))
 			continue;
-		if (!has_intersects_mems_allowed(p))
+		if (!has_intersects_mems_allowed(p, nodemask))
 			continue;
 
 		task = find_lock_task_mm(p);
@@ -353,7 +374,7 @@ static void dump_tasks(const struct mem_cgroup *mem)
 }
 
 static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
-							struct mem_cgroup *mem)
+			struct mem_cgroup *mem, nodemask_t *nodemask)
 {
 	pr_warning("%s invoked oom-killer: gfp_mask=0x%x, order=%d, "
 		"oom_adj=%d\n",
@@ -365,7 +386,7 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 	mem_cgroup_print_oom_info(mem, p);
 	show_mem();
 	if (sysctl_oom_dump_tasks)
-		dump_tasks(mem);
+		dump_tasks(mem, nodemask);
 }
 
 #define K(x) ((x) << (PAGE_SHIFT-10))
@@ -404,7 +425,7 @@ static int __oom_kill_process(struct task_struct *p, struct mem_cgroup *mem)
 
 static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			    unsigned long points, struct mem_cgroup *mem,
-			    const char *message)
+			    nodemask_t *nodemask, const char *message)
 {
 	struct task_struct *c;
 	struct task_struct *t = p;
@@ -413,7 +434,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	struct timespec uptime;
 
 	if (printk_ratelimit())
-		dump_header(p, gfp_mask, order, mem);
+		dump_header(p, gfp_mask, order, mem, nodemask);
 
 	pr_err("%s: Kill process %d (%s) with score %lu or sacrifice child\n",
 	       message, task_pid_nr(p), p->comm, points);
@@ -426,7 +447,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 
 			if (c->mm == p->mm)
 				continue;
-			if (oom_unkillable(c, mem))
+			if (oom_unkillable(c, mem, nodemask))
 				continue;
 
 			/* oom_badness() returns 0 if the thread is unkillable */
@@ -451,11 +472,11 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
 		panic("out of memory(memcg). panic_on_oom is selected.\n");
 	read_lock(&tasklist_lock);
 retry:
-	p = select_bad_process(&points, mem);
+	p = select_bad_process(&points, mem, NULL);
 	if (!p || PTR_ERR(p) == -1UL)
 		goto out;
 
-	if (oom_kill_process(p, gfp_mask, 0, points, mem,
+	if (oom_kill_process(p, gfp_mask, 0, points, mem, NULL,
 				"Memory cgroup out of memory"))
 		goto retry;
 out:
@@ -567,33 +588,33 @@ static void clear_system_oom(void)
 /*
  * Must be called with tasklist_lock held for read.
  */
-static void __out_of_memory(gfp_t gfp_mask, int order)
+static void __out_of_memory(gfp_t gfp_mask, int order, nodemask_t *nodemask)
 {
 	struct task_struct *p;
 	unsigned long points;
 
 	if (sysctl_oom_kill_allocating_task)
-		if (!oom_kill_process(current, gfp_mask, order, 0, NULL,
-				"Out of memory (oom_kill_allocating_task)"))
+		if (!oom_kill_process(current, gfp_mask, order, 0, NULL, nodemask,
+				      "Out of memory (oom_kill_allocating_task)"))
 			return;
 retry:
 	/*
 	 * Rambo mode: Shoot down a process and hope it solves whatever
 	 * issues we may have.
 	 */
-	p = select_bad_process(&points, NULL);
+	p = select_bad_process(&points, NULL, nodemask);
 
 	if (PTR_ERR(p) == -1UL)
 		return;
 
 	/* Found nothing?!?! Either we hang forever, or we panic. */
 	if (!p) {
-		dump_header(NULL, gfp_mask, order, NULL);
+		dump_header(NULL, gfp_mask, order, NULL, nodemask);
 		read_unlock(&tasklist_lock);
 		panic("Out of memory and no killable processes...\n");
 	}
 
-	if (oom_kill_process(p, gfp_mask, order, points, NULL,
+	if (oom_kill_process(p, gfp_mask, order, points, NULL, nodemask,
 			     "Out of memory"))
 		goto retry;
 }
@@ -603,6 +624,7 @@ retry:
  * @zonelist: zonelist pointer
  * @gfp_mask: memory allocation flags
  * @order: amount of memory being requested as a power of 2
+ * @nodemask: nodemask passed to page allocator
  *
  * If we run out of memory, we have the choice between either
  * killing a random task (bad), letting the system crash (worse)
@@ -622,7 +644,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 
 	if (sysctl_panic_on_oom == 2) {
 		read_lock(&tasklist_lock);
-		dump_header(NULL, gfp_mask, order, NULL);
+		dump_header(NULL, gfp_mask, order, NULL, nodemask);
 		read_unlock(&tasklist_lock);
 		panic("out of memory. Compulsory panic_on_oom is selected.\n");
 	}
@@ -633,26 +655,17 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	 */
 	if (zonelist)
 		constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
-	read_lock(&tasklist_lock);
-
-	switch (constraint) {
-	case CONSTRAINT_MEMORY_POLICY:
-		oom_kill_process(current, gfp_mask, order, 0, NULL,
-				"No available memory (MPOL_BIND)");
-		break;
-
-	case CONSTRAINT_NONE:
-		if (sysctl_panic_on_oom) {
-			dump_header(NULL, gfp_mask, order, NULL);
-			read_unlock(&tasklist_lock);
-			panic("out of memory. panic_on_oom is selected\n");
-		}
-		/* Fall-through */
-	case CONSTRAINT_CPUSET:
-		__out_of_memory(gfp_mask, order);
-		break;
+	if (constraint != CONSTRAINT_MEMORY_POLICY)
+		nodemask = NULL;
+	if ((constraint == CONSTRAINT_NONE) && sysctl_panic_on_oom) {
+		read_lock(&tasklist_lock);
+		dump_header(NULL, gfp_mask, order, NULL, nodemask);
+		read_unlock(&tasklist_lock);
+		panic("out of memory. panic_on_oom is selected\n");
 	}
 
+	read_lock(&tasklist_lock);
+	__out_of_memory(gfp_mask, order, nodemask);
 	read_unlock(&tasklist_lock);
 
 	/*
@@ -672,6 +685,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 void pagefault_out_of_memory(void)
 {
 	if (try_set_system_oom()) {
+		/* unknown gfp_mask and order */
 		out_of_memory(NULL, 0, 0, NULL);
 		clear_system_oom();
 	}
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 10/10] oom: select task from tasklist for mempolicy ooms
@ 2010-06-08 12:04   ` KOSAKI Motohiro
  0 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 12:04 UTC (permalink / raw)
  To: Luis Claudio R. Goncalves, LKML, linux-mm, Oleg Nesterov,
	David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin,
	Minchan Kim
  Cc: kosaki.motohiro

From: David Rientjes <rientjes@google.com>

The oom killer presently kills current whenever there is no more memory
free or reclaimable on its mempolicy's nodes.  There is no guarantee
that current is a memory-hogging task or that killing it will free any
substantial amount of memory, however.

In such situations, it is better to scan the tasklist for nodes that are
allowed to allocate on current's set of nodes and kill the task with the
highest badness() score.  This ensures that the most memory-hogging
task, or the one configured by the user with /proc/pid/oom_adj, is always
selected in such scenarios.

Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 include/linux/mempolicy.h |   13 +++++-
 mm/mempolicy.c            |   44 ++++++++++++++++++++
 mm/oom_kill.c             |   98 +++++++++++++++++++++++++-------------------
 3 files changed, 112 insertions(+), 43 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 7b9ef6b..9c84270 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -210,6 +210,8 @@ extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
 				unsigned long addr, gfp_t gfp_flags,
 				struct mempolicy **mpol, nodemask_t **nodemask);
 extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
+extern bool mempolicy_nodemask_intersects(struct task_struct *tsk,
+					  const nodemask_t *mask);
 extern unsigned slab_node(struct mempolicy *policy);
 
 extern enum zone_type policy_zone;
@@ -338,7 +340,16 @@ static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
 	return node_zonelist(0, gfp_flags);
 }
 
-static inline bool init_nodemask_of_mempolicy(nodemask_t *m) { return false; }
+static inline bool init_nodemask_of_mempolicy(nodemask_t *m)
+{
+	return false;
+}
+
+static inline bool mempolicy_nodemask_intersects(struct task_struct *tsk,
+						 const nodemask_t *mask)
+{
+	return false;
+}
 
 static inline int do_migrate_pages(struct mm_struct *mm,
 			const nodemask_t *from_nodes,
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 296eef1..4cf5302 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1712,6 +1712,50 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
 }
 #endif
 
+/*
+ * mempolicy_nodemask_intersects
+ *
+ * If tsk's mempolicy is "default" [NULL], return 'true' to indicate default
+ * policy.  Otherwise, check for intersection between mask and the policy
+ * nodemask for 'bind' or 'interleave' policy.  For 'perferred' or 'local'
+ * policy, always return true since it may allocate elsewhere on fallback.
+ *
+ * Takes task_lock(tsk) to prevent freeing of its mempolicy.
+ */
+bool mempolicy_nodemask_intersects(struct task_struct *tsk,
+				   const nodemask_t *mask)
+{
+	struct mempolicy *mempolicy;
+	bool ret = true;
+
+	if (!mask)
+		return ret;
+	task_lock(tsk);
+	mempolicy = tsk->mempolicy;
+	if (!mempolicy)
+		goto out;
+
+	switch (mempolicy->mode) {
+	case MPOL_PREFERRED:
+		/*
+		 * MPOL_PREFERRED and MPOL_F_LOCAL are only preferred nodes to
+		 * allocate from, they may fallback to other nodes when oom.
+		 * Thus, it's possible for tsk to have allocated memory from
+		 * nodes in mask.
+		 */
+		break;
+	case MPOL_BIND:
+	case MPOL_INTERLEAVE:
+		ret = nodes_intersects(mempolicy->v.nodes, *mask);
+		break;
+	default:
+		BUG();
+	}
+ out:
+	task_unlock(tsk);
+	return ret;
+}
+
 /* Allocate a page in interleaved policy.
    Own path because it needs to do special accounting. */
 static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f45ac18..1dff3c3 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -27,6 +27,7 @@
 #include <linux/module.h>
 #include <linux/notifier.h>
 #include <linux/memcontrol.h>
+#include <linux/mempolicy.h>
 #include <linux/security.h>
 
 int sysctl_panic_on_oom;
@@ -36,17 +37,35 @@ static DEFINE_SPINLOCK(zone_scan_lock);
 
 /*
  * Do all threads of the target process overlap our allowed nodes?
+ * @p: task struct of which task to consider
+ * @nodemask: nodemask passed to page allocator for mempolicy ooms
  */
-static int has_intersects_mems_allowed(struct task_struct *p)
+static bool has_intersects_mems_allowed(struct task_struct *p,
+				       const nodemask_t *nodemask)
 {
 	struct task_struct *t = p;
 
 	do {
-		if (cpuset_mems_allowed_intersects(current, t))
-			return 1;
+		if (nodemask) {
+			/*
+			 * If this is a mempolicy constrained oom, tsk's
+			 * cpuset is irrelevant.  Only return true if its
+			 * mempolicy intersects current, otherwise it may be
+			 * needlessly killed.
+			 */
+			if (mempolicy_nodemask_intersects(t, nodemask))
+				return true;
+		} else {
+			/*
+			 * This is not a mempolicy constrained oom, so only
+			 * check the mems of tsk's cpuset.
+			 */
+			if (cpuset_mems_allowed_intersects(current, t))
+				return true;
+		}
 	} while_each_thread(p, t);
 
-	return 0;
+	return false;
 }
 
 static struct task_struct *find_lock_task_mm(struct task_struct *p)
@@ -239,7 +258,8 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
 }
 #endif
 
-static int oom_unkillable(struct task_struct *p, struct mem_cgroup *mem)
+static int oom_unkillable(struct task_struct *p, struct mem_cgroup *mem,
+			  nodemask_t *nodemask)
 {
 	/* skip the init task and kthreads */
 	if (is_global_init(p) || (p->flags & PF_KTHREAD))
@@ -252,7 +272,7 @@ static int oom_unkillable(struct task_struct *p, struct mem_cgroup *mem)
 		return 1;
 
 	/* If p's nodes don't overlap ours, it may not help to kill p. */
-	if (!has_intersects_mems_allowed(p))
+	if (!has_intersects_mems_allowed(p, nodemask))
 		return 1;
 
 	return 0;
@@ -265,7 +285,8 @@ static int oom_unkillable(struct task_struct *p, struct mem_cgroup *mem)
  * (not docbooked, we don't want this one cluttering up the manual)
  */
 static struct task_struct *select_bad_process(unsigned long *ppoints,
-						struct mem_cgroup *mem)
+					      struct mem_cgroup *mem,
+					      nodemask_t *nodemask)
 {
 	struct task_struct *p;
 	struct task_struct *chosen = NULL;
@@ -276,7 +297,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 	for_each_process(p) {
 		unsigned long points;
 
-		if (oom_unkillable(p, mem))
+		if (oom_unkillable(p, mem, nodemask))
 			continue;
 
 		/*
@@ -314,7 +335,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
  *
  * Call with tasklist_lock read-locked.
  */
-static void dump_tasks(const struct mem_cgroup *mem)
+static void dump_tasks(const struct mem_cgroup *mem, nodemask_t *nodemask)
 {
 	struct task_struct *p;
 	struct task_struct *task;
@@ -332,7 +353,7 @@ static void dump_tasks(const struct mem_cgroup *mem)
 			continue;
 		if (mem && !task_in_mem_cgroup(p, mem))
 			continue;
-		if (!has_intersects_mems_allowed(p))
+		if (!has_intersects_mems_allowed(p, nodemask))
 			continue;
 
 		task = find_lock_task_mm(p);
@@ -353,7 +374,7 @@ static void dump_tasks(const struct mem_cgroup *mem)
 }
 
 static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
-							struct mem_cgroup *mem)
+			struct mem_cgroup *mem, nodemask_t *nodemask)
 {
 	pr_warning("%s invoked oom-killer: gfp_mask=0x%x, order=%d, "
 		"oom_adj=%d\n",
@@ -365,7 +386,7 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 	mem_cgroup_print_oom_info(mem, p);
 	show_mem();
 	if (sysctl_oom_dump_tasks)
-		dump_tasks(mem);
+		dump_tasks(mem, nodemask);
 }
 
 #define K(x) ((x) << (PAGE_SHIFT-10))
@@ -404,7 +425,7 @@ static int __oom_kill_process(struct task_struct *p, struct mem_cgroup *mem)
 
 static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			    unsigned long points, struct mem_cgroup *mem,
-			    const char *message)
+			    nodemask_t *nodemask, const char *message)
 {
 	struct task_struct *c;
 	struct task_struct *t = p;
@@ -413,7 +434,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	struct timespec uptime;
 
 	if (printk_ratelimit())
-		dump_header(p, gfp_mask, order, mem);
+		dump_header(p, gfp_mask, order, mem, nodemask);
 
 	pr_err("%s: Kill process %d (%s) with score %lu or sacrifice child\n",
 	       message, task_pid_nr(p), p->comm, points);
@@ -426,7 +447,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 
 			if (c->mm == p->mm)
 				continue;
-			if (oom_unkillable(c, mem))
+			if (oom_unkillable(c, mem, nodemask))
 				continue;
 
 			/* oom_badness() returns 0 if the thread is unkillable */
@@ -451,11 +472,11 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
 		panic("out of memory(memcg). panic_on_oom is selected.\n");
 	read_lock(&tasklist_lock);
 retry:
-	p = select_bad_process(&points, mem);
+	p = select_bad_process(&points, mem, NULL);
 	if (!p || PTR_ERR(p) == -1UL)
 		goto out;
 
-	if (oom_kill_process(p, gfp_mask, 0, points, mem,
+	if (oom_kill_process(p, gfp_mask, 0, points, mem, NULL,
 				"Memory cgroup out of memory"))
 		goto retry;
 out:
@@ -567,33 +588,33 @@ static void clear_system_oom(void)
 /*
  * Must be called with tasklist_lock held for read.
  */
-static void __out_of_memory(gfp_t gfp_mask, int order)
+static void __out_of_memory(gfp_t gfp_mask, int order, nodemask_t *nodemask)
 {
 	struct task_struct *p;
 	unsigned long points;
 
 	if (sysctl_oom_kill_allocating_task)
-		if (!oom_kill_process(current, gfp_mask, order, 0, NULL,
-				"Out of memory (oom_kill_allocating_task)"))
+		if (!oom_kill_process(current, gfp_mask, order, 0, NULL, nodemask,
+				      "Out of memory (oom_kill_allocating_task)"))
 			return;
 retry:
 	/*
 	 * Rambo mode: Shoot down a process and hope it solves whatever
 	 * issues we may have.
 	 */
-	p = select_bad_process(&points, NULL);
+	p = select_bad_process(&points, NULL, nodemask);
 
 	if (PTR_ERR(p) == -1UL)
 		return;
 
 	/* Found nothing?!?! Either we hang forever, or we panic. */
 	if (!p) {
-		dump_header(NULL, gfp_mask, order, NULL);
+		dump_header(NULL, gfp_mask, order, NULL, nodemask);
 		read_unlock(&tasklist_lock);
 		panic("Out of memory and no killable processes...\n");
 	}
 
-	if (oom_kill_process(p, gfp_mask, order, points, NULL,
+	if (oom_kill_process(p, gfp_mask, order, points, NULL, nodemask,
 			     "Out of memory"))
 		goto retry;
 }
@@ -603,6 +624,7 @@ retry:
  * @zonelist: zonelist pointer
  * @gfp_mask: memory allocation flags
  * @order: amount of memory being requested as a power of 2
+ * @nodemask: nodemask passed to page allocator
  *
  * If we run out of memory, we have the choice between either
  * killing a random task (bad), letting the system crash (worse)
@@ -622,7 +644,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 
 	if (sysctl_panic_on_oom == 2) {
 		read_lock(&tasklist_lock);
-		dump_header(NULL, gfp_mask, order, NULL);
+		dump_header(NULL, gfp_mask, order, NULL, nodemask);
 		read_unlock(&tasklist_lock);
 		panic("out of memory. Compulsory panic_on_oom is selected.\n");
 	}
@@ -633,26 +655,17 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	 */
 	if (zonelist)
 		constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
-	read_lock(&tasklist_lock);
-
-	switch (constraint) {
-	case CONSTRAINT_MEMORY_POLICY:
-		oom_kill_process(current, gfp_mask, order, 0, NULL,
-				"No available memory (MPOL_BIND)");
-		break;
-
-	case CONSTRAINT_NONE:
-		if (sysctl_panic_on_oom) {
-			dump_header(NULL, gfp_mask, order, NULL);
-			read_unlock(&tasklist_lock);
-			panic("out of memory. panic_on_oom is selected\n");
-		}
-		/* Fall-through */
-	case CONSTRAINT_CPUSET:
-		__out_of_memory(gfp_mask, order);
-		break;
+	if (constraint != CONSTRAINT_MEMORY_POLICY)
+		nodemask = NULL;
+	if ((constraint == CONSTRAINT_NONE) && sysctl_panic_on_oom) {
+		read_lock(&tasklist_lock);
+		dump_header(NULL, gfp_mask, order, NULL, nodemask);
+		read_unlock(&tasklist_lock);
+		panic("out of memory. panic_on_oom is selected\n");
 	}
 
+	read_lock(&tasklist_lock);
+	__out_of_memory(gfp_mask, order, nodemask);
 	read_unlock(&tasklist_lock);
 
 	/*
@@ -672,6 +685,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 void pagefault_out_of_memory(void)
 {
 	if (try_set_system_oom()) {
+		/* unknown gfp_mask and order */
 		out_of_memory(NULL, 0, 0, NULL);
 		clear_system_oom();
 	}
-- 
1.6.5.2



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 08/10] oom: use send_sig() instead force_sig()
  2010-06-08 12:01   ` KOSAKI Motohiro
@ 2010-06-08 18:41     ` Oleg Nesterov
  -1 siblings, 0 replies; 50+ messages in thread
From: Oleg Nesterov @ 2010-06-08 18:41 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Luis Claudio R. Goncalves, LKML, linux-mm, David Rientjes,
	Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin, Minchan Kim

On 06/08, KOSAKI Motohiro wrote:
>
> Oleg pointed out oom_kill.c has force_sig() abuse. force_sig() mean
> ignore signal mask. but SIGKILL itself is not maskable.

Yes. And we have other reasons to avoid force_sig(). It should be used
only for synchronous signals.

But,

> @@ -399,7 +399,7 @@ static int __oom_kill_process(struct task_struct *p, struct mem_cgroup *mem)
>  	p->rt.time_slice = HZ;
>  	set_tsk_thread_flag(p, TIF_MEMDIE);
>
> -	force_sig(SIGKILL, p);
> +	send_sig(SIGKILL, p, 1);

This is not right, we need send_sig(SIGKILL, p, 0). Better yet,
send_sig_info(SIGKILL, SEND_SIG_NOINFO). I think send_sig() should
die.

The reason is that si_fromuser() must be true, otherwise we can't kill
the SIGNAL_UNKILLABLE (sub-namespace inits) tasks.

Oh. This reminds me, we really need the trivial (but annoying) cleanups
here. The usage of SEND_SIG_ constants is messy, and they should be
renamed at least.

And in fact, we need the new one which acts like SEND_SIG_FORCED but
si_fromuser(). We do not want to allocate the memory when the caller
is oom_kill or zap_pid_ns_processes().

OK. I'll send the simple patch which adds the new helper with the
comment. send_sigkill() or kernel_kill_task(), or do you see a
better name?

Oleg.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 08/10] oom: use send_sig() instead force_sig()
@ 2010-06-08 18:41     ` Oleg Nesterov
  0 siblings, 0 replies; 50+ messages in thread
From: Oleg Nesterov @ 2010-06-08 18:41 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Luis Claudio R. Goncalves, LKML, linux-mm, David Rientjes,
	Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin, Minchan Kim

On 06/08, KOSAKI Motohiro wrote:
>
> Oleg pointed out oom_kill.c has force_sig() abuse. force_sig() mean
> ignore signal mask. but SIGKILL itself is not maskable.

Yes. And we have other reasons to avoid force_sig(). It should be used
only for synchronous signals.

But,

> @@ -399,7 +399,7 @@ static int __oom_kill_process(struct task_struct *p, struct mem_cgroup *mem)
>  	p->rt.time_slice = HZ;
>  	set_tsk_thread_flag(p, TIF_MEMDIE);
>
> -	force_sig(SIGKILL, p);
> +	send_sig(SIGKILL, p, 1);

This is not right, we need send_sig(SIGKILL, p, 0). Better yet,
send_sig_info(SIGKILL, SEND_SIG_NOINFO). I think send_sig() should
die.

The reason is that si_fromuser() must be true, otherwise we can't kill
the SIGNAL_UNKILLABLE (sub-namespace inits) tasks.

Oh. This reminds me, we really need the trivial (but annoying) cleanups
here. The usage of SEND_SIG_ constants is messy, and they should be
renamed at least.

And in fact, we need the new one which acts like SEND_SIG_FORCED but
si_fromuser(). We do not want to allocate the memory when the caller
is oom_kill or zap_pid_ns_processes().

OK. I'll send the simple patch which adds the new helper with the
comment. send_sigkill() or kernel_kill_task(), or do you see a
better name?

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 07/10] oom: kill useless debug print
  2010-06-08 11:59   ` KOSAKI Motohiro
@ 2010-06-08 19:01     ` David Rientjes
  -1 siblings, 0 replies; 50+ messages in thread
From: David Rientjes @ 2010-06-08 19:01 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Luis Claudio R. Goncalves, LKML, linux-mm, Oleg Nesterov,
	Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin, Minchan Kim

On Tue, 8 Jun 2010, KOSAKI Motohiro wrote:

> Now, all of oom developers usually are using sysctl_oom_dump_tasks.
> Redundunt useless debug print can be removed.
> 

This is already removed with my heuristic rewrite as you well know, why 
are you constantly trying to get in the way of that work?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 07/10] oom: kill useless debug print
@ 2010-06-08 19:01     ` David Rientjes
  0 siblings, 0 replies; 50+ messages in thread
From: David Rientjes @ 2010-06-08 19:01 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Luis Claudio R. Goncalves, LKML, linux-mm, Oleg Nesterov,
	Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin, Minchan Kim

On Tue, 8 Jun 2010, KOSAKI Motohiro wrote:

> Now, all of oom developers usually are using sysctl_oom_dump_tasks.
> Redundunt useless debug print can be removed.
> 

This is already removed with my heuristic rewrite as you well know, why 
are you constantly trying to get in the way of that work?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 09/10] oom: filter tasks not sharing the same cpuset
  2010-06-08 12:02   ` KOSAKI Motohiro
@ 2010-06-08 19:05     ` David Rientjes
  -1 siblings, 0 replies; 50+ messages in thread
From: David Rientjes @ 2010-06-08 19:05 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Luis Claudio R. Goncalves, LKML, linux-mm, Oleg Nesterov,
	Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin, Minchan Kim

On Tue, 8 Jun 2010, KOSAKI Motohiro wrote:

> From: David Rientjes <rientjes@google.com>
> 
> Tasks that do not share the same set of allowed nodes with the task that
> triggered the oom should not be considered as candidates for oom kill.
> 

You're not a maintainer, as I obviously have to point out to you often 
enough.  I've repeatedly asked you to work with me in reviewing my oom 
killer rewrite on linux-mm, yet you seldom offer any valuable feedback 
other than a simple "nack".  I don't consider any of your patchset here to 
be more applicable than my patchset, which has been developed over the 
course of several months, and your lack of participation in the process is 
really quite shocking to me.

In case nobody has told you before: Andrew maintains this code and these 
patches will be going through the -mm tree.  You are not a maintainer of 
it (or any other kernel code), so please act within your role of kernel 
hacker and review patches as people propose them by offering your 
constructive feedback.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 09/10] oom: filter tasks not sharing the same cpuset
@ 2010-06-08 19:05     ` David Rientjes
  0 siblings, 0 replies; 50+ messages in thread
From: David Rientjes @ 2010-06-08 19:05 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Luis Claudio R. Goncalves, LKML, linux-mm, Oleg Nesterov,
	Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin, Minchan Kim

On Tue, 8 Jun 2010, KOSAKI Motohiro wrote:

> From: David Rientjes <rientjes@google.com>
> 
> Tasks that do not share the same set of allowed nodes with the task that
> triggered the oom should not be considered as candidates for oom kill.
> 

You're not a maintainer, as I obviously have to point out to you often 
enough.  I've repeatedly asked you to work with me in reviewing my oom 
killer rewrite on linux-mm, yet you seldom offer any valuable feedback 
other than a simple "nack".  I don't consider any of your patchset here to 
be more applicable than my patchset, which has been developed over the 
course of several months, and your lack of participation in the process is 
really quite shocking to me.

In case nobody has told you before: Andrew maintains this code and these 
patches will be going through the -mm tree.  You are not a maintainer of 
it (or any other kernel code), so please act within your role of kernel 
hacker and review patches as people propose them by offering your 
constructive feedback.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 06/10] oom: cleanup has_intersects_mems_allowed()
  2010-06-08 11:59   ` KOSAKI Motohiro
@ 2010-06-08 19:07     ` David Rientjes
  -1 siblings, 0 replies; 50+ messages in thread
From: David Rientjes @ 2010-06-08 19:07 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Luis Claudio R. Goncalves, LKML, linux-mm, Oleg Nesterov,
	Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin, Minchan Kim

On Tue, 8 Jun 2010, KOSAKI Motohiro wrote:

> Now has_intersects_mems_allowed() has own thread iterate logic, but
> it should use while_each_thread().
> 
> It slightly improve the code readability.
> 

These cleanups should be done on top of my oom killer rewrite instead, 
please work with others in their work instead of getting in the way of it 
time and time again.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 06/10] oom: cleanup has_intersects_mems_allowed()
@ 2010-06-08 19:07     ` David Rientjes
  0 siblings, 0 replies; 50+ messages in thread
From: David Rientjes @ 2010-06-08 19:07 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Luis Claudio R. Goncalves, LKML, linux-mm, Oleg Nesterov,
	Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin, Minchan Kim

On Tue, 8 Jun 2010, KOSAKI Motohiro wrote:

> Now has_intersects_mems_allowed() has own thread iterate logic, but
> it should use while_each_thread().
> 
> It slightly improve the code readability.
> 

These cleanups should be done on top of my oom killer rewrite instead, 
please work with others in their work instead of getting in the way of it 
time and time again.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 03/10] oom: rename badness() to oom_badness()
  2010-06-08 11:56   ` KOSAKI Motohiro
@ 2010-06-08 19:09     ` David Rientjes
  -1 siblings, 0 replies; 50+ messages in thread
From: David Rientjes @ 2010-06-08 19:09 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Luis Claudio R. Goncalves, LKML, linux-mm, Oleg Nesterov,
	Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin, Minchan Kim

On Tue, 8 Jun 2010, KOSAKI Motohiro wrote:

> badness() is wrong name because it's too generic name.
> 
> rename it.
> 

This is already renamed in my heuristic rewrite which we all agree, in 
principle, is needed.  I'm really confused that you're trying to take 
little bits and pieces of my work like this, using my name for the 
function in this case, for example, and not working with other developers 
in reviewing their work.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 03/10] oom: rename badness() to oom_badness()
@ 2010-06-08 19:09     ` David Rientjes
  0 siblings, 0 replies; 50+ messages in thread
From: David Rientjes @ 2010-06-08 19:09 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Luis Claudio R. Goncalves, LKML, linux-mm, Oleg Nesterov,
	Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin, Minchan Kim

On Tue, 8 Jun 2010, KOSAKI Motohiro wrote:

> badness() is wrong name because it's too generic name.
> 
> rename it.
> 

This is already renamed in my heuristic rewrite which we all agree, in 
principle, is needed.  I'm really confused that you're trying to take 
little bits and pieces of my work like this, using my name for the 
function in this case, for example, and not working with other developers 
in reviewing their work.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 02/10] oom: remove verbose argument from __oom_kill_process()
  2010-06-08 11:55   ` KOSAKI Motohiro
@ 2010-06-08 19:09     ` David Rientjes
  -1 siblings, 0 replies; 50+ messages in thread
From: David Rientjes @ 2010-06-08 19:09 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Luis Claudio R. Goncalves, LKML, linux-mm, Oleg Nesterov,
	Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin, Minchan Kim

On Tue, 8 Jun 2010, KOSAKI Motohiro wrote:

> Now, verbose argument is unused. This patch remove it.
> 

This is already done in my oom killer rewrite, please work with other 
developers in developing their work instead of acting as a maintainer, 
which you aren't.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 02/10] oom: remove verbose argument from __oom_kill_process()
@ 2010-06-08 19:09     ` David Rientjes
  0 siblings, 0 replies; 50+ messages in thread
From: David Rientjes @ 2010-06-08 19:09 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Luis Claudio R. Goncalves, LKML, linux-mm, Oleg Nesterov,
	Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin, Minchan Kim

On Tue, 8 Jun 2010, KOSAKI Motohiro wrote:

> Now, verbose argument is unused. This patch remove it.
> 

This is already done in my oom killer rewrite, please work with other 
developers in developing their work instead of acting as a maintainer, 
which you aren't.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/10] oom: don't try to kill oom_unkillable child
  2010-06-08 11:54   ` KOSAKI Motohiro
@ 2010-06-08 19:10     ` David Rientjes
  -1 siblings, 0 replies; 50+ messages in thread
From: David Rientjes @ 2010-06-08 19:10 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Luis Claudio R. Goncalves, LKML, linux-mm, Oleg Nesterov,
	Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin, Minchan Kim

On Tue, 8 Jun 2010, KOSAKI Motohiro wrote:

> Now, badness() doesn't care neigher CPUSET nor mempolicy. Then
> if the victim child process is oom_unkillable()==1, __out_of_memory()
> can makes kernel hang eventually.
> 
> This patch fixes it.
> 
> 
> [remark: this is needed to fold "oom: sacrifice child with highest
> badness score for parent"]
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
>  mm/oom_kill.c |    5 ++---
>  1 files changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index d49d542..0d7397b 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -387,9 +387,6 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
>  static int __oom_kill_process(struct task_struct *p, struct mem_cgroup *mem,
>  			      int verbose)
>  {
> -	if (oom_unkillable(p, mem))
> -		return 1;
> -
>  	p = find_lock_task_mm(p);
>  	if (!p)
>  		return 1;
> @@ -440,6 +437,8 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  
>  			if (c->mm == p->mm)
>  				continue;
> +			if (oom_unkillable(c, mem, nodemask))
> +				continue;
>  
>  			/* badness() returns 0 if the thread is unkillable */
>  			cpoints = badness(c, uptime.tv_sec);

This doesn't apply to anything.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/10] oom: don't try to kill oom_unkillable child
@ 2010-06-08 19:10     ` David Rientjes
  0 siblings, 0 replies; 50+ messages in thread
From: David Rientjes @ 2010-06-08 19:10 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Luis Claudio R. Goncalves, LKML, linux-mm, Oleg Nesterov,
	Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin, Minchan Kim

On Tue, 8 Jun 2010, KOSAKI Motohiro wrote:

> Now, badness() doesn't care neigher CPUSET nor mempolicy. Then
> if the victim child process is oom_unkillable()==1, __out_of_memory()
> can makes kernel hang eventually.
> 
> This patch fixes it.
> 
> 
> [remark: this is needed to fold "oom: sacrifice child with highest
> badness score for parent"]
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
>  mm/oom_kill.c |    5 ++---
>  1 files changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index d49d542..0d7397b 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -387,9 +387,6 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
>  static int __oom_kill_process(struct task_struct *p, struct mem_cgroup *mem,
>  			      int verbose)
>  {
> -	if (oom_unkillable(p, mem))
> -		return 1;
> -
>  	p = find_lock_task_mm(p);
>  	if (!p)
>  		return 1;
> @@ -440,6 +437,8 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  
>  			if (c->mm == p->mm)
>  				continue;
> +			if (oom_unkillable(c, mem, nodemask))
> +				continue;
>  
>  			/* badness() returns 0 if the thread is unkillable */
>  			cpoints = badness(c, uptime.tv_sec);

This doesn't apply to anything.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 0/1] signals: introduce send_sigkill() helper
  2010-06-08 18:41     ` Oleg Nesterov
@ 2010-06-10  0:59       ` Oleg Nesterov
  -1 siblings, 0 replies; 50+ messages in thread
From: Oleg Nesterov @ 2010-06-10  0:59 UTC (permalink / raw)
  To: Andrew Morton, KOSAKI Motohiro, Roland McGrath
  Cc: Luis Claudio R. Goncalves, LKML, linux-mm, David Rientjes,
	KAMEZAWA Hiroyuki, Nick Piggin, Minchan Kim

On 06/08, Oleg Nesterov wrote:
>
> > @@ -399,7 +399,7 @@ static int __oom_kill_process(struct task_struct *p, struct mem_cgroup *mem)
> >  	p->rt.time_slice = HZ;
> >  	set_tsk_thread_flag(p, TIF_MEMDIE);
> >
> > -	force_sig(SIGKILL, p);
> > +	send_sig(SIGKILL, p, 1);
>
> This is not right, we need send_sig(SIGKILL, p, 0). Better yet,
> send_sig_info(SIGKILL, SEND_SIG_NOINFO). I think send_sig() should
> die.
>
> The reason is that si_fromuser() must be true, otherwise we can't kill
> the SIGNAL_UNKILLABLE (sub-namespace inits) tasks.
>
> Oh. This reminds me, we really need the trivial (but annoying) cleanups
> here. The usage of SEND_SIG_ constants is messy, and they should be
> renamed at least.
>
> And in fact, we need the new one which acts like SEND_SIG_FORCED but
> si_fromuser(). We do not want to allocate the memory when the caller
> is oom_kill or zap_pid_ns_processes().

I tried to make some simple cleanups right now, but this really needs
time and discussion.

So. If we are going to remove force_sig() in mm/oom_kill.c (and I think
we should), I'd like to add the trivial helper first.

Oleg.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 0/1] signals: introduce send_sigkill() helper
@ 2010-06-10  0:59       ` Oleg Nesterov
  0 siblings, 0 replies; 50+ messages in thread
From: Oleg Nesterov @ 2010-06-10  0:59 UTC (permalink / raw)
  To: Andrew Morton, KOSAKI Motohiro, Roland McGrath
  Cc: Luis Claudio R. Goncalves, LKML, linux-mm, David Rientjes,
	KAMEZAWA Hiroyuki, Nick Piggin, Minchan Kim

On 06/08, Oleg Nesterov wrote:
>
> > @@ -399,7 +399,7 @@ static int __oom_kill_process(struct task_struct *p, struct mem_cgroup *mem)
> >  	p->rt.time_slice = HZ;
> >  	set_tsk_thread_flag(p, TIF_MEMDIE);
> >
> > -	force_sig(SIGKILL, p);
> > +	send_sig(SIGKILL, p, 1);
>
> This is not right, we need send_sig(SIGKILL, p, 0). Better yet,
> send_sig_info(SIGKILL, SEND_SIG_NOINFO). I think send_sig() should
> die.
>
> The reason is that si_fromuser() must be true, otherwise we can't kill
> the SIGNAL_UNKILLABLE (sub-namespace inits) tasks.
>
> Oh. This reminds me, we really need the trivial (but annoying) cleanups
> here. The usage of SEND_SIG_ constants is messy, and they should be
> renamed at least.
>
> And in fact, we need the new one which acts like SEND_SIG_FORCED but
> si_fromuser(). We do not want to allocate the memory when the caller
> is oom_kill or zap_pid_ns_processes().

I tried to make some simple cleanups right now, but this really needs
time and discussion.

So. If we are going to remove force_sig() in mm/oom_kill.c (and I think
we should), I'd like to add the trivial helper first.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 1/1] signals: introduce send_sigkill() helper
  2010-06-10  0:59       ` Oleg Nesterov
@ 2010-06-10  1:00         ` Oleg Nesterov
  -1 siblings, 0 replies; 50+ messages in thread
From: Oleg Nesterov @ 2010-06-10  1:00 UTC (permalink / raw)
  To: Andrew Morton, KOSAKI Motohiro, Roland McGrath
  Cc: Luis Claudio R. Goncalves, LKML, linux-mm, David Rientjes,
	KAMEZAWA Hiroyuki, Nick Piggin, Minchan Kim

Cleanup, no functional changes.

There are a lot of buggy SIGKILL users in kernel. For example, almost
every force_sig(SIGKILL) is wrong. force_sig() is not safe, it assumes
that the task has the valid ->sighand, and in general it should be used
only for synchronous signals. send_sig(SIGKILL, p, 1) or
send_xxx(SEND_SIG_FORCED/SEND_SIG_PRIV) is not right too but this is not
immediately obvious.

The only way to correctly send SIGKILL is send_sig_info(SEND_SIG_NOINFO)
but we do not want to use this directly, because we can optimize this
case later. For example, zap_pid_ns_processes() allocates sigqueue for
each process in namespace, this is unneeded.

Introduce the trivial send_sigkill() helper on top of send_sig_info()
and change zap_pid_ns_processes() as an example.

Note: we need more cleanups here, this is only the first change.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 include/linux/sched.h  |    9 +++++++++
 kernel/pid_namespace.c |    8 +-------
 2 files changed, 10 insertions(+), 7 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2053,6 +2053,15 @@ static inline int kill_cad_pid(int sig, 
 #define SEND_SIG_PRIV	((struct siginfo *) 1)
 #define SEND_SIG_FORCED	((struct siginfo *) 2)
 
+static inline int send_sigkill(struct task_struct * p)
+{
+	/*
+	 * Kills any user-space task, even SIGNAL_UNKILLABLE.
+	 * We use SEND_SIG_NOINFO to make si_fromuser() true.
+	 */
+	return send_sig_info(SIGKILL, SEND_SIG_NOINFO, p);
+}
+
 /*
  * True if we are on the alternate signal stack.
  */
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -160,15 +160,9 @@ void zap_pid_ns_processes(struct pid_nam
 	nr = next_pidmap(pid_ns, 1);
 	while (nr > 0) {
 		rcu_read_lock();
-
-		/*
-		 * Any nested-container's init processes won't ignore the
-		 * SEND_SIG_NOINFO signal, see send_signal()->si_fromuser().
-		 */
 		task = pid_task(find_vpid(nr), PIDTYPE_PID);
 		if (task)
-			send_sig_info(SIGKILL, SEND_SIG_NOINFO, task);
-
+			send_sigkill(task);
 		rcu_read_unlock();
 
 		nr = next_pidmap(pid_ns, nr);


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 1/1] signals: introduce send_sigkill() helper
@ 2010-06-10  1:00         ` Oleg Nesterov
  0 siblings, 0 replies; 50+ messages in thread
From: Oleg Nesterov @ 2010-06-10  1:00 UTC (permalink / raw)
  To: Andrew Morton, KOSAKI Motohiro, Roland McGrath
  Cc: Luis Claudio R. Goncalves, LKML, linux-mm, David Rientjes,
	KAMEZAWA Hiroyuki, Nick Piggin, Minchan Kim

Cleanup, no functional changes.

There are a lot of buggy SIGKILL users in kernel. For example, almost
every force_sig(SIGKILL) is wrong. force_sig() is not safe, it assumes
that the task has the valid ->sighand, and in general it should be used
only for synchronous signals. send_sig(SIGKILL, p, 1) or
send_xxx(SEND_SIG_FORCED/SEND_SIG_PRIV) is not right too but this is not
immediately obvious.

The only way to correctly send SIGKILL is send_sig_info(SEND_SIG_NOINFO)
but we do not want to use this directly, because we can optimize this
case later. For example, zap_pid_ns_processes() allocates sigqueue for
each process in namespace, this is unneeded.

Introduce the trivial send_sigkill() helper on top of send_sig_info()
and change zap_pid_ns_processes() as an example.

Note: we need more cleanups here, this is only the first change.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 include/linux/sched.h  |    9 +++++++++
 kernel/pid_namespace.c |    8 +-------
 2 files changed, 10 insertions(+), 7 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2053,6 +2053,15 @@ static inline int kill_cad_pid(int sig, 
 #define SEND_SIG_PRIV	((struct siginfo *) 1)
 #define SEND_SIG_FORCED	((struct siginfo *) 2)
 
+static inline int send_sigkill(struct task_struct * p)
+{
+	/*
+	 * Kills any user-space task, even SIGNAL_UNKILLABLE.
+	 * We use SEND_SIG_NOINFO to make si_fromuser() true.
+	 */
+	return send_sig_info(SIGKILL, SEND_SIG_NOINFO, p);
+}
+
 /*
  * True if we are on the alternate signal stack.
  */
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -160,15 +160,9 @@ void zap_pid_ns_processes(struct pid_nam
 	nr = next_pidmap(pid_ns, 1);
 	while (nr > 0) {
 		rcu_read_lock();
-
-		/*
-		 * Any nested-container's init processes won't ignore the
-		 * SEND_SIG_NOINFO signal, see send_signal()->si_fromuser().
-		 */
 		task = pid_task(find_vpid(nr), PIDTYPE_PID);
 		if (task)
-			send_sig_info(SIGKILL, SEND_SIG_NOINFO, task);
-
+			send_sigkill(task);
 		rcu_read_unlock();
 
 		nr = next_pidmap(pid_ns, nr);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/1] signals: introduce send_sigkill() helper
  2010-06-10  1:00         ` Oleg Nesterov
@ 2010-06-11  0:40           ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-11  0:40 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, KOSAKI Motohiro, Roland McGrath,
	Luis Claudio R. Goncalves, LKML, linux-mm, David Rientjes,
	Nick Piggin, Minchan Kim

On Thu, 10 Jun 2010 03:00:23 +0200
Oleg Nesterov <oleg@redhat.com> wrote:

> Cleanup, no functional changes.
> 
> There are a lot of buggy SIGKILL users in kernel. For example, almost
> every force_sig(SIGKILL) is wrong. force_sig() is not safe, it assumes
> that the task has the valid ->sighand, and in general it should be used
> only for synchronous signals. send_sig(SIGKILL, p, 1) or
> send_xxx(SEND_SIG_FORCED/SEND_SIG_PRIV) is not right too but this is not
> immediately obvious.
> 
> The only way to correctly send SIGKILL is send_sig_info(SEND_SIG_NOINFO)
> but we do not want to use this directly, because we can optimize this
> case later. For example, zap_pid_ns_processes() allocates sigqueue for
> each process in namespace, this is unneeded.
> 
> Introduce the trivial send_sigkill() helper on top of send_sig_info()
> and change zap_pid_ns_processes() as an example.
> 
> Note: we need more cleanups here, this is only the first change.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/1] signals: introduce send_sigkill() helper
@ 2010-06-11  0:40           ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-11  0:40 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, KOSAKI Motohiro, Roland McGrath,
	Luis Claudio R. Goncalves, LKML, linux-mm, David Rientjes,
	Nick Piggin, Minchan Kim

On Thu, 10 Jun 2010 03:00:23 +0200
Oleg Nesterov <oleg@redhat.com> wrote:

> Cleanup, no functional changes.
> 
> There are a lot of buggy SIGKILL users in kernel. For example, almost
> every force_sig(SIGKILL) is wrong. force_sig() is not safe, it assumes
> that the task has the valid ->sighand, and in general it should be used
> only for synchronous signals. send_sig(SIGKILL, p, 1) or
> send_xxx(SEND_SIG_FORCED/SEND_SIG_PRIV) is not right too but this is not
> immediately obvious.
> 
> The only way to correctly send SIGKILL is send_sig_info(SEND_SIG_NOINFO)
> but we do not want to use this directly, because we can optimize this
> case later. For example, zap_pid_ns_processes() allocates sigqueue for
> each process in namespace, this is unneeded.
> 
> Introduce the trivial send_sigkill() helper on top of send_sig_info()
> and change zap_pid_ns_processes() as an example.
> 
> Note: we need more cleanups here, this is only the first change.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 08/10] oom: use send_sig() instead force_sig()
  2010-06-08 18:41     ` Oleg Nesterov
@ 2010-06-13 11:24       ` KOSAKI Motohiro
  -1 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2010-06-13 11:24 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: kosaki.motohiro, Luis Claudio R. Goncalves, LKML, linux-mm,
	David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin,
	Minchan Kim

> On 06/08, KOSAKI Motohiro wrote:
> >
> > Oleg pointed out oom_kill.c has force_sig() abuse. force_sig() mean
> > ignore signal mask. but SIGKILL itself is not maskable.
> 
> Yes. And we have other reasons to avoid force_sig(). It should be used
> only for synchronous signals.
> 
> But,
> 
> > @@ -399,7 +399,7 @@ static int __oom_kill_process(struct task_struct *p, struct mem_cgroup *mem)
> >  	p->rt.time_slice = HZ;
> >  	set_tsk_thread_flag(p, TIF_MEMDIE);
> >
> > -	force_sig(SIGKILL, p);
> > +	send_sig(SIGKILL, p, 1);
> 
> This is not right, we need send_sig(SIGKILL, p, 0). Better yet,
> send_sig_info(SIGKILL, SEND_SIG_NOINFO). I think send_sig() should
> die.
> 
> The reason is that si_fromuser() must be true, otherwise we can't kill
> the SIGNAL_UNKILLABLE (sub-namespace inits) tasks.

Thanks. I am not signal expert. 
To be honest, current special siginfo arguments have a bit unclear meanings
to me ;)
current definition (following) doesn't teach anything.

sched.h
=====================
/* These can be the second arg to send_sig_info/send_group_sig_info.  */
#define SEND_SIG_NOINFO ((struct siginfo *) 0)
#define SEND_SIG_PRIV   ((struct siginfo *) 1)
#define SEND_SIG_FORCED ((struct siginfo *) 2)


If anyone write exact meanings, I'm really really glad.



> Oh. This reminds me, we really need the trivial (but annoying) cleanups
> here. The usage of SEND_SIG_ constants is messy, and they should be
> renamed at least.
> 
> And in fact, we need the new one which acts like SEND_SIG_FORCED but
> si_fromuser(). We do not want to allocate the memory when the caller
> is oom_kill or zap_pid_ns_processes().
> 
> OK. I'll send the simple patch which adds the new helper with the
> comment. send_sigkill() or kernel_kill_task(), or do you see a
> better name?

Very thanks. both name are pretty good to me.




^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 08/10] oom: use send_sig() instead force_sig()
@ 2010-06-13 11:24       ` KOSAKI Motohiro
  0 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2010-06-13 11:24 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: kosaki.motohiro, Luis Claudio R. Goncalves, LKML, linux-mm,
	David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin,
	Minchan Kim

> On 06/08, KOSAKI Motohiro wrote:
> >
> > Oleg pointed out oom_kill.c has force_sig() abuse. force_sig() mean
> > ignore signal mask. but SIGKILL itself is not maskable.
> 
> Yes. And we have other reasons to avoid force_sig(). It should be used
> only for synchronous signals.
> 
> But,
> 
> > @@ -399,7 +399,7 @@ static int __oom_kill_process(struct task_struct *p, struct mem_cgroup *mem)
> >  	p->rt.time_slice = HZ;
> >  	set_tsk_thread_flag(p, TIF_MEMDIE);
> >
> > -	force_sig(SIGKILL, p);
> > +	send_sig(SIGKILL, p, 1);
> 
> This is not right, we need send_sig(SIGKILL, p, 0). Better yet,
> send_sig_info(SIGKILL, SEND_SIG_NOINFO). I think send_sig() should
> die.
> 
> The reason is that si_fromuser() must be true, otherwise we can't kill
> the SIGNAL_UNKILLABLE (sub-namespace inits) tasks.

Thanks. I am not signal expert. 
To be honest, current special siginfo arguments have a bit unclear meanings
to me ;)
current definition (following) doesn't teach anything.

sched.h
=====================
/* These can be the second arg to send_sig_info/send_group_sig_info.  */
#define SEND_SIG_NOINFO ((struct siginfo *) 0)
#define SEND_SIG_PRIV   ((struct siginfo *) 1)
#define SEND_SIG_FORCED ((struct siginfo *) 2)


If anyone write exact meanings, I'm really really glad.



> Oh. This reminds me, we really need the trivial (but annoying) cleanups
> here. The usage of SEND_SIG_ constants is messy, and they should be
> renamed at least.
> 
> And in fact, we need the new one which acts like SEND_SIG_FORCED but
> si_fromuser(). We do not want to allocate the memory when the caller
> is oom_kill or zap_pid_ns_processes().
> 
> OK. I'll send the simple patch which adds the new helper with the
> comment. send_sigkill() or kernel_kill_task(), or do you see a
> better name?

Very thanks. both name are pretty good to me.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/1] signals: introduce send_sigkill() helper
  2010-06-10  1:00         ` Oleg Nesterov
@ 2010-06-13 11:24           ` KOSAKI Motohiro
  -1 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2010-06-13 11:24 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: kosaki.motohiro, Andrew Morton, Roland McGrath,
	Luis Claudio R. Goncalves, LKML, linux-mm, David Rientjes,
	KAMEZAWA Hiroyuki, Nick Piggin, Minchan Kim

> Cleanup, no functional changes.
> 
> There are a lot of buggy SIGKILL users in kernel. For example, almost
> every force_sig(SIGKILL) is wrong. force_sig() is not safe, it assumes
> that the task has the valid ->sighand, and in general it should be used
> only for synchronous signals. send_sig(SIGKILL, p, 1) or
> send_xxx(SEND_SIG_FORCED/SEND_SIG_PRIV) is not right too but this is not
> immediately obvious.
> 
> The only way to correctly send SIGKILL is send_sig_info(SEND_SIG_NOINFO)
> but we do not want to use this directly, because we can optimize this
> case later. For example, zap_pid_ns_processes() allocates sigqueue for
> each process in namespace, this is unneeded.
> 
> Introduce the trivial send_sigkill() helper on top of send_sig_info()
> and change zap_pid_ns_processes() as an example.
> 
> Note: we need more cleanups here, this is only the first change.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

Great.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>





^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/1] signals: introduce send_sigkill() helper
@ 2010-06-13 11:24           ` KOSAKI Motohiro
  0 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2010-06-13 11:24 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: kosaki.motohiro, Andrew Morton, Roland McGrath,
	Luis Claudio R. Goncalves, LKML, linux-mm, David Rientjes,
	KAMEZAWA Hiroyuki, Nick Piggin, Minchan Kim

> Cleanup, no functional changes.
> 
> There are a lot of buggy SIGKILL users in kernel. For example, almost
> every force_sig(SIGKILL) is wrong. force_sig() is not safe, it assumes
> that the task has the valid ->sighand, and in general it should be used
> only for synchronous signals. send_sig(SIGKILL, p, 1) or
> send_xxx(SEND_SIG_FORCED/SEND_SIG_PRIV) is not right too but this is not
> immediately obvious.
> 
> The only way to correctly send SIGKILL is send_sig_info(SEND_SIG_NOINFO)
> but we do not want to use this directly, because we can optimize this
> case later. For example, zap_pid_ns_processes() allocates sigqueue for
> each process in namespace, this is unneeded.
> 
> Introduce the trivial send_sigkill() helper on top of send_sig_info()
> and change zap_pid_ns_processes() as an example.
> 
> Note: we need more cleanups here, this is only the first change.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

Great.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/1] signals: introduce send_sigkill() helper
  2010-06-10  1:00         ` Oleg Nesterov
@ 2010-06-13 15:29           ` Oleg Nesterov
  -1 siblings, 0 replies; 50+ messages in thread
From: Oleg Nesterov @ 2010-06-13 15:29 UTC (permalink / raw)
  To: Andrew Morton, KOSAKI Motohiro, Roland McGrath
  Cc: Luis Claudio R. Goncalves, LKML, linux-mm, David Rientjes,
	KAMEZAWA Hiroyuki, Nick Piggin, Minchan Kim

Andrew, please drop

	signals-introduce-send_sigkill-helper.patch

I am stupid.

On 06/10, Oleg Nesterov wrote:
>
> Cleanup, no functional changes.
>
> There are a lot of buggy SIGKILL users in kernel. For example, almost
> every force_sig(SIGKILL) is wrong. force_sig() is not safe, it assumes
> that the task has the valid ->sighand, and in general it should be used
> only for synchronous signals. send_sig(SIGKILL, p, 1) or
> send_xxx(SEND_SIG_FORCED/SEND_SIG_PRIV) is not right too but this is not
> immediately obvious.
>
> The only way to correctly send SIGKILL is send_sig_info(SEND_SIG_NOINFO)

No, SEND_SIG_NOINFO doesn't work too. Oh, can't understand what I was
thinking about. current is the random task, but send_signal() checks
if the caller is from-parent-ns.

> Note: we need more cleanups here, this is only the first change.

We need the cleanups first. Until then oom-killer has to use force_sig()
if we want to kill the SIGNAL_UNKILLABLE tasks too.

Oleg.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/1] signals: introduce send_sigkill() helper
@ 2010-06-13 15:29           ` Oleg Nesterov
  0 siblings, 0 replies; 50+ messages in thread
From: Oleg Nesterov @ 2010-06-13 15:29 UTC (permalink / raw)
  To: Andrew Morton, KOSAKI Motohiro, Roland McGrath
  Cc: Luis Claudio R. Goncalves, LKML, linux-mm, David Rientjes,
	KAMEZAWA Hiroyuki, Nick Piggin, Minchan Kim

Andrew, please drop

	signals-introduce-send_sigkill-helper.patch

I am stupid.

On 06/10, Oleg Nesterov wrote:
>
> Cleanup, no functional changes.
>
> There are a lot of buggy SIGKILL users in kernel. For example, almost
> every force_sig(SIGKILL) is wrong. force_sig() is not safe, it assumes
> that the task has the valid ->sighand, and in general it should be used
> only for synchronous signals. send_sig(SIGKILL, p, 1) or
> send_xxx(SEND_SIG_FORCED/SEND_SIG_PRIV) is not right too but this is not
> immediately obvious.
>
> The only way to correctly send SIGKILL is send_sig_info(SEND_SIG_NOINFO)

No, SEND_SIG_NOINFO doesn't work too. Oh, can't understand what I was
thinking about. current is the random task, but send_signal() checks
if the caller is from-parent-ns.

> Note: we need more cleanups here, this is only the first change.

We need the cleanups first. Until then oom-killer has to use force_sig()
if we want to kill the SIGNAL_UNKILLABLE tasks too.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/1] signals: introduce send_sigkill() helper
  2010-06-13 15:29           ` Oleg Nesterov
@ 2010-06-16 10:00             ` KOSAKI Motohiro
  -1 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2010-06-16 10:00 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: kosaki.motohiro, Andrew Morton, Roland McGrath,
	Luis Claudio R. Goncalves, LKML, linux-mm, David Rientjes,
	KAMEZAWA Hiroyuki, Nick Piggin, Minchan Kim

> Andrew, please drop
> 
> 	signals-introduce-send_sigkill-helper.patch
> 
> I am stupid.
> 
> On 06/10, Oleg Nesterov wrote:
> >
> > Cleanup, no functional changes.
> >
> > There are a lot of buggy SIGKILL users in kernel. For example, almost
> > every force_sig(SIGKILL) is wrong. force_sig() is not safe, it assumes
> > that the task has the valid ->sighand, and in general it should be used
> > only for synchronous signals. send_sig(SIGKILL, p, 1) or
> > send_xxx(SEND_SIG_FORCED/SEND_SIG_PRIV) is not right too but this is not
> > immediately obvious.
> >
> > The only way to correctly send SIGKILL is send_sig_info(SEND_SIG_NOINFO)
> 
> No, SEND_SIG_NOINFO doesn't work too. Oh, can't understand what I was
> thinking about. current is the random task, but send_signal() checks
> if the caller is from-parent-ns.
> 
> > Note: we need more cleanups here, this is only the first change.
> 
> We need the cleanups first. Until then oom-killer has to use force_sig()
> if we want to kill the SIGNAL_UNKILLABLE tasks too.

This definitely needed. OOM-Killer is not racist ;)

Thanks.



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/1] signals: introduce send_sigkill() helper
@ 2010-06-16 10:00             ` KOSAKI Motohiro
  0 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2010-06-16 10:00 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: kosaki.motohiro, Andrew Morton, Roland McGrath,
	Luis Claudio R. Goncalves, LKML, linux-mm, David Rientjes,
	KAMEZAWA Hiroyuki, Nick Piggin, Minchan Kim

> Andrew, please drop
> 
> 	signals-introduce-send_sigkill-helper.patch
> 
> I am stupid.
> 
> On 06/10, Oleg Nesterov wrote:
> >
> > Cleanup, no functional changes.
> >
> > There are a lot of buggy SIGKILL users in kernel. For example, almost
> > every force_sig(SIGKILL) is wrong. force_sig() is not safe, it assumes
> > that the task has the valid ->sighand, and in general it should be used
> > only for synchronous signals. send_sig(SIGKILL, p, 1) or
> > send_xxx(SEND_SIG_FORCED/SEND_SIG_PRIV) is not right too but this is not
> > immediately obvious.
> >
> > The only way to correctly send SIGKILL is send_sig_info(SEND_SIG_NOINFO)
> 
> No, SEND_SIG_NOINFO doesn't work too. Oh, can't understand what I was
> thinking about. current is the random task, but send_signal() checks
> if the caller is from-parent-ns.
> 
> > Note: we need more cleanups here, this is only the first change.
> 
> We need the cleanups first. Until then oom-killer has to use force_sig()
> if we want to kill the SIGNAL_UNKILLABLE tasks too.

This definitely needed. OOM-Killer is not racist ;)

Thanks.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2010-06-16 10:01 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-06-08 11:53 [0/10] 3rd pile of OOM patch series KOSAKI Motohiro
2010-06-08 11:53 ` KOSAKI Motohiro
2010-06-08 11:54 ` [PATCH 01/10] oom: don't try to kill oom_unkillable child KOSAKI Motohiro
2010-06-08 11:54   ` KOSAKI Motohiro
2010-06-08 19:10   ` David Rientjes
2010-06-08 19:10     ` David Rientjes
2010-06-08 11:55 ` [PATCH 02/10] oom: remove verbose argument from __oom_kill_process() KOSAKI Motohiro
2010-06-08 11:55   ` KOSAKI Motohiro
2010-06-08 19:09   ` David Rientjes
2010-06-08 19:09     ` David Rientjes
2010-06-08 11:56 ` [PATCH 03/10] oom: rename badness() to oom_badness() KOSAKI Motohiro
2010-06-08 11:56   ` KOSAKI Motohiro
2010-06-08 19:09   ` David Rientjes
2010-06-08 19:09     ` David Rientjes
2010-06-08 11:57 ` [PATCH 04/10] oom: move sysctl declarations to oom.h KOSAKI Motohiro
2010-06-08 11:57   ` KOSAKI Motohiro
2010-06-08 11:58 ` [PATCH 05/10] oom: enable oom tasklist dump by default KOSAKI Motohiro
2010-06-08 11:58   ` KOSAKI Motohiro
2010-06-08 11:59 ` [PATCH 06/10] oom: cleanup has_intersects_mems_allowed() KOSAKI Motohiro
2010-06-08 11:59   ` KOSAKI Motohiro
2010-06-08 19:07   ` David Rientjes
2010-06-08 19:07     ` David Rientjes
2010-06-08 11:59 ` [PATCH 07/10] oom: kill useless debug print KOSAKI Motohiro
2010-06-08 11:59   ` KOSAKI Motohiro
2010-06-08 19:01   ` David Rientjes
2010-06-08 19:01     ` David Rientjes
2010-06-08 12:01 ` [PATCH 08/10] oom: use send_sig() instead force_sig() KOSAKI Motohiro
2010-06-08 12:01   ` KOSAKI Motohiro
2010-06-08 18:41   ` Oleg Nesterov
2010-06-08 18:41     ` Oleg Nesterov
2010-06-10  0:59     ` [PATCH 0/1] signals: introduce send_sigkill() helper Oleg Nesterov
2010-06-10  0:59       ` Oleg Nesterov
2010-06-10  1:00       ` [PATCH 1/1] " Oleg Nesterov
2010-06-10  1:00         ` Oleg Nesterov
2010-06-11  0:40         ` KAMEZAWA Hiroyuki
2010-06-11  0:40           ` KAMEZAWA Hiroyuki
2010-06-13 11:24         ` KOSAKI Motohiro
2010-06-13 11:24           ` KOSAKI Motohiro
2010-06-13 15:29         ` Oleg Nesterov
2010-06-13 15:29           ` Oleg Nesterov
2010-06-16 10:00           ` KOSAKI Motohiro
2010-06-16 10:00             ` KOSAKI Motohiro
2010-06-13 11:24     ` [PATCH 08/10] oom: use send_sig() instead force_sig() KOSAKI Motohiro
2010-06-13 11:24       ` KOSAKI Motohiro
2010-06-08 12:02 ` [PATCH 09/10] oom: filter tasks not sharing the same cpuset KOSAKI Motohiro
2010-06-08 12:02   ` KOSAKI Motohiro
2010-06-08 19:05   ` David Rientjes
2010-06-08 19:05     ` David Rientjes
2010-06-08 12:04 ` [PATCH 10/10] oom: select task from tasklist for mempolicy ooms KOSAKI Motohiro
2010-06-08 12:04   ` KOSAKI Motohiro

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.