All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/5] oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads
@ 2010-05-31  9:33 ` KOSAKI Motohiro
  0 siblings, 0 replies; 110+ messages in thread
From: KOSAKI Motohiro @ 2010-05-31  9:33 UTC (permalink / raw)
  To: LKML, linux-mm, Oleg Nesterov, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin
  Cc: kosaki.motohiro

From: Oleg Nesterov <oleg@redhat.com>
Subject: oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads

select_bad_process() thinks a kernel thread can't have ->mm != NULL, this
is not true due to use_mm().

Change the code to check PF_KTHREAD.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/oom_kill.c |    9 +++------
 1 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 709aedf..070b713 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -256,14 +256,11 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 	for_each_process(p) {
 		unsigned long points;
 
-		/*
-		 * skip kernel threads and tasks which have already released
-		 * their mm.
-		 */
+		/* skip the tasks which have already released their mm. */
 		if (!p->mm)
 			continue;
-		/* skip the init task */
-		if (is_global_init(p))
+		/* skip the init task and kthreads */
+		if (is_global_init(p) || (p->flags & PF_KTHREAD))
 			continue;
 		if (mem && !task_in_mem_cgroup(p, mem))
 			continue;
-- 
1.6.5.2





^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 1/5] oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads
@ 2010-05-31  9:33 ` KOSAKI Motohiro
  0 siblings, 0 replies; 110+ messages in thread
From: KOSAKI Motohiro @ 2010-05-31  9:33 UTC (permalink / raw)
  To: LKML, linux-mm, Oleg Nesterov, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin
  Cc: kosaki.motohiro

From: Oleg Nesterov <oleg@redhat.com>
Subject: oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads

select_bad_process() thinks a kernel thread can't have ->mm != NULL, this
is not true due to use_mm().

Change the code to check PF_KTHREAD.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/oom_kill.c |    9 +++------
 1 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 709aedf..070b713 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -256,14 +256,11 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 	for_each_process(p) {
 		unsigned long points;
 
-		/*
-		 * skip kernel threads and tasks which have already released
-		 * their mm.
-		 */
+		/* skip the tasks which have already released their mm. */
 		if (!p->mm)
 			continue;
-		/* skip the init task */
-		if (is_global_init(p))
+		/* skip the init task and kthreads */
+		if (is_global_init(p) || (p->flags & PF_KTHREAD))
 			continue;
 		if (mem && !task_in_mem_cgroup(p, mem))
 			continue;
-- 
1.6.5.2




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 2/5] oom: select_bad_process: PF_EXITING check should take ->mm into account
  2010-05-31  9:33 ` KOSAKI Motohiro
@ 2010-05-31  9:35   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 110+ messages in thread
From: KOSAKI Motohiro @ 2010-05-31  9:35 UTC (permalink / raw)
  To: LKML, linux-mm, Oleg Nesterov, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin
  Cc: kosaki.motohiro

From: Oleg Nesterov <oleg@redhat.com>
Subject: oom: select_bad_process: PF_EXITING check should take ->mm into account

select_bad_process() checks PF_EXITING to detect the task which is going
to release its memory, but the logic is very wrong.

	- a single process P with the dead group leader disables
	  select_bad_process() completely, it will always return
	  ERR_PTR() while P can live forever

	- if the PF_EXITING task has already released its ->mm
	  it doesn't make sense to expect it is goiing to free
	  more memory (except task_struct/etc)

Change the code to ignore the PF_EXITING tasks without ->mm.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [rebase to latest -mm]
---
 mm/oom_kill.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 070b713..c87a6f4 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -287,7 +287,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 		 * the process of exiting and releasing its resources.
 		 * Otherwise we could get an easy OOM deadlock.
 		 */
-		if (p->flags & PF_EXITING) {
+		if ((p->flags & PF_EXITING) && p->mm) {
 			if (p != current)
 				return ERR_PTR(-1UL);
 
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 2/5] oom: select_bad_process: PF_EXITING check should take ->mm into account
@ 2010-05-31  9:35   ` KOSAKI Motohiro
  0 siblings, 0 replies; 110+ messages in thread
From: KOSAKI Motohiro @ 2010-05-31  9:35 UTC (permalink / raw)
  To: LKML, linux-mm, Oleg Nesterov, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin
  Cc: kosaki.motohiro

From: Oleg Nesterov <oleg@redhat.com>
Subject: oom: select_bad_process: PF_EXITING check should take ->mm into account

select_bad_process() checks PF_EXITING to detect the task which is going
to release its memory, but the logic is very wrong.

	- a single process P with the dead group leader disables
	  select_bad_process() completely, it will always return
	  ERR_PTR() while P can live forever

	- if the PF_EXITING task has already released its ->mm
	  it doesn't make sense to expect it is goiing to free
	  more memory (except task_struct/etc)

Change the code to ignore the PF_EXITING tasks without ->mm.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [rebase to latest -mm]
---
 mm/oom_kill.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 070b713..c87a6f4 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -287,7 +287,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 		 * the process of exiting and releasing its resources.
 		 * Otherwise we could get an easy OOM deadlock.
 		 */
-		if (p->flags & PF_EXITING) {
+		if ((p->flags & PF_EXITING) && p->mm) {
 			if (p != current)
 				return ERR_PTR(-1UL);
 
-- 
1.6.5.2



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 3/5] oom: introduce find_lock_task_mm() to fix !mm false positives
  2010-05-31  9:33 ` KOSAKI Motohiro
@ 2010-05-31  9:36   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 110+ messages in thread
From: KOSAKI Motohiro @ 2010-05-31  9:36 UTC (permalink / raw)
  To: LKML, linux-mm, Oleg Nesterov, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin
  Cc: kosaki.motohiro

From: Oleg Nesterov <oleg@redhat.com>
Subject: [PATCH 3/5] oom: introduce find_lock_task_mm() to fix !mm false positives

Almost all ->mm == NUL checks in oom_kill.c are wrong.

The current code assumes that the task without ->mm has already
released its memory and ignores the process. However this is not
necessarily true when this process is multithreaded, other live
sub-threads can use this ->mm.

- Remove the "if (!p->mm)" check in select_bad_process(), it is
  just wrong.

- Add the new helper, find_lock_task_mm(), which finds the live
  thread which uses the memory and takes task_lock() to pin ->mm

- change oom_badness() to use this helper instead of just checking
  ->mm != NULL.

- As David pointed out, select_bad_process() must never choose the
  task without ->mm, but no matter what badness() returns the
  task can be chosen if nothing else has been found yet.

Note! This patch is not enough, we need more changes.

	- badness() was fixed, but oom_kill_task() still ignores
	  the task without ->mm

This will be addressed later.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [rebase
latest -mm and remove some obsoleted description]
---
 mm/oom_kill.c |   28 +++++++++++++++++-----------
 1 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index c87a6f4..162af2e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -52,6 +52,19 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
 	return 0;
 }
 
+static struct task_struct *find_lock_task_mm(struct task_struct *p)
+{
+	struct task_struct *t = p;
+	do {
+		task_lock(t);
+		if (likely(t->mm))
+			return t;
+		task_unlock(t);
+	} while_each_thread(p, t);
+
+	return NULL;
+}
+
 /**
  * badness - calculate a numeric value for how bad this task has been
  * @p: task struct of which task we should calculate
@@ -74,7 +87,6 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
 unsigned long badness(struct task_struct *p, unsigned long uptime)
 {
 	unsigned long points, cpu_time, run_time;
-	struct mm_struct *mm;
 	struct task_struct *child;
 	int oom_adj = p->signal->oom_adj;
 	struct task_cputime task_time;
@@ -84,17 +96,14 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 	if (oom_adj == OOM_DISABLE)
 		return 0;
 
-	task_lock(p);
-	mm = p->mm;
-	if (!mm) {
-		task_unlock(p);
+	p = find_lock_task_mm(p);
+	if (!p)
 		return 0;
-	}
 
 	/*
 	 * The memory size of the process is the basis for the badness.
 	 */
-	points = mm->total_vm;
+	points = p->mm->total_vm;
 
 	/*
 	 * After this unlock we can no longer dereference local variable `mm'
@@ -117,7 +126,7 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 	 */
 	list_for_each_entry(child, &p->children, sibling) {
 		task_lock(child);
-		if (child->mm != mm && child->mm)
+		if (child->mm != p->mm && child->mm)
 			points += child->mm->total_vm/2 + 1;
 		task_unlock(child);
 	}
@@ -256,9 +265,6 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 	for_each_process(p) {
 		unsigned long points;
 
-		/* skip the tasks which have already released their mm. */
-		if (!p->mm)
-			continue;
 		/* skip the init task and kthreads */
 		if (is_global_init(p) || (p->flags & PF_KTHREAD))
 			continue;
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 3/5] oom: introduce find_lock_task_mm() to fix !mm false positives
@ 2010-05-31  9:36   ` KOSAKI Motohiro
  0 siblings, 0 replies; 110+ messages in thread
From: KOSAKI Motohiro @ 2010-05-31  9:36 UTC (permalink / raw)
  To: LKML, linux-mm, Oleg Nesterov, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin
  Cc: kosaki.motohiro

From: Oleg Nesterov <oleg@redhat.com>
Subject: [PATCH 3/5] oom: introduce find_lock_task_mm() to fix !mm false positives

Almost all ->mm == NUL checks in oom_kill.c are wrong.

The current code assumes that the task without ->mm has already
released its memory and ignores the process. However this is not
necessarily true when this process is multithreaded, other live
sub-threads can use this ->mm.

- Remove the "if (!p->mm)" check in select_bad_process(), it is
  just wrong.

- Add the new helper, find_lock_task_mm(), which finds the live
  thread which uses the memory and takes task_lock() to pin ->mm

- change oom_badness() to use this helper instead of just checking
  ->mm != NULL.

- As David pointed out, select_bad_process() must never choose the
  task without ->mm, but no matter what badness() returns the
  task can be chosen if nothing else has been found yet.

Note! This patch is not enough, we need more changes.

	- badness() was fixed, but oom_kill_task() still ignores
	  the task without ->mm

This will be addressed later.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [rebase
latest -mm and remove some obsoleted description]
---
 mm/oom_kill.c |   28 +++++++++++++++++-----------
 1 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index c87a6f4..162af2e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -52,6 +52,19 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
 	return 0;
 }
 
+static struct task_struct *find_lock_task_mm(struct task_struct *p)
+{
+	struct task_struct *t = p;
+	do {
+		task_lock(t);
+		if (likely(t->mm))
+			return t;
+		task_unlock(t);
+	} while_each_thread(p, t);
+
+	return NULL;
+}
+
 /**
  * badness - calculate a numeric value for how bad this task has been
  * @p: task struct of which task we should calculate
@@ -74,7 +87,6 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
 unsigned long badness(struct task_struct *p, unsigned long uptime)
 {
 	unsigned long points, cpu_time, run_time;
-	struct mm_struct *mm;
 	struct task_struct *child;
 	int oom_adj = p->signal->oom_adj;
 	struct task_cputime task_time;
@@ -84,17 +96,14 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 	if (oom_adj == OOM_DISABLE)
 		return 0;
 
-	task_lock(p);
-	mm = p->mm;
-	if (!mm) {
-		task_unlock(p);
+	p = find_lock_task_mm(p);
+	if (!p)
 		return 0;
-	}
 
 	/*
 	 * The memory size of the process is the basis for the badness.
 	 */
-	points = mm->total_vm;
+	points = p->mm->total_vm;
 
 	/*
 	 * After this unlock we can no longer dereference local variable `mm'
@@ -117,7 +126,7 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 	 */
 	list_for_each_entry(child, &p->children, sibling) {
 		task_lock(child);
-		if (child->mm != mm && child->mm)
+		if (child->mm != p->mm && child->mm)
 			points += child->mm->total_vm/2 + 1;
 		task_unlock(child);
 	}
@@ -256,9 +265,6 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 	for_each_process(p) {
 		unsigned long points;
 
-		/* skip the tasks which have already released their mm. */
-		if (!p->mm)
-			continue;
 		/* skip the init task and kthreads */
 		if (is_global_init(p) || (p->flags & PF_KTHREAD))
 			continue;
-- 
1.6.5.2



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 4/5] oom: the points calculation of child processes must use find_lock_task_mm() too
  2010-05-31  9:33 ` KOSAKI Motohiro
@ 2010-05-31  9:37   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 110+ messages in thread
From: KOSAKI Motohiro @ 2010-05-31  9:37 UTC (permalink / raw)
  To: LKML, linux-mm, Oleg Nesterov, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin
  Cc: kosaki.motohiro

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Subject: [PATCH 4/5] oom: the points calculation of child processes must use find_lock_task_mm() too

child point calclation use find_lock_task_mm() too.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/oom_kill.c |   13 ++++++++-----
 1 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 162af2e..30d9da0 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -87,6 +87,7 @@ static struct task_struct *find_lock_task_mm(struct task_struct *p)
 unsigned long badness(struct task_struct *p, unsigned long uptime)
 {
 	unsigned long points, cpu_time, run_time;
+	struct task_struct *c;
 	struct task_struct *child;
 	int oom_adj = p->signal->oom_adj;
 	struct task_cputime task_time;
@@ -124,11 +125,13 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 	 * child is eating the vast majority of memory, adding only half
 	 * to the parents will make the child our kill candidate of choice.
 	 */
-	list_for_each_entry(child, &p->children, sibling) {
-		task_lock(child);
-		if (child->mm != p->mm && child->mm)
-			points += child->mm->total_vm/2 + 1;
-		task_unlock(child);
+	list_for_each_entry(c, &p->children, sibling) {
+		child = find_lock_task_mm(c);
+		if (child) {
+			if (child->mm != p->mm)
+				points += child->mm->total_vm/2 + 1;
+			task_unlock(child);
+		}
 	}
 
 	/*
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 4/5] oom: the points calculation of child processes must use find_lock_task_mm() too
@ 2010-05-31  9:37   ` KOSAKI Motohiro
  0 siblings, 0 replies; 110+ messages in thread
From: KOSAKI Motohiro @ 2010-05-31  9:37 UTC (permalink / raw)
  To: LKML, linux-mm, Oleg Nesterov, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin
  Cc: kosaki.motohiro

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Subject: [PATCH 4/5] oom: the points calculation of child processes must use find_lock_task_mm() too

child point calclation use find_lock_task_mm() too.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/oom_kill.c |   13 ++++++++-----
 1 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 162af2e..30d9da0 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -87,6 +87,7 @@ static struct task_struct *find_lock_task_mm(struct task_struct *p)
 unsigned long badness(struct task_struct *p, unsigned long uptime)
 {
 	unsigned long points, cpu_time, run_time;
+	struct task_struct *c;
 	struct task_struct *child;
 	int oom_adj = p->signal->oom_adj;
 	struct task_cputime task_time;
@@ -124,11 +125,13 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 	 * child is eating the vast majority of memory, adding only half
 	 * to the parents will make the child our kill candidate of choice.
 	 */
-	list_for_each_entry(child, &p->children, sibling) {
-		task_lock(child);
-		if (child->mm != p->mm && child->mm)
-			points += child->mm->total_vm/2 + 1;
-		task_unlock(child);
+	list_for_each_entry(c, &p->children, sibling) {
+		child = find_lock_task_mm(c);
+		if (child) {
+			if (child->mm != p->mm)
+				points += child->mm->total_vm/2 + 1;
+			task_unlock(child);
+		}
 	}
 
 	/*
-- 
1.6.5.2



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 5/5] oom: __oom_kill_task() must use find_lock_task_mm() too
  2010-05-31  9:33 ` KOSAKI Motohiro
@ 2010-05-31  9:38   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 110+ messages in thread
From: KOSAKI Motohiro @ 2010-05-31  9:38 UTC (permalink / raw)
  To: LKML, linux-mm, Oleg Nesterov, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin
  Cc: kosaki.motohiro

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Subject: [PATCH 5/5] oom: __oom_kill_task() must use find_lock_task_mm() too

__oom_kill_task also use find_lock_task_mm(). because if sysctl_oom_kill_allocating_task
is true, __out_of_memory() don't call select_bad_process().

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/oom_kill.c |    5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 30d9da0..f6aa3fc 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -394,12 +394,11 @@ static void __oom_kill_task(struct task_struct *p, int verbose)
 		return;
 	}
 
-	task_lock(p);
-	if (!p->mm) {
+	p = find_lock_task_mm(p);
+	if (!p) {
 		WARN_ON(1);
 		printk(KERN_WARNING "tried to kill an mm-less task %d (%s)!\n",
 			task_pid_nr(p), p->comm);
-		task_unlock(p);
 		return;
 	}
 
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 5/5] oom: __oom_kill_task() must use find_lock_task_mm() too
@ 2010-05-31  9:38   ` KOSAKI Motohiro
  0 siblings, 0 replies; 110+ messages in thread
From: KOSAKI Motohiro @ 2010-05-31  9:38 UTC (permalink / raw)
  To: LKML, linux-mm, Oleg Nesterov, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin
  Cc: kosaki.motohiro

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Subject: [PATCH 5/5] oom: __oom_kill_task() must use find_lock_task_mm() too

__oom_kill_task also use find_lock_task_mm(). because if sysctl_oom_kill_allocating_task
is true, __out_of_memory() don't call select_bad_process().

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/oom_kill.c |    5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 30d9da0..f6aa3fc 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -394,12 +394,11 @@ static void __oom_kill_task(struct task_struct *p, int verbose)
 		return;
 	}
 
-	task_lock(p);
-	if (!p->mm) {
+	p = find_lock_task_mm(p);
+	if (!p) {
 		WARN_ON(1);
 		printk(KERN_WARNING "tried to kill an mm-less task %d (%s)!\n",
 			task_pid_nr(p), p->comm);
-		task_unlock(p);
 		return;
 	}
 
-- 
1.6.5.2



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [PATCH 2/5] oom: select_bad_process: PF_EXITING check should take ->mm into account
  2010-05-31  9:35   ` KOSAKI Motohiro
@ 2010-05-31 16:43     ` Oleg Nesterov
  -1 siblings, 0 replies; 110+ messages in thread
From: Oleg Nesterov @ 2010-05-31 16:43 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki,
	Nick Piggin

Thanks a lot Kosaki for doing this!

I still can't find the time to play with this code :/

On 05/31, KOSAKI Motohiro wrote:
>
> select_bad_process() checks PF_EXITING to detect the task which is going
> to release its memory, but the logic is very wrong.
>
> 	- a single process P with the dead group leader disables
> 	  select_bad_process() completely, it will always return
> 	  ERR_PTR() while P can live forever
>
> 	- if the PF_EXITING task has already released its ->mm
> 	  it doesn't make sense to expect it is goiing to free
> 	  more memory (except task_struct/etc)
>
> Change the code to ignore the PF_EXITING tasks without ->mm.
>
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -287,7 +287,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
>  		 * the process of exiting and releasing its resources.
>  		 * Otherwise we could get an easy OOM deadlock.
>  		 */
> -		if (p->flags & PF_EXITING) {
> +		if ((p->flags & PF_EXITING) && p->mm) {

(strictly speaking, this change is needed after 3/5 which removes the
 top-level "if (!p->mm)" check in select_bad_process).


I'd like to add a note... with or without this, we have problems
with the coredump. A thread participating in the coredumping
(group-leader in this case) can have PF_EXITING && mm, but this doesn't
mean it is going to exit soon, and the dumper can use a lot more memory.

Otoh, if select_bad_process() chooses the thread which dumps the core,
SIGKILL can't stop it. This should be fixed in do_coredump() paths, this
is the long-standing problem.

And, as it was already discussed, we only check the group-leader here.
But I can't suggest something better.

Oleg.


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 2/5] oom: select_bad_process: PF_EXITING check should take ->mm into account
@ 2010-05-31 16:43     ` Oleg Nesterov
  0 siblings, 0 replies; 110+ messages in thread
From: Oleg Nesterov @ 2010-05-31 16:43 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki,
	Nick Piggin

Thanks a lot Kosaki for doing this!

I still can't find the time to play with this code :/

On 05/31, KOSAKI Motohiro wrote:
>
> select_bad_process() checks PF_EXITING to detect the task which is going
> to release its memory, but the logic is very wrong.
>
> 	- a single process P with the dead group leader disables
> 	  select_bad_process() completely, it will always return
> 	  ERR_PTR() while P can live forever
>
> 	- if the PF_EXITING task has already released its ->mm
> 	  it doesn't make sense to expect it is goiing to free
> 	  more memory (except task_struct/etc)
>
> Change the code to ignore the PF_EXITING tasks without ->mm.
>
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -287,7 +287,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
>  		 * the process of exiting and releasing its resources.
>  		 * Otherwise we could get an easy OOM deadlock.
>  		 */
> -		if (p->flags & PF_EXITING) {
> +		if ((p->flags & PF_EXITING) && p->mm) {

(strictly speaking, this change is needed after 3/5 which removes the
 top-level "if (!p->mm)" check in select_bad_process).


I'd like to add a note... with or without this, we have problems
with the coredump. A thread participating in the coredumping
(group-leader in this case) can have PF_EXITING && mm, but this doesn't
mean it is going to exit soon, and the dumper can use a lot more memory.

Otoh, if select_bad_process() chooses the thread which dumps the core,
SIGKILL can't stop it. This should be fixed in do_coredump() paths, this
is the long-standing problem.

And, as it was already discussed, we only check the group-leader here.
But I can't suggest something better.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 4/5] oom: the points calculation of child processes must use find_lock_task_mm() too
  2010-05-31  9:37   ` KOSAKI Motohiro
@ 2010-05-31 16:56     ` Oleg Nesterov
  -1 siblings, 0 replies; 110+ messages in thread
From: Oleg Nesterov @ 2010-05-31 16:56 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki,
	Nick Piggin

On 05/31, KOSAKI Motohiro wrote:
>
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -87,6 +87,7 @@ static struct task_struct *find_lock_task_mm(struct task_struct *p)
>  unsigned long badness(struct task_struct *p, unsigned long uptime)
>  {
>  	unsigned long points, cpu_time, run_time;
> +	struct task_struct *c;
>  	struct task_struct *child;
>  	int oom_adj = p->signal->oom_adj;
>  	struct task_cputime task_time;
> @@ -124,11 +125,13 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
>  	 * child is eating the vast majority of memory, adding only half
>  	 * to the parents will make the child our kill candidate of choice.
>  	 */
> -	list_for_each_entry(child, &p->children, sibling) {
> -		task_lock(child);
> -		if (child->mm != p->mm && child->mm)
> -			points += child->mm->total_vm/2 + 1;
> -		task_unlock(child);
> +	list_for_each_entry(c, &p->children, sibling) {
> +		child = find_lock_task_mm(c);
> +		if (child) {
> +			if (child->mm != p->mm)
> +				points += child->mm->total_vm/2 + 1;
> +			task_unlock(child);
> +		}

Acked-by: Oleg Nesterov <oleg@redhat.com>




And, I think we need another patch on top of this one. Note that
this list_for_each_entry(p->children) can only see the tasks forked
by p, it can't see other children forked by its sub-threads.

IOW, we need

	do {
		list_for_each_entry(c, &t->children, sibling) {
			...
		}
	} while_each_thread(p, t);

Probably the same for oom_kill_process().

What do you think?

Oleg.


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 4/5] oom: the points calculation of child processes must use find_lock_task_mm() too
@ 2010-05-31 16:56     ` Oleg Nesterov
  0 siblings, 0 replies; 110+ messages in thread
From: Oleg Nesterov @ 2010-05-31 16:56 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki,
	Nick Piggin

On 05/31, KOSAKI Motohiro wrote:
>
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -87,6 +87,7 @@ static struct task_struct *find_lock_task_mm(struct task_struct *p)
>  unsigned long badness(struct task_struct *p, unsigned long uptime)
>  {
>  	unsigned long points, cpu_time, run_time;
> +	struct task_struct *c;
>  	struct task_struct *child;
>  	int oom_adj = p->signal->oom_adj;
>  	struct task_cputime task_time;
> @@ -124,11 +125,13 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
>  	 * child is eating the vast majority of memory, adding only half
>  	 * to the parents will make the child our kill candidate of choice.
>  	 */
> -	list_for_each_entry(child, &p->children, sibling) {
> -		task_lock(child);
> -		if (child->mm != p->mm && child->mm)
> -			points += child->mm->total_vm/2 + 1;
> -		task_unlock(child);
> +	list_for_each_entry(c, &p->children, sibling) {
> +		child = find_lock_task_mm(c);
> +		if (child) {
> +			if (child->mm != p->mm)
> +				points += child->mm->total_vm/2 + 1;
> +			task_unlock(child);
> +		}

Acked-by: Oleg Nesterov <oleg@redhat.com>




And, I think we need another patch on top of this one. Note that
this list_for_each_entry(p->children) can only see the tasks forked
by p, it can't see other children forked by its sub-threads.

IOW, we need

	do {
		list_for_each_entry(c, &t->children, sibling) {
			...
		}
	} while_each_thread(p, t);

Probably the same for oom_kill_process().

What do you think?

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 4/5] oom: the points calculation of child processes must use find_lock_task_mm() too
  2010-05-31 16:56     ` Oleg Nesterov
@ 2010-05-31 23:48       ` KOSAKI Motohiro
  -1 siblings, 0 replies; 110+ messages in thread
From: KOSAKI Motohiro @ 2010-05-31 23:48 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: kosaki.motohiro, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

> And, I think we need another patch on top of this one. Note that
> this list_for_each_entry(p->children) can only see the tasks forked
> by p, it can't see other children forked by its sub-threads.
> 
> IOW, we need
> 
> 	do {
> 		list_for_each_entry(c, &t->children, sibling) {
> 			...
> 		}
> 	} while_each_thread(p, t);
> 
> Probably the same for oom_kill_process().
> 
> What do you think?

Makes perfectly sense. I have to do it!

Thanks good reviewing!





^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 4/5] oom: the points calculation of child processes must use find_lock_task_mm() too
@ 2010-05-31 23:48       ` KOSAKI Motohiro
  0 siblings, 0 replies; 110+ messages in thread
From: KOSAKI Motohiro @ 2010-05-31 23:48 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: kosaki.motohiro, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

> And, I think we need another patch on top of this one. Note that
> this list_for_each_entry(p->children) can only see the tasks forked
> by p, it can't see other children forked by its sub-threads.
> 
> IOW, we need
> 
> 	do {
> 		list_for_each_entry(c, &t->children, sibling) {
> 			...
> 		}
> 	} while_each_thread(p, t);
> 
> Probably the same for oom_kill_process().
> 
> What do you think?

Makes perfectly sense. I have to do it!

Thanks good reviewing!




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 1/5] oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads
  2010-05-31  9:33 ` KOSAKI Motohiro
@ 2010-06-01  0:54   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 110+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-01  0:54 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Oleg Nesterov, David Rientjes, Andrew Morton,
	Nick Piggin

On Mon, 31 May 2010 18:33:06 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> From: Oleg Nesterov <oleg@redhat.com>
> Subject: oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads
> 
> select_bad_process() thinks a kernel thread can't have ->mm != NULL, this
> is not true due to use_mm().
> 
> Change the code to check PF_KTHREAD.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> Acked-by: David Rientjes <rientjes@google.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 1/5] oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads
@ 2010-06-01  0:54   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 110+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-01  0:54 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Oleg Nesterov, David Rientjes, Andrew Morton,
	Nick Piggin

On Mon, 31 May 2010 18:33:06 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> From: Oleg Nesterov <oleg@redhat.com>
> Subject: oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads
> 
> select_bad_process() thinks a kernel thread can't have ->mm != NULL, this
> is not true due to use_mm().
> 
> Change the code to check PF_KTHREAD.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> Acked-by: David Rientjes <rientjes@google.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 3/5] oom: introduce find_lock_task_mm() to fix !mm false positives
  2010-05-31  9:36   ` KOSAKI Motohiro
@ 2010-06-01  0:57     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 110+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-01  0:57 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Oleg Nesterov, David Rientjes, Andrew Morton,
	Nick Piggin

On Mon, 31 May 2010 18:36:34 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> From: Oleg Nesterov <oleg@redhat.com>
> Subject: [PATCH 3/5] oom: introduce find_lock_task_mm() to fix !mm false positives
> 
> Almost all ->mm == NUL checks in oom_kill.c are wrong.
> 
> The current code assumes that the task without ->mm has already
> released its memory and ignores the process. However this is not
> necessarily true when this process is multithreaded, other live
> sub-threads can use this ->mm.
> 
> - Remove the "if (!p->mm)" check in select_bad_process(), it is
>   just wrong.
> 
> - Add the new helper, find_lock_task_mm(), which finds the live
>   thread which uses the memory and takes task_lock() to pin ->mm
> 
> - change oom_badness() to use this helper instead of just checking
>   ->mm != NULL.
> 
> - As David pointed out, select_bad_process() must never choose the
>   task without ->mm, but no matter what badness() returns the
>   task can be chosen if nothing else has been found yet.
> 
> Note! This patch is not enough, we need more changes.
> 
> 	- badness() was fixed, but oom_kill_task() still ignores
> 	  the task without ->mm
> 
> This will be addressed later.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> Cc: David Rientjes <rientjes@google.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [rebase
> latest -mm and remove some obsoleted description]

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 3/5] oom: introduce find_lock_task_mm() to fix !mm false positives
@ 2010-06-01  0:57     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 110+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-01  0:57 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Oleg Nesterov, David Rientjes, Andrew Morton,
	Nick Piggin

On Mon, 31 May 2010 18:36:34 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> From: Oleg Nesterov <oleg@redhat.com>
> Subject: [PATCH 3/5] oom: introduce find_lock_task_mm() to fix !mm false positives
> 
> Almost all ->mm == NUL checks in oom_kill.c are wrong.
> 
> The current code assumes that the task without ->mm has already
> released its memory and ignores the process. However this is not
> necessarily true when this process is multithreaded, other live
> sub-threads can use this ->mm.
> 
> - Remove the "if (!p->mm)" check in select_bad_process(), it is
>   just wrong.
> 
> - Add the new helper, find_lock_task_mm(), which finds the live
>   thread which uses the memory and takes task_lock() to pin ->mm
> 
> - change oom_badness() to use this helper instead of just checking
>   ->mm != NULL.
> 
> - As David pointed out, select_bad_process() must never choose the
>   task without ->mm, but no matter what badness() returns the
>   task can be chosen if nothing else has been found yet.
> 
> Note! This patch is not enough, we need more changes.
> 
> 	- badness() was fixed, but oom_kill_task() still ignores
> 	  the task without ->mm
> 
> This will be addressed later.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> Cc: David Rientjes <rientjes@google.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [rebase
> latest -mm and remove some obsoleted description]

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 5/5] oom: __oom_kill_task() must use find_lock_task_mm() too
  2010-05-31  9:38   ` KOSAKI Motohiro
@ 2010-06-01  1:02     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 110+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-01  1:02 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Oleg Nesterov, David Rientjes, Andrew Morton,
	Nick Piggin

On Mon, 31 May 2010 18:38:14 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Subject: [PATCH 5/5] oom: __oom_kill_task() must use find_lock_task_mm() too
> 
> __oom_kill_task also use find_lock_task_mm(). because if sysctl_oom_kill_allocating_task
> is true, __out_of_memory() don't call select_bad_process().
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 5/5] oom: __oom_kill_task() must use find_lock_task_mm() too
@ 2010-06-01  1:02     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 110+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-01  1:02 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Oleg Nesterov, David Rientjes, Andrew Morton,
	Nick Piggin

On Mon, 31 May 2010 18:38:14 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Subject: [PATCH 5/5] oom: __oom_kill_task() must use find_lock_task_mm() too
> 
> __oom_kill_task also use find_lock_task_mm(). because if sysctl_oom_kill_allocating_task
> is true, __out_of_memory() don't call select_bad_process().
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 2/5] oom: select_bad_process: PF_EXITING check should take ->mm into account
  2010-05-31 16:43     ` Oleg Nesterov
@ 2010-06-01  1:10       ` KOSAKI Motohiro
  -1 siblings, 0 replies; 110+ messages in thread
From: KOSAKI Motohiro @ 2010-06-01  1:10 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: kosaki.motohiro, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

Hi

> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -287,7 +287,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
> >  		 * the process of exiting and releasing its resources.
> >  		 * Otherwise we could get an easy OOM deadlock.
> >  		 */
> > -		if (p->flags & PF_EXITING) {
> > +		if ((p->flags & PF_EXITING) && p->mm) {
> 
> (strictly speaking, this change is needed after 3/5 which removes the
>  top-level "if (!p->mm)" check in select_bad_process).
> 
> 
> I'd like to add a note... with or without this, we have problems
> with the coredump. A thread participating in the coredumping
> (group-leader in this case) can have PF_EXITING && mm, but this doesn't
> mean it is going to exit soon, and the dumper can use a lot more memory.

Sure. I think coredump sould do nothing if oom occur.
So, merely making PF_COREDUMP is bad idea? I mean


task-flags		allocator
------------------------------------------------
none			N/A
TIF_MEMDIE		allow to use emergency memory.
			don't call page reclaim.
PF_COREDUMP		N/A
TIF_MEMDIE+PF_COREDUMP	disallow to use emergency memory.
			don't call page reclaim.

In other word, coredump path makes allocation failure if the task
marked as TIF_MEMDIE.
And, userland oom helper should be marked PF_OOM_ORIGIN perhaps.


> Otoh, if select_bad_process() chooses the thread which dumps the core,
> SIGKILL can't stop it. This should be fixed in do_coredump() paths, this
> is the long-standing problem.
> 
> And, as it was already discussed, we only check the group-leader here.
> But I can't suggest something better.

I guess signal_group_exit() is enough in practical case. I mean
exit(2) is only used by pthread_exit(3), so practically the last thread
in the process don't die by using exit(2).

I don't say signal_group_exit() is no side-effect. but I guess originally
intention was testing during _process_ exiting. 

Am I missing something?




^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 2/5] oom: select_bad_process: PF_EXITING check should take ->mm into account
@ 2010-06-01  1:10       ` KOSAKI Motohiro
  0 siblings, 0 replies; 110+ messages in thread
From: KOSAKI Motohiro @ 2010-06-01  1:10 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: kosaki.motohiro, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

Hi

> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -287,7 +287,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
> >  		 * the process of exiting and releasing its resources.
> >  		 * Otherwise we could get an easy OOM deadlock.
> >  		 */
> > -		if (p->flags & PF_EXITING) {
> > +		if ((p->flags & PF_EXITING) && p->mm) {
> 
> (strictly speaking, this change is needed after 3/5 which removes the
>  top-level "if (!p->mm)" check in select_bad_process).
> 
> 
> I'd like to add a note... with or without this, we have problems
> with the coredump. A thread participating in the coredumping
> (group-leader in this case) can have PF_EXITING && mm, but this doesn't
> mean it is going to exit soon, and the dumper can use a lot more memory.

Sure. I think coredump sould do nothing if oom occur.
So, merely making PF_COREDUMP is bad idea? I mean


task-flags		allocator
------------------------------------------------
none			N/A
TIF_MEMDIE		allow to use emergency memory.
			don't call page reclaim.
PF_COREDUMP		N/A
TIF_MEMDIE+PF_COREDUMP	disallow to use emergency memory.
			don't call page reclaim.

In other word, coredump path makes allocation failure if the task
marked as TIF_MEMDIE.
And, userland oom helper should be marked PF_OOM_ORIGIN perhaps.


> Otoh, if select_bad_process() chooses the thread which dumps the core,
> SIGKILL can't stop it. This should be fixed in do_coredump() paths, this
> is the long-standing problem.
> 
> And, as it was already discussed, we only check the group-leader here.
> But I can't suggest something better.

I guess signal_group_exit() is enough in practical case. I mean
exit(2) is only used by pthread_exit(3), so practically the last thread
in the process don't die by using exit(2).

I don't say signal_group_exit() is no side-effect. but I guess originally
intention was testing during _process_ exiting. 

Am I missing something?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 2/5] oom: select_bad_process: PF_EXITING check should take ->mm into account
  2010-06-01  1:10       ` KOSAKI Motohiro
@ 2010-06-01 20:18         ` Oleg Nesterov
  -1 siblings, 0 replies; 110+ messages in thread
From: Oleg Nesterov @ 2010-06-01 20:18 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki,
	Nick Piggin

On 06/01, KOSAKI Motohiro wrote:
>
> > I'd like to add a note... with or without this, we have problems
> > with the coredump. A thread participating in the coredumping
> > (group-leader in this case) can have PF_EXITING && mm, but this doesn't
> > mean it is going to exit soon, and the dumper can use a lot more memory.
>
> Sure. I think coredump sould do nothing if oom occur.
> So, merely making PF_COREDUMP is bad idea? I mean
>
> task-flags		allocator
> ------------------------------------------------
> none			N/A
> TIF_MEMDIE		allow to use emergency memory.
> 			don't call page reclaim.
> PF_COREDUMP		N/A
> TIF_MEMDIE+PF_COREDUMP	disallow to use emergency memory.
> 			don't call page reclaim.
>
> In other word, coredump path makes allocation failure if the task
> marked as TIF_MEMDIE.

Perhaps... But where should TIF_MEMDIE go this case? Let me clarify.

Two threads, group-leader L and its sub-thread T. T dumps the code.
In this case both threads have ->mm != NULL, L has PF_EXITING.

The first problem is, select_bad_process() always return -1 in this
case (even if the caller is T, this doesn't matter).

The second problem is that we should add TIF_MEMDIE to T, not L.

This is more or less easy. For simplicity, let's suppose we removed
this PF_EXITING check from select_bad_process().

Otoh, if we make do_coredump() interruptible (and we should do this
in any case), then perhaps the TIF_MEMDIE+PF_COREDUMP is not really
needed? Afaics we always send SIGKILL along with TIF_MEMDIE.

> > And, as it was already discussed, we only check the group-leader here.
> > But I can't suggest something better.
>
> I guess signal_group_exit() is enough in practical case.

Unlike SIGNAL_GROUP_EXIT check, signal_group_exit() can also mean
exec. This is probably correct. If we see the task inside de_thread()
he is going to free its old mm soon.

The problem is this check doesn't cover the case when a single-threaded
task exits (even if it does sys_exit_group). And it is not enough to
remove the thread_group_empty-case-optimization from do_group_exit(),
it can call sys_exit() instead.

But anyway I agree, select_bad_process can probably check

	signal_group_exit() || (PF_EXITINF && thread_group_empty())

And in that case it is better to remove the "&& p->mm" part of the
current check.

Oleg.


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 2/5] oom: select_bad_process: PF_EXITING check should take ->mm into account
@ 2010-06-01 20:18         ` Oleg Nesterov
  0 siblings, 0 replies; 110+ messages in thread
From: Oleg Nesterov @ 2010-06-01 20:18 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki,
	Nick Piggin

On 06/01, KOSAKI Motohiro wrote:
>
> > I'd like to add a note... with or without this, we have problems
> > with the coredump. A thread participating in the coredumping
> > (group-leader in this case) can have PF_EXITING && mm, but this doesn't
> > mean it is going to exit soon, and the dumper can use a lot more memory.
>
> Sure. I think coredump sould do nothing if oom occur.
> So, merely making PF_COREDUMP is bad idea? I mean
>
> task-flags		allocator
> ------------------------------------------------
> none			N/A
> TIF_MEMDIE		allow to use emergency memory.
> 			don't call page reclaim.
> PF_COREDUMP		N/A
> TIF_MEMDIE+PF_COREDUMP	disallow to use emergency memory.
> 			don't call page reclaim.
>
> In other word, coredump path makes allocation failure if the task
> marked as TIF_MEMDIE.

Perhaps... But where should TIF_MEMDIE go this case? Let me clarify.

Two threads, group-leader L and its sub-thread T. T dumps the code.
In this case both threads have ->mm != NULL, L has PF_EXITING.

The first problem is, select_bad_process() always return -1 in this
case (even if the caller is T, this doesn't matter).

The second problem is that we should add TIF_MEMDIE to T, not L.

This is more or less easy. For simplicity, let's suppose we removed
this PF_EXITING check from select_bad_process().

Otoh, if we make do_coredump() interruptible (and we should do this
in any case), then perhaps the TIF_MEMDIE+PF_COREDUMP is not really
needed? Afaics we always send SIGKILL along with TIF_MEMDIE.

> > And, as it was already discussed, we only check the group-leader here.
> > But I can't suggest something better.
>
> I guess signal_group_exit() is enough in practical case.

Unlike SIGNAL_GROUP_EXIT check, signal_group_exit() can also mean
exec. This is probably correct. If we see the task inside de_thread()
he is going to free its old mm soon.

The problem is this check doesn't cover the case when a single-threaded
task exits (even if it does sys_exit_group). And it is not enough to
remove the thread_group_empty-case-optimization from do_group_exit(),
it can call sys_exit() instead.

But anyway I agree, select_bad_process can probably check

	signal_group_exit() || (PF_EXITINF && thread_group_empty())

And in that case it is better to remove the "&& p->mm" part of the
current check.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 1/5] oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads
  2010-05-31  9:33 ` KOSAKI Motohiro
@ 2010-06-01 20:36   ` David Rientjes
  -1 siblings, 0 replies; 110+ messages in thread
From: David Rientjes @ 2010-06-01 20:36 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Oleg Nesterov, Andrew Morton, KAMEZAWA Hiroyuki,
	Nick Piggin

On Mon, 31 May 2010, KOSAKI Motohiro wrote:

> From: Oleg Nesterov <oleg@redhat.com>
> Subject: oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads
> 
> select_bad_process() thinks a kernel thread can't have ->mm != NULL, this
> is not true due to use_mm().
> 
> Change the code to check PF_KTHREAD.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> Acked-by: David Rientjes <rientjes@google.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

This is already pushed in my oom killer rewrite as patch 14/18 "check 
PF_KTHREAD instead of !mm to skip kthreads".

This does not need to be merged immediately since it's not vital: use_mm() 
is only temporary state and these kthreads will once again be excluded 
when they call unuse_mm().  The worst case scenario here is that the oom 
killer will erroneously select one of these kthreads which cannot die and 
will need to reselect another task on its next call.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 1/5] oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads
@ 2010-06-01 20:36   ` David Rientjes
  0 siblings, 0 replies; 110+ messages in thread
From: David Rientjes @ 2010-06-01 20:36 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Oleg Nesterov, Andrew Morton, KAMEZAWA Hiroyuki,
	Nick Piggin

On Mon, 31 May 2010, KOSAKI Motohiro wrote:

> From: Oleg Nesterov <oleg@redhat.com>
> Subject: oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads
> 
> select_bad_process() thinks a kernel thread can't have ->mm != NULL, this
> is not true due to use_mm().
> 
> Change the code to check PF_KTHREAD.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> Acked-by: David Rientjes <rientjes@google.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

This is already pushed in my oom killer rewrite as patch 14/18 "check 
PF_KTHREAD instead of !mm to skip kthreads".

This does not need to be merged immediately since it's not vital: use_mm() 
is only temporary state and these kthreads will once again be excluded 
when they call unuse_mm().  The worst case scenario here is that the oom 
killer will erroneously select one of these kthreads which cannot die and 
will need to reselect another task on its next call.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 2/5] oom: select_bad_process: PF_EXITING check should take ->mm into account
  2010-05-31  9:35   ` KOSAKI Motohiro
@ 2010-06-01 20:39     ` David Rientjes
  -1 siblings, 0 replies; 110+ messages in thread
From: David Rientjes @ 2010-06-01 20:39 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Oleg Nesterov, Andrew Morton, KAMEZAWA Hiroyuki,
	Nick Piggin

On Mon, 31 May 2010, KOSAKI Motohiro wrote:

> From: Oleg Nesterov <oleg@redhat.com>
> Subject: oom: select_bad_process: PF_EXITING check should take ->mm into account
> 
> select_bad_process() checks PF_EXITING to detect the task which is going
> to release its memory, but the logic is very wrong.
> 
> 	- a single process P with the dead group leader disables
> 	  select_bad_process() completely, it will always return
> 	  ERR_PTR() while P can live forever
> 
> 	- if the PF_EXITING task has already released its ->mm
> 	  it doesn't make sense to expect it is goiing to free
> 	  more memory (except task_struct/etc)
> 
> Change the code to ignore the PF_EXITING tasks without ->mm.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> Acked-by: David Rientjes <rientjes@google.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [rebase to latest -mm]

This is already pushed in my oom killer rewrite as patch 13/18 "oom: avoid 
race for oom killed tasks detaching mm prior to exit".

It's not vital to merge now because causing the oom killer to temporarily 
become a no-op before it can fully exit even though it has already 
detached its memory only delays killing another task until it exits and 
there's nothing in the way of that exiting while it's still under 
PF_EXITING.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 2/5] oom: select_bad_process: PF_EXITING check should take ->mm into account
@ 2010-06-01 20:39     ` David Rientjes
  0 siblings, 0 replies; 110+ messages in thread
From: David Rientjes @ 2010-06-01 20:39 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Oleg Nesterov, Andrew Morton, KAMEZAWA Hiroyuki,
	Nick Piggin

On Mon, 31 May 2010, KOSAKI Motohiro wrote:

> From: Oleg Nesterov <oleg@redhat.com>
> Subject: oom: select_bad_process: PF_EXITING check should take ->mm into account
> 
> select_bad_process() checks PF_EXITING to detect the task which is going
> to release its memory, but the logic is very wrong.
> 
> 	- a single process P with the dead group leader disables
> 	  select_bad_process() completely, it will always return
> 	  ERR_PTR() while P can live forever
> 
> 	- if the PF_EXITING task has already released its ->mm
> 	  it doesn't make sense to expect it is goiing to free
> 	  more memory (except task_struct/etc)
> 
> Change the code to ignore the PF_EXITING tasks without ->mm.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> Acked-by: David Rientjes <rientjes@google.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [rebase to latest -mm]

This is already pushed in my oom killer rewrite as patch 13/18 "oom: avoid 
race for oom killed tasks detaching mm prior to exit".

It's not vital to merge now because causing the oom killer to temporarily 
become a no-op before it can fully exit even though it has already 
detached its memory only delays killing another task until it exits and 
there's nothing in the way of that exiting while it's still under 
PF_EXITING.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 3/5] oom: introduce find_lock_task_mm() to fix !mm false positives
  2010-05-31  9:36   ` KOSAKI Motohiro
@ 2010-06-01 20:42     ` David Rientjes
  -1 siblings, 0 replies; 110+ messages in thread
From: David Rientjes @ 2010-06-01 20:42 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Oleg Nesterov, Andrew Morton, KAMEZAWA Hiroyuki,
	Nick Piggin

On Mon, 31 May 2010, KOSAKI Motohiro wrote:

> From: Oleg Nesterov <oleg@redhat.com>
> Subject: [PATCH 3/5] oom: introduce find_lock_task_mm() to fix !mm false positives
> 
> Almost all ->mm == NUL checks in oom_kill.c are wrong.
> 
> The current code assumes that the task without ->mm has already
> released its memory and ignores the process. However this is not
> necessarily true when this process is multithreaded, other live
> sub-threads can use this ->mm.
> 
> - Remove the "if (!p->mm)" check in select_bad_process(), it is
>   just wrong.
> 
> - Add the new helper, find_lock_task_mm(), which finds the live
>   thread which uses the memory and takes task_lock() to pin ->mm
> 
> - change oom_badness() to use this helper instead of just checking
>   ->mm != NULL.
> 
> - As David pointed out, select_bad_process() must never choose the
>   task without ->mm, but no matter what badness() returns the
>   task can be chosen if nothing else has been found yet.
> 
> Note! This patch is not enough, we need more changes.
> 
> 	- badness() was fixed, but oom_kill_task() still ignores
> 	  the task without ->mm
> 
> This will be addressed later.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> Cc: David Rientjes <rientjes@google.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [rebase
> latest -mm and remove some obsoleted description]

This is already pushed as part of my oom killer rewrite in patch 15/18 
"oom: introduce find_lock_task_mm to fix !mm false positives".

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 3/5] oom: introduce find_lock_task_mm() to fix !mm false positives
@ 2010-06-01 20:42     ` David Rientjes
  0 siblings, 0 replies; 110+ messages in thread
From: David Rientjes @ 2010-06-01 20:42 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Oleg Nesterov, Andrew Morton, KAMEZAWA Hiroyuki,
	Nick Piggin

On Mon, 31 May 2010, KOSAKI Motohiro wrote:

> From: Oleg Nesterov <oleg@redhat.com>
> Subject: [PATCH 3/5] oom: introduce find_lock_task_mm() to fix !mm false positives
> 
> Almost all ->mm == NUL checks in oom_kill.c are wrong.
> 
> The current code assumes that the task without ->mm has already
> released its memory and ignores the process. However this is not
> necessarily true when this process is multithreaded, other live
> sub-threads can use this ->mm.
> 
> - Remove the "if (!p->mm)" check in select_bad_process(), it is
>   just wrong.
> 
> - Add the new helper, find_lock_task_mm(), which finds the live
>   thread which uses the memory and takes task_lock() to pin ->mm
> 
> - change oom_badness() to use this helper instead of just checking
>   ->mm != NULL.
> 
> - As David pointed out, select_bad_process() must never choose the
>   task without ->mm, but no matter what badness() returns the
>   task can be chosen if nothing else has been found yet.
> 
> Note! This patch is not enough, we need more changes.
> 
> 	- badness() was fixed, but oom_kill_task() still ignores
> 	  the task without ->mm
> 
> This will be addressed later.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> Cc: David Rientjes <rientjes@google.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [rebase
> latest -mm and remove some obsoleted description]

This is already pushed as part of my oom killer rewrite in patch 15/18 
"oom: introduce find_lock_task_mm to fix !mm false positives".

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 5/5] oom: __oom_kill_task() must use find_lock_task_mm() too
  2010-05-31  9:38   ` KOSAKI Motohiro
@ 2010-06-01 20:44     ` David Rientjes
  -1 siblings, 0 replies; 110+ messages in thread
From: David Rientjes @ 2010-06-01 20:44 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Oleg Nesterov, Andrew Morton, KAMEZAWA Hiroyuki,
	Nick Piggin

On Mon, 31 May 2010, KOSAKI Motohiro wrote:

> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Subject: [PATCH 5/5] oom: __oom_kill_task() must use find_lock_task_mm() too
> 
> __oom_kill_task also use find_lock_task_mm(). because if sysctl_oom_kill_allocating_task
> is true, __out_of_memory() don't call select_bad_process().
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

This code is removed as part of my oom killer rewrite as patch 12/18 "oom: 
remove unnecessary code and cleanup".

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 5/5] oom: __oom_kill_task() must use find_lock_task_mm() too
@ 2010-06-01 20:44     ` David Rientjes
  0 siblings, 0 replies; 110+ messages in thread
From: David Rientjes @ 2010-06-01 20:44 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Oleg Nesterov, Andrew Morton, KAMEZAWA Hiroyuki,
	Nick Piggin

On Mon, 31 May 2010, KOSAKI Motohiro wrote:

> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Subject: [PATCH 5/5] oom: __oom_kill_task() must use find_lock_task_mm() too
> 
> __oom_kill_task also use find_lock_task_mm(). because if sysctl_oom_kill_allocating_task
> is true, __out_of_memory() don't call select_bad_process().
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

This code is removed as part of my oom killer rewrite as patch 12/18 "oom: 
remove unnecessary code and cleanup".

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 1/5] oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads
  2010-06-01 20:36   ` David Rientjes
@ 2010-06-01 21:20     ` Oleg Nesterov
  -1 siblings, 0 replies; 110+ messages in thread
From: Oleg Nesterov @ 2010-06-01 21:20 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On 06/01, David Rientjes wrote:
>
> On Mon, 31 May 2010, KOSAKI Motohiro wrote:
>
> > select_bad_process() thinks a kernel thread can't have ->mm != NULL, this
> > is not true due to use_mm().
> >
> > Change the code to check PF_KTHREAD.
>
> This is already pushed in my oom killer rewrite as patch 14/18 "check
> PF_KTHREAD instead of !mm to skip kthreads".
>
> This does not need to be merged immediately since it's not vital: use_mm()
> is only temporary state and these kthreads will once again be excluded
> when they call unuse_mm().  The worst case scenario here is that the oom
> killer will erroneously select one of these kthreads which cannot die

It can't die but force_sig() does bad things which shouldn't be done
with workqueue thread. Note that it removes SIG_IGN, sets
SIGNAL_GROUP_EXIT, makes signal_pending/fatal_signal_pedning true, etc.

But yes, I agree, the problem is minor. But nevertheless it is bug,
the longstanding bug with the simple fix. Why should we "hide" this fix
inside the long series of non-trivial patches which rewrite oom-killer?
And it is completely orthogonal to other changes.

Oleg.


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 1/5] oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads
@ 2010-06-01 21:20     ` Oleg Nesterov
  0 siblings, 0 replies; 110+ messages in thread
From: Oleg Nesterov @ 2010-06-01 21:20 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On 06/01, David Rientjes wrote:
>
> On Mon, 31 May 2010, KOSAKI Motohiro wrote:
>
> > select_bad_process() thinks a kernel thread can't have ->mm != NULL, this
> > is not true due to use_mm().
> >
> > Change the code to check PF_KTHREAD.
>
> This is already pushed in my oom killer rewrite as patch 14/18 "check
> PF_KTHREAD instead of !mm to skip kthreads".
>
> This does not need to be merged immediately since it's not vital: use_mm()
> is only temporary state and these kthreads will once again be excluded
> when they call unuse_mm().  The worst case scenario here is that the oom
> killer will erroneously select one of these kthreads which cannot die

It can't die but force_sig() does bad things which shouldn't be done
with workqueue thread. Note that it removes SIG_IGN, sets
SIGNAL_GROUP_EXIT, makes signal_pending/fatal_signal_pedning true, etc.

But yes, I agree, the problem is minor. But nevertheless it is bug,
the longstanding bug with the simple fix. Why should we "hide" this fix
inside the long series of non-trivial patches which rewrite oom-killer?
And it is completely orthogonal to other changes.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 1/5] oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads
  2010-06-01 21:20     ` Oleg Nesterov
@ 2010-06-01 21:26       ` David Rientjes
  -1 siblings, 0 replies; 110+ messages in thread
From: David Rientjes @ 2010-06-01 21:26 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On Tue, 1 Jun 2010, Oleg Nesterov wrote:

> But yes, I agree, the problem is minor. But nevertheless it is bug,
> the longstanding bug with the simple fix. Why should we "hide" this fix
> inside the long series of non-trivial patches which rewrite oom-killer?
> And it is completely orthogonal to other changes.
> 

Again, the question is whether or not the fix is rc material or not, 
otherwise there's no difference in the route that it gets upstream: the 
patch is duplicated in both series.  If you feel that this minor issue 
(which has never been reported in at least the last three years and 
doesn't have any side effects other than a couple of millisecond delay 
until unuse_mm() when the oom killer will kill something else) should be 
addressed in 2.6.35-rc2, then that's a conversation to be had with Andrew.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 1/5] oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads
@ 2010-06-01 21:26       ` David Rientjes
  0 siblings, 0 replies; 110+ messages in thread
From: David Rientjes @ 2010-06-01 21:26 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On Tue, 1 Jun 2010, Oleg Nesterov wrote:

> But yes, I agree, the problem is minor. But nevertheless it is bug,
> the longstanding bug with the simple fix. Why should we "hide" this fix
> inside the long series of non-trivial patches which rewrite oom-killer?
> And it is completely orthogonal to other changes.
> 

Again, the question is whether or not the fix is rc material or not, 
otherwise there's no difference in the route that it gets upstream: the 
patch is duplicated in both series.  If you feel that this minor issue 
(which has never been reported in at least the last three years and 
doesn't have any side effects other than a couple of millisecond delay 
until unuse_mm() when the oom killer will kill something else) should be 
addressed in 2.6.35-rc2, then that's a conversation to be had with Andrew.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [PATCH] oom: remove PF_EXITING check completely
  2010-06-01 20:18         ` Oleg Nesterov
@ 2010-06-02 13:54           ` KOSAKI Motohiro
  -1 siblings, 0 replies; 110+ messages in thread
From: KOSAKI Motohiro @ 2010-06-02 13:54 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: kosaki.motohiro, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

> On 06/01, KOSAKI Motohiro wrote:
> >
> > > I'd like to add a note... with or without this, we have problems
> > > with the coredump. A thread participating in the coredumping
> > > (group-leader in this case) can have PF_EXITING && mm, but this doesn't
> > > mean it is going to exit soon, and the dumper can use a lot more memory.
> >
> > Sure. I think coredump sould do nothing if oom occur.
> > So, merely making PF_COREDUMP is bad idea? I mean
> >
> > task-flags		allocator
> > ------------------------------------------------
> > none			N/A
> > TIF_MEMDIE		allow to use emergency memory.
> > 			don't call page reclaim.
> > PF_COREDUMP		N/A
> > TIF_MEMDIE+PF_COREDUMP	disallow to use emergency memory.
> > 			don't call page reclaim.
> >
> > In other word, coredump path makes allocation failure if the task
> > marked as TIF_MEMDIE.
> 
> Perhaps... But where should TIF_MEMDIE go this case? Let me clarify.
> 
> Two threads, group-leader L and its sub-thread T. T dumps the code.
> In this case both threads have ->mm != NULL, L has PF_EXITING.
> 
> The first problem is, select_bad_process() always return -1 in this
> case (even if the caller is T, this doesn't matter).
> 
> The second problem is that we should add TIF_MEMDIE to T, not L.
> 
> This is more or less easy. For simplicity, let's suppose we removed
> this PF_EXITING check from select_bad_process().

Today, I've thought to make some bandaid patches for this issue. but
yes, I've reached the same conclusion.

If we think multithread and core dump situation, all fixes are just
bandaid. We can't remove deadlock chance completely.

The deadlock is certenaly worst result, then, minor PF_EXITING optimization
doesn't have so much worth.


==============================================================
Subject: [PATCH] oom: remove PF_EXITING check completely

PF_EXITING is wrong check if the task have multiple threads. This patch
removes it.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@suse.de>
---
 mm/oom_kill.c |   27 ---------------------------
 1 files changed, 0 insertions(+), 27 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9e7f0f9..b06f8d1 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -302,24 +302,6 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 		if (test_tsk_thread_flag(p, TIF_MEMDIE))
 			return ERR_PTR(-1UL);
 
-		/*
-		 * This is in the process of releasing memory so wait for it
-		 * to finish before killing some other task by mistake.
-		 *
-		 * However, if p is the current task, we allow the 'kill' to
-		 * go ahead if it is exiting: this will simply set TIF_MEMDIE,
-		 * which will allow it to gain access to memory reserves in
-		 * the process of exiting and releasing its resources.
-		 * Otherwise we could get an easy OOM deadlock.
-		 */
-		if ((p->flags & PF_EXITING) && p->mm) {
-			if (p != current)
-				return ERR_PTR(-1UL);
-
-			chosen = p;
-			*ppoints = ULONG_MAX;
-		}
-
 		points = badness(p, uptime.tv_sec);
 		if (points > *ppoints || !chosen) {
 			chosen = p;
@@ -436,15 +418,6 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	if (printk_ratelimit())
 		dump_header(p, gfp_mask, order, mem);
 
-	/*
-	 * If the task is already exiting, don't alarm the sysadmin or kill
-	 * its children or threads, just set TIF_MEMDIE so it can die quickly
-	 */
-	if (p->flags & PF_EXITING) {
-		__oom_kill_process(p, mem, 0);
-		return 0;
-	}
-
 	printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n",
 					message, task_pid_nr(p), p->comm, points);
 
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH] oom: remove PF_EXITING check completely
@ 2010-06-02 13:54           ` KOSAKI Motohiro
  0 siblings, 0 replies; 110+ messages in thread
From: KOSAKI Motohiro @ 2010-06-02 13:54 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: kosaki.motohiro, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

> On 06/01, KOSAKI Motohiro wrote:
> >
> > > I'd like to add a note... with or without this, we have problems
> > > with the coredump. A thread participating in the coredumping
> > > (group-leader in this case) can have PF_EXITING && mm, but this doesn't
> > > mean it is going to exit soon, and the dumper can use a lot more memory.
> >
> > Sure. I think coredump sould do nothing if oom occur.
> > So, merely making PF_COREDUMP is bad idea? I mean
> >
> > task-flags		allocator
> > ------------------------------------------------
> > none			N/A
> > TIF_MEMDIE		allow to use emergency memory.
> > 			don't call page reclaim.
> > PF_COREDUMP		N/A
> > TIF_MEMDIE+PF_COREDUMP	disallow to use emergency memory.
> > 			don't call page reclaim.
> >
> > In other word, coredump path makes allocation failure if the task
> > marked as TIF_MEMDIE.
> 
> Perhaps... But where should TIF_MEMDIE go this case? Let me clarify.
> 
> Two threads, group-leader L and its sub-thread T. T dumps the code.
> In this case both threads have ->mm != NULL, L has PF_EXITING.
> 
> The first problem is, select_bad_process() always return -1 in this
> case (even if the caller is T, this doesn't matter).
> 
> The second problem is that we should add TIF_MEMDIE to T, not L.
> 
> This is more or less easy. For simplicity, let's suppose we removed
> this PF_EXITING check from select_bad_process().

Today, I've thought to make some bandaid patches for this issue. but
yes, I've reached the same conclusion.

If we think multithread and core dump situation, all fixes are just
bandaid. We can't remove deadlock chance completely.

The deadlock is certenaly worst result, then, minor PF_EXITING optimization
doesn't have so much worth.


==============================================================
Subject: [PATCH] oom: remove PF_EXITING check completely

PF_EXITING is wrong check if the task have multiple threads. This patch
removes it.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@suse.de>
---
 mm/oom_kill.c |   27 ---------------------------
 1 files changed, 0 insertions(+), 27 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9e7f0f9..b06f8d1 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -302,24 +302,6 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 		if (test_tsk_thread_flag(p, TIF_MEMDIE))
 			return ERR_PTR(-1UL);
 
-		/*
-		 * This is in the process of releasing memory so wait for it
-		 * to finish before killing some other task by mistake.
-		 *
-		 * However, if p is the current task, we allow the 'kill' to
-		 * go ahead if it is exiting: this will simply set TIF_MEMDIE,
-		 * which will allow it to gain access to memory reserves in
-		 * the process of exiting and releasing its resources.
-		 * Otherwise we could get an easy OOM deadlock.
-		 */
-		if ((p->flags & PF_EXITING) && p->mm) {
-			if (p != current)
-				return ERR_PTR(-1UL);
-
-			chosen = p;
-			*ppoints = ULONG_MAX;
-		}
-
 		points = badness(p, uptime.tv_sec);
 		if (points > *ppoints || !chosen) {
 			chosen = p;
@@ -436,15 +418,6 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	if (printk_ratelimit())
 		dump_header(p, gfp_mask, order, mem);
 
-	/*
-	 * If the task is already exiting, don't alarm the sysadmin or kill
-	 * its children or threads, just set TIF_MEMDIE so it can die quickly
-	 */
-	if (p->flags & PF_EXITING) {
-		__oom_kill_process(p, mem, 0);
-		return 0;
-	}
-
 	printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n",
 					message, task_pid_nr(p), p->comm, points);
 
-- 
1.6.5.2



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH] oom: Make coredump interruptible
  2010-06-01 20:18         ` Oleg Nesterov
@ 2010-06-02 13:54           ` KOSAKI Motohiro
  -1 siblings, 0 replies; 110+ messages in thread
From: KOSAKI Motohiro @ 2010-06-02 13:54 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: kosaki.motohiro, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

> Otoh, if we make do_coredump() interruptible (and we should do this
> in any case), then perhaps the TIF_MEMDIE+PF_COREDUMP is not really
> needed? Afaics we always send SIGKILL along with TIF_MEMDIE.

How is to make per-process oom flag + interruptible coredump?

this per-process oom flag can be used vmscan shortcut exiting too.
(IOW, It can help DavidR mmap_sem issue)


===========================================================
Subject: [PATCH] oom: Make coredump interruptible

If oom victim process is under core dumping, sending SIGKILL cause
no-op. Unfortunately, coredump need relatively much memory. It mean
OOM vs coredump can makes deadlock.

Then, coredump logic should check the task has received SIGKILL
from OOM.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 fs/binfmt_elf.c       |    5 +++++
 include/linux/sched.h |    3 +++
 mm/oom_kill.c         |    1 +
 3 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 535e763..aa47979 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -2038,6 +2038,11 @@ static int elf_core_dump(struct coredump_params *cprm)
 				page_cache_release(page);
 			} else
 				stop = !dump_seek(cprm->file, PAGE_SIZE);
+
+			/* Now, The process received OOM. Exit soon! */
+			if (current->signal->oom_victim)
+				stop = 1;
+
 			if (stop)
 				goto end_coredump;
 		}
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8485aa2..1c4fa86 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -544,6 +544,9 @@ struct signal_struct {
 	int			notify_count;
 	struct task_struct	*group_exit_task;
 
+	/* true mean the process is OOM-killer victim. */
+	bool			oom_victim;
+
 	/* thread group stop support, overloads group_exit_code too */
 	int			group_stop_count;
 	unsigned int		flags; /* see SIGNAL_* flags below */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f33a1b8..39e31bf 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -400,6 +400,7 @@ static int __oom_kill_process(struct task_struct *p, struct mem_cgroup *mem,
 	 */
 	p->rt.time_slice = HZ;
 	set_tsk_thread_flag(p, TIF_MEMDIE);
+	p->signal->oom_victim = true;
 
 	force_sig(SIGKILL, p);
 
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH] oom: Make coredump interruptible
@ 2010-06-02 13:54           ` KOSAKI Motohiro
  0 siblings, 0 replies; 110+ messages in thread
From: KOSAKI Motohiro @ 2010-06-02 13:54 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: kosaki.motohiro, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

> Otoh, if we make do_coredump() interruptible (and we should do this
> in any case), then perhaps the TIF_MEMDIE+PF_COREDUMP is not really
> needed? Afaics we always send SIGKILL along with TIF_MEMDIE.

How is to make per-process oom flag + interruptible coredump?

this per-process oom flag can be used vmscan shortcut exiting too.
(IOW, It can help DavidR mmap_sem issue)


===========================================================
Subject: [PATCH] oom: Make coredump interruptible

If oom victim process is under core dumping, sending SIGKILL cause
no-op. Unfortunately, coredump need relatively much memory. It mean
OOM vs coredump can makes deadlock.

Then, coredump logic should check the task has received SIGKILL
from OOM.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 fs/binfmt_elf.c       |    5 +++++
 include/linux/sched.h |    3 +++
 mm/oom_kill.c         |    1 +
 3 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 535e763..aa47979 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -2038,6 +2038,11 @@ static int elf_core_dump(struct coredump_params *cprm)
 				page_cache_release(page);
 			} else
 				stop = !dump_seek(cprm->file, PAGE_SIZE);
+
+			/* Now, The process received OOM. Exit soon! */
+			if (current->signal->oom_victim)
+				stop = 1;
+
 			if (stop)
 				goto end_coredump;
 		}
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8485aa2..1c4fa86 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -544,6 +544,9 @@ struct signal_struct {
 	int			notify_count;
 	struct task_struct	*group_exit_task;
 
+	/* true mean the process is OOM-killer victim. */
+	bool			oom_victim;
+
 	/* thread group stop support, overloads group_exit_code too */
 	int			group_stop_count;
 	unsigned int		flags; /* see SIGNAL_* flags below */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f33a1b8..39e31bf 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -400,6 +400,7 @@ static int __oom_kill_process(struct task_struct *p, struct mem_cgroup *mem,
 	 */
 	p->rt.time_slice = HZ;
 	set_tsk_thread_flag(p, TIF_MEMDIE);
+	p->signal->oom_victim = true;
 
 	force_sig(SIGKILL, p);
 
-- 
1.6.5.2



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [PATCH 1/5] oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads
  2010-06-01 21:26       ` David Rientjes
@ 2010-06-02 13:54         ` KOSAKI Motohiro
  -1 siblings, 0 replies; 110+ messages in thread
From: KOSAKI Motohiro @ 2010-06-02 13:54 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Oleg Nesterov, LKML, linux-mm, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

> On Tue, 1 Jun 2010, Oleg Nesterov wrote:
> 
> > But yes, I agree, the problem is minor. But nevertheless it is bug,
> > the longstanding bug with the simple fix. Why should we "hide" this fix
> > inside the long series of non-trivial patches which rewrite oom-killer?
> > And it is completely orthogonal to other changes.
> > 
> 
> Again, the question is whether or not the fix is rc material or not, 
> otherwise there's no difference in the route that it gets upstream: the 
> patch is duplicated in both series.  If you feel that this minor issue 
> (which has never been reported in at least the last three years and 
> doesn't have any side effects other than a couple of millisecond delay 
> until unuse_mm() when the oom killer will kill something else) should be 
> addressed in 2.6.35-rc2, then that's a conversation to be had with Andrew.

Well, we have bugfix-at-first development rule. Why do you refuse our
development process?




^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 1/5] oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads
@ 2010-06-02 13:54         ` KOSAKI Motohiro
  0 siblings, 0 replies; 110+ messages in thread
From: KOSAKI Motohiro @ 2010-06-02 13:54 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Oleg Nesterov, LKML, linux-mm, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

> On Tue, 1 Jun 2010, Oleg Nesterov wrote:
> 
> > But yes, I agree, the problem is minor. But nevertheless it is bug,
> > the longstanding bug with the simple fix. Why should we "hide" this fix
> > inside the long series of non-trivial patches which rewrite oom-killer?
> > And it is completely orthogonal to other changes.
> > 
> 
> Again, the question is whether or not the fix is rc material or not, 
> otherwise there's no difference in the route that it gets upstream: the 
> patch is duplicated in both series.  If you feel that this minor issue 
> (which has never been reported in at least the last three years and 
> doesn't have any side effects other than a couple of millisecond delay 
> until unuse_mm() when the oom killer will kill something else) should be 
> addressed in 2.6.35-rc2, then that's a conversation to be had with Andrew.

Well, we have bugfix-at-first development rule. Why do you refuse our
development process?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 1/5] oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads
  2010-05-31  9:33 ` KOSAKI Motohiro
@ 2010-06-02 15:32   ` Minchan Kim
  -1 siblings, 0 replies; 110+ messages in thread
From: Minchan Kim @ 2010-06-02 15:32 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Oleg Nesterov, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On Mon, May 31, 2010 at 06:33:06PM +0900, KOSAKI Motohiro wrote:
> From: Oleg Nesterov <oleg@redhat.com>
> Subject: oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads
> 
> select_bad_process() thinks a kernel thread can't have ->mm != NULL, this
> is not true due to use_mm().
> 
> Change the code to check PF_KTHREAD.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> Acked-by: David Rientjes <rientjes@google.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 1/5] oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads
@ 2010-06-02 15:32   ` Minchan Kim
  0 siblings, 0 replies; 110+ messages in thread
From: Minchan Kim @ 2010-06-02 15:32 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Oleg Nesterov, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On Mon, May 31, 2010 at 06:33:06PM +0900, KOSAKI Motohiro wrote:
> From: Oleg Nesterov <oleg@redhat.com>
> Subject: oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads
> 
> select_bad_process() thinks a kernel thread can't have ->mm != NULL, this
> is not true due to use_mm().
> 
> Change the code to check PF_KTHREAD.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> Acked-by: David Rientjes <rientjes@google.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: Make coredump interruptible
  2010-06-02 13:54           ` KOSAKI Motohiro
@ 2010-06-02 15:42             ` Oleg Nesterov
  -1 siblings, 0 replies; 110+ messages in thread
From: Oleg Nesterov @ 2010-06-02 15:42 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki,
	Nick Piggin, Roland McGrath

(add Roland)

On 06/02, KOSAKI Motohiro wrote:
>
> > Otoh, if we make do_coredump() interruptible (and we should do this
> > in any case), then perhaps the TIF_MEMDIE+PF_COREDUMP is not really
> > needed? Afaics we always send SIGKILL along with TIF_MEMDIE.
>
> How is to make per-process oom flag + interruptible coredump?
>
> this per-process oom flag can be used vmscan shortcut exiting too.
> (IOW, It can help DavidR mmap_sem issue)

Firstly, this solution is not complete. We should make it really
interruptible (from user-space too), but we need more changes for
this (in particular we need to distinguish group-exit/exec cases
from the explicit SIGKILL case). Let's not discuss this here, this
is the different story.


But. I agree very much that it makes sense to add the quick fix
right now. Even if this fix will be superseded by the "proper"
fixes later.

> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -2038,6 +2038,11 @@ static int elf_core_dump(struct coredump_params *cprm)
>  				page_cache_release(page);
>  			} else
>  				stop = !dump_seek(cprm->file, PAGE_SIZE);
> +
> +			/* Now, The process received OOM. Exit soon! */
> +			if (current->signal->oom_victim)
> +				stop = 1;

Agreed, most problems with memory allocations should come from this loop.

> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -544,6 +544,9 @@ struct signal_struct {
>  	int			notify_count;
>  	struct task_struct	*group_exit_task;
>
> +	/* true mean the process is OOM-killer victim. */
> +	bool			oom_victim;

Well, the new word in signal_struct is not nice. It is better to
set SIGNAL_OOM_XXX in ->signal->flags (this needs ->siglock).

But. I don't think that signal_struct is the right place for the marker.

The thread which actually dumps the core doesn't necessarily belong
to the same thread group, but it can share ->mm with the selected
oom victim.

IOW, we should mark ->mm instead (perhaps mm->flags) or mm->core_state.
This in turn means we need find_lock_task_mm().

What do you think?

Oleg.


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: Make coredump interruptible
@ 2010-06-02 15:42             ` Oleg Nesterov
  0 siblings, 0 replies; 110+ messages in thread
From: Oleg Nesterov @ 2010-06-02 15:42 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki,
	Nick Piggin, Roland McGrath

(add Roland)

On 06/02, KOSAKI Motohiro wrote:
>
> > Otoh, if we make do_coredump() interruptible (and we should do this
> > in any case), then perhaps the TIF_MEMDIE+PF_COREDUMP is not really
> > needed? Afaics we always send SIGKILL along with TIF_MEMDIE.
>
> How is to make per-process oom flag + interruptible coredump?
>
> this per-process oom flag can be used vmscan shortcut exiting too.
> (IOW, It can help DavidR mmap_sem issue)

Firstly, this solution is not complete. We should make it really
interruptible (from user-space too), but we need more changes for
this (in particular we need to distinguish group-exit/exec cases
from the explicit SIGKILL case). Let's not discuss this here, this
is the different story.


But. I agree very much that it makes sense to add the quick fix
right now. Even if this fix will be superseded by the "proper"
fixes later.

> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -2038,6 +2038,11 @@ static int elf_core_dump(struct coredump_params *cprm)
>  				page_cache_release(page);
>  			} else
>  				stop = !dump_seek(cprm->file, PAGE_SIZE);
> +
> +			/* Now, The process received OOM. Exit soon! */
> +			if (current->signal->oom_victim)
> +				stop = 1;

Agreed, most problems with memory allocations should come from this loop.

> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -544,6 +544,9 @@ struct signal_struct {
>  	int			notify_count;
>  	struct task_struct	*group_exit_task;
>
> +	/* true mean the process is OOM-killer victim. */
> +	bool			oom_victim;

Well, the new word in signal_struct is not nice. It is better to
set SIGNAL_OOM_XXX in ->signal->flags (this needs ->siglock).

But. I don't think that signal_struct is the right place for the marker.

The thread which actually dumps the core doesn't necessarily belong
to the same thread group, but it can share ->mm with the selected
oom victim.

IOW, we should mark ->mm instead (perhaps mm->flags) or mm->core_state.
This in turn means we need find_lock_task_mm().

What do you think?

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: remove PF_EXITING check completely
  2010-06-02 13:54           ` KOSAKI Motohiro
@ 2010-06-02 15:54             ` Oleg Nesterov
  -1 siblings, 0 replies; 110+ messages in thread
From: Oleg Nesterov @ 2010-06-02 15:54 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki,
	Nick Piggin

On 06/02, KOSAKI Motohiro wrote:
>
> Today, I've thought to make some bandaid patches for this issue. but
> yes, I've reached the same conclusion.
>
> If we think multithread and core dump situation, all fixes are just
> bandaid. We can't remove deadlock chance completely.
>
> The deadlock is certenaly worst result, then, minor PF_EXITING optimization
> doesn't have so much worth.

Agreed! I was always wondering if it really helps in practice.


> Subject: [PATCH] oom: remove PF_EXITING check completely
>
> PF_EXITING is wrong check if the task have multiple threads. This patch
> removes it.
>
> Suggested-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Nick Piggin <npiggin@suse.de>
> ---
>  mm/oom_kill.c |   27 ---------------------------
>  1 files changed, 0 insertions(+), 27 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 9e7f0f9..b06f8d1 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -302,24 +302,6 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
>  		if (test_tsk_thread_flag(p, TIF_MEMDIE))
>  			return ERR_PTR(-1UL);
>
> -		/*
> -		 * This is in the process of releasing memory so wait for it
> -		 * to finish before killing some other task by mistake.
> -		 *
> -		 * However, if p is the current task, we allow the 'kill' to
> -		 * go ahead if it is exiting: this will simply set TIF_MEMDIE,
> -		 * which will allow it to gain access to memory reserves in
> -		 * the process of exiting and releasing its resources.
> -		 * Otherwise we could get an easy OOM deadlock.
> -		 */
> -		if ((p->flags & PF_EXITING) && p->mm) {
> -			if (p != current)
> -				return ERR_PTR(-1UL);
> -
> -			chosen = p;
> -			*ppoints = ULONG_MAX;
> -		}
> -
>  		points = badness(p, uptime.tv_sec);
>  		if (points > *ppoints || !chosen) {
>  			chosen = p;
> @@ -436,15 +418,6 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  	if (printk_ratelimit())
>  		dump_header(p, gfp_mask, order, mem);
>
> -	/*
> -	 * If the task is already exiting, don't alarm the sysadmin or kill
> -	 * its children or threads, just set TIF_MEMDIE so it can die quickly
> -	 */
> -	if (p->flags & PF_EXITING) {
> -		__oom_kill_process(p, mem, 0);
> -		return 0;
> -	}
> -
>  	printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n",
>  					message, task_pid_nr(p), p->comm, points);
>
> --


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: remove PF_EXITING check completely
@ 2010-06-02 15:54             ` Oleg Nesterov
  0 siblings, 0 replies; 110+ messages in thread
From: Oleg Nesterov @ 2010-06-02 15:54 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki,
	Nick Piggin

On 06/02, KOSAKI Motohiro wrote:
>
> Today, I've thought to make some bandaid patches for this issue. but
> yes, I've reached the same conclusion.
>
> If we think multithread and core dump situation, all fixes are just
> bandaid. We can't remove deadlock chance completely.
>
> The deadlock is certenaly worst result, then, minor PF_EXITING optimization
> doesn't have so much worth.

Agreed! I was always wondering if it really helps in practice.


> Subject: [PATCH] oom: remove PF_EXITING check completely
>
> PF_EXITING is wrong check if the task have multiple threads. This patch
> removes it.
>
> Suggested-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Nick Piggin <npiggin@suse.de>
> ---
>  mm/oom_kill.c |   27 ---------------------------
>  1 files changed, 0 insertions(+), 27 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 9e7f0f9..b06f8d1 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -302,24 +302,6 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
>  		if (test_tsk_thread_flag(p, TIF_MEMDIE))
>  			return ERR_PTR(-1UL);
>
> -		/*
> -		 * This is in the process of releasing memory so wait for it
> -		 * to finish before killing some other task by mistake.
> -		 *
> -		 * However, if p is the current task, we allow the 'kill' to
> -		 * go ahead if it is exiting: this will simply set TIF_MEMDIE,
> -		 * which will allow it to gain access to memory reserves in
> -		 * the process of exiting and releasing its resources.
> -		 * Otherwise we could get an easy OOM deadlock.
> -		 */
> -		if ((p->flags & PF_EXITING) && p->mm) {
> -			if (p != current)
> -				return ERR_PTR(-1UL);
> -
> -			chosen = p;
> -			*ppoints = ULONG_MAX;
> -		}
> -
>  		points = badness(p, uptime.tv_sec);
>  		if (points > *ppoints || !chosen) {
>  			chosen = p;
> @@ -436,15 +418,6 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  	if (printk_ratelimit())
>  		dump_header(p, gfp_mask, order, mem);
>
> -	/*
> -	 * If the task is already exiting, don't alarm the sysadmin or kill
> -	 * its children or threads, just set TIF_MEMDIE so it can die quickly
> -	 */
> -	if (p->flags & PF_EXITING) {
> -		__oom_kill_process(p, mem, 0);
> -		return 0;
> -	}
> -
>  	printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n",
>  					message, task_pid_nr(p), p->comm, points);
>
> --

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 3/5] oom: introduce find_lock_task_mm() to fix !mm false positives
  2010-05-31  9:36   ` KOSAKI Motohiro
@ 2010-06-02 16:05     ` Minchan Kim
  -1 siblings, 0 replies; 110+ messages in thread
From: Minchan Kim @ 2010-06-02 16:05 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Oleg Nesterov, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On Mon, May 31, 2010 at 06:36:34PM +0900, KOSAKI Motohiro wrote:
> From: Oleg Nesterov <oleg@redhat.com>
> Subject: [PATCH 3/5] oom: introduce find_lock_task_mm() to fix !mm false positives
> 
> Almost all ->mm == NUL checks in oom_kill.c are wrong.
> 
> The current code assumes that the task without ->mm has already
> released its memory and ignores the process. However this is not
> necessarily true when this process is multithreaded, other live
> sub-threads can use this ->mm.
> 
> - Remove the "if (!p->mm)" check in select_bad_process(), it is
>   just wrong.
> 
> - Add the new helper, find_lock_task_mm(), which finds the live
>   thread which uses the memory and takes task_lock() to pin ->mm
> 
> - change oom_badness() to use this helper instead of just checking
>   ->mm != NULL.
> 
> - As David pointed out, select_bad_process() must never choose the
>   task without ->mm, but no matter what badness() returns the
>   task can be chosen if nothing else has been found yet.
> 
> Note! This patch is not enough, we need more changes.
> 
> 	- badness() was fixed, but oom_kill_task() still ignores
> 	  the task without ->mm
> 
> This will be addressed later.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> Cc: David Rientjes <rientjes@google.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [rebase
> latest -mm and remove some obsoleted description]
Reviewed-by: Minchan Kim <minchan.kim@gmail.com?

Good catch but I have a nitpick. :)

find_lock_task_mm isn't good name of the function, I think. 
As you know, original goal of the function is to find sub-thread of p
which is alive(ie, doesn't release mm). 

task_lock is important for user of the function but minor.

I suggest following as 
/*
 * If we find alive thread of process, it returns task_struct of sub thread.
 * Notice. this function calls task_lock. So caller should call task_unlock.
 */
static struct task_struct *find_alive_subthread(struct task_struct *process)
{
... 
}

I don't forced my suggesion if you suggest much good name.
Regardless of accepting my suggestion, looks good to me.

> ---
>  mm/oom_kill.c |   28 +++++++++++++++++-----------
>  1 files changed, 17 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index c87a6f4..162af2e 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -52,6 +52,19 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
>  	return 0;
>  }
>  
> +static struct task_struct *find_lock_task_mm(struct task_struct *p)
> +{
> +	struct task_struct *t = p;
> +	do {
> +		task_lock(t);
> +		if (likely(t->mm))
> +			return t;
> +		task_unlock(t);
> +	} while_each_thread(p, t);
> +
> +	return NULL;
> +}
> +
>  /**
>   * badness - calculate a numeric value for how bad this task has been
>   * @p: task struct of which task we should calculate
> @@ -74,7 +87,6 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
>  unsigned long badness(struct task_struct *p, unsigned long uptime)
>  {
>  	unsigned long points, cpu_time, run_time;
> -	struct mm_struct *mm;
>  	struct task_struct *child;
>  	int oom_adj = p->signal->oom_adj;
>  	struct task_cputime task_time;
> @@ -84,17 +96,14 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
>  	if (oom_adj == OOM_DISABLE)
>  		return 0;
>  
> -	task_lock(p);
> -	mm = p->mm;
> -	if (!mm) {
> -		task_unlock(p);
> +	p = find_lock_task_mm(p);
> +	if (!p)
>  		return 0;
> -	}
>  
>  	/*
>  	 * The memory size of the process is the basis for the badness.
>  	 */
> -	points = mm->total_vm;
> +	points = p->mm->total_vm;
>  
>  	/*
>  	 * After this unlock we can no longer dereference local variable `mm'
> @@ -117,7 +126,7 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
>  	 */
>  	list_for_each_entry(child, &p->children, sibling) {
>  		task_lock(child);
> -		if (child->mm != mm && child->mm)
> +		if (child->mm != p->mm && child->mm)
>  			points += child->mm->total_vm/2 + 1;
>  		task_unlock(child);
>  	}
> @@ -256,9 +265,6 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
>  	for_each_process(p) {
>  		unsigned long points;
>  
> -		/* skip the tasks which have already released their mm. */
> -		if (!p->mm)
> -			continue;
>  		/* skip the init task and kthreads */
>  		if (is_global_init(p) || (p->flags & PF_KTHREAD))
>  			continue;
> -- 
> 1.6.5.2
> 
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 3/5] oom: introduce find_lock_task_mm() to fix !mm false positives
@ 2010-06-02 16:05     ` Minchan Kim
  0 siblings, 0 replies; 110+ messages in thread
From: Minchan Kim @ 2010-06-02 16:05 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Oleg Nesterov, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On Mon, May 31, 2010 at 06:36:34PM +0900, KOSAKI Motohiro wrote:
> From: Oleg Nesterov <oleg@redhat.com>
> Subject: [PATCH 3/5] oom: introduce find_lock_task_mm() to fix !mm false positives
> 
> Almost all ->mm == NUL checks in oom_kill.c are wrong.
> 
> The current code assumes that the task without ->mm has already
> released its memory and ignores the process. However this is not
> necessarily true when this process is multithreaded, other live
> sub-threads can use this ->mm.
> 
> - Remove the "if (!p->mm)" check in select_bad_process(), it is
>   just wrong.
> 
> - Add the new helper, find_lock_task_mm(), which finds the live
>   thread which uses the memory and takes task_lock() to pin ->mm
> 
> - change oom_badness() to use this helper instead of just checking
>   ->mm != NULL.
> 
> - As David pointed out, select_bad_process() must never choose the
>   task without ->mm, but no matter what badness() returns the
>   task can be chosen if nothing else has been found yet.
> 
> Note! This patch is not enough, we need more changes.
> 
> 	- badness() was fixed, but oom_kill_task() still ignores
> 	  the task without ->mm
> 
> This will be addressed later.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> Cc: David Rientjes <rientjes@google.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [rebase
> latest -mm and remove some obsoleted description]
Reviewed-by: Minchan Kim <minchan.kim@gmail.com?

Good catch but I have a nitpick. :)

find_lock_task_mm isn't good name of the function, I think. 
As you know, original goal of the function is to find sub-thread of p
which is alive(ie, doesn't release mm). 

task_lock is important for user of the function but minor.

I suggest following as 
/*
 * If we find alive thread of process, it returns task_struct of sub thread.
 * Notice. this function calls task_lock. So caller should call task_unlock.
 */
static struct task_struct *find_alive_subthread(struct task_struct *process)
{
... 
}

I don't forced my suggesion if you suggest much good name.
Regardless of accepting my suggestion, looks good to me.

> ---
>  mm/oom_kill.c |   28 +++++++++++++++++-----------
>  1 files changed, 17 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index c87a6f4..162af2e 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -52,6 +52,19 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
>  	return 0;
>  }
>  
> +static struct task_struct *find_lock_task_mm(struct task_struct *p)
> +{
> +	struct task_struct *t = p;
> +	do {
> +		task_lock(t);
> +		if (likely(t->mm))
> +			return t;
> +		task_unlock(t);
> +	} while_each_thread(p, t);
> +
> +	return NULL;
> +}
> +
>  /**
>   * badness - calculate a numeric value for how bad this task has been
>   * @p: task struct of which task we should calculate
> @@ -74,7 +87,6 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
>  unsigned long badness(struct task_struct *p, unsigned long uptime)
>  {
>  	unsigned long points, cpu_time, run_time;
> -	struct mm_struct *mm;
>  	struct task_struct *child;
>  	int oom_adj = p->signal->oom_adj;
>  	struct task_cputime task_time;
> @@ -84,17 +96,14 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
>  	if (oom_adj == OOM_DISABLE)
>  		return 0;
>  
> -	task_lock(p);
> -	mm = p->mm;
> -	if (!mm) {
> -		task_unlock(p);
> +	p = find_lock_task_mm(p);
> +	if (!p)
>  		return 0;
> -	}
>  
>  	/*
>  	 * The memory size of the process is the basis for the badness.
>  	 */
> -	points = mm->total_vm;
> +	points = p->mm->total_vm;
>  
>  	/*
>  	 * After this unlock we can no longer dereference local variable `mm'
> @@ -117,7 +126,7 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
>  	 */
>  	list_for_each_entry(child, &p->children, sibling) {
>  		task_lock(child);
> -		if (child->mm != mm && child->mm)
> +		if (child->mm != p->mm && child->mm)
>  			points += child->mm->total_vm/2 + 1;
>  		task_unlock(child);
>  	}
> @@ -256,9 +265,6 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
>  	for_each_process(p) {
>  		unsigned long points;
>  
> -		/* skip the tasks which have already released their mm. */
> -		if (!p->mm)
> -			continue;
>  		/* skip the init task and kthreads */
>  		if (is_global_init(p) || (p->flags & PF_KTHREAD))
>  			continue;
> -- 
> 1.6.5.2
> 
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: Make coredump interruptible
  2010-06-02 15:42             ` Oleg Nesterov
@ 2010-06-02 17:29               ` Roland McGrath
  -1 siblings, 0 replies; 110+ messages in thread
From: Roland McGrath @ 2010-06-02 17:29 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: KOSAKI Motohiro, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

Why not just test TIF_MEMDIE?

Thanks,
Roland

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: Make coredump interruptible
@ 2010-06-02 17:29               ` Roland McGrath
  0 siblings, 0 replies; 110+ messages in thread
From: Roland McGrath @ 2010-06-02 17:29 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: KOSAKI Motohiro, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

Why not just test TIF_MEMDIE?

Thanks,
Roland

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: Make coredump interruptible
  2010-06-02 17:29               ` Roland McGrath
@ 2010-06-02 17:53                 ` Oleg Nesterov
  -1 siblings, 0 replies; 110+ messages in thread
From: Oleg Nesterov @ 2010-06-02 17:53 UTC (permalink / raw)
  To: Roland McGrath
  Cc: KOSAKI Motohiro, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On 06/02, Roland McGrath wrote:
>
> Why not just test TIF_MEMDIE?

Because it is per-thread.

when select_bad_process() finds the task P to kill it can participate
in the core dump (sleep in exit_mm), but we should somehow inform the
thread which actually dumps the core: P->mm->core_state->dumper.

Well, we can use TIF_MEMDIE if we chose the right thread, I think.
But perhaps mm->flags |= MMF_OOM is better, it can have other user.
I dunno.

Oleg.


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: Make coredump interruptible
@ 2010-06-02 17:53                 ` Oleg Nesterov
  0 siblings, 0 replies; 110+ messages in thread
From: Oleg Nesterov @ 2010-06-02 17:53 UTC (permalink / raw)
  To: Roland McGrath
  Cc: KOSAKI Motohiro, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On 06/02, Roland McGrath wrote:
>
> Why not just test TIF_MEMDIE?

Because it is per-thread.

when select_bad_process() finds the task P to kill it can participate
in the core dump (sleep in exit_mm), but we should somehow inform the
thread which actually dumps the core: P->mm->core_state->dumper.

Well, we can use TIF_MEMDIE if we chose the right thread, I think.
But perhaps mm->flags |= MMF_OOM is better, it can have other user.
I dunno.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: Make coredump interruptible
  2010-06-02 17:53                 ` Oleg Nesterov
@ 2010-06-02 18:58                   ` Roland McGrath
  -1 siblings, 0 replies; 110+ messages in thread
From: Roland McGrath @ 2010-06-02 18:58 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: KOSAKI Motohiro, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

> Because it is per-thread.

I see.

> when select_bad_process() finds the task P to kill it can participate
> in the core dump (sleep in exit_mm), but we should somehow inform the
> thread which actually dumps the core: P->mm->core_state->dumper.

Perhaps it should simply do that: if you would choose P to oom-kill, and
P->mm->core_state!=NULL, then choose P->mm->core_state->dumper instead.

> Well, we can use TIF_MEMDIE if we chose the right thread, I think.
> But perhaps mm->flags |= MMF_OOM is better, it can have other user.
> I dunno.

This is all the quick hack before get around to just making core dumping
fully-interruptible, no?  So we should go with whatever is the simplest
change now.

Perhaps this belongs in another thread as you suggested.  But I wonder what
we might get just from s/TASK_UNINTERRUPTIBLE/TASK_KILLABLE/ in exit_mm.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: Make coredump interruptible
@ 2010-06-02 18:58                   ` Roland McGrath
  0 siblings, 0 replies; 110+ messages in thread
From: Roland McGrath @ 2010-06-02 18:58 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: KOSAKI Motohiro, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

> Because it is per-thread.

I see.

> when select_bad_process() finds the task P to kill it can participate
> in the core dump (sleep in exit_mm), but we should somehow inform the
> thread which actually dumps the core: P->mm->core_state->dumper.

Perhaps it should simply do that: if you would choose P to oom-kill, and
P->mm->core_state!=NULL, then choose P->mm->core_state->dumper instead.

> Well, we can use TIF_MEMDIE if we chose the right thread, I think.
> But perhaps mm->flags |= MMF_OOM is better, it can have other user.
> I dunno.

This is all the quick hack before get around to just making core dumping
fully-interruptible, no?  So we should go with whatever is the simplest
change now.

Perhaps this belongs in another thread as you suggested.  But I wonder what
we might get just from s/TASK_UNINTERRUPTIBLE/TASK_KILLABLE/ in exit_mm.


Thanks,
Roland

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: Make coredump interruptible
  2010-06-02 18:58                   ` Roland McGrath
@ 2010-06-02 20:38                     ` Oleg Nesterov
  -1 siblings, 0 replies; 110+ messages in thread
From: Oleg Nesterov @ 2010-06-02 20:38 UTC (permalink / raw)
  To: Roland McGrath
  Cc: KOSAKI Motohiro, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On 06/02, Roland McGrath wrote:
>
> > when select_bad_process() finds the task P to kill it can participate
> > in the core dump (sleep in exit_mm), but we should somehow inform the
> > thread which actually dumps the core: P->mm->core_state->dumper.
>
> Perhaps it should simply do that: if you would choose P to oom-kill, and
> P->mm->core_state!=NULL, then choose P->mm->core_state->dumper instead.

... to set TIF_MEMDIE which should be checked in elf_core_dump().

Probably yes.

> > Well, we can use TIF_MEMDIE if we chose the right thread, I think.
> > But perhaps mm->flags |= MMF_OOM is better, it can have other user.
> > I dunno.
>
> This is all the quick hack before get around to just making core dumping
> fully-interruptible, no?  So we should go with whatever is the simplest
> change now.

Yes.

> Perhaps this belongs in another thread as you suggested.  But I wonder what
> we might get just from s/TASK_UNINTERRUPTIBLE/TASK_KILLABLE/ in exit_mm.

Oh. This needs more thinking. Definitely the task sleeping in exit_mm()
must not exit until core_state->dumper->thread returns from do_coredump().
If nothing else, the dumper can use its task_struct and it relies on
the stable core_thread->next list. And right now TASK_KILLABLE can't
work anyway, it is possible that fatal_signal_pending() is true.

But perhaps we can do something later. Assuming that do_coredump() is
interruptible, TASK_KILLABLE can make the difference only if the dumper
belongs to another thread-group.

Oleg.


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: Make coredump interruptible
@ 2010-06-02 20:38                     ` Oleg Nesterov
  0 siblings, 0 replies; 110+ messages in thread
From: Oleg Nesterov @ 2010-06-02 20:38 UTC (permalink / raw)
  To: Roland McGrath
  Cc: KOSAKI Motohiro, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On 06/02, Roland McGrath wrote:
>
> > when select_bad_process() finds the task P to kill it can participate
> > in the core dump (sleep in exit_mm), but we should somehow inform the
> > thread which actually dumps the core: P->mm->core_state->dumper.
>
> Perhaps it should simply do that: if you would choose P to oom-kill, and
> P->mm->core_state!=NULL, then choose P->mm->core_state->dumper instead.

... to set TIF_MEMDIE which should be checked in elf_core_dump().

Probably yes.

> > Well, we can use TIF_MEMDIE if we chose the right thread, I think.
> > But perhaps mm->flags |= MMF_OOM is better, it can have other user.
> > I dunno.
>
> This is all the quick hack before get around to just making core dumping
> fully-interruptible, no?  So we should go with whatever is the simplest
> change now.

Yes.

> Perhaps this belongs in another thread as you suggested.  But I wonder what
> we might get just from s/TASK_UNINTERRUPTIBLE/TASK_KILLABLE/ in exit_mm.

Oh. This needs more thinking. Definitely the task sleeping in exit_mm()
must not exit until core_state->dumper->thread returns from do_coredump().
If nothing else, the dumper can use its task_struct and it relies on
the stable core_thread->next list. And right now TASK_KILLABLE can't
work anyway, it is possible that fatal_signal_pending() is true.

But perhaps we can do something later. Assuming that do_coredump() is
interruptible, TASK_KILLABLE can make the difference only if the dumper
belongs to another thread-group.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: remove PF_EXITING check completely
  2010-06-02 15:54             ` Oleg Nesterov
@ 2010-06-02 21:02               ` David Rientjes
  -1 siblings, 0 replies; 110+ messages in thread
From: David Rientjes @ 2010-06-02 21:02 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On Wed, 2 Jun 2010, Oleg Nesterov wrote:

> > Today, I've thought to make some bandaid patches for this issue. but
> > yes, I've reached the same conclusion.
> >
> > If we think multithread and core dump situation, all fixes are just
> > bandaid. We can't remove deadlock chance completely.
> >
> > The deadlock is certenaly worst result, then, minor PF_EXITING optimization
> > doesn't have so much worth.
> 
> Agreed! I was always wondering if it really helps in practice.
> 

Nack, this certainly does help in practice, it prevents needlessly killing 
additional tasks when one is exiting and may free memory.  It's much 
better to defer killing something temporarily if an eligible task (i.e. 
one that has a high probability of memory allocations on current's nodes 
or contributing to its memcg) is exiting.

We depend on this check specifically for our use of cpusets, so please 
don't remove it.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: remove PF_EXITING check completely
@ 2010-06-02 21:02               ` David Rientjes
  0 siblings, 0 replies; 110+ messages in thread
From: David Rientjes @ 2010-06-02 21:02 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On Wed, 2 Jun 2010, Oleg Nesterov wrote:

> > Today, I've thought to make some bandaid patches for this issue. but
> > yes, I've reached the same conclusion.
> >
> > If we think multithread and core dump situation, all fixes are just
> > bandaid. We can't remove deadlock chance completely.
> >
> > The deadlock is certenaly worst result, then, minor PF_EXITING optimization
> > doesn't have so much worth.
> 
> Agreed! I was always wondering if it really helps in practice.
> 

Nack, this certainly does help in practice, it prevents needlessly killing 
additional tasks when one is exiting and may free memory.  It's much 
better to defer killing something temporarily if an eligible task (i.e. 
one that has a high probability of memory allocations on current's nodes 
or contributing to its memcg) is exiting.

We depend on this check specifically for our use of cpusets, so please 
don't remove it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 1/5] oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads
  2010-06-02 13:54         ` KOSAKI Motohiro
@ 2010-06-02 21:09           ` David Rientjes
  -1 siblings, 0 replies; 110+ messages in thread
From: David Rientjes @ 2010-06-02 21:09 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Oleg Nesterov, LKML, linux-mm, Andrew Morton, KAMEZAWA Hiroyuki,
	Nick Piggin

On Wed, 2 Jun 2010, KOSAKI Motohiro wrote:

> > Again, the question is whether or not the fix is rc material or not, 
> > otherwise there's no difference in the route that it gets upstream: the 
> > patch is duplicated in both series.  If you feel that this minor issue 
> > (which has never been reported in at least the last three years and 
> > doesn't have any side effects other than a couple of millisecond delay 
> > until unuse_mm() when the oom killer will kill something else) should be 
> > addressed in 2.6.35-rc2, then that's a conversation to be had with Andrew.
> 
> Well, we have bugfix-at-first development rule. Why do you refuse our
> development process?
> 

This isn't a bugfix, it simply prevents a recall to the oom killer after 
the kthread has called unuse_mm().  Please show where any side effects of 
oom killing a kthread, which cannot exit, as a result of use_mm() causes a 
problem _anywhere_.

If that's the definition you have for a "bugfix," then I could certainly 
argue that some of my patches like "oom: filter tasks not sharing the same 
cpuset" is a bugfix because it allows needlessly killing tasks that won't 
free memory for current, or "oom: avoid oom killer for lowmem allocations" 
is a bugfix because it allows killing a task that won't free lowmem, etc.

I agree that this is a nice patch to have to avoid that recall later, 
which is why I merged it into my patchset, but let's please be accurate 
about its impact.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 1/5] oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads
@ 2010-06-02 21:09           ` David Rientjes
  0 siblings, 0 replies; 110+ messages in thread
From: David Rientjes @ 2010-06-02 21:09 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Oleg Nesterov, LKML, linux-mm, Andrew Morton, KAMEZAWA Hiroyuki,
	Nick Piggin

On Wed, 2 Jun 2010, KOSAKI Motohiro wrote:

> > Again, the question is whether or not the fix is rc material or not, 
> > otherwise there's no difference in the route that it gets upstream: the 
> > patch is duplicated in both series.  If you feel that this minor issue 
> > (which has never been reported in at least the last three years and 
> > doesn't have any side effects other than a couple of millisecond delay 
> > until unuse_mm() when the oom killer will kill something else) should be 
> > addressed in 2.6.35-rc2, then that's a conversation to be had with Andrew.
> 
> Well, we have bugfix-at-first development rule. Why do you refuse our
> development process?
> 

This isn't a bugfix, it simply prevents a recall to the oom killer after 
the kthread has called unuse_mm().  Please show where any side effects of 
oom killing a kthread, which cannot exit, as a result of use_mm() causes a 
problem _anywhere_.

If that's the definition you have for a "bugfix," then I could certainly 
argue that some of my patches like "oom: filter tasks not sharing the same 
cpuset" is a bugfix because it allows needlessly killing tasks that won't 
free memory for current, or "oom: avoid oom killer for lowmem allocations" 
is a bugfix because it allows killing a task that won't free lowmem, etc.

I agree that this is a nice patch to have to avoid that recall later, 
which is why I merged it into my patchset, but let's please be accurate 
about its impact.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 1/5] oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads
  2010-06-02 21:09           ` David Rientjes
@ 2010-06-02 21:33             ` Oleg Nesterov
  -1 siblings, 0 replies; 110+ messages in thread
From: Oleg Nesterov @ 2010-06-02 21:33 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On 06/02, David Rientjes wrote:
>
> On Wed, 2 Jun 2010, KOSAKI Motohiro wrote:
>
> > > Again, the question is whether or not the fix is rc material or not,
> > > otherwise there's no difference in the route that it gets upstream: the
> > > patch is duplicated in both series.  If you feel that this minor issue
> > > (which has never been reported in at least the last three years and
> > > doesn't have any side effects other than a couple of millisecond delay
> > > until unuse_mm() when the oom killer will kill something else) should be
> > > addressed in 2.6.35-rc2, then that's a conversation to be had with Andrew.
> >
> > Well, we have bugfix-at-first development rule. Why do you refuse our
> > development process?
>
> This isn't a bugfix, it simply prevents a recall to the oom killer after
> the kthread has called unuse_mm().  Please show where any side effects of
> oom killing a kthread, which cannot exit, as a result of use_mm() causes a
> problem _anywhere_.

I already showed you the side effects, but you removed this part in your
reply.

>From http://marc.info/?l=linux-kernel&m=127542732121077

	It can't die but force_sig() does bad things which shouldn't be done
	with workqueue thread. Note that it removes SIG_IGN, sets
	SIGNAL_GROUP_EXIT, makes signal_pending/fatal_signal_pedning true, etc.

A workqueue thread must not run with SIGNAL_GROUP_EXIT set, SIGKILL
must be ignored, signal_pending() must not be true.

This is bug. It is minor, agreed, currently use_mm() is only used by aio.

Oleg.


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 1/5] oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads
@ 2010-06-02 21:33             ` Oleg Nesterov
  0 siblings, 0 replies; 110+ messages in thread
From: Oleg Nesterov @ 2010-06-02 21:33 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On 06/02, David Rientjes wrote:
>
> On Wed, 2 Jun 2010, KOSAKI Motohiro wrote:
>
> > > Again, the question is whether or not the fix is rc material or not,
> > > otherwise there's no difference in the route that it gets upstream: the
> > > patch is duplicated in both series.  If you feel that this minor issue
> > > (which has never been reported in at least the last three years and
> > > doesn't have any side effects other than a couple of millisecond delay
> > > until unuse_mm() when the oom killer will kill something else) should be
> > > addressed in 2.6.35-rc2, then that's a conversation to be had with Andrew.
> >
> > Well, we have bugfix-at-first development rule. Why do you refuse our
> > development process?
>
> This isn't a bugfix, it simply prevents a recall to the oom killer after
> the kthread has called unuse_mm().  Please show where any side effects of
> oom killing a kthread, which cannot exit, as a result of use_mm() causes a
> problem _anywhere_.

I already showed you the side effects, but you removed this part in your
reply.

>From http://marc.info/?l=linux-kernel&m=127542732121077

	It can't die but force_sig() does bad things which shouldn't be done
	with workqueue thread. Note that it removes SIG_IGN, sets
	SIGNAL_GROUP_EXIT, makes signal_pending/fatal_signal_pedning true, etc.

A workqueue thread must not run with SIGNAL_GROUP_EXIT set, SIGKILL
must be ignored, signal_pending() must not be true.

This is bug. It is minor, agreed, currently use_mm() is only used by aio.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 1/5] oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads
  2010-06-02 21:33             ` Oleg Nesterov
@ 2010-06-02 21:46               ` David Rientjes
  -1 siblings, 0 replies; 110+ messages in thread
From: David Rientjes @ 2010-06-02 21:46 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On Wed, 2 Jun 2010, Oleg Nesterov wrote:

> > This isn't a bugfix, it simply prevents a recall to the oom killer after
> > the kthread has called unuse_mm().  Please show where any side effects of
> > oom killing a kthread, which cannot exit, as a result of use_mm() causes a
> > problem _anywhere_.
> 
> I already showed you the side effects, but you removed this part in your
> reply.
> 
> From http://marc.info/?l=linux-kernel&m=127542732121077
> 
> 	It can't die but force_sig() does bad things which shouldn't be done
> 	with workqueue thread. Note that it removes SIG_IGN, sets
> 	SIGNAL_GROUP_EXIT, makes signal_pending/fatal_signal_pedning true, etc.
> 
> A workqueue thread must not run with SIGNAL_GROUP_EXIT set, SIGKILL
> must be ignored, signal_pending() must not be true.
> 
> This is bug. It is minor, agreed, currently use_mm() is only used by aio.
> 

It's a problem that would probably never happen in practice because you're 
talking about a race between select_bad_process() and __oom_kill_task() 
which is wide since it iterates the entire tasklist, which workqueue 
threads will be near the beginning of, and there is an extremely small 
chance that the badness score for the mm that it assumed would be 
considered the ideal task to kill.  If you think this is rc material, then 
push it to Andrew and say that.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 1/5] oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads
@ 2010-06-02 21:46               ` David Rientjes
  0 siblings, 0 replies; 110+ messages in thread
From: David Rientjes @ 2010-06-02 21:46 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On Wed, 2 Jun 2010, Oleg Nesterov wrote:

> > This isn't a bugfix, it simply prevents a recall to the oom killer after
> > the kthread has called unuse_mm().  Please show where any side effects of
> > oom killing a kthread, which cannot exit, as a result of use_mm() causes a
> > problem _anywhere_.
> 
> I already showed you the side effects, but you removed this part in your
> reply.
> 
> From http://marc.info/?l=linux-kernel&m=127542732121077
> 
> 	It can't die but force_sig() does bad things which shouldn't be done
> 	with workqueue thread. Note that it removes SIG_IGN, sets
> 	SIGNAL_GROUP_EXIT, makes signal_pending/fatal_signal_pedning true, etc.
> 
> A workqueue thread must not run with SIGNAL_GROUP_EXIT set, SIGKILL
> must be ignored, signal_pending() must not be true.
> 
> This is bug. It is minor, agreed, currently use_mm() is only used by aio.
> 

It's a problem that would probably never happen in practice because you're 
talking about a race between select_bad_process() and __oom_kill_task() 
which is wide since it iterates the entire tasklist, which workqueue 
threads will be near the beginning of, and there is an extremely small 
chance that the badness score for the mm that it assumed would be 
considered the ideal task to kill.  If you think this is rc material, then 
push it to Andrew and say that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: remove PF_EXITING check completely
  2010-06-02 21:02               ` David Rientjes
@ 2010-06-03  4:48                 ` KOSAKI Motohiro
  -1 siblings, 0 replies; 110+ messages in thread
From: KOSAKI Motohiro @ 2010-06-03  4:48 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Oleg Nesterov, LKML, linux-mm, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

> On Wed, 2 Jun 2010, Oleg Nesterov wrote:
> 
> > > Today, I've thought to make some bandaid patches for this issue. but
> > > yes, I've reached the same conclusion.
> > >
> > > If we think multithread and core dump situation, all fixes are just
> > > bandaid. We can't remove deadlock chance completely.
> > >
> > > The deadlock is certenaly worst result, then, minor PF_EXITING optimization
> > > doesn't have so much worth.
> > 
> > Agreed! I was always wondering if it really helps in practice.
> > 
> 
> Nack, this certainly does help in practice, it prevents needlessly killing 
> additional tasks when one is exiting and may free memory.  It's much 
> better to defer killing something temporarily if an eligible task (i.e. 
> one that has a high probability of memory allocations on current's nodes 
> or contributing to its memcg) is exiting.
> 
> We depend on this check specifically for our use of cpusets, so please 
> don't remove it.

Your claim violate our development process. Oleg pointed this check
doesn't only work well, but also can makes deadlock. So, We certinally
need anything fix. then, I'll remove this check completely at 2.6.35
timeframe.

But this doesn't mean we refuse you make better patch at all. I expect
we can merge very soon if you make such patch.



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: remove PF_EXITING check completely
@ 2010-06-03  4:48                 ` KOSAKI Motohiro
  0 siblings, 0 replies; 110+ messages in thread
From: KOSAKI Motohiro @ 2010-06-03  4:48 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Oleg Nesterov, LKML, linux-mm, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

> On Wed, 2 Jun 2010, Oleg Nesterov wrote:
> 
> > > Today, I've thought to make some bandaid patches for this issue. but
> > > yes, I've reached the same conclusion.
> > >
> > > If we think multithread and core dump situation, all fixes are just
> > > bandaid. We can't remove deadlock chance completely.
> > >
> > > The deadlock is certenaly worst result, then, minor PF_EXITING optimization
> > > doesn't have so much worth.
> > 
> > Agreed! I was always wondering if it really helps in practice.
> > 
> 
> Nack, this certainly does help in practice, it prevents needlessly killing 
> additional tasks when one is exiting and may free memory.  It's much 
> better to defer killing something temporarily if an eligible task (i.e. 
> one that has a high probability of memory allocations on current's nodes 
> or contributing to its memcg) is exiting.
> 
> We depend on this check specifically for our use of cpusets, so please 
> don't remove it.

Your claim violate our development process. Oleg pointed this check
doesn't only work well, but also can makes deadlock. So, We certinally
need anything fix. then, I'll remove this check completely at 2.6.35
timeframe.

But this doesn't mean we refuse you make better patch at all. I expect
we can merge very soon if you make such patch.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: remove PF_EXITING check completely
  2010-06-03  4:48                 ` KOSAKI Motohiro
@ 2010-06-03  6:29                   ` David Rientjes
  -1 siblings, 0 replies; 110+ messages in thread
From: David Rientjes @ 2010-06-03  6:29 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Oleg Nesterov, LKML, linux-mm, Andrew Morton, KAMEZAWA Hiroyuki,
	Nick Piggin

On Thu, 3 Jun 2010, KOSAKI Motohiro wrote:

> > On Wed, 2 Jun 2010, Oleg Nesterov wrote:
> > 
> > > > Today, I've thought to make some bandaid patches for this issue. but
> > > > yes, I've reached the same conclusion.
> > > >
> > > > If we think multithread and core dump situation, all fixes are just
> > > > bandaid. We can't remove deadlock chance completely.
> > > >
> > > > The deadlock is certenaly worst result, then, minor PF_EXITING optimization
> > > > doesn't have so much worth.
> > > 
> > > Agreed! I was always wondering if it really helps in practice.
> > > 
> > 
> > Nack, this certainly does help in practice, it prevents needlessly killing 
> > additional tasks when one is exiting and may free memory.  It's much 
> > better to defer killing something temporarily if an eligible task (i.e. 
> > one that has a high probability of memory allocations on current's nodes 
> > or contributing to its memcg) is exiting.
> > 
> > We depend on this check specifically for our use of cpusets, so please 
> > don't remove it.
> 
> Your claim violate our development process. Oleg pointed this check
> doesn't only work well, but also can makes deadlock. So, We certinally
> need anything fix. then, I'll remove this check completely at 2.6.35
> timeframe.
> 

Show me your deadlock.  I want to see it.  In practice.

We've been using this check specifically for three years and it prevents 
needlessly killing additional tasks when one is already exiting and will 
free its memory.  That's a crucial aspect of using cpusets that run out of 
memory constantly.

Unless you actually have real world experience with using the oom killer 
to affect a memory containment strategy, I don't buy into your overly 
exaggerated claims that these are all bugfixes and these races that you 
have no practical evidence to support actually even matter but speculate 
based on pure code inspection are important.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: remove PF_EXITING check completely
@ 2010-06-03  6:29                   ` David Rientjes
  0 siblings, 0 replies; 110+ messages in thread
From: David Rientjes @ 2010-06-03  6:29 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Oleg Nesterov, LKML, linux-mm, Andrew Morton, KAMEZAWA Hiroyuki,
	Nick Piggin

On Thu, 3 Jun 2010, KOSAKI Motohiro wrote:

> > On Wed, 2 Jun 2010, Oleg Nesterov wrote:
> > 
> > > > Today, I've thought to make some bandaid patches for this issue. but
> > > > yes, I've reached the same conclusion.
> > > >
> > > > If we think multithread and core dump situation, all fixes are just
> > > > bandaid. We can't remove deadlock chance completely.
> > > >
> > > > The deadlock is certenaly worst result, then, minor PF_EXITING optimization
> > > > doesn't have so much worth.
> > > 
> > > Agreed! I was always wondering if it really helps in practice.
> > > 
> > 
> > Nack, this certainly does help in practice, it prevents needlessly killing 
> > additional tasks when one is exiting and may free memory.  It's much 
> > better to defer killing something temporarily if an eligible task (i.e. 
> > one that has a high probability of memory allocations on current's nodes 
> > or contributing to its memcg) is exiting.
> > 
> > We depend on this check specifically for our use of cpusets, so please 
> > don't remove it.
> 
> Your claim violate our development process. Oleg pointed this check
> doesn't only work well, but also can makes deadlock. So, We certinally
> need anything fix. then, I'll remove this check completely at 2.6.35
> timeframe.
> 

Show me your deadlock.  I want to see it.  In practice.

We've been using this check specifically for three years and it prevents 
needlessly killing additional tasks when one is already exiting and will 
free its memory.  That's a crucial aspect of using cpusets that run out of 
memory constantly.

Unless you actually have real world experience with using the oom killer 
to affect a memory containment strategy, I don't buy into your overly 
exaggerated claims that these are all bugfixes and these races that you 
have no practical evidence to support actually even matter but speculate 
based on pure code inspection are important.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: Make coredump interruptible
  2010-06-02 20:38                     ` Oleg Nesterov
@ 2010-06-03 14:03                       ` Oleg Nesterov
  -1 siblings, 0 replies; 110+ messages in thread
From: Oleg Nesterov @ 2010-06-03 14:03 UTC (permalink / raw)
  To: Roland McGrath
  Cc: KOSAKI Motohiro, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On 06/02, Oleg Nesterov wrote:
>
> On 06/02, Roland McGrath wrote:
> >
> > > when select_bad_process() finds the task P to kill it can participate
> > > in the core dump (sleep in exit_mm), but we should somehow inform the
> > > thread which actually dumps the core: P->mm->core_state->dumper.
> >
> > Perhaps it should simply do that: if you would choose P to oom-kill, and
> > P->mm->core_state!=NULL, then choose P->mm->core_state->dumper instead.
>
> ... to set TIF_MEMDIE which should be checked in elf_core_dump().
>
> Probably yes.

Well, nothing can protect mm->core_state, the dumper owns it. Of course
we can add the locking, but this is not nice.

And again, perhaps MMF_OOMKILLED can be useful anyway.

So, I think this would be the most quick/simple fix for now.

Oleg.


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: Make coredump interruptible
@ 2010-06-03 14:03                       ` Oleg Nesterov
  0 siblings, 0 replies; 110+ messages in thread
From: Oleg Nesterov @ 2010-06-03 14:03 UTC (permalink / raw)
  To: Roland McGrath
  Cc: KOSAKI Motohiro, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On 06/02, Oleg Nesterov wrote:
>
> On 06/02, Roland McGrath wrote:
> >
> > > when select_bad_process() finds the task P to kill it can participate
> > > in the core dump (sleep in exit_mm), but we should somehow inform the
> > > thread which actually dumps the core: P->mm->core_state->dumper.
> >
> > Perhaps it should simply do that: if you would choose P to oom-kill, and
> > P->mm->core_state!=NULL, then choose P->mm->core_state->dumper instead.
>
> ... to set TIF_MEMDIE which should be checked in elf_core_dump().
>
> Probably yes.

Well, nothing can protect mm->core_state, the dumper owns it. Of course
we can add the locking, but this is not nice.

And again, perhaps MMF_OOMKILLED can be useful anyway.

So, I think this would be the most quick/simple fix for now.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 1/5] oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads
  2010-06-02 21:46               ` David Rientjes
@ 2010-06-03 14:27                 ` Oleg Nesterov
  -1 siblings, 0 replies; 110+ messages in thread
From: Oleg Nesterov @ 2010-06-03 14:27 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On 06/02, David Rientjes wrote:
>
> On Wed, 2 Jun 2010, Oleg Nesterov wrote:
>
> > > This isn't a bugfix, it simply prevents a recall to the oom killer after
> > > the kthread has called unuse_mm().  Please show where any side effects of
> > > oom killing a kthread, which cannot exit, as a result of use_mm() causes a
> > > problem _anywhere_.
> >
> > I already showed you the side effects, but you removed this part in your
> > reply.
> >
> > From http://marc.info/?l=linux-kernel&m=127542732121077
> >
> > 	It can't die but force_sig() does bad things which shouldn't be done
> > 	with workqueue thread. Note that it removes SIG_IGN, sets
> > 	SIGNAL_GROUP_EXIT, makes signal_pending/fatal_signal_pedning true, etc.
> >
> > A workqueue thread must not run with SIGNAL_GROUP_EXIT set, SIGKILL
> > must be ignored, signal_pending() must not be true.
> >
> > This is bug. It is minor, agreed, currently use_mm() is only used by aio.
>
> It's a problem that would probably never happen in practice because

No need to convince me this bug is minor. I repeated this every time.
I only argued with the "isn't a bugfix, no side effects".

> considered the ideal task to kill.  If you think this is rc material, then
> push it to Andrew and say that.

No, I don't think it is strictly necessary to push this fix into rc.
But I don't understand why this matters. And in any case, when it comes
to oom, I am in no position to make any authoritative decisions.


David, I don't understand why do you refuse to re-diff your changes
on top of Kosaki's work. If nothing else, this will help to review
your changes.

Oleg.


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 1/5] oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads
@ 2010-06-03 14:27                 ` Oleg Nesterov
  0 siblings, 0 replies; 110+ messages in thread
From: Oleg Nesterov @ 2010-06-03 14:27 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On 06/02, David Rientjes wrote:
>
> On Wed, 2 Jun 2010, Oleg Nesterov wrote:
>
> > > This isn't a bugfix, it simply prevents a recall to the oom killer after
> > > the kthread has called unuse_mm().  Please show where any side effects of
> > > oom killing a kthread, which cannot exit, as a result of use_mm() causes a
> > > problem _anywhere_.
> >
> > I already showed you the side effects, but you removed this part in your
> > reply.
> >
> > From http://marc.info/?l=linux-kernel&m=127542732121077
> >
> > 	It can't die but force_sig() does bad things which shouldn't be done
> > 	with workqueue thread. Note that it removes SIG_IGN, sets
> > 	SIGNAL_GROUP_EXIT, makes signal_pending/fatal_signal_pedning true, etc.
> >
> > A workqueue thread must not run with SIGNAL_GROUP_EXIT set, SIGKILL
> > must be ignored, signal_pending() must not be true.
> >
> > This is bug. It is minor, agreed, currently use_mm() is only used by aio.
>
> It's a problem that would probably never happen in practice because

No need to convince me this bug is minor. I repeated this every time.
I only argued with the "isn't a bugfix, no side effects".

> considered the ideal task to kill.  If you think this is rc material, then
> push it to Andrew and say that.

No, I don't think it is strictly necessary to push this fix into rc.
But I don't understand why this matters. And in any case, when it comes
to oom, I am in no position to make any authoritative decisions.


David, I don't understand why do you refuse to re-diff your changes
on top of Kosaki's work. If nothing else, this will help to review
your changes.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 1/5] oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads
  2010-06-03 14:27                 ` Oleg Nesterov
@ 2010-06-03 20:11                   ` David Rientjes
  -1 siblings, 0 replies; 110+ messages in thread
From: David Rientjes @ 2010-06-03 20:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On Thu, 3 Jun 2010, Oleg Nesterov wrote:

> David, I don't understand why do you refuse to re-diff your changes
> on top of Kosaki's work. If nothing else, this will help to review
> your changes.
> 

I simply don't have enough time in a day to rebase my rewrite patches on 
top of what Andrew may or may not merge into -mm.  When he merges 
something, that would be different.

I don't think we need to push anything to -mm right now that isn't rc 
material since the rewrite should make it in before the merge window.  If 
there are outstanding fixes that should go into rc (and probably stable 
material as well), those need to be pushed to Andrew immediately.  I 
disagree that I've seen any to date that are immediate fixes.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 1/5] oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads
@ 2010-06-03 20:11                   ` David Rientjes
  0 siblings, 0 replies; 110+ messages in thread
From: David Rientjes @ 2010-06-03 20:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On Thu, 3 Jun 2010, Oleg Nesterov wrote:

> David, I don't understand why do you refuse to re-diff your changes
> on top of Kosaki's work. If nothing else, this will help to review
> your changes.
> 

I simply don't have enough time in a day to rebase my rewrite patches on 
top of what Andrew may or may not merge into -mm.  When he merges 
something, that would be different.

I don't think we need to push anything to -mm right now that isn't rc 
material since the rewrite should make it in before the merge window.  If 
there are outstanding fixes that should go into rc (and probably stable 
material as well), those need to be pushed to Andrew immediately.  I 
disagree that I've seen any to date that are immediate fixes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: Make coredump interruptible
  2010-06-02 20:38                     ` Oleg Nesterov
@ 2010-06-04 10:54                       ` KOSAKI Motohiro
  -1 siblings, 0 replies; 110+ messages in thread
From: KOSAKI Motohiro @ 2010-06-04 10:54 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: kosaki.motohiro, Roland McGrath, LKML, linux-mm, David Rientjes,
	Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin

> On 06/02, Roland McGrath wrote:
> >
> > > when select_bad_process() finds the task P to kill it can participate
> > > in the core dump (sleep in exit_mm), but we should somehow inform the
> > > thread which actually dumps the core: P->mm->core_state->dumper.
> >
> > Perhaps it should simply do that: if you would choose P to oom-kill, and
> > P->mm->core_state!=NULL, then choose P->mm->core_state->dumper instead.
> 
> ... to set TIF_MEMDIE which should be checked in elf_core_dump().
> 
> Probably yes.

Yep, probably. but can you please allow me additonal explanation?

In multi threaded OOM case, we have two problematic routine, coredump
and vmscan. Roland's idea can only solve the former. 

But I also interest vmscan quickly exit if OOM received. if other threads
get stuck in vmscan for freeing addional pages (this is very typical because
usually every thread call any syscall and eventually call kmalloc etc), 
recovering oom become very slow even if this doesn't makes deadlock.

Unfortunatelly, vmscan need much refactoring before appling this idea.
then, I didn't include such fixes.

I mean I hope to implement per-process OOM flag even if coredump don't
really need it.

So, I created MMF_OOM patch today. It is just for discussion, still.

From f099e1ba6e7b5654b35b468c13e1ae9e4f182ea4 Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date: Fri, 4 Jun 2010 18:56:56 +0900
Subject: [RFC][PATCH v2] oom: make coredump interruptible

If oom victim process is under core dumping, sending SIGKILL cause
no-op. Unfortunately, coredump need relatively much memory. It mean
OOM vs coredump can makes deadlock.

Then, coredump logic should check the task has received SIGKILL
from OOM.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 fs/binfmt_elf.c       |    4 ++++
 include/linux/sched.h |    1 +
 mm/oom_kill.c         |    3 ++-
 3 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 535e763..2aca748 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -2038,6 +2038,10 @@ static int elf_core_dump(struct coredump_params *cprm)
 				page_cache_release(page);
 			} else
 				stop = !dump_seek(cprm->file, PAGE_SIZE);
+
+			/* The task need to exit ASAP if received OOM. */
+			if (test_bit(MMF_OOM_KILLED, &current->mm->flags))
+				stop = 1;
 			if (stop)
 				goto end_coredump;
 		}
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8485aa2..53b7caa 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -436,6 +436,7 @@ extern int get_dumpable(struct mm_struct *mm);
 #endif
 					/* leave room for more dump flags */
 #define MMF_VM_MERGEABLE	16	/* KSM may merge identical pages */
+#define MMF_OOM_KILLED		17	/* Killed by OOM */
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 2678a04..29850c4 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -401,7 +401,6 @@ static int __oom_kill_process(struct task_struct *p, struct mem_cgroup *mem,
 		       K(p->mm->total_vm),
 		       K(get_mm_counter(p->mm, MM_ANONPAGES)),
 		       K(get_mm_counter(p->mm, MM_FILEPAGES)));
-	task_unlock(p);
 
 	/*
 	 * We give our sacrificial lamb high priority and access to
@@ -410,6 +409,8 @@ static int __oom_kill_process(struct task_struct *p, struct mem_cgroup *mem,
 	 */
 	p->rt.time_slice = HZ;
 	set_tsk_thread_flag(p, TIF_MEMDIE);
+	set_bit(MMF_OOM_KILLED, &p->mm->flags);
+	task_unlock(p);
 
 	force_sig(SIGKILL, p);
 
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: Make coredump interruptible
@ 2010-06-04 10:54                       ` KOSAKI Motohiro
  0 siblings, 0 replies; 110+ messages in thread
From: KOSAKI Motohiro @ 2010-06-04 10:54 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: kosaki.motohiro, Roland McGrath, LKML, linux-mm, David Rientjes,
	Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin

> On 06/02, Roland McGrath wrote:
> >
> > > when select_bad_process() finds the task P to kill it can participate
> > > in the core dump (sleep in exit_mm), but we should somehow inform the
> > > thread which actually dumps the core: P->mm->core_state->dumper.
> >
> > Perhaps it should simply do that: if you would choose P to oom-kill, and
> > P->mm->core_state!=NULL, then choose P->mm->core_state->dumper instead.
> 
> ... to set TIF_MEMDIE which should be checked in elf_core_dump().
> 
> Probably yes.

Yep, probably. but can you please allow me additonal explanation?

In multi threaded OOM case, we have two problematic routine, coredump
and vmscan. Roland's idea can only solve the former. 

But I also interest vmscan quickly exit if OOM received. if other threads
get stuck in vmscan for freeing addional pages (this is very typical because
usually every thread call any syscall and eventually call kmalloc etc), 
recovering oom become very slow even if this doesn't makes deadlock.

Unfortunatelly, vmscan need much refactoring before appling this idea.
then, I didn't include such fixes.

I mean I hope to implement per-process OOM flag even if coredump don't
really need it.

So, I created MMF_OOM patch today. It is just for discussion, still.

From f099e1ba6e7b5654b35b468c13e1ae9e4f182ea4 Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date: Fri, 4 Jun 2010 18:56:56 +0900
Subject: [RFC][PATCH v2] oom: make coredump interruptible

If oom victim process is under core dumping, sending SIGKILL cause
no-op. Unfortunately, coredump need relatively much memory. It mean
OOM vs coredump can makes deadlock.

Then, coredump logic should check the task has received SIGKILL
from OOM.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 fs/binfmt_elf.c       |    4 ++++
 include/linux/sched.h |    1 +
 mm/oom_kill.c         |    3 ++-
 3 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 535e763..2aca748 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -2038,6 +2038,10 @@ static int elf_core_dump(struct coredump_params *cprm)
 				page_cache_release(page);
 			} else
 				stop = !dump_seek(cprm->file, PAGE_SIZE);
+
+			/* The task need to exit ASAP if received OOM. */
+			if (test_bit(MMF_OOM_KILLED, &current->mm->flags))
+				stop = 1;
 			if (stop)
 				goto end_coredump;
 		}
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8485aa2..53b7caa 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -436,6 +436,7 @@ extern int get_dumpable(struct mm_struct *mm);
 #endif
 					/* leave room for more dump flags */
 #define MMF_VM_MERGEABLE	16	/* KSM may merge identical pages */
+#define MMF_OOM_KILLED		17	/* Killed by OOM */
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 2678a04..29850c4 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -401,7 +401,6 @@ static int __oom_kill_process(struct task_struct *p, struct mem_cgroup *mem,
 		       K(p->mm->total_vm),
 		       K(get_mm_counter(p->mm, MM_ANONPAGES)),
 		       K(get_mm_counter(p->mm, MM_FILEPAGES)));
-	task_unlock(p);
 
 	/*
 	 * We give our sacrificial lamb high priority and access to
@@ -410,6 +409,8 @@ static int __oom_kill_process(struct task_struct *p, struct mem_cgroup *mem,
 	 */
 	p->rt.time_slice = HZ;
 	set_tsk_thread_flag(p, TIF_MEMDIE);
+	set_bit(MMF_OOM_KILLED, &p->mm->flags);
+	task_unlock(p);
 
 	force_sig(SIGKILL, p);
 
-- 
1.6.5.2



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: Make coredump interruptible
  2010-06-04 10:54                       ` KOSAKI Motohiro
@ 2010-06-04 11:27                         ` Oleg Nesterov
  -1 siblings, 0 replies; 110+ messages in thread
From: Oleg Nesterov @ 2010-06-04 11:27 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Roland McGrath, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On 06/04, KOSAKI Motohiro wrote:
>
> > ... to set TIF_MEMDIE which should be checked in elf_core_dump().
> >
> > Probably yes.
>
> Yep, probably. but can you please allow me additonal explanation?
>
> In multi threaded OOM case, we have two problematic routine, coredump
> and vmscan. Roland's idea can only solve the former.
>
> But I also interest vmscan quickly exit if OOM received.

Yes, agreed. See another email from me, MMF_ flags looks "obviously
useful" to me.

(I'd suggest you to add a note into the changelog, to explain
 that the new flag makes sense even without coredump problems).

> @@ -410,6 +409,8 @@ static int __oom_kill_process(struct task_struct *p, struct mem_cgroup *mem,
>  	 */
>  	p->rt.time_slice = HZ;
>  	set_tsk_thread_flag(p, TIF_MEMDIE);
> +	set_bit(MMF_OOM_KILLED, &p->mm->flags);
> +	task_unlock(p);

IIUC, it has find_lock_task() mm above and thus we can trust p->mm ?
(I am asking just in case, I lost the plot a bit).

Ack or Reviewed, whatever your prefer.

Very minor nit.

> @@ -2038,6 +2038,10 @@ static int elf_core_dump(struct coredump_params *cprm)
>                               page_cache_release(page);
>                       } else
>                               stop = !dump_seek(cprm->file, PAGE_SIZE);
> +
> +                     /* The task need to exit ASAP if received OOM. */
> +                     if (test_bit(MMF_OOM_KILLED, &current->mm->flags))
> +                             stop = 1;

Perhaps this check makes more sense at the start of the loop,
and there is no need to set "stop = 1" (this var is not visible
outside of "for (;;) {}" anyway). Cosmetic, up to you.

Oleg.


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: Make coredump interruptible
@ 2010-06-04 11:27                         ` Oleg Nesterov
  0 siblings, 0 replies; 110+ messages in thread
From: Oleg Nesterov @ 2010-06-04 11:27 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Roland McGrath, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On 06/04, KOSAKI Motohiro wrote:
>
> > ... to set TIF_MEMDIE which should be checked in elf_core_dump().
> >
> > Probably yes.
>
> Yep, probably. but can you please allow me additonal explanation?
>
> In multi threaded OOM case, we have two problematic routine, coredump
> and vmscan. Roland's idea can only solve the former.
>
> But I also interest vmscan quickly exit if OOM received.

Yes, agreed. See another email from me, MMF_ flags looks "obviously
useful" to me.

(I'd suggest you to add a note into the changelog, to explain
 that the new flag makes sense even without coredump problems).

> @@ -410,6 +409,8 @@ static int __oom_kill_process(struct task_struct *p, struct mem_cgroup *mem,
>  	 */
>  	p->rt.time_slice = HZ;
>  	set_tsk_thread_flag(p, TIF_MEMDIE);
> +	set_bit(MMF_OOM_KILLED, &p->mm->flags);
> +	task_unlock(p);

IIUC, it has find_lock_task() mm above and thus we can trust p->mm ?
(I am asking just in case, I lost the plot a bit).

Ack or Reviewed, whatever your prefer.

Very minor nit.

> @@ -2038,6 +2038,10 @@ static int elf_core_dump(struct coredump_params *cprm)
>                               page_cache_release(page);
>                       } else
>                               stop = !dump_seek(cprm->file, PAGE_SIZE);
> +
> +                     /* The task need to exit ASAP if received OOM. */
> +                     if (test_bit(MMF_OOM_KILLED, &current->mm->flags))
> +                             stop = 1;

Perhaps this check makes more sense at the start of the loop,
and there is no need to set "stop = 1" (this var is not visible
outside of "for (;;) {}" anyway). Cosmetic, up to you.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: Make coredump interruptible
  2010-06-04 11:27                         ` Oleg Nesterov
@ 2010-06-04 11:34                           ` Oleg Nesterov
  -1 siblings, 0 replies; 110+ messages in thread
From: Oleg Nesterov @ 2010-06-04 11:34 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Roland McGrath, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On 06/04, Oleg Nesterov wrote:
>
> (I'd suggest you to add a note into the changelog, to explain
>  that the new flag makes sense even without coredump problems).

And. May I ask you to add another note into the changelog?

> > @@ -410,6 +409,8 @@ static int __oom_kill_process(struct task_struct *p, struct mem_cgroup *mem,
> >  	 */
> >  	p->rt.time_slice = HZ;
> >  	set_tsk_thread_flag(p, TIF_MEMDIE);
> > +	set_bit(MMF_OOM_KILLED, &p->mm->flags);

I think the changelog should explain that, if we race with fork(),
this flag can't leak into the child's new mm. mm_init() filters
the bits outside of MMF_INIT_MASK.

If we race with exec, it can't leak because mm_alloc() does
memset(0).

Oleg.


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: Make coredump interruptible
@ 2010-06-04 11:34                           ` Oleg Nesterov
  0 siblings, 0 replies; 110+ messages in thread
From: Oleg Nesterov @ 2010-06-04 11:34 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Roland McGrath, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On 06/04, Oleg Nesterov wrote:
>
> (I'd suggest you to add a note into the changelog, to explain
>  that the new flag makes sense even without coredump problems).

And. May I ask you to add another note into the changelog?

> > @@ -410,6 +409,8 @@ static int __oom_kill_process(struct task_struct *p, struct mem_cgroup *mem,
> >  	 */
> >  	p->rt.time_slice = HZ;
> >  	set_tsk_thread_flag(p, TIF_MEMDIE);
> > +	set_bit(MMF_OOM_KILLED, &p->mm->flags);

I think the changelog should explain that, if we race with fork(),
this flag can't leak into the child's new mm. mm_init() filters
the bits outside of MMF_INIT_MASK.

If we race with exec, it can't leak because mm_alloc() does
memset(0).

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: Make coredump interruptible
  2010-06-04 11:27                         ` Oleg Nesterov
@ 2010-06-09 19:53                           ` Oleg Nesterov
  -1 siblings, 0 replies; 110+ messages in thread
From: Oleg Nesterov @ 2010-06-09 19:53 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Roland McGrath, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On 06/04, Oleg Nesterov wrote:
>
> On 06/04, KOSAKI Motohiro wrote:
> >
> > In multi threaded OOM case, we have two problematic routine, coredump
> > and vmscan. Roland's idea can only solve the former.
> >
> > But I also interest vmscan quickly exit if OOM received.
>
> Yes, agreed. See another email from me, MMF_ flags looks "obviously
> useful" to me.

Well. But somehow we forgot about the !coredumping case... Suppose
that select_bad_process() chooses the process P to kill and we have
other processes (not sub-threads) which share the same ->mm.

In that case I am not sure we should blindly set MMF_OOMKILL. Suppose
that we kill P and after that the "out-of-memory" condition goes away.
But its ->mm still has MMF_OOMKILL set, and it is used. Who/when will
clear this flag?

Perhaps something like below makes sense for now.

Oleg.

--- x/fs/exec.c
+++ x/fs/exec.c
@@ -1594,6 +1594,7 @@ static inline int zap_threads(struct tas
 	spin_lock_irq(&tsk->sighand->siglock);
 	if (!signal_group_exit(tsk->signal)) {
 		mm->core_state = core_state;
+		set_bit(MMF_COREDUMP, &mm->flags);
 		nr = zap_process(tsk, exit_code);
 	}
 	spin_unlock_irq(&tsk->sighand->siglock);
--- x/fs/binfmt_elf.c
+++ x/fs/binfmt_elf.c
@@ -2028,6 +2028,9 @@ static int elf_core_dump(struct coredump
 			struct page *page;
 			int stop;
 
+			if (!test_bit(MMF_COREDUMP, &current->mm->flags))
+				goto end_coredump;
+
 			page = get_dump_page(addr);
 			if (page) {
 				void *kaddr = kmap(page);
--- x/mm/oom_kill.c
+++ x/mm/oom_kill.c
@@ -414,6 +414,7 @@ static void __oom_kill_task(struct task_
 	p->rt.time_slice = HZ;
 	set_tsk_thread_flag(p, TIF_MEMDIE);
 
+	clear_bit(MMF_COREDUMP, &p->mm->flags);
 	force_sig(SIGKILL, p);
 }
 


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: Make coredump interruptible
@ 2010-06-09 19:53                           ` Oleg Nesterov
  0 siblings, 0 replies; 110+ messages in thread
From: Oleg Nesterov @ 2010-06-09 19:53 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Roland McGrath, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On 06/04, Oleg Nesterov wrote:
>
> On 06/04, KOSAKI Motohiro wrote:
> >
> > In multi threaded OOM case, we have two problematic routine, coredump
> > and vmscan. Roland's idea can only solve the former.
> >
> > But I also interest vmscan quickly exit if OOM received.
>
> Yes, agreed. See another email from me, MMF_ flags looks "obviously
> useful" to me.

Well. But somehow we forgot about the !coredumping case... Suppose
that select_bad_process() chooses the process P to kill and we have
other processes (not sub-threads) which share the same ->mm.

In that case I am not sure we should blindly set MMF_OOMKILL. Suppose
that we kill P and after that the "out-of-memory" condition goes away.
But its ->mm still has MMF_OOMKILL set, and it is used. Who/when will
clear this flag?

Perhaps something like below makes sense for now.

Oleg.

--- x/fs/exec.c
+++ x/fs/exec.c
@@ -1594,6 +1594,7 @@ static inline int zap_threads(struct tas
 	spin_lock_irq(&tsk->sighand->siglock);
 	if (!signal_group_exit(tsk->signal)) {
 		mm->core_state = core_state;
+		set_bit(MMF_COREDUMP, &mm->flags);
 		nr = zap_process(tsk, exit_code);
 	}
 	spin_unlock_irq(&tsk->sighand->siglock);
--- x/fs/binfmt_elf.c
+++ x/fs/binfmt_elf.c
@@ -2028,6 +2028,9 @@ static int elf_core_dump(struct coredump
 			struct page *page;
 			int stop;
 
+			if (!test_bit(MMF_COREDUMP, &current->mm->flags))
+				goto end_coredump;
+
 			page = get_dump_page(addr);
 			if (page) {
 				void *kaddr = kmap(page);
--- x/mm/oom_kill.c
+++ x/mm/oom_kill.c
@@ -414,6 +414,7 @@ static void __oom_kill_task(struct task_
 	p->rt.time_slice = HZ;
 	set_tsk_thread_flag(p, TIF_MEMDIE);
 
+	clear_bit(MMF_COREDUMP, &p->mm->flags);
 	force_sig(SIGKILL, p);
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: Make coredump interruptible
  2010-06-09 19:53                           ` Oleg Nesterov
@ 2010-06-09 20:41                             ` David Rientjes
  -1 siblings, 0 replies; 110+ messages in thread
From: David Rientjes @ 2010-06-09 20:41 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: KOSAKI Motohiro, Roland McGrath, LKML, linux-mm, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On Wed, 9 Jun 2010, Oleg Nesterov wrote:

> --- x/mm/oom_kill.c
> +++ x/mm/oom_kill.c
> @@ -414,6 +414,7 @@ static void __oom_kill_task(struct task_
>  	p->rt.time_slice = HZ;
>  	set_tsk_thread_flag(p, TIF_MEMDIE);
>  
> +	clear_bit(MMF_COREDUMP, &p->mm->flags);
>  	force_sig(SIGKILL, p);
>  }
>  

This requires task_lock(p).

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: Make coredump interruptible
@ 2010-06-09 20:41                             ` David Rientjes
  0 siblings, 0 replies; 110+ messages in thread
From: David Rientjes @ 2010-06-09 20:41 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: KOSAKI Motohiro, Roland McGrath, LKML, linux-mm, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On Wed, 9 Jun 2010, Oleg Nesterov wrote:

> --- x/mm/oom_kill.c
> +++ x/mm/oom_kill.c
> @@ -414,6 +414,7 @@ static void __oom_kill_task(struct task_
>  	p->rt.time_slice = HZ;
>  	set_tsk_thread_flag(p, TIF_MEMDIE);
>  
> +	clear_bit(MMF_COREDUMP, &p->mm->flags);
>  	force_sig(SIGKILL, p);
>  }
>  

This requires task_lock(p).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: Make coredump interruptible
  2010-06-09 20:41                             ` David Rientjes
@ 2010-06-09 21:03                               ` Oleg Nesterov
  -1 siblings, 0 replies; 110+ messages in thread
From: Oleg Nesterov @ 2010-06-09 21:03 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, Roland McGrath, LKML, linux-mm, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On 06/09, David Rientjes wrote:
>
> On Wed, 9 Jun 2010, Oleg Nesterov wrote:
>
> > --- x/mm/oom_kill.c
> > +++ x/mm/oom_kill.c
> > @@ -414,6 +414,7 @@ static void __oom_kill_task(struct task_
> >  	p->rt.time_slice = HZ;
> >  	set_tsk_thread_flag(p, TIF_MEMDIE);
> >
> > +	clear_bit(MMF_COREDUMP, &p->mm->flags);
> >  	force_sig(SIGKILL, p);
> >  }
> >
>
> This requires task_lock(p).

Yes, yes, sure. This is only template. I'll wait for the next mmotm
to send the actual patch on top of recent changes. Unless Kosaki/Roland
have other ideas.

Imho, we really need to fix the coredump/oom problem.

Oleg.


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: Make coredump interruptible
@ 2010-06-09 21:03                               ` Oleg Nesterov
  0 siblings, 0 replies; 110+ messages in thread
From: Oleg Nesterov @ 2010-06-09 21:03 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, Roland McGrath, LKML, linux-mm, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On 06/09, David Rientjes wrote:
>
> On Wed, 9 Jun 2010, Oleg Nesterov wrote:
>
> > --- x/mm/oom_kill.c
> > +++ x/mm/oom_kill.c
> > @@ -414,6 +414,7 @@ static void __oom_kill_task(struct task_
> >  	p->rt.time_slice = HZ;
> >  	set_tsk_thread_flag(p, TIF_MEMDIE);
> >
> > +	clear_bit(MMF_COREDUMP, &p->mm->flags);
> >  	force_sig(SIGKILL, p);
> >  }
> >
>
> This requires task_lock(p).

Yes, yes, sure. This is only template. I'll wait for the next mmotm
to send the actual patch on top of recent changes. Unless Kosaki/Roland
have other ideas.

Imho, we really need to fix the coredump/oom problem.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: Make coredump interruptible
  2010-06-09 19:53                           ` Oleg Nesterov
@ 2010-06-13 11:24                             ` KOSAKI Motohiro
  -1 siblings, 0 replies; 110+ messages in thread
From: KOSAKI Motohiro @ 2010-06-13 11:24 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: kosaki.motohiro, Roland McGrath, LKML, linux-mm, David Rientjes,
	Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin

Sorry for the delay.

> On 06/04, Oleg Nesterov wrote:
> >
> > On 06/04, KOSAKI Motohiro wrote:
> > >
> > > In multi threaded OOM case, we have two problematic routine, coredump
> > > and vmscan. Roland's idea can only solve the former.
> > >
> > > But I also interest vmscan quickly exit if OOM received.
> >
> > Yes, agreed. See another email from me, MMF_ flags looks "obviously
> > useful" to me.
> 
> Well. But somehow we forgot about the !coredumping case... Suppose
> that select_bad_process() chooses the process P to kill and we have
> other processes (not sub-threads) which share the same ->mm.

Ah, yes. I think you are correct.


> In that case I am not sure we should blindly set MMF_OOMKILL. Suppose
> that we kill P and after that the "out-of-memory" condition goes away.
> But its ->mm still has MMF_OOMKILL set, and it is used. Who/when will
> clear this flag?
> 
> Perhaps something like below makes sense for now.

Probably, this works. at least I don't find any problems.
But umm... Do you mean we can't implement per-process oom flags?

example,
	1) back to implement signal->oom_victim
	   because We are using SIGKILL for OOM and struct signal
	   naturally represent signal target.
	2) mm->nr_oom_killed_task
	   just avoid simple flag. instead counting number of tasks of
	   oom-killed.

I think both avoid your explained problem. Am I missing something?

But, again, I have no objection to your patch. because I really hope to
fix coredump vs oom issue.




^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: Make coredump interruptible
@ 2010-06-13 11:24                             ` KOSAKI Motohiro
  0 siblings, 0 replies; 110+ messages in thread
From: KOSAKI Motohiro @ 2010-06-13 11:24 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: kosaki.motohiro, Roland McGrath, LKML, linux-mm, David Rientjes,
	Andrew Morton, KAMEZAWA Hiroyuki, Nick Piggin

Sorry for the delay.

> On 06/04, Oleg Nesterov wrote:
> >
> > On 06/04, KOSAKI Motohiro wrote:
> > >
> > > In multi threaded OOM case, we have two problematic routine, coredump
> > > and vmscan. Roland's idea can only solve the former.
> > >
> > > But I also interest vmscan quickly exit if OOM received.
> >
> > Yes, agreed. See another email from me, MMF_ flags looks "obviously
> > useful" to me.
> 
> Well. But somehow we forgot about the !coredumping case... Suppose
> that select_bad_process() chooses the process P to kill and we have
> other processes (not sub-threads) which share the same ->mm.

Ah, yes. I think you are correct.


> In that case I am not sure we should blindly set MMF_OOMKILL. Suppose
> that we kill P and after that the "out-of-memory" condition goes away.
> But its ->mm still has MMF_OOMKILL set, and it is used. Who/when will
> clear this flag?
> 
> Perhaps something like below makes sense for now.

Probably, this works. at least I don't find any problems.
But umm... Do you mean we can't implement per-process oom flags?

example,
	1) back to implement signal->oom_victim
	   because We are using SIGKILL for OOM and struct signal
	   naturally represent signal target.
	2) mm->nr_oom_killed_task
	   just avoid simple flag. instead counting number of tasks of
	   oom-killed.

I think both avoid your explained problem. Am I missing something?

But, again, I have no objection to your patch. because I really hope to
fix coredump vs oom issue.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: Make coredump interruptible
  2010-06-13 11:24                             ` KOSAKI Motohiro
@ 2010-06-13 15:53                               ` Oleg Nesterov
  -1 siblings, 0 replies; 110+ messages in thread
From: Oleg Nesterov @ 2010-06-13 15:53 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Roland McGrath, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On 06/13, KOSAKI Motohiro wrote:
>
> > On 06/04, Oleg Nesterov wrote:
> > >
> > Perhaps something like below makes sense for now.
>
> Probably, this works. at least I don't find any problems.
> But umm... Do you mean we can't implement per-process oom flags?

Sorry, can't understand what you mean.

> example,
> 	1) back to implement signal->oom_victim
> 	   because We are using SIGKILL for OOM and struct signal
> 	   naturally represent signal target.

Yes, but if this process participates in the coredump, we should find
the right thread, or mark mm or mm->core_state.

In fact, I was never sure that oom-kill should kill the single process.
Perhaps it should kill all tasks using the same ->mm instead. But this
is another story.

> 	2) mm->nr_oom_killed_task
> 	   just avoid simple flag. instead counting number of tasks of
> 	   oom-killed.

again, can't understand.

> I think both avoid your explained problem. Am I missing something?

I guess that I am missing something ;) Please clarify?

> But, again, I have no objection to your patch. because I really hope to
> fix coredump vs oom issue.

Yes, I think this is important. And if we keep the PF_EXITING check in
select_bad_process(), it should be fixed so that at least the coredump
can't fool it. And the "p != current" is obviously not right too.

I'll try to do something next week, the patches should be simple.

Oleg.


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: Make coredump interruptible
@ 2010-06-13 15:53                               ` Oleg Nesterov
  0 siblings, 0 replies; 110+ messages in thread
From: Oleg Nesterov @ 2010-06-13 15:53 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Roland McGrath, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On 06/13, KOSAKI Motohiro wrote:
>
> > On 06/04, Oleg Nesterov wrote:
> > >
> > Perhaps something like below makes sense for now.
>
> Probably, this works. at least I don't find any problems.
> But umm... Do you mean we can't implement per-process oom flags?

Sorry, can't understand what you mean.

> example,
> 	1) back to implement signal->oom_victim
> 	   because We are using SIGKILL for OOM and struct signal
> 	   naturally represent signal target.

Yes, but if this process participates in the coredump, we should find
the right thread, or mark mm or mm->core_state.

In fact, I was never sure that oom-kill should kill the single process.
Perhaps it should kill all tasks using the same ->mm instead. But this
is another story.

> 	2) mm->nr_oom_killed_task
> 	   just avoid simple flag. instead counting number of tasks of
> 	   oom-killed.

again, can't understand.

> I think both avoid your explained problem. Am I missing something?

I guess that I am missing something ;) Please clarify?

> But, again, I have no objection to your patch. because I really hope to
> fix coredump vs oom issue.

Yes, I think this is important. And if we keep the PF_EXITING check in
select_bad_process(), it should be fixed so that at least the coredump
can't fool it. And the "p != current" is obviously not right too.

I'll try to do something next week, the patches should be simple.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* uninterruptible CLONE_VFORK (Was: oom: Make coredump interruptible)
  2010-06-13 15:53                               ` Oleg Nesterov
@ 2010-06-13 17:13                                 ` Oleg Nesterov
  -1 siblings, 0 replies; 110+ messages in thread
From: Oleg Nesterov @ 2010-06-13 17:13 UTC (permalink / raw)
  To: Roland McGrath, KOSAKI Motohiro
  Cc: LKML, linux-mm, David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki,
	Nick Piggin

On 06/13, Oleg Nesterov wrote:
>
> On 06/13, KOSAKI Motohiro wrote:
> >
> > But, again, I have no objection to your patch. because I really hope to
> > fix coredump vs oom issue.
>
> Yes, I think this is important.

Oh. And another problem, vfork() is not interruptible too. This means
that the user can hide the memory hog from oom-killer. But let's forget
about oom.

Roland, any reason it should be uninterruptible? This doesn't look good
in any case. Perhaps the pseudo-patch below makes sense?

Oleg.

--- x/kernel/fork.c
+++ x/kernel/fork.c
@@ -1359,6 +1359,26 @@ struct task_struct * __cpuinit fork_idle
 	return task;
 }
 
+// ---------------------------------------------------
+// THIS SHOULD BE USED BY mm_release/coredump_wait/etc
+// ---------------------------------------------------
+void complete_vfork_done(struct task_struct *tsk)
+{
+	struct completion *vfork = xchg(tsk->vfork_done, NULL);
+	if (vfork)
+		complete(vfork);
+}
+
+static wait_for_vfork_done(struct task_struct *child, struct completion *vfork)
+{
+	if (!wait_for_completion_killable(vfork))
+		return;
+	if (xchg(child->vfork_done, NULL) != NULL)
+		return;
+	// the child has already read ->vfork_done and it should wake us up
+	wait_for_completion(vfork);
+}
+
 /*
  *  Ok, this is the main fork-routine.
  *
@@ -1433,6 +1453,7 @@ long do_fork(unsigned long clone_flags,
 		if (clone_flags & CLONE_VFORK) {
 			p->vfork_done = &vfork;
 			init_completion(&vfork);
+			get_task_struct(p);
 		}
 
 		audit_finish_fork(p);
@@ -1462,7 +1483,8 @@ long do_fork(unsigned long clone_flags,
 
 		if (clone_flags & CLONE_VFORK) {
 			freezer_do_not_count();
-			wait_for_completion(&vfork);
+			wait_for_vfork_done(p, &vfork);
+			put_task_struct(p),
 			freezer_count();
 			tracehook_report_vfork_done(p, nr);
 		}


^ permalink raw reply	[flat|nested] 110+ messages in thread

* uninterruptible CLONE_VFORK (Was: oom: Make coredump interruptible)
@ 2010-06-13 17:13                                 ` Oleg Nesterov
  0 siblings, 0 replies; 110+ messages in thread
From: Oleg Nesterov @ 2010-06-13 17:13 UTC (permalink / raw)
  To: Roland McGrath, KOSAKI Motohiro
  Cc: LKML, linux-mm, David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki,
	Nick Piggin

On 06/13, Oleg Nesterov wrote:
>
> On 06/13, KOSAKI Motohiro wrote:
> >
> > But, again, I have no objection to your patch. because I really hope to
> > fix coredump vs oom issue.
>
> Yes, I think this is important.

Oh. And another problem, vfork() is not interruptible too. This means
that the user can hide the memory hog from oom-killer. But let's forget
about oom.

Roland, any reason it should be uninterruptible? This doesn't look good
in any case. Perhaps the pseudo-patch below makes sense?

Oleg.

--- x/kernel/fork.c
+++ x/kernel/fork.c
@@ -1359,6 +1359,26 @@ struct task_struct * __cpuinit fork_idle
 	return task;
 }
 
+// ---------------------------------------------------
+// THIS SHOULD BE USED BY mm_release/coredump_wait/etc
+// ---------------------------------------------------
+void complete_vfork_done(struct task_struct *tsk)
+{
+	struct completion *vfork = xchg(tsk->vfork_done, NULL);
+	if (vfork)
+		complete(vfork);
+}
+
+static wait_for_vfork_done(struct task_struct *child, struct completion *vfork)
+{
+	if (!wait_for_completion_killable(vfork))
+		return;
+	if (xchg(child->vfork_done, NULL) != NULL)
+		return;
+	// the child has already read ->vfork_done and it should wake us up
+	wait_for_completion(vfork);
+}
+
 /*
  *  Ok, this is the main fork-routine.
  *
@@ -1433,6 +1453,7 @@ long do_fork(unsigned long clone_flags,
 		if (clone_flags & CLONE_VFORK) {
 			p->vfork_done = &vfork;
 			init_completion(&vfork);
+			get_task_struct(p);
 		}
 
 		audit_finish_fork(p);
@@ -1462,7 +1483,8 @@ long do_fork(unsigned long clone_flags,
 
 		if (clone_flags & CLONE_VFORK) {
 			freezer_do_not_count();
-			wait_for_completion(&vfork);
+			wait_for_vfork_done(p, &vfork);
+			put_task_struct(p),
 			freezer_count();
 			tracehook_report_vfork_done(p, nr);
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: Make coredump interruptible
  2010-06-02 20:38                     ` Oleg Nesterov
@ 2010-06-14  0:26                       ` Roland McGrath
  -1 siblings, 0 replies; 110+ messages in thread
From: Roland McGrath @ 2010-06-14  0:26 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: KOSAKI Motohiro, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

> Oh. This needs more thinking. Definitely the task sleeping in exit_mm()
> must not exit until core_state->dumper->thread returns from do_coredump().
> If nothing else, the dumper can use its task_struct and it relies on
> the stable core_thread->next list. And right now TASK_KILLABLE can't
> work anyway, it is possible that fatal_signal_pending() is true.

Yes, I was right to say this should be another thread.  Let's not get into
all this right now.  I think it is mostly orthogonal to the oom_kill issue.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: Make coredump interruptible
@ 2010-06-14  0:26                       ` Roland McGrath
  0 siblings, 0 replies; 110+ messages in thread
From: Roland McGrath @ 2010-06-14  0:26 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: KOSAKI Motohiro, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

> Oh. This needs more thinking. Definitely the task sleeping in exit_mm()
> must not exit until core_state->dumper->thread returns from do_coredump().
> If nothing else, the dumper can use its task_struct and it relies on
> the stable core_thread->next list. And right now TASK_KILLABLE can't
> work anyway, it is possible that fatal_signal_pending() is true.

Yes, I was right to say this should be another thread.  Let's not get into
all this right now.  I think it is mostly orthogonal to the oom_kill issue.


Thanks,
Roland

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: Make coredump interruptible
  2010-06-13 15:53                               ` Oleg Nesterov
@ 2010-06-14  0:36                                 ` Roland McGrath
  -1 siblings, 0 replies; 110+ messages in thread
From: Roland McGrath @ 2010-06-14  0:36 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: KOSAKI Motohiro, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

> > 	1) back to implement signal->oom_victim
> > 	   because We are using SIGKILL for OOM and struct signal
> > 	   naturally represent signal target.
> 
> Yes, but if this process participates in the coredump, we should find
> the right thread, or mark mm or mm->core_state.
> 
> In fact, I was never sure that oom-kill should kill the single process.
> Perhaps it should kill all tasks using the same ->mm instead. But this
> is another story.

Indeed.  But as long as oom_kill acts on process granularity, I don't think
we should have it set an mm-granularity flag.  That calculus changes if a
core dump is actually in progress, since that is already definitely going
to kill all tasks using that mm.  When no dump is in progress, it feels
wrong to leave any state change in mm, since the other mm-sharers were not
affected.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH] oom: Make coredump interruptible
@ 2010-06-14  0:36                                 ` Roland McGrath
  0 siblings, 0 replies; 110+ messages in thread
From: Roland McGrath @ 2010-06-14  0:36 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: KOSAKI Motohiro, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

> > 	1) back to implement signal->oom_victim
> > 	   because We are using SIGKILL for OOM and struct signal
> > 	   naturally represent signal target.
> 
> Yes, but if this process participates in the coredump, we should find
> the right thread, or mark mm or mm->core_state.
> 
> In fact, I was never sure that oom-kill should kill the single process.
> Perhaps it should kill all tasks using the same ->mm instead. But this
> is another story.

Indeed.  But as long as oom_kill acts on process granularity, I don't think
we should have it set an mm-granularity flag.  That calculus changes if a
core dump is actually in progress, since that is already definitely going
to kill all tasks using that mm.  When no dump is in progress, it feels
wrong to leave any state change in mm, since the other mm-sharers were not
affected.


Thanks,
Roland

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: uninterruptible CLONE_VFORK (Was: oom: Make coredump interruptible)
  2010-06-13 17:13                                 ` Oleg Nesterov
@ 2010-06-14  0:56                                   ` Roland McGrath
  -1 siblings, 0 replies; 110+ messages in thread
From: Roland McGrath @ 2010-06-14  0:56 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: KOSAKI Motohiro, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

> Oh. And another problem, vfork() is not interruptible too. This means
> that the user can hide the memory hog from oom-killer. 

I'm not sure there is really any danger like that, because of the
oom_kill_process "Try to kill a child first" logic.  Eventually the vfork
child will be chosen and killed, and when it finally exits that will
release the vfork wait.  So if the vfork parent is really the culprit,
it will then be subject to oom_kill_process sooner or later.

> But let's forget about oom.

Sure, but it reminds me to mention that vfork mm sharing is another reason
that having oom_kill set some persistent state in the mm seems wrong.  If a
vfork child is chosen for oom_kill and killed, then it's possible that will
relieve the need (e.g. much memory was held indirectly via its fd table or
whatnot else that is not shared with the parent via mm).  So once the child
is dead, there should not be any lingering bits in the parent's mm.

> Roland, any reason it should be uninterruptible? This doesn't look good
> in any case. Perhaps the pseudo-patch below makes sense?

I've long thought that we should make a vfork parent SIGKILL-able.  (Of
course the vfork wait can't be made interruptible by other signals, since
it must never do anything userish like signal handler setup until the child
has died or exec'd.)  I don't know off hand of any problem with your
straightforward change.  But I don't have much confidence that there isn't
any strange gotcha waiting there due to some other kind of implicit
assumption about vfork parent blocks that we are overlooking at the moment.
So I wouldn't change this without more thorough auditing and thinking about
everything related to vfork.

Personally, what I've really been interested in is changing the vfork wait
to use some different kind of blocking entirely.  My real motivation for
that is to let a vfork wait be morphed into and out of TASK_TRACED, so a
debugger can examine its registers and so forth.  That would entail letting
the vfork/clone syscall return fully back to the asm level so it could stop
in a proper state some place like the syscall-exit or notify-resume points.
However, that has other wrinkles on machines like sparc and ia64, where
user_regset access can involve user memory access.  Since we can't allow
those while the user memory is still shared with the child, it might not
really be practical at all.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: uninterruptible CLONE_VFORK (Was: oom: Make coredump interruptible)
@ 2010-06-14  0:56                                   ` Roland McGrath
  0 siblings, 0 replies; 110+ messages in thread
From: Roland McGrath @ 2010-06-14  0:56 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: KOSAKI Motohiro, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

> Oh. And another problem, vfork() is not interruptible too. This means
> that the user can hide the memory hog from oom-killer. 

I'm not sure there is really any danger like that, because of the
oom_kill_process "Try to kill a child first" logic.  Eventually the vfork
child will be chosen and killed, and when it finally exits that will
release the vfork wait.  So if the vfork parent is really the culprit,
it will then be subject to oom_kill_process sooner or later.

> But let's forget about oom.

Sure, but it reminds me to mention that vfork mm sharing is another reason
that having oom_kill set some persistent state in the mm seems wrong.  If a
vfork child is chosen for oom_kill and killed, then it's possible that will
relieve the need (e.g. much memory was held indirectly via its fd table or
whatnot else that is not shared with the parent via mm).  So once the child
is dead, there should not be any lingering bits in the parent's mm.

> Roland, any reason it should be uninterruptible? This doesn't look good
> in any case. Perhaps the pseudo-patch below makes sense?

I've long thought that we should make a vfork parent SIGKILL-able.  (Of
course the vfork wait can't be made interruptible by other signals, since
it must never do anything userish like signal handler setup until the child
has died or exec'd.)  I don't know off hand of any problem with your
straightforward change.  But I don't have much confidence that there isn't
any strange gotcha waiting there due to some other kind of implicit
assumption about vfork parent blocks that we are overlooking at the moment.
So I wouldn't change this without more thorough auditing and thinking about
everything related to vfork.

Personally, what I've really been interested in is changing the vfork wait
to use some different kind of blocking entirely.  My real motivation for
that is to let a vfork wait be morphed into and out of TASK_TRACED, so a
debugger can examine its registers and so forth.  That would entail letting
the vfork/clone syscall return fully back to the asm level so it could stop
in a proper state some place like the syscall-exit or notify-resume points.
However, that has other wrinkles on machines like sparc and ia64, where
user_regset access can involve user memory access.  Since we can't allow
those while the user memory is still shared with the child, it might not
really be practical at all.


Thanks,
Roland

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: uninterruptible CLONE_VFORK (Was: oom: Make coredump interruptible)
  2010-06-14  0:56                                   ` Roland McGrath
@ 2010-06-14 16:33                                     ` Oleg Nesterov
  -1 siblings, 0 replies; 110+ messages in thread
From: Oleg Nesterov @ 2010-06-14 16:33 UTC (permalink / raw)
  To: Roland McGrath
  Cc: KOSAKI Motohiro, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On 06/13, Roland McGrath wrote:
>
> > Oh. And another problem, vfork() is not interruptible too. This means
> > that the user can hide the memory hog from oom-killer.
>
> I'm not sure there is really any danger like that, because of the
> oom_kill_process "Try to kill a child first" logic.

But note that oom_kill_process() doesn't kill the children with the
same ->mm. I never understood this code.

Anyway I agree. Even if I am right, this is not very serious problem
from oom-kill pov. To me, the uninterruptible CLONE_VFORK is bad by
itself.

> > But let's forget about oom.
>
> Sure, but it reminds me to mention that vfork mm sharing is another reason
> that having oom_kill set some persistent state in the mm seems wrong.

Yes, yes, this was already discussed a bit. Only if the core dump is in
progress we can touch ->mm or (probably better but needs a bit more locking)
mm->core_state to signal the coredumping thread and (perhaps) for something
else.

> > Roland, any reason it should be uninterruptible? This doesn't look good
> > in any case. Perhaps the pseudo-patch below makes sense?
>
> I've long thought that we should make a vfork parent SIGKILL-able.

Good ;)

> (Of
> course the vfork wait can't be made interruptible by other signals, since
> it must never do anything userish

Yes sure. That is why wait_for_completion_killable(), not _interrutpible.
But I assume you didn't mean that only SIGKILL should interrupt the
parent, any sig_fatal() signal should.

> I don't know off hand of any problem with your
> straightforward change.  But I don't have much confidence that there isn't
> any strange gotcha waiting there due to some other kind of implicit
> assumption about vfork parent blocks that we are overlooking at the moment.
> So I wouldn't change this without more thorough auditing and thinking about
> everything related to vfork.

Agreed. This needs auditing. And CLONE_VFORK can be used with/without all
other CLONE_ flags... Probably we should mostly worry about vfork ==
CLONE_VM | CLONE_VFORK case.

Anyway. ->vfork_done is per-thread. This means that without any changes
do_fork(CLONE_VFORK) can return (to user-mode) before the child's thread
group exits/execs. Perhaps this means we shouldn't worry too much.

> Personally, what I've really been interested in is changing the vfork wait
> to use some different kind of blocking entirely.  My real motivation for
> that is to let a vfork wait be morphed into and out of TASK_TRACED,

I see. I never thought about this, but I think you are right.

Hmm. Even without debugger, the parent doesn't react to SIGSTOP. Say,

	int main(voif)
	{
		if (!vfork())
			pause();
	}

and ^Z won't work obviously. Not good.

This is not trivail I guess. Needs thinking...

Oleg.


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: uninterruptible CLONE_VFORK (Was: oom: Make coredump interruptible)
@ 2010-06-14 16:33                                     ` Oleg Nesterov
  0 siblings, 0 replies; 110+ messages in thread
From: Oleg Nesterov @ 2010-06-14 16:33 UTC (permalink / raw)
  To: Roland McGrath
  Cc: KOSAKI Motohiro, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On 06/13, Roland McGrath wrote:
>
> > Oh. And another problem, vfork() is not interruptible too. This means
> > that the user can hide the memory hog from oom-killer.
>
> I'm not sure there is really any danger like that, because of the
> oom_kill_process "Try to kill a child first" logic.

But note that oom_kill_process() doesn't kill the children with the
same ->mm. I never understood this code.

Anyway I agree. Even if I am right, this is not very serious problem
from oom-kill pov. To me, the uninterruptible CLONE_VFORK is bad by
itself.

> > But let's forget about oom.
>
> Sure, but it reminds me to mention that vfork mm sharing is another reason
> that having oom_kill set some persistent state in the mm seems wrong.

Yes, yes, this was already discussed a bit. Only if the core dump is in
progress we can touch ->mm or (probably better but needs a bit more locking)
mm->core_state to signal the coredumping thread and (perhaps) for something
else.

> > Roland, any reason it should be uninterruptible? This doesn't look good
> > in any case. Perhaps the pseudo-patch below makes sense?
>
> I've long thought that we should make a vfork parent SIGKILL-able.

Good ;)

> (Of
> course the vfork wait can't be made interruptible by other signals, since
> it must never do anything userish

Yes sure. That is why wait_for_completion_killable(), not _interrutpible.
But I assume you didn't mean that only SIGKILL should interrupt the
parent, any sig_fatal() signal should.

> I don't know off hand of any problem with your
> straightforward change.  But I don't have much confidence that there isn't
> any strange gotcha waiting there due to some other kind of implicit
> assumption about vfork parent blocks that we are overlooking at the moment.
> So I wouldn't change this without more thorough auditing and thinking about
> everything related to vfork.

Agreed. This needs auditing. And CLONE_VFORK can be used with/without all
other CLONE_ flags... Probably we should mostly worry about vfork ==
CLONE_VM | CLONE_VFORK case.

Anyway. ->vfork_done is per-thread. This means that without any changes
do_fork(CLONE_VFORK) can return (to user-mode) before the child's thread
group exits/execs. Perhaps this means we shouldn't worry too much.

> Personally, what I've really been interested in is changing the vfork wait
> to use some different kind of blocking entirely.  My real motivation for
> that is to let a vfork wait be morphed into and out of TASK_TRACED,

I see. I never thought about this, but I think you are right.

Hmm. Even without debugger, the parent doesn't react to SIGSTOP. Say,

	int main(voif)
	{
		if (!vfork())
			pause();
	}

and ^Z won't work obviously. Not good.

This is not trivail I guess. Needs thinking...

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: uninterruptible CLONE_VFORK (Was: oom: Make coredump interruptible)
  2010-06-14 16:33                                     ` Oleg Nesterov
@ 2010-06-14 19:17                                       ` Roland McGrath
  -1 siblings, 0 replies; 110+ messages in thread
From: Roland McGrath @ 2010-06-14 19:17 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: KOSAKI Motohiro, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

> But note that oom_kill_process() doesn't kill the children with the
> same ->mm. I never understood this code.

Yes, odd.  This is the first time I've really looked at oom_kill.

> Anyway I agree. Even if I am right, this is not very serious problem
> from oom-kill pov. To me, the uninterruptible CLONE_VFORK is bad by
> itself.

Agreed.

> Yes sure. That is why wait_for_completion_killable(), not _interrutpible.

Right, your code was fine.  I was just being pedantic for the record since
you said "interruptible" in the text.

> But I assume you didn't mean that only SIGKILL should interrupt the
> parent, any sig_fatal() signal should.

Yes.

> Agreed. This needs auditing. And CLONE_VFORK can be used with/without all
> other CLONE_ flags... Probably we should mostly worry about vfork ==
> CLONE_VM | CLONE_VFORK case.

Yes.  I hope it is fine to make clone refuse CLONE_VFORK set without
CLONE_VM in the future as a sanity check.  I don't think any use of
CLONE_VFORK other than the actual vfork use is something we ever intended
to support.

> Anyway. ->vfork_done is per-thread. This means that without any changes
> do_fork(CLONE_VFORK) can return (to user-mode) before the child's thread
> group exits/execs. Perhaps this means we shouldn't worry too much.

You mean some other thread in the parent's group can run in user mode.
Yes.  The real reason for the vfork wait is just that the parent/child will
share the user stack memory, so in practice it's fine if other threads with
other stacks are touching other memory (i.e. it's just the user's problem).

> Hmm. Even without debugger, the parent doesn't react to SIGSTOP. 

Yes.  It's been a long time since I thought about the vfork stuff much.
But I now recall thinking about the SIGSTOP/SIGTSTP issue too.  It does
seem bad.  OTOH, it has lurked there for many years now without complaints.

Note that supporting stop/fatal signals in the normal way means that the
call has to return and pass the syscall-exit tracing point first.  This
means a change in the order of events seen by a debugger.  It also
complicates the subject of PTRACE_EVENT_VFORK_DONE reports, which today
happen before syscall-exit or signal stuff is possible.  For proper
stopping in the normal way, the vfork-wait would be restarted via
sys_restart_syscall or something.  But the way that happens ordinarily is
to get all the way back to user mode and reenter with a normal syscall.
That doesn't touch the user stack itself, but it sure makes one nervous.
It's hard to see how we could ever do that and then prevent normal signals
from being handled before the restart.  (Instead, we'd have the actual
blocking done inside get_signal_to_deliver so we just never get to user
mode until the vfork hold is released, and not actually need to restart.)
So there are multiple cans of worms cascading from a change, even though
the actual work to do the block in a new way might not be very complex.

It all seems kind of doable, at least if we accept a change in the userland
debugger experience of which ptrace reports a vfork parent might make in
what order.  But plenty of hair to worry about.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: uninterruptible CLONE_VFORK (Was: oom: Make coredump interruptible)
@ 2010-06-14 19:17                                       ` Roland McGrath
  0 siblings, 0 replies; 110+ messages in thread
From: Roland McGrath @ 2010-06-14 19:17 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: KOSAKI Motohiro, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

> But note that oom_kill_process() doesn't kill the children with the
> same ->mm. I never understood this code.

Yes, odd.  This is the first time I've really looked at oom_kill.

> Anyway I agree. Even if I am right, this is not very serious problem
> from oom-kill pov. To me, the uninterruptible CLONE_VFORK is bad by
> itself.

Agreed.

> Yes sure. That is why wait_for_completion_killable(), not _interrutpible.

Right, your code was fine.  I was just being pedantic for the record since
you said "interruptible" in the text.

> But I assume you didn't mean that only SIGKILL should interrupt the
> parent, any sig_fatal() signal should.

Yes.

> Agreed. This needs auditing. And CLONE_VFORK can be used with/without all
> other CLONE_ flags... Probably we should mostly worry about vfork ==
> CLONE_VM | CLONE_VFORK case.

Yes.  I hope it is fine to make clone refuse CLONE_VFORK set without
CLONE_VM in the future as a sanity check.  I don't think any use of
CLONE_VFORK other than the actual vfork use is something we ever intended
to support.

> Anyway. ->vfork_done is per-thread. This means that without any changes
> do_fork(CLONE_VFORK) can return (to user-mode) before the child's thread
> group exits/execs. Perhaps this means we shouldn't worry too much.

You mean some other thread in the parent's group can run in user mode.
Yes.  The real reason for the vfork wait is just that the parent/child will
share the user stack memory, so in practice it's fine if other threads with
other stacks are touching other memory (i.e. it's just the user's problem).

> Hmm. Even without debugger, the parent doesn't react to SIGSTOP. 

Yes.  It's been a long time since I thought about the vfork stuff much.
But I now recall thinking about the SIGSTOP/SIGTSTP issue too.  It does
seem bad.  OTOH, it has lurked there for many years now without complaints.

Note that supporting stop/fatal signals in the normal way means that the
call has to return and pass the syscall-exit tracing point first.  This
means a change in the order of events seen by a debugger.  It also
complicates the subject of PTRACE_EVENT_VFORK_DONE reports, which today
happen before syscall-exit or signal stuff is possible.  For proper
stopping in the normal way, the vfork-wait would be restarted via
sys_restart_syscall or something.  But the way that happens ordinarily is
to get all the way back to user mode and reenter with a normal syscall.
That doesn't touch the user stack itself, but it sure makes one nervous.
It's hard to see how we could ever do that and then prevent normal signals
from being handled before the restart.  (Instead, we'd have the actual
blocking done inside get_signal_to_deliver so we just never get to user
mode until the vfork hold is released, and not actually need to restart.)
So there are multiple cans of worms cascading from a change, even though
the actual work to do the block in a new way might not be very complex.

It all seems kind of doable, at least if we accept a change in the userland
debugger experience of which ptrace reports a vfork parent might make in
what order.  But plenty of hair to worry about.


Thanks,
Roland

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: uninterruptible CLONE_VFORK (Was: oom: Make coredump interruptible)
  2010-06-14 19:17                                       ` Roland McGrath
@ 2010-06-28 17:33                                         ` Oleg Nesterov
  -1 siblings, 0 replies; 110+ messages in thread
From: Oleg Nesterov @ 2010-06-28 17:33 UTC (permalink / raw)
  To: Roland McGrath
  Cc: KOSAKI Motohiro, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On 06/14, Roland McGrath wrote:
>
> > Hmm. Even without debugger, the parent doesn't react to SIGSTOP.
>
> Yes.  It's been a long time since I thought about the vfork stuff much.
> But I now recall thinking about the SIGSTOP/SIGTSTP issue too.  It does
> seem bad.  OTOH, it has lurked there for many years now without complaints.
>
> Note that supporting stop/fatal signals in the normal way means that the
> call has to return and pass the syscall-exit tracing point first.  This
> means a change in the order of events seen by a debugger.  It also
> complicates the subject of PTRACE_EVENT_VFORK_DONE reports, which today
> happen before syscall-exit or signal stuff is possible.  For proper
> stopping in the normal way, the vfork-wait would be restarted via
> sys_restart_syscall or something.

Yes. I was thinking about this too.

The parent can play with real_blocked or saved_sigmask to block all
signals except STOP and KILL, use TASK_INTERRUPTIBLE for wait, and
just return ERESTART each time it gets the signal (it should clear
child->vfork_done if fatal_signal_pending).

We should also check PF_KTHREAD though, there are in kernel users
of CLONE_VFORK.

> Bu the way that happens ordinarily is
> to get all the way back to user mode and reenter with a normal syscall.
> That doesn't touch the user stack itself, but it sure makes one nervous.

me too. Especially because I do not really know how !x86 machines
implement this all.

We should also verify that the exiting/stopping parent can never write
to its ->mm. For example, exit_mm() does put_user(tsk->clear_child_tid).
Fortunately we can rely on PF_SIGNALED flag in this case.

Oleg.


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: uninterruptible CLONE_VFORK (Was: oom: Make coredump interruptible)
@ 2010-06-28 17:33                                         ` Oleg Nesterov
  0 siblings, 0 replies; 110+ messages in thread
From: Oleg Nesterov @ 2010-06-28 17:33 UTC (permalink / raw)
  To: Roland McGrath
  Cc: KOSAKI Motohiro, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

On 06/14, Roland McGrath wrote:
>
> > Hmm. Even without debugger, the parent doesn't react to SIGSTOP.
>
> Yes.  It's been a long time since I thought about the vfork stuff much.
> But I now recall thinking about the SIGSTOP/SIGTSTP issue too.  It does
> seem bad.  OTOH, it has lurked there for many years now without complaints.
>
> Note that supporting stop/fatal signals in the normal way means that the
> call has to return and pass the syscall-exit tracing point first.  This
> means a change in the order of events seen by a debugger.  It also
> complicates the subject of PTRACE_EVENT_VFORK_DONE reports, which today
> happen before syscall-exit or signal stuff is possible.  For proper
> stopping in the normal way, the vfork-wait would be restarted via
> sys_restart_syscall or something.

Yes. I was thinking about this too.

The parent can play with real_blocked or saved_sigmask to block all
signals except STOP and KILL, use TASK_INTERRUPTIBLE for wait, and
just return ERESTART each time it gets the signal (it should clear
child->vfork_done if fatal_signal_pending).

We should also check PF_KTHREAD though, there are in kernel users
of CLONE_VFORK.

> Bu the way that happens ordinarily is
> to get all the way back to user mode and reenter with a normal syscall.
> That doesn't touch the user stack itself, but it sure makes one nervous.

me too. Especially because I do not really know how !x86 machines
implement this all.

We should also verify that the exiting/stopping parent can never write
to its ->mm. For example, exit_mm() does put_user(tsk->clear_child_tid).
Fortunately we can rely on PF_SIGNALED flag in this case.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: uninterruptible CLONE_VFORK (Was: oom: Make coredump interruptible)
  2010-06-28 17:33                                         ` Oleg Nesterov
@ 2010-06-28 18:04                                           ` Roland McGrath
  -1 siblings, 0 replies; 110+ messages in thread
From: Roland McGrath @ 2010-06-28 18:04 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: KOSAKI Motohiro, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

> The parent can play with real_blocked or saved_sigmask to block all
> signals except STOP and KILL, use TASK_INTERRUPTIBLE for wait, and
> just return ERESTART each time it gets the signal (it should clear
> child->vfork_done if fatal_signal_pending).

Yes, perhaps.

> We should also check PF_KTHREAD though, there are in kernel users
> of CLONE_VFORK.

There is only __call_usermodehelper, but yes.

> > Bu the way that happens ordinarily is
> > to get all the way back to user mode and reenter with a normal syscall.
> > That doesn't touch the user stack itself, but it sure makes one nervous.
> 
> me too. Especially because I do not really know how !x86 machines
> implement this all.

The only problem I know of off hand is ia64's TIF_RESTORE_RSE (an
arch-specific ptrace detail).  But yes, it would require a careful
reading of all the arch code paths.

> We should also verify that the exiting/stopping parent can never write
> to its ->mm. For example, exit_mm() does put_user(tsk->clear_child_tid).
> Fortunately we can rely on PF_SIGNALED flag in this case.

Right.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: uninterruptible CLONE_VFORK (Was: oom: Make coredump interruptible)
@ 2010-06-28 18:04                                           ` Roland McGrath
  0 siblings, 0 replies; 110+ messages in thread
From: Roland McGrath @ 2010-06-28 18:04 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: KOSAKI Motohiro, LKML, linux-mm, David Rientjes, Andrew Morton,
	KAMEZAWA Hiroyuki, Nick Piggin

> The parent can play with real_blocked or saved_sigmask to block all
> signals except STOP and KILL, use TASK_INTERRUPTIBLE for wait, and
> just return ERESTART each time it gets the signal (it should clear
> child->vfork_done if fatal_signal_pending).

Yes, perhaps.

> We should also check PF_KTHREAD though, there are in kernel users
> of CLONE_VFORK.

There is only __call_usermodehelper, but yes.

> > Bu the way that happens ordinarily is
> > to get all the way back to user mode and reenter with a normal syscall.
> > That doesn't touch the user stack itself, but it sure makes one nervous.
> 
> me too. Especially because I do not really know how !x86 machines
> implement this all.

The only problem I know of off hand is ia64's TIF_RESTORE_RSE (an
arch-specific ptrace detail).  But yes, it would require a careful
reading of all the arch code paths.

> We should also verify that the exiting/stopping parent can never write
> to its ->mm. For example, exit_mm() does put_user(tsk->clear_child_tid).
> Fortunately we can rely on PF_SIGNALED flag in this case.

Right.


Thanks,
Roland

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

end of thread, other threads:[~2010-06-28 18:04 UTC | newest]

Thread overview: 110+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-05-31  9:33 [PATCH 1/5] oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads KOSAKI Motohiro
2010-05-31  9:33 ` KOSAKI Motohiro
2010-05-31  9:35 ` [PATCH 2/5] oom: select_bad_process: PF_EXITING check should take ->mm into account KOSAKI Motohiro
2010-05-31  9:35   ` KOSAKI Motohiro
2010-05-31 16:43   ` Oleg Nesterov
2010-05-31 16:43     ` Oleg Nesterov
2010-06-01  1:10     ` KOSAKI Motohiro
2010-06-01  1:10       ` KOSAKI Motohiro
2010-06-01 20:18       ` Oleg Nesterov
2010-06-01 20:18         ` Oleg Nesterov
2010-06-02 13:54         ` [PATCH] oom: remove PF_EXITING check completely KOSAKI Motohiro
2010-06-02 13:54           ` KOSAKI Motohiro
2010-06-02 15:54           ` Oleg Nesterov
2010-06-02 15:54             ` Oleg Nesterov
2010-06-02 21:02             ` David Rientjes
2010-06-02 21:02               ` David Rientjes
2010-06-03  4:48               ` KOSAKI Motohiro
2010-06-03  4:48                 ` KOSAKI Motohiro
2010-06-03  6:29                 ` David Rientjes
2010-06-03  6:29                   ` David Rientjes
2010-06-02 13:54         ` [PATCH] oom: Make coredump interruptible KOSAKI Motohiro
2010-06-02 13:54           ` KOSAKI Motohiro
2010-06-02 15:42           ` Oleg Nesterov
2010-06-02 15:42             ` Oleg Nesterov
2010-06-02 17:29             ` Roland McGrath
2010-06-02 17:29               ` Roland McGrath
2010-06-02 17:53               ` Oleg Nesterov
2010-06-02 17:53                 ` Oleg Nesterov
2010-06-02 18:58                 ` Roland McGrath
2010-06-02 18:58                   ` Roland McGrath
2010-06-02 20:38                   ` Oleg Nesterov
2010-06-02 20:38                     ` Oleg Nesterov
2010-06-03 14:03                     ` Oleg Nesterov
2010-06-03 14:03                       ` Oleg Nesterov
2010-06-04 10:54                     ` KOSAKI Motohiro
2010-06-04 10:54                       ` KOSAKI Motohiro
2010-06-04 11:27                       ` Oleg Nesterov
2010-06-04 11:27                         ` Oleg Nesterov
2010-06-04 11:34                         ` Oleg Nesterov
2010-06-04 11:34                           ` Oleg Nesterov
2010-06-09 19:53                         ` Oleg Nesterov
2010-06-09 19:53                           ` Oleg Nesterov
2010-06-09 20:41                           ` David Rientjes
2010-06-09 20:41                             ` David Rientjes
2010-06-09 21:03                             ` Oleg Nesterov
2010-06-09 21:03                               ` Oleg Nesterov
2010-06-13 11:24                           ` KOSAKI Motohiro
2010-06-13 11:24                             ` KOSAKI Motohiro
2010-06-13 15:53                             ` Oleg Nesterov
2010-06-13 15:53                               ` Oleg Nesterov
2010-06-13 17:13                               ` uninterruptible CLONE_VFORK (Was: oom: Make coredump interruptible) Oleg Nesterov
2010-06-13 17:13                                 ` Oleg Nesterov
2010-06-14  0:56                                 ` Roland McGrath
2010-06-14  0:56                                   ` Roland McGrath
2010-06-14 16:33                                   ` Oleg Nesterov
2010-06-14 16:33                                     ` Oleg Nesterov
2010-06-14 19:17                                     ` Roland McGrath
2010-06-14 19:17                                       ` Roland McGrath
2010-06-28 17:33                                       ` Oleg Nesterov
2010-06-28 17:33                                         ` Oleg Nesterov
2010-06-28 18:04                                         ` Roland McGrath
2010-06-28 18:04                                           ` Roland McGrath
2010-06-14  0:36                               ` [PATCH] oom: Make coredump interruptible Roland McGrath
2010-06-14  0:36                                 ` Roland McGrath
2010-06-14  0:26                     ` Roland McGrath
2010-06-14  0:26                       ` Roland McGrath
2010-06-01 20:39   ` [PATCH 2/5] oom: select_bad_process: PF_EXITING check should take ->mm into account David Rientjes
2010-06-01 20:39     ` David Rientjes
2010-05-31  9:36 ` [PATCH 3/5] oom: introduce find_lock_task_mm() to fix !mm false positives KOSAKI Motohiro
2010-05-31  9:36   ` KOSAKI Motohiro
2010-06-01  0:57   ` KAMEZAWA Hiroyuki
2010-06-01  0:57     ` KAMEZAWA Hiroyuki
2010-06-01 20:42   ` David Rientjes
2010-06-01 20:42     ` David Rientjes
2010-06-02 16:05   ` Minchan Kim
2010-06-02 16:05     ` Minchan Kim
2010-05-31  9:37 ` [PATCH 4/5] oom: the points calculation of child processes must use find_lock_task_mm() too KOSAKI Motohiro
2010-05-31  9:37   ` KOSAKI Motohiro
2010-05-31 16:56   ` Oleg Nesterov
2010-05-31 16:56     ` Oleg Nesterov
2010-05-31 23:48     ` KOSAKI Motohiro
2010-05-31 23:48       ` KOSAKI Motohiro
2010-05-31  9:38 ` [PATCH 5/5] oom: __oom_kill_task() " KOSAKI Motohiro
2010-05-31  9:38   ` KOSAKI Motohiro
2010-06-01  1:02   ` KAMEZAWA Hiroyuki
2010-06-01  1:02     ` KAMEZAWA Hiroyuki
2010-06-01 20:44   ` David Rientjes
2010-06-01 20:44     ` David Rientjes
2010-06-01  0:54 ` [PATCH 1/5] oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads KAMEZAWA Hiroyuki
2010-06-01  0:54   ` KAMEZAWA Hiroyuki
2010-06-01 20:36 ` David Rientjes
2010-06-01 20:36   ` David Rientjes
2010-06-01 21:20   ` Oleg Nesterov
2010-06-01 21:20     ` Oleg Nesterov
2010-06-01 21:26     ` David Rientjes
2010-06-01 21:26       ` David Rientjes
2010-06-02 13:54       ` KOSAKI Motohiro
2010-06-02 13:54         ` KOSAKI Motohiro
2010-06-02 21:09         ` David Rientjes
2010-06-02 21:09           ` David Rientjes
2010-06-02 21:33           ` Oleg Nesterov
2010-06-02 21:33             ` Oleg Nesterov
2010-06-02 21:46             ` David Rientjes
2010-06-02 21:46               ` David Rientjes
2010-06-03 14:27               ` Oleg Nesterov
2010-06-03 14:27                 ` Oleg Nesterov
2010-06-03 20:11                 ` David Rientjes
2010-06-03 20:11                   ` David Rientjes
2010-06-02 15:32 ` Minchan Kim
2010-06-02 15:32   ` Minchan Kim

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.