All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8] make vfork killable/restartable/traceable
@ 2011-07-27 16:31 Oleg Nesterov
  2011-07-27 16:32 ` [PATCH 1/8] vfork: introduce complete_vfork_done() Oleg Nesterov
                   ` (13 more replies)
  0 siblings, 14 replies; 54+ messages in thread
From: Oleg Nesterov @ 2011-07-27 16:31 UTC (permalink / raw)
  To: Linus Torvalds, Roland McGrath, Tejun Heo
  Cc: Denys Vlasenko, KOSAKI Motohiro, Matt Fleming, linux-kernel

Hello.

CLONE_VFORK sleeps in TASK_INTERRUPTIBLE until the child exits/execs.
This is obviously not good, it is sooo simple to create the task which
doesn't react to SIGKILL/SIGSTOP.

Questions:

	- do we really need this?

	  I think we do. This really "looks like a bug" in any case,
	  even if nobody ever complained afaik.

	- may be 1-3 is enough?

	  may be... but personally I think SIGSTOP/ptrace should work
	  too.

	- is it safe to exit/stop on !x86 machine???

	  I do not know. May be this needs some #ifdef's around
	  wait_for_completion_interruptible(). I am not sure that,
	  say, arch_ptrace_stop() can't abuse the ->mm shared with
	  the child.

	  OTOH. This can happen anyway, do_fork() does ptrace_event()
	  before wait_for_completion().

	- and of course, while I think this is bugfix, this is user
	  visible change.

Please comment.

9/8 is off-topic.

Oleg.

 fs/exec.c                   |   11 ----
 include/linux/sched.h       |    2 +-
 include/linux/thread_info.h |    4 ++
 kernel/fork.c               |  121 ++++++++++++++++++++++++++++++++++---------
 kernel/pid.c                |   13 +++++
 5 files changed, 114 insertions(+), 37 deletions(-)


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 1/8] vfork: introduce complete_vfork_done()
  2011-07-27 16:31 [PATCH 0/8] make vfork killable/restartable/traceable Oleg Nesterov
@ 2011-07-27 16:32 ` Oleg Nesterov
  2011-07-27 16:32 ` [PATCH 2/8] vfork: introduce clone_vfork_finish() Oleg Nesterov
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 54+ messages in thread
From: Oleg Nesterov @ 2011-07-27 16:32 UTC (permalink / raw)
  To: Linus Torvalds, Roland McGrath, Tejun Heo
  Cc: Denys Vlasenko, KOSAKI Motohiro, Matt Fleming, linux-kernel

No functional changes.

Move the clear-and-complete-vfork_done code into the new trivial
helper, complete_vfork_done().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 include/linux/sched.h |    1 +
 fs/exec.c             |    8 ++------
 kernel/fork.c         |   17 ++++++++++-------
 3 files changed, 13 insertions(+), 13 deletions(-)

--- 3.1/include/linux/sched.h~1_complete_vfork_done	2011-07-26 15:50:14.000000000 +0200
+++ 3.1/include/linux/sched.h	2011-07-26 17:11:40.000000000 +0200
@@ -2253,6 +2253,7 @@ extern int do_execve(const char *,
 		     const char __user * const __user *,
 		     const char __user * const __user *, struct pt_regs *);
 extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *);
+extern void complete_vfork_done(struct task_struct *tsk);
 struct task_struct *fork_idle(int);
 
 extern void set_task_comm(struct task_struct *tsk, char *from);
--- 3.1/fs/exec.c~1_complete_vfork_done	2011-07-23 15:26:19.000000000 +0200
+++ 3.1/fs/exec.c	2011-07-26 17:15:10.000000000 +0200
@@ -1885,7 +1885,6 @@ static int coredump_wait(int exit_code, 
 {
 	struct task_struct *tsk = current;
 	struct mm_struct *mm = tsk->mm;
-	struct completion *vfork_done;
 	int core_waiters = -EBUSY;
 
 	init_completion(&core_state->startup);
@@ -1904,11 +1903,8 @@ static int coredump_wait(int exit_code, 
 	 * Make sure nobody is waiting for us to release the VM,
 	 * otherwise we can deadlock when we wait on each other
 	 */
-	vfork_done = tsk->vfork_done;
-	if (vfork_done) {
-		tsk->vfork_done = NULL;
-		complete(vfork_done);
-	}
+	if (tsk->vfork_done)
+		complete_vfork_done(tsk);
 
 	if (core_waiters)
 		wait_for_completion(&core_state->startup);
--- 3.1/kernel/fork.c~1_complete_vfork_done	2011-07-26 15:50:15.000000000 +0200
+++ 3.1/kernel/fork.c	2011-07-26 17:32:22.000000000 +0200
@@ -662,8 +662,6 @@ EXPORT_SYMBOL_GPL(get_task_mm);
  */
 void mm_release(struct task_struct *tsk, struct mm_struct *mm)
 {
-	struct completion *vfork_done = tsk->vfork_done;
-
 	/* Get rid of any futexes when releasing the mm */
 #ifdef CONFIG_FUTEX
 	if (unlikely(tsk->robust_list)) {
@@ -683,11 +681,8 @@ void mm_release(struct task_struct *tsk,
 	/* Get rid of any cached register state */
 	deactivate_mm(tsk, mm);
 
-	/* notify parent sleeping on vfork() */
-	if (vfork_done) {
-		tsk->vfork_done = NULL;
-		complete(vfork_done);
-	}
+	if (tsk->vfork_done)
+		complete_vfork_done(tsk);
 
 	/*
 	 * If we're exiting normally, clear a user-space tid field if
@@ -1446,6 +1441,14 @@ struct task_struct * __cpuinit fork_idle
 	return task;
 }
 
+void complete_vfork_done(struct task_struct *tsk)
+{
+	struct completion *vfork_done = tsk->vfork_done;
+
+	tsk->vfork_done = NULL;
+	complete(vfork_done);
+}
+
 /*
  *  Ok, this is the main fork-routine.
  *


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 2/8] vfork: introduce clone_vfork_finish()
  2011-07-27 16:31 [PATCH 0/8] make vfork killable/restartable/traceable Oleg Nesterov
  2011-07-27 16:32 ` [PATCH 1/8] vfork: introduce complete_vfork_done() Oleg Nesterov
@ 2011-07-27 16:32 ` Oleg Nesterov
  2011-07-27 16:32 ` [PATCH 3/8] vfork: make it killable Oleg Nesterov
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 54+ messages in thread
From: Oleg Nesterov @ 2011-07-27 16:32 UTC (permalink / raw)
  To: Linus Torvalds, Roland McGrath, Tejun Heo
  Cc: Denys Vlasenko, KOSAKI Motohiro, Matt Fleming, linux-kernel

No functional changes.

Move the wait-for-vfork_done code from do_wait() into the new
helper, clone_vfork_finish().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 kernel/fork.c |   19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

--- 3.1/kernel/fork.c~2_wait_for_vfork_done	2011-07-26 17:52:25.000000000 +0200
+++ 3.1/kernel/fork.c	2011-07-26 17:52:59.000000000 +0200
@@ -1449,6 +1449,17 @@ void complete_vfork_done(struct task_str
 	complete(vfork_done);
 }
 
+static long clone_vfork_finish(struct task_struct *child,
+				struct completion *vfork_done, long pid)
+{
+	freezer_do_not_count();
+	wait_for_completion(vfork_done);
+	freezer_count();
+
+	ptrace_event(PTRACE_EVENT_VFORK_DONE, pid);
+	return pid;
+}
+
 /*
  *  Ok, this is the main fork-routine.
  *
@@ -1536,12 +1547,8 @@ long do_fork(unsigned long clone_flags,
 		if (unlikely(trace))
 			ptrace_event(trace, nr);
 
-		if (clone_flags & CLONE_VFORK) {
-			freezer_do_not_count();
-			wait_for_completion(&vfork);
-			freezer_count();
-			ptrace_event(PTRACE_EVENT_VFORK_DONE, nr);
-		}
+		if (clone_flags & CLONE_VFORK)
+			nr = clone_vfork_finish(p, &vfork, nr);
 	} else {
 		nr = PTR_ERR(p);
 	}


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 3/8] vfork: make it killable
  2011-07-27 16:31 [PATCH 0/8] make vfork killable/restartable/traceable Oleg Nesterov
  2011-07-27 16:32 ` [PATCH 1/8] vfork: introduce complete_vfork_done() Oleg Nesterov
  2011-07-27 16:32 ` [PATCH 2/8] vfork: introduce clone_vfork_finish() Oleg Nesterov
@ 2011-07-27 16:32 ` Oleg Nesterov
  2011-07-29 13:02   ` Matt Fleming
  2011-07-27 16:33 ` [PATCH 4/8] coredump_wait: don't call complete_vfork_done() Oleg Nesterov
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 54+ messages in thread
From: Oleg Nesterov @ 2011-07-27 16:32 UTC (permalink / raw)
  To: Linus Torvalds, Roland McGrath, Tejun Heo
  Cc: Denys Vlasenko, KOSAKI Motohiro, Matt Fleming, linux-kernel

Make vfork() killable().

Change clone_vfork_finish() to do wait_for_completion_killable().
If it fails we do not return to the user-mode and never touch mm
shared with our child.

However, we should clear child->vfork_done before return.
complete_vfork_done() and clone_vfork_finish() use xchg-and-check
to avoid the races with each other. If clone_vfork_finish() fails
to clear child->vfork_done it does another wait_for_completion() to
ensure the child finishes complete-in-progress.

NOTE: this and the next patches do not affect in-kernel users of
CLONE_VFORK, kernel threads run with all signals ignored, including
SIGKILL/SIGSTOP.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 kernel/fork.c |   27 +++++++++++++++++++--------
 1 file changed, 19 insertions(+), 8 deletions(-)

--- 3.1/kernel/fork.c~3_make_killable	2011-07-26 19:26:03.000000000 +0200
+++ 3.1/kernel/fork.c	2011-07-26 20:23:28.000000000 +0200
@@ -688,7 +688,8 @@ void mm_release(struct task_struct *tsk,
 	 * If we're exiting normally, clear a user-space tid field if
 	 * requested.  We leave this alone when dying by signal, to leave
 	 * the value intact in a core dump, and to save the unnecessary
-	 * trouble otherwise.  Userland only wants this done for a sys_exit.
+	 * trouble, say, a killed vfork parent shouldn't touch this mm.
+	 * Userland only wants this done for a sys_exit.
 	 */
 	if (tsk->clear_child_tid) {
 		if (!(tsk->flags & PF_SIGNALED) &&
@@ -1443,18 +1444,25 @@ struct task_struct * __cpuinit fork_idle
 
 void complete_vfork_done(struct task_struct *tsk)
 {
-	struct completion *vfork_done = tsk->vfork_done;
+	struct completion *vfork_done = xchg(&tsk->vfork_done, NULL);
 
-	tsk->vfork_done = NULL;
-	complete(vfork_done);
+	if (vfork_done)
+		complete(vfork_done);
 }
 
 static long clone_vfork_finish(struct task_struct *child,
 				struct completion *vfork_done, long pid)
 {
-	freezer_do_not_count();
-	wait_for_completion(vfork_done);
-	freezer_count();
+	int killed = wait_for_completion_killable(vfork_done);
+
+	if (killed) {
+		struct completion *steal = xchg(&child->vfork_done, NULL);
+		/* if we race with complete_vfork_done() we have to wait */
+		if (unlikely(!steal))
+			wait_for_completion(vfork_done);
+
+		return -EINTR;
+	}
 
 	ptrace_event(PTRACE_EVENT_VFORK_DONE, pid);
 	return pid;
@@ -1527,6 +1535,7 @@ long do_fork(unsigned long clone_flags,
 			put_user(nr, parent_tidptr);
 
 		if (clone_flags & CLONE_VFORK) {
+			get_task_struct(p);
 			p->vfork_done = &vfork;
 			init_completion(&vfork);
 		}
@@ -1547,8 +1556,10 @@ long do_fork(unsigned long clone_flags,
 		if (unlikely(trace))
 			ptrace_event(trace, nr);
 
-		if (clone_flags & CLONE_VFORK)
+		if (clone_flags & CLONE_VFORK) {
 			nr = clone_vfork_finish(p, &vfork, nr);
+			put_task_struct(p);
+		}
 	} else {
 		nr = PTR_ERR(p);
 	}


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 4/8] coredump_wait: don't call complete_vfork_done()
  2011-07-27 16:31 [PATCH 0/8] make vfork killable/restartable/traceable Oleg Nesterov
                   ` (2 preceding siblings ...)
  2011-07-27 16:32 ` [PATCH 3/8] vfork: make it killable Oleg Nesterov
@ 2011-07-27 16:33 ` Oleg Nesterov
  2011-07-29 13:02   ` Matt Fleming
  2011-07-27 16:33 ` [PATCH 5/8] introduce find_get_task_by_vpid() Oleg Nesterov
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 54+ messages in thread
From: Oleg Nesterov @ 2011-07-27 16:33 UTC (permalink / raw)
  To: Linus Torvalds, Roland McGrath, Tejun Heo
  Cc: Denys Vlasenko, KOSAKI Motohiro, Matt Fleming, linux-kernel

Now that CLONE_VFORK is killable, coredump_wait() no longer needs
complete_vfork_done(). zap_threads() should find and kill all tasks
with the same ->mm, this includes our parent if ->vfork_done is set.

mm_release() becomes the only caller, unexport complete_vfork_done().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 include/linux/sched.h |    1 -
 fs/exec.c             |    7 -------
 kernel/fork.c         |    3 ++-
 3 files changed, 2 insertions(+), 9 deletions(-)

--- 3.1/include/linux/sched.h~4_coredump_no_vfork_done	2011-07-26 20:23:28.000000000 +0200
+++ 3.1/include/linux/sched.h	2011-07-26 20:28:24.000000000 +0200
@@ -2253,7 +2253,6 @@ extern int do_execve(const char *,
 		     const char __user * const __user *,
 		     const char __user * const __user *, struct pt_regs *);
 extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *);
-extern void complete_vfork_done(struct task_struct *tsk);
 struct task_struct *fork_idle(int);
 
 extern void set_task_comm(struct task_struct *tsk, char *from);
--- 3.1/fs/exec.c~4_coredump_no_vfork_done	2011-07-26 20:23:28.000000000 +0200
+++ 3.1/fs/exec.c	2011-07-26 20:28:24.000000000 +0200
@@ -1899,13 +1899,6 @@ static int coredump_wait(int exit_code, 
 	if (unlikely(core_waiters < 0))
 		goto fail;
 
-	/*
-	 * Make sure nobody is waiting for us to release the VM,
-	 * otherwise we can deadlock when we wait on each other
-	 */
-	if (tsk->vfork_done)
-		complete_vfork_done(tsk);
-
 	if (core_waiters)
 		wait_for_completion(&core_state->startup);
 fail:
--- 3.1/kernel/fork.c~4_coredump_no_vfork_done	2011-07-26 20:23:28.000000000 +0200
+++ 3.1/kernel/fork.c	2011-07-26 20:28:24.000000000 +0200
@@ -647,6 +647,7 @@ struct mm_struct *get_task_mm(struct tas
 }
 EXPORT_SYMBOL_GPL(get_task_mm);
 
+static void complete_vfork_done(struct task_struct *tsk);
 /* Please note the differences between mmput and mm_release.
  * mmput is called whenever we stop holding onto a mm_struct,
  * error success whatever.
@@ -1442,7 +1443,7 @@ struct task_struct * __cpuinit fork_idle
 	return task;
 }
 
-void complete_vfork_done(struct task_struct *tsk)
+static void complete_vfork_done(struct task_struct *tsk)
 {
 	struct completion *vfork_done = xchg(&tsk->vfork_done, NULL);
 


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 5/8] introduce find_get_task_by_vpid()
  2011-07-27 16:31 [PATCH 0/8] make vfork killable/restartable/traceable Oleg Nesterov
                   ` (3 preceding siblings ...)
  2011-07-27 16:33 ` [PATCH 4/8] coredump_wait: don't call complete_vfork_done() Oleg Nesterov
@ 2011-07-27 16:33 ` Oleg Nesterov
  2011-07-27 16:33 ` [PATCH 6/8] vfork: do not setup child->vfork_done beforehand Oleg Nesterov
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 54+ messages in thread
From: Oleg Nesterov @ 2011-07-27 16:33 UTC (permalink / raw)
  To: Linus Torvalds, Roland McGrath, Tejun Heo
  Cc: Denys Vlasenko, KOSAKI Motohiro, Matt Fleming, linux-kernel

Introduce the new helper, find_get_task_by_vpid(), which simply
does find_task_by_vpid() + get_task_struct() under rcu_read_lock().

It has several potential users which currently do this by hand,
and we are going to add clone_vfork_restart() which needs it too.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 include/linux/sched.h |    1 +
 kernel/pid.c          |   13 +++++++++++++
 2 files changed, 14 insertions(+)

--- 3.1/include/linux/sched.h~5_find_get_task	2011-07-26 20:35:55.000000000 +0200
+++ 3.1/include/linux/sched.h	2011-07-26 20:36:10.000000000 +0200
@@ -2107,6 +2107,7 @@ extern struct pid_namespace init_pid_ns;
  */
 
 extern struct task_struct *find_task_by_vpid(pid_t nr);
+extern struct task_struct *find_get_task_by_vpid(pid_t vnr);
 extern struct task_struct *find_task_by_pid_ns(pid_t nr,
 		struct pid_namespace *ns);
 
--- 3.1/kernel/pid.c~5_find_get_task	2011-07-26 15:50:15.000000000 +0200
+++ 3.1/kernel/pid.c	2011-07-26 20:33:18.000000000 +0200
@@ -427,6 +427,19 @@ struct task_struct *find_task_by_vpid(pi
 	return find_task_by_pid_ns(vnr, current->nsproxy->pid_ns);
 }
 
+struct task_struct *find_get_task_by_vpid(pid_t vnr)
+{
+	struct task_struct *task;
+
+	rcu_read_lock();
+	task = find_task_by_vpid(vnr);
+	if (task)
+		get_task_struct(task);
+	rcu_read_unlock();
+
+	return task;
+}
+
 struct pid *get_task_pid(struct task_struct *task, enum pid_type type)
 {
 	struct pid *pid;


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 6/8] vfork: do not setup child->vfork_done beforehand
  2011-07-27 16:31 [PATCH 0/8] make vfork killable/restartable/traceable Oleg Nesterov
                   ` (4 preceding siblings ...)
  2011-07-27 16:33 ` [PATCH 5/8] introduce find_get_task_by_vpid() Oleg Nesterov
@ 2011-07-27 16:33 ` Oleg Nesterov
  2011-07-27 16:34 ` [PATCH 7/8] vfork: make it stoppable/traceable Oleg Nesterov
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 54+ messages in thread
From: Oleg Nesterov @ 2011-07-27 16:33 UTC (permalink / raw)
  To: Linus Torvalds, Roland McGrath, Tejun Heo
  Cc: Denys Vlasenko, KOSAKI Motohiro, Matt Fleming, linux-kernel

do_fork() allocates/initializes "struct completion" on stack and
sets p->vfork_done in advance, before the task becomes runnable.
This way clone_vfork_finish(p) can always trust ->vfork_done.

But, to make this restartable, we need to re-assign ->vfork_done
and wait in sys_restart_syscall(), and we must not do this if the
child has already passed complete_vfork_done().

With this patch we allocate/initialize and set child->vfork_done in
clone_vfork_finish() when we are going to actually wait. To prevent
the race with the child we use the fake VFORK_DONE_NOP marker set by
do_fork().

complete_vfork_done() does nothing if it sees VFORK_DONE_NOP but sets
->vfork_done = NULL.

This way clone_vfork_finish() can do xchg() and if it returns !NULL
we can be sure the child didn't exit yet, we can safely wait.

IOW, the logic is:

	- do_fork(CLONE_VFORK)

		child->vfork_done = VFORK_DONE_NOP; // != NULL

	- complete_vfork_done() 	// child

		vfork = xchg(tsk->vfork_done, NULL);

	- clone_vfork_finish()		// parent

		struct completion vfork_done;
		if (xchg(&child->vfork_done, &vfork_done) != NULL)
			 wait_for_completion(&vfork_done);

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 kernel/fork.c |   31 ++++++++++++++++++++-----------
 1 file changed, 20 insertions(+), 11 deletions(-)

--- 3.1/kernel/fork.c~6_no_vfork_done	2011-07-26 20:28:24.000000000 +0200
+++ 3.1/kernel/fork.c	2011-07-26 21:13:56.000000000 +0200
@@ -1443,28 +1443,40 @@ struct task_struct * __cpuinit fork_idle
 	return task;
 }
 
+#define VFORK_DONE_NOP		((struct completion *) 1)
+#define vfork_done_valid(vf)	((unsigned long)(vf) > 1)
+
 static void complete_vfork_done(struct task_struct *tsk)
 {
 	struct completion *vfork_done = xchg(&tsk->vfork_done, NULL);
 
-	if (vfork_done)
+	if (vfork_done_valid(vfork_done))
 		complete(vfork_done);
 }
 
-static long clone_vfork_finish(struct task_struct *child,
-				struct completion *vfork_done, long pid)
+static long clone_vfork_finish(struct task_struct *child, long pid)
 {
-	int killed = wait_for_completion_killable(vfork_done);
+	struct completion vfork_done;
+	int killed;
+
+	init_completion(&vfork_done);
 
+	/* complete_vfork_done() was already called? */
+	if (xchg(&child->vfork_done, &vfork_done) == NULL)
+		goto done;
+
+	killed = wait_for_completion_killable(&vfork_done);
 	if (killed) {
-		struct completion *steal = xchg(&child->vfork_done, NULL);
+		struct completion *steal = xchg(&child->vfork_done,
+							VFORK_DONE_NOP);
 		/* if we race with complete_vfork_done() we have to wait */
 		if (unlikely(!steal))
-			wait_for_completion(vfork_done);
+			wait_for_completion(&vfork_done);
 
 		return -EINTR;
 	}
 
+done:
 	ptrace_event(PTRACE_EVENT_VFORK_DONE, pid);
 	return pid;
 }
@@ -1526,8 +1538,6 @@ long do_fork(unsigned long clone_flags,
 	 * might get invalid after that point, if the thread exits quickly.
 	 */
 	if (!IS_ERR(p)) {
-		struct completion vfork;
-
 		trace_sched_process_fork(current, p);
 
 		nr = task_pid_vnr(p);
@@ -1536,9 +1546,8 @@ long do_fork(unsigned long clone_flags,
 			put_user(nr, parent_tidptr);
 
 		if (clone_flags & CLONE_VFORK) {
+			p->vfork_done = VFORK_DONE_NOP;
 			get_task_struct(p);
-			p->vfork_done = &vfork;
-			init_completion(&vfork);
 		}
 
 		audit_finish_fork(p);
@@ -1558,7 +1567,7 @@ long do_fork(unsigned long clone_flags,
 			ptrace_event(trace, nr);
 
 		if (clone_flags & CLONE_VFORK) {
-			nr = clone_vfork_finish(p, &vfork, nr);
+			nr = clone_vfork_finish(p, nr);
 			put_task_struct(p);
 		}
 	} else {


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 7/8] vfork: make it stoppable/traceable
  2011-07-27 16:31 [PATCH 0/8] make vfork killable/restartable/traceable Oleg Nesterov
                   ` (5 preceding siblings ...)
  2011-07-27 16:33 ` [PATCH 6/8] vfork: do not setup child->vfork_done beforehand Oleg Nesterov
@ 2011-07-27 16:34 ` Oleg Nesterov
  2011-07-27 16:34 ` [PATCH 8/8] vfork: do not block SIG_DFL/SIG_IGN signals is single-threaded Oleg Nesterov
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 54+ messages in thread
From: Oleg Nesterov @ 2011-07-27 16:34 UTC (permalink / raw)
  To: Linus Torvalds, Roland McGrath, Tejun Heo
  Cc: Denys Vlasenko, KOSAKI Motohiro, Matt Fleming, linux-kernel

Make vfork() stoppable/traceable.

Change clone_vfork_finish() paths to block all signals except
SIGKILL | SIGSTOP and do wait_for_completion_interruptible().

This means we should restart after the stop/ptrace_attach or
the spurious wakeup, implement clone_vfork_restart().

-ERESTART_RESTARTBLOCK is safe, we can never dequeue a signal
which has a handler, thus we can never return to the user-space.
Unless the debugger changes regs of course, but this is fine.

Note:

	- This changes the "killable" behavior, the vforking task
	  doesn't react to the fatal signals except SIGKILL. See
	  the next patch

	- The code asks for the final cleanups, for example we
	  should move put_task_struct() into clone_vfork_finish()
	  and simplify the usage of ->restart_block. Will be done
	  later

	- We use ->saved_sigmask to record the original sigmask.
	  This is safe, nobody should play with it, we do not do.
	  set_restore_sigmask(). Still it would be more clean to
	  use restart_block->vfork, but then we should somehow
	  export sigset_t for thread_info.h

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 include/linux/thread_info.h |    4 ++++
 kernel/fork.c               |   37 +++++++++++++++++++++++++++++++++++--
 2 files changed, 39 insertions(+), 2 deletions(-)

--- 3.1/include/linux/thread_info.h~7_vfork_restart	2011-07-27 15:27:38.000000000 +0200
+++ 3.1/include/linux/thread_info.h	2011-07-27 15:28:43.000000000 +0200
@@ -44,6 +44,10 @@ struct restart_block {
 			unsigned long tv_sec;
 			unsigned long tv_nsec;
 		} poll;
+
+		struct {
+			long pid;
+		} vfork;
 	};
 };
 
--- 3.1/kernel/fork.c~7_vfork_restart	2011-07-27 15:27:38.000000000 +0200
+++ 3.1/kernel/fork.c	2011-07-27 16:01:01.000000000 +0200
@@ -1454,18 +1454,24 @@ static void complete_vfork_done(struct t
 		complete(vfork_done);
 }
 
+static long clone_vfork_restart(struct restart_block *);
+
 static long clone_vfork_finish(struct task_struct *child, long pid)
 {
+	struct restart_block *restart = &current_thread_info()->restart_block;
 	struct completion vfork_done;
 	int killed;
 
+	if (!child || child->real_parent != current)
+		goto done;
+
 	init_completion(&vfork_done);
 
 	/* complete_vfork_done() was already called? */
 	if (xchg(&child->vfork_done, &vfork_done) == NULL)
 		goto done;
 
-	killed = wait_for_completion_killable(&vfork_done);
+	killed = wait_for_completion_interruptible(&vfork_done);
 	if (killed) {
 		struct completion *steal = xchg(&child->vfork_done,
 							VFORK_DONE_NOP);
@@ -1473,14 +1479,40 @@ static long clone_vfork_finish(struct ta
 		if (unlikely(!steal))
 			wait_for_completion(&vfork_done);
 
-		return -EINTR;
+		restart->fn = clone_vfork_restart;
+		restart->vfork.pid = pid;
+
+		return -ERESTART_RESTARTBLOCK;
 	}
 
 done:
+	restart->fn = do_no_restart_syscall; /* not really needed */
+	set_current_blocked(&current->saved_sigmask);
 	ptrace_event(PTRACE_EVENT_VFORK_DONE, pid);
 	return pid;
 }
 
+static long clone_vfork_restart(struct restart_block *restart)
+{
+	long pid = restart->vfork.pid;
+	struct task_struct *child = find_get_task_by_vpid(pid);
+	long ret;
+
+	ret = clone_vfork_finish(child, pid);
+	if (child)
+		put_task_struct(child);
+	return ret;
+}
+
+static void clone_vfork_prepare(void)
+{
+	sigset_t vfork_mask;
+
+	current->saved_sigmask = current->blocked;
+	siginitsetinv(&vfork_mask, sigmask(SIGKILL) | sigmask(SIGSTOP));
+	set_current_blocked(&vfork_mask);
+}
+
 /*
  *  Ok, this is the main fork-routine.
  *
@@ -1567,6 +1599,7 @@ long do_fork(unsigned long clone_flags,
 			ptrace_event(trace, nr);
 
 		if (clone_flags & CLONE_VFORK) {
+			clone_vfork_prepare();
 			nr = clone_vfork_finish(p, nr);
 			put_task_struct(p);
 		}


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 8/8] vfork: do not block SIG_DFL/SIG_IGN signals is single-threaded
  2011-07-27 16:31 [PATCH 0/8] make vfork killable/restartable/traceable Oleg Nesterov
                   ` (6 preceding siblings ...)
  2011-07-27 16:34 ` [PATCH 7/8] vfork: make it stoppable/traceable Oleg Nesterov
@ 2011-07-27 16:34 ` Oleg Nesterov
  2011-07-27 16:34 ` [PATCH 9/8] kill PF_STARTING Oleg Nesterov
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 54+ messages in thread
From: Oleg Nesterov @ 2011-07-27 16:34 UTC (permalink / raw)
  To: Linus Torvalds, Roland McGrath, Tejun Heo
  Cc: Denys Vlasenko, KOSAKI Motohiro, Matt Fleming, linux-kernel

vfork() blocks all signals except SIGKILL and SIGSTOP. This means
it doesn't react to ^Z or other fatal signals.

We can unblock all signals which doesn't have a handler and solve
this. Unfortunately, without the really ugly hacks we can not do
this in the multithreaded case, we can not trust sighand->action[]
otherwise.

Let's do this in the single-threaded case at least. Anyway, I do
not think that vfork() from the multithreaded application is sane.
And even in this case other threads can handle the blocked signals
unless they exit after clone_vfork_prepare().

Note: "sighand->count == 1" doesn't handle the dead-leader case,
this is easy to fix.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 kernel/fork.c |   18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

--- 3.1/kernel/fork.c~8_more_signals	2011-07-27 17:48:53.000000000 +0200
+++ 3.1/kernel/fork.c	2011-07-27 17:51:54.000000000 +0200
@@ -1506,10 +1506,26 @@ static long clone_vfork_restart(struct r
 
 static void clone_vfork_prepare(void)
 {
+	struct sighand_struct *sigh = current->sighand;
 	sigset_t vfork_mask;
 
-	current->saved_sigmask = current->blocked;
 	siginitsetinv(&vfork_mask, sigmask(SIGKILL) | sigmask(SIGSTOP));
+	if (atomic_read(&sigh->count) == 1) {
+		__sighandler_t h;
+		int signr;
+		/*
+		 * Nobody can play with ->action[], we can unblock all
+		 * signals which do not have a handler, they can not
+		 * trigger return-to-user-mode.
+		 */
+		for (signr = 1; signr <= _NSIG; ++signr) {
+			h = sigh->action[signr-1].sa.sa_handler;
+			if (h == SIG_DFL || h == SIG_IGN)
+				sigdelset(&vfork_mask, signr);
+		}
+	}
+
+	current->saved_sigmask = current->blocked;
 	set_current_blocked(&vfork_mask);
 }
 


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 9/8] kill PF_STARTING
  2011-07-27 16:31 [PATCH 0/8] make vfork killable/restartable/traceable Oleg Nesterov
                   ` (7 preceding siblings ...)
  2011-07-27 16:34 ` [PATCH 8/8] vfork: do not block SIG_DFL/SIG_IGN signals is single-threaded Oleg Nesterov
@ 2011-07-27 16:34 ` Oleg Nesterov
  2011-07-27 19:39 ` [PATCH 0/8] make vfork killable/restartable/traceable Linus Torvalds
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 54+ messages in thread
From: Oleg Nesterov @ 2011-07-27 16:34 UTC (permalink / raw)
  To: Linus Torvalds, Roland McGrath, Tejun Heo
  Cc: Denys Vlasenko, KOSAKI Motohiro, Matt Fleming, linux-kernel

Previously it was (ab)used by utrace. Then it was wrongly used by
the scheduler code.

Currently it is not used, kill it before it finds the new erroneous
user.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 include/linux/sched.h |    1 -
 kernel/fork.c         |    9 ---------
 2 files changed, 10 deletions(-)

--- 3.1/include/linux/sched.h~9_kill_pf_starting	2011-07-27 17:48:36.000000000 +0200
+++ 3.1/include/linux/sched.h	2011-07-27 17:53:21.000000000 +0200
@@ -1756,7 +1756,6 @@ extern void thread_group_times(struct ta
 /*
  * Per process flags
  */
-#define PF_STARTING	0x00000002	/* being created */
 #define PF_EXITING	0x00000004	/* getting shut down */
 #define PF_EXITPIDONE	0x00000008	/* pi exit done on shut down */
 #define PF_VCPU		0x00000010	/* I'm a virtual CPU */
--- 3.1/kernel/fork.c~9_kill_pf_starting	2011-07-27 17:51:54.000000000 +0200
+++ 3.1/kernel/fork.c	2011-07-27 17:54:09.000000000 +0200
@@ -992,7 +992,6 @@ static void copy_flags(unsigned long clo
 
 	new_flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER);
 	new_flags |= PF_FORKNOEXEC;
-	new_flags |= PF_STARTING;
 	p->flags = new_flags;
 	clear_freeze_flag(p);
 }
@@ -1600,14 +1599,6 @@ long do_fork(unsigned long clone_flags,
 
 		audit_finish_fork(p);
 
-		/*
-		 * We set PF_STARTING at creation in case tracing wants to
-		 * use this to distinguish a fully live task from one that
-		 * hasn't finished SIGSTOP raising yet.  Now we clear it
-		 * and set the child going.
-		 */
-		p->flags &= ~PF_STARTING;
-
 		wake_up_new_task(p);
 
 		/* forking complete and child started to run, tell ptracer */


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/8] make vfork killable/restartable/traceable
  2011-07-27 16:31 [PATCH 0/8] make vfork killable/restartable/traceable Oleg Nesterov
                   ` (8 preceding siblings ...)
  2011-07-27 16:34 ` [PATCH 9/8] kill PF_STARTING Oleg Nesterov
@ 2011-07-27 19:39 ` Linus Torvalds
  2011-07-28 13:59   ` Oleg Nesterov
  2011-07-27 22:38 ` Pedro Alves
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2011-07-27 19:39 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Roland McGrath, Tejun Heo, Denys Vlasenko, KOSAKI Motohiro,
	Matt Fleming, linux-kernel

On Wed, Jul 27, 2011 at 9:31 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>
> CLONE_VFORK sleeps in TASK_INTERRUPTIBLE until the child exits/execs.
> This is obviously not good, it is sooo simple to create the task which
> doesn't react to SIGKILL/SIGSTOP.

Well, I don't know how bad that is. You just kill the child instead.
That's how vfork has always worked, not just on Linux.

And quite frankly, I think your patches 1-3 are unbelievably ugly. If
it was some simple and straightforward "use
wait_for_completion_killable() instead", I wouldn't mind it. But I
think you made a simple and clean sequence convoluted and annoying.

I *suspect* that the killable() thing could be done more nicely by
moving the vfork_completion into the parent instead, and maybe the
vfork cleanup could just use
"complete(&task->parent->vfork_completion);" instead (so if the parent
goes away, it completes some irrelevant init case instead).

So *if* this can be done while still having straightforward code, I
think it might be worth doing. But patches 1-3 just make me go "not
worth the ugliness, especially since it's not a real problem".

                               Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/8] make vfork killable/restartable/traceable
  2011-07-27 16:31 [PATCH 0/8] make vfork killable/restartable/traceable Oleg Nesterov
                   ` (9 preceding siblings ...)
  2011-07-27 19:39 ` [PATCH 0/8] make vfork killable/restartable/traceable Linus Torvalds
@ 2011-07-27 22:38 ` Pedro Alves
  2011-07-29 19:23 ` Tejun Heo
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 54+ messages in thread
From: Pedro Alves @ 2011-07-27 22:38 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Roland McGrath, Tejun Heo, Denys Vlasenko,
	KOSAKI Motohiro, Matt Fleming, linux-kernel

On Wednesday 27 July 2011 17:31:59, Oleg Nesterov wrote:
> Hello.
> 
> CLONE_VFORK sleeps in TASK_INTERRUPTIBLE until the child exits/execs.
> This is obviously not good, it is sooo simple to create the task which
> doesn't react to SIGKILL/SIGSTOP.
> 
> Questions:
> 
> 	- do we really need this?
> 
> 	  I think we do. This really "looks like a bug" in any case,
> 	  even if nobody ever complained afaik.
> 
> 	- may be 1-3 is enough?
> 
> 	  may be... but personally I think SIGSTOP/ptrace should work
> 	  too.

Thanks, me too.  If not SIGSTOP, then PTRACE_STOP.  The fact that it
doesn't has led to the need for workarounds in gdb.  Do you see
any way gdb will be able to check at runtime its workarounds
won't be needed anymore ?

-- 
Pedro Alves

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/8] make vfork killable/restartable/traceable
  2011-07-27 19:39 ` [PATCH 0/8] make vfork killable/restartable/traceable Linus Torvalds
@ 2011-07-28 13:59   ` Oleg Nesterov
  2011-07-28 14:58     ` Oleg Nesterov
  0 siblings, 1 reply; 54+ messages in thread
From: Oleg Nesterov @ 2011-07-28 13:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Roland McGrath, Tejun Heo, Denys Vlasenko, KOSAKI Motohiro,
	Matt Fleming, linux-kernel

On 07/27, Linus Torvalds wrote:
>
> On Wed, Jul 27, 2011 at 9:31 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > CLONE_VFORK sleeps in TASK_INTERRUPTIBLE until the child exits/execs.
> > This is obviously not good, it is sooo simple to create the task which
> > doesn't react to SIGKILL/SIGSTOP.
>
> Well, I don't know how bad that is.

Well, me to. That is why 0/9 starts with "do we really need this?".
I expected you won't be happy ;)

However.

> You just kill the child instead.

Sure. Assuming you know what happens and who should be killed.
Yes, it is not that hard to figure out. Still. vfork() is the
only example which allows to create the unkillable/unstoppable
task in a trivial way.

> And quite frankly, I think your patches 1-3 are unbelievably ugly. If
> it was some simple and straightforward "use
> wait_for_completion_killable() instead", I wouldn't mind it. But I
> think you made a simple and clean sequence convoluted and annoying.

Yes. This doesn't make the code simpler. I agree. The question is,
are they are more ugly then necessary. May be... I'll try to think
a bit more.

And just in case. Personally I think that "unstoppable" is worse
than "unkillable". Suppose that you run the "good" application
which doesn't abuse vfork/signals but does something like

	if (!vfork()) {
		do_simething();		
		execve(...);
	}

In this case ^C always works, even if do_something() blocks for
some reason.

But it is quite possible that ^Z "hangs" just because it races
with vfork().

> I *suspect* that the killable() thing could be done more nicely by
> moving the vfork_completion into the parent instead, and maybe the
> vfork cleanup could just use
> "complete(&task->parent->vfork_completion);" instead

I thought about moving the "vfork_done" thing (in some form) from
child to parent. So far I do not see a clean solution.

For example. If we simply use ->real_parent->vfork_completion, then
yes, we do not need to communicate with the child, the child can rely
on rcu to ensure "struct completion" can't go away. But, this bloats
task_struct a bit, and:

> (so if the parent
> goes away, it completes some irrelevant init case instead).

This assumes /sbin/init can't sleep in CLONE_VFORK. So we need some
complications again.

Not to mention, kthread/kthread_stop should be reworked somehow.

> especially since it's not a real problem

Well. Personally I don't agree.

I'll try to simplify the patches. I am not sure I can do something
really simple.

For example, 3/8 can do

	// called by mm_release()

	void complete_vfork_done(struct task_struct *tsk)
	{
		struct completion *vfork_done;

		task_lock(tsk);
		vfork_done = tsk->vfork_done;
		if (vfork_done) {
			tsk->vfork_done = NULL; // UNNEEDED
			complete(vfork_done);
		}
		task_unlock(tsk);
	}

	// used by do fork instead of wait_for_completion()

	static long wait_for_vfork_done(struct task_struct *child,
					struct completion *vfork_done)
	{
		int killed = wait_for_completion_killable(vfork_done);

		if (killed) {
			task_lock(tsk);
			child->vfork_done = NULL;
			task_unlock(tsk);

			return -EINTR;
		}

		return 0;
	}

Does this look "not too ugly" to you or not?

Oleg.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/8] make vfork killable/restartable/traceable
  2011-07-28 13:59   ` Oleg Nesterov
@ 2011-07-28 14:58     ` Oleg Nesterov
  0 siblings, 0 replies; 54+ messages in thread
From: Oleg Nesterov @ 2011-07-28 14:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Roland McGrath, Tejun Heo, Denys Vlasenko, KOSAKI Motohiro,
	Matt Fleming, linux-kernel

On 07/28, Oleg Nesterov wrote:
>
> On 07/27, Linus Torvalds wrote:
> >
> > I *suspect* that the killable() thing could be done more nicely by
> > moving the vfork_completion into the parent instead, and maybe the
> > vfork cleanup could just use
> > "complete(&task->parent->vfork_completion);" instead
>
> I thought about moving the "vfork_done" thing (in some form) from
> child to parent. So far I do not see a clean solution.

Just in case...

We can also do

	- struct completion *vfork_done;
	+ struct completion vfork_done;

in struct task_struct. This can really simplify these changes.
But we still need to get/put the child, and this bloats task_struct.

Oleg.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/8] vfork: make it killable
  2011-07-27 16:32 ` [PATCH 3/8] vfork: make it killable Oleg Nesterov
@ 2011-07-29 13:02   ` Matt Fleming
  2011-07-29 14:32     ` Oleg Nesterov
  0 siblings, 1 reply; 54+ messages in thread
From: Matt Fleming @ 2011-07-29 13:02 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Roland McGrath, Tejun Heo, Denys Vlasenko,
	KOSAKI Motohiro, linux-kernel

On Wed, 2011-07-27 at 18:32 +0200, Oleg Nesterov wrote:

[...]
 
>  static long clone_vfork_finish(struct task_struct *child,
>  				struct completion *vfork_done, long pid)
>  {
> -	freezer_do_not_count();
> -	wait_for_completion(vfork_done);
> -	freezer_count();
> +	int killed = wait_for_completion_killable(vfork_done);
> +
> +	if (killed) {
> +		struct completion *steal = xchg(&child->vfork_done, NULL);
> +		/* if we race with complete_vfork_done() we have to wait */
> +		if (unlikely(!steal))
> +			wait_for_completion(vfork_done);
> +
> +		return -EINTR;
> +	}

Hmm.. isn't this inherently racy anyway? Why go to the trouble of trying
to handle this race at all?


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 4/8] coredump_wait: don't call complete_vfork_done()
  2011-07-27 16:33 ` [PATCH 4/8] coredump_wait: don't call complete_vfork_done() Oleg Nesterov
@ 2011-07-29 13:02   ` Matt Fleming
  2011-07-29 14:25     ` Oleg Nesterov
  0 siblings, 1 reply; 54+ messages in thread
From: Matt Fleming @ 2011-07-29 13:02 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Roland McGrath, Tejun Heo, Denys Vlasenko,
	KOSAKI Motohiro, linux-kernel

On Wed, 2011-07-27 at 18:33 +0200, Oleg Nesterov wrote:
> Now that CLONE_VFORK is killable, coredump_wait() no longer needs
> complete_vfork_done(). zap_threads() should find and kill all tasks
> with the same ->mm, this includes our parent if ->vfork_done is set.
> 
> mm_release() becomes the only caller, unexport complete_vfork_done().
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

Isn't there a subtle change in user-visible behaviour regarding wait()
with this patch?

Before the patch, if a child dumps its core it will wakeup the parent
which can read the status of the child via wait(), whereas with this
patch applied the parent will actually be killed along with the child.

I'm not trying to say which behaviour I think is the correct one, just
that because it is a user-visible change and it is maybe worth a bit
more discussion in the changelog?


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 4/8] coredump_wait: don't call complete_vfork_done()
  2011-07-29 13:02   ` Matt Fleming
@ 2011-07-29 14:25     ` Oleg Nesterov
  2011-07-29 15:26       ` Matt Fleming
  0 siblings, 1 reply; 54+ messages in thread
From: Oleg Nesterov @ 2011-07-29 14:25 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Linus Torvalds, Roland McGrath, Tejun Heo, Denys Vlasenko,
	KOSAKI Motohiro, linux-kernel

On 07/29, Matt Fleming wrote:
>
> On Wed, 2011-07-27 at 18:33 +0200, Oleg Nesterov wrote:
> > Now that CLONE_VFORK is killable, coredump_wait() no longer needs
> > complete_vfork_done(). zap_threads() should find and kill all tasks
> > with the same ->mm, this includes our parent if ->vfork_done is set.
> >
> > mm_release() becomes the only caller, unexport complete_vfork_done().
> >
> > Signed-off-by: Oleg Nesterov <oleg@redhat.com>
>
> Isn't there a subtle change in user-visible behaviour regarding wait()
> with this patch?
>
> Before the patch, if a child dumps its core it will wakeup the parent
> which can read the status of the child via wait(), whereas with this
> patch applied the parent will actually be killed along with the child.

No.

Please note that if ->vfork_done != NULL, then ->real_parent shares
->mm with us, by definition of CLONE_VFORK.

In this case, with or without this patch, the parent was already
killed by zap_threads(). It can never do wait() or something else.

However. before 3/8, it was necessary to wakeup the TASK_UNINTERRUPTIBLE
parent, otherwise we deadlock. Once again, it can't do anything,
it will die immediately because of fatal_signal_pending().

After 3/8, zap_process()->signal_wake_up(1) wakes up the KILLABLE
parent, no need to do complete().

Oleg.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/8] vfork: make it killable
  2011-07-29 13:02   ` Matt Fleming
@ 2011-07-29 14:32     ` Oleg Nesterov
  2011-07-29 15:32       ` Matt Fleming
  0 siblings, 1 reply; 54+ messages in thread
From: Oleg Nesterov @ 2011-07-29 14:32 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Linus Torvalds, Roland McGrath, Tejun Heo, Denys Vlasenko,
	KOSAKI Motohiro, linux-kernel

On 07/29, Matt Fleming wrote:
>
> On Wed, 2011-07-27 at 18:32 +0200, Oleg Nesterov wrote:
>
> [...]
>
> >  static long clone_vfork_finish(struct task_struct *child,
> >  				struct completion *vfork_done, long pid)
> >  {
> > -	freezer_do_not_count();
> > -	wait_for_completion(vfork_done);
> > -	freezer_count();
> > +	int killed = wait_for_completion_killable(vfork_done);
> > +
> > +	if (killed) {
> > +		struct completion *steal = xchg(&child->vfork_done, NULL);
> > +		/* if we race with complete_vfork_done() we have to wait */
> > +		if (unlikely(!steal))
> > +			wait_for_completion(vfork_done);
> > +
> > +		return -EINTR;
> > +	}
>
> Hmm.. isn't this inherently racy anyway? Why go to the trouble of trying
> to handle this race at all?

Suppose the child does xchg() and sees vfork_done != NULL. In this
case the parent shouldn't return from do_fork() until the child
does complete(), this "struct completion" was allocated on parent's
stack.

OK, I am starting to agree this looks overcomplicated, task_lock()
can make the code look simpler (see 0/8).

Oleg.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 4/8] coredump_wait: don't call complete_vfork_done()
  2011-07-29 14:25     ` Oleg Nesterov
@ 2011-07-29 15:26       ` Matt Fleming
  0 siblings, 0 replies; 54+ messages in thread
From: Matt Fleming @ 2011-07-29 15:26 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Roland McGrath, Tejun Heo, Denys Vlasenko,
	KOSAKI Motohiro, linux-kernel

On Fri, 2011-07-29 at 16:25 +0200, Oleg Nesterov wrote:
>
> Please note that if ->vfork_done != NULL, then ->real_parent shares
> ->mm with us, by definition of CLONE_VFORK.
> 
> In this case, with or without this patch, the parent was already
> killed by zap_threads(). It can never do wait() or something else.
> 
> However. before 3/8, it was necessary to wakeup the TASK_UNINTERRUPTIBLE
> parent, otherwise we deadlock. Once again, it can't do anything,
> it will die immediately because of fatal_signal_pending().

D'oh, yes. I see now, thanks!


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/8] vfork: make it killable
  2011-07-29 14:32     ` Oleg Nesterov
@ 2011-07-29 15:32       ` Matt Fleming
  0 siblings, 0 replies; 54+ messages in thread
From: Matt Fleming @ 2011-07-29 15:32 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Roland McGrath, Tejun Heo, Denys Vlasenko,
	KOSAKI Motohiro, linux-kernel

On Fri, 2011-07-29 at 16:32 +0200, Oleg Nesterov wrote:
> On 07/29, Matt Fleming wrote:
> >
> > On Wed, 2011-07-27 at 18:32 +0200, Oleg Nesterov wrote:
> >
> > [...]
> >
> > >  static long clone_vfork_finish(struct task_struct *child,
> > >  				struct completion *vfork_done, long pid)
> > >  {
> > > -	freezer_do_not_count();
> > > -	wait_for_completion(vfork_done);
> > > -	freezer_count();
> > > +	int killed = wait_for_completion_killable(vfork_done);
> > > +
> > > +	if (killed) {
> > > +		struct completion *steal = xchg(&child->vfork_done, NULL);
> > > +		/* if we race with complete_vfork_done() we have to wait */
> > > +		if (unlikely(!steal))
> > > +			wait_for_completion(vfork_done);
> > > +
> > > +		return -EINTR;
> > > +	}
> >
> > Hmm.. isn't this inherently racy anyway? Why go to the trouble of trying
> > to handle this race at all?
> 
> Suppose the child does xchg() and sees vfork_done != NULL. In this
> case the parent shouldn't return from do_fork() until the child
> does complete(), this "struct completion" was allocated on parent's
> stack.
> 
> OK, I am starting to agree this looks overcomplicated, task_lock()
> can make the code look simpler (see 0/8).

Yeah, I think the code in 0/8 looks like a better solution.



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/8] make vfork killable/restartable/traceable
  2011-07-27 16:31 [PATCH 0/8] make vfork killable/restartable/traceable Oleg Nesterov
                   ` (10 preceding siblings ...)
  2011-07-27 22:38 ` Pedro Alves
@ 2011-07-29 19:23 ` Tejun Heo
  2011-08-12 17:55   ` [PATCH v2 0/3] make vfork killable Oleg Nesterov
       [not found] ` <20110727163610.GJ23793@redhat.com>
  2011-08-10 21:44 ` [PATCH 0/8] make vfork killable/restartable/traceable Pavel Machek
  13 siblings, 1 reply; 54+ messages in thread
From: Tejun Heo @ 2011-07-29 19:23 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Roland McGrath, Denys Vlasenko, KOSAKI Motohiro,
	Matt Fleming, linux-kernel

Hello, Oleg.

On Wed, Jul 27, 2011 at 06:31:59PM +0200, Oleg Nesterov wrote:
> 	- is it safe to exit/stop on !x86 machine???
> 
> 	  I do not know. May be this needs some #ifdef's around
> 	  wait_for_completion_interruptible(). I am not sure that,
> 	  say, arch_ptrace_stop() can't abuse the ->mm shared with
> 	  the child.
>
> 	  OTOH. This can happen anyway, do_fork() does ptrace_event()
> 	  before wait_for_completion().

The only affected architecture would be ia64 thanks to the rolling
register stack.  I suspected it would unroll on fork but couldn't
determine whether it actually does that.  At any rate, it should be
safe as the child isn't allowed to return from the forking function
and arch_ptrace_stop() unrolling the register stack shouldn't affect
the child.  Eh... too much speculation, probably best to ask ia64 ppl.

A wild, possibly bat-shit crazy idea about the whole series:

If the current implementation is too nasty, an alternative approach
could be handling vfork waiting as a type of job control stop.
ie. while a task has vforked children, it's considered to have sticky
STOP state and can't leave get_signal_to_deliver() unless killed.  I
think this could be a bit more natural than trying to allow restart of
vfork().

So, very roughly, something like the following for the parent.

* On vfork(), mark vfork child exists and set TIF_SIGPENDING.

* When entering get_signal_to_deliver(), if vfork child exists, save
  sigmask and block all blockable signals.

* If dequeue_signal() returns 0 && vfork child exists, enter STOP.

* When leaving get_signal_to_deliver(), restore sigmask if saved on
  entry.

Haven't really thought a lot about the details so this might end up
uglier than the current attempt.  :)

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* mm->oom_disable_count is broken
       [not found]           ` <20110730143426.GA6061@redhat.com>
@ 2011-07-30 15:22             ` Oleg Nesterov
  2011-08-01 11:52               ` KOSAKI Motohiro
  0 siblings, 1 reply; 54+ messages in thread
From: Oleg Nesterov @ 2011-07-30 15:22 UTC (permalink / raw)
  To: David Rientjes, Andrew Morton
  Cc: Linus Torvalds, Roland McGrath, Tejun Heo, Denys Vlasenko,
	KOSAKI Motohiro, Matt Fleming, linux-kernel, Andrey Vagin,
	Frantisek Hrbata, Ying Han

On 07/30, Oleg Nesterov wrote:
>
> So I'd suggest this hack^Wpatch as a quick fix. I still think this
> code is not right, but the patch tries to keep the current logic.

And this reminds me. mm->oom_disable_count looks absolutely broken.
IIRC, I already complained but nobody replied.

What does this counter mean?

	/* How many tasks sharing this mm are OOM_DISABLE */
	atomic_t oom_disable_count;

tasks? processes or threads?

Lets look at oom_adjust_write(),

		if (task->signal->oom_adj == OOM_DISABLE)
			atomic_inc(&task->mm->oom_disable_count);

OK, so it is per-process. No matter how many threads this process
has, mm->oom_disable_count becomes 1 (ignoring CLONE_VM).


However, exit_mm() does:

	if (tsk->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
		atomic_dec(&mm->oom_disable_count);

but this is per-thread! it becomes zero and then negative after
pthread_exit().


copy_process()->copy_mm() seems to think it is per-thread too. But,
bad_fork_cleanup_mm:

	if (p->mm) {
		task_lock(p);
		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
			atomic_dec(&p->mm->oom_disable_count);
		task_unlock(p);
		mmput(p->mm);
	}

Why do we take task_lock() ? OK, oom_score_adj_write() does task_lock()
too, but this can't help in the multithreaded case? Why copy_mm() checks
->oom_score_adj lockless?

Oleg.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: mm->oom_disable_count is broken
  2011-07-30 15:22             ` mm->oom_disable_count is broken Oleg Nesterov
@ 2011-08-01 11:52               ` KOSAKI Motohiro
  2011-08-29 18:37                 ` Oleg Nesterov
  0 siblings, 1 reply; 54+ messages in thread
From: KOSAKI Motohiro @ 2011-08-01 11:52 UTC (permalink / raw)
  To: oleg
  Cc: rientjes, akpm, torvalds, roland, tj, dvlasenk, matt.fleming,
	linux-kernel, avagin, fhrbata, yinghan

(2011/07/31 0:22), Oleg Nesterov wrote:
> On 07/30, Oleg Nesterov wrote:
>>
>> So I'd suggest this hack^Wpatch as a quick fix. I still think this
>> code is not right, but the patch tries to keep the current logic.
> 
> And this reminds me. mm->oom_disable_count looks absolutely broken.
> IIRC, I already complained but nobody replied.
>
> What does this counter mean?
> 
> 	/* How many tasks sharing this mm are OOM_DISABLE */
> 	atomic_t oom_disable_count;
> 
> tasks? processes or threads?
> 
> Lets look at oom_adjust_write(),
> 
> 		if (task->signal->oom_adj == OOM_DISABLE)
> 			atomic_inc(&task->mm->oom_disable_count);
> 
> OK, so it is per-process. No matter how many threads this process
> has, mm->oom_disable_count becomes 1 (ignoring CLONE_VM).
> 
> 
> However, exit_mm() does:
> 
> 	if (tsk->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> 		atomic_dec(&mm->oom_disable_count);
> 
> but this is per-thread! it becomes zero and then negative after
> pthread_exit().
> 
> 
> copy_process()->copy_mm() seems to think it is per-thread too. But,
> bad_fork_cleanup_mm:
> 
> 	if (p->mm) {
> 		task_lock(p);
> 		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> 			atomic_dec(&p->mm->oom_disable_count);
> 		task_unlock(p);
> 		mmput(p->mm);
> 	}
> 
> Why do we take task_lock() ? OK, oom_score_adj_write() does task_lock()
> too, but this can't help in the multithreaded case? Why copy_mm() checks
> ->oom_score_adj lockless?

IIRC, I did pointed out this issue. But nobody replied.
I think ->oom_disable_count is currently broken. but now I have no time to
audit this stuff. So, I'd suggest to revert this code if nobody don't fix it.





^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/8] make vfork killable/restartable/traceable
  2011-07-27 16:31 [PATCH 0/8] make vfork killable/restartable/traceable Oleg Nesterov
                   ` (12 preceding siblings ...)
       [not found] ` <20110727163610.GJ23793@redhat.com>
@ 2011-08-10 21:44 ` Pavel Machek
  2011-08-11 16:09   ` Oleg Nesterov
  13 siblings, 1 reply; 54+ messages in thread
From: Pavel Machek @ 2011-08-10 21:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Roland McGrath, Tejun Heo, Denys Vlasenko,
	KOSAKI Motohiro, Matt Fleming, linux-kernel

Hi!

> CLONE_VFORK sleeps in TASK_INTERRUPTIBLE until the child exits/execs.
> This is obviously not good, it is sooo simple to create the task which
> doesn't react to SIGKILL/SIGSTOP.
> 
> Questions:
> 
> 	- do we really need this?
> 
> 	  I think we do. This really "looks like a bug" in any case,
> 	  even if nobody ever complained afaik.
> 
> 	- may be 1-3 is enough?
> 
> 	  may be... but personally I think SIGSTOP/ptrace should work
> 	  too.
> 
> 	- is it safe to exit/stop on !x86 machine???
> 
> 	  I do not know. May be this needs some #ifdef's around
> 	  wait_for_completion_interruptible(). I am not sure that,
> 	  say, arch_ptrace_stop() can't abuse the ->mm shared with
> 	  the child.
> 
> 	  OTOH. This can happen anyway, do_fork() does ptrace_event()
> 	  before wait_for_completion().
> 
> 	- and of course, while I think this is bugfix, this is user
> 	  visible change.
> 
> Please comment.

I believe we should fix it. I was always claiming "if it does not
react to SIGKILL, it is a kernel bug" and well, this just proved me
wrong...
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/8] make vfork killable/restartable/traceable
  2011-08-10 21:44 ` [PATCH 0/8] make vfork killable/restartable/traceable Pavel Machek
@ 2011-08-11 16:09   ` Oleg Nesterov
  2011-08-11 16:22     ` Tejun Heo
  0 siblings, 1 reply; 54+ messages in thread
From: Oleg Nesterov @ 2011-08-11 16:09 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Linus Torvalds, Roland McGrath, Tejun Heo, Denys Vlasenko,
	KOSAKI Motohiro, Matt Fleming, linux-kernel

On 08/10, Pavel Machek wrote:
>
> > CLONE_VFORK sleeps in TASK_INTERRUPTIBLE until the child exits/execs.
> > This is obviously not good, it is sooo simple to create the task which
> > doesn't react to SIGKILL/SIGSTOP.
> >
> > Questions:
> >
> > 	- do we really need this?
> >
> I believe we should fix it. I was always claiming "if it does not
> react to SIGKILL, it is a kernel bug" and well, this just proved me
> wrong...

Agreed.

Just in case... sorry to all for delay, I am busy. I'll try to resend
the simplified version "soon", at least the "unkillable" part. And
we also have the pending oom issues...

Oleg.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/8] make vfork killable/restartable/traceable
  2011-08-11 16:09   ` Oleg Nesterov
@ 2011-08-11 16:22     ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2011-08-11 16:22 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Pavel Machek, Linus Torvalds, Roland McGrath, Denys Vlasenko,
	KOSAKI Motohiro, Matt Fleming, linux-kernel

Hello,

On Thu, Aug 11, 2011 at 06:09:14PM +0200, Oleg Nesterov wrote:
> On 08/10, Pavel Machek wrote:
> >
> > > CLONE_VFORK sleeps in TASK_INTERRUPTIBLE until the child exits/execs.
> > > This is obviously not good, it is sooo simple to create the task which
> > > doesn't react to SIGKILL/SIGSTOP.
> > >
> > > Questions:
> > >
> > > 	- do we really need this?
> > >
> > I believe we should fix it. I was always claiming "if it does not
> > react to SIGKILL, it is a kernel bug" and well, this just proved me
> > wrong...
> 
> Agreed.
> 
> Just in case... sorry to all for delay, I am busy. I'll try to resend
> the simplified version "soon", at least the "unkillable" part. And
> we also have the pending oom issues...

I'm currently trying to handle both vfork and cgroup freezes as jobctl
stop so that all three different stop modes - group stop, vfork wait
and cgroup freeze - share the same characteristics.  I'm currently
somewhat stuck with some freezer and kthread issues I discovered in
the process and it seems like it would take some time.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v2 0/3] make vfork killable
  2011-07-29 19:23 ` Tejun Heo
@ 2011-08-12 17:55   ` Oleg Nesterov
  2011-08-12 17:56     ` [PATCH 1/3] vfork: introduce complete_vfork_done() Oleg Nesterov
                       ` (4 more replies)
  0 siblings, 5 replies; 54+ messages in thread
From: Oleg Nesterov @ 2011-08-12 17:55 UTC (permalink / raw)
  To: Tejun Heo, Linus Torvalds
  Cc: Roland McGrath, Denys Vlasenko, KOSAKI Motohiro, Matt Fleming,
	linux-kernel, Pavel Machek

Hi Tejun,

On 07/29, Tejun Heo wrote:
>
> If the current implementation is too nasty,

OK, I agree, the patches I sent doesn't look very clear/clean.

But,

> an alternative approach
> could be handling vfork waiting as a type of job control stop.

Well, I didn't see the code, but to be honest this doesn't look
like a good idea to me. Firstly, personally I do not think this
has something to do with the job control stop.

And, to me sys_restart_syscall() looks like the very natural
approach, and simple.

> * When entering get_signal_to_deliver(), if vfork child exists, save
>   sigmask and block all blockable signals.

Oh, I'd like to avoid this. Why should we change get_signal_to_deliver()
paths to help vfork?

> * When leaving get_signal_to_deliver(), restore sigmask if saved on
>   entry.

And I _think_ we need much more complications. We still need to
communicate with the child, for example. Unless we are going to
add the "struct completion vfork_done" or something into task_struct,
personally I dislike this idea.

> Haven't really thought a lot about the details so this might end up
> uglier than the current attempt.  :)

I _hope_ it is much uglier, but I can be wrong of course ;)

OK. Can't we make the first step at least? Make it killable. I think
this will simplify the next changes anyway.

So. This seried makes vfork() killable. No restarts, just
s/wait_for_completion/wait_for_completion_killable/ + clear
child->vfork_done if killed.

The only overhead this patch adds to CLONE_VFORK

	- parent does get_task_struct() + put_task_struct()

	- child does task_lock() + task_unlock()

4/3 is off-topic, somehow I hate PF_STARTING irrationally. and it
helps to make the diffstat below better ;)

Oleg.

 fs/exec.c             |   18 +-------------
 include/linux/sched.h |    1 -
 kernel/fork.c         |   61 +++++++++++++++++++++++++++++++-----------------
 3 files changed, 41 insertions(+), 39 deletions(-)



^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 1/3] vfork: introduce complete_vfork_done()
  2011-08-12 17:55   ` [PATCH v2 0/3] make vfork killable Oleg Nesterov
@ 2011-08-12 17:56     ` Oleg Nesterov
  2011-08-12 17:56     ` [PATCH 2/3] vfork: make it killable Oleg Nesterov
                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 54+ messages in thread
From: Oleg Nesterov @ 2011-08-12 17:56 UTC (permalink / raw)
  To: Tejun Heo, Linus Torvalds
  Cc: Roland McGrath, Denys Vlasenko, KOSAKI Motohiro, Matt Fleming,
	linux-kernel, Pavel Machek

No functional changes.

Move the clear-and-complete-vfork_done code into the new trivial
helper, complete_vfork_done().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 include/linux/sched.h |    1 +
 fs/exec.c             |    8 ++------
 kernel/fork.c         |   17 ++++++++++-------
 3 files changed, 13 insertions(+), 13 deletions(-)

--- 3.1/include/linux/sched.h~1_complete_vfork_done	2011-08-12 16:02:17.000000000 +0200
+++ 3.1/include/linux/sched.h	2011-08-12 16:05:40.000000000 +0200
@@ -2254,6 +2254,7 @@ extern int do_execve(const char *,
 		     const char __user * const __user *,
 		     const char __user * const __user *, struct pt_regs *);
 extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *);
+extern void complete_vfork_done(struct task_struct *tsk);
 struct task_struct *fork_idle(int);
 
 extern void set_task_comm(struct task_struct *tsk, char *from);
--- 3.1/fs/exec.c~1_complete_vfork_done	2011-08-12 16:02:16.000000000 +0200
+++ 3.1/fs/exec.c	2011-08-12 16:05:40.000000000 +0200
@@ -1914,7 +1914,6 @@ static int coredump_wait(int exit_code, 
 {
 	struct task_struct *tsk = current;
 	struct mm_struct *mm = tsk->mm;
-	struct completion *vfork_done;
 	int core_waiters = -EBUSY;
 
 	init_completion(&core_state->startup);
@@ -1933,11 +1932,8 @@ static int coredump_wait(int exit_code, 
 	 * Make sure nobody is waiting for us to release the VM,
 	 * otherwise we can deadlock when we wait on each other
 	 */
-	vfork_done = tsk->vfork_done;
-	if (vfork_done) {
-		tsk->vfork_done = NULL;
-		complete(vfork_done);
-	}
+	if (tsk->vfork_done)
+		complete_vfork_done(tsk);
 
 	if (core_waiters)
 		wait_for_completion(&core_state->startup);
--- 3.1/kernel/fork.c~1_complete_vfork_done	2011-08-12 16:02:17.000000000 +0200
+++ 3.1/kernel/fork.c	2011-08-12 16:08:19.000000000 +0200
@@ -650,6 +650,14 @@ struct mm_struct *get_task_mm(struct tas
 }
 EXPORT_SYMBOL_GPL(get_task_mm);
 
+void complete_vfork_done(struct task_struct *tsk)
+{
+	struct completion *vfork_done = tsk->vfork_done;
+
+	tsk->vfork_done = NULL;
+	complete(vfork_done);
+}
+
 /* Please note the differences between mmput and mm_release.
  * mmput is called whenever we stop holding onto a mm_struct,
  * error success whatever.
@@ -665,8 +673,6 @@ EXPORT_SYMBOL_GPL(get_task_mm);
  */
 void mm_release(struct task_struct *tsk, struct mm_struct *mm)
 {
-	struct completion *vfork_done = tsk->vfork_done;
-
 	/* Get rid of any futexes when releasing the mm */
 #ifdef CONFIG_FUTEX
 	if (unlikely(tsk->robust_list)) {
@@ -686,11 +692,8 @@ void mm_release(struct task_struct *tsk,
 	/* Get rid of any cached register state */
 	deactivate_mm(tsk, mm);
 
-	/* notify parent sleeping on vfork() */
-	if (vfork_done) {
-		tsk->vfork_done = NULL;
-		complete(vfork_done);
-	}
+	if (tsk->vfork_done)
+		complete_vfork_done(tsk);
 
 	/*
 	 * If we're exiting normally, clear a user-space tid field if


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 2/3] vfork: make it killable
  2011-08-12 17:55   ` [PATCH v2 0/3] make vfork killable Oleg Nesterov
  2011-08-12 17:56     ` [PATCH 1/3] vfork: introduce complete_vfork_done() Oleg Nesterov
@ 2011-08-12 17:56     ` Oleg Nesterov
  2011-08-19 20:33       ` Matt Fleming
  2011-08-12 17:56     ` [PATCH 3/3] coredump_wait: don't call complete_vfork_done() Oleg Nesterov
                       ` (2 subsequent siblings)
  4 siblings, 1 reply; 54+ messages in thread
From: Oleg Nesterov @ 2011-08-12 17:56 UTC (permalink / raw)
  To: Tejun Heo, Linus Torvalds
  Cc: Roland McGrath, Denys Vlasenko, KOSAKI Motohiro, Matt Fleming,
	linux-kernel, Pavel Machek

Make vfork() killable.

Change do_fork(CLONE_VFORK) to do wait_for_completion_killable().
If it fails we do not return to the user-mode and never touch ->mm
shared with our child.

However, in this case we should clear child->vfork_done before
return, we use task_lock() in do_fork()->wait_for_vfork_done()
and complete_vfork_done() to serialize with each other.

Note: now that we use task_lock() we don't really need completion,
we could turn task->vfork_done into "task_struct *wake_up_me" but
this needs some complications.

NOTE: this and the next patches do not affect in-kernel users of
CLONE_VFORK, kernel threads run with all signals ignored including
SIGKILL/SIGSTOP.

However this is obviously the user-visible change. Not only a fatal
signal can kill the vforking parent, a sub-thread can do execve or
exit_group() and kill the thread sleeping in vfork().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 kernel/fork.c |   40 ++++++++++++++++++++++++++++++++--------
 1 file changed, 32 insertions(+), 8 deletions(-)

--- 3.1/kernel/fork.c~2_make_vfork_killable	2011-08-12 16:08:19.000000000 +0200
+++ 3.1/kernel/fork.c	2011-08-12 18:50:11.000000000 +0200
@@ -652,10 +652,34 @@ EXPORT_SYMBOL_GPL(get_task_mm);
 
 void complete_vfork_done(struct task_struct *tsk)
 {
-	struct completion *vfork_done = tsk->vfork_done;
+	struct completion *vfork;
 
-	tsk->vfork_done = NULL;
-	complete(vfork_done);
+	task_lock(tsk);
+	vfork = tsk->vfork_done;
+	if (likely(vfork)) {
+		tsk->vfork_done = NULL;
+		complete(vfork);
+	}
+	task_unlock(tsk);
+}
+
+static int wait_for_vfork_done(struct task_struct *child,
+				struct completion *vfork)
+{
+	int killed;
+
+	freezer_do_not_count();
+	killed = wait_for_completion_killable(vfork);
+	freezer_count();
+
+	if (killed) {
+		task_lock(child);
+		child->vfork_done = NULL;
+		task_unlock(child);
+	}
+
+	put_task_struct(child);
+	return killed;
 }
 
 /* Please note the differences between mmput and mm_release.
@@ -699,7 +723,8 @@ void mm_release(struct task_struct *tsk,
 	 * If we're exiting normally, clear a user-space tid field if
 	 * requested.  We leave this alone when dying by signal, to leave
 	 * the value intact in a core dump, and to save the unnecessary
-	 * trouble otherwise.  Userland only wants this done for a sys_exit.
+	 * trouble, say, a killed vfork parent shouldn't touch this mm.
+	 * Userland only wants this done for a sys_exit.
 	 */
 	if (tsk->clear_child_tid) {
 		if (!(tsk->flags & PF_SIGNALED) &&
@@ -1534,6 +1559,7 @@ long do_fork(unsigned long clone_flags,
 		if (clone_flags & CLONE_VFORK) {
 			p->vfork_done = &vfork;
 			init_completion(&vfork);
+			get_task_struct(p);
 		}
 
 		audit_finish_fork(p);
@@ -1553,10 +1579,8 @@ long do_fork(unsigned long clone_flags,
 			ptrace_event(trace, nr);
 
 		if (clone_flags & CLONE_VFORK) {
-			freezer_do_not_count();
-			wait_for_completion(&vfork);
-			freezer_count();
-			ptrace_event(PTRACE_EVENT_VFORK_DONE, nr);
+			if (!wait_for_vfork_done(p, &vfork))
+				ptrace_event(PTRACE_EVENT_VFORK_DONE, nr);
 		}
 	} else {
 		nr = PTR_ERR(p);


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 3/3] coredump_wait: don't call complete_vfork_done()
  2011-08-12 17:55   ` [PATCH v2 0/3] make vfork killable Oleg Nesterov
  2011-08-12 17:56     ` [PATCH 1/3] vfork: introduce complete_vfork_done() Oleg Nesterov
  2011-08-12 17:56     ` [PATCH 2/3] vfork: make it killable Oleg Nesterov
@ 2011-08-12 17:56     ` Oleg Nesterov
  2011-08-17  7:50       ` Tejun Heo
  2011-08-12 17:57     ` [PATCH 4/3] kill PF_STARTING Oleg Nesterov
  2011-08-13 16:18     ` [PATCH v2 0/3] make vfork killable Tejun Heo
  4 siblings, 1 reply; 54+ messages in thread
From: Oleg Nesterov @ 2011-08-12 17:56 UTC (permalink / raw)
  To: Tejun Heo, Linus Torvalds
  Cc: Roland McGrath, Denys Vlasenko, KOSAKI Motohiro, Matt Fleming,
	linux-kernel, Pavel Machek

Now that CLONE_VFORK is killable, coredump_wait() no longer needs
complete_vfork_done(). zap_threads() should find and kill all tasks
with the same ->mm, this includes our parent if ->vfork_done is set.

mm_release() becomes the only caller, unexport complete_vfork_done().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 include/linux/sched.h |    1 -
 fs/exec.c             |   14 ++------------
 kernel/fork.c         |    2 +-
 3 files changed, 3 insertions(+), 14 deletions(-)

--- 3.1/include/linux/sched.h~3_coredump_no_vfork_done	2011-08-12 18:50:11.000000000 +0200
+++ 3.1/include/linux/sched.h	2011-08-12 18:51:13.000000000 +0200
@@ -2254,7 +2254,6 @@ extern int do_execve(const char *,
 		     const char __user * const __user *,
 		     const char __user * const __user *, struct pt_regs *);
 extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *);
-extern void complete_vfork_done(struct task_struct *tsk);
 struct task_struct *fork_idle(int);
 
 extern void set_task_comm(struct task_struct *tsk, char *from);
--- 3.1/fs/exec.c~3_coredump_no_vfork_done	2011-08-12 18:50:11.000000000 +0200
+++ 3.1/fs/exec.c	2011-08-12 18:51:13.000000000 +0200
@@ -1925,19 +1925,9 @@ static int coredump_wait(int exit_code, 
 		core_waiters = zap_threads(tsk, mm, core_state, exit_code);
 	up_write(&mm->mmap_sem);
 
-	if (unlikely(core_waiters < 0))
-		goto fail;
-
-	/*
-	 * Make sure nobody is waiting for us to release the VM,
-	 * otherwise we can deadlock when we wait on each other
-	 */
-	if (tsk->vfork_done)
-		complete_vfork_done(tsk);
-
-	if (core_waiters)
+	if (core_waiters > 0)
 		wait_for_completion(&core_state->startup);
-fail:
+
 	return core_waiters;
 }
 
--- 3.1/kernel/fork.c~3_coredump_no_vfork_done	2011-08-12 18:50:11.000000000 +0200
+++ 3.1/kernel/fork.c	2011-08-12 18:51:13.000000000 +0200
@@ -650,7 +650,7 @@ struct mm_struct *get_task_mm(struct tas
 }
 EXPORT_SYMBOL_GPL(get_task_mm);
 
-void complete_vfork_done(struct task_struct *tsk)
+static void complete_vfork_done(struct task_struct *tsk)
 {
 	struct completion *vfork;
 


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 4/3] kill PF_STARTING
  2011-08-12 17:55   ` [PATCH v2 0/3] make vfork killable Oleg Nesterov
                       ` (2 preceding siblings ...)
  2011-08-12 17:56     ` [PATCH 3/3] coredump_wait: don't call complete_vfork_done() Oleg Nesterov
@ 2011-08-12 17:57     ` Oleg Nesterov
  2011-08-17  7:51       ` Tejun Heo
  2011-08-13 16:18     ` [PATCH v2 0/3] make vfork killable Tejun Heo
  4 siblings, 1 reply; 54+ messages in thread
From: Oleg Nesterov @ 2011-08-12 17:57 UTC (permalink / raw)
  To: Tejun Heo, Linus Torvalds
  Cc: Roland McGrath, Denys Vlasenko, KOSAKI Motohiro, Matt Fleming,
	linux-kernel, Pavel Machek

Previously it was (ab)used by utrace. Then it was wrongly used by
the scheduler code.

Currently it is not used, kill it before it finds the new erroneous
user.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 include/linux/sched.h |    1 -
 1 file changed, 1 deletion(-)

--- 3.1/include/linux/sched.h~4_kill_PF_STARTING	2011-08-12 18:51:13.000000000 +0200
+++ 3.1/include/linux/sched.h	2011-08-12 18:52:58.000000000 +0200
@@ -1756,7 +1756,6 @@ extern void thread_group_times(struct ta
 /*
  * Per process flags
  */
-#define PF_STARTING	0x00000002	/* being created */
 #define PF_EXITING	0x00000004	/* getting shut down */
 #define PF_EXITPIDONE	0x00000008	/* pi exit done on shut down */
 #define PF_VCPU		0x00000010	/* I'm a virtual CPU */
--- 3.1/kernel/fork.c~4_kill_PF_STARTING	2011-08-12 18:51:13.000000000 +0200
+++ 3.1/kernel/fork.c	2011-08-12 18:52:58.000000000 +0200
@@ -1026,7 +1026,6 @@ static void copy_flags(unsigned long clo
 
 	new_flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER);
 	new_flags |= PF_FORKNOEXEC;
-	new_flags |= PF_STARTING;
 	p->flags = new_flags;
 	clear_freeze_flag(p);
 }
@@ -1563,15 +1562,6 @@ long do_fork(unsigned long clone_flags,
 		}
 
 		audit_finish_fork(p);
-
-		/*
-		 * We set PF_STARTING at creation in case tracing wants to
-		 * use this to distinguish a fully live task from one that
-		 * hasn't finished SIGSTOP raising yet.  Now we clear it
-		 * and set the child going.
-		 */
-		p->flags &= ~PF_STARTING;
-
 		wake_up_new_task(p);
 
 		/* forking complete and child started to run, tell ptracer */


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 0/3] make vfork killable
  2011-08-12 17:55   ` [PATCH v2 0/3] make vfork killable Oleg Nesterov
                       ` (3 preceding siblings ...)
  2011-08-12 17:57     ` [PATCH 4/3] kill PF_STARTING Oleg Nesterov
@ 2011-08-13 16:18     ` Tejun Heo
  2011-08-15 19:42       ` Oleg Nesterov
  2011-08-23 22:01       ` Matt Helsley
  4 siblings, 2 replies; 54+ messages in thread
From: Tejun Heo @ 2011-08-13 16:18 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Roland McGrath, Denys Vlasenko, KOSAKI Motohiro,
	Matt Fleming, linux-kernel, Pavel Machek

Hello, Oleg.

On Fri, Aug 12, 2011 at 07:55:50PM +0200, Oleg Nesterov wrote:
> > an alternative approach
> > could be handling vfork waiting as a type of job control stop.
> 
> Well, I didn't see the code, but to be honest this doesn't look
> like a good idea to me. Firstly, personally I do not think this
> has something to do with the job control stop.
> 
> And, to me sys_restart_syscall() looks like the very natural
> approach, and simple.

I've been playing with this and it does a bit further than
implementation simplicity.  Currently, we have three different modes
of stopping a task.

* Regular job control and ptrace.
* vfork wait.
* cgroup freeze.

Currently, all three behave differently and the latter two use
UNINTERRUPTIBLE sleep causing rather nasty problems.  If we want to
fix the UNINTERRUPTIBLE sleep problem, we end up introducing a new
user visible state no matter which way we go - ie. a task will be in a
state which isn't UNINTERRUPTIBLE sleep but still behave differently
in terms of signal delivery and job control.

What's needed is this different state of being stopped which reponds
to all kernel's desires (killing and ptracing) but stays stopped
regardless of what the user requests.

There's multiple ways to implement this and forced syscall restart is
one way to achieve it - ie. while the stop condition is pending,
syscall is forced to be restarted after interruption and re-enter
stop.  The downside is that that wouldn't work with cgroup freeze at
all - there's no syscall to restart.

So, what I'm proposing is to basically add another job control state
which is similar to process group stop but controlled by other
parameters like vfork wait condition or control group frozen state.
This allows these stops to be handled in a way very similar to already
esablished job control states including interaction with ptrace.

> > * When entering get_signal_to_deliver(), if vfork child exists, save
> >   sigmask and block all blockable signals.
> 
> Oh, I'd like to avoid this. Why should we change get_signal_to_deliver()
> paths to help vfork?

get_signal_to_deliver() may be a misnomer but that's already the place
user tasks go to sleep when they aren't allowed to proceed at the
moment, so it's a logical extension of the existing behavior.

> > * When leaving get_signal_to_deliver(), restore sigmask if saved on
> >   entry.
> 
> And I _think_ we need much more complications. We still need to
> communicate with the child, for example. Unless we are going to
> add the "struct completion vfork_done" or something into task_struct,
> personally I dislike this idea.
> 
> > Haven't really thought a lot about the details so this might end up
> > uglier than the current attempt.  :)
> 
> I _hope_ it is much uglier, but I can be wrong of course ;)

Oh, it turns out I didn't need this.  The implementation isn't simple
tho, but it gives a very uniform behavior across all different modes
of stops.  ie. both vfork and freeze under ptrace would report
TRAP_STOP w/ flags indicating stop conditions in effect and would
re-trap if LISTEN is in effect on each state transition, so it can be
handled exactly as the group stop condition.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 0/3] make vfork killable
  2011-08-13 16:18     ` [PATCH v2 0/3] make vfork killable Tejun Heo
@ 2011-08-15 19:42       ` Oleg Nesterov
  2011-08-16 19:42         ` Tejun Heo
  2011-08-23 22:01       ` Matt Helsley
  1 sibling, 1 reply; 54+ messages in thread
From: Oleg Nesterov @ 2011-08-15 19:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Roland McGrath, Denys Vlasenko, KOSAKI Motohiro,
	Matt Fleming, linux-kernel, Pavel Machek

Hi Tejun,

On 08/13, Tejun Heo wrote:
>
> Hello, Oleg.
>
> On Fri, Aug 12, 2011 at 07:55:50PM +0200, Oleg Nesterov wrote:
> > > an alternative approach
> > > could be handling vfork waiting as a type of job control stop.
> >
> > Well, I didn't see the code, but to be honest this doesn't look
> > like a good idea to me. Firstly, personally I do not think this
> > has something to do with the job control stop.
> >
> > And, to me sys_restart_syscall() looks like the very natural
> > approach, and simple.
>
> I've been playing with this and it does a bit further than
> implementation simplicity.  Currently, we have three different modes
> of stopping a task.
>
> * Regular job control and ptrace.
> * vfork wait.
> * cgroup freeze.

I still can't understand what exactly you have in mind.

And to me vfork() is closer to nanoslep() than to cgroup_freezer.
(As for cgroup_freezer, I agree it would be nice to reimplent it
 in any case).

> The downside is that that wouldn't work with cgroup freeze at
> all - there's no syscall to restart.

Sure. Still I don't understand why restart is not suitable for vfork
and why it would be better to unify freezer/vfork.

OK. Let's discuss this later, I hope you will cc me ;)


But what do you think about this series? It is simple, it doesn't
play with restarts. In some sense it even simplifies the code because
it removes one user of ->vfork_done. I don't think these patches can
complicate the further changes you are going to do.

No?

Oleg.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 0/3] make vfork killable
  2011-08-15 19:42       ` Oleg Nesterov
@ 2011-08-16 19:42         ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2011-08-16 19:42 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Roland McGrath, Denys Vlasenko, KOSAKI Motohiro,
	Matt Fleming, linux-kernel, Pavel Machek

Hello, Oleg.

On Mon, Aug 15, 2011 at 09:42:09PM +0200, Oleg Nesterov wrote:
> > * Regular job control and ptrace.
> > * vfork wait.
> > * cgroup freeze.
> 
> I still can't understand what exactly you have in mind.
>
> And to me vfork() is closer to nanoslep() than to cgroup_freezer.
> (As for cgroup_freezer, I agree it would be nice to reimplent it
>  in any case).

Maybe, it would be better to handle vfork() that way.  I'm not sure,
but if we can handle all three waits in the same manner, that would be
a plus too.  I'll think more about it.

> > The downside is that that wouldn't work with cgroup freeze at
> > all - there's no syscall to restart.
> 
> Sure. Still I don't understand why restart is not suitable for vfork
> and why it would be better to unify freezer/vfork.

It's still in the works but the user visible behavior is more
consistent both with and without ptrace.  I'm still trying to wrap my
head around consolidating freezer into it.  Let's see how it fans out.

> OK. Let's discuss this later, I hope you will cc me ;)

Oh, no need to worry about that. :)

> But what do you think about this series? It is simple, it doesn't
> play with restarts. In some sense it even simplifies the code because
> it removes one user of ->vfork_done. I don't think these patches can
> complicate the further changes you are going to do.
> 
> No?

Was too lost in the freezer land.  Will review first thing tomorrow.

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] coredump_wait: don't call complete_vfork_done()
  2011-08-12 17:56     ` [PATCH 3/3] coredump_wait: don't call complete_vfork_done() Oleg Nesterov
@ 2011-08-17  7:50       ` Tejun Heo
  2011-08-17 15:11         ` Oleg Nesterov
  0 siblings, 1 reply; 54+ messages in thread
From: Tejun Heo @ 2011-08-17  7:50 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Roland McGrath, Denys Vlasenko, KOSAKI Motohiro,
	Matt Fleming, linux-kernel, Pavel Machek

Hello, sorry about the delay.

On Fri, Aug 12, 2011 at 07:56:46PM +0200, Oleg Nesterov wrote:
> Now that CLONE_VFORK is killable, coredump_wait() no longer needs
> complete_vfork_done(). zap_threads() should find and kill all tasks
> with the same ->mm, this includes our parent if ->vfork_done is set.
> 
> mm_release() becomes the only caller, unexport complete_vfork_done().
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

Yeap, looks pretty good to me.  For all three patches,

  Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 4/3] kill PF_STARTING
  2011-08-12 17:57     ` [PATCH 4/3] kill PF_STARTING Oleg Nesterov
@ 2011-08-17  7:51       ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2011-08-17  7:51 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Roland McGrath, Denys Vlasenko, KOSAKI Motohiro,
	Matt Fleming, linux-kernel, Pavel Machek

On Fri, Aug 12, 2011 at 07:57:01PM +0200, Oleg Nesterov wrote:
> Previously it was (ab)used by utrace. Then it was wrongly used by
> the scheduler code.
> 
> Currently it is not used, kill it before it finds the new erroneous
> user.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks. :)

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] coredump_wait: don't call complete_vfork_done()
  2011-08-17  7:50       ` Tejun Heo
@ 2011-08-17 15:11         ` Oleg Nesterov
  0 siblings, 0 replies; 54+ messages in thread
From: Oleg Nesterov @ 2011-08-17 15:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Roland McGrath, Denys Vlasenko, KOSAKI Motohiro,
	Matt Fleming, linux-kernel, Pavel Machek

On 08/17, Tejun Heo wrote:
>
> Yeap, looks pretty good to me.  For all three patches,
>
>   Acked-by: Tejun Heo <tj@kernel.org>

Great, thanks.

Since Linus doesn't object, I hope he agrees with these changes too,

I am going to add this into ptrace branch.

Oleg.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/3] vfork: make it killable
  2011-08-12 17:56     ` [PATCH 2/3] vfork: make it killable Oleg Nesterov
@ 2011-08-19 20:33       ` Matt Fleming
  2011-08-22 13:35         ` Oleg Nesterov
  0 siblings, 1 reply; 54+ messages in thread
From: Matt Fleming @ 2011-08-19 20:33 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Tejun Heo, Linus Torvalds, Roland McGrath, Denys Vlasenko,
	KOSAKI Motohiro, linux-kernel, Pavel Machek

On Fri, 2011-08-12 at 19:56 +0200, Oleg Nesterov wrote:
> Make vfork() killable.
> 
> Change do_fork(CLONE_VFORK) to do wait_for_completion_killable().
> If it fails we do not return to the user-mode and never touch ->mm
> shared with our child.
> 
> However, in this case we should clear child->vfork_done before
> return, we use task_lock() in do_fork()->wait_for_vfork_done()
> and complete_vfork_done() to serialize with each other.

It's probably worth updating the comment above task_lock() in
include/linux/sched.h, to say that it is now used to protect
->vfork_done.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/3] vfork: make it killable
  2011-08-19 20:33       ` Matt Fleming
@ 2011-08-22 13:35         ` Oleg Nesterov
  0 siblings, 0 replies; 54+ messages in thread
From: Oleg Nesterov @ 2011-08-22 13:35 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Tejun Heo, Linus Torvalds, Roland McGrath, Denys Vlasenko,
	KOSAKI Motohiro, linux-kernel, Pavel Machek

On 08/19, Matt Fleming wrote:
>
> On Fri, 2011-08-12 at 19:56 +0200, Oleg Nesterov wrote:
> > Make vfork() killable.
> >
> > Change do_fork(CLONE_VFORK) to do wait_for_completion_killable().
> > If it fails we do not return to the user-mode and never touch ->mm
> > shared with our child.
> >
> > However, in this case we should clear child->vfork_done before
> > return, we use task_lock() in do_fork()->wait_for_vfork_done()
> > and complete_vfork_done() to serialize with each other.
>
> It's probably worth updating the comment above task_lock() in
> include/linux/sched.h, to say that it is now used to protect
> ->vfork_done.

OK, added

	@@ -2335,7 +2335,7 @@ static inline int thread_group_empty(struct task_struct *p)
	  * Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring
	  * subscriptions and synchronises with wait4().  Also used in procfs.  Also
	  * pins the final release of task.io_context.  Also protects ->cpuset and
	- * ->cgroup.subsys[].
	+ * ->cgroup.subsys[]. And ->vfork_done.
	  *
	  * Nests both inside and outside of read_lock(&tasklist_lock).
	  * It must not be nested with write_lock_irq(&tasklist_lock),

to this patch, and pushed 1-4 into ptrace branch.

Oleg.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 0/3] make vfork killable
  2011-08-13 16:18     ` [PATCH v2 0/3] make vfork killable Tejun Heo
  2011-08-15 19:42       ` Oleg Nesterov
@ 2011-08-23 22:01       ` Matt Helsley
  2011-08-23 22:12         ` Tejun Heo
  1 sibling, 1 reply; 54+ messages in thread
From: Matt Helsley @ 2011-08-23 22:01 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Oleg Nesterov, Linus Torvalds, Roland McGrath, Denys Vlasenko,
	KOSAKI Motohiro, Matt Fleming, linux-kernel, Pavel Machek

On Sat, Aug 13, 2011 at 06:18:14PM +0200, Tejun Heo wrote:
> Hello, Oleg.
> 
> On Fri, Aug 12, 2011 at 07:55:50PM +0200, Oleg Nesterov wrote:
> > > an alternative approach
> > > could be handling vfork waiting as a type of job control stop.
> > 
> > Well, I didn't see the code, but to be honest this doesn't look
> > like a good idea to me. Firstly, personally I do not think this
> > has something to do with the job control stop.
> > 
> > And, to me sys_restart_syscall() looks like the very natural
> > approach, and simple.
> 
> I've been playing with this and it does a bit further than
> implementation simplicity.  Currently, we have three different modes
> of stopping a task.
> 
> * Regular job control and ptrace.
> * vfork wait.
> * cgroup freeze.
> 
> Currently, all three behave differently and the latter two use
> UNINTERRUPTIBLE sleep causing rather nasty problems.  If we want to
> fix the UNINTERRUPTIBLE sleep problem, we end up introducing a new
> user visible state no matter which way we go - ie. a task will be in a
> state which isn't UNINTERRUPTIBLE sleep but still behave differently
> in terms of signal delivery and job control.
> 
> What's needed is this different state of being stopped which reponds
> to all kernel's desires (killing and ptracing) but stays stopped
> regardless of what the user requests.
> 
> There's multiple ways to implement this and forced syscall restart is
> one way to achieve it - ie. while the stop condition is pending,
> syscall is forced to be restarted after interruption and re-enter
> stop.  The downside is that that wouldn't work with cgroup freeze at
> all - there's no syscall to restart.
> 
> So, what I'm proposing is to basically add another job control state
> which is similar to process group stop but controlled by other
> parameters like vfork wait condition or control group frozen state.

Are you proposing one new jobctl state for vfork and another for
the cgroup freezer?

> This allows these stops to be handled in a way very similar to already
> esablished job control states including interaction with ptrace.

Interesting. So long as shells and ptracers in containers would not see
jobctl traps/events triggered by tasks outside the container it sounds
fine.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 0/3] make vfork killable
  2011-08-23 22:01       ` Matt Helsley
@ 2011-08-23 22:12         ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2011-08-23 22:12 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Oleg Nesterov, Linus Torvalds, Roland McGrath, Denys Vlasenko,
	KOSAKI Motohiro, Matt Fleming, linux-kernel, Pavel Machek

Hello, Matt.

On Wed, Aug 24, 2011 at 12:01 AM, Matt Helsley <matthltc@us.ibm.com> wrote:
> Are you proposing one new jobctl state for vfork and another for
> the cgroup freezer?

Dunno how it would actually turn out yet, but no matter how, it would
be transparent to the process itself while being visible to and
manipulatable by ptracer.  Well, that's the idea...

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: mm->oom_disable_count is broken
  2011-08-01 11:52               ` KOSAKI Motohiro
@ 2011-08-29 18:37                 ` Oleg Nesterov
  2011-08-29 23:17                   ` David Rientjes
  0 siblings, 1 reply; 54+ messages in thread
From: Oleg Nesterov @ 2011-08-29 18:37 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: rientjes, akpm, torvalds, roland, tj, dvlasenk, matt.fleming,
	linux-kernel, avagin, fhrbata, yinghan

On 08/01, KOSAKI Motohiro wrote:
>
> > And this reminds me. mm->oom_disable_count looks absolutely broken.
> > IIRC, I already complained but nobody replied.
> >
> > [...snip...]
>
> IIRC, I did pointed out this issue. But nobody replied.
> I think ->oom_disable_count is currently broken. but now I have no time to
> audit this stuff. So, I'd suggest to revert this code if nobody don't fix it.

I tend to agree, of course we can fix oom_disable_count but I don't
really understand why do we want it.

David, could you please explain? I mean, CLONE_VM (without CLONE_THREAD)
is not that common, I think. Does this counter really help in practice?


And. personally I dislike it because ->oom_disable_count is just another
proof that ->oom_score_adj should be in ->mm, not per-process. IIRC,
you already explained me why we can't do this, but - sorry - I forgot.
May be something with vfork... Could you explain this again?

Oleg.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: mm->oom_disable_count is broken
  2011-08-29 18:37                 ` Oleg Nesterov
@ 2011-08-29 23:17                   ` David Rientjes
  2011-08-30  7:43                       ` David Rientjes
  2011-08-30 16:17                     ` mm->oom_disable_count is broken Oleg Nesterov
  0 siblings, 2 replies; 54+ messages in thread
From: David Rientjes @ 2011-08-29 23:17 UTC (permalink / raw)
  To: Oleg Nesterov, Ying Han
  Cc: KOSAKI Motohiro, Andrew Morton, Linus Torvalds, roland, tj,
	dvlasenk, matt.fleming, linux-kernel, avagin, fhrbata

On Mon, 29 Aug 2011, Oleg Nesterov wrote:

> > IIRC, I did pointed out this issue. But nobody replied.
> > I think ->oom_disable_count is currently broken. but now I have no time to
> > audit this stuff. So, I'd suggest to revert this code if nobody don't fix it.
> 
> I tend to agree, of course we can fix oom_disable_count but I don't
> really understand why do we want it.
> 

I'd rather just remove it entirely, we'll have to ask it's author.  Ying, 
do you see a reason to keep oom_disable_count around?

The only thing that I can see it doing is preventing a thread that shares 
an ->mm with an unkillable thread from being killed itself since it won't 
lead to future memory freeing.  It prevents the second tasklist iteration 
after a task has been chosen to check if another thread sharing the memory 
cannot be killed.

I'd rather just kill the thread anyway because there's a chance that the 
OOM_DISABLE thread is waiting on it and may free its memory as well and 
there's no guarantee that when you set a thread to be OOM_DISABLE that all 
threads sharing the same memory are disabled as well.

> And. personally I dislike it because ->oom_disable_count is just another
> proof that ->oom_score_adj should be in ->mm, not per-process. IIRC,
> you already explained me why we can't do this, but - sorry - I forgot.
> May be something with vfork... Could you explain this again?
> 

I actually really wanted oom_score_adj to be in the ->mm, it would 
simplify a lot of the code :)  The problem was the inheritance property: 
we expect a job scheduler that is OOM_DISABLE to be able to vfork, change 
the oom_score_adj of the child, and then exec so that it is not oom 
disabled before starting to allocate memory.  If this were in the mm, then 
setting the oom_score_adj of the child prior to exec would change the job 
scheduler's oom score as well.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [patch 1/2] oom: remove oom_disable_count
  2011-08-29 23:17                   ` David Rientjes
@ 2011-08-30  7:43                       ` David Rientjes
  2011-08-30 16:17                     ` mm->oom_disable_count is broken Oleg Nesterov
  1 sibling, 0 replies; 54+ messages in thread
From: David Rientjes @ 2011-08-30  7:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ying Han, Oleg Nesterov, KOSAKI Motohiro, linux-kernel, linux-mm

This removes mm->oom_disable_count entirely since it's unnecessary and
currently buggy.  The counter was intended to be per-process but it's
currently decremented in the exit path for each thread that exits, causing
it to underflow.

The count was originally intended to prevent oom killing threads that
share memory with threads that cannot be killed since it doesn't lead to
future memory freeing.  The counter could be fixed to represent all
threads sharing the same mm, but it's better to remove the count since:

 - it is possible that the OOM_DISABLE thread sharing memory with the
   victim is waiting on that thread to exit and will actually cause
   future memory freeing, and

 - there is no guarantee that a thread is disabled from oom killing just
   because another thread sharing its mm is oom disabled.

Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 fs/exec.c                |    4 ----
 fs/proc/base.c           |   13 -------------
 include/linux/mm_types.h |    3 ---
 kernel/exit.c            |    2 --
 kernel/fork.c            |   10 +---------
 mm/oom_kill.c            |   23 +++++------------------
 6 files changed, 6 insertions(+), 49 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -841,10 +841,6 @@ static int exec_mmap(struct mm_struct *mm)
 	tsk->mm = mm;
 	tsk->active_mm = mm;
 	activate_mm(active_mm, mm);
-	if (old_mm && tsk->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
-		atomic_dec(&old_mm->oom_disable_count);
-		atomic_inc(&tsk->mm->oom_disable_count);
-	}
 	task_unlock(tsk);
 	arch_pick_mmap_layout(mm);
 	if (old_mm) {
diff --git a/fs/proc/base.c b/fs/proc/base.c
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1107,13 +1107,6 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
 		goto err_sighand;
 	}
 
-	if (oom_adjust != task->signal->oom_adj) {
-		if (oom_adjust == OOM_DISABLE)
-			atomic_inc(&task->mm->oom_disable_count);
-		if (task->signal->oom_adj == OOM_DISABLE)
-			atomic_dec(&task->mm->oom_disable_count);
-	}
-
 	/*
 	 * Warn that /proc/pid/oom_adj is deprecated, see
 	 * Documentation/feature-removal-schedule.txt.
@@ -1215,12 +1208,6 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 		goto err_sighand;
 	}
 
-	if (oom_score_adj != task->signal->oom_score_adj) {
-		if (oom_score_adj == OOM_SCORE_ADJ_MIN)
-			atomic_inc(&task->mm->oom_disable_count);
-		if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
-			atomic_dec(&task->mm->oom_disable_count);
-	}
 	task->signal->oom_score_adj = oom_score_adj;
 	if (has_capability_noaudit(current, CAP_SYS_RESOURCE))
 		task->signal->oom_score_adj_min = oom_score_adj;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -313,9 +313,6 @@ struct mm_struct {
 	unsigned int token_priority;
 	unsigned int last_interval;
 
-	/* How many tasks sharing this mm are OOM_DISABLE */
-	atomic_t oom_disable_count;
-
 	unsigned long flags; /* Must use atomic bitops to access the bits */
 
 	struct core_state *core_state; /* coredumping support */
diff --git a/kernel/exit.c b/kernel/exit.c
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -681,8 +681,6 @@ static void exit_mm(struct task_struct * tsk)
 	enter_lazy_tlb(mm, current);
 	/* We don't want this task to be frozen prematurely */
 	clear_freeze_flag(tsk);
-	if (tsk->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
-		atomic_dec(&mm->oom_disable_count);
 	task_unlock(tsk);
 	mm_update_next_owner(mm);
 	mmput(mm);
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -501,7 +501,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 	mm->cached_hole_size = ~0UL;
 	mm_init_aio(mm);
 	mm_init_owner(mm, p);
-	atomic_set(&mm->oom_disable_count, 0);
 
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
@@ -816,8 +815,6 @@ good_mm:
 	/* Initializing for Swap token stuff */
 	mm->token_priority = 0;
 	mm->last_interval = 0;
-	if (tsk->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
-		atomic_inc(&mm->oom_disable_count);
 
 	tsk->mm = mm;
 	tsk->active_mm = mm;
@@ -1391,13 +1388,8 @@ bad_fork_cleanup_io:
 bad_fork_cleanup_namespaces:
 	exit_task_namespaces(p);
 bad_fork_cleanup_mm:
-	if (p->mm) {
-		task_lock(p);
-		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
-			atomic_dec(&p->mm->oom_disable_count);
-		task_unlock(p);
+	if (p->mm)
 		mmput(p->mm);
-	}
 bad_fork_cleanup_signal:
 	if (!(clone_flags & CLONE_THREAD))
 		free_signal_struct(p->signal);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -53,13 +53,7 @@ int test_set_oom_score_adj(int new_val)
 
 	spin_lock_irq(&sighand->siglock);
 	old_val = current->signal->oom_score_adj;
-	if (new_val != old_val) {
-		if (new_val == OOM_SCORE_ADJ_MIN)
-			atomic_inc(&current->mm->oom_disable_count);
-		else if (old_val == OOM_SCORE_ADJ_MIN)
-			atomic_dec(&current->mm->oom_disable_count);
-		current->signal->oom_score_adj = new_val;
-	}
+	current->signal->oom_score_adj = new_val;
 	spin_unlock_irq(&sighand->siglock);
 
 	return old_val;
@@ -172,16 +166,6 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
 		return 0;
 
 	/*
-	 * Shortcut check for a thread sharing p->mm that is OOM_SCORE_ADJ_MIN
-	 * so the entire heuristic doesn't need to be executed for something
-	 * that cannot be killed.
-	 */
-	if (atomic_read(&p->mm->oom_disable_count)) {
-		task_unlock(p);
-		return 0;
-	}
-
-	/*
 	 * The memory controller may have a limit of 0 bytes, so avoid a divide
 	 * by zero, if necessary.
 	 */
@@ -447,6 +431,9 @@ static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
 	for_each_process(q)
 		if (q->mm == mm && !same_thread_group(q, p) &&
 		    !(q->flags & PF_KTHREAD)) {
+			if (q->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+				continue;
+
 			task_lock(q);	/* Protect ->comm from prctl() */
 			pr_err("Kill process %d (%s) sharing same memory\n",
 				task_pid_nr(q), q->comm);
@@ -723,7 +710,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	read_lock(&tasklist_lock);
 	if (sysctl_oom_kill_allocating_task &&
 	    !oom_unkillable_task(current, NULL, nodemask) &&
-	    current->mm && !atomic_read(&current->mm->oom_disable_count)) {
+	    current->mm) {
 		/*
 		 * oom_kill_process() needs tasklist_lock held.  If it returns
 		 * non-zero, current could not be killed so we must fallback to

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [patch 1/2] oom: remove oom_disable_count
@ 2011-08-30  7:43                       ` David Rientjes
  0 siblings, 0 replies; 54+ messages in thread
From: David Rientjes @ 2011-08-30  7:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ying Han, Oleg Nesterov, KOSAKI Motohiro, linux-kernel, linux-mm

This removes mm->oom_disable_count entirely since it's unnecessary and
currently buggy.  The counter was intended to be per-process but it's
currently decremented in the exit path for each thread that exits, causing
it to underflow.

The count was originally intended to prevent oom killing threads that
share memory with threads that cannot be killed since it doesn't lead to
future memory freeing.  The counter could be fixed to represent all
threads sharing the same mm, but it's better to remove the count since:

 - it is possible that the OOM_DISABLE thread sharing memory with the
   victim is waiting on that thread to exit and will actually cause
   future memory freeing, and

 - there is no guarantee that a thread is disabled from oom killing just
   because another thread sharing its mm is oom disabled.

Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 fs/exec.c                |    4 ----
 fs/proc/base.c           |   13 -------------
 include/linux/mm_types.h |    3 ---
 kernel/exit.c            |    2 --
 kernel/fork.c            |   10 +---------
 mm/oom_kill.c            |   23 +++++------------------
 6 files changed, 6 insertions(+), 49 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -841,10 +841,6 @@ static int exec_mmap(struct mm_struct *mm)
 	tsk->mm = mm;
 	tsk->active_mm = mm;
 	activate_mm(active_mm, mm);
-	if (old_mm && tsk->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
-		atomic_dec(&old_mm->oom_disable_count);
-		atomic_inc(&tsk->mm->oom_disable_count);
-	}
 	task_unlock(tsk);
 	arch_pick_mmap_layout(mm);
 	if (old_mm) {
diff --git a/fs/proc/base.c b/fs/proc/base.c
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1107,13 +1107,6 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
 		goto err_sighand;
 	}
 
-	if (oom_adjust != task->signal->oom_adj) {
-		if (oom_adjust == OOM_DISABLE)
-			atomic_inc(&task->mm->oom_disable_count);
-		if (task->signal->oom_adj == OOM_DISABLE)
-			atomic_dec(&task->mm->oom_disable_count);
-	}
-
 	/*
 	 * Warn that /proc/pid/oom_adj is deprecated, see
 	 * Documentation/feature-removal-schedule.txt.
@@ -1215,12 +1208,6 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 		goto err_sighand;
 	}
 
-	if (oom_score_adj != task->signal->oom_score_adj) {
-		if (oom_score_adj == OOM_SCORE_ADJ_MIN)
-			atomic_inc(&task->mm->oom_disable_count);
-		if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
-			atomic_dec(&task->mm->oom_disable_count);
-	}
 	task->signal->oom_score_adj = oom_score_adj;
 	if (has_capability_noaudit(current, CAP_SYS_RESOURCE))
 		task->signal->oom_score_adj_min = oom_score_adj;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -313,9 +313,6 @@ struct mm_struct {
 	unsigned int token_priority;
 	unsigned int last_interval;
 
-	/* How many tasks sharing this mm are OOM_DISABLE */
-	atomic_t oom_disable_count;
-
 	unsigned long flags; /* Must use atomic bitops to access the bits */
 
 	struct core_state *core_state; /* coredumping support */
diff --git a/kernel/exit.c b/kernel/exit.c
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -681,8 +681,6 @@ static void exit_mm(struct task_struct * tsk)
 	enter_lazy_tlb(mm, current);
 	/* We don't want this task to be frozen prematurely */
 	clear_freeze_flag(tsk);
-	if (tsk->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
-		atomic_dec(&mm->oom_disable_count);
 	task_unlock(tsk);
 	mm_update_next_owner(mm);
 	mmput(mm);
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -501,7 +501,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 	mm->cached_hole_size = ~0UL;
 	mm_init_aio(mm);
 	mm_init_owner(mm, p);
-	atomic_set(&mm->oom_disable_count, 0);
 
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
@@ -816,8 +815,6 @@ good_mm:
 	/* Initializing for Swap token stuff */
 	mm->token_priority = 0;
 	mm->last_interval = 0;
-	if (tsk->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
-		atomic_inc(&mm->oom_disable_count);
 
 	tsk->mm = mm;
 	tsk->active_mm = mm;
@@ -1391,13 +1388,8 @@ bad_fork_cleanup_io:
 bad_fork_cleanup_namespaces:
 	exit_task_namespaces(p);
 bad_fork_cleanup_mm:
-	if (p->mm) {
-		task_lock(p);
-		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
-			atomic_dec(&p->mm->oom_disable_count);
-		task_unlock(p);
+	if (p->mm)
 		mmput(p->mm);
-	}
 bad_fork_cleanup_signal:
 	if (!(clone_flags & CLONE_THREAD))
 		free_signal_struct(p->signal);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -53,13 +53,7 @@ int test_set_oom_score_adj(int new_val)
 
 	spin_lock_irq(&sighand->siglock);
 	old_val = current->signal->oom_score_adj;
-	if (new_val != old_val) {
-		if (new_val == OOM_SCORE_ADJ_MIN)
-			atomic_inc(&current->mm->oom_disable_count);
-		else if (old_val == OOM_SCORE_ADJ_MIN)
-			atomic_dec(&current->mm->oom_disable_count);
-		current->signal->oom_score_adj = new_val;
-	}
+	current->signal->oom_score_adj = new_val;
 	spin_unlock_irq(&sighand->siglock);
 
 	return old_val;
@@ -172,16 +166,6 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
 		return 0;
 
 	/*
-	 * Shortcut check for a thread sharing p->mm that is OOM_SCORE_ADJ_MIN
-	 * so the entire heuristic doesn't need to be executed for something
-	 * that cannot be killed.
-	 */
-	if (atomic_read(&p->mm->oom_disable_count)) {
-		task_unlock(p);
-		return 0;
-	}
-
-	/*
 	 * The memory controller may have a limit of 0 bytes, so avoid a divide
 	 * by zero, if necessary.
 	 */
@@ -447,6 +431,9 @@ static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
 	for_each_process(q)
 		if (q->mm == mm && !same_thread_group(q, p) &&
 		    !(q->flags & PF_KTHREAD)) {
+			if (q->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+				continue;
+
 			task_lock(q);	/* Protect ->comm from prctl() */
 			pr_err("Kill process %d (%s) sharing same memory\n",
 				task_pid_nr(q), q->comm);
@@ -723,7 +710,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	read_lock(&tasklist_lock);
 	if (sysctl_oom_kill_allocating_task &&
 	    !oom_unkillable_task(current, NULL, nodemask) &&
-	    current->mm && !atomic_read(&current->mm->oom_disable_count)) {
+	    current->mm) {
 		/*
 		 * oom_kill_process() needs tasklist_lock held.  If it returns
 		 * non-zero, current could not be killed so we must fallback to

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [patch 2/2] oom: fix race while temporarily setting current's oom_score_adj
  2011-08-30  7:43                       ` David Rientjes
@ 2011-08-30  7:43                         ` David Rientjes
  -1 siblings, 0 replies; 54+ messages in thread
From: David Rientjes @ 2011-08-30  7:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ying Han, Oleg Nesterov, KOSAKI Motohiro, linux-kernel, linux-mm

test_set_oom_score_adj() was introduced in 72788c385604 (oom: replace
PF_OOM_ORIGIN with toggling oom_score_adj) to temporarily elevate
current's oom_score_adj for ksm and swapoff without requiring an
additional per-process flag.

Using that function to both set oom_score_adj to OOM_SCORE_ADJ_MAX and
then reinstate the previous value is racy since it's possible that
userspace can set the value to something else itself before the old value
is reinstated.  That results in userspace setting current's oom_score_adj
to a different value and then the kernel immediately setting it back to
its previous value without notification.

To fix this, a new compare_swap_oom_score_adj() function is introduced
with the same semantics as the compare and swap CAS instruction, or
CMPXCHG on x86.  It is used to reinstate the previous value of
oom_score_adj if and only if the present value is the same as the old
value.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/oom.h |    1 +
 mm/ksm.c            |    3 ++-
 mm/oom_kill.c       |   19 +++++++++++++++++++
 mm/swapfile.c       |    2 +-
 4 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -40,6 +40,7 @@ enum oom_constraint {
 	CONSTRAINT_MEMCG,
 };
 
+extern void compare_swap_oom_score_adj(int old_val, int new_val);
 extern int test_set_oom_score_adj(int new_val);
 
 extern unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
diff --git a/mm/ksm.c b/mm/ksm.c
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1905,7 +1905,8 @@ static ssize_t run_store(struct kobject *kobj, struct kobj_attribute *attr,
 
 			oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX);
 			err = unmerge_and_remove_all_rmap_items();
-			test_set_oom_score_adj(oom_score_adj);
+			compare_swap_oom_score_adj(OOM_SCORE_ADJ_MAX,
+								oom_score_adj);
 			if (err) {
 				ksm_run = KSM_RUN_STOP;
 				count = err;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -38,6 +38,25 @@ int sysctl_oom_kill_allocating_task;
 int sysctl_oom_dump_tasks = 1;
 static DEFINE_SPINLOCK(zone_scan_lock);
 
+/*
+ * compare_swap_oom_score_adj() - compare and swap current's oom_score_adj
+ * @old_val: old oom_score_adj for compare
+ * @new_val: new oom_score_adj for swap
+ *
+ * Sets the oom_score_adj value for current to @new_val iff its present value is
+ * @old_val.  Usually used to reinstate a previous value to prevent racing with
+ * userspacing tuning the value in the interim.
+ */
+void compare_swap_oom_score_adj(int old_val, int new_val)
+{
+	struct sighand_struct *sighand = current->sighand;
+
+	spin_lock_irq(&sighand->siglock);
+	if (current->signal->oom_score_adj == old_val)
+		current->signal->oom_score_adj = new_val;
+	spin_unlock_irq(&sighand->siglock);
+}
+
 /**
  * test_set_oom_score_adj() - set current's oom_score_adj and return old value
  * @new_val: new oom_score_adj value
diff --git a/mm/swapfile.c b/mm/swapfile.c
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1617,7 +1617,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 
 	oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX);
 	err = try_to_unuse(type);
-	test_set_oom_score_adj(oom_score_adj);
+	compare_swap_oom_score_adj(OOM_SCORE_ADJ_MAX, oom_score_adj);
 
 	if (err) {
 		/*

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [patch 2/2] oom: fix race while temporarily setting current's oom_score_adj
@ 2011-08-30  7:43                         ` David Rientjes
  0 siblings, 0 replies; 54+ messages in thread
From: David Rientjes @ 2011-08-30  7:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ying Han, Oleg Nesterov, KOSAKI Motohiro, linux-kernel, linux-mm

test_set_oom_score_adj() was introduced in 72788c385604 (oom: replace
PF_OOM_ORIGIN with toggling oom_score_adj) to temporarily elevate
current's oom_score_adj for ksm and swapoff without requiring an
additional per-process flag.

Using that function to both set oom_score_adj to OOM_SCORE_ADJ_MAX and
then reinstate the previous value is racy since it's possible that
userspace can set the value to something else itself before the old value
is reinstated.  That results in userspace setting current's oom_score_adj
to a different value and then the kernel immediately setting it back to
its previous value without notification.

To fix this, a new compare_swap_oom_score_adj() function is introduced
with the same semantics as the compare and swap CAS instruction, or
CMPXCHG on x86.  It is used to reinstate the previous value of
oom_score_adj if and only if the present value is the same as the old
value.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/oom.h |    1 +
 mm/ksm.c            |    3 ++-
 mm/oom_kill.c       |   19 +++++++++++++++++++
 mm/swapfile.c       |    2 +-
 4 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -40,6 +40,7 @@ enum oom_constraint {
 	CONSTRAINT_MEMCG,
 };
 
+extern void compare_swap_oom_score_adj(int old_val, int new_val);
 extern int test_set_oom_score_adj(int new_val);
 
 extern unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
diff --git a/mm/ksm.c b/mm/ksm.c
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1905,7 +1905,8 @@ static ssize_t run_store(struct kobject *kobj, struct kobj_attribute *attr,
 
 			oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX);
 			err = unmerge_and_remove_all_rmap_items();
-			test_set_oom_score_adj(oom_score_adj);
+			compare_swap_oom_score_adj(OOM_SCORE_ADJ_MAX,
+								oom_score_adj);
 			if (err) {
 				ksm_run = KSM_RUN_STOP;
 				count = err;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -38,6 +38,25 @@ int sysctl_oom_kill_allocating_task;
 int sysctl_oom_dump_tasks = 1;
 static DEFINE_SPINLOCK(zone_scan_lock);
 
+/*
+ * compare_swap_oom_score_adj() - compare and swap current's oom_score_adj
+ * @old_val: old oom_score_adj for compare
+ * @new_val: new oom_score_adj for swap
+ *
+ * Sets the oom_score_adj value for current to @new_val iff its present value is
+ * @old_val.  Usually used to reinstate a previous value to prevent racing with
+ * userspacing tuning the value in the interim.
+ */
+void compare_swap_oom_score_adj(int old_val, int new_val)
+{
+	struct sighand_struct *sighand = current->sighand;
+
+	spin_lock_irq(&sighand->siglock);
+	if (current->signal->oom_score_adj == old_val)
+		current->signal->oom_score_adj = new_val;
+	spin_unlock_irq(&sighand->siglock);
+}
+
 /**
  * test_set_oom_score_adj() - set current's oom_score_adj and return old value
  * @new_val: new oom_score_adj value
diff --git a/mm/swapfile.c b/mm/swapfile.c
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1617,7 +1617,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 
 	oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX);
 	err = try_to_unuse(type);
-	test_set_oom_score_adj(oom_score_adj);
+	compare_swap_oom_score_adj(OOM_SCORE_ADJ_MAX, oom_score_adj);
 
 	if (err) {
 		/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 1/2] oom: remove oom_disable_count
  2011-08-30  7:43                       ` David Rientjes
@ 2011-08-30 15:28                         ` Oleg Nesterov
  -1 siblings, 0 replies; 54+ messages in thread
From: Oleg Nesterov @ 2011-08-30 15:28 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Ying Han, KOSAKI Motohiro, linux-kernel, linux-mm

On 08/30, David Rientjes wrote:
>
> This removes mm->oom_disable_count entirely since it's unnecessary and
> currently buggy.  The counter was intended to be per-process but it's
> currently decremented in the exit path for each thread that exits, causing
> it to underflow.
>
> The count was originally intended to prevent oom killing threads that
> share memory with threads that cannot be killed since it doesn't lead to
> future memory freeing.  The counter could be fixed to represent all
> threads sharing the same mm, but it's better to remove the count since:
>
>  - it is possible that the OOM_DISABLE thread sharing memory with the
>    victim is waiting on that thread to exit and will actually cause
>    future memory freeing, and
>
>  - there is no guarantee that a thread is disabled from oom killing just
>    because another thread sharing its mm is oom disabled.

Great, thanks.

Even _if_ (I hope not) we decide to re-introduce this counter later,
I think it will be much more simple to start from the very beginning
and make the correct patch.

> @@ -447,6 +431,9 @@ static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
>  	for_each_process(q)
>  		if (q->mm == mm && !same_thread_group(q, p) &&
>  		    !(q->flags & PF_KTHREAD)) {

(I guess this is on top of -mm patch)

> +			if (q->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> +				continue;
> +

Afaics, this is the only change apart from "removes mm->oom_disable_count
entirely", looks reasonable to me.

Oleg.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 1/2] oom: remove oom_disable_count
@ 2011-08-30 15:28                         ` Oleg Nesterov
  0 siblings, 0 replies; 54+ messages in thread
From: Oleg Nesterov @ 2011-08-30 15:28 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Ying Han, KOSAKI Motohiro, linux-kernel, linux-mm

On 08/30, David Rientjes wrote:
>
> This removes mm->oom_disable_count entirely since it's unnecessary and
> currently buggy.  The counter was intended to be per-process but it's
> currently decremented in the exit path for each thread that exits, causing
> it to underflow.
>
> The count was originally intended to prevent oom killing threads that
> share memory with threads that cannot be killed since it doesn't lead to
> future memory freeing.  The counter could be fixed to represent all
> threads sharing the same mm, but it's better to remove the count since:
>
>  - it is possible that the OOM_DISABLE thread sharing memory with the
>    victim is waiting on that thread to exit and will actually cause
>    future memory freeing, and
>
>  - there is no guarantee that a thread is disabled from oom killing just
>    because another thread sharing its mm is oom disabled.

Great, thanks.

Even _if_ (I hope not) we decide to re-introduce this counter later,
I think it will be much more simple to start from the very beginning
and make the correct patch.

> @@ -447,6 +431,9 @@ static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
>  	for_each_process(q)
>  		if (q->mm == mm && !same_thread_group(q, p) &&
>  		    !(q->flags & PF_KTHREAD)) {

(I guess this is on top of -mm patch)

> +			if (q->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> +				continue;
> +

Afaics, this is the only change apart from "removes mm->oom_disable_count
entirely", looks reasonable to me.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 2/2] oom: fix race while temporarily setting current's oom_score_adj
  2011-08-30  7:43                         ` David Rientjes
@ 2011-08-30 15:57                           ` Oleg Nesterov
  -1 siblings, 0 replies; 54+ messages in thread
From: Oleg Nesterov @ 2011-08-30 15:57 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Ying Han, KOSAKI Motohiro, linux-kernel, linux-mm

On 08/30, David Rientjes wrote:
>
> Using that function to both set oom_score_adj to OOM_SCORE_ADJ_MAX and
> then reinstate the previous value is racy since it's possible that
> userspace can set the value to something else itself before the old value
> is reinstated.  That results in userspace setting current's oom_score_adj
> to a different value and then the kernel immediately setting it back to
> its previous value without notification.

Sure,

> To fix this, a new compare_swap_oom_score_adj() function is introduced
> with the same semantics as the compare and swap CAS instruction, or
> CMPXCHG on x86.  It is used to reinstate the previous value of
> oom_score_adj if and only if the present value is the same as the old
> value.

But this can't fix the race completely ?

> +void compare_swap_oom_score_adj(int old_val, int new_val)
> +{
> +	struct sighand_struct *sighand = current->sighand;
> +
> +	spin_lock_irq(&sighand->siglock);
> +	if (current->signal->oom_score_adj == old_val)
> +		current->signal->oom_score_adj = new_val;
> +	spin_unlock_irq(&sighand->siglock);
> +}

So. This is used this way:

	old_val = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX);

	do_something();

	compare_swap_oom_score_adj(OOM_SCORE_ADJ_MAX, old_val);

What if userspace sets oom_score_adj = OOM_SCORE_ADJ_MAX in between?
May be the callers should use OOM_SCORE_ADJ_MAX + 1 instead, this way
we can't confuse old_val with the value from the userspace...




But in fact I am writing this email because I have the question.
Do we really need 2 helpers, and do we really need to allow to set
the arbitrary value?

I mean, perhaps we can do something like

	void set_oom_victim(bool on)
	{
		if (on) {
			oom_score_adj += ADJ_MAX - ADJ_MIN + 1;
		} else if (oom_score_adj > ADJ_MAX) {
			oom_score_adj -= ADJ_MAX - ADJ_MIN + 1;
		}
	}

Not sure this really makes sense, just curious.

Oleg.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 2/2] oom: fix race while temporarily setting current's oom_score_adj
@ 2011-08-30 15:57                           ` Oleg Nesterov
  0 siblings, 0 replies; 54+ messages in thread
From: Oleg Nesterov @ 2011-08-30 15:57 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Ying Han, KOSAKI Motohiro, linux-kernel, linux-mm

On 08/30, David Rientjes wrote:
>
> Using that function to both set oom_score_adj to OOM_SCORE_ADJ_MAX and
> then reinstate the previous value is racy since it's possible that
> userspace can set the value to something else itself before the old value
> is reinstated.  That results in userspace setting current's oom_score_adj
> to a different value and then the kernel immediately setting it back to
> its previous value without notification.

Sure,

> To fix this, a new compare_swap_oom_score_adj() function is introduced
> with the same semantics as the compare and swap CAS instruction, or
> CMPXCHG on x86.  It is used to reinstate the previous value of
> oom_score_adj if and only if the present value is the same as the old
> value.

But this can't fix the race completely ?

> +void compare_swap_oom_score_adj(int old_val, int new_val)
> +{
> +	struct sighand_struct *sighand = current->sighand;
> +
> +	spin_lock_irq(&sighand->siglock);
> +	if (current->signal->oom_score_adj == old_val)
> +		current->signal->oom_score_adj = new_val;
> +	spin_unlock_irq(&sighand->siglock);
> +}

So. This is used this way:

	old_val = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX);

	do_something();

	compare_swap_oom_score_adj(OOM_SCORE_ADJ_MAX, old_val);

What if userspace sets oom_score_adj = OOM_SCORE_ADJ_MAX in between?
May be the callers should use OOM_SCORE_ADJ_MAX + 1 instead, this way
we can't confuse old_val with the value from the userspace...




But in fact I am writing this email because I have the question.
Do we really need 2 helpers, and do we really need to allow to set
the arbitrary value?

I mean, perhaps we can do something like

	void set_oom_victim(bool on)
	{
		if (on) {
			oom_score_adj += ADJ_MAX - ADJ_MIN + 1;
		} else if (oom_score_adj > ADJ_MAX) {
			oom_score_adj -= ADJ_MAX - ADJ_MIN + 1;
		}
	}

Not sure this really makes sense, just curious.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: mm->oom_disable_count is broken
  2011-08-29 23:17                   ` David Rientjes
  2011-08-30  7:43                       ` David Rientjes
@ 2011-08-30 16:17                     ` Oleg Nesterov
  1 sibling, 0 replies; 54+ messages in thread
From: Oleg Nesterov @ 2011-08-30 16:17 UTC (permalink / raw)
  To: David Rientjes
  Cc: Ying Han, KOSAKI Motohiro, Andrew Morton, Linus Torvalds, roland,
	tj, dvlasenk, matt.fleming, linux-kernel, avagin, fhrbata

On 08/29, David Rientjes wrote:
>
> On Mon, 29 Aug 2011, Oleg Nesterov wrote:
>
> > And. personally I dislike it because ->oom_disable_count is just another
> > proof that ->oom_score_adj should be in ->mm, not per-process. IIRC,
> > you already explained me why we can't do this, but - sorry - I forgot.
> > May be something with vfork... Could you explain this again?
>
> I actually really wanted oom_score_adj to be in the ->mm, it would
> simplify a lot of the code :)  The problem was the inheritance property:
> we expect a job scheduler that is OOM_DISABLE to be able to vfork, change
> the oom_score_adj of the child, and then exec so that it is not oom
> disabled before starting to allocate memory.

Ah, I see. Thanks.

And yes, now I recall this is what you already explained ;)

Damn.

Oleg.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 1/2] oom: remove oom_disable_count
  2011-08-30 15:28                         ` Oleg Nesterov
@ 2011-08-30 22:06                           ` David Rientjes
  -1 siblings, 0 replies; 54+ messages in thread
From: David Rientjes @ 2011-08-30 22:06 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Ying Han, KOSAKI Motohiro, linux-kernel, linux-mm

On Tue, 30 Aug 2011, Oleg Nesterov wrote:

> > @@ -447,6 +431,9 @@ static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
> >  	for_each_process(q)
> >  		if (q->mm == mm && !same_thread_group(q, p) &&
> >  		    !(q->flags & PF_KTHREAD)) {
> 
> (I guess this is on top of -mm patch)
> 

Yes, it's based on 
oom-avoid-killing-kthreads-if-they-assume-the-oom-killed-threads-mm.patch 
which I thought would be pushed for the 3.1 rc series, we certainly don't 
want to SIGKILL kthreads :)

> > +			if (q->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> > +				continue;
> > +
> 
> Afaics, this is the only change apart from "removes mm->oom_disable_count
> entirely", looks reasonable to me.
> 

Yeah, it's necessary because this loop in oom_kill_task() kills all 
user threads in different thread groups unconditionally if they share the 
same mm, so we need to ensure that we aren't sending a SIGKILL to anything 
that is actually oom disabled.  Before, the check in oom_badness() would 
have prevented the task (`p' in this function) from being chosen in the 
first place.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch 1/2] oom: remove oom_disable_count
@ 2011-08-30 22:06                           ` David Rientjes
  0 siblings, 0 replies; 54+ messages in thread
From: David Rientjes @ 2011-08-30 22:06 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Ying Han, KOSAKI Motohiro, linux-kernel, linux-mm

On Tue, 30 Aug 2011, Oleg Nesterov wrote:

> > @@ -447,6 +431,9 @@ static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
> >  	for_each_process(q)
> >  		if (q->mm == mm && !same_thread_group(q, p) &&
> >  		    !(q->flags & PF_KTHREAD)) {
> 
> (I guess this is on top of -mm patch)
> 

Yes, it's based on 
oom-avoid-killing-kthreads-if-they-assume-the-oom-killed-threads-mm.patch 
which I thought would be pushed for the 3.1 rc series, we certainly don't 
want to SIGKILL kthreads :)

> > +			if (q->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> > +				continue;
> > +
> 
> Afaics, this is the only change apart from "removes mm->oom_disable_count
> entirely", looks reasonable to me.
> 

Yeah, it's necessary because this loop in oom_kill_task() kills all 
user threads in different thread groups unconditionally if they share the 
same mm, so we need to ensure that we aren't sending a SIGKILL to anything 
that is actually oom disabled.  Before, the check in oom_badness() would 
have prevented the task (`p' in this function) from being chosen in the 
first place.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2011-08-30 22:06 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-27 16:31 [PATCH 0/8] make vfork killable/restartable/traceable Oleg Nesterov
2011-07-27 16:32 ` [PATCH 1/8] vfork: introduce complete_vfork_done() Oleg Nesterov
2011-07-27 16:32 ` [PATCH 2/8] vfork: introduce clone_vfork_finish() Oleg Nesterov
2011-07-27 16:32 ` [PATCH 3/8] vfork: make it killable Oleg Nesterov
2011-07-29 13:02   ` Matt Fleming
2011-07-29 14:32     ` Oleg Nesterov
2011-07-29 15:32       ` Matt Fleming
2011-07-27 16:33 ` [PATCH 4/8] coredump_wait: don't call complete_vfork_done() Oleg Nesterov
2011-07-29 13:02   ` Matt Fleming
2011-07-29 14:25     ` Oleg Nesterov
2011-07-29 15:26       ` Matt Fleming
2011-07-27 16:33 ` [PATCH 5/8] introduce find_get_task_by_vpid() Oleg Nesterov
2011-07-27 16:33 ` [PATCH 6/8] vfork: do not setup child->vfork_done beforehand Oleg Nesterov
2011-07-27 16:34 ` [PATCH 7/8] vfork: make it stoppable/traceable Oleg Nesterov
2011-07-27 16:34 ` [PATCH 8/8] vfork: do not block SIG_DFL/SIG_IGN signals is single-threaded Oleg Nesterov
2011-07-27 16:34 ` [PATCH 9/8] kill PF_STARTING Oleg Nesterov
2011-07-27 19:39 ` [PATCH 0/8] make vfork killable/restartable/traceable Linus Torvalds
2011-07-28 13:59   ` Oleg Nesterov
2011-07-28 14:58     ` Oleg Nesterov
2011-07-27 22:38 ` Pedro Alves
2011-07-29 19:23 ` Tejun Heo
2011-08-12 17:55   ` [PATCH v2 0/3] make vfork killable Oleg Nesterov
2011-08-12 17:56     ` [PATCH 1/3] vfork: introduce complete_vfork_done() Oleg Nesterov
2011-08-12 17:56     ` [PATCH 2/3] vfork: make it killable Oleg Nesterov
2011-08-19 20:33       ` Matt Fleming
2011-08-22 13:35         ` Oleg Nesterov
2011-08-12 17:56     ` [PATCH 3/3] coredump_wait: don't call complete_vfork_done() Oleg Nesterov
2011-08-17  7:50       ` Tejun Heo
2011-08-17 15:11         ` Oleg Nesterov
2011-08-12 17:57     ` [PATCH 4/3] kill PF_STARTING Oleg Nesterov
2011-08-17  7:51       ` Tejun Heo
2011-08-13 16:18     ` [PATCH v2 0/3] make vfork killable Tejun Heo
2011-08-15 19:42       ` Oleg Nesterov
2011-08-16 19:42         ` Tejun Heo
2011-08-23 22:01       ` Matt Helsley
2011-08-23 22:12         ` Tejun Heo
     [not found] ` <20110727163610.GJ23793@redhat.com>
     [not found]   ` <20110727175624.GA3950@redhat.com>
     [not found]     ` <20110728154324.GA22864@redhat.com>
     [not found]       ` <alpine.DEB.2.00.1107281341060.16093@chino.kir.corp.google.com>
     [not found]         ` <20110729141431.GA3501@redhat.com>
     [not found]           ` <20110730143426.GA6061@redhat.com>
2011-07-30 15:22             ` mm->oom_disable_count is broken Oleg Nesterov
2011-08-01 11:52               ` KOSAKI Motohiro
2011-08-29 18:37                 ` Oleg Nesterov
2011-08-29 23:17                   ` David Rientjes
2011-08-30  7:43                     ` [patch 1/2] oom: remove oom_disable_count David Rientjes
2011-08-30  7:43                       ` David Rientjes
2011-08-30  7:43                       ` [patch 2/2] oom: fix race while temporarily setting current's oom_score_adj David Rientjes
2011-08-30  7:43                         ` David Rientjes
2011-08-30 15:57                         ` Oleg Nesterov
2011-08-30 15:57                           ` Oleg Nesterov
2011-08-30 15:28                       ` [patch 1/2] oom: remove oom_disable_count Oleg Nesterov
2011-08-30 15:28                         ` Oleg Nesterov
2011-08-30 22:06                         ` David Rientjes
2011-08-30 22:06                           ` David Rientjes
2011-08-30 16:17                     ` mm->oom_disable_count is broken Oleg Nesterov
2011-08-10 21:44 ` [PATCH 0/8] make vfork killable/restartable/traceable Pavel Machek
2011-08-11 16:09   ` Oleg Nesterov
2011-08-11 16:22     ` Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.