linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] make request_module() killable
@ 2012-02-14 16:47 Oleg Nesterov
  2012-02-14 16:47 ` [PATCH 1/6] usermodehelper: introduce umh_complete(sub_info) Oleg Nesterov
                   ` (5 more replies)
  0 siblings, 6 replies; 25+ messages in thread
From: Oleg Nesterov @ 2012-02-14 16:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: apw, arjan, fhrbata, john.johansen, penguin-kernel, rientjes,
	rusty, tj, linux-kernel

As Tetsuo pointed out, request_module() is very much unfriendly wrt OOM.
It needs "a lot" of time/memory to finish while the caller is blocked in
TASK_UNINTERRUPTIBLE.

Changes:

	- improve the comments a bit

	- reorder the changes so that 3/6 which changes kABI can be
	  skipped, to simplify the backporting

	- tried to test, seems to work

This series is orthogonal to kmod-avoid-deadlock-by-recursive-kmod-call.patch
(which I still think should be replaced, I'll try to return to this later).

3/6 depends on usermodehelper-use-umh_wait_proc-consistently.patch


Many thanks to Tetsuo for the discussion.

Oleg.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/6] usermodehelper: introduce umh_complete(sub_info)
  2012-02-14 16:47 [PATCH 0/6] make request_module() killable Oleg Nesterov
@ 2012-02-14 16:47 ` Oleg Nesterov
  2012-02-14 16:48 ` [PATCH 2/6] usermodehelper: implement UMH_KILLABLE Oleg Nesterov
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 25+ messages in thread
From: Oleg Nesterov @ 2012-02-14 16:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: apw, arjan, fhrbata, john.johansen, penguin-kernel, rientjes,
	rusty, tj, linux-kernel

Preparation. Add the new trivial helper, umh_complete(). Currently
it simply does complete(sub_info->complete).

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/kmod.c |    9 +++++++--
 1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/kmod.c b/kernel/kmod.c
index a0a8854..8ea2594 100644
--- a/kernel/kmod.c
+++ b/kernel/kmod.c
@@ -199,6 +199,11 @@ void call_usermodehelper_freeinfo(struct subprocess_info *info)
 }
 EXPORT_SYMBOL(call_usermodehelper_freeinfo);
 
+static void umh_complete(struct subprocess_info *sub_info)
+{
+	complete(sub_info->complete);
+}
+
 /* Keventd can't block, but this (a child) can. */
 static int wait_for_helper(void *data)
 {
@@ -235,7 +240,7 @@ static int wait_for_helper(void *data)
 			sub_info->retval = ret;
 	}
 
-	complete(sub_info->complete);
+	umh_complete(sub_info);
 	return 0;
 }
 
@@ -269,7 +274,7 @@ static void __call_usermodehelper(struct work_struct *work)
 	case UMH_WAIT_EXEC:
 		if (pid < 0)
 			sub_info->retval = pid;
-		complete(sub_info->complete);
+		umh_complete(sub_info);
 	}
 }
 
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/6] usermodehelper: implement UMH_KILLABLE
  2012-02-14 16:47 [PATCH 0/6] make request_module() killable Oleg Nesterov
  2012-02-14 16:47 ` [PATCH 1/6] usermodehelper: introduce umh_complete(sub_info) Oleg Nesterov
@ 2012-02-14 16:48 ` Oleg Nesterov
  2012-02-14 16:48 ` [PATCH 3/6] usermodehelper: kill umh_wait, renumber UMH_* constants Oleg Nesterov
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 25+ messages in thread
From: Oleg Nesterov @ 2012-02-14 16:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: apw, arjan, fhrbata, john.johansen, penguin-kernel, rientjes,
	rusty, tj, linux-kernel

Implement UMH_KILLABLE, should be used along with UMH_WAIT_EXEC/PROC.
The caller must ensure that subprocess_info->path/etc can not go away
until call_usermodehelper_freeinfo().

call_usermodehelper_exec(UMH_KILLABLE) does wait_for_completion_killable.
If it fails, it uses xchg(&sub_info->complete, NULL) to serialize with
umh_complete() which does the same xhcg() to access sub_info->complete.

If call_usermodehelper_exec wins, it can safely return. umh_complete()
should get NULL and call call_usermodehelper_freeinfo().

Otherwise we know that umh_complete() was already called, in this case
call_usermodehelper_exec() falls back to wait_for_completion() which
should succeed "very soon".

Note: UMH_NO_WAIT == -1 but it obviously should not be used with
UMH_KILLABLE. We delay the neccessary cleanup to simplify the back
porting.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/kmod.h |    2 ++
 kernel/kmod.c        |   27 +++++++++++++++++++++++++--
 2 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/include/linux/kmod.h b/include/linux/kmod.h
index 722f477..1b59858 100644
--- a/include/linux/kmod.h
+++ b/include/linux/kmod.h
@@ -54,6 +54,8 @@ enum umh_wait {
 	UMH_WAIT_PROC = 1,	/* wait for the process to complete */
 };
 
+#define UMH_KILLABLE	4	/* wait for EXEC/PROC killable */
+
 struct subprocess_info {
 	struct work_struct work;
 	struct completion *complete;
diff --git a/kernel/kmod.c b/kernel/kmod.c
index 8ea2594..f92f917 100644
--- a/kernel/kmod.c
+++ b/kernel/kmod.c
@@ -201,7 +201,15 @@ EXPORT_SYMBOL(call_usermodehelper_freeinfo);
 
 static void umh_complete(struct subprocess_info *sub_info)
 {
-	complete(sub_info->complete);
+	struct completion *comp = xchg(&sub_info->complete, NULL);
+	/*
+	 * See call_usermodehelper_exec(). If xchg() returns NULL
+	 * we own sub_info, the UMH_KILLABLE caller has gone away.
+	 */
+	if (comp)
+		complete(comp);
+	else
+		call_usermodehelper_freeinfo(sub_info);
 }
 
 /* Keventd can't block, but this (a child) can. */
@@ -252,6 +260,9 @@ static void __call_usermodehelper(struct work_struct *work)
 	enum umh_wait wait = sub_info->wait;
 	pid_t pid;
 
+	if (wait != UMH_NO_WAIT)
+		wait &= ~UMH_KILLABLE;
+
 	/* CLONE_VFORK: wait until the usermode helper has execve'd
 	 * successfully We need the data structures to stay around
 	 * until that is done.  */
@@ -461,9 +472,21 @@ int call_usermodehelper_exec(struct subprocess_info *sub_info,
 	queue_work(khelper_wq, &sub_info->work);
 	if (wait == UMH_NO_WAIT)	/* task has freed sub_info */
 		goto unlock;
+
+	if (wait & UMH_KILLABLE) {
+		retval = wait_for_completion_killable(&done);
+		if (!retval)
+			goto wait_done;
+
+		/* umh_complete() will see NULL and free sub_info */
+		if (xchg(&sub_info->complete, NULL))
+			goto unlock;
+		/* fallthrough, umh_complete() was already called */
+	}
+
 	wait_for_completion(&done);
+wait_done:
 	retval = sub_info->retval;
-
 out:
 	call_usermodehelper_freeinfo(sub_info);
 unlock:
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 3/6] usermodehelper: kill umh_wait, renumber UMH_* constants
  2012-02-14 16:47 [PATCH 0/6] make request_module() killable Oleg Nesterov
  2012-02-14 16:47 ` [PATCH 1/6] usermodehelper: introduce umh_complete(sub_info) Oleg Nesterov
  2012-02-14 16:48 ` [PATCH 2/6] usermodehelper: implement UMH_KILLABLE Oleg Nesterov
@ 2012-02-14 16:48 ` Oleg Nesterov
  2012-02-15  1:09   ` Rusty Russell
  2012-02-14 16:48 ` [PATCH 4/6] usermodehelper: ____call_usermodehelper() doesn't need do_exit() Oleg Nesterov
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 25+ messages in thread
From: Oleg Nesterov @ 2012-02-14 16:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: apw, arjan, fhrbata, john.johansen, penguin-kernel, rientjes,
	rusty, tj, linux-kernel

No functional changes. It is not sane to use UMH_KILLABLE with
enum umh_wait, but obviously we do not want another argument in
call_usermodehelper_* helpers. Kill this enum, use the plain int.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/kmod.h        |   18 +++++++-----------
 kernel/kmod.c               |    8 ++------
 security/keys/request_key.c |    2 +-
 3 files changed, 10 insertions(+), 18 deletions(-)

diff --git a/include/linux/kmod.h b/include/linux/kmod.h
index 1b59858..9efeae6 100644
--- a/include/linux/kmod.h
+++ b/include/linux/kmod.h
@@ -48,12 +48,9 @@ static inline int request_module_nowait(const char *name, ...) { return -ENOSYS;
 struct cred;
 struct file;
 
-enum umh_wait {
-	UMH_NO_WAIT = -1,	/* don't wait at all */
-	UMH_WAIT_EXEC = 0,	/* wait for the exec, but not the process */
-	UMH_WAIT_PROC = 1,	/* wait for the process to complete */
-};
-
+#define UMH_NO_WAIT	0	/* don't wait at all */
+#define UMH_WAIT_EXEC	1	/* wait for the exec, but not the process */
+#define UMH_WAIT_PROC	2	/* wait for the process to complete */
 #define UMH_KILLABLE	4	/* wait for EXEC/PROC killable */
 
 struct subprocess_info {
@@ -62,7 +59,7 @@ struct subprocess_info {
 	char *path;
 	char **argv;
 	char **envp;
-	enum umh_wait wait;
+	int wait;
 	int retval;
 	int (*init)(struct subprocess_info *info, struct cred *new);
 	void (*cleanup)(struct subprocess_info *info);
@@ -80,15 +77,14 @@ void call_usermodehelper_setfns(struct subprocess_info *info,
 		    void *data);
 
 /* Actually execute the sub-process */
-int call_usermodehelper_exec(struct subprocess_info *info, enum umh_wait wait);
+int call_usermodehelper_exec(struct subprocess_info *info, int wait);
 
 /* Free the subprocess_info. This is only needed if you're not going
    to call call_usermodehelper_exec */
 void call_usermodehelper_freeinfo(struct subprocess_info *info);
 
 static inline int
-call_usermodehelper_fns(char *path, char **argv, char **envp,
-			enum umh_wait wait,
+call_usermodehelper_fns(char *path, char **argv, char **envp, int wait,
 			int (*init)(struct subprocess_info *info, struct cred *new),
 			void (*cleanup)(struct subprocess_info *), void *data)
 {
@@ -106,7 +102,7 @@ call_usermodehelper_fns(char *path, char **argv, char **envp,
 }
 
 static inline int
-call_usermodehelper(char *path, char **argv, char **envp, enum umh_wait wait)
+call_usermodehelper(char *path, char **argv, char **envp, int wait)
 {
 	return call_usermodehelper_fns(path, argv, envp, wait,
 				       NULL, NULL, NULL);
diff --git a/kernel/kmod.c b/kernel/kmod.c
index f92f917..8341de9 100644
--- a/kernel/kmod.c
+++ b/kernel/kmod.c
@@ -257,12 +257,9 @@ static void __call_usermodehelper(struct work_struct *work)
 {
 	struct subprocess_info *sub_info =
 		container_of(work, struct subprocess_info, work);
-	enum umh_wait wait = sub_info->wait;
+	int wait = sub_info->wait & ~UMH_KILLABLE;
 	pid_t pid;
 
-	if (wait != UMH_NO_WAIT)
-		wait &= ~UMH_KILLABLE;
-
 	/* CLONE_VFORK: wait until the usermode helper has execve'd
 	 * successfully We need the data structures to stay around
 	 * until that is done.  */
@@ -451,8 +448,7 @@ EXPORT_SYMBOL(call_usermodehelper_setfns);
  * asynchronously if wait is not set, and runs as a child of keventd.
  * (ie. it runs with full root capabilities).
  */
-int call_usermodehelper_exec(struct subprocess_info *sub_info,
-			     enum umh_wait wait)
+int call_usermodehelper_exec(struct subprocess_info *sub_info, int wait)
 {
 	DECLARE_COMPLETION_ONSTACK(done);
 	int retval = 0;
diff --git a/security/keys/request_key.c b/security/keys/request_key.c
index 8246532..fc7c8cc 100644
--- a/security/keys/request_key.c
+++ b/security/keys/request_key.c
@@ -91,7 +91,7 @@ static void umh_keys_cleanup(struct subprocess_info *info)
  * Call a usermode helper with a specific session keyring.
  */
 static int call_usermodehelper_keys(char *path, char **argv, char **envp,
-			 struct key *session_keyring, enum umh_wait wait)
+			 		struct key *session_keyring, int wait)
 {
 	gfp_t gfp_mask = (wait == UMH_NO_WAIT) ? GFP_ATOMIC : GFP_KERNEL;
 	struct subprocess_info *info =
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 4/6] usermodehelper: ____call_usermodehelper() doesn't need do_exit()
  2012-02-14 16:47 [PATCH 0/6] make request_module() killable Oleg Nesterov
                   ` (2 preceding siblings ...)
  2012-02-14 16:48 ` [PATCH 3/6] usermodehelper: kill umh_wait, renumber UMH_* constants Oleg Nesterov
@ 2012-02-14 16:48 ` Oleg Nesterov
  2012-02-14 16:48 ` [PATCH 5/6] kmod: introduce call_modprobe() helper Oleg Nesterov
  2012-02-14 16:49 ` [PATCH 6/6] kmod: make __request_module() killable Oleg Nesterov
  5 siblings, 0 replies; 25+ messages in thread
From: Oleg Nesterov @ 2012-02-14 16:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: apw, arjan, fhrbata, john.johansen, penguin-kernel, rientjes,
	rusty, tj, linux-kernel

Minor cleanup. ____call_usermodehelper() can simply return, no need
to call do_exit() explicitely.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/kmod.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/kmod.c b/kernel/kmod.c
index 8341de9..685b246 100644
--- a/kernel/kmod.c
+++ b/kernel/kmod.c
@@ -188,7 +188,7 @@ static int ____call_usermodehelper(void *data)
 	/* Exec failed? */
 fail:
 	sub_info->retval = retval;
-	do_exit(0);
+	return 0;
 }
 
 void call_usermodehelper_freeinfo(struct subprocess_info *info)
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 5/6] kmod: introduce call_modprobe() helper
  2012-02-14 16:47 [PATCH 0/6] make request_module() killable Oleg Nesterov
                   ` (3 preceding siblings ...)
  2012-02-14 16:48 ` [PATCH 4/6] usermodehelper: ____call_usermodehelper() doesn't need do_exit() Oleg Nesterov
@ 2012-02-14 16:48 ` Oleg Nesterov
  2012-02-14 16:49 ` [PATCH 6/6] kmod: make __request_module() killable Oleg Nesterov
  5 siblings, 0 replies; 25+ messages in thread
From: Oleg Nesterov @ 2012-02-14 16:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: apw, arjan, fhrbata, john.johansen, penguin-kernel, rientjes,
	rusty, tj, linux-kernel

No functional changes. Move the call_usermodehelper code from
__request_module() into the new simple helper, call_modprobe().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/kmod.c |   24 ++++++++++++++++--------
 1 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/kernel/kmod.c b/kernel/kmod.c
index 685b246..56a29e8 100644
--- a/kernel/kmod.c
+++ b/kernel/kmod.c
@@ -60,6 +60,21 @@ static DECLARE_RWSEM(umhelper_sem);
 */
 char modprobe_path[KMOD_PATH_LEN] = "/sbin/modprobe";
 
+static int call_modprobe(char *module_name, int wait)
+{
+	static char *envp[] = {
+		"HOME=/",
+		"TERM=linux",
+		"PATH=/sbin:/usr/sbin:/bin:/usr/bin",
+		NULL
+	};
+
+	char *argv[] = { modprobe_path, "-q", "--", module_name, NULL };
+
+	return call_usermodehelper_fns(modprobe_path, argv, envp,
+					wait, NULL, NULL, NULL);
+}
+
 /**
  * __request_module - try to load a kernel module
  * @wait: wait (or not) for the operation to complete
@@ -81,11 +96,6 @@ int __request_module(bool wait, const char *fmt, ...)
 	char module_name[MODULE_NAME_LEN];
 	unsigned int max_modprobes;
 	int ret;
-	char *argv[] = { modprobe_path, "-q", "--", module_name, NULL };
-	static char *envp[] = { "HOME=/",
-				"TERM=linux",
-				"PATH=/sbin:/usr/sbin:/bin:/usr/bin",
-				NULL };
 	static atomic_t kmod_concurrent = ATOMIC_INIT(0);
 #define MAX_KMOD_CONCURRENT 50	/* Completely arbitrary value - KAO */
 	static int kmod_loop_msg;
@@ -128,9 +138,7 @@ int __request_module(bool wait, const char *fmt, ...)
 
 	trace_module_request(module_name, wait, _RET_IP_);
 
-	ret = call_usermodehelper_fns(modprobe_path, argv, envp,
-			wait ? UMH_WAIT_PROC : UMH_WAIT_EXEC,
-			NULL, NULL, NULL);
+	ret = call_modprobe(module_name, wait ? UMH_WAIT_PROC : UMH_WAIT_EXEC);
 
 	atomic_dec(&kmod_concurrent);
 	return ret;
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 6/6] kmod: make __request_module() killable
  2012-02-14 16:47 [PATCH 0/6] make request_module() killable Oleg Nesterov
                   ` (4 preceding siblings ...)
  2012-02-14 16:48 ` [PATCH 5/6] kmod: introduce call_modprobe() helper Oleg Nesterov
@ 2012-02-14 16:49 ` Oleg Nesterov
  2012-02-15 20:30   ` Andrew Morton
  5 siblings, 1 reply; 25+ messages in thread
From: Oleg Nesterov @ 2012-02-14 16:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: apw, arjan, fhrbata, john.johansen, penguin-kernel, rientjes,
	rusty, tj, linux-kernel

As Tetsuo Handa pointed out, request_module() can stress the
system while the oom-killed caller sleeps in TASK_UNINTERRUPTIBLE.

Make __request_module() killable. The only necessary change is
that call_modprobe() should kmalloc argv and module_name, they
can't live in the stack if we use UMH_KILLABLE. This memory is
freed via call_usermodehelper_freeinfo()->cleanup.

Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/kmod.c |   26 ++++++++++++++++++++++++--
 1 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/kernel/kmod.c b/kernel/kmod.c
index 56a29e8..957a7aa 100644
--- a/kernel/kmod.c
+++ b/kernel/kmod.c
@@ -60,6 +60,12 @@ static DECLARE_RWSEM(umhelper_sem);
 */
 char modprobe_path[KMOD_PATH_LEN] = "/sbin/modprobe";
 
+static void free_modprobe_argv(struct subprocess_info *info)
+{
+	kfree(info->argv[3]); /* check call_modprobe() */
+	kfree(info->argv);
+}
+
 static int call_modprobe(char *module_name, int wait)
 {
 	static char *envp[] = {
@@ -69,10 +75,26 @@ static int call_modprobe(char *module_name, int wait)
 		NULL
 	};
 
-	char *argv[] = { modprobe_path, "-q", "--", module_name, NULL };
+	char **argv = kmalloc(sizeof(char *[5]), GFP_KERNEL);
+	if (!argv)
+		goto out;
+
+	module_name = kstrdup(module_name, GFP_KERNEL);
+	if (!module_name)
+		goto free_argv;
+
+	argv[0] = modprobe_path;
+	argv[1] = "-q";
+	argv[2] = "--";
+	argv[3] = module_name;	/* check free_modprobe_argv() */
+	argv[4] = NULL;
 
 	return call_usermodehelper_fns(modprobe_path, argv, envp,
-					wait, NULL, NULL, NULL);
+		wait | UMH_KILLABLE, NULL, free_modprobe_argv, NULL);
+free_argv:
+	kfree(argv);
+out:
+	return -ENOMEM;
 }
 
 /**
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/6] usermodehelper: kill umh_wait, renumber UMH_* constants
  2012-02-14 16:48 ` [PATCH 3/6] usermodehelper: kill umh_wait, renumber UMH_* constants Oleg Nesterov
@ 2012-02-15  1:09   ` Rusty Russell
  2012-02-15 18:12     ` Oleg Nesterov
  0 siblings, 1 reply; 25+ messages in thread
From: Rusty Russell @ 2012-02-15  1:09 UTC (permalink / raw)
  To: Oleg Nesterov, Andrew Morton
  Cc: apw, arjan, fhrbata, john.johansen, penguin-kernel, rientjes, tj,
	linux-kernel

On Tue, 14 Feb 2012 17:48:21 +0100, Oleg Nesterov <oleg@redhat.com> wrote:
> No functional changes. It is not sane to use UMH_KILLABLE with
> enum umh_wait, but obviously we do not want another argument in
> call_usermodehelper_* helpers. Kill this enum, use the plain int.

Seems like a step backwards.  Perhaps reorder the enum, but implying an
explicit range of values explicit by using an enum seems good.

But it's a minor gripe, feel free to ignore.

Entire series:
Acked-by: Rusty Russell <rusty@rustcorp.com.au>

Thanks for this!
Rusty.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/6] usermodehelper: kill umh_wait, renumber UMH_* constants
  2012-02-15  1:09   ` Rusty Russell
@ 2012-02-15 18:12     ` Oleg Nesterov
  0 siblings, 0 replies; 25+ messages in thread
From: Oleg Nesterov @ 2012-02-15 18:12 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Andrew Morton, apw, arjan, fhrbata, john.johansen,
	penguin-kernel, rientjes, tj, linux-kernel

On 02/15, Rusty Russell wrote:
>
> On Tue, 14 Feb 2012 17:48:21 +0100, Oleg Nesterov <oleg@redhat.com> wrote:
> > No functional changes. It is not sane to use UMH_KILLABLE with
> > enum umh_wait, but obviously we do not want another argument in
> > call_usermodehelper_* helpers. Kill this enum, use the plain int.
>
> Seems like a step backwards.  Perhaps reorder the enum, but implying an
> explicit range of values explicit by using an enum seems good.

The problem is, the things like "UMH_WAIT_EXEC | UMH_KILLABLE" do not
look like enum to me, even if they both are enums and the resulting
code is correct.

OTOH, something like

	enum umh_wait {
		UMH_NO_WAIT,
		UMH_WAIT_EXEC,
		UMH_WAIT_PROC,
		UMH_WAIT_EXEC_KILLABLE,
		UMH_WAIT_PROC_KILLABLE,
	};

doesn't look very nice/convenient too.

But. This is the cleanup, and thus it should not disturb the maintainer.
I do not mind to redo it either way, just tell me what you prefer.

This change doesn't create any dependency, it can be reverted/reworked
at any moment. It could be the last patch in series, probably I should
have sent it as 6/6.

Or,

> But it's a minor gripe, feel free to ignore.

works for me too ;)

> Entire series:
> Acked-by: Rusty Russell <rusty@rustcorp.com.au>

Thanks!

Oleg.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 6/6] kmod: make __request_module() killable
  2012-02-14 16:49 ` [PATCH 6/6] kmod: make __request_module() killable Oleg Nesterov
@ 2012-02-15 20:30   ` Andrew Morton
  2012-02-16 15:04     ` Oleg Nesterov
  0 siblings, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2012-02-15 20:30 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: apw, arjan, fhrbata, john.johansen, penguin-kernel, rientjes,
	rusty, tj, linux-kernel

On Tue, 14 Feb 2012 17:49:14 +0100
Oleg Nesterov <oleg@redhat.com> wrote:

> As Tetsuo Handa pointed out, request_module() can stress the
> system while the oom-killed caller sleeps in TASK_UNINTERRUPTIBLE.

Whine.

Solving this problem is the entire point of the entire patchset and you
told us almost nothing about it.  Please, provide a complete
description of the problem which is being solved, so we can understand
the value of the patchset?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 6/6] kmod: make __request_module() killable
  2012-02-15 20:30   ` Andrew Morton
@ 2012-02-16 15:04     ` Oleg Nesterov
  2012-02-16 17:26       ` [PATCH 0/4] make vfork() killable Oleg Nesterov
  0 siblings, 1 reply; 25+ messages in thread
From: Oleg Nesterov @ 2012-02-16 15:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: apw, arjan, fhrbata, john.johansen, penguin-kernel, rientjes,
	rusty, tj, linux-kernel

On 02/15, Andrew Morton wrote:
>
> On Tue, 14 Feb 2012 17:49:14 +0100
> Oleg Nesterov <oleg@redhat.com> wrote:
>
> > As Tetsuo Handa pointed out, request_module() can stress the
> > system while the oom-killed caller sleeps in TASK_UNINTERRUPTIBLE.
>
> Whine.
>
> Solving this problem is the entire point of the entire patchset and you
> told us almost nothing about it.  Please, provide a complete
> description of the problem which is being solved, so we can understand
> the value of the patchset?

I did ;) from the message I sent to security list:

	Tetsuo has the test-cases, but the problem (well, one of the problems) is
	simple.

	The task T uses "almost all" memory, then it does something which triggers
	request_module(). Say, it can simply call sys_socket(). This in turn needs
	more memory and leads to OOM. oom-killer correctly chooses T and kills it,
	but this can't help because it sleeps in TASK_UNINTERRUPTIBLE and after
	that oom-killer becomes "disabled" by the TIF_MEMDIE task T.

Credits to Tetsuo.

But in fact I think this change is "obviously good" anyway. Assuming
it is correct of course. request_module() is heavy, it can take the
unpredictable amount of time/resources to finish. It is not good we
can't interrupt the task which waits for completion.

Yes, this adds some complications and initially I wasn't agree with
Tetsuo, I thought this doesn't worth the trouble. But I hope that
this code is simple/clean enough.



Btw, there is another example of "unbounded" sleep in UNINTERRUPTIBLE,
vfork. I already have the patches, will send today.

Oleg.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 0/4] make vfork() killable
  2012-02-16 15:04     ` Oleg Nesterov
@ 2012-02-16 17:26       ` Oleg Nesterov
  2012-02-16 17:26         ` [PATCH 1/4] introduce complete_vfork_done() Oleg Nesterov
                           ` (5 more replies)
  0 siblings, 6 replies; 25+ messages in thread
From: Oleg Nesterov @ 2012-02-16 17:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: apw, arjan, fhrbata, john.johansen, penguin-kernel, rientjes,
	rusty, tj, linux-kernel

On 02/16, Oleg Nesterov wrote:
>
> Btw, there is another example of "unbounded" sleep in UNINTERRUPTIBLE,
> vfork. I already have the patches, will send today.

Make vfork() killable, please apply.

These patches are old (Aug 2011), they spent a lot of time in linux-next
without any problem. Linus participated in the discussion, iiuc he has
no objections.

The next step is make vfork() stoppable/traceable.

Oleg.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/4] introduce complete_vfork_done()
  2012-02-16 17:26       ` [PATCH 0/4] make vfork() killable Oleg Nesterov
@ 2012-02-16 17:26         ` Oleg Nesterov
  2012-02-17  0:35           ` Andrew Morton
  2012-02-16 17:27         ` [PATCH 2/4] vfork: make it killable Oleg Nesterov
                           ` (4 subsequent siblings)
  5 siblings, 1 reply; 25+ messages in thread
From: Oleg Nesterov @ 2012-02-16 17:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: apw, arjan, fhrbata, john.johansen, penguin-kernel, rientjes,
	rusty, tj, linux-kernel

No functional changes.

Move the clear-and-complete-vfork_done code into the new trivial
helper, complete_vfork_done().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
---
 fs/exec.c             |    8 ++------
 include/linux/sched.h |    1 +
 kernel/fork.c         |   17 ++++++++++-------
 3 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 92ce83a..dccdcec 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1915,7 +1915,6 @@ static int coredump_wait(int exit_code, struct core_state *core_state)
 {
 	struct task_struct *tsk = current;
 	struct mm_struct *mm = tsk->mm;
-	struct completion *vfork_done;
 	int core_waiters = -EBUSY;
 
 	init_completion(&core_state->startup);
@@ -1934,11 +1933,8 @@ static int coredump_wait(int exit_code, struct core_state *core_state)
 	 * Make sure nobody is waiting for us to release the VM,
 	 * otherwise we can deadlock when we wait on each other
 	 */
-	vfork_done = tsk->vfork_done;
-	if (vfork_done) {
-		tsk->vfork_done = NULL;
-		complete(vfork_done);
-	}
+	if (tsk->vfork_done)
+		complete_vfork_done(tsk);
 
 	if (core_waiters)
 		wait_for_completion(&core_state->startup);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7d379a6..1b25a37 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2291,6 +2291,7 @@ extern int do_execve(const char *,
 		     const char __user * const __user *,
 		     const char __user * const __user *, struct pt_regs *);
 extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *);
+extern void complete_vfork_done(struct task_struct *tsk);
 struct task_struct *fork_idle(int);
 
 extern void set_task_comm(struct task_struct *tsk, char *from);
diff --git a/kernel/fork.c b/kernel/fork.c
index b77fd55..5a41446 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -667,6 +667,14 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
 	return mm;
 }
 
+void complete_vfork_done(struct task_struct *tsk)
+{
+	struct completion *vfork_done = tsk->vfork_done;
+
+	tsk->vfork_done = NULL;
+	complete(vfork_done);
+}
+
 /* Please note the differences between mmput and mm_release.
  * mmput is called whenever we stop holding onto a mm_struct,
  * error success whatever.
@@ -682,8 +690,6 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
  */
 void mm_release(struct task_struct *tsk, struct mm_struct *mm)
 {
-	struct completion *vfork_done = tsk->vfork_done;
-
 	/* Get rid of any futexes when releasing the mm */
 #ifdef CONFIG_FUTEX
 	if (unlikely(tsk->robust_list)) {
@@ -703,11 +709,8 @@ void mm_release(struct task_struct *tsk, struct mm_struct *mm)
 	/* Get rid of any cached register state */
 	deactivate_mm(tsk, mm);
 
-	/* notify parent sleeping on vfork() */
-	if (vfork_done) {
-		tsk->vfork_done = NULL;
-		complete(vfork_done);
-	}
+	if (tsk->vfork_done)
+		complete_vfork_done(tsk);
 
 	/*
 	 * If we're exiting normally, clear a user-space tid field if
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/4] vfork: make it killable
  2012-02-16 17:26       ` [PATCH 0/4] make vfork() killable Oleg Nesterov
  2012-02-16 17:26         ` [PATCH 1/4] introduce complete_vfork_done() Oleg Nesterov
@ 2012-02-16 17:27         ` Oleg Nesterov
  2012-02-17  0:39           ` Andrew Morton
  2012-02-16 17:27         ` [PATCH 3/4] coredump_wait: don't call complete_vfork_done() Oleg Nesterov
                           ` (3 subsequent siblings)
  5 siblings, 1 reply; 25+ messages in thread
From: Oleg Nesterov @ 2012-02-16 17:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: apw, arjan, fhrbata, john.johansen, penguin-kernel, rientjes,
	rusty, tj, linux-kernel

Make vfork() killable.

Change do_fork(CLONE_VFORK) to do wait_for_completion_killable().
If it fails we do not return to the user-mode and never touch the
memory shared with our child.

However, in this case we should clear child->vfork_done before
return, we use task_lock() in do_fork()->wait_for_vfork_done()
and complete_vfork_done() to serialize with each other.

Note: now that we use task_lock() we don't really need completion,
we could turn task->vfork_done into "task_struct *wake_up_me" but
this needs some complications.

NOTE: this and the next patches do not affect in-kernel users of
CLONE_VFORK, kernel threads run with all signals ignored including
SIGKILL/SIGSTOP.

However this is obviously the user-visible change. Not only a fatal
signal can kill the vforking parent, a sub-thread can do execve or
exit_group() and kill the thread sleeping in vfork().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
---
 include/linux/sched.h |    2 +-
 kernel/fork.c         |   40 ++++++++++++++++++++++++++++++++--------
 2 files changed, 33 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1b25a37..b646771 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2372,7 +2372,7 @@ static inline int thread_group_empty(struct task_struct *p)
  * Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring
  * subscriptions and synchronises with wait4().  Also used in procfs.  Also
  * pins the final release of task.io_context.  Also protects ->cpuset and
- * ->cgroup.subsys[].
+ * ->cgroup.subsys[]. And ->vfork_done.
  *
  * Nests both inside and outside of read_lock(&tasklist_lock).
  * It must not be nested with write_lock_irq(&tasklist_lock),
diff --git a/kernel/fork.c b/kernel/fork.c
index 5a41446..fc2c2f0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -669,10 +669,34 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
 
 void complete_vfork_done(struct task_struct *tsk)
 {
-	struct completion *vfork_done = tsk->vfork_done;
+	struct completion *vfork;
 
-	tsk->vfork_done = NULL;
-	complete(vfork_done);
+	task_lock(tsk);
+	vfork = tsk->vfork_done;
+	if (likely(vfork)) {
+		tsk->vfork_done = NULL;
+		complete(vfork);
+	}
+	task_unlock(tsk);
+}
+
+static int wait_for_vfork_done(struct task_struct *child,
+				struct completion *vfork)
+{
+	int killed;
+
+	freezer_do_not_count();
+	killed = wait_for_completion_killable(vfork);
+	freezer_count();
+
+	if (killed) {
+		task_lock(child);
+		child->vfork_done = NULL;
+		task_unlock(child);
+	}
+
+	put_task_struct(child);
+	return killed;
 }
 
 /* Please note the differences between mmput and mm_release.
@@ -716,7 +740,8 @@ void mm_release(struct task_struct *tsk, struct mm_struct *mm)
 	 * If we're exiting normally, clear a user-space tid field if
 	 * requested.  We leave this alone when dying by signal, to leave
 	 * the value intact in a core dump, and to save the unnecessary
-	 * trouble otherwise.  Userland only wants this done for a sys_exit.
+	 * trouble, say, a killed vfork parent shouldn't touch this mm.
+	 * Userland only wants this done for a sys_exit.
 	 */
 	if (tsk->clear_child_tid) {
 		if (!(tsk->flags & PF_SIGNALED) &&
@@ -1548,6 +1573,7 @@ long do_fork(unsigned long clone_flags,
 		if (clone_flags & CLONE_VFORK) {
 			p->vfork_done = &vfork;
 			init_completion(&vfork);
+			get_task_struct(p);
 		}
 
 		/*
@@ -1565,10 +1591,8 @@ long do_fork(unsigned long clone_flags,
 			ptrace_event(trace, nr);
 
 		if (clone_flags & CLONE_VFORK) {
-			freezer_do_not_count();
-			wait_for_completion(&vfork);
-			freezer_count();
-			ptrace_event(PTRACE_EVENT_VFORK_DONE, nr);
+			if (!wait_for_vfork_done(p, &vfork))
+				ptrace_event(PTRACE_EVENT_VFORK_DONE, nr);
 		}
 	} else {
 		nr = PTR_ERR(p);
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 3/4] coredump_wait: don't call complete_vfork_done()
  2012-02-16 17:26       ` [PATCH 0/4] make vfork() killable Oleg Nesterov
  2012-02-16 17:26         ` [PATCH 1/4] introduce complete_vfork_done() Oleg Nesterov
  2012-02-16 17:27         ` [PATCH 2/4] vfork: make it killable Oleg Nesterov
@ 2012-02-16 17:27         ` Oleg Nesterov
  2012-02-16 17:27         ` [PATCH 4/4] kill PF_STARTING Oleg Nesterov
                           ` (2 subsequent siblings)
  5 siblings, 0 replies; 25+ messages in thread
From: Oleg Nesterov @ 2012-02-16 17:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: apw, arjan, fhrbata, john.johansen, penguin-kernel, rientjes,
	rusty, tj, linux-kernel

Now that CLONE_VFORK is killable, coredump_wait() no longer needs
complete_vfork_done(). zap_threads() should find and kill all tasks
with the same ->mm, this includes our parent if ->vfork_done is set.

mm_release() becomes the only caller, unexport complete_vfork_done().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
---
 fs/exec.c             |   14 ++------------
 include/linux/sched.h |    1 -
 kernel/fork.c         |    2 +-
 3 files changed, 3 insertions(+), 14 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index dccdcec..153dee1 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1926,19 +1926,9 @@ static int coredump_wait(int exit_code, struct core_state *core_state)
 		core_waiters = zap_threads(tsk, mm, core_state, exit_code);
 	up_write(&mm->mmap_sem);
 
-	if (unlikely(core_waiters < 0))
-		goto fail;
-
-	/*
-	 * Make sure nobody is waiting for us to release the VM,
-	 * otherwise we can deadlock when we wait on each other
-	 */
-	if (tsk->vfork_done)
-		complete_vfork_done(tsk);
-
-	if (core_waiters)
+	if (core_waiters > 0)
 		wait_for_completion(&core_state->startup);
-fail:
+
 	return core_waiters;
 }
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b646771..11fcafa 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2291,7 +2291,6 @@ extern int do_execve(const char *,
 		     const char __user * const __user *,
 		     const char __user * const __user *, struct pt_regs *);
 extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *);
-extern void complete_vfork_done(struct task_struct *tsk);
 struct task_struct *fork_idle(int);
 
 extern void set_task_comm(struct task_struct *tsk, char *from);
diff --git a/kernel/fork.c b/kernel/fork.c
index fc2c2f0..57ebebb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -667,7 +667,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
 	return mm;
 }
 
-void complete_vfork_done(struct task_struct *tsk)
+static void complete_vfork_done(struct task_struct *tsk)
 {
 	struct completion *vfork;
 
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 4/4] kill PF_STARTING
  2012-02-16 17:26       ` [PATCH 0/4] make vfork() killable Oleg Nesterov
                           ` (2 preceding siblings ...)
  2012-02-16 17:27         ` [PATCH 3/4] coredump_wait: don't call complete_vfork_done() Oleg Nesterov
@ 2012-02-16 17:27         ` Oleg Nesterov
  2012-02-17  0:26         ` [PATCH 0/4] make vfork() killable Andrew Morton
       [not found]         ` <20120216173233.GF30393@redhat.com>
  5 siblings, 0 replies; 25+ messages in thread
From: Oleg Nesterov @ 2012-02-16 17:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: apw, arjan, fhrbata, john.johansen, penguin-kernel, rientjes,
	rusty, tj, linux-kernel

Previously it was (ab)used by utrace. Then it was wrongly used by
the scheduler code.

Currently it is not used, kill it before it finds the new erroneous
user.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
---
 include/linux/sched.h |    1 -
 kernel/fork.c         |    9 ---------
 2 files changed, 0 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 11fcafa..0657368 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1777,7 +1777,6 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
 /*
  * Per process flags
  */
-#define PF_STARTING	0x00000002	/* being created */
 #define PF_EXITING	0x00000004	/* getting shut down */
 #define PF_EXITPIDONE	0x00000008	/* pi exit done on shut down */
 #define PF_VCPU		0x00000010	/* I'm a virtual CPU */
diff --git a/kernel/fork.c b/kernel/fork.c
index 57ebebb..3aa9153 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1043,7 +1043,6 @@ static void copy_flags(unsigned long clone_flags, struct task_struct *p)
 
 	new_flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER);
 	new_flags |= PF_FORKNOEXEC;
-	new_flags |= PF_STARTING;
 	p->flags = new_flags;
 }
 
@@ -1576,14 +1575,6 @@ long do_fork(unsigned long clone_flags,
 			get_task_struct(p);
 		}
 
-		/*
-		 * We set PF_STARTING at creation in case tracing wants to
-		 * use this to distinguish a fully live task from one that
-		 * hasn't finished SIGSTOP raising yet.  Now we clear it
-		 * and set the child going.
-		 */
-		p->flags &= ~PF_STARTING;
-
 		wake_up_new_task(p);
 
 		/* forking complete and child started to run, tell ptracer */
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/4] make vfork() killable
  2012-02-16 17:26       ` [PATCH 0/4] make vfork() killable Oleg Nesterov
                           ` (3 preceding siblings ...)
  2012-02-16 17:27         ` [PATCH 4/4] kill PF_STARTING Oleg Nesterov
@ 2012-02-17  0:26         ` Andrew Morton
  2012-02-17  2:45           ` Stephen Rothwell
       [not found]         ` <20120216173233.GF30393@redhat.com>
  5 siblings, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2012-02-17  0:26 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: apw, arjan, fhrbata, john.johansen, penguin-kernel, rientjes,
	rusty, tj, linux-kernel

On Thu, 16 Feb 2012 18:26:26 +0100
Oleg Nesterov <oleg@redhat.com> wrote:

> On 02/16, Oleg Nesterov wrote:
> >
> > Btw, there is another example of "unbounded" sleep in UNINTERRUPTIBLE,
> > vfork. I already have the patches, will send today.
> 
> Make vfork() killable, please apply.
> 
> These patches are old (Aug 2011), they spent a lot of time in linux-next
> without any problem. Linus participated in the discussion, iiuc he has
> no objections.

hum.  Via which tree did they get into linux-next and why did they not
get any further?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] introduce complete_vfork_done()
  2012-02-16 17:26         ` [PATCH 1/4] introduce complete_vfork_done() Oleg Nesterov
@ 2012-02-17  0:35           ` Andrew Morton
  2012-02-17 14:37             ` Oleg Nesterov
  0 siblings, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2012-02-17  0:35 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: apw, arjan, fhrbata, john.johansen, penguin-kernel, rientjes,
	rusty, tj, linux-kernel

On Thu, 16 Feb 2012 18:26:47 +0100
Oleg Nesterov <oleg@redhat.com> wrote:

> No functional changes.
> 
> Move the clear-and-complete-vfork_done code into the new trivial
> helper, complete_vfork_done().
> 
> ...
>
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1915,7 +1915,6 @@ static int coredump_wait(int exit_code, struct core_state *core_state)
>  {
>  	struct task_struct *tsk = current;
>  	struct mm_struct *mm = tsk->mm;
> -	struct completion *vfork_done;
>  	int core_waiters = -EBUSY;
>  
>  	init_completion(&core_state->startup);
> @@ -1934,11 +1933,8 @@ static int coredump_wait(int exit_code, struct core_state *core_state)
>  	 * Make sure nobody is waiting for us to release the VM,
>  	 * otherwise we can deadlock when we wait on each other
>  	 */
> -	vfork_done = tsk->vfork_done;
> -	if (vfork_done) {
> -		tsk->vfork_done = NULL;
> -		complete(vfork_done);
> -	}
> +	if (tsk->vfork_done)
> +		complete_vfork_done(tsk);
>  
>  	if (core_waiters)
>  		wait_for_completion(&core_state->startup);
>
> ...
>
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -667,6 +667,14 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
>  	return mm;
>  }
>  
> +void complete_vfork_done(struct task_struct *tsk)
> +{
> +	struct completion *vfork_done = tsk->vfork_done;
> +
> +	tsk->vfork_done = NULL;
> +	complete(vfork_done);
> +}
> +
>  /* Please note the differences between mmput and mm_release.
>   * mmput is called whenever we stop holding onto a mm_struct,
>   * error success whatever.
> @@ -682,8 +690,6 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
>   */
>  void mm_release(struct task_struct *tsk, struct mm_struct *mm)
>  {
> -	struct completion *vfork_done = tsk->vfork_done;
> -
>  	/* Get rid of any futexes when releasing the mm */
>  #ifdef CONFIG_FUTEX
>  	if (unlikely(tsk->robust_list)) {
> @@ -703,11 +709,8 @@ void mm_release(struct task_struct *tsk, struct mm_struct *mm)
>  	/* Get rid of any cached register state */
>  	deactivate_mm(tsk, mm);
>  
> -	/* notify parent sleeping on vfork() */
> -	if (vfork_done) {
> -		tsk->vfork_done = NULL;
> -		complete(vfork_done);
> -	}
> +	if (tsk->vfork_done)
> +		complete_vfork_done(tsk);

This all looks somewhat smelly.

- Why do we zero tsk->vfork_done in this manner?  It *looks* like
  it's done to prevent the kernel from running complete() twice against
  a single task in a race situation.  If this is the case then it's
  pretty lame, isn't it?  We'd need external locking to firm that up
  and I'm not seeing it.

- Moving the test for non-null tsk->vfork_done into
  complete_vfork_done() would simplify things a bit?

- The complete_vfork_done() interface isn't wonderful.  What prevents
  tsk from getting freed?  Presumably the caller must have pinned it in
  some fashion?  Or must hold some lock?  Or it's always run against
  `current', in which case it would be clearer to not pass the
  task_struct arg at all?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/4] vfork: make it killable
  2012-02-16 17:27         ` [PATCH 2/4] vfork: make it killable Oleg Nesterov
@ 2012-02-17  0:39           ` Andrew Morton
  2012-02-17 14:44             ` Oleg Nesterov
  0 siblings, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2012-02-17  0:39 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: apw, arjan, fhrbata, john.johansen, penguin-kernel, rientjes,
	rusty, tj, linux-kernel

On Thu, 16 Feb 2012 18:27:06 +0100
Oleg Nesterov <oleg@redhat.com> wrote:

> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -669,10 +669,34 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
>  
>  void complete_vfork_done(struct task_struct *tsk)
>  {
> -	struct completion *vfork_done = tsk->vfork_done;
> +	struct completion *vfork;
>  
> -	tsk->vfork_done = NULL;
> -	complete(vfork_done);
> +	task_lock(tsk);
> +	vfork = tsk->vfork_done;
> +	if (likely(vfork)) {
> +		tsk->vfork_done = NULL;
> +		complete(vfork);
> +	}
> +	task_unlock(tsk);
> +}

OK, so now we don't need to test tsk->vfork_done in callers.  But
mm_release() still does this, and it does it outside locks.  Mistake,
or micro-optimisation?  If the latter, why is the lockless peek
race-free?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/4] make vfork() killable
  2012-02-17  0:26         ` [PATCH 0/4] make vfork() killable Andrew Morton
@ 2012-02-17  2:45           ` Stephen Rothwell
  2012-02-17 14:46             ` Oleg Nesterov
  0 siblings, 1 reply; 25+ messages in thread
From: Stephen Rothwell @ 2012-02-17  2:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oleg Nesterov, apw, arjan, fhrbata, john.johansen,
	penguin-kernel, rientjes, rusty, tj, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 461 bytes --]

Hi Andrew,

On Thu, 16 Feb 2012 16:26:02 -0800 Andrew Morton <akpm@linux-foundation.org> wrote:
>
> hum.  Via which tree did they get into linux-next and why did they not
> get any further?

They were in the ptrace tree from Aug 26 last year ... they were still
there on Oct 14 and then the ptrace tree was removed ("temporarily", at
Oleg's request) from linux-next on Oct 24.

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] introduce complete_vfork_done()
  2012-02-17  0:35           ` Andrew Morton
@ 2012-02-17 14:37             ` Oleg Nesterov
  0 siblings, 0 replies; 25+ messages in thread
From: Oleg Nesterov @ 2012-02-17 14:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: apw, arjan, fhrbata, john.johansen, penguin-kernel, rientjes,
	rusty, tj, linux-kernel

On 02/16, Andrew Morton wrote:
>
> On Thu, 16 Feb 2012 18:26:47 +0100
> Oleg Nesterov <oleg@redhat.com> wrote:
>
> > ...
> > +void complete_vfork_done(struct task_struct *tsk)
> > +{
> > +	struct completion *vfork_done = tsk->vfork_done;
> > +
> > +	tsk->vfork_done = NULL;
> > +	complete(vfork_done);
> > +}
> > +
> >  /* Please note the differences between mmput and mm_release.
> >   * mmput is called whenever we stop holding onto a mm_struct,
> >   * error success whatever.
> > @@ -682,8 +690,6 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
> >   */
> >  void mm_release(struct task_struct *tsk, struct mm_struct *mm)
> >  {
> > -	struct completion *vfork_done = tsk->vfork_done;
> > -
> >  	/* Get rid of any futexes when releasing the mm */
> >  #ifdef CONFIG_FUTEX
> >  	if (unlikely(tsk->robust_list)) {
> > @@ -703,11 +709,8 @@ void mm_release(struct task_struct *tsk, struct mm_struct *mm)
> >  	/* Get rid of any cached register state */
> >  	deactivate_mm(tsk, mm);
> >
> > -	/* notify parent sleeping on vfork() */
> > -	if (vfork_done) {
> > -		tsk->vfork_done = NULL;
> > -		complete(vfork_done);
> > -	}
> > +	if (tsk->vfork_done)
> > +		complete_vfork_done(tsk);
>
> This all looks somewhat smelly.

First of all, let me repeat that this patch changes nothing, justs
move this code into the new helper.


> - Why do we zero tsk->vfork_done in this manner?  It *looks* like
>   it's done to prevent the kernel from running complete() twice against
>   a single task

Yes,

> in a race situation.

No. More precisely, not before/after this patch.

"if (vfork_done) complete_vfork_done()" is called twice very often.
A vforked child does exec and notifies its parent. It should clear
->vfork_done, otherwise it will do complete_vfork_done() again on
exit when ->vfork_done points to nowhere.

The caller can never race with another user of ->vfork_done. It
is the parent sleeping in do_fork(CLONE_VFORK). (I am ignoring
the kernel threads created by kthread_create).

>   We'd need external locking to firm that up
>   and I'm not seeing it.

After the next patch, parent/child can race with each other, that
is why the next patch moves complete() under task_lock(). I'll write
another email in reply to 2/4.

> - Moving the test for non-null tsk->vfork_done into
>   complete_vfork_done() would simplify things a bit?

Yes, perhaps this makes sense. After 3/4 mm_release() becomes the
only caller and this microoptimization buys nothing, this helper
will be static.

I like the explicit test a bit more, just because it looks more
clear to me. But this is subjective, I can redo.

> - The complete_vfork_done() interface isn't wonderful.  What prevents
>   tsk from getting freed?  Presumably the caller must have pinned it in
>   some fashion?  Or must hold some lock?  Or it's always run against
>   `current',

Yes, it is always current,

> in which case it would be clearer to not pass the
>   task_struct arg at all?

Well, may be... But mm_release() already has the 'tsk' argument which
is always current. It would be strange to not use it.

Oleg.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/4] vfork: make it killable
  2012-02-17  0:39           ` Andrew Morton
@ 2012-02-17 14:44             ` Oleg Nesterov
  0 siblings, 0 replies; 25+ messages in thread
From: Oleg Nesterov @ 2012-02-17 14:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: apw, arjan, fhrbata, john.johansen, penguin-kernel, rientjes,
	rusty, tj, linux-kernel

On 02/16, Andrew Morton wrote:
>
> On Thu, 16 Feb 2012 18:27:06 +0100
> Oleg Nesterov <oleg@redhat.com> wrote:
>
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -669,10 +669,34 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
> >
> >  void complete_vfork_done(struct task_struct *tsk)
> >  {
> > -	struct completion *vfork_done = tsk->vfork_done;
> > +	struct completion *vfork;
> >
> > -	tsk->vfork_done = NULL;
> > -	complete(vfork_done);
> > +	task_lock(tsk);
> > +	vfork = tsk->vfork_done;
> > +	if (likely(vfork)) {
> > +		tsk->vfork_done = NULL;
> > +		complete(vfork);
> > +	}
> > +	task_unlock(tsk);
> > +}
>
> OK, so now we don't need to test tsk->vfork_done in callers.  But
> mm_release() still does this, and it does it outside locks.

Yes, complete_vfork_done() can be called unconditionally,

> Mistake,
> or micro-optimisation?

micro-optimisation to avoid the unnecessary task_lock().

> If the latter, why is the lockless peek
> race-free?

If ->vfork_done != NULL, the child can never miss it. The parent
sets this pointer before the first wakeup.

However. The killed parent can clear ->vfork_done, see "if (killed)"
int wait_for_vfork_done(). That is why complete_vfork_done() should
re-check under task_lock().

Oleg.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/4] make vfork() killable
  2012-02-17  2:45           ` Stephen Rothwell
@ 2012-02-17 14:46             ` Oleg Nesterov
  0 siblings, 0 replies; 25+ messages in thread
From: Oleg Nesterov @ 2012-02-17 14:46 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: Andrew Morton, apw, arjan, fhrbata, john.johansen,
	penguin-kernel, rientjes, rusty, tj, linux-kernel

Hi Stephen,

On 02/17, Stephen Rothwell wrote:
>
> On Thu, 16 Feb 2012 16:26:02 -0800 Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > hum.  Via which tree did they get into linux-next and why did they not
> > get any further?
>
> They were in the ptrace tree from Aug 26 last year ... they were still
> there on Oct 14 and then the ptrace tree was removed ("temporarily", at
> Oleg's request) from linux-next on Oct 24.

Yes, thanks Stephen.

And yes, it was "temporarily" because I hoped to restore my korg account
quickly. But it is not that easy ;)

Oleg.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 0/1] hung_task: fix the broken rcu_lock_break() logic
       [not found]             ` <20120217150726.GD22440@redhat.com>
@ 2012-02-17 18:00               ` Oleg Nesterov
  2012-02-17 18:00                 ` [PATCH 1/1] " Oleg Nesterov
  0 siblings, 1 reply; 25+ messages in thread
From: Oleg Nesterov @ 2012-02-17 18:00 UTC (permalink / raw)
  To: Andrew Morton, Tetsuo Handa
  Cc: apw, arjan, fhrbata, john.johansen, rientjes, rusty, tj,
	linux-kernel, Paul E. McKenney, Mandeep Singh Baines

On 02/17, Oleg Nesterov wrote:
>
> On 02/17, Tetsuo Handa wrote:
> >
> > Oleg Nesterov wrote:
> > >     int main(void)
> > >     {
> > >             vfork();
> > >             pause();
> > >     }
> >^
> > Wow!
>
> and this reminds me... check_hung_uninterruptible_tasks() is very broken,
> it can crash the kernel itself. I'll resend my ancient fixes once again
> today, maybe this resend will be successful ;)

Please apply.

The patch was sent more than a year ago, I preserved the acks I got.

I was also going to add the PF_FREEZER_SKIP check in check_hung_task(),
but this was already done: f9fab10bbd768b0e5254e53a4a8477a94bfc4b96.
This means I lied, the program above no longer creates the problem
for check_hung_uninterruptible_tasks(). But that was not the point
for make-it-killable anyway.


I simply can't understand why do we have sysctl_hung_task_check_count,
may be someone can explain... OK, we need to do rcu_read_unlock()
from time to time even if need_resched() is not set, that is why
we have HUNG_TASK_BATCHING. Although perhaps it makes sense to do

	-	if (!--batch_count)
	+	if (!--batch_count || need_resched())
			rcu_lock_break();

but max_count? I guess I missed something obvious.

Oleg.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/1] hung_task: fix the broken rcu_lock_break() logic
  2012-02-17 18:00               ` [PATCH 0/1] hung_task: fix the broken rcu_lock_break() logic Oleg Nesterov
@ 2012-02-17 18:00                 ` Oleg Nesterov
  0 siblings, 0 replies; 25+ messages in thread
From: Oleg Nesterov @ 2012-02-17 18:00 UTC (permalink / raw)
  To: Andrew Morton, Tetsuo Handa
  Cc: apw, arjan, fhrbata, john.johansen, rientjes, rusty, tj,
	linux-kernel, Paul E. McKenney, Mandeep Singh Baines

check_hung_uninterruptible_tasks()->rcu_lock_break() introduced by
"softlockup: check all tasks in hung_task" commit ce9dbe24 looks
absolutely wrong.

	- rcu_lock_break() does put_task_struct(). If the task has exited
	  it is not safe to even read its ->state, nothing protects this
	  task_struct.

	- The TASK_DEAD checks are wrong too. Contrary to the comment, we
	  can't use it to check if the task was unhashed. It can be unhashed
	  without TASK_DEAD, or it can be valid with TASK_DEAD.

	  For example, an autoreaping task can do release_task(current)
	  long before it sets TASK_DEAD in do_exit().

	  Or, a zombie task can have ->state == TASK_DEAD but release_task()
	  was not called, and in this case we must not break the loop.

Change this code to check pid_alive() instead, and do this before
we drop the reference to the task_struct.

Note: while_each_thread() under rcu_read_lock() is not really safe,
it can livelock. This will be fixed later, but fortunately in this
case the "max_count" logic saves us anyway.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Mandeep Singh Baines <msb@google.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/hung_task.c |   11 +++++++----
 1 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 2e48ec0..c21449f 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -119,15 +119,20 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
  * For preemptible RCU it is sufficient to call rcu_read_unlock in order
  * to exit the grace period. For classic RCU, a reschedule is required.
  */
-static void rcu_lock_break(struct task_struct *g, struct task_struct *t)
+static bool rcu_lock_break(struct task_struct *g, struct task_struct *t)
 {
+	bool can_cont;
+
 	get_task_struct(g);
 	get_task_struct(t);
 	rcu_read_unlock();
 	cond_resched();
 	rcu_read_lock();
+	can_cont = pid_alive(g) && pid_alive(t);
 	put_task_struct(t);
 	put_task_struct(g);
+
+	return can_cont;
 }
 
 /*
@@ -154,9 +159,7 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
 			goto unlock;
 		if (!--batch_count) {
 			batch_count = HUNG_TASK_BATCHING;
-			rcu_lock_break(g, t);
-			/* Exit if t or g was unhashed during refresh. */
-			if (t->state == TASK_DEAD || g->state == TASK_DEAD)
+			if (!rcu_lock_break(g, t))
 				goto unlock;
 		}
 		/* use "==" to skip the TASK_KILLABLE tasks waiting on NFS */
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2012-02-17 18:08 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-14 16:47 [PATCH 0/6] make request_module() killable Oleg Nesterov
2012-02-14 16:47 ` [PATCH 1/6] usermodehelper: introduce umh_complete(sub_info) Oleg Nesterov
2012-02-14 16:48 ` [PATCH 2/6] usermodehelper: implement UMH_KILLABLE Oleg Nesterov
2012-02-14 16:48 ` [PATCH 3/6] usermodehelper: kill umh_wait, renumber UMH_* constants Oleg Nesterov
2012-02-15  1:09   ` Rusty Russell
2012-02-15 18:12     ` Oleg Nesterov
2012-02-14 16:48 ` [PATCH 4/6] usermodehelper: ____call_usermodehelper() doesn't need do_exit() Oleg Nesterov
2012-02-14 16:48 ` [PATCH 5/6] kmod: introduce call_modprobe() helper Oleg Nesterov
2012-02-14 16:49 ` [PATCH 6/6] kmod: make __request_module() killable Oleg Nesterov
2012-02-15 20:30   ` Andrew Morton
2012-02-16 15:04     ` Oleg Nesterov
2012-02-16 17:26       ` [PATCH 0/4] make vfork() killable Oleg Nesterov
2012-02-16 17:26         ` [PATCH 1/4] introduce complete_vfork_done() Oleg Nesterov
2012-02-17  0:35           ` Andrew Morton
2012-02-17 14:37             ` Oleg Nesterov
2012-02-16 17:27         ` [PATCH 2/4] vfork: make it killable Oleg Nesterov
2012-02-17  0:39           ` Andrew Morton
2012-02-17 14:44             ` Oleg Nesterov
2012-02-16 17:27         ` [PATCH 3/4] coredump_wait: don't call complete_vfork_done() Oleg Nesterov
2012-02-16 17:27         ` [PATCH 4/4] kill PF_STARTING Oleg Nesterov
2012-02-17  0:26         ` [PATCH 0/4] make vfork() killable Andrew Morton
2012-02-17  2:45           ` Stephen Rothwell
2012-02-17 14:46             ` Oleg Nesterov
     [not found]         ` <20120216173233.GF30393@redhat.com>
     [not found]           ` <201202172211.CGH81726.OStOJFLFHQVMFO@I-love.SAKURA.ne.jp>
     [not found]             ` <20120217150726.GD22440@redhat.com>
2012-02-17 18:00               ` [PATCH 0/1] hung_task: fix the broken rcu_lock_break() logic Oleg Nesterov
2012-02-17 18:00                 ` [PATCH 1/1] " Oleg Nesterov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).