All of lore.kernel.org
 help / color / mirror / Atom feed
* + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision.patch added to -mm tree
@ 2011-08-16 20:11 akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
  2011-08-17 11:55   ` Oleg Nesterov
  0 siblings, 1 reply; 63+ messages in thread
From: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b @ 2011-08-16 20:11 UTC (permalink / raw)
  To: mm-commits-u79uwXL29TY76Z2rM5mHXA
  Cc: lennart-mdGvqq1h2p+GdvJs77BJ7Q, kay.sievers-tD+1rO4QERM,
	linux-man-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	roland-/Z5OmTQCD9xF6kxbq+BtvQ,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b


The patch titled
     prctl: add PR_{SET,GET}_CHILD_REAPER to allow simple process supervision
has been added to the -mm tree.  Its filename is
     prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/SubmitChecklist when testing your code ***

See http://userweb.kernel.org/~akpm/stuff/added-to-mm.txt to find
out what to do about this

The current -mm tree may be found at http://userweb.kernel.org/~akpm/mmotm/

------------------------------------------------------
Subject: prctl: add PR_{SET,GET}_CHILD_REAPER to allow simple process supervision
From: Lennart Poettering <lennart-mdGvqq1h2p+GdvJs77BJ7Q@public.gmane.org>

Userspace service managers/supervisors need to track their started
services.  Many services daemonize by double-forking and get implicitely
re-parented to PID 1.  The process manager will no longer be able to
receive the SIGCHLD signals for them.

With this prctl, a service manager can mark itself as a sort of 'sub-init'
process, able to stay as the parent process for all processes created by
the started services.  All SIGCHLD signals will be delivered to the
service manager.

As a side effect, the relevant parent PID information does not get lost by
a double-fork, which results in a more elaborate process tree and 'ps'
output.

This is orthogonal to PID namespaces.  PID namespaces are isolated from
each other, while a service management process usually requires the
serices to live in the same namespace, to be able to talk to each other.

Users of this will be the systemd per-user instance, which provides
init-like functionality for the user's login session and D-Bus, which
activates bus services on on-demand.  Both will need init-like
capabilities to be able to properly keep track of the services they start.

Signed-off-by: Lennart Poettering <lennart-mdGvqq1h2p+GdvJs77BJ7Q@public.gmane.org>
Signed-off-by: Kay Sievers <kay.sievers-tD+1rO4QERM@public.gmane.org>
Cc: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Roland McGrath <roland-/Z5OmTQCD9xF6kxbq+BtvQ@public.gmane.org>
Cc: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Cc: <linux-man-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Signed-off-by: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
---

 include/linux/prctl.h |    3 +++
 include/linux/sched.h |    2 ++
 kernel/exit.c         |    9 ++++++++-
 kernel/fork.c         |    2 ++
 kernel/sys.c          |    7 +++++++
 5 files changed, 22 insertions(+), 1 deletion(-)

diff -puN include/linux/prctl.h~prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision include/linux/prctl.h
--- a/include/linux/prctl.h~prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision
+++ a/include/linux/prctl.h
@@ -102,4 +102,7 @@
 
 #define PR_MCE_KILL_GET 34
 
+#define PR_SET_CHILD_REAPER 35
+#define PR_GET_CHILD_REAPER 36
+
 #endif /* _LINUX_PRCTL_H */
diff -puN include/linux/sched.h~prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision include/linux/sched.h
--- a/include/linux/sched.h~prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision
+++ a/include/linux/sched.h
@@ -1296,6 +1296,8 @@ struct task_struct {
 				 * execve */
 	unsigned in_iowait:1;
 
+	/* Reparent child processes to this process instead of pid 1. */
+	unsigned child_reaper:1;
 
 	/* Revert to default priority/policy when forking */
 	unsigned sched_reset_on_fork:1;
diff -puN kernel/exit.c~prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision kernel/exit.c
--- a/kernel/exit.c~prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision
+++ a/kernel/exit.c
@@ -701,7 +701,7 @@ static struct task_struct *find_new_reap
 	__acquires(&tasklist_lock)
 {
 	struct pid_namespace *pid_ns = task_active_pid_ns(father);
-	struct task_struct *thread;
+	struct task_struct *thread, *reaper;
 
 	thread = father;
 	while_each_thread(father, thread) {
@@ -712,6 +712,13 @@ static struct task_struct *find_new_reap
 		return thread;
 	}
 
+	/* find the first ancestor which is marked as child_reaper */
+	for (reaper = father->parent;
+	     reaper != &init_task && reaper != pid_ns->child_reaper;
+	     reaper = reaper->parent)
+		if (reaper->child_reaper)
+			return reaper;
+
 	if (unlikely(pid_ns->child_reaper == father)) {
 		write_unlock_irq(&tasklist_lock);
 		if (unlikely(pid_ns == &init_pid_ns))
diff -puN kernel/fork.c~prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision kernel/fork.c
--- a/kernel/fork.c~prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision
+++ a/kernel/fork.c
@@ -1328,6 +1328,8 @@ static struct task_struct *copy_process(
 		p->parent_exec_id = current->self_exec_id;
 	}
 
+	p->child_reaper = 0;
+
 	spin_lock(&current->sighand->siglock);
 
 	/*
diff -puN kernel/sys.c~prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision kernel/sys.c
--- a/kernel/sys.c~prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision
+++ a/kernel/sys.c
@@ -1800,6 +1800,13 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
 			else
 				error = PR_MCE_KILL_DEFAULT;
 			break;
+		case PR_SET_CHILD_REAPER:
+			me->child_reaper = !!arg2;
+			error = 0;
+			break;
+		case PR_GET_CHILD_REAPER:
+			error = put_user(me->child_reaper, (int __user *) arg2);
+			break;
 		default:
 			error = -EINVAL;
 			break;
_

Patches currently in -mm which might be from lennart-mdGvqq1h2p+GdvJs77BJ7Q@public.gmane.org are

prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision.patch

--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-17 11:55   ` Oleg Nesterov
  0 siblings, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2011-08-17 11:55 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, lennart, kay.sievers, linux-man, roland, torvalds

On 08/16, Andrew Morton wrote:
>
> From: Lennart Poettering <lennart@poettering.net>
>
> Userspace service managers/supervisors need to track their started
> services.  Many services daemonize by double-forking and get implicitely
> re-parented to PID 1.  The process manager will no longer be able to
> receive the SIGCHLD signals for them.
>
> With this prctl, a service manager can mark itself as a sort of 'sub-init'
> process, able to stay as the parent process for all processes created by
> the started services.  All SIGCHLD signals will be delivered to the
> service manager.

I try to never argue with the new features. But to be honest, this
doesn't look very good to me.

OK, a service manager M does prctl(PR_SET_CHILD_REAPER), then it forks
a service X which forks another child C and exits. Then C exits and
notifies M.

But. How can M know that the service X should be restarted? It only
knows the pid. What if wait(WEXITED) succeeds because C in turn does
fork + exit? What M has 2 or more services?




Anyway, the implementation is certainly buggy.

> @@ -1296,6 +1296,8 @@ struct task_struct {
>  				 * execve */
>  	unsigned in_iowait:1;
>
> +	/* Reparent child processes to this process instead of pid 1. */
> +	unsigned child_reaper:1;

First of all - this is already very wrong imho. This should be
per-process, not per-thread.

> +	/* find the first ancestor which is marked as child_reaper */
> +	for (reaper = father->parent;
> +	     reaper != &init_task && reaper != pid_ns->child_reaper;
> +	     reaper = reaper->parent)

This loop can never reach init_task/child_reaper and crash the kernel.
For example, father->parent can point to init_task's sub-thread.

OTOH you shouldn't use init_task at all.

Also. You shouldn't do this if the sub-namespace init exits, this is
wrong.

> +		if (reaper->child_reaper)
> +			return reaper;

No, we can't blindly return this task, it can be dead/exiting. More
precisely, we must not do this if it has already passed
forget_original_parent(). That is why the code above checks PF_EXITING.

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-17 11:55   ` Oleg Nesterov
  0 siblings, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2011-08-17 11:55 UTC (permalink / raw)
  To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lennart-mdGvqq1h2p+GdvJs77BJ7Q, kay.sievers-tD+1rO4QERM,
	linux-man-u79uwXL29TY76Z2rM5mHXA, roland-/Z5OmTQCD9xF6kxbq+BtvQ,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On 08/16, Andrew Morton wrote:
>
> From: Lennart Poettering <lennart-mdGvqq1h2p+GdvJs77BJ7Q@public.gmane.org>
>
> Userspace service managers/supervisors need to track their started
> services.  Many services daemonize by double-forking and get implicitely
> re-parented to PID 1.  The process manager will no longer be able to
> receive the SIGCHLD signals for them.
>
> With this prctl, a service manager can mark itself as a sort of 'sub-init'
> process, able to stay as the parent process for all processes created by
> the started services.  All SIGCHLD signals will be delivered to the
> service manager.

I try to never argue with the new features. But to be honest, this
doesn't look very good to me.

OK, a service manager M does prctl(PR_SET_CHILD_REAPER), then it forks
a service X which forks another child C and exits. Then C exits and
notifies M.

But. How can M know that the service X should be restarted? It only
knows the pid. What if wait(WEXITED) succeeds because C in turn does
fork + exit? What M has 2 or more services?




Anyway, the implementation is certainly buggy.

> @@ -1296,6 +1296,8 @@ struct task_struct {
>  				 * execve */
>  	unsigned in_iowait:1;
>
> +	/* Reparent child processes to this process instead of pid 1. */
> +	unsigned child_reaper:1;

First of all - this is already very wrong imho. This should be
per-process, not per-thread.

> +	/* find the first ancestor which is marked as child_reaper */
> +	for (reaper = father->parent;
> +	     reaper != &init_task && reaper != pid_ns->child_reaper;
> +	     reaper = reaper->parent)

This loop can never reach init_task/child_reaper and crash the kernel.
For example, father->parent can point to init_task's sub-thread.

OTOH you shouldn't use init_task at all.

Also. You shouldn't do this if the sub-namespace init exits, this is
wrong.

> +		if (reaper->child_reaper)
> +			return reaper;

No, we can't blindly return this task, it can be dead/exiting. More
precisely, we must not do this if it has already passed
forget_original_parent(). That is why the code above checks PF_EXITING.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-17 13:05     ` Oleg Nesterov
  0 siblings, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2011-08-17 13:05 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, lennart, kay.sievers, linux-man, roland, torvalds

On 08/17, Oleg Nesterov wrote:
>
> On 08/16, Andrew Morton wrote:
> >
> > From: Lennart Poettering <lennart@poettering.net>
> >
> > Userspace service managers/supervisors need to track their started
> > services.  Many services daemonize by double-forking and get implicitely
> > re-parented to PID 1.  The process manager will no longer be able to
> > receive the SIGCHLD signals for them.
> >
> > With this prctl, a service manager can mark itself as a sort of 'sub-init'
> > process, able to stay as the parent process for all processes created by
> > the started services.  All SIGCHLD signals will be delivered to the
> > service manager.
>
> I try to never argue with the new features. But to be honest, this
> doesn't look very good to me.
>
> OK, a service manager M does prctl(PR_SET_CHILD_REAPER), then it forks
> a service X which forks another child C and exits. Then C exits and
> notifies M.
>
> But. How can M know that the service X should be restarted? It only
> knows the pid. What if wait(WEXITED) succeeds because C in turn does
> fork + exit? What M has 2 or more services?

Also. I am almost sure I have already reviewed a very similar patch
a long ago. Ungortunately, I can't find the previous discussion, and
I can't recall why that patch was not accepted.

But, I seem to remember, that patch cleared ->child_reaper on exec,
I think this makes sense.

And I am not sure about security. No, I do not see any problems, just
I don't know. Say, should we check the creds during reparenting? I
dunno.

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-17 13:05     ` Oleg Nesterov
  0 siblings, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2011-08-17 13:05 UTC (permalink / raw)
  To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lennart-mdGvqq1h2p+GdvJs77BJ7Q, kay.sievers-tD+1rO4QERM,
	linux-man-u79uwXL29TY76Z2rM5mHXA, roland-/Z5OmTQCD9xF6kxbq+BtvQ,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On 08/17, Oleg Nesterov wrote:
>
> On 08/16, Andrew Morton wrote:
> >
> > From: Lennart Poettering <lennart-mdGvqq1h2p+GdvJs77BJ7Q@public.gmane.org>
> >
> > Userspace service managers/supervisors need to track their started
> > services.  Many services daemonize by double-forking and get implicitely
> > re-parented to PID 1.  The process manager will no longer be able to
> > receive the SIGCHLD signals for them.
> >
> > With this prctl, a service manager can mark itself as a sort of 'sub-init'
> > process, able to stay as the parent process for all processes created by
> > the started services.  All SIGCHLD signals will be delivered to the
> > service manager.
>
> I try to never argue with the new features. But to be honest, this
> doesn't look very good to me.
>
> OK, a service manager M does prctl(PR_SET_CHILD_REAPER), then it forks
> a service X which forks another child C and exits. Then C exits and
> notifies M.
>
> But. How can M know that the service X should be restarted? It only
> knows the pid. What if wait(WEXITED) succeeds because C in turn does
> fork + exit? What M has 2 or more services?

Also. I am almost sure I have already reviewed a very similar patch
a long ago. Ungortunately, I can't find the previous discussion, and
I can't recall why that patch was not accepted.

But, I seem to remember, that patch cleared ->child_reaper on exec,
I think this makes sense.

And I am not sure about security. No, I do not see any problems, just
I don't know. Say, should we check the creds during reparenting? I
dunno.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
  2011-08-17 11:55   ` Oleg Nesterov
  (?)
  (?)
@ 2011-08-17 13:13   ` Kay Sievers
  2011-08-17 13:45       ` Oleg Nesterov
  -1 siblings, 1 reply; 63+ messages in thread
From: Kay Sievers @ 2011-08-17 13:13 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: akpm, linux-kernel, lennart, linux-man, roland, torvalds

On Wed, Aug 17, 2011 at 13:55, Oleg Nesterov <oleg@redhat.com> wrote:
> On 08/16, Andrew Morton wrote:
>> From: Lennart Poettering <lennart@poettering.net>
>>
>> Userspace service managers/supervisors need to track their started
>> services.  Many services daemonize by double-forking and get implicitely
>> re-parented to PID 1.  The process manager will no longer be able to
>> receive the SIGCHLD signals for them.
>>
>> With this prctl, a service manager can mark itself as a sort of 'sub-init'
>> process, able to stay as the parent process for all processes created by
>> the started services.  All SIGCHLD signals will be delivered to the
>> service manager.
>
> I try to never argue with the new features. But to be honest, this
> doesn't look very good to me.
>
> OK, a service manager M does prctl(PR_SET_CHILD_REAPER), then it forks
> a service X which forks another child C and exits. Then C exits and
> notifies M.
>
> But. How can M know that the service X should be restarted? It only
> knows the pid.

Legacy services write pid files and we read them, so we know the pid
to watch for. Proper services never double-fork and reparent in a
modern init environment.

> What if wait(WEXITED) succeeds because C in turn does
> fork + exit?

Nothing is really doing this. There are a few hacks out there that do
that on service update instead of just restarting, and such services
explicitly need to tell their new pid, up or they just need to disable
supervision if they can't behave. :)

> What M has 2 or more services?

Nothing different from one service. They are all tracked the same way.

> Anyway, the implementation is certainly buggy.
>
>> @@ -1296,6 +1296,8 @@ struct task_struct {
>>                                * execve */
>>       unsigned in_iowait:1;
>>
>> +     /* Reparent child processes to this process instead of pid 1. */
>> +     unsigned child_reaper:1;
>
> First of all - this is already very wrong imho. This should be
> per-process, not per-thread.

What do you mean? That would go where instead?

>> +     /* find the first ancestor which is marked as child_reaper */
>> +     for (reaper = father->parent;
>> +          reaper != &init_task && reaper != pid_ns->child_reaper;
>> +          reaper = reaper->parent)
>
> This loop can never reach init_task/child_reaper and crash the kernel.

You mean: *if* this loop can never ...?

> For example, father->parent can point to init_task's sub-thread.
>
> OTOH you shouldn't use init_task at all.

What would we use instead?

> Also. You shouldn't do this if the sub-namespace init exits, this is
> wrong.

It we find a sub-init, before the namespace PID1, why wouldn't we return it?

>> +             if (reaper->child_reaper)
>> +                     return reaper;
>
> No, we can't blindly return this task, it can be dead/exiting. More
> precisely, we must not do this if it has already passed
> forget_original_parent(). That is why the code above checks PF_EXITING.

Ok.

Kay

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-17 13:21       ` Kay Sievers
  0 siblings, 0 replies; 63+ messages in thread
From: Kay Sievers @ 2011-08-17 13:21 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: akpm, linux-kernel, lennart, linux-man, roland, torvalds

On Wed, Aug 17, 2011 at 15:05, Oleg Nesterov <oleg@redhat.com> wrote:
> On 08/17, Oleg Nesterov wrote:
>> On 08/16, Andrew Morton wrote:
>> > From: Lennart Poettering <lennart@poettering.net>
>> >
>> > Userspace service managers/supervisors need to track their started
>> > services.  Many services daemonize by double-forking and get implicitely
>> > re-parented to PID 1.  The process manager will no longer be able to
>> > receive the SIGCHLD signals for them.
>> >
>> > With this prctl, a service manager can mark itself as a sort of 'sub-init'
>> > process, able to stay as the parent process for all processes created by
>> > the started services.  All SIGCHLD signals will be delivered to the
>> > service manager.
>>
>> I try to never argue with the new features. But to be honest, this
>> doesn't look very good to me.
>>
>> OK, a service manager M does prctl(PR_SET_CHILD_REAPER), then it forks
>> a service X which forks another child C and exits. Then C exits and
>> notifies M.
>>
>> But. How can M know that the service X should be restarted? It only
>> knows the pid. What if wait(WEXITED) succeeds because C in turn does
>> fork + exit? What M has 2 or more services?
>
> Also. I am almost sure I have already reviewed a very similar patch
> a long ago.

It's almost the same.

> Ungortunately, I can't find the previous discussion, and
> I can't recall why that patch was not accepted.

We've just been busy with other things. But now that systemd will not
only be PID 1 but also the process parent for every user logged into
the system, we need this functionality.

> But, I seem to remember, that patch cleared ->child_reaper on exec,

I don't think he original patch did.

> I think this makes sense.

Why would it? Systemd can serialize its state and properly re-exec
itself as many times as needed during its lifetime. Why would the
kernel take something away from a process, which it explicitly asked
for?

> And I am not sure about security. No, I do not see any problems, just
> I don't know. Say, should we check the creds during reparenting? I
> dunno.

Hmm, I don't see why that would be necessary. It's just one of our
parents that aks for our signals.

Kay

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-17 13:21       ` Kay Sievers
  0 siblings, 0 replies; 63+ messages in thread
From: Kay Sievers @ 2011-08-17 13:21 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lennart-mdGvqq1h2p+GdvJs77BJ7Q, linux-man-u79uwXL29TY76Z2rM5mHXA,
	roland-/Z5OmTQCD9xF6kxbq+BtvQ,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Wed, Aug 17, 2011 at 15:05, Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On 08/17, Oleg Nesterov wrote:
>> On 08/16, Andrew Morton wrote:
>> > From: Lennart Poettering <lennart-mdGvqq1h2p+GdvJs77BJ7Q@public.gmane.org>
>> >
>> > Userspace service managers/supervisors need to track their started
>> > services.  Many services daemonize by double-forking and get implicitely
>> > re-parented to PID 1.  The process manager will no longer be able to
>> > receive the SIGCHLD signals for them.
>> >
>> > With this prctl, a service manager can mark itself as a sort of 'sub-init'
>> > process, able to stay as the parent process for all processes created by
>> > the started services.  All SIGCHLD signals will be delivered to the
>> > service manager.
>>
>> I try to never argue with the new features. But to be honest, this
>> doesn't look very good to me.
>>
>> OK, a service manager M does prctl(PR_SET_CHILD_REAPER), then it forks
>> a service X which forks another child C and exits. Then C exits and
>> notifies M.
>>
>> But. How can M know that the service X should be restarted? It only
>> knows the pid. What if wait(WEXITED) succeeds because C in turn does
>> fork + exit? What M has 2 or more services?
>
> Also. I am almost sure I have already reviewed a very similar patch
> a long ago.

It's almost the same.

> Ungortunately, I can't find the previous discussion, and
> I can't recall why that patch was not accepted.

We've just been busy with other things. But now that systemd will not
only be PID 1 but also the process parent for every user logged into
the system, we need this functionality.

> But, I seem to remember, that patch cleared ->child_reaper on exec,

I don't think he original patch did.

> I think this makes sense.

Why would it? Systemd can serialize its state and properly re-exec
itself as many times as needed during its lifetime. Why would the
kernel take something away from a process, which it explicitly asked
for?

> And I am not sure about security. No, I do not see any problems, just
> I don't know. Say, should we check the creds during reparenting? I
> dunno.

Hmm, I don't see why that would be necessary. It's just one of our
parents that aks for our signals.

Kay
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-17 13:37         ` Alan Cox
  0 siblings, 0 replies; 63+ messages in thread
From: Alan Cox @ 2011-08-17 13:37 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Oleg Nesterov, akpm, linux-kernel, lennart, linux-man, roland, torvalds

O> Why would it? Systemd can serialize its state and properly re-exec
> itself as many times as needed during its lifetime. Why would the
> kernel take something away from a process, which it explicitly asked
> for?

Because a re-exec is a change of context, in the same was as a re-exec
closes some file handles kills alarms, adjusts signals etc. Across an
suid boundary of course it gets even more important.

Why would the kernel allow a parent process, possibly with a different
security context, to muck with defined and guaranteed standards compliant
behaviour it may rely upon.

> Hmm, I don't see why that would be necessary. It's just one of our
> parents that aks for our signals.

I think it is fundamentally the wrong answer. The behaviour in question
is in every Unix since day one and apps rely upon it.

Now I can see why you want to know when processes exit and do it without
tampering with the process, but it seems to me that's simply a question
of us lacking a way to do this nicely, whether inotify/dnotify/etc
on /proc, some kind of 'also signal me' property or some kind of process
event interface.

Of those a signal based one seems the weakest because programmers and
signal often don't mix well because it is asychronous and also because it
wouldn't naturally allow multiple users (eg a process monitoring tool and
systemd to share)

For that matter your init process could farm them back out down a named
pipe or some other such interface and do the monitoring in userspace.

Alan

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-17 13:37         ` Alan Cox
  0 siblings, 0 replies; 63+ messages in thread
From: Alan Cox @ 2011-08-17 13:37 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Oleg Nesterov, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lennart-mdGvqq1h2p+GdvJs77BJ7Q, linux-man-u79uwXL29TY76Z2rM5mHXA,
	roland-/Z5OmTQCD9xF6kxbq+BtvQ,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

O> Why would it? Systemd can serialize its state and properly re-exec
> itself as many times as needed during its lifetime. Why would the
> kernel take something away from a process, which it explicitly asked
> for?

Because a re-exec is a change of context, in the same was as a re-exec
closes some file handles kills alarms, adjusts signals etc. Across an
suid boundary of course it gets even more important.

Why would the kernel allow a parent process, possibly with a different
security context, to muck with defined and guaranteed standards compliant
behaviour it may rely upon.

> Hmm, I don't see why that would be necessary. It's just one of our
> parents that aks for our signals.

I think it is fundamentally the wrong answer. The behaviour in question
is in every Unix since day one and apps rely upon it.

Now I can see why you want to know when processes exit and do it without
tampering with the process, but it seems to me that's simply a question
of us lacking a way to do this nicely, whether inotify/dnotify/etc
on /proc, some kind of 'also signal me' property or some kind of process
event interface.

Of those a signal based one seems the weakest because programmers and
signal often don't mix well because it is asychronous and also because it
wouldn't naturally allow multiple users (eg a process monitoring tool and
systemd to share)

For that matter your init process could farm them back out down a named
pipe or some other such interface and do the monitoring in userspace.

Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-17 13:45       ` Oleg Nesterov
  0 siblings, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2011-08-17 13:45 UTC (permalink / raw)
  To: Kay Sievers; +Cc: akpm, linux-kernel, lennart, linux-man, roland, torvalds

On 08/17, Kay Sievers wrote:
>
> On Wed, Aug 17, 2011 at 13:55, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > I try to never argue with the new features. But to be honest, this
> > doesn't look very good to me.
> >
> > OK, a service manager M does prctl(PR_SET_CHILD_REAPER), then it forks
> > a service X which forks another child C and exits. Then C exits and
> > notifies M.
> >
> > But. How can M know that the service X should be restarted? It only
> > knows the pid.
>
> Legacy services write pid files and we read them, so we know the pid
> to watch for. Proper services never double-fork and reparent in a
> modern init environment.

OK. So, this patch can only help to handle the legacy services? And
the service should participate (write pid files for example). And,

>
> > What if wait(WEXITED) succeeds because C in turn does
> > fork + exit?
>
> Nothing is really doing this.

OK. But this means you propose this patch to solve the very specific
problems. IOW, imho this doesn't look very useful "in general" to me.

May be we need something else instead... And iiuc you don't really
need to change the reparenting, you only want the notification if
the process exits.



> >> @@ -1296,6 +1296,8 @@ struct task_struct {
> >>                                * execve */
> >>       unsigned in_iowait:1;
> >>
> >> +     /* Reparent child processes to this process instead of pid 1. */
> >> +     unsigned child_reaper:1;
> >
> > First of all - this is already very wrong imho. This should be
> > per-process, not per-thread.
>
> What do you mean? That would go where instead?

You should mark the whole process as sub-reaper, not a single thread
which does prctl(). The parent/child relationship is process-wide.

If nothing else. Suppose that application does pthread_create(), the
new thread does prctl(REAPER) and exits.

> >> +     /* find the first ancestor which is marked as child_reaper */
> >> +     for (reaper = father->parent;
> >> +          reaper != &init_task && reaper != pid_ns->child_reaper;
> >> +          reaper = reaper->parent)
> >
> > This loop can never reach init_task/child_reaper and crash the kernel.
>
> You mean: *if* this loop can never ...?

Yes.

> > For example, father->parent can point to init_task's sub-thread.
> >
> > OTOH you shouldn't use init_task at all.
>
> What would we use instead?

You should check ->child_reaper only. But see above, it can be multithreaded.

> > Also. You shouldn't do this if the sub-namespace init exits, this is
> > wrong.
>
> It we find a sub-init, before the namespace PID1, why wouldn't we return it?

Ah, I meant pid_ns->child_reaper, not task->child_reaper.

If pid_ns->child_reaper exits we should never try to "reparent" its
children, see zap_pid_ns_processes() in particular. IOW, this should
go into the "else" branch of "if (pid_ns->child_reaper == father)"

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-17 13:45       ` Oleg Nesterov
  0 siblings, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2011-08-17 13:45 UTC (permalink / raw)
  To: Kay Sievers
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lennart-mdGvqq1h2p+GdvJs77BJ7Q, linux-man-u79uwXL29TY76Z2rM5mHXA,
	roland-/Z5OmTQCD9xF6kxbq+BtvQ,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On 08/17, Kay Sievers wrote:
>
> On Wed, Aug 17, 2011 at 13:55, Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >
> > I try to never argue with the new features. But to be honest, this
> > doesn't look very good to me.
> >
> > OK, a service manager M does prctl(PR_SET_CHILD_REAPER), then it forks
> > a service X which forks another child C and exits. Then C exits and
> > notifies M.
> >
> > But. How can M know that the service X should be restarted? It only
> > knows the pid.
>
> Legacy services write pid files and we read them, so we know the pid
> to watch for. Proper services never double-fork and reparent in a
> modern init environment.

OK. So, this patch can only help to handle the legacy services? And
the service should participate (write pid files for example). And,

>
> > What if wait(WEXITED) succeeds because C in turn does
> > fork + exit?
>
> Nothing is really doing this.

OK. But this means you propose this patch to solve the very specific
problems. IOW, imho this doesn't look very useful "in general" to me.

May be we need something else instead... And iiuc you don't really
need to change the reparenting, you only want the notification if
the process exits.



> >> @@ -1296,6 +1296,8 @@ struct task_struct {
> >>                                * execve */
> >>       unsigned in_iowait:1;
> >>
> >> +     /* Reparent child processes to this process instead of pid 1. */
> >> +     unsigned child_reaper:1;
> >
> > First of all - this is already very wrong imho. This should be
> > per-process, not per-thread.
>
> What do you mean? That would go where instead?

You should mark the whole process as sub-reaper, not a single thread
which does prctl(). The parent/child relationship is process-wide.

If nothing else. Suppose that application does pthread_create(), the
new thread does prctl(REAPER) and exits.

> >> +     /* find the first ancestor which is marked as child_reaper */
> >> +     for (reaper = father->parent;
> >> +          reaper != &init_task && reaper != pid_ns->child_reaper;
> >> +          reaper = reaper->parent)
> >
> > This loop can never reach init_task/child_reaper and crash the kernel.
>
> You mean: *if* this loop can never ...?

Yes.

> > For example, father->parent can point to init_task's sub-thread.
> >
> > OTOH you shouldn't use init_task at all.
>
> What would we use instead?

You should check ->child_reaper only. But see above, it can be multithreaded.

> > Also. You shouldn't do this if the sub-namespace init exits, this is
> > wrong.
>
> It we find a sub-init, before the namespace PID1, why wouldn't we return it?

Ah, I meant pid_ns->child_reaper, not task->child_reaper.

If pid_ns->child_reaper exits we should never try to "reparent" its
children, see zap_pid_ns_processes() in particular. IOW, this should
go into the "else" branch of "if (pid_ns->child_reaper == father)"

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-17 14:16         ` Oleg Nesterov
  0 siblings, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2011-08-17 14:16 UTC (permalink / raw)
  To: Kay Sievers; +Cc: akpm, linux-kernel, lennart, linux-man, roland, torvalds

On 08/17, Kay Sievers wrote:
>
> On Wed, Aug 17, 2011 at 15:05, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > But, I seem to remember, that patch cleared ->child_reaper on exec,
>
> I don't think he original patch did.
>
> > I think this makes sense.
>
> Why would it? Systemd can serialize its state and properly re-exec
> itself as many times as needed during its lifetime. Why would the
> kernel take something away from a process, which it explicitly asked
> for?
>
> > And I am not sure about security. No, I do not see any problems, just
> > I don't know. Say, should we check the creds during reparenting? I
> > dunno.
>
> Hmm, I don't see why that would be necessary. It's just one of our
> parents that aks for our signals.

Oh, I do not know. I do not pretend I understand the security ;)

For example. I simply can't understand why do we have security_task_wait().
Why waitpid(my_natural_child) can fail for security reasons? But we have
selinux_task_wait().



So, once again. I am not arguing. I am only asking the questions.
I didn't mean I see any problem here.

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-17 14:16         ` Oleg Nesterov
  0 siblings, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2011-08-17 14:16 UTC (permalink / raw)
  To: Kay Sievers
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lennart-mdGvqq1h2p+GdvJs77BJ7Q, linux-man-u79uwXL29TY76Z2rM5mHXA,
	roland-/Z5OmTQCD9xF6kxbq+BtvQ,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On 08/17, Kay Sievers wrote:
>
> On Wed, Aug 17, 2011 at 15:05, Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >
> > But, I seem to remember, that patch cleared ->child_reaper on exec,
>
> I don't think he original patch did.
>
> > I think this makes sense.
>
> Why would it? Systemd can serialize its state and properly re-exec
> itself as many times as needed during its lifetime. Why would the
> kernel take something away from a process, which it explicitly asked
> for?
>
> > And I am not sure about security. No, I do not see any problems, just
> > I don't know. Say, should we check the creds during reparenting? I
> > dunno.
>
> Hmm, I don't see why that would be necessary. It's just one of our
> parents that aks for our signals.

Oh, I do not know. I do not pretend I understand the security ;)

For example. I simply can't understand why do we have security_task_wait().
Why waitpid(my_natural_child) can fail for security reasons? But we have
selinux_task_wait().



So, once again. I am not arguing. I am only asking the questions.
I didn't mean I see any problem here.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-17 15:45         ` Kay Sievers
  0 siblings, 0 replies; 63+ messages in thread
From: Kay Sievers @ 2011-08-17 15:45 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: akpm, linux-kernel, lennart, linux-man, roland, torvalds

On Wed, Aug 17, 2011 at 15:45, Oleg Nesterov <oleg@redhat.com> wrote:
> On 08/17, Kay Sievers wrote:
>> On Wed, Aug 17, 2011 at 13:55, Oleg Nesterov <oleg@redhat.com> wrote:
>> >
>> > I try to never argue with the new features. But to be honest, this
>> > doesn't look very good to me.
>> >
>> > OK, a service manager M does prctl(PR_SET_CHILD_REAPER), then it forks
>> > a service X which forks another child C and exits. Then C exits and
>> > notifies M.
>> >
>> > But. How can M know that the service X should be restarted? It only
>> > knows the pid.
>>
>> Legacy services write pid files and we read them, so we know the pid
>> to watch for. Proper services never double-fork and reparent in a
>> modern init environment.
>
> OK. So, this patch can only help to handle the legacy services?

It helps them with services that need it. It is not recommended to
double-fork ever with a modern init system, but it's historic default
and common practice, and we are not going to change that any time
soon.

> And
> the service should participate (write pid files for example). And,

This is not meant as a security feature, if that's what your asking.
It will not prevent services from doing nasty things and escape the
process that started them. But it's still a feature that today only
PID 1 and which we need for more processes.

>> > What if wait(WEXITED) succeeds because C in turn does
>> > fork + exit?
>>
>> Nothing is really doing this.
>
> OK. But this means you propose this patch to solve the very specific
> problems.

No, it's for a very common problem. But again, it's not a security feature.

> IOW, imho this doesn't look very useful "in general" to me.

It is very useful if you have an init-like daemon.

> May be we need something else instead... And iiuc you don't really
> need to change the reparenting, you only want the notification if
> the process exits.

No, we want to be the parent of the process, and we want to be the one
who reaps all the child process, not only receive some out-of-band
notifications. The sub-init is the babysitter of all the things it has
started, and that should be reflected in the parent child relation.

>> >> @@ -1296,6 +1296,8 @@ struct task_struct {
>> >>                                * execve */
>> >>       unsigned in_iowait:1;
>> >>
>> >> +     /* Reparent child processes to this process instead of pid 1. */
>> >> +     unsigned child_reaper:1;
>> >
>> > First of all - this is already very wrong imho. This should be
>> > per-process, not per-thread.
>>
>> What do you mean? That would go where instead?
>
> You should mark the whole process as sub-reaper, not a single thread
> which does prctl(). The parent/child relationship is process-wide.

Ok.

> If nothing else. Suppose that application does pthread_create(), the
> new thread does prctl(REAPER) and exits.

I get the (weird) idea. :)

>> >> +     /* find the first ancestor which is marked as child_reaper */
>> >> +     for (reaper = father->parent;
>> >> +          reaper != &init_task && reaper != pid_ns->child_reaper;
>> >> +          reaper = reaper->parent)
>> >
>> > This loop can never reach init_task/child_reaper and crash the kernel.
>>
>> You mean: *if* this loop can never ...?
>
> Yes.
>
>> > For example, father->parent can point to init_task's sub-thread.
>> >
>> > OTOH you shouldn't use init_task at all.
>>
>> What would we use instead?
>
> You should check ->child_reaper only. But see above, it can be multithreaded.

The main PID 1 from the system has no ->child_reaper set as far as I
see, hence we check for init_task.

>> > Also. You shouldn't do this if the sub-namespace init exits, this is
>> > wrong.
>>
>> It we find a sub-init, before the namespace PID1, why wouldn't we return it?
>
> Ah, I meant pid_ns->child_reaper, not task->child_reaper.
>
> If pid_ns->child_reaper exits we should never try to "reparent" its
> children, see zap_pid_ns_processes() in particular. IOW, this should
> go into the "else" branch of "if (pid_ns->child_reaper == father)"

I don't understand this. If we find a marked task->child_reaper
_before_ we find a pid_ns->child_reaper in the chain of parents, why
wouldn't we return it?

Kay

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-17 15:45         ` Kay Sievers
  0 siblings, 0 replies; 63+ messages in thread
From: Kay Sievers @ 2011-08-17 15:45 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lennart-mdGvqq1h2p+GdvJs77BJ7Q, linux-man-u79uwXL29TY76Z2rM5mHXA,
	roland-/Z5OmTQCD9xF6kxbq+BtvQ,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Wed, Aug 17, 2011 at 15:45, Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On 08/17, Kay Sievers wrote:
>> On Wed, Aug 17, 2011 at 13:55, Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> >
>> > I try to never argue with the new features. But to be honest, this
>> > doesn't look very good to me.
>> >
>> > OK, a service manager M does prctl(PR_SET_CHILD_REAPER), then it forks
>> > a service X which forks another child C and exits. Then C exits and
>> > notifies M.
>> >
>> > But. How can M know that the service X should be restarted? It only
>> > knows the pid.
>>
>> Legacy services write pid files and we read them, so we know the pid
>> to watch for. Proper services never double-fork and reparent in a
>> modern init environment.
>
> OK. So, this patch can only help to handle the legacy services?

It helps them with services that need it. It is not recommended to
double-fork ever with a modern init system, but it's historic default
and common practice, and we are not going to change that any time
soon.

> And
> the service should participate (write pid files for example). And,

This is not meant as a security feature, if that's what your asking.
It will not prevent services from doing nasty things and escape the
process that started them. But it's still a feature that today only
PID 1 and which we need for more processes.

>> > What if wait(WEXITED) succeeds because C in turn does
>> > fork + exit?
>>
>> Nothing is really doing this.
>
> OK. But this means you propose this patch to solve the very specific
> problems.

No, it's for a very common problem. But again, it's not a security feature.

> IOW, imho this doesn't look very useful "in general" to me.

It is very useful if you have an init-like daemon.

> May be we need something else instead... And iiuc you don't really
> need to change the reparenting, you only want the notification if
> the process exits.

No, we want to be the parent of the process, and we want to be the one
who reaps all the child process, not only receive some out-of-band
notifications. The sub-init is the babysitter of all the things it has
started, and that should be reflected in the parent child relation.

>> >> @@ -1296,6 +1296,8 @@ struct task_struct {
>> >>                                * execve */
>> >>       unsigned in_iowait:1;
>> >>
>> >> +     /* Reparent child processes to this process instead of pid 1. */
>> >> +     unsigned child_reaper:1;
>> >
>> > First of all - this is already very wrong imho. This should be
>> > per-process, not per-thread.
>>
>> What do you mean? That would go where instead?
>
> You should mark the whole process as sub-reaper, not a single thread
> which does prctl(). The parent/child relationship is process-wide.

Ok.

> If nothing else. Suppose that application does pthread_create(), the
> new thread does prctl(REAPER) and exits.

I get the (weird) idea. :)

>> >> +     /* find the first ancestor which is marked as child_reaper */
>> >> +     for (reaper = father->parent;
>> >> +          reaper != &init_task && reaper != pid_ns->child_reaper;
>> >> +          reaper = reaper->parent)
>> >
>> > This loop can never reach init_task/child_reaper and crash the kernel.
>>
>> You mean: *if* this loop can never ...?
>
> Yes.
>
>> > For example, father->parent can point to init_task's sub-thread.
>> >
>> > OTOH you shouldn't use init_task at all.
>>
>> What would we use instead?
>
> You should check ->child_reaper only. But see above, it can be multithreaded.

The main PID 1 from the system has no ->child_reaper set as far as I
see, hence we check for init_task.

>> > Also. You shouldn't do this if the sub-namespace init exits, this is
>> > wrong.
>>
>> It we find a sub-init, before the namespace PID1, why wouldn't we return it?
>
> Ah, I meant pid_ns->child_reaper, not task->child_reaper.
>
> If pid_ns->child_reaper exits we should never try to "reparent" its
> children, see zap_pid_ns_processes() in particular. IOW, this should
> go into the "else" branch of "if (pid_ns->child_reaper == father)"

I don't understand this. If we find a marked task->child_reaper
_before_ we find a pid_ns->child_reaper in the chain of parents, why
wouldn't we return it?

Kay
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-17 15:53           ` Alan Cox
  0 siblings, 0 replies; 63+ messages in thread
From: Alan Cox @ 2011-08-17 15:53 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Oleg Nesterov, akpm, linux-kernel, lennart, linux-man, roland, torvalds

O> This is not meant as a security feature, if that's what your asking.
> It will not prevent services from doing nasty things and escape the
> process that started them. But it's still a feature that today only
> PID 1 and which we need for more processes.

I'm more worried about it beign a security flaw...

> > IOW, imho this doesn't look very useful "in general" to me.
> 
> It is very useful if you have an init-like daemon.

Which is a special case

> 
> > May be we need something else instead... And iiuc you don't really
> > need to change the reparenting, you only want the notification if
> > the process exits.
> 
> No, we want to be the parent of the process, and we want to be the one
> who reaps all the child process, not only receive some out-of-band
> notifications. The sub-init is the babysitter of all the things it has
> started, and that should be reflected in the parent child relation.

Why ?


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-17 15:53           ` Alan Cox
  0 siblings, 0 replies; 63+ messages in thread
From: Alan Cox @ 2011-08-17 15:53 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Oleg Nesterov, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lennart-mdGvqq1h2p+GdvJs77BJ7Q, linux-man-u79uwXL29TY76Z2rM5mHXA,
	roland-/Z5OmTQCD9xF6kxbq+BtvQ,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

O> This is not meant as a security feature, if that's what your asking.
> It will not prevent services from doing nasty things and escape the
> process that started them. But it's still a feature that today only
> PID 1 and which we need for more processes.

I'm more worried about it beign a security flaw...

> > IOW, imho this doesn't look very useful "in general" to me.
> 
> It is very useful if you have an init-like daemon.

Which is a special case

> 
> > May be we need something else instead... And iiuc you don't really
> > need to change the reparenting, you only want the notification if
> > the process exits.
> 
> No, we want to be the parent of the process, and we want to be the one
> who reaps all the child process, not only receive some out-of-band
> notifications. The sub-init is the babysitter of all the things it has
> started, and that should be reflected in the parent child relation.

Why ?

--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-17 16:03         ` Denys Vlasenko
  0 siblings, 0 replies; 63+ messages in thread
From: Denys Vlasenko @ 2011-08-17 16:03 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Oleg Nesterov, akpm, linux-kernel, lennart, linux-man, roland, torvalds

On Wed, Aug 17, 2011 at 3:21 PM, Kay Sievers <kay.sievers@vrfy.org> wrote:
> On Wed, Aug 17, 2011 at 15:05, Oleg Nesterov <oleg@redhat.com> wrote:
>> On 08/17, Oleg Nesterov wrote:
>>> On 08/16, Andrew Morton wrote:
>>> > From: Lennart Poettering <lennart@poettering.net>
>>> >
>>> > Userspace service managers/supervisors need to track their started
>>> > services.  Many services daemonize by double-forking and get implicitely
>>> > re-parented to PID 1.  The process manager will no longer be able to
>>> > receive the SIGCHLD signals for them.
>>> >
>>> > With this prctl, a service manager can mark itself as a sort of 'sub-init'
>>> > process, able to stay as the parent process for all processes created by
>>> > the started services.  All SIGCHLD signals will be delivered to the
>>> > service manager.
>>>
>>> I try to never argue with the new features. But to be honest, this
>>> doesn't look very good to me.
>>>
>>> OK, a service manager M does prctl(PR_SET_CHILD_REAPER), then it forks
>>> a service X which forks another child C and exits. Then C exits and
>>> notifies M.
>>>
>>> But. How can M know that the service X should be restarted? It only
>>> knows the pid. What if wait(WEXITED) succeeds because C in turn does
>>> fork + exit? What M has 2 or more services?
>>
>> Also. I am almost sure I have already reviewed a very similar patch
>> a long ago.
>
> It's almost the same.
>
>> Ungortunately, I can't find the previous discussion, and
>> I can't recall why that patch was not accepted.
>
> We've just been busy with other things. But now that systemd will not
> only be PID 1 but also the process parent for every user logged into
> the system, we need this functionality.

Getty and login worked rather well for 40+ years.
You need to stop trying to supplant every traditional Unix utility
with systemd. This is becoming ridiculous.
You have is a hammer, everything looks like a nail to you.
Where will you stop?

-- 
vda

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-17 16:03         ` Denys Vlasenko
  0 siblings, 0 replies; 63+ messages in thread
From: Denys Vlasenko @ 2011-08-17 16:03 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Oleg Nesterov, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lennart-mdGvqq1h2p+GdvJs77BJ7Q, linux-man-u79uwXL29TY76Z2rM5mHXA,
	roland-/Z5OmTQCD9xF6kxbq+BtvQ,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Wed, Aug 17, 2011 at 3:21 PM, Kay Sievers <kay.sievers-tD+1rO4QERM@public.gmane.org> wrote:
> On Wed, Aug 17, 2011 at 15:05, Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> On 08/17, Oleg Nesterov wrote:
>>> On 08/16, Andrew Morton wrote:
>>> > From: Lennart Poettering <lennart-mdGvqq1h2p+GdvJs77BJ7Q@public.gmane.org>
>>> >
>>> > Userspace service managers/supervisors need to track their started
>>> > services.  Many services daemonize by double-forking and get implicitely
>>> > re-parented to PID 1.  The process manager will no longer be able to
>>> > receive the SIGCHLD signals for them.
>>> >
>>> > With this prctl, a service manager can mark itself as a sort of 'sub-init'
>>> > process, able to stay as the parent process for all processes created by
>>> > the started services.  All SIGCHLD signals will be delivered to the
>>> > service manager.
>>>
>>> I try to never argue with the new features. But to be honest, this
>>> doesn't look very good to me.
>>>
>>> OK, a service manager M does prctl(PR_SET_CHILD_REAPER), then it forks
>>> a service X which forks another child C and exits. Then C exits and
>>> notifies M.
>>>
>>> But. How can M know that the service X should be restarted? It only
>>> knows the pid. What if wait(WEXITED) succeeds because C in turn does
>>> fork + exit? What M has 2 or more services?
>>
>> Also. I am almost sure I have already reviewed a very similar patch
>> a long ago.
>
> It's almost the same.
>
>> Ungortunately, I can't find the previous discussion, and
>> I can't recall why that patch was not accepted.
>
> We've just been busy with other things. But now that systemd will not
> only be PID 1 but also the process parent for every user logged into
> the system, we need this functionality.

Getty and login worked rather well for 40+ years.
You need to stop trying to supplant every traditional Unix utility
with systemd. This is becoming ridiculous.
You have is a hammer, everything looks like a nail to you.
Where will you stop?

-- 
vda
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-17 16:20           ` Oleg Nesterov
  0 siblings, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2011-08-17 16:20 UTC (permalink / raw)
  To: Kay Sievers; +Cc: akpm, linux-kernel, lennart, linux-man, roland, torvalds

On 08/17, Kay Sievers wrote:
>
> On Wed, Aug 17, 2011 at 15:45, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > OK. So, this patch can only help to handle the legacy services?
>
> It helps them with services that need it. It is not recommended to
> double-fork ever with a modern init system, but it's historic default
> and common practice, and we are not going to change that any time
> soon.
>
> > And
> > the service should participate (write pid files for example). And,
>
> This is not meant as a security feature, if that's what your asking.

Not at all.

> It will not prevent services from doing nasty things and escape the
> process that started them. But it's still a feature that today only
> PID 1 and which we need for more processes.
>
> >> > What if wait(WEXITED) succeeds because C in turn does
> >> > fork + exit?
> >>
> >> Nothing is really doing this.
> >
> > OK. But this means you propose this patch to solve the very specific
> > problems.
>
> No, it's for a very common problem. But again, it's not a security feature.

Once again, I didn't meant security.

> > IOW, imho this doesn't look very useful "in general" to me.
>
> It is very useful if you have an init-like daemon.

Well, this is subjective, but personally I don't agree.

> > May be we need something else instead... And iiuc you don't really
> > need to change the reparenting, you only want the notification if
> > the process exits.
>
> No, we want to be the parent of the process,
> ...
> The sub-init is the babysitter of all the things it has
> started, and that should be reflected in the parent child relation.

OK. But could you explain why do we want this? This is not clear from
the changelog/discussion.

> > You should check ->child_reaper only. But see above, it can be multithreaded.
>
> The main PID 1 from the system has no ->child_reaper set as far as I
> see, hence we check for init_task.

No, you don't. Once again, if pid_ns->child_reaper exits, you should
not even try to find the sub-reaper in its parents chain.

see also below...

> >> > Also. You shouldn't do this if the sub-namespace init exits, this is
> >> > wrong.
> >>
> >> It we find a sub-init, before the namespace PID1, why wouldn't we return it?
> >
> > Ah, I meant pid_ns->child_reaper, not task->child_reaper.
> >
> > If pid_ns->child_reaper exits we should never try to "reparent" its
> > children, see zap_pid_ns_processes() in particular. IOW, this should
> > go into the "else" branch of "if (pid_ns->child_reaper == father)"
>
> I don't understand this. If we find a marked task->child_reaper
> _before_ we find a pid_ns->child_reaper in the chain of parents,

This is fine.

OK. I guess I wasn't clear, and I do not know how to explaine better.
Please look at your code ;) Suppose that a sub-namespace init exits.
Not the global /sbin/init. Not the caller of prctl(REAPER).

In this case we should kill the children, not reparent them. Or panic
if it is the global init (see above).

See?

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-17 16:20           ` Oleg Nesterov
  0 siblings, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2011-08-17 16:20 UTC (permalink / raw)
  To: Kay Sievers
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lennart-mdGvqq1h2p+GdvJs77BJ7Q, linux-man-u79uwXL29TY76Z2rM5mHXA,
	roland-/Z5OmTQCD9xF6kxbq+BtvQ,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On 08/17, Kay Sievers wrote:
>
> On Wed, Aug 17, 2011 at 15:45, Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >
> > OK. So, this patch can only help to handle the legacy services?
>
> It helps them with services that need it. It is not recommended to
> double-fork ever with a modern init system, but it's historic default
> and common practice, and we are not going to change that any time
> soon.
>
> > And
> > the service should participate (write pid files for example). And,
>
> This is not meant as a security feature, if that's what your asking.

Not at all.

> It will not prevent services from doing nasty things and escape the
> process that started them. But it's still a feature that today only
> PID 1 and which we need for more processes.
>
> >> > What if wait(WEXITED) succeeds because C in turn does
> >> > fork + exit?
> >>
> >> Nothing is really doing this.
> >
> > OK. But this means you propose this patch to solve the very specific
> > problems.
>
> No, it's for a very common problem. But again, it's not a security feature.

Once again, I didn't meant security.

> > IOW, imho this doesn't look very useful "in general" to me.
>
> It is very useful if you have an init-like daemon.

Well, this is subjective, but personally I don't agree.

> > May be we need something else instead... And iiuc you don't really
> > need to change the reparenting, you only want the notification if
> > the process exits.
>
> No, we want to be the parent of the process,
> ...
> The sub-init is the babysitter of all the things it has
> started, and that should be reflected in the parent child relation.

OK. But could you explain why do we want this? This is not clear from
the changelog/discussion.

> > You should check ->child_reaper only. But see above, it can be multithreaded.
>
> The main PID 1 from the system has no ->child_reaper set as far as I
> see, hence we check for init_task.

No, you don't. Once again, if pid_ns->child_reaper exits, you should
not even try to find the sub-reaper in its parents chain.

see also below...

> >> > Also. You shouldn't do this if the sub-namespace init exits, this is
> >> > wrong.
> >>
> >> It we find a sub-init, before the namespace PID1, why wouldn't we return it?
> >
> > Ah, I meant pid_ns->child_reaper, not task->child_reaper.
> >
> > If pid_ns->child_reaper exits we should never try to "reparent" its
> > children, see zap_pid_ns_processes() in particular. IOW, this should
> > go into the "else" branch of "if (pid_ns->child_reaper == father)"
>
> I don't understand this. If we find a marked task->child_reaper
> _before_ we find a pid_ns->child_reaper in the chain of parents,

This is fine.

OK. I guess I wasn't clear, and I do not know how to explaine better.
Please look at your code ;) Suppose that a sub-namespace init exits.
Not the global /sbin/init. Not the caller of prctl(REAPER).

In this case we should kill the children, not reparent them. Or panic
if it is the global init (see above).

See?

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-17 16:47             ` Kay Sievers
  0 siblings, 0 replies; 63+ messages in thread
From: Kay Sievers @ 2011-08-17 16:47 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: akpm, linux-kernel, lennart, linux-man, roland, torvalds

On Wed, Aug 17, 2011 at 18:20, Oleg Nesterov <oleg@redhat.com> wrote:
> On 08/17, Kay Sievers wrote:

>> No, we want to be the parent of the process,
>> ...
>> The sub-init is the babysitter of all the things it has
>> started, and that should be reflected in the parent child relation.
>
> OK. But could you explain why do we want this? This is not clear from
> the changelog/discussion.

As said, PID1 has the privilege of reaping all processes that
double-fork. Any sub-init wants to do the same for the stuff it
watches. The process that reaps has all the information about the
process as long as wants, it can look up stuff in /proc if needed or
has all the return values of wait(). Async notifications like
taskstats just can not provide what SIGCHLD, /proc and wait() offer.

>> > You should check ->child_reaper only. But see above, it can be multithreaded.
>>
>> The main PID 1 from the system has no ->child_reaper set as far as I
>> see, hence we check for init_task.
>
> No, you don't.

Don't? I don't check, or we don't have it set?

> Once again, if pid_ns->child_reaper exits, you should
> not even try to find the sub-reaper in its parents chain.

That would mean we can never run a sub-init in a pid namespace? Why not?

Or do you mean that we *are* already the pid_ns->child_reaper, not
that one *exists*?

>> >> > Also. You shouldn't do this if the sub-namespace init exits, this is
>> >> > wrong.
>> >>
>> >> It we find a sub-init, before the namespace PID1, why wouldn't we return it?
>> >
>> > Ah, I meant pid_ns->child_reaper, not task->child_reaper.
>> >
>> > If pid_ns->child_reaper exits we should never try to "reparent" its
>> > children, see zap_pid_ns_processes() in particular. IOW, this should
>> > go into the "else" branch of "if (pid_ns->child_reaper == father)"
>>
>> I don't understand this. If we find a marked task->child_reaper
>> _before_ we find a pid_ns->child_reaper in the chain of parents,
>
> This is fine.
>
> OK. I guess I wasn't clear, and I do not know how to explaine better.
> Please look at your code ;) Suppose that a sub-namespace init exits.
> Not the global /sbin/init. Not the caller of prctl(REAPER).
>
> In this case we should kill the children, not reparent them. Or panic
> if it is the global init (see above).
>
> See?

Not sure. You mean that the lookup for a possible task->cild_reaper
should be _before_ the check for pid_ns->child_reaper == father which
is currently below?

Kay

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-17 16:47             ` Kay Sievers
  0 siblings, 0 replies; 63+ messages in thread
From: Kay Sievers @ 2011-08-17 16:47 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lennart-mdGvqq1h2p+GdvJs77BJ7Q, linux-man-u79uwXL29TY76Z2rM5mHXA,
	roland-/Z5OmTQCD9xF6kxbq+BtvQ,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Wed, Aug 17, 2011 at 18:20, Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On 08/17, Kay Sievers wrote:

>> No, we want to be the parent of the process,
>> ...
>> The sub-init is the babysitter of all the things it has
>> started, and that should be reflected in the parent child relation.
>
> OK. But could you explain why do we want this? This is not clear from
> the changelog/discussion.

As said, PID1 has the privilege of reaping all processes that
double-fork. Any sub-init wants to do the same for the stuff it
watches. The process that reaps has all the information about the
process as long as wants, it can look up stuff in /proc if needed or
has all the return values of wait(). Async notifications like
taskstats just can not provide what SIGCHLD, /proc and wait() offer.

>> > You should check ->child_reaper only. But see above, it can be multithreaded.
>>
>> The main PID 1 from the system has no ->child_reaper set as far as I
>> see, hence we check for init_task.
>
> No, you don't.

Don't? I don't check, or we don't have it set?

> Once again, if pid_ns->child_reaper exits, you should
> not even try to find the sub-reaper in its parents chain.

That would mean we can never run a sub-init in a pid namespace? Why not?

Or do you mean that we *are* already the pid_ns->child_reaper, not
that one *exists*?

>> >> > Also. You shouldn't do this if the sub-namespace init exits, this is
>> >> > wrong.
>> >>
>> >> It we find a sub-init, before the namespace PID1, why wouldn't we return it?
>> >
>> > Ah, I meant pid_ns->child_reaper, not task->child_reaper.
>> >
>> > If pid_ns->child_reaper exits we should never try to "reparent" its
>> > children, see zap_pid_ns_processes() in particular. IOW, this should
>> > go into the "else" branch of "if (pid_ns->child_reaper == father)"
>>
>> I don't understand this. If we find a marked task->child_reaper
>> _before_ we find a pid_ns->child_reaper in the chain of parents,
>
> This is fine.
>
> OK. I guess I wasn't clear, and I do not know how to explaine better.
> Please look at your code ;) Suppose that a sub-namespace init exits.
> Not the global /sbin/init. Not the caller of prctl(REAPER).
>
> In this case we should kill the children, not reparent them. Or panic
> if it is the global init (see above).
>
> See?

Not sure. You mean that the lookup for a possible task->cild_reaper
should be _before_ the check for pid_ns->child_reaper == father which
is currently below?

Kay
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-17 18:57               ` Oleg Nesterov
  0 siblings, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2011-08-17 18:57 UTC (permalink / raw)
  To: Kay Sievers; +Cc: akpm, linux-kernel, lennart, linux-man, roland, torvalds

On 08/17, Kay Sievers wrote:
>
> On Wed, Aug 17, 2011 at 18:20, Oleg Nesterov <oleg@redhat.com> wrote:
> > On 08/17, Kay Sievers wrote:
>
> >> No, we want to be the parent of the process,
> >> ...
> >> The sub-init is the babysitter of all the things it has
> >> started, and that should be reflected in the parent child relation.
> >
> > OK. But could you explain why do we want this? This is not clear from
> > the changelog/discussion.
>
> As said, PID1 has the privilege of reaping all processes that
> double-fork. Any sub-init wants to do the same for the stuff it
> watches. The process that reaps has all the information about the
> process as long as wants, it can look up stuff in /proc if needed or
> has all the return values of wait().

OK.

> Async notifications like
> taskstats just can not provide what SIGCHLD, /proc and wait() offer.

Why not? Async notifications can't delay the reaping, yes. But I am
not sure /proc/ is that useful when the task exits. OK, I won't argue.




> >> > You should check ->child_reaper only. But see above, it can be multithreaded.
> >>
> >> The main PID 1 from the system has no ->child_reaper set as far as I
> >> see, hence we check for init_task.
> >
> > No, you don't.
>
> Don't? I don't check, or we don't have it set?

Argh, sorry. "No, this is not needed", this is what I tried to say.

> > Once again, if pid_ns->child_reaper exits, you should
> > not even try to find the sub-reaper in its parents chain.
>
> That would mean we can never run a sub-init in a pid namespace? Why not?
>
> Or do you mean that we *are* already the pid_ns->child_reaper, not
> that one *exists*?

I already got lost a bit, not sure I understand.
I meant we *are* (the caller) the exiting pid_ns->child_reaper thread.

> >> >> > Also. You shouldn't do this if the sub-namespace init exits, this is
> >> >> > wrong.
> >> >>
> >> >> It we find a sub-init, before the namespace PID1, why wouldn't we return it?
> >> >
> >> > Ah, I meant pid_ns->child_reaper, not task->child_reaper.
> >> >
> >> > If pid_ns->child_reaper exits we should never try to "reparent" its
> >> > children, see zap_pid_ns_processes() in particular. IOW, this should
> >> > go into the "else" branch of "if (pid_ns->child_reaper == father)"
> >>
> >> I don't understand this. If we find a marked task->child_reaper
> >> _before_ we find a pid_ns->child_reaper in the chain of parents,
> >
> > This is fine.
> >
> > OK. I guess I wasn't clear, and I do not know how to explaine better.
> > Please look at your code ;) Suppose that a sub-namespace init exits.
> > Not the global /sbin/init. Not the caller of prctl(REAPER).
> >
> > In this case we should kill the children, not reparent them. Or panic
> > if it is the global init (see above).
> >
> > See?
>
> Not sure. You mean that the lookup for a possible task->cild_reaper
> should be _before_ the check for pid_ns->child_reaper == father which
> is currently below?

Hmm. I am even more confused.

So. I actually applied your patch. The code is

	for (reaper = father->parent;
	     reaper != &init_task && reaper != pid_ns->child_reaper;
	     reaper = reaper->parent)
		if (reaper->child_reaper)
			return reaper;

	if (unlikely(pid_ns->child_reaper == father)) {
		...

The lookup is already _before_ the check for pid_ns->child_reaper == father,
what do you mean?

And, ignoring the mt problems I mentioned, I mean we should do

	if (pid_ns->child_reaper == father) {

		panic_or_kill_this_namespace;	// the current code

	} else {
		for (reaper = father->parent;
		     reaper != pid_ns->child_reaper;
		     reaper = reaper->parent)
			if (reaper->child_reaper)
				return reaper;
	}

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-17 18:57               ` Oleg Nesterov
  0 siblings, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2011-08-17 18:57 UTC (permalink / raw)
  To: Kay Sievers
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lennart-mdGvqq1h2p+GdvJs77BJ7Q, linux-man-u79uwXL29TY76Z2rM5mHXA,
	roland-/Z5OmTQCD9xF6kxbq+BtvQ,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On 08/17, Kay Sievers wrote:
>
> On Wed, Aug 17, 2011 at 18:20, Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > On 08/17, Kay Sievers wrote:
>
> >> No, we want to be the parent of the process,
> >> ...
> >> The sub-init is the babysitter of all the things it has
> >> started, and that should be reflected in the parent child relation.
> >
> > OK. But could you explain why do we want this? This is not clear from
> > the changelog/discussion.
>
> As said, PID1 has the privilege of reaping all processes that
> double-fork. Any sub-init wants to do the same for the stuff it
> watches. The process that reaps has all the information about the
> process as long as wants, it can look up stuff in /proc if needed or
> has all the return values of wait().

OK.

> Async notifications like
> taskstats just can not provide what SIGCHLD, /proc and wait() offer.

Why not? Async notifications can't delay the reaping, yes. But I am
not sure /proc/ is that useful when the task exits. OK, I won't argue.




> >> > You should check ->child_reaper only. But see above, it can be multithreaded.
> >>
> >> The main PID 1 from the system has no ->child_reaper set as far as I
> >> see, hence we check for init_task.
> >
> > No, you don't.
>
> Don't? I don't check, or we don't have it set?

Argh, sorry. "No, this is not needed", this is what I tried to say.

> > Once again, if pid_ns->child_reaper exits, you should
> > not even try to find the sub-reaper in its parents chain.
>
> That would mean we can never run a sub-init in a pid namespace? Why not?
>
> Or do you mean that we *are* already the pid_ns->child_reaper, not
> that one *exists*?

I already got lost a bit, not sure I understand.
I meant we *are* (the caller) the exiting pid_ns->child_reaper thread.

> >> >> > Also. You shouldn't do this if the sub-namespace init exits, this is
> >> >> > wrong.
> >> >>
> >> >> It we find a sub-init, before the namespace PID1, why wouldn't we return it?
> >> >
> >> > Ah, I meant pid_ns->child_reaper, not task->child_reaper.
> >> >
> >> > If pid_ns->child_reaper exits we should never try to "reparent" its
> >> > children, see zap_pid_ns_processes() in particular. IOW, this should
> >> > go into the "else" branch of "if (pid_ns->child_reaper == father)"
> >>
> >> I don't understand this. If we find a marked task->child_reaper
> >> _before_ we find a pid_ns->child_reaper in the chain of parents,
> >
> > This is fine.
> >
> > OK. I guess I wasn't clear, and I do not know how to explaine better.
> > Please look at your code ;) Suppose that a sub-namespace init exits.
> > Not the global /sbin/init. Not the caller of prctl(REAPER).
> >
> > In this case we should kill the children, not reparent them. Or panic
> > if it is the global init (see above).
> >
> > See?
>
> Not sure. You mean that the lookup for a possible task->cild_reaper
> should be _before_ the check for pid_ns->child_reaper == father which
> is currently below?

Hmm. I am even more confused.

So. I actually applied your patch. The code is

	for (reaper = father->parent;
	     reaper != &init_task && reaper != pid_ns->child_reaper;
	     reaper = reaper->parent)
		if (reaper->child_reaper)
			return reaper;

	if (unlikely(pid_ns->child_reaper == father)) {
		...

The lookup is already _before_ the check for pid_ns->child_reaper == father,
what do you mean?

And, ignoring the mt problems I mentioned, I mean we should do

	if (pid_ns->child_reaper == father) {

		panic_or_kill_this_namespace;	// the current code

	} else {
		for (reaper = father->parent;
		     reaper != pid_ns->child_reaper;
		     reaper = reaper->parent)
			if (reaper->child_reaper)
				return reaper;
	}

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-17 20:56                 ` Kay Sievers
  0 siblings, 0 replies; 63+ messages in thread
From: Kay Sievers @ 2011-08-17 20:56 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: akpm, linux-kernel, lennart, linux-man, roland, torvalds

On Wed, Aug 17, 2011 at 20:57, Oleg Nesterov <oleg@redhat.com> wrote:
> On 08/17, Kay Sievers wrote:
>> Async notifications like
>> taskstats just can not provide what SIGCHLD, /proc and wait() offer.
>
> Why not? Async notifications can't delay the reaping, yes. But I am
> not sure /proc/ is that useful when the task exits. OK, I won't argue.

I guess parents usually prefer to pick up their children themselves
that ran away from home, instead of getting calls from somebody else
telling them that they already took care of everything. :)

The most prominent issue with async messages is the re-use of the same
keys/ids they carry, during the window the event happened and the
async message is received. If we reap PIDs ourselves, the PID is
pinned until we take care of it, it can not be another process be
created in the meantime with the same PID.

>> > Once again, if pid_ns->child_reaper exits, you should
>> > not even try to find the sub-reaper in its parents chain.
>>
>> That would mean we can never run a sub-init in a pid namespace? Why not?
>>
>> Or do you mean that we *are* already the pid_ns->child_reaper, not
>> that one *exists*?
>
> I already got lost a bit, not sure I understand.
> I meant we *are* (the caller) the exiting pid_ns->child_reaper thread.

Right, that's what I meant. And what can be avoided if we move the
parent walk below the current check.

>> > In this case we should kill the children, not reparent them. Or panic
>> > if it is the global init (see above).
>> >
>> > See?
>>
>> Not sure. You mean that the lookup for a possible task->cild_reaper
>> should be _before_ the check for pid_ns->child_reaper == father which
>> is currently below?
>
> Hmm. I am even more confused.

Yeah, sorry, it sounds confusing, but I think we arrived now for that
specific part. :)

> So. I actually applied your patch. The code is
>
>        for (reaper = father->parent;
>             reaper != &init_task && reaper != pid_ns->child_reaper;
>             reaper = reaper->parent)
>                if (reaper->child_reaper)
>                        return reaper;
>
>        if (unlikely(pid_ns->child_reaper == father)) {
>                ...
>
> The lookup is already _before_ the check for pid_ns->child_reaper == father,
> what do you mean?

Yeah, it is before, but should be after.

> And, ignoring the mt problems I mentioned, I mean we should do
>
>        if (pid_ns->child_reaper == father) {
>
>                panic_or_kill_this_namespace;   // the current code
>
>        } else {
>                for (reaper = father->parent;
>                     reaper != pid_ns->child_reaper;
>                     reaper = reaper->parent)
>                        if (reaper->child_reaper)
>                                return reaper;
>        }

Yeah, clear now, it looks better to keep it in that order.

Thanks,
Kay

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-17 20:56                 ` Kay Sievers
  0 siblings, 0 replies; 63+ messages in thread
From: Kay Sievers @ 2011-08-17 20:56 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lennart-mdGvqq1h2p+GdvJs77BJ7Q, linux-man-u79uwXL29TY76Z2rM5mHXA,
	roland-/Z5OmTQCD9xF6kxbq+BtvQ,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Wed, Aug 17, 2011 at 20:57, Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On 08/17, Kay Sievers wrote:
>> Async notifications like
>> taskstats just can not provide what SIGCHLD, /proc and wait() offer.
>
> Why not? Async notifications can't delay the reaping, yes. But I am
> not sure /proc/ is that useful when the task exits. OK, I won't argue.

I guess parents usually prefer to pick up their children themselves
that ran away from home, instead of getting calls from somebody else
telling them that they already took care of everything. :)

The most prominent issue with async messages is the re-use of the same
keys/ids they carry, during the window the event happened and the
async message is received. If we reap PIDs ourselves, the PID is
pinned until we take care of it, it can not be another process be
created in the meantime with the same PID.

>> > Once again, if pid_ns->child_reaper exits, you should
>> > not even try to find the sub-reaper in its parents chain.
>>
>> That would mean we can never run a sub-init in a pid namespace? Why not?
>>
>> Or do you mean that we *are* already the pid_ns->child_reaper, not
>> that one *exists*?
>
> I already got lost a bit, not sure I understand.
> I meant we *are* (the caller) the exiting pid_ns->child_reaper thread.

Right, that's what I meant. And what can be avoided if we move the
parent walk below the current check.

>> > In this case we should kill the children, not reparent them. Or panic
>> > if it is the global init (see above).
>> >
>> > See?
>>
>> Not sure. You mean that the lookup for a possible task->cild_reaper
>> should be _before_ the check for pid_ns->child_reaper == father which
>> is currently below?
>
> Hmm. I am even more confused.

Yeah, sorry, it sounds confusing, but I think we arrived now for that
specific part. :)

> So. I actually applied your patch. The code is
>
>        for (reaper = father->parent;
>             reaper != &init_task && reaper != pid_ns->child_reaper;
>             reaper = reaper->parent)
>                if (reaper->child_reaper)
>                        return reaper;
>
>        if (unlikely(pid_ns->child_reaper == father)) {
>                ...
>
> The lookup is already _before_ the check for pid_ns->child_reaper == father,
> what do you mean?

Yeah, it is before, but should be after.

> And, ignoring the mt problems I mentioned, I mean we should do
>
>        if (pid_ns->child_reaper == father) {
>
>                panic_or_kill_this_namespace;   // the current code
>
>        } else {
>                for (reaper = father->parent;
>                     reaper != pid_ns->child_reaper;
>                     reaper = reaper->parent)
>                        if (reaper->child_reaper)
>                                return reaper;
>        }

Yeah, clear now, it looks better to keep it in that order.

Thanks,
Kay
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-18 12:43         ` Lennart Poettering
  0 siblings, 0 replies; 63+ messages in thread
From: Lennart Poettering @ 2011-08-18 12:43 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Kay Sievers, akpm, linux-kernel, linux-man, roland, torvalds

On Wed, 17.08.11 15:45, Oleg Nesterov (oleg@redhat.com) wrote:

> You should mark the whole process as sub-reaper, not a single thread
> which does prctl(). The parent/child relationship is process-wide.

Hmm, how would we implement this best? Would it be sufficient to follow
group_leader pointer to set/get the flag, and to follow real_parent
instead of parent when looking for the reaper task (i.e. the one with
the flag set)?

Lennart

-- 
Lennart Poettering - Red Hat, Inc.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-18 12:43         ` Lennart Poettering
  0 siblings, 0 replies; 63+ messages in thread
From: Lennart Poettering @ 2011-08-18 12:43 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Kay Sievers, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-man-u79uwXL29TY76Z2rM5mHXA, roland-/Z5OmTQCD9xF6kxbq+BtvQ,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Wed, 17.08.11 15:45, Oleg Nesterov (oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org) wrote:

> You should mark the whole process as sub-reaper, not a single thread
> which does prctl(). The parent/child relationship is process-wide.

Hmm, how would we implement this best? Would it be sufficient to follow
group_leader pointer to set/get the flag, and to follow real_parent
instead of parent when looking for the reaper task (i.e. the one with
the flag set)?

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-18 14:25           ` Oleg Nesterov
  0 siblings, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2011-08-18 14:25 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Kay Sievers, akpm, linux-kernel, linux-man, roland, torvalds

On 08/18, Lennart Poettering wrote:
>
> On Wed, 17.08.11 15:45, Oleg Nesterov (oleg@redhat.com) wrote:
>
> > You should mark the whole process as sub-reaper, not a single thread
> > which does prctl(). The parent/child relationship is process-wide.
>
> Hmm, how would we implement this best? Would it be sufficient to follow
> group_leader pointer to set/get the flag,

You can mark task->group_leader. Or, probably better, task->signal.

INHO, the best option is SIGNAL_SUB_REAPER in signal->flags. But this
is not possible until we cleanup the usage of signal->flags.

> and to follow real_parent

OOPS. I simly can't understand how I managed to miss this. Of course,
in any case you should follow ->real_parent, not ->parent!

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-18 14:25           ` Oleg Nesterov
  0 siblings, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2011-08-18 14:25 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Kay Sievers, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-man-u79uwXL29TY76Z2rM5mHXA, roland-/Z5OmTQCD9xF6kxbq+BtvQ,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On 08/18, Lennart Poettering wrote:
>
> On Wed, 17.08.11 15:45, Oleg Nesterov (oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org) wrote:
>
> > You should mark the whole process as sub-reaper, not a single thread
> > which does prctl(). The parent/child relationship is process-wide.
>
> Hmm, how would we implement this best? Would it be sufficient to follow
> group_leader pointer to set/get the flag,

You can mark task->group_leader. Or, probably better, task->signal.

INHO, the best option is SIGNAL_SUB_REAPER in signal->flags. But this
is not possible until we cleanup the usage of signal->flags.

> and to follow real_parent

OOPS. I simly can't understand how I managed to miss this. Of course,
in any case you should follow ->real_parent, not ->parent!

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
  2011-08-18 14:25           ` Oleg Nesterov
  (?)
@ 2011-08-18 18:11           ` Kay Sievers
  2011-08-18 18:48               ` Oleg Nesterov
  2011-08-18 21:23               ` Linus Torvalds
  -1 siblings, 2 replies; 63+ messages in thread
From: Kay Sievers @ 2011-08-18 18:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Lennart Poettering, akpm, linux-kernel, linux-man, roland, torvalds

On Thu, 2011-08-18 at 16:25 +0200, Oleg Nesterov wrote:
> On 08/18, Lennart Poettering wrote:
> >
> > On Wed, 17.08.11 15:45, Oleg Nesterov (oleg@redhat.com) wrote:
> >
> > > You should mark the whole process as sub-reaper, not a single thread
> > > which does prctl(). The parent/child relationship is process-wide.
> >
> > Hmm, how would we implement this best? Would it be sufficient to follow
> > group_leader pointer to set/get the flag,
> 
> You can mark task->group_leader. Or, probably better, task->signal.
> 
> INHO, the best option is SIGNAL_SUB_REAPER in signal->flags. But this
> is not possible until we cleanup the usage of signal->flags.
> 
> > and to follow real_parent
> 
> OOPS. I simly can't understand how I managed to miss this. Of course,
> in any case you should follow ->real_parent, not ->parent!

How about this? It:
- uses task->real_parent to walk up the chain of parents.

- does not use init_task but the the parent pointer to itself

- moves the flag into task->signal to have it process-wide
  and not per thread

- moves the parent walk after the check for
    pid_ns->child_reaper == father

- makes sure it does not return a PF_EXITING task 

- adds some explanation of SIGCHLD + wait() vs. async events
  like taskstats, to the changelog

- updates the comments for find_new_reaper() 

Thanks a lot,
Kay


From: Lennart Poettering <lennart@poettering.net>
Subject: prctl: add PR_{SET,GET}_CHILD_REAPER to allow simple process supervision

Userspace service managers/supervisors need to track their started
services. Many services daemonize by double-forking and get implicitely
re-parented to PID 1. The process manager will no longer be able to
receive the SIGCHLD signals for them, and is no longer in charge of
reaping the children with wait(). All information about the children
is lost at the moment PID 1 cleans up the re-parented processes.

With this prctl, a service manager process can mark itself as a sort of
'sub-init', able to stay as the parent for all orphaned processes
created by the started services. All SIGCHLD signals will be delivered
to the service manager.

Receiving SIGCHLD and doing wait() is in cases of a service-manager
much preferred over any possible asynchronous notification about
specific PIDs, because the service manager has full access to the
child process data in /proc and the PID can not be re-used until
the wait(), the service-manager itself is in charge of, has happended.

As a side effect, the relevant parent PID information does not get lost
by a double-fork, which results in a more elaborate process tree and 'ps'
output.

This is orthogonal to PID namespaces. PID namespaces are isolated
from each other, while a service management process usually requires
the serices to live in the same namespace, to be able to talk to each
other.

Users of this will be the systemd per-user instance, which provides
init-like functionality for the user's login session and D-Bus, which
activates bus services on on-demand. Both will need init-like capabilities
to be able to properly keep track of the services they start.

Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Lennart Poettering <lennart@poettering.net>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
---

 include/linux/prctl.h |    3 +++
 include/linux/sched.h |    8 ++++++++
 kernel/exit.c         |   24 +++++++++++++++++++-----
 kernel/sys.c          |    7 +++++++
 4 files changed, 37 insertions(+), 5 deletions(-)

--- a/include/linux/prctl.h
+++ b/include/linux/prctl.h
@@ -102,4 +102,7 @@
 
 #define PR_MCE_KILL_GET 34
 
+#define PR_SET_CHILD_REAPER 35
+#define PR_GET_CHILD_REAPER 36
+
 #endif /* _LINUX_PRCTL_H */
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -550,6 +550,14 @@ struct signal_struct {
 	int			group_stop_count;
 	unsigned int		flags; /* see SIGNAL_* flags below */
 
+	/*
+	 * PR_SET_CHILD_REAPER flag which marks a process like a service
+	 * manager to re-parent orphan (double-forking) child processes
+	 * to this process instead of init, so the service manager is
+	 * able to receive SIGCHLD and is resonsible to do the wait().
+	 */
+	unsigned int		child_reaper:1;
+
 	/* POSIX.1b Interval Timers */
 	struct list_head posix_timers;
 
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -689,11 +689,12 @@ static void exit_mm(struct task_struct *
 }
 
 /*
- * When we die, we re-parent all our children.
- * Try to give them to another thread in our thread
- * group, and if no such member exists, give it to
- * the child reaper process (ie "init") in our pid
- * space.
+ * When we die, we re-parent all our children, and try to:
+ * 1. give them to another thread in our thread group, if such a
+ *    member exists
+ * 2. give it to the first anchestor process which prctl'd itself
+ *    as a child_reaper for its children (like a service manager)
+ * 3. give it to the init process (PID 1) in our pid namespace
  */
 static struct task_struct *find_new_reaper(struct task_struct *father)
 	__releases(&tasklist_lock)
@@ -724,6 +725,19 @@ static struct task_struct *find_new_reap
 		 * forget_original_parent() must move them somewhere.
 		 */
 		pid_ns->child_reaper = init_pid_ns.child_reaper;
+	} else {
+		/* find the first ancestor which is marked as child_reaper */
+		for (thread = father->real_parent;
+		     thread != thread->real_parent;
+		     thread = thread->real_parent) {
+			if (thread == pid_ns->child_reaper)
+				break;
+			if (!thread->signal->child_reaper)
+				continue;
+			if (thread->flags & PF_EXITING)
+				continue;
+			return thread;
+		}
 	}
 
 	return pid_ns->child_reaper;
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1799,6 +1799,13 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
 			else
 				error = PR_MCE_KILL_DEFAULT;
 			break;
+		case PR_SET_CHILD_REAPER:
+			me->signal->child_reaper = !!arg2;
+			error = 0;
+			break;
+		case PR_GET_CHILD_REAPER:
+			error = put_user(me->signal->child_reaper, (int __user *) arg2);
+			break;
 		default:
 			error = -EINVAL;
 			break;



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
  2011-08-18 18:11           ` Kay Sievers
@ 2011-08-18 18:48               ` Oleg Nesterov
  2011-08-18 21:23               ` Linus Torvalds
  1 sibling, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2011-08-18 18:48 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Lennart Poettering, akpm, linux-kernel, linux-man, roland, torvalds

Hello Kay,

I need to go away, I'll read this patch (and the whole email) tomorrow.

Just a quick note right now,

On 08/18, Kay Sievers wrote:
>
>  static struct task_struct *find_new_reaper(struct task_struct *father)
>  	__releases(&tasklist_lock)
> @@ -724,6 +725,19 @@ static struct task_struct *find_new_reap
>  		 * forget_original_parent() must move them somewhere.
>  		 */
>  		pid_ns->child_reaper = init_pid_ns.child_reaper;
> +	} else {
> +		/* find the first ancestor which is marked as child_reaper */
> +		for (thread = father->real_parent;
> +		     thread != thread->real_parent;
> +		     thread = thread->real_parent) {
> +			if (thread == pid_ns->child_reaper)
> +				break;
> +			if (!thread->signal->child_reaper)
> +				continue;
> +			if (thread->flags & PF_EXITING)
> +				continue;
> +			return thread;

No, this doesn't look right.

This code should do something like

		for (reaper = father->real_parent;
		     !same_thread_group(reaper, pid_ns->child_reaper);
		     reaper = reaper->real_parent) {
			if (!signal->child_reaper)
				continue;

			if (there is a !PF_EXITING thread)
				return thread;
		}

And I forgot to mention, could you please-please rename child_reaper?
Say, is_child_reaper or is_sub_reaper. Or whatever. I do not really
care about the naming. But I use grep very often, and personally I
dislike the task->child_reaper/signal->child_reaper confusion.

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-18 18:48               ` Oleg Nesterov
  0 siblings, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2011-08-18 18:48 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Lennart Poettering, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-man-u79uwXL29TY76Z2rM5mHXA, roland-/Z5OmTQCD9xF6kxbq+BtvQ,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hello Kay,

I need to go away, I'll read this patch (and the whole email) tomorrow.

Just a quick note right now,

On 08/18, Kay Sievers wrote:
>
>  static struct task_struct *find_new_reaper(struct task_struct *father)
>  	__releases(&tasklist_lock)
> @@ -724,6 +725,19 @@ static struct task_struct *find_new_reap
>  		 * forget_original_parent() must move them somewhere.
>  		 */
>  		pid_ns->child_reaper = init_pid_ns.child_reaper;
> +	} else {
> +		/* find the first ancestor which is marked as child_reaper */
> +		for (thread = father->real_parent;
> +		     thread != thread->real_parent;
> +		     thread = thread->real_parent) {
> +			if (thread == pid_ns->child_reaper)
> +				break;
> +			if (!thread->signal->child_reaper)
> +				continue;
> +			if (thread->flags & PF_EXITING)
> +				continue;
> +			return thread;

No, this doesn't look right.

This code should do something like

		for (reaper = father->real_parent;
		     !same_thread_group(reaper, pid_ns->child_reaper);
		     reaper = reaper->real_parent) {
			if (!signal->child_reaper)
				continue;

			if (there is a !PF_EXITING thread)
				return thread;
		}

And I forgot to mention, could you please-please rename child_reaper?
Say, is_child_reaper or is_sub_reaper. Or whatever. I do not really
care about the naming. But I use grep very often, and personally I
dislike the task->child_reaper/signal->child_reaper confusion.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
  2011-08-18 18:11           ` Kay Sievers
@ 2011-08-18 21:23               ` Linus Torvalds
  2011-08-18 21:23               ` Linus Torvalds
  1 sibling, 0 replies; 63+ messages in thread
From: Linus Torvalds @ 2011-08-18 21:23 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Oleg Nesterov, Lennart Poettering, akpm, linux-kernel, linux-man, roland

On Thu, Aug 18, 2011 at 11:11 AM, Kay Sievers <kay.sievers@vrfy.org> wrote:
>
> How about this? It:
> - uses task->real_parent to walk up the chain of parents.

If I read this right, it does that for all the normal cases too.
Disgusting. Slowing down the *usual* UNIX case for your new made-up
case that nobody actually uses is not acceptable.

So NAK. We're not slowing down normal code for some new feature that
realistically will never be used by any normal applications.

                   Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-18 21:23               ` Linus Torvalds
  0 siblings, 0 replies; 63+ messages in thread
From: Linus Torvalds @ 2011-08-18 21:23 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Oleg Nesterov, Lennart Poettering,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-man-u79uwXL29TY76Z2rM5mHXA, roland-/Z5OmTQCD9xF6kxbq+BtvQ

On Thu, Aug 18, 2011 at 11:11 AM, Kay Sievers <kay.sievers-tD+1rO4QERM@public.gmane.org> wrote:
>
> How about this? It:
> - uses task->real_parent to walk up the chain of parents.

If I read this right, it does that for all the normal cases too.
Disgusting. Slowing down the *usual* UNIX case for your new made-up
case that nobody actually uses is not acceptable.

So NAK. We're not slowing down normal code for some new feature that
realistically will never be used by any normal applications.

                   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-18 21:55                 ` Kay Sievers
  0 siblings, 0 replies; 63+ messages in thread
From: Kay Sievers @ 2011-08-18 21:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Lennart Poettering, akpm, linux-kernel, linux-man, roland

On Thu, Aug 18, 2011 at 23:23, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Thu, Aug 18, 2011 at 11:11 AM, Kay Sievers <kay.sievers@vrfy.org> wrote:
>>
>> How about this? It:
>> - uses task->real_parent to walk up the chain of parents.
>
> If I read this right, it does that for all the normal cases too.
> Disgusting. Slowing down the *usual* UNIX case for your new made-up
> case that nobody actually uses is not acceptable.

Re-parenting is not a *usual* operation, usual exit()s have a parent and
do not trigger that code. And in most cases of double-forking it is a
one step, which is almost the same than it was before. Only if we skip
multiple parents it will get the check, which is cheap anyway.

> So NAK. We're not slowing down normal code for some new feature that
> realistically will never be used by any normal applications.

It will be used for all new service managers. UNIX is a pain if you want
to watch your children regarding double-forking, if you are not init.

And it will clean up the mess the desktop stuff is putting in 'ps afx'.
Here is the before/after output:

  253 ?        Ss     0:00 /bin/dbus-daemon --system --address=systemd: --nofork --systemd-activation
  294 ?        Sl     0:00 /usr/libexec/polkit-1/polkitd
  328 ?        S      0:00 /usr/sbin/modem-manager
  608 ?        Sl     0:00 /usr/libexec/colord
  658 ?        Sl     0:00 /usr/libexec/upowerd
  819 ?        Sl     0:00 /usr/libexec/imsettings-daemon
  916 ?        Sl     0:00 /usr/libexec/udisks-daemon
  917 ?        S      0:00  \_ udisks-daemon: not polling any devices


  294 ?        Ss     0:00 /bin/dbus-daemon --system --address=systemd: --nofork --systemd-activation
  426 ?        Sl     0:00  \_ /usr/libexec/polkit-1/polkitd
  449 ?        S      0:00  \_ /usr/sbin/modem-manager
  635 ?        Sl     0:00  \_ /usr/libexec/colord
  705 ?        Sl     0:00  \_ /usr/libexec/upowerd
  959 ?        Sl     0:00  \_ /usr/libexec/udisks-daemon
  960 ?        S      0:00  |   \_ udisks-daemon: not polling any devices
  977 ?        Sl     0:00  \_ /usr/libexec/packagekitd

It's a pretty nice way for an admin too see what's going on on that box,
even with good old 'ps'.

It will also dramatically simplify the handling of started services from
a service manager which double-fork for only historic reasons and make
proper service supervision a nightmare if you are not PID 1.

Upstart ptraces all started services to accomplish something similar. I
doubt we want to go that road, but if you prefer ... :)

If you want, we can make the forked processes inherit a flag if the
'subreaper' should be looked up at all. Then we have almost zero
overhead if the feature isn't used. You think that's needed?

Thanks,
Kay



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-18 21:55                 ` Kay Sievers
  0 siblings, 0 replies; 63+ messages in thread
From: Kay Sievers @ 2011-08-18 21:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Lennart Poettering,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-man-u79uwXL29TY76Z2rM5mHXA, roland-/Z5OmTQCD9xF6kxbq+BtvQ

On Thu, Aug 18, 2011 at 23:23, Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
> On Thu, Aug 18, 2011 at 11:11 AM, Kay Sievers <kay.sievers-tD+1rO4QERM@public.gmane.org> wrote:
>>
>> How about this? It:
>> - uses task->real_parent to walk up the chain of parents.
>
> If I read this right, it does that for all the normal cases too.
> Disgusting. Slowing down the *usual* UNIX case for your new made-up
> case that nobody actually uses is not acceptable.

Re-parenting is not a *usual* operation, usual exit()s have a parent and
do not trigger that code. And in most cases of double-forking it is a
one step, which is almost the same than it was before. Only if we skip
multiple parents it will get the check, which is cheap anyway.

> So NAK. We're not slowing down normal code for some new feature that
> realistically will never be used by any normal applications.

It will be used for all new service managers. UNIX is a pain if you want
to watch your children regarding double-forking, if you are not init.

And it will clean up the mess the desktop stuff is putting in 'ps afx'.
Here is the before/after output:

  253 ?        Ss     0:00 /bin/dbus-daemon --system --address=systemd: --nofork --systemd-activation
  294 ?        Sl     0:00 /usr/libexec/polkit-1/polkitd
  328 ?        S      0:00 /usr/sbin/modem-manager
  608 ?        Sl     0:00 /usr/libexec/colord
  658 ?        Sl     0:00 /usr/libexec/upowerd
  819 ?        Sl     0:00 /usr/libexec/imsettings-daemon
  916 ?        Sl     0:00 /usr/libexec/udisks-daemon
  917 ?        S      0:00  \_ udisks-daemon: not polling any devices


  294 ?        Ss     0:00 /bin/dbus-daemon --system --address=systemd: --nofork --systemd-activation
  426 ?        Sl     0:00  \_ /usr/libexec/polkit-1/polkitd
  449 ?        S      0:00  \_ /usr/sbin/modem-manager
  635 ?        Sl     0:00  \_ /usr/libexec/colord
  705 ?        Sl     0:00  \_ /usr/libexec/upowerd
  959 ?        Sl     0:00  \_ /usr/libexec/udisks-daemon
  960 ?        S      0:00  |   \_ udisks-daemon: not polling any devices
  977 ?        Sl     0:00  \_ /usr/libexec/packagekitd

It's a pretty nice way for an admin too see what's going on on that box,
even with good old 'ps'.

It will also dramatically simplify the handling of started services from
a service manager which double-fork for only historic reasons and make
proper service supervision a nightmare if you are not PID 1.

Upstart ptraces all started services to accomplish something similar. I
doubt we want to go that road, but if you prefer ... :)

If you want, we can make the forked processes inherit a flag if the
'subreaper' should be looked up at all. Then we have almost zero
overhead if the feature isn't used. You think that's needed?

Thanks,
Kay


--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
  2011-08-18 21:55                 ` Kay Sievers
@ 2011-08-18 22:22                   ` Linus Torvalds
  -1 siblings, 0 replies; 63+ messages in thread
From: Linus Torvalds @ 2011-08-18 22:22 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Oleg Nesterov, Lennart Poettering, akpm, linux-kernel, linux-man, roland

On Thu, Aug 18, 2011 at 2:55 PM, Kay Sievers <kay.sievers@vrfy.org> wrote:
>
>> So NAK. We're not slowing down normal code for some new feature that
>> realistically will never be used by any normal applications.
>
> It will be used for all new service managers. UNIX is a pain if you want
> to watch your children regarding double-forking, if you are not init.

I'll believe that "all new service managers" when I see it. Until
then, a new feature is just that - a new feature. Which nobody uses.

> And it will clean up the mess the desktop stuff is putting in 'ps afx'.
> Here is the before/after output:
>
>  253 ?        Ss     0:00 /bin/dbus-daemon --system --address=systemd: --nofork --systemd-activation
>  294 ?        Sl     0:00 /usr/libexec/polkit-1/polkitd
>  328 ?        S      0:00 /usr/sbin/modem-manager
>  608 ?        Sl     0:00 /usr/libexec/colord
>  658 ?        Sl     0:00 /usr/libexec/upowerd
>  819 ?        Sl     0:00 /usr/libexec/imsettings-daemon
>  916 ?        Sl     0:00 /usr/libexec/udisks-daemon
>  917 ?        S      0:00  \_ udisks-daemon: not polling any devices
>
>
>  294 ?        Ss     0:00 /bin/dbus-daemon --system --address=systemd: --nofork --systemd-activation
>  426 ?        Sl     0:00  \_ /usr/libexec/polkit-1/polkitd
>  449 ?        S      0:00  \_ /usr/sbin/modem-manager
>  635 ?        Sl     0:00  \_ /usr/libexec/colord
>  705 ?        Sl     0:00  \_ /usr/libexec/upowerd
>  959 ?        Sl     0:00  \_ /usr/libexec/udisks-daemon
>  960 ?        S      0:00  |   \_ udisks-daemon: not polling any devices
>  977 ?        Sl     0:00  \_ /usr/libexec/packagekitd

Yeah, that looks like a nice feature.

> If you want, we can make the forked processes inherit a flag if the
> 'subreaper' should be looked up at all. Then we have almost zero
> overhead if the feature isn't used. You think that's needed?

Yes, I do. Because with any current system, that "almost zero
overhead" is just totally wasted effort entirely for zero gain. Which
just makes me go "Eww".

             Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-18 22:22                   ` Linus Torvalds
  0 siblings, 0 replies; 63+ messages in thread
From: Linus Torvalds @ 2011-08-18 22:22 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Oleg Nesterov, Lennart Poettering,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-man-u79uwXL29TY76Z2rM5mHXA, roland-/Z5OmTQCD9xF6kxbq+BtvQ

On Thu, Aug 18, 2011 at 2:55 PM, Kay Sievers <kay.sievers-tD+1rO4QERM@public.gmane.org> wrote:
>
>> So NAK. We're not slowing down normal code for some new feature that
>> realistically will never be used by any normal applications.
>
> It will be used for all new service managers. UNIX is a pain if you want
> to watch your children regarding double-forking, if you are not init.

I'll believe that "all new service managers" when I see it. Until
then, a new feature is just that - a new feature. Which nobody uses.

> And it will clean up the mess the desktop stuff is putting in 'ps afx'.
> Here is the before/after output:
>
>  253 ?        Ss     0:00 /bin/dbus-daemon --system --address=systemd: --nofork --systemd-activation
>  294 ?        Sl     0:00 /usr/libexec/polkit-1/polkitd
>  328 ?        S      0:00 /usr/sbin/modem-manager
>  608 ?        Sl     0:00 /usr/libexec/colord
>  658 ?        Sl     0:00 /usr/libexec/upowerd
>  819 ?        Sl     0:00 /usr/libexec/imsettings-daemon
>  916 ?        Sl     0:00 /usr/libexec/udisks-daemon
>  917 ?        S      0:00  \_ udisks-daemon: not polling any devices
>
>
>  294 ?        Ss     0:00 /bin/dbus-daemon --system --address=systemd: --nofork --systemd-activation
>  426 ?        Sl     0:00  \_ /usr/libexec/polkit-1/polkitd
>  449 ?        S      0:00  \_ /usr/sbin/modem-manager
>  635 ?        Sl     0:00  \_ /usr/libexec/colord
>  705 ?        Sl     0:00  \_ /usr/libexec/upowerd
>  959 ?        Sl     0:00  \_ /usr/libexec/udisks-daemon
>  960 ?        S      0:00  |   \_ udisks-daemon: not polling any devices
>  977 ?        Sl     0:00  \_ /usr/libexec/packagekitd

Yeah, that looks like a nice feature.

> If you want, we can make the forked processes inherit a flag if the
> 'subreaper' should be looked up at all. Then we have almost zero
> overhead if the feature isn't used. You think that's needed?

Yes, I do. Because with any current system, that "almost zero
overhead" is just totally wasted effort entirely for zero gain. Which
just makes me go "Eww".

             Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-19  0:48                     ` Kay Sievers
  0 siblings, 0 replies; 63+ messages in thread
From: Kay Sievers @ 2011-08-19  0:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Lennart Poettering, akpm, linux-kernel, linux-man, roland

On Thu, 2011-08-18 at 15:22 -0700, Linus Torvalds wrote:
> On Thu, Aug 18, 2011 at 2:55 PM, Kay Sievers <kay.sievers@vrfy.org> wrote:
> >
> > It will be used for all new service managers. UNIX is a pain if you want
> > to watch your children regarding double-forking, if you are not init.
> 
> I'll believe that "all new service managers" when I see it. Until
> then, a new feature is just that - a new feature. Which nobody uses.

That feature will be used in Fedora 17 if we get that properly working.
Promised!

So far we have a pretty good track record in immediately using the stuff
we add to the kernel. People working on the other side usually complain
that we always require such new kernels. :)

> > If you want, we can make the forked processes inherit a flag if the
> > 'subreaper' should be looked up at all. Then we have almost zero
> > overhead if the feature isn't used. You think that's needed?
> 
> Yes, I do. Because with any current system, that "almost zero
> overhead" is just totally wasted effort entirely for zero gain. Which
> just makes me go "Eww".

Yeah, understood. I'll change it with the next round, and we can see how
to go from there.

Thanks,
Kay


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-19  0:48                     ` Kay Sievers
  0 siblings, 0 replies; 63+ messages in thread
From: Kay Sievers @ 2011-08-19  0:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Lennart Poettering,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-man-u79uwXL29TY76Z2rM5mHXA, roland-/Z5OmTQCD9xF6kxbq+BtvQ

On Thu, 2011-08-18 at 15:22 -0700, Linus Torvalds wrote:
> On Thu, Aug 18, 2011 at 2:55 PM, Kay Sievers <kay.sievers-tD+1rO4QERM@public.gmane.org> wrote:
> >
> > It will be used for all new service managers. UNIX is a pain if you want
> > to watch your children regarding double-forking, if you are not init.
> 
> I'll believe that "all new service managers" when I see it. Until
> then, a new feature is just that - a new feature. Which nobody uses.

That feature will be used in Fedora 17 if we get that properly working.
Promised!

So far we have a pretty good track record in immediately using the stuff
we add to the kernel. People working on the other side usually complain
that we always require such new kernels. :)

> > If you want, we can make the forked processes inherit a flag if the
> > 'subreaper' should be looked up at all. Then we have almost zero
> > overhead if the feature isn't used. You think that's needed?
> 
> Yes, I do. Because with any current system, that "almost zero
> overhead" is just totally wasted effort entirely for zero gain. Which
> just makes me go "Eww".

Yeah, understood. I'll change it with the next round, and we can see how
to go from there.

Thanks,
Kay

--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-19  1:31                 ` Kay Sievers
  0 siblings, 0 replies; 63+ messages in thread
From: Kay Sievers @ 2011-08-19  1:31 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Lennart Poettering, akpm, linux-kernel, linux-man, roland, torvalds

On Thu, 2011-08-18 at 20:48 +0200, Oleg Nesterov wrote:
> On 08/18, Kay Sievers wrote:

> No, this doesn't look right.
> 
> This code should do something like
> 
> 		for (reaper = father->real_parent;
> 		     !same_thread_group(reaper, pid_ns->child_reaper);

Without that check, bootup immediately hangs. The problem is, I expect,
that we need to exit the loop for re-parenting kernel threads, and
pid_ns->child_reaper seems out of scope in these cases. That's why we
initially had the &init_task check there.

> 			if (there is a !PF_EXITING thread)
> 				return thread;

Added this.

> And I forgot to mention, could you please-please rename child_reaper?
> Say, is_child_reaper or is_sub_reaper. Or whatever. I do not really
> care about the naming. But I use grep very often, and personally I
> dislike the task->child_reaper/signal->child_reaper confusion.

Right, makes sense. Renamed everything to subreaper, which makes it
clear that we still fall back to 'init' in case things go wrong.


version 3:
- rename all child_reaper to child_subreaper to avoid confusion with
  the PID namespace child_reaper variable

- check all possible threads of the reaper process for valid one 

- optimization: let processes inherit a flag to indicate that there is
  a subreaper to lookup, in case they need to be re-parented.

version 2:
- uses task->real_parent to walk up the chain of parents.

- does not use init_task but the the parent pointer to itself

- moves the flag into task->signal to have it process-wide
  and not per thread

- moves the parent walk after the check for
  pid_ns->child_reaper == father

- makes sure it does not return a PF_EXITING task

- adds some explanation of SIGCHLD + wait() vs. async events
  like taskstats, to the changelog

- updates the comments for find_new_reaper()



From: Lennart Poettering <lennart@poettering.net>
Subject: prctl: add PR_{SET,GET}_CHILD_SUBREAPER to allow simple process supervision

Userspace service managers/supervisors need to track their started
services. Many services daemonize by double-forking and get implicitely
re-parented to PID 1. The process manager will no longer be able to
receive the SIGCHLD signals for them, and is no longer in charge of
reaping the children with wait(). All information about the children
is lost at the moment PID 1 cleans up the re-parented processes.

With this prctl, a service manager process can mark itself as a sort of
'sub-init', able to stay as the parent for all orphaned processes
created by the started services. All SIGCHLD signals will be delivered
to the service manager.

Receiving SIGCHLD and doing wait() is in cases of a service-manager
much preferred over any possible asynchronous notification about
specific PIDs, because the service manager has full access to the
child process data in /proc and the PID can not be re-used until
the wait(), the service-manager itself is in charge of, has happended.

As a side effect, the relevant parent PID information does not get lost
by a double-fork, which results in a more elaborate process tree and 'ps'
output.

This is orthogonal to PID namespaces. PID namespaces are isolated
from each other, while a service management process usually requires
the serices to live in the same namespace, to be able to talk to each
other.

Users of this will be the systemd per-user instance, which provides
init-like functionality for the user's login session and D-Bus, which
activates bus services on on-demand. Both will need init-like capabilities
to be able to properly keep track of the services they start.

Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Lennart Poettering <lennart@poettering.net>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
---

 include/linux/prctl.h |    3 +++
 include/linux/sched.h |   12 ++++++++++++
 kernel/exit.c         |   28 +++++++++++++++++++++++-----
 kernel/fork.c         |    3 +++
 kernel/sys.c          |    9 +++++++++
 5 files changed, 50 insertions(+), 5 deletions(-)

--- a/include/linux/prctl.h
+++ b/include/linux/prctl.h
@@ -102,4 +102,7 @@
 
 #define PR_MCE_KILL_GET 34
 
+#define PR_SET_CHILD_SUBREAPER 35
+#define PR_GET_CHILD_SUBREAPER 36
+
 #endif /* _LINUX_PRCTL_H */
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -550,6 +550,18 @@ struct signal_struct {
 	int			group_stop_count;
 	unsigned int		flags; /* see SIGNAL_* flags below */
 
+	/*
+	 * PR_SET_CHILD_SUBREAPER marks a process, like a service
+	 * manager, to re-parent orphan (double-forking) child processes
+	 * to this process instead of 'init'. The service manager is
+	 * able to receive SIGCHLD signals and is able to investigate
+	 * the process until it calls wait(). All children of this
+	 * process will inherit a flag if they should look for a
+	 * child_subreaper process at exit.
+	 */
+	unsigned int		is_child_subreaper:1;
+	unsigned int		has_child_subreaper:1;
+
 	/* POSIX.1b Interval Timers */
 	struct list_head posix_timers;
 
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -689,11 +689,12 @@ static void exit_mm(struct task_struct *
 }
 
 /*
- * When we die, we re-parent all our children.
- * Try to give them to another thread in our thread
- * group, and if no such member exists, give it to
- * the child reaper process (ie "init") in our pid
- * space.
+ * When we die, we re-parent all our children, and try to:
+ * 1. give them to another thread in our thread group, if such a
+ *    member exists
+ * 2. give it to the first anchestor process which prctl'd itself
+ *    as a child_subreaper for its children (like a service manager)
+ * 3. give it to the init process (PID 1) in our pid namespace
  */
 static struct task_struct *find_new_reaper(struct task_struct *father)
 	__releases(&tasklist_lock)
@@ -724,6 +725,23 @@ static struct task_struct *find_new_reap
 		 * forget_original_parent() must move them somewhere.
 		 */
 		pid_ns->child_reaper = init_pid_ns.child_reaper;
+	} else if (father->signal->has_child_subreaper) {
+		struct task_struct *reaper;
+
+		/* find the first ancestor marked as child_subreaper */
+		for (reaper = father->real_parent;
+		     reaper != reaper->real_parent;
+		     reaper = reaper->real_parent) {
+			if (same_thread_group(reaper, pid_ns->child_reaper))
+				break;
+			if (!reaper->signal->is_child_subreaper)
+				continue;
+			thread = reaper;
+			do {
+				if (!(thread->flags & PF_EXITING))
+					return reaper;
+			} while_each_thread(reaper, thread);
+		}
 	}
 
 	return pid_ns->child_reaper;
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -987,6 +987,9 @@ static int copy_signal(unsigned long clo
 	sig->oom_score_adj = current->signal->oom_score_adj;
 	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
 
+	if (current->signal->has_child_subreaper)
+		sig->has_child_subreaper = true;
+
 	mutex_init(&sig->cred_guard_mutex);
 
 	return 0;
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1799,6 +1799,15 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
 			else
 				error = PR_MCE_KILL_DEFAULT;
 			break;
+		case PR_SET_CHILD_SUBREAPER:
+			me->signal->is_child_subreaper = !!arg2;
+			me->signal->has_child_subreaper = true;
+			error = 0;
+			break;
+		case PR_GET_CHILD_SUBREAPER:
+			error = put_user(me->signal->is_child_subreaper,
+					 (int __user *) arg2);
+			break;
 		default:
 			error = -EINVAL;
 			break;



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-19  1:31                 ` Kay Sievers
  0 siblings, 0 replies; 63+ messages in thread
From: Kay Sievers @ 2011-08-19  1:31 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Lennart Poettering, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-man-u79uwXL29TY76Z2rM5mHXA, roland-/Z5OmTQCD9xF6kxbq+BtvQ,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Thu, 2011-08-18 at 20:48 +0200, Oleg Nesterov wrote:
> On 08/18, Kay Sievers wrote:

> No, this doesn't look right.
> 
> This code should do something like
> 
> 		for (reaper = father->real_parent;
> 		     !same_thread_group(reaper, pid_ns->child_reaper);

Without that check, bootup immediately hangs. The problem is, I expect,
that we need to exit the loop for re-parenting kernel threads, and
pid_ns->child_reaper seems out of scope in these cases. That's why we
initially had the &init_task check there.

> 			if (there is a !PF_EXITING thread)
> 				return thread;

Added this.

> And I forgot to mention, could you please-please rename child_reaper?
> Say, is_child_reaper or is_sub_reaper. Or whatever. I do not really
> care about the naming. But I use grep very often, and personally I
> dislike the task->child_reaper/signal->child_reaper confusion.

Right, makes sense. Renamed everything to subreaper, which makes it
clear that we still fall back to 'init' in case things go wrong.


version 3:
- rename all child_reaper to child_subreaper to avoid confusion with
  the PID namespace child_reaper variable

- check all possible threads of the reaper process for valid one 

- optimization: let processes inherit a flag to indicate that there is
  a subreaper to lookup, in case they need to be re-parented.

version 2:
- uses task->real_parent to walk up the chain of parents.

- does not use init_task but the the parent pointer to itself

- moves the flag into task->signal to have it process-wide
  and not per thread

- moves the parent walk after the check for
  pid_ns->child_reaper == father

- makes sure it does not return a PF_EXITING task

- adds some explanation of SIGCHLD + wait() vs. async events
  like taskstats, to the changelog

- updates the comments for find_new_reaper()



From: Lennart Poettering <lennart-mdGvqq1h2p+GdvJs77BJ7Q@public.gmane.org>
Subject: prctl: add PR_{SET,GET}_CHILD_SUBREAPER to allow simple process supervision

Userspace service managers/supervisors need to track their started
services. Many services daemonize by double-forking and get implicitely
re-parented to PID 1. The process manager will no longer be able to
receive the SIGCHLD signals for them, and is no longer in charge of
reaping the children with wait(). All information about the children
is lost at the moment PID 1 cleans up the re-parented processes.

With this prctl, a service manager process can mark itself as a sort of
'sub-init', able to stay as the parent for all orphaned processes
created by the started services. All SIGCHLD signals will be delivered
to the service manager.

Receiving SIGCHLD and doing wait() is in cases of a service-manager
much preferred over any possible asynchronous notification about
specific PIDs, because the service manager has full access to the
child process data in /proc and the PID can not be re-used until
the wait(), the service-manager itself is in charge of, has happended.

As a side effect, the relevant parent PID information does not get lost
by a double-fork, which results in a more elaborate process tree and 'ps'
output.

This is orthogonal to PID namespaces. PID namespaces are isolated
from each other, while a service management process usually requires
the serices to live in the same namespace, to be able to talk to each
other.

Users of this will be the systemd per-user instance, which provides
init-like functionality for the user's login session and D-Bus, which
activates bus services on on-demand. Both will need init-like capabilities
to be able to properly keep track of the services they start.

Cc: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Lennart Poettering <lennart-mdGvqq1h2p+GdvJs77BJ7Q@public.gmane.org>
Signed-off-by: Kay Sievers <kay.sievers-tD+1rO4QERM@public.gmane.org>
---

 include/linux/prctl.h |    3 +++
 include/linux/sched.h |   12 ++++++++++++
 kernel/exit.c         |   28 +++++++++++++++++++++++-----
 kernel/fork.c         |    3 +++
 kernel/sys.c          |    9 +++++++++
 5 files changed, 50 insertions(+), 5 deletions(-)

--- a/include/linux/prctl.h
+++ b/include/linux/prctl.h
@@ -102,4 +102,7 @@
 
 #define PR_MCE_KILL_GET 34
 
+#define PR_SET_CHILD_SUBREAPER 35
+#define PR_GET_CHILD_SUBREAPER 36
+
 #endif /* _LINUX_PRCTL_H */
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -550,6 +550,18 @@ struct signal_struct {
 	int			group_stop_count;
 	unsigned int		flags; /* see SIGNAL_* flags below */
 
+	/*
+	 * PR_SET_CHILD_SUBREAPER marks a process, like a service
+	 * manager, to re-parent orphan (double-forking) child processes
+	 * to this process instead of 'init'. The service manager is
+	 * able to receive SIGCHLD signals and is able to investigate
+	 * the process until it calls wait(). All children of this
+	 * process will inherit a flag if they should look for a
+	 * child_subreaper process at exit.
+	 */
+	unsigned int		is_child_subreaper:1;
+	unsigned int		has_child_subreaper:1;
+
 	/* POSIX.1b Interval Timers */
 	struct list_head posix_timers;
 
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -689,11 +689,12 @@ static void exit_mm(struct task_struct *
 }
 
 /*
- * When we die, we re-parent all our children.
- * Try to give them to another thread in our thread
- * group, and if no such member exists, give it to
- * the child reaper process (ie "init") in our pid
- * space.
+ * When we die, we re-parent all our children, and try to:
+ * 1. give them to another thread in our thread group, if such a
+ *    member exists
+ * 2. give it to the first anchestor process which prctl'd itself
+ *    as a child_subreaper for its children (like a service manager)
+ * 3. give it to the init process (PID 1) in our pid namespace
  */
 static struct task_struct *find_new_reaper(struct task_struct *father)
 	__releases(&tasklist_lock)
@@ -724,6 +725,23 @@ static struct task_struct *find_new_reap
 		 * forget_original_parent() must move them somewhere.
 		 */
 		pid_ns->child_reaper = init_pid_ns.child_reaper;
+	} else if (father->signal->has_child_subreaper) {
+		struct task_struct *reaper;
+
+		/* find the first ancestor marked as child_subreaper */
+		for (reaper = father->real_parent;
+		     reaper != reaper->real_parent;
+		     reaper = reaper->real_parent) {
+			if (same_thread_group(reaper, pid_ns->child_reaper))
+				break;
+			if (!reaper->signal->is_child_subreaper)
+				continue;
+			thread = reaper;
+			do {
+				if (!(thread->flags & PF_EXITING))
+					return reaper;
+			} while_each_thread(reaper, thread);
+		}
 	}
 
 	return pid_ns->child_reaper;
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -987,6 +987,9 @@ static int copy_signal(unsigned long clo
 	sig->oom_score_adj = current->signal->oom_score_adj;
 	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
 
+	if (current->signal->has_child_subreaper)
+		sig->has_child_subreaper = true;
+
 	mutex_init(&sig->cred_guard_mutex);
 
 	return 0;
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1799,6 +1799,15 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
 			else
 				error = PR_MCE_KILL_DEFAULT;
 			break;
+		case PR_SET_CHILD_SUBREAPER:
+			me->signal->is_child_subreaper = !!arg2;
+			me->signal->has_child_subreaper = true;
+			error = 0;
+			break;
+		case PR_GET_CHILD_SUBREAPER:
+			error = put_user(me->signal->is_child_subreaper,
+					 (int __user *) arg2);
+			break;
 		default:
 			error = -EINVAL;
 			break;


--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
  2011-08-19  1:31                 ` Kay Sievers
@ 2011-08-19 12:25                   ` Oleg Nesterov
  -1 siblings, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2011-08-19 12:25 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Lennart Poettering, akpm, linux-kernel, linux-man, roland, torvalds

On 08/19, Kay Sievers wrote:
>
> On Thu, 2011-08-18 at 20:48 +0200, Oleg Nesterov wrote:
> > On 08/18, Kay Sievers wrote:
>
> > No, this doesn't look right.
> >
> > This code should do something like
> >
> > 		for (reaper = father->real_parent;
> > 		     !same_thread_group(reaper, pid_ns->child_reaper);
>
> Without that check, bootup immediately hangs. The problem is, I expect,
> that we need to exit the loop for re-parenting kernel threads,

Argh. Indeed, I forgot about kthreads. See below.

> - optimization: let processes inherit a flag to indicate that there is
>   a subreaper to lookup, in case they need to be re-parented.

I'll write another email about this...

>  static struct task_struct *find_new_reaper(struct task_struct *father)
>  	__releases(&tasklist_lock)
> @@ -724,6 +725,23 @@ static struct task_struct *find_new_reap
>  		 * forget_original_parent() must move them somewhere.
>  		 */
>  		pid_ns->child_reaper = init_pid_ns.child_reaper;
> +	} else if (father->signal->has_child_subreaper) {
> +		struct task_struct *reaper;
> +
> +		/* find the first ancestor marked as child_subreaper */
> +		for (reaper = father->real_parent;
> +		     reaper != reaper->real_parent;

This looks mysterious. This relies on the fact that INIT_TASK(tsk)
sets .real_parent = tsk. "reaper != &init_task" looks much more clean.
And we can't use PF_KTHREAD because of usermodehelper.

But. Now that you check ->has_child_subreaper before the lookup,
this problem should go away? I mean, if ->has_child_subreaper == T
then some of our parents is the userspace task. Even if it was
spawned by kthread and then exited, we can't miss ->child_reaper
in the parents chain.

Or I missed something?


> +			if (!reaper->signal->is_child_subreaper)
> +				continue;
> +			thread = reaper;
> +			do {
> +				if (!(thread->flags & PF_EXITING))
> +					return reaper;
> +			} while_each_thread(reaper, thread);

Yes, this looks correct.

> +		case PR_SET_CHILD_SUBREAPER:
> +			me->signal->is_child_subreaper = !!arg2;
> +			me->signal->has_child_subreaper = true;

Hmm. This looks wrong... why do we set ->has_child_subreaper?

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-19 12:25                   ` Oleg Nesterov
  0 siblings, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2011-08-19 12:25 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Lennart Poettering, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-man-u79uwXL29TY76Z2rM5mHXA, roland-/Z5OmTQCD9xF6kxbq+BtvQ,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On 08/19, Kay Sievers wrote:
>
> On Thu, 2011-08-18 at 20:48 +0200, Oleg Nesterov wrote:
> > On 08/18, Kay Sievers wrote:
>
> > No, this doesn't look right.
> >
> > This code should do something like
> >
> > 		for (reaper = father->real_parent;
> > 		     !same_thread_group(reaper, pid_ns->child_reaper);
>
> Without that check, bootup immediately hangs. The problem is, I expect,
> that we need to exit the loop for re-parenting kernel threads,

Argh. Indeed, I forgot about kthreads. See below.

> - optimization: let processes inherit a flag to indicate that there is
>   a subreaper to lookup, in case they need to be re-parented.

I'll write another email about this...

>  static struct task_struct *find_new_reaper(struct task_struct *father)
>  	__releases(&tasklist_lock)
> @@ -724,6 +725,23 @@ static struct task_struct *find_new_reap
>  		 * forget_original_parent() must move them somewhere.
>  		 */
>  		pid_ns->child_reaper = init_pid_ns.child_reaper;
> +	} else if (father->signal->has_child_subreaper) {
> +		struct task_struct *reaper;
> +
> +		/* find the first ancestor marked as child_subreaper */
> +		for (reaper = father->real_parent;
> +		     reaper != reaper->real_parent;

This looks mysterious. This relies on the fact that INIT_TASK(tsk)
sets .real_parent = tsk. "reaper != &init_task" looks much more clean.
And we can't use PF_KTHREAD because of usermodehelper.

But. Now that you check ->has_child_subreaper before the lookup,
this problem should go away? I mean, if ->has_child_subreaper == T
then some of our parents is the userspace task. Even if it was
spawned by kthread and then exited, we can't miss ->child_reaper
in the parents chain.

Or I missed something?


> +			if (!reaper->signal->is_child_subreaper)
> +				continue;
> +			thread = reaper;
> +			do {
> +				if (!(thread->flags & PF_EXITING))
> +					return reaper;
> +			} while_each_thread(reaper, thread);

Yes, this looks correct.

> +		case PR_SET_CHILD_SUBREAPER:
> +			me->signal->is_child_subreaper = !!arg2;
> +			me->signal->has_child_subreaper = true;

Hmm. This looks wrong... why do we set ->has_child_subreaper?

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-19 12:44                     ` Kay Sievers
  0 siblings, 0 replies; 63+ messages in thread
From: Kay Sievers @ 2011-08-19 12:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Lennart Poettering, akpm, linux-kernel, linux-man, roland, torvalds

On Fri, Aug 19, 2011 at 14:25, Oleg Nesterov <oleg@redhat.com> wrote:
> On 08/19, Kay Sievers wrote:
>> +             /* find the first ancestor marked as child_subreaper */
>> +             for (reaper = father->real_parent;
>> +                  reaper != reaper->real_parent;
>
> This looks mysterious. This relies on the fact that INIT_TASK(tsk)
> sets .real_parent = tsk. "reaper != &init_task" looks much more clean.
>
> But. Now that you check ->has_child_subreaper before the lookup,
> this problem should go away? I mean, if ->has_child_subreaper == T
> then some of our parents is the userspace task. Even if it was
> spawned by kthread and then exited, we can't miss ->child_reaper
> in the parents chain.

That's right. We only ever look at 'flagged' processes now, and no
other than a userspace process can ever be tagged. We can just walk up
to  ->child_reaper.

> Or I missed something?

I think that's right. I'll test it.

>> +             case PR_SET_CHILD_SUBREAPER:
>> +                     me->signal->is_child_subreaper = !!arg2;
>> +                     me->signal->has_child_subreaper = true;
>
> Hmm. This looks wrong... why do we set ->has_child_subreaper?

That's the flag we pass down to our childs, hence we need to set it here.

Kay

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-19 12:44                     ` Kay Sievers
  0 siblings, 0 replies; 63+ messages in thread
From: Kay Sievers @ 2011-08-19 12:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Lennart Poettering, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-man-u79uwXL29TY76Z2rM5mHXA, roland-/Z5OmTQCD9xF6kxbq+BtvQ,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, Aug 19, 2011 at 14:25, Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On 08/19, Kay Sievers wrote:
>> +             /* find the first ancestor marked as child_subreaper */
>> +             for (reaper = father->real_parent;
>> +                  reaper != reaper->real_parent;
>
> This looks mysterious. This relies on the fact that INIT_TASK(tsk)
> sets .real_parent = tsk. "reaper != &init_task" looks much more clean.
>
> But. Now that you check ->has_child_subreaper before the lookup,
> this problem should go away? I mean, if ->has_child_subreaper == T
> then some of our parents is the userspace task. Even if it was
> spawned by kthread and then exited, we can't miss ->child_reaper
> in the parents chain.

That's right. We only ever look at 'flagged' processes now, and no
other than a userspace process can ever be tagged. We can just walk up
to  ->child_reaper.

> Or I missed something?

I think that's right. I'll test it.

>> +             case PR_SET_CHILD_SUBREAPER:
>> +                     me->signal->is_child_subreaper = !!arg2;
>> +                     me->signal->has_child_subreaper = true;
>
> Hmm. This looks wrong... why do we set ->has_child_subreaper?

That's the flag we pass down to our childs, hence we need to set it here.

Kay
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-19 13:13                       ` Oleg Nesterov
  0 siblings, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2011-08-19 13:13 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Lennart Poettering, akpm, linux-kernel, linux-man, roland, torvalds

On 08/19, Kay Sievers wrote:
>
> On Fri, Aug 19, 2011 at 14:25, Oleg Nesterov <oleg@redhat.com> wrote:
>
> >> +             case PR_SET_CHILD_SUBREAPER:
> >> +                     me->signal->is_child_subreaper = !!arg2;
> >> +                     me->signal->has_child_subreaper = true;
> >
> > Hmm. This looks wrong... why do we set ->has_child_subreaper?
>
> That's the flag we pass down to our childs, hence we need to set it here.

Aha, I see. I've misread copy_signal(), it copies ->has_child_subreaper,
_not_ ->is_child_subreaper (as I wrongly thought) from parent. And I was
going to blame this logic in the next email, I already started to write it ;)

But this has other (OK, minor) problems too, afaics. First of all, this
->has_child_subreaper = T is not right when the caller exits. We should
not look for ->is_child_subreaper parent, our children should to find us.

Right?

And. afaics this makes the semantics of prctl(REAPER) a bit unclear...
Suppose a task P does

	C1 = fork();

	prctl(REAPER);

	C2 = fork();

In this case it "owns" the children of C2, but not C1. This is fine, and
perhaps this is even better.

But what if P->parent did prctl(REAPER) too? Then P becomes the sub-reaper
for the tasks which were forked before prctl().

In short, in general the caller of prctl(REAPER) doesn't know how this
can affect the forks in the past.

Again, again, I am not arguing. Just I think we should discuss everything
if we are going to add the new feature.



Finally. I am not sure this is really better, but it seems we can
can ->has_child_subreape "more correct" with the same effect.

	- prctl(PR_SET_CHILD_SUBREAPER):

		me->is_child_subreaper = !!arg2;
		// ->has_child_subreaper is not set

	- copy_signal():

		me->has_child_subreaper =
			parent->has_child_subreaper ||
			parent->is_child_subreaper;

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-19 13:13                       ` Oleg Nesterov
  0 siblings, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2011-08-19 13:13 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Lennart Poettering, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-man-u79uwXL29TY76Z2rM5mHXA, roland-/Z5OmTQCD9xF6kxbq+BtvQ,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On 08/19, Kay Sievers wrote:
>
> On Fri, Aug 19, 2011 at 14:25, Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>
> >> +             case PR_SET_CHILD_SUBREAPER:
> >> +                     me->signal->is_child_subreaper = !!arg2;
> >> +                     me->signal->has_child_subreaper = true;
> >
> > Hmm. This looks wrong... why do we set ->has_child_subreaper?
>
> That's the flag we pass down to our childs, hence we need to set it here.

Aha, I see. I've misread copy_signal(), it copies ->has_child_subreaper,
_not_ ->is_child_subreaper (as I wrongly thought) from parent. And I was
going to blame this logic in the next email, I already started to write it ;)

But this has other (OK, minor) problems too, afaics. First of all, this
->has_child_subreaper = T is not right when the caller exits. We should
not look for ->is_child_subreaper parent, our children should to find us.

Right?

And. afaics this makes the semantics of prctl(REAPER) a bit unclear...
Suppose a task P does

	C1 = fork();

	prctl(REAPER);

	C2 = fork();

In this case it "owns" the children of C2, but not C1. This is fine, and
perhaps this is even better.

But what if P->parent did prctl(REAPER) too? Then P becomes the sub-reaper
for the tasks which were forked before prctl().

In short, in general the caller of prctl(REAPER) doesn't know how this
can affect the forks in the past.

Again, again, I am not arguing. Just I think we should discuss everything
if we are going to add the new feature.



Finally. I am not sure this is really better, but it seems we can
can ->has_child_subreape "more correct" with the same effect.

	- prctl(PR_SET_CHILD_SUBREAPER):

		me->is_child_subreaper = !!arg2;
		// ->has_child_subreaper is not set

	- copy_signal():

		me->has_child_subreaper =
			parent->has_child_subreaper ||
			parent->is_child_subreaper;

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
  2011-08-19 13:13                       ` Oleg Nesterov
  (?)
@ 2011-08-19 14:20                       ` Kay Sievers
  2011-08-19 14:58                           ` Oleg Nesterov
  -1 siblings, 1 reply; 63+ messages in thread
From: Kay Sievers @ 2011-08-19 14:20 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Lennart Poettering, akpm, linux-kernel, linux-man, roland, torvalds

On Fri, 2011-08-19 at 15:13 +0200, Oleg Nesterov wrote:
> On 08/19, Kay Sievers wrote:

> But this has other (OK, minor) problems too, afaics. First of all, this
> ->has_child_subreaper = T is not right when the caller exits. We should
> not look for ->is_child_subreaper parent, our children should to find us.
> 
> Right?

The flags are still inherited. If the subreaper isn't there anymore, we
will look it up and not find it. I think that's ok for the very unusual
case that a service manager goes away and its children still run. We
will just fall back to the usual PID 1 logic.

> And. afaics this makes the semantics of prctl(REAPER) a bit unclear...
> Suppose a task P does
> 
> 	C1 = fork();
> 
> 	prctl(REAPER);
> 
> 	C2 = fork();
> 
> In this case it "owns" the children of C2, but not C1. This is fine, and
> perhaps this is even better.

Yeah, thought about that too. I think it is what we want. 

> But what if P->parent did prctl(REAPER) too? Then P becomes the sub-reaper
> for the tasks which were forked before prctl().
> 
> In short, in general the caller of prctl(REAPER) doesn't know how this
> can affect the forks in the past.
> 
> Again, again, I am not arguing. Just I think we should discuss everything
> if we are going to add the new feature.

I agree. I'll add a few lines of comment to explain this when we have a
version that looks fine. I'll need to update the man page for the prctl
anyway. It's probably the best place to explain the use of this flag.

> Finally. I am not sure this is really better, but it seems we can
> can ->has_child_subreape "more correct" with the same effect.
> 		me->has_child_subreaper =
> 			parent->has_child_subreaper ||
> 			parent->is_child_subreaper;

Nice idea! Looks so much better, yes.

Thanks a lot again,
Kay


version 4:
- as we check from within 'flagged' processes only, we can
  avoid the &init_task or point-to-ourselves check

- set parent->has_child_subreaper only in child processes
  so prctl(PR_SET_CHILD_SUBREAPER) toggle is cleaner
  regarding already existing and future childs

version 3:
- rename all child_reaper to child_subreaper to avoid confusion with
 the PID namespace child_reaper variable

- check all possible threads of the reaper process for valid one

- optimization: let processes inherit a flag to indicate that there is
  a subreaper to lookup, in case they need to be re-parented.

version 2:
- uses task->real_parent to walk up the chain of parents.

- does not use init_task but the the parent pointer to itself

- moves the flag into task->signal to have it process-wide
  and not per thread

- moves the parent walk after the check for
  pid_ns->child_reaper == father

- makes sure it does not return a PF_EXITING task

- adds some explanation of SIGCHLD + wait() vs. async events
  like taskstats, to the changelog

- updates the comments for find_new_reaper()


From: Lennart Poettering <lennart@poettering.net>
Subject: prctl: add PR_{SET,GET}_CHILD_SUBREAPER to allow simple process supervision

Userspace service managers/supervisors need to track their started
services. Many services daemonize by double-forking and get implicitely
re-parented to PID 1. The process manager will no longer be able to
receive the SIGCHLD signals for them, and is no longer in charge of
reaping the children with wait(). All information about the children
is lost at the moment PID 1 cleans up the re-parented processes.

With this prctl, a service manager process can mark itself as a sort of
'sub-init', able to stay as the parent for all orphaned processes
created by the started services. All SIGCHLD signals will be delivered
to the service manager.

Receiving SIGCHLD and doing wait() is in cases of a service-manager
much preferred over any possible asynchronous notification about
specific PIDs, because the service manager has full access to the
child process data in /proc and the PID can not be re-used until
the wait(), the service-manager itself is in charge of, has happended.

As a side effect, the relevant parent PID information does not get lost
by a double-fork, which results in a more elaborate process tree and 'ps'
output.

This is orthogonal to PID namespaces. PID namespaces are isolated
from each other, while a service management process usually requires
the serices to live in the same namespace, to be able to talk to each
other.

Users of this will be the systemd per-user instance, which provides
init-like functionality for the user's login session and D-Bus, which
activates bus services on on-demand. Both will need init-like capabilities
to be able to properly keep track of the services they start.

Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Lennart Poettering <lennart@poettering.net>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
---

 include/linux/prctl.h |    3 +++
 include/linux/sched.h |   12 ++++++++++++
 kernel/exit.c         |   26 +++++++++++++++++++++-----
 kernel/fork.c         |    3 +++
 kernel/sys.c          |    8 ++++++++
 5 files changed, 47 insertions(+), 5 deletions(-)

--- a/include/linux/prctl.h
+++ b/include/linux/prctl.h
@@ -102,4 +102,7 @@
 
 #define PR_MCE_KILL_GET 34
 
+#define PR_SET_CHILD_SUBREAPER 35
+#define PR_GET_CHILD_SUBREAPER 36
+
 #endif /* _LINUX_PRCTL_H */
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -550,6 +550,18 @@ struct signal_struct {
 	int			group_stop_count;
 	unsigned int		flags; /* see SIGNAL_* flags below */
 
+	/*
+	 * PR_SET_CHILD_SUBREAPER marks a process, like a service
+	 * manager, to re-parent orphan (double-forking) child processes
+	 * to this process instead of 'init'. The service manager is
+	 * able to receive SIGCHLD signals and is able to investigate
+	 * the process until it calls wait(). All children of this
+	 * process will inherit a flag if they should look for a
+	 * child_subreaper process at exit.
+	 */
+	unsigned int		is_child_subreaper:1;
+	unsigned int		has_child_subreaper:1;
+
 	/* POSIX.1b Interval Timers */
 	struct list_head posix_timers;
 
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -689,11 +689,12 @@ static void exit_mm(struct task_struct *
 }
 
 /*
- * When we die, we re-parent all our children.
- * Try to give them to another thread in our thread
- * group, and if no such member exists, give it to
- * the child reaper process (ie "init") in our pid
- * space.
+ * When we die, we re-parent all our children, and try to:
+ * 1. give them to another thread in our thread group, if such a
+ *    member exists
+ * 2. give it to the first anchestor process which prctl'd itself
+ *    as a child_subreaper for its children (like a service manager)
+ * 3. give it to the init process (PID 1) in our pid namespace
  */
 static struct task_struct *find_new_reaper(struct task_struct *father)
 	__releases(&tasklist_lock)
@@ -724,6 +725,21 @@ static struct task_struct *find_new_reap
 		 * forget_original_parent() must move them somewhere.
 		 */
 		pid_ns->child_reaper = init_pid_ns.child_reaper;
+	} else if (father->signal->has_child_subreaper) {
+		struct task_struct *reaper;
+
+		/* find the first ancestor marked as child_subreaper */
+		for (reaper = father->real_parent;
+		     !same_thread_group(reaper, pid_ns->child_reaper);
+		     reaper = reaper->real_parent) {
+			if (!reaper->signal->is_child_subreaper)
+				continue;
+			thread = reaper;
+			do {
+				if (!(thread->flags & PF_EXITING))
+					return reaper;
+			} while_each_thread(reaper, thread);
+		}
 	}
 
 	return pid_ns->child_reaper;
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -987,6 +987,9 @@ static int copy_signal(unsigned long clo
 	sig->oom_score_adj = current->signal->oom_score_adj;
 	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
 
+	sig->has_child_subreaper = current->signal->has_child_subreaper ||
+				   current->signal->is_child_subreaper;
+
 	mutex_init(&sig->cred_guard_mutex);
 
 	return 0;
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1799,6 +1799,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
 			else
 				error = PR_MCE_KILL_DEFAULT;
 			break;
+		case PR_SET_CHILD_SUBREAPER:
+			me->signal->is_child_subreaper = !!arg2;
+			error = 0;
+			break;
+		case PR_GET_CHILD_SUBREAPER:
+			error = put_user(me->signal->is_child_subreaper,
+					 (int __user *) arg2);
+			break;
 		default:
 			error = -EINVAL;
 			break;



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
  2011-08-19 14:20                       ` Kay Sievers
@ 2011-08-19 14:58                           ` Oleg Nesterov
  0 siblings, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2011-08-19 14:58 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Lennart Poettering, akpm, linux-kernel, linux-man, roland, torvalds

On 08/19, Kay Sievers wrote:
>
> On Fri, 2011-08-19 at 15:13 +0200, Oleg Nesterov wrote:
> > On 08/19, Kay Sievers wrote:
>
> > But this has other (OK, minor) problems too, afaics. First of all, this
> > ->has_child_subreaper = T is not right when the caller exits. We should
> > not look for ->is_child_subreaper parent, our children should to find us.
> >
> > Right?
>
> The flags are still inherited. If the subreaper isn't there anymore, we
> will look it up and not find it. I think that's ok for the very unusual
> case that a service manager goes away and its children still run. We
> will just fall back to the usual PID 1 logic.

Yes, yes, I didn't mean this is really buggy. Just a bit "strange".
Anyway, this is fixed in v4 you sent.

> version 4:

Looks correct... But I don't trust myself. Especially after I missed
the problem with klthreads ;)

I'll try to re-read this patch carefully tomorrow with a fresh head.

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-19 14:58                           ` Oleg Nesterov
  0 siblings, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2011-08-19 14:58 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Lennart Poettering, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-man-u79uwXL29TY76Z2rM5mHXA, roland-/Z5OmTQCD9xF6kxbq+BtvQ,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On 08/19, Kay Sievers wrote:
>
> On Fri, 2011-08-19 at 15:13 +0200, Oleg Nesterov wrote:
> > On 08/19, Kay Sievers wrote:
>
> > But this has other (OK, minor) problems too, afaics. First of all, this
> > ->has_child_subreaper = T is not right when the caller exits. We should
> > not look for ->is_child_subreaper parent, our children should to find us.
> >
> > Right?
>
> The flags are still inherited. If the subreaper isn't there anymore, we
> will look it up and not find it. I think that's ok for the very unusual
> case that a service manager goes away and its children still run. We
> will just fall back to the usual PID 1 logic.

Yes, yes, I didn't mean this is really buggy. Just a bit "strange".
Anyway, this is fixed in v4 you sent.

> version 4:

Looks correct... But I don't trust myself. Especially after I missed
the problem with klthreads ;)

I'll try to re-read this patch carefully tomorrow with a fresh head.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-20 15:33                             ` Oleg Nesterov
  0 siblings, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2011-08-20 15:33 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Lennart Poettering, akpm, linux-kernel, linux-man, roland, torvalds

On 08/19, Oleg Nesterov wrote:
>
> On 08/19, Kay Sievers wrote:
> >
> > version 4:
>
> Looks correct... But I don't trust myself. Especially after I missed
> the problem with klthreads ;)

And I missed another problem. So. We do need to check that
father != init_task even if we check has_child_subreaper.



Suppose that a kernel thread execs the usermode task T which
does prctl(REAPER). Suppose that its grandchild C exits, it
should be reparented to T.

If T is alive - everything fine, the lookup finds T with
->is_sub_reaper set.

If T has exited - everything fine again, C->parent was already
reparented to pid_ns->child_reaper (or another sub-reaper).

But! If T exits, there is a window between setting PF_EXITING
and forget_original_parent() which should re-parent C->parent.
If C exits in this window, it will see PF_EXITING and continue
the lookup, but it will never reach pid_ns->child_reaper.

(if we could check ->exit_state instead of PF_EXITING, everything
 would be fine).





And cough... there is another, not that subtle problem ;) That
task T can _clear_ ->is_child_subreaper after forking the child.
But since this obviously can't clear C->has_child_subreaper, we
can't trust it.




So. please add this check back. I insisted you should remove it,
but I was wrong. Otherwise looks correct.





Damn. And why do we check PF_EXITING but not exit_state? this is
because we have to drop tasklist for exit_task_namespaces(), see
762a24beed3f3ab93224bd447710e6c36fcf1968. However, there were a
lot of changes since then. Afaics we can change do_notify_parent()
to use task_active_pid_ns(tsk->parent) and then we can call
exit_task_namespaces() before exit_notify(). In this case we can
change exit_notify/forget_original_parent to reparent and set
exit_state under tasklist, this also saves unlock+lock.

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-20 15:33                             ` Oleg Nesterov
  0 siblings, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2011-08-20 15:33 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Lennart Poettering, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-man-u79uwXL29TY76Z2rM5mHXA, roland-/Z5OmTQCD9xF6kxbq+BtvQ,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On 08/19, Oleg Nesterov wrote:
>
> On 08/19, Kay Sievers wrote:
> >
> > version 4:
>
> Looks correct... But I don't trust myself. Especially after I missed
> the problem with klthreads ;)

And I missed another problem. So. We do need to check that
father != init_task even if we check has_child_subreaper.



Suppose that a kernel thread execs the usermode task T which
does prctl(REAPER). Suppose that its grandchild C exits, it
should be reparented to T.

If T is alive - everything fine, the lookup finds T with
->is_sub_reaper set.

If T has exited - everything fine again, C->parent was already
reparented to pid_ns->child_reaper (or another sub-reaper).

But! If T exits, there is a window between setting PF_EXITING
and forget_original_parent() which should re-parent C->parent.
If C exits in this window, it will see PF_EXITING and continue
the lookup, but it will never reach pid_ns->child_reaper.

(if we could check ->exit_state instead of PF_EXITING, everything
 would be fine).





And cough... there is another, not that subtle problem ;) That
task T can _clear_ ->is_child_subreaper after forking the child.
But since this obviously can't clear C->has_child_subreaper, we
can't trust it.




So. please add this check back. I insisted you should remove it,
but I was wrong. Otherwise looks correct.





Damn. And why do we check PF_EXITING but not exit_state? this is
because we have to drop tasklist for exit_task_namespaces(), see
762a24beed3f3ab93224bd447710e6c36fcf1968. However, there were a
lot of changes since then. Afaics we can change do_notify_parent()
to use task_active_pid_ns(tsk->parent) and then we can call
exit_task_namespaces() before exit_notify(). In this case we can
change exit_notify/forget_original_parent to reparent and set
exit_state under tasklist, this also saves unlock+lock.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
  2011-08-20 15:33                             ` Oleg Nesterov
  (?)
@ 2011-08-21 18:33                             ` Kay Sievers
  2011-08-22 11:14                                 ` Oleg Nesterov
  -1 siblings, 1 reply; 63+ messages in thread
From: Kay Sievers @ 2011-08-21 18:33 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Lennart Poettering, akpm, linux-kernel, linux-man, roland, torvalds

On Sat, 2011-08-20 at 17:33 +0200, Oleg Nesterov wrote:
> On 08/19, Oleg Nesterov wrote:

> And I missed another problem. So. We do need to check that
> father != init_task even if we check has_child_subreaper.

> Suppose that a kernel thread execs the usermode task T which
> does prctl(REAPER). Suppose that its grandchild C exits, it
> should be reparented to T.
> 
> If T is alive - everything fine, the lookup finds T with
> ->is_sub_reaper set.
> 
> If T has exited - everything fine again, C->parent was already
> reparented to pid_ns->child_reaper (or another sub-reaper).
> 
> But! If T exits, there is a window between setting PF_EXITING
> and forget_original_parent() which should re-parent C->parent.
> If C exits in this window, it will see PF_EXITING and continue
> the lookup, but it will never reach pid_ns->child_reaper.
> 
> (if we could check ->exit_state instead of PF_EXITING, everything
>  would be fine).
> 
> And cough... there is another, not that subtle problem ;) That
> task T can _clear_ ->is_child_subreaper after forking the child.
> But since this obviously can't clear C->has_child_subreaper, we
> can't trust it.
> 
> So. please add this check back. I insisted you should remove it,
> but I was wrong. Otherwise looks correct.

Next version. Thanks a lot for help.

Thanks,
Kay


version 5:
- add back check for &init_task

version 4:
- as we check from within 'flagged' processes only, we can
 avoid the &init_task or point-to-ourselves check

- set parent->has_child_subreaper only in child processes
 so prctl(PR_SET_CHILD_SUBREAPER) toggle is cleaner
 regarding already existing and future childs

version 3:
- rename all child_reaper to child_subreaper to avoid confusion with
 the PID namespace child_reaper variable

- check all possible threads of the reaper process for valid one

- optimization: let processes inherit a flag to indicate that there is
 a subreaper to lookup, in case they need to be re-parented.

version 2:
- uses task->real_parent to walk up the chain of parents.

- does not use init_task but the the parent pointer to itself

- moves the flag into task->signal to have it process-wide
  and not per thread

- moves the parent walk after the check for
  pid_ns->child_reaper == father

- makes sure it does not return a PF_EXITING task

- adds some explanation of SIGCHLD + wait() vs. async events
  like taskstats, to the changelog

- updates the comments for find_new_reaper()


From: Lennart Poettering <lennart@poettering.net>
Subject: prctl: add PR_{SET,GET}_CHILD_SUBREAPER to allow simple process supervision

Userspace service managers/supervisors need to track their started
services. Many services daemonize by double-forking and get implicitely
re-parented to PID 1. The process manager will no longer be able to
receive the SIGCHLD signals for them, and is no longer in charge of
reaping the children with wait(). All information about the children
is lost at the moment PID 1 cleans up the re-parented processes.

With this prctl, a service manager process can mark itself as a sort of
'sub-init', able to stay as the parent for all orphaned processes
created by the started services. All SIGCHLD signals will be delivered
to the service manager.

Receiving SIGCHLD and doing wait() is in cases of a service-manager
much preferred over any possible asynchronous notification about
specific PIDs, because the service manager has full access to the
child process data in /proc and the PID can not be re-used until
the wait(), the service-manager itself is in charge of, has happended.

As a side effect, the relevant parent PID information does not get lost
by a double-fork, which results in a more elaborate process tree and 'ps'
output.

This is orthogonal to PID namespaces. PID namespaces are isolated
from each other, while a service management process usually requires
the serices to live in the same namespace, to be able to talk to each
other.

Users of this will be the systemd per-user instance, which provides
init-like functionality for the user's login session and D-Bus, which
activates bus services on on-demand. Both will need init-like capabilities
to be able to properly keep track of the services they start.

Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Lennart Poettering <lennart@poettering.net>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
---

 include/linux/prctl.h |    3 +++
 include/linux/sched.h |   12 ++++++++++++
 kernel/exit.c         |   28 +++++++++++++++++++++++-----
 kernel/fork.c         |    3 +++
 kernel/sys.c          |    8 ++++++++
 5 files changed, 49 insertions(+), 5 deletions(-)

--- a/include/linux/prctl.h
+++ b/include/linux/prctl.h
@@ -102,4 +102,7 @@
 
 #define PR_MCE_KILL_GET 34
 
+#define PR_SET_CHILD_SUBREAPER 35
+#define PR_GET_CHILD_SUBREAPER 36
+
 #endif /* _LINUX_PRCTL_H */
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -550,6 +550,18 @@ struct signal_struct {
 	int			group_stop_count;
 	unsigned int		flags; /* see SIGNAL_* flags below */
 
+	/*
+	 * PR_SET_CHILD_SUBREAPER marks a process, like a service
+	 * manager, to re-parent orphan (double-forking) child processes
+	 * to this process instead of 'init'. The service manager is
+	 * able to receive SIGCHLD signals and is able to investigate
+	 * the process until it calls wait(). All children of this
+	 * process will inherit a flag if they should look for a
+	 * child_subreaper process at exit.
+	 */
+	unsigned int		is_child_subreaper:1;
+	unsigned int		has_child_subreaper:1;
+
 	/* POSIX.1b Interval Timers */
 	struct list_head posix_timers;
 
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -689,11 +689,12 @@ static void exit_mm(struct task_struct *
 }
 
 /*
- * When we die, we re-parent all our children.
- * Try to give them to another thread in our thread
- * group, and if no such member exists, give it to
- * the child reaper process (ie "init") in our pid
- * space.
+ * When we die, we re-parent all our children, and try to:
+ * 1. give them to another thread in our thread group, if such a
+ *    member exists
+ * 2. give it to the first anchestor process which prctl'd itself
+ *    as a child_subreaper for its children (like a service manager)
+ * 3. give it to the init process (PID 1) in our pid namespace
  */
 static struct task_struct *find_new_reaper(struct task_struct *father)
 	__releases(&tasklist_lock)
@@ -724,6 +725,23 @@ static struct task_struct *find_new_reap
 		 * forget_original_parent() must move them somewhere.
 		 */
 		pid_ns->child_reaper = init_pid_ns.child_reaper;
+	} else if (father->signal->has_child_subreaper) {
+		struct task_struct *reaper;
+
+		/* find the first ancestor marked as child_subreaper */
+		for (reaper = father->real_parent;
+		     reaper != &init_task;
+		     reaper = reaper->real_parent) {
+			if (same_thread_group(reaper, pid_ns->child_reaper))
+				break;
+			if (!reaper->signal->is_child_subreaper)
+				continue;
+			thread = reaper;
+			do {
+				if (!(thread->flags & PF_EXITING))
+					return reaper;
+			} while_each_thread(reaper, thread);
+		}
 	}
 
 	return pid_ns->child_reaper;
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -987,6 +987,9 @@ static int copy_signal(unsigned long clo
 	sig->oom_score_adj = current->signal->oom_score_adj;
 	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
 
+	sig->has_child_subreaper = current->signal->has_child_subreaper ||
+				   current->signal->is_child_subreaper;
+
 	mutex_init(&sig->cred_guard_mutex);
 
 	return 0;
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1799,6 +1799,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
 			else
 				error = PR_MCE_KILL_DEFAULT;
 			break;
+		case PR_SET_CHILD_SUBREAPER:
+			me->signal->is_child_subreaper = !!arg2;
+			error = 0;
+			break;
+		case PR_GET_CHILD_SUBREAPER:
+			error = put_user(me->signal->is_child_subreaper,
+					 (int __user *) arg2);
+			break;
 		default:
 			error = -EINVAL;
 			break;



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
  2011-08-21 18:33                             ` Kay Sievers
@ 2011-08-22 11:14                                 ` Oleg Nesterov
  0 siblings, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2011-08-22 11:14 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Lennart Poettering, akpm, linux-kernel, linux-man, roland, torvalds

On 08/21, Kay Sievers wrote:
>
> version 5:
> - add back check for &init_task

Reviewed-by: Oleg Nesterov <oleg@redhat.com>





Just one note, to avoid the "doesn't look correct" questions.

> @@ -987,6 +987,9 @@ static int copy_signal(unsigned long clo
>  	sig->oom_score_adj = current->signal->oom_score_adj;
>  	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
>
> +	sig->has_child_subreaper = current->signal->has_child_subreaper ||
> +				   current->signal->is_child_subreaper;

This assumes that current == parent. But this is not true with
CLONE_PARENT. If a task does prctl(REAPER) + clone(CLONE_PARENT)
->has_child_subreaper == T is not exactly right.

But I think this is fine and we do not care, a "wrong"
->has_child_subreaper is harmless and it can't be 100% correct in
any case.

I believe the patch is correct.

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-22 11:14                                 ` Oleg Nesterov
  0 siblings, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2011-08-22 11:14 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Lennart Poettering, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-man-u79uwXL29TY76Z2rM5mHXA, roland-/Z5OmTQCD9xF6kxbq+BtvQ,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On 08/21, Kay Sievers wrote:
>
> version 5:
> - add back check for &init_task

Reviewed-by: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>





Just one note, to avoid the "doesn't look correct" questions.

> @@ -987,6 +987,9 @@ static int copy_signal(unsigned long clo
>  	sig->oom_score_adj = current->signal->oom_score_adj;
>  	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
>
> +	sig->has_child_subreaper = current->signal->has_child_subreaper ||
> +				   current->signal->is_child_subreaper;

This assumes that current == parent. But this is not true with
CLONE_PARENT. If a task does prctl(REAPER) + clone(CLONE_PARENT)
->has_child_subreaper == T is not exactly right.

But I think this is fine and we do not care, a "wrong"
->has_child_subreaper is harmless and it can't be 100% correct in
any case.

I believe the patch is correct.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-22 23:48                                   ` Kay Sievers
  0 siblings, 0 replies; 63+ messages in thread
From: Kay Sievers @ 2011-08-22 23:48 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Lennart Poettering, akpm, linux-kernel, linux-man, roland, torvalds

On Mon, Aug 22, 2011 at 13:14, Oleg Nesterov <oleg@redhat.com> wrote:

> Reviewed-by: Oleg Nesterov <oleg@redhat.com>

Thanks Oleg for all the help.

Final version with updated changelog.

Linus' concern about the added overhead in the not-used case should be
addressed with the addition of a flag, which we inherit and skip all the
parent search, if none of our ancestors marked itself as a
CHILD_SUBREAPER.

Andrew, mind picking this up again? 

Thanks,
Kay


From: Lennart Poettering <lennart@poettering.net>
Subject: prctl: add PR_{SET,GET}_CHILD_SUBREAPER to allow simple process supervision

Userspace service managers/supervisors need to track their started
services. Many services daemonize by double-forking and get implicitely
re-parented to PID 1. The service manager will no longer be able to
receive the SIGCHLD signals for them, and is no longer in charge of
reaping the children with wait(). All information about the children
is lost at the moment PID 1 cleans up the re-parented processes.

With this prctl, a service manager process can mark itself as a sort of
'sub-init', able to stay as the parent for all orphaned processes
created by the started services. All SIGCHLD signals will be delivered
to the service manager.

Receiving SIGCHLD and doing wait() is in cases of a service-manager
much preferred over any possible asynchronous notification about
specific PIDs, because the service manager has full access to the
child process data in /proc and the PID can not be re-used until
the wait(), the service-manager itself is in charge of, has happended.

As a side effect, the relevant parent PID information does not get lost
by a double-fork, which results in a more elaborate process tree and 'ps'
output:

before:
  # ps afx
  253 ?        Ss     0:00 /bin/dbus-daemon --system --nofork
  294 ?        Sl     0:00 /usr/libexec/polkit-1/polkitd
  328 ?        S      0:00 /usr/sbin/modem-manager
  608 ?        Sl     0:00 /usr/libexec/colord
  658 ?        Sl     0:00 /usr/libexec/upowerd
  819 ?        Sl     0:00 /usr/libexec/imsettings-daemon
  916 ?        Sl     0:00 /usr/libexec/udisks-daemon
  917 ?        S      0:00  \_ udisks-daemon: not polling any devices

after:
  # ps afx
  294 ?        Ss     0:00 /bin/dbus-daemon --system --nofork
  426 ?        Sl     0:00  \_ /usr/libexec/polkit-1/polkitd
  449 ?        S      0:00  \_ /usr/sbin/modem-manager
  635 ?        Sl     0:00  \_ /usr/libexec/colord
  705 ?        Sl     0:00  \_ /usr/libexec/upowerd
  959 ?        Sl     0:00  \_ /usr/libexec/udisks-daemon
  960 ?        S      0:00  |   \_ udisks-daemon: not polling any devices
  977 ?        Sl     0:00  \_ /usr/libexec/packagekitd

This prctl is orthogonal to PID namespaces. PID namespaces are isolated
from each other, while a service management process usually requires
the services to live in the same namespace, to be able to talk to each
other.

Users of this will be the systemd per-user instance, which provides
init-like functionality for the user's login session and D-Bus, which
activates bus services on-demand. Both need init-like capabilities
to be able to properly keep track of the services they start.

Many thanks to Oleg for several rounds of review and insights.

Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Lennart Poettering <lennart@poettering.net>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
---

 include/linux/prctl.h |    3 +++
 include/linux/sched.h |   12 ++++++++++++
 kernel/exit.c         |   28 +++++++++++++++++++++++-----
 kernel/fork.c         |    3 +++
 kernel/sys.c          |    8 ++++++++
 5 files changed, 49 insertions(+), 5 deletions(-)

--- a/include/linux/prctl.h
+++ b/include/linux/prctl.h
@@ -102,4 +102,7 @@
 
 #define PR_MCE_KILL_GET 34
 
+#define PR_SET_CHILD_SUBREAPER 35
+#define PR_GET_CHILD_SUBREAPER 36
+
 #endif /* _LINUX_PRCTL_H */
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -550,6 +550,18 @@ struct signal_struct {
 	int			group_stop_count;
 	unsigned int		flags; /* see SIGNAL_* flags below */
 
+	/*
+	 * PR_SET_CHILD_SUBREAPER marks a process, like a service
+	 * manager, to re-parent orphan (double-forking) child processes
+	 * to this process instead of 'init'. The service manager is
+	 * able to receive SIGCHLD signals and is able to investigate
+	 * the process until it calls wait(). All children of this
+	 * process will inherit a flag if they should look for a
+	 * child_subreaper process at exit.
+	 */
+	unsigned int		is_child_subreaper:1;
+	unsigned int		has_child_subreaper:1;
+
 	/* POSIX.1b Interval Timers */
 	struct list_head posix_timers;
 
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -689,11 +689,12 @@ static void exit_mm(struct task_struct *
 }
 
 /*
- * When we die, we re-parent all our children.
- * Try to give them to another thread in our thread
- * group, and if no such member exists, give it to
- * the child reaper process (ie "init") in our pid
- * space.
+ * When we die, we re-parent all our children, and try to:
+ * 1. give them to another thread in our thread group, if such a
+ *    member exists
+ * 2. give it to the first anchestor process which prctl'd itself
+ *    as a child_subreaper for its children (like a service manager)
+ * 3. give it to the init process (PID 1) in our pid namespace
  */
 static struct task_struct *find_new_reaper(struct task_struct *father)
 	__releases(&tasklist_lock)
@@ -724,6 +725,23 @@ static struct task_struct *find_new_reap
 		 * forget_original_parent() must move them somewhere.
 		 */
 		pid_ns->child_reaper = init_pid_ns.child_reaper;
+	} else if (father->signal->has_child_subreaper) {
+		struct task_struct *reaper;
+
+		/* find the first ancestor marked as child_subreaper */
+		for (reaper = father->real_parent;
+		     reaper != &init_task;
+		     reaper = reaper->real_parent) {
+			if (same_thread_group(reaper, pid_ns->child_reaper))
+				break;
+			if (!reaper->signal->is_child_subreaper)
+				continue;
+			thread = reaper;
+			do {
+				if (!(thread->flags & PF_EXITING))
+					return reaper;
+			} while_each_thread(reaper, thread);
+		}
 	}
 
 	return pid_ns->child_reaper;
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -987,6 +987,9 @@ static int copy_signal(unsigned long clo
 	sig->oom_score_adj = current->signal->oom_score_adj;
 	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
 
+	sig->has_child_subreaper = current->signal->has_child_subreaper ||
+				   current->signal->is_child_subreaper;
+
 	mutex_init(&sig->cred_guard_mutex);
 
 	return 0;
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1799,6 +1799,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
 			else
 				error = PR_MCE_KILL_DEFAULT;
 			break;
+		case PR_SET_CHILD_SUBREAPER:
+			me->signal->is_child_subreaper = !!arg2;
+			error = 0;
+			break;
+		case PR_GET_CHILD_SUBREAPER:
+			error = put_user(me->signal->is_child_subreaper,
+					 (int __user *) arg2);
+			break;
 		default:
 			error = -EINVAL;
 			break;




^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-22 23:48                                   ` Kay Sievers
  0 siblings, 0 replies; 63+ messages in thread
From: Kay Sievers @ 2011-08-22 23:48 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Lennart Poettering, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-man-u79uwXL29TY76Z2rM5mHXA, roland-/Z5OmTQCD9xF6kxbq+BtvQ,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Mon, Aug 22, 2011 at 13:14, Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> Reviewed-by: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Thanks Oleg for all the help.

Final version with updated changelog.

Linus' concern about the added overhead in the not-used case should be
addressed with the addition of a flag, which we inherit and skip all the
parent search, if none of our ancestors marked itself as a
CHILD_SUBREAPER.

Andrew, mind picking this up again? 

Thanks,
Kay


From: Lennart Poettering <lennart-mdGvqq1h2p+GdvJs77BJ7Q@public.gmane.org>
Subject: prctl: add PR_{SET,GET}_CHILD_SUBREAPER to allow simple process supervision

Userspace service managers/supervisors need to track their started
services. Many services daemonize by double-forking and get implicitely
re-parented to PID 1. The service manager will no longer be able to
receive the SIGCHLD signals for them, and is no longer in charge of
reaping the children with wait(). All information about the children
is lost at the moment PID 1 cleans up the re-parented processes.

With this prctl, a service manager process can mark itself as a sort of
'sub-init', able to stay as the parent for all orphaned processes
created by the started services. All SIGCHLD signals will be delivered
to the service manager.

Receiving SIGCHLD and doing wait() is in cases of a service-manager
much preferred over any possible asynchronous notification about
specific PIDs, because the service manager has full access to the
child process data in /proc and the PID can not be re-used until
the wait(), the service-manager itself is in charge of, has happended.

As a side effect, the relevant parent PID information does not get lost
by a double-fork, which results in a more elaborate process tree and 'ps'
output:

before:
  # ps afx
  253 ?        Ss     0:00 /bin/dbus-daemon --system --nofork
  294 ?        Sl     0:00 /usr/libexec/polkit-1/polkitd
  328 ?        S      0:00 /usr/sbin/modem-manager
  608 ?        Sl     0:00 /usr/libexec/colord
  658 ?        Sl     0:00 /usr/libexec/upowerd
  819 ?        Sl     0:00 /usr/libexec/imsettings-daemon
  916 ?        Sl     0:00 /usr/libexec/udisks-daemon
  917 ?        S      0:00  \_ udisks-daemon: not polling any devices

after:
  # ps afx
  294 ?        Ss     0:00 /bin/dbus-daemon --system --nofork
  426 ?        Sl     0:00  \_ /usr/libexec/polkit-1/polkitd
  449 ?        S      0:00  \_ /usr/sbin/modem-manager
  635 ?        Sl     0:00  \_ /usr/libexec/colord
  705 ?        Sl     0:00  \_ /usr/libexec/upowerd
  959 ?        Sl     0:00  \_ /usr/libexec/udisks-daemon
  960 ?        S      0:00  |   \_ udisks-daemon: not polling any devices
  977 ?        Sl     0:00  \_ /usr/libexec/packagekitd

This prctl is orthogonal to PID namespaces. PID namespaces are isolated
from each other, while a service management process usually requires
the services to live in the same namespace, to be able to talk to each
other.

Users of this will be the systemd per-user instance, which provides
init-like functionality for the user's login session and D-Bus, which
activates bus services on-demand. Both need init-like capabilities
to be able to properly keep track of the services they start.

Many thanks to Oleg for several rounds of review and insights.

Reviewed-by: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Lennart Poettering <lennart-mdGvqq1h2p+GdvJs77BJ7Q@public.gmane.org>
Signed-off-by: Kay Sievers <kay.sievers-tD+1rO4QERM@public.gmane.org>
---

 include/linux/prctl.h |    3 +++
 include/linux/sched.h |   12 ++++++++++++
 kernel/exit.c         |   28 +++++++++++++++++++++++-----
 kernel/fork.c         |    3 +++
 kernel/sys.c          |    8 ++++++++
 5 files changed, 49 insertions(+), 5 deletions(-)

--- a/include/linux/prctl.h
+++ b/include/linux/prctl.h
@@ -102,4 +102,7 @@
 
 #define PR_MCE_KILL_GET 34
 
+#define PR_SET_CHILD_SUBREAPER 35
+#define PR_GET_CHILD_SUBREAPER 36
+
 #endif /* _LINUX_PRCTL_H */
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -550,6 +550,18 @@ struct signal_struct {
 	int			group_stop_count;
 	unsigned int		flags; /* see SIGNAL_* flags below */
 
+	/*
+	 * PR_SET_CHILD_SUBREAPER marks a process, like a service
+	 * manager, to re-parent orphan (double-forking) child processes
+	 * to this process instead of 'init'. The service manager is
+	 * able to receive SIGCHLD signals and is able to investigate
+	 * the process until it calls wait(). All children of this
+	 * process will inherit a flag if they should look for a
+	 * child_subreaper process at exit.
+	 */
+	unsigned int		is_child_subreaper:1;
+	unsigned int		has_child_subreaper:1;
+
 	/* POSIX.1b Interval Timers */
 	struct list_head posix_timers;
 
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -689,11 +689,12 @@ static void exit_mm(struct task_struct *
 }
 
 /*
- * When we die, we re-parent all our children.
- * Try to give them to another thread in our thread
- * group, and if no such member exists, give it to
- * the child reaper process (ie "init") in our pid
- * space.
+ * When we die, we re-parent all our children, and try to:
+ * 1. give them to another thread in our thread group, if such a
+ *    member exists
+ * 2. give it to the first anchestor process which prctl'd itself
+ *    as a child_subreaper for its children (like a service manager)
+ * 3. give it to the init process (PID 1) in our pid namespace
  */
 static struct task_struct *find_new_reaper(struct task_struct *father)
 	__releases(&tasklist_lock)
@@ -724,6 +725,23 @@ static struct task_struct *find_new_reap
 		 * forget_original_parent() must move them somewhere.
 		 */
 		pid_ns->child_reaper = init_pid_ns.child_reaper;
+	} else if (father->signal->has_child_subreaper) {
+		struct task_struct *reaper;
+
+		/* find the first ancestor marked as child_subreaper */
+		for (reaper = father->real_parent;
+		     reaper != &init_task;
+		     reaper = reaper->real_parent) {
+			if (same_thread_group(reaper, pid_ns->child_reaper))
+				break;
+			if (!reaper->signal->is_child_subreaper)
+				continue;
+			thread = reaper;
+			do {
+				if (!(thread->flags & PF_EXITING))
+					return reaper;
+			} while_each_thread(reaper, thread);
+		}
 	}
 
 	return pid_ns->child_reaper;
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -987,6 +987,9 @@ static int copy_signal(unsigned long clo
 	sig->oom_score_adj = current->signal->oom_score_adj;
 	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
 
+	sig->has_child_subreaper = current->signal->has_child_subreaper ||
+				   current->signal->is_child_subreaper;
+
 	mutex_init(&sig->cred_guard_mutex);
 
 	return 0;
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1799,6 +1799,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
 			else
 				error = PR_MCE_KILL_DEFAULT;
 			break;
+		case PR_SET_CHILD_SUBREAPER:
+			me->signal->is_child_subreaper = !!arg2;
+			error = 0;
+			break;
+		case PR_GET_CHILD_SUBREAPER:
+			error = put_user(me->signal->is_child_subreaper,
+					 (int __user *) arg2);
+			break;
 		default:
 			error = -EINVAL;
 			break;



--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-23  0:30           ` Colin Walters
  0 siblings, 0 replies; 63+ messages in thread
From: Colin Walters @ 2011-08-23  0:30 UTC (permalink / raw)
  To: Alan Cox
  Cc: Kay Sievers, Oleg Nesterov, akpm, linux-kernel, lennart,
	linux-man, roland, torvalds

On Wed, Aug 17, 2011 at 9:37 AM, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
>
> Now I can see why you want to know when processes exit and do it without
> tampering with the process, but it seems to me that's simply a question
> of us lacking a way to do this nicely, whether inotify/dnotify/etc
> on /proc, some kind of 'also signal me' property or some kind of process
> event interface.

Yes please!  The fact that in Unix one can only get notifications of a
process exiting when one is the parent is just lame, and SIGCHLD (like
all signals) sucks.  Lennart was tossing around a "childfd" namespace
proposal...

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch added to -mm tree
@ 2011-08-23  0:30           ` Colin Walters
  0 siblings, 0 replies; 63+ messages in thread
From: Colin Walters @ 2011-08-23  0:30 UTC (permalink / raw)
  To: Alan Cox
  Cc: Kay Sievers, Oleg Nesterov,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lennart-mdGvqq1h2p+GdvJs77BJ7Q, linux-man-u79uwXL29TY76Z2rM5mHXA,
	roland-/Z5OmTQCD9xF6kxbq+BtvQ,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Wed, Aug 17, 2011 at 9:37 AM, Alan Cox <alan-qBU/x9rampVanCEyBjwyrvXRex20P6io@public.gmane.org> wrote:
>
> Now I can see why you want to know when processes exit and do it without
> tampering with the process, but it seems to me that's simply a question
> of us lacking a way to do this nicely, whether inotify/dnotify/etc
> on /proc, some kind of 'also signal me' property or some kind of process
> event interface.

Yes please!  The fact that in Unix one can only get notifications of a
process exiting when one is the parent is just lame, and SIGCHLD (like
all signals) sucks.  Lennart was tossing around a "childfd" namespace
proposal...
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2011-08-23  0:30 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-16 20:11 + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision.patch added to -mm tree akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
2011-08-17 11:55 ` + prctl-add-pr_setget_child_reaper-to-allow-simple-process-supervision .patch " Oleg Nesterov
2011-08-17 11:55   ` Oleg Nesterov
2011-08-17 13:05   ` Oleg Nesterov
2011-08-17 13:05     ` Oleg Nesterov
2011-08-17 13:21     ` Kay Sievers
2011-08-17 13:21       ` Kay Sievers
2011-08-17 13:37       ` Alan Cox
2011-08-17 13:37         ` Alan Cox
2011-08-23  0:30         ` Colin Walters
2011-08-23  0:30           ` Colin Walters
2011-08-17 14:16       ` Oleg Nesterov
2011-08-17 14:16         ` Oleg Nesterov
2011-08-17 16:03       ` Denys Vlasenko
2011-08-17 16:03         ` Denys Vlasenko
2011-08-17 13:13   ` Kay Sievers
2011-08-17 13:45     ` Oleg Nesterov
2011-08-17 13:45       ` Oleg Nesterov
2011-08-17 15:45       ` Kay Sievers
2011-08-17 15:45         ` Kay Sievers
2011-08-17 15:53         ` Alan Cox
2011-08-17 15:53           ` Alan Cox
2011-08-17 16:20         ` Oleg Nesterov
2011-08-17 16:20           ` Oleg Nesterov
2011-08-17 16:47           ` Kay Sievers
2011-08-17 16:47             ` Kay Sievers
2011-08-17 18:57             ` Oleg Nesterov
2011-08-17 18:57               ` Oleg Nesterov
2011-08-17 20:56               ` Kay Sievers
2011-08-17 20:56                 ` Kay Sievers
2011-08-18 12:43       ` Lennart Poettering
2011-08-18 12:43         ` Lennart Poettering
2011-08-18 14:25         ` Oleg Nesterov
2011-08-18 14:25           ` Oleg Nesterov
2011-08-18 18:11           ` Kay Sievers
2011-08-18 18:48             ` Oleg Nesterov
2011-08-18 18:48               ` Oleg Nesterov
2011-08-19  1:31               ` Kay Sievers
2011-08-19  1:31                 ` Kay Sievers
2011-08-19 12:25                 ` Oleg Nesterov
2011-08-19 12:25                   ` Oleg Nesterov
2011-08-19 12:44                   ` Kay Sievers
2011-08-19 12:44                     ` Kay Sievers
2011-08-19 13:13                     ` Oleg Nesterov
2011-08-19 13:13                       ` Oleg Nesterov
2011-08-19 14:20                       ` Kay Sievers
2011-08-19 14:58                         ` Oleg Nesterov
2011-08-19 14:58                           ` Oleg Nesterov
2011-08-20 15:33                           ` Oleg Nesterov
2011-08-20 15:33                             ` Oleg Nesterov
2011-08-21 18:33                             ` Kay Sievers
2011-08-22 11:14                               ` Oleg Nesterov
2011-08-22 11:14                                 ` Oleg Nesterov
2011-08-22 23:48                                 ` Kay Sievers
2011-08-22 23:48                                   ` Kay Sievers
2011-08-18 21:23             ` Linus Torvalds
2011-08-18 21:23               ` Linus Torvalds
2011-08-18 21:55               ` Kay Sievers
2011-08-18 21:55                 ` Kay Sievers
2011-08-18 22:22                 ` Linus Torvalds
2011-08-18 22:22                   ` Linus Torvalds
2011-08-19  0:48                   ` Kay Sievers
2011-08-19  0:48                     ` Kay Sievers

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.