linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] prctl: propagate has_child_subreaper flag to every descendant
@ 2017-01-19 16:43 Pavel Tikhomirov
  2017-01-20 18:14 ` Oleg Nesterov
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Pavel Tikhomirov @ 2017-01-19 16:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Andrew Morton, Cyrill Gorcunov, John Stultz,
	Thomas Gleixner, Oleg Nesterov, Nicolas Pitre, Pavel Tikhomirov,
	Michal Hocko, Stanislav Kinsburskiy, Mateusz Guzik, linux-kernel,
	Pavel Emelyanov, Konstantin Khorenko

If process forks some children when it has is_child_subreaper
flag enabled they will inherit has_child_subreaper flag - first
group, when is_child_subreaper is disabled forked children will
not inherit it - second group. So child-subreaper does not reparent
all his descendants when their parents die. Having these two
differently behaving groups can lead to confusion. Also it is
a problem for CRIU, as when we restore process tree we need to
somehow determine which descendants belong to which group and
much harder - to put them exactly to these group.

To simplify these we can add a propagation of has_child_subreaper
flag on PR_SET_CHILD_SUBREAPER, walking all descendants of child-
subreaper to setup has_child_subreaper flag.

In common cases when process like systemd first sets itself to
be a child-subreaper and only after that forks its services, we will
have zero-length list of descendants to walk. Testing with binary
subtree of 2^15 processes prctl took < 0.007 sec and has shown close
to linear dependency(~0.2 * n * usec) on lower numbers of processes.

Using csr_descendant list to collect descendants and do tree walk
without recursion.

Optimize:

a) When descendant already has has_child_subreaper flag all his subtree
has it too already.

b) When some descendant is child_reaper, it's subtree is in different
pidns from us(original child-subreaper) and processes from other pidns
will never reparent to us.

So we can skip their(a,b) subtree from walk.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
---
 include/linux/sched.h |  2 ++
 kernel/sys.c          | 51 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4d19052..9cb44c4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1715,6 +1715,8 @@ struct task_struct {
 	struct signal_struct *signal;
 	struct sighand_struct *sighand;
 
+	struct list_head csr_descendant;
+
 	sigset_t blocked, real_blocked;
 	sigset_t saved_sigmask;	/* restored if set_restore_sigmask() was used */
 	struct sigpending pending;
diff --git a/kernel/sys.c b/kernel/sys.c
index 842914e..05b6d7d 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2063,6 +2063,55 @@ static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr)
 }
 #endif
 
+static DEFINE_SPINLOCK(descendants_lock);
+
+static void prctl_set_child_subreaper(struct task_struct *reaper, bool arg2)
+{
+	LIST_HEAD(descendants);
+
+	reaper->signal->is_child_subreaper = arg2;
+	if (!arg2)
+		return;
+
+	spin_lock(&descendants_lock);
+	read_lock(&tasklist_lock);
+
+	list_add(&reaper->csr_descendant, &descendants);
+
+	while (!list_empty(&descendants)) {
+		struct task_struct *tsk;
+		struct task_struct *p;
+
+		tsk = list_first_entry(&descendants, struct task_struct,
+				csr_descendant);
+
+		list_for_each_entry(p, &tsk->children, sibling) {
+			/*
+			 * If tsk has has_child_subreaper - all its decendants
+			 * already have these flag too and new decendants will
+			 * inherit it on fork, so nothing to be done here.
+			 */
+			if (p->signal->has_child_subreaper)
+				continue;
+
+			/*
+			 * If we've found child_reaper - skip descendants in
+			 * it's subtree as they will never get out pidns
+			 */
+			if (is_child_reaper(task_pid(p)))
+				continue;
+
+			p->signal->has_child_subreaper = 1;
+			list_add(&p->csr_descendant, &descendants);
+		}
+
+		list_del_init(&tsk->csr_descendant);
+	}
+
+	read_unlock(&tasklist_lock);
+	spin_unlock(&descendants_lock);
+}
+
 SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		unsigned long, arg4, unsigned long, arg5)
 {
@@ -2213,7 +2262,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		error = prctl_get_tid_address(me, (int __user **)arg2);
 		break;
 	case PR_SET_CHILD_SUBREAPER:
-		me->signal->is_child_subreaper = !!arg2;
+		prctl_set_child_subreaper(me, !!arg2);
 		break;
 	case PR_GET_CHILD_SUBREAPER:
 		error = put_user(me->signal->is_child_subreaper,
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] prctl: propagate has_child_subreaper flag to every descendant
  2017-01-19 16:43 [PATCH] prctl: propagate has_child_subreaper flag to every descendant Pavel Tikhomirov
@ 2017-01-20 18:14 ` Oleg Nesterov
  2017-01-22 10:00   ` Pavel Tikhomirov
  2017-01-22 10:11   ` Pavel Tikhomirov
  2017-01-23 11:57 ` [PATCH] introduce the walk_process_tree() helper Oleg Nesterov
  2017-01-23 16:44 ` setns() && PR_SET_CHILD_SUBREAPER Oleg Nesterov
  2 siblings, 2 replies; 19+ messages in thread
From: Oleg Nesterov @ 2017-01-20 18:14 UTC (permalink / raw)
  To: Pavel Tikhomirov
  Cc: Ingo Molnar, Peter Zijlstra, Andrew Morton, Cyrill Gorcunov,
	John Stultz, Thomas Gleixner, Nicolas Pitre, Michal Hocko,
	Stanislav Kinsburskiy, Mateusz Guzik, linux-kernel,
	Pavel Emelyanov, Konstantin Khorenko

On 01/19, Pavel Tikhomirov wrote:
>
> Having these two
> differently behaving groups can lead to confusion. Also it is
> a problem for CRIU, as when we restore process tree we need to
> somehow determine which descendants belong to which group and
> much harder - to put them exactly to these group.

Hmm. could you explain how this change helps CRIU? I mean, why
restorer can't do prctl(CHILD_SUBREAPER) before the first fork?

Anyway, afaics the patch is sub-optimal and not correct...

> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1715,6 +1715,8 @@ struct task_struct {
>  	struct signal_struct *signal;
>  	struct sighand_struct *sighand;
>
> +	struct list_head csr_descendant;
> +

You don't need this new member and descendants_lock. task_struct has
the ->real_parent pointer so you can work the tree without recursion.

> +static void prctl_set_child_subreaper(struct task_struct *reaper, bool arg2)
> +{
> +	LIST_HEAD(descendants);
> +
> +	reaper->signal->is_child_subreaper = arg2;
> +	if (!arg2)
> +		return;
> +
> +	spin_lock(&descendants_lock);
> +	read_lock(&tasklist_lock);
> +
> +	list_add(&reaper->csr_descendant, &descendants);
> +
> +	while (!list_empty(&descendants)) {
> +		struct task_struct *tsk;
> +		struct task_struct *p;
> +
> +		tsk = list_first_entry(&descendants, struct task_struct,
> +				csr_descendant);
> +
> +		list_for_each_entry(p, &tsk->children, sibling) {

This is not enough. Every thread has its own ->children list, you need
to walk the sub-threads as well.

> +			 * If we've found child_reaper - skip descendants in
> +			 * it's subtree as they will never get out pidns
> +			 */
> +			if (is_child_reaper(task_pid(p)))
> +				continue;

Again, a child reaper can be multi-threaded, this check can be false
negative.

Probably is_child_reaper() should be renamed somehow and a new helper
makes sense... something like

	bool task_is_child_reaper(struct task_struct *p)
	{
		return same_thread_group(p, task_active_pid_ns(p)->child_reaper);
	}

Oleg.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] prctl: propagate has_child_subreaper flag to every descendant
  2017-01-20 18:14 ` Oleg Nesterov
@ 2017-01-22 10:00   ` Pavel Tikhomirov
  2017-01-22 10:11   ` Pavel Tikhomirov
  1 sibling, 0 replies; 19+ messages in thread
From: Pavel Tikhomirov @ 2017-01-22 10:00 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ingo Molnar, Peter Zijlstra, Andrew Morton, Cyrill Gorcunov,
	John Stultz, Thomas Gleixner, Nicolas Pitre, Michal Hocko,
	Stanislav Kinsburskiy, Mateusz Guzik, linux-kernel,
	Pavel Emelyanov, Konstantin Khorenko



On 01/20/2017 09:14 PM, Oleg Nesterov wrote:
> On 01/19, Pavel Tikhomirov wrote:
>>
>> Having these two
>> differently behaving groups can lead to confusion. Also it is
>> a problem for CRIU, as when we restore process tree we need to
>> somehow determine which descendants belong to which group and
>> much harder - to put them exactly to these group.
>
> Hmm. could you explain how this change helps CRIU? I mean, why
> restorer can't do prctl(CHILD_SUBREAPER) before the first fork?

Imagine we have these tree in pidns:

1: has_child_subreaper == 0 && is_child_subreaper == 0
|-2: has_child_subreaper == 0 && is_child_subreaper == 1
| |-3: has_child_subreaper == 0 && is_child_subreaper == 0
| | |-5: has_child_subreaper == 0 && is_child_subreaper == 0
| |-4: has_child_subreaper == 1 && is_child_subreaper == 0
| | |-6: has_child_subreaper == 1 && is_child_subreaper == 0

before c/r: If 4 dies 6 will reparent to 2, if 3 dies 5 will reparent to 1.
after c/r: (where restorer had is_child_subreaper == 1, everybody in the 
tree will have has_child_subreaper == 1) Everybody will reparent to 2.

>
> Anyway, afaics the patch is sub-optimal and not correct...
>
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1715,6 +1715,8 @@ struct task_struct {
>>  	struct signal_struct *signal;
>>  	struct sighand_struct *sighand;
>>
>> +	struct list_head csr_descendant;
>> +
>
> You don't need this new member and descendants_lock. task_struct has
> the ->real_parent pointer so you can work the tree without recursion.

Sorry I don't get how I can walk down the tree of all descendants with 
help of ->real_parent pointer, can you please point on some example or 
explain a bit more? (I see task_is_descendant() in 
security/yama/yama_lsm.c but we will need to check it for every process, 
not only descendants, the latter can be a lot faster.)

>
>> +static void prctl_set_child_subreaper(struct task_struct *reaper, bool arg2)
>> +{
>> +	LIST_HEAD(descendants);
>> +
>> +	reaper->signal->is_child_subreaper = arg2;
>> +	if (!arg2)
>> +		return;
>> +
>> +	spin_lock(&descendants_lock);
>> +	read_lock(&tasklist_lock);
>> +
>> +	list_add(&reaper->csr_descendant, &descendants);
>> +
>> +	while (!list_empty(&descendants)) {
>> +		struct task_struct *tsk;
>> +		struct task_struct *p;
>> +
>> +		tsk = list_first_entry(&descendants, struct task_struct,
>> +				csr_descendant);
>> +
>> +		list_for_each_entry(p, &tsk->children, sibling) {
>
> This is not enough. Every thread has its own ->children list, you need
> to walk the sub-threads as well.

Will do.

>
>> +			 * If we've found child_reaper - skip descendants in
>> +			 * it's subtree as they will never get out pidns
>> +			 */
>> +			if (is_child_reaper(task_pid(p)))
>> +				continue;
>
> Again, a child reaper can be multi-threaded, this check can be false
> negative.
>
> Probably is_child_reaper() should be renamed somehow and a new helper
> makes sense... something like

Will do.

>
> 	bool task_is_child_reaper(struct task_struct *p)
> 	{
> 		return same_thread_group(p, task_active_pid_ns(p)->child_reaper);
> 	}
>
> Oleg.
>

-- 
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] prctl: propagate has_child_subreaper flag to every descendant
  2017-01-20 18:14 ` Oleg Nesterov
  2017-01-22 10:00   ` Pavel Tikhomirov
@ 2017-01-22 10:11   ` Pavel Tikhomirov
  2017-01-23 11:55     ` Oleg Nesterov
  1 sibling, 1 reply; 19+ messages in thread
From: Pavel Tikhomirov @ 2017-01-22 10:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ingo Molnar, Peter Zijlstra, Andrew Morton, Cyrill Gorcunov,
	John Stultz, Thomas Gleixner, Nicolas Pitre, Michal Hocko,
	Stanislav Kinsburskiy, Mateusz Guzik, linux-kernel,
	Pavel Emelyanov, Konstantin Khorenko

Sorry I had some problem with mail-agent, resend to be on the safe side.

On 01/20/2017 09:14 PM, Oleg Nesterov wrote:
> On 01/19, Pavel Tikhomirov wrote:
>>
>> Having these two
>> differently behaving groups can lead to confusion. Also it is
>> a problem for CRIU, as when we restore process tree we need to
>> somehow determine which descendants belong to which group and
>> much harder - to put them exactly to these group.
>
> Hmm. could you explain how this change helps CRIU? I mean, why
> restorer can't do prctl(CHILD_SUBREAPER) before the first fork?

Imagine we have these tree in pidns:

1: has_child_subreaper == 0 && is_child_subreaper == 0
|-2: has_child_subreaper == 0 && is_child_subreaper == 1
| |-3: has_child_subreaper == 0 && is_child_subreaper == 0
| | |-5: has_child_subreaper == 0 && is_child_subreaper == 0
| |-4: has_child_subreaper == 1 && is_child_subreaper == 0
| | |-6: has_child_subreaper == 1 && is_child_subreaper == 0

before c/r: If 4 dies 6 will reparent to 2, if 3 dies 5 will reparent to 1.
after c/r: (where restorer had is_child_subreaper == 1, everybody in the 
tree will have has_child_subreaper == 1) Everybody will reparent to 2.

>
> Anyway, afaics the patch is sub-optimal and not correct...
>
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1715,6 +1715,8 @@ struct task_struct {
>>  	struct signal_struct *signal;
>>  	struct sighand_struct *sighand;
>>
>> +	struct list_head csr_descendant;
>> +
>
> You don't need this new member and descendants_lock. task_struct has
> the ->real_parent pointer so you can work the tree without recursion.

Sorry I don't get how I can walk down the tree of all descendants with 
help of ->real_parent pointer, can you please point on some example or 
explain a bit more? (I see task_is_descendant() in 
security/yama/yama_lsm.c but we will need to check it for every process, 
not only descendants, the latter can be a lot faster.)

>
>> +static void prctl_set_child_subreaper(struct task_struct *reaper, bool arg2)
>> +{
>> +	LIST_HEAD(descendants);
>> +
>> +	reaper->signal->is_child_subreaper = arg2;
>> +	if (!arg2)
>> +		return;
>> +
>> +	spin_lock(&descendants_lock);
>> +	read_lock(&tasklist_lock);
>> +
>> +	list_add(&reaper->csr_descendant, &descendants);
>> +
>> +	while (!list_empty(&descendants)) {
>> +		struct task_struct *tsk;
>> +		struct task_struct *p;
>> +
>> +		tsk = list_first_entry(&descendants, struct task_struct,
>> +				csr_descendant);
>> +
>> +		list_for_each_entry(p, &tsk->children, sibling) {
>
> This is not enough. Every thread has its own ->children list, you need
> to walk the sub-threads as well.

Will do.

>
>> +			 * If we've found child_reaper - skip descendants in
>> +			 * it's subtree as they will never get out pidns
>> +			 */
>> +			if (is_child_reaper(task_pid(p)))
>> +				continue;
>
> Again, a child reaper can be multi-threaded, this check can be false
> negative.
>
> Probably is_child_reaper() should be renamed somehow and a new helper
> makes sense... something like

Will do.

>
> 	bool task_is_child_reaper(struct task_struct *p)
> 	{
> 		return same_thread_group(p, task_active_pid_ns(p)->child_reaper);
> 	}
>
> Oleg.
>

-- 
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] prctl: propagate has_child_subreaper flag to every descendant
  2017-01-22 10:11   ` Pavel Tikhomirov
@ 2017-01-23 11:55     ` Oleg Nesterov
  2017-01-23 12:52       ` task_is_descendant() cleanup Oleg Nesterov
  2017-01-23 14:30       ` [PATCH] prctl: propagate has_child_subreaper flag to every descendant Pavel Tikhomirov
  0 siblings, 2 replies; 19+ messages in thread
From: Oleg Nesterov @ 2017-01-23 11:55 UTC (permalink / raw)
  To: Pavel Tikhomirov
  Cc: Ingo Molnar, Peter Zijlstra, Andrew Morton, Cyrill Gorcunov,
	John Stultz, Thomas Gleixner, Nicolas Pitre, Michal Hocko,
	Stanislav Kinsburskiy, Mateusz Guzik, linux-kernel,
	Pavel Emelyanov, Konstantin Khorenko

On 01/22, Pavel Tikhomirov wrote:
>
> >
> >Hmm. could you explain how this change helps CRIU? I mean, why
> >restorer can't do prctl(CHILD_SUBREAPER) before the first fork?
>
> Imagine we have these tree in pidns:
>
> 1: has_child_subreaper == 0 && is_child_subreaper == 0
> |-2: has_child_subreaper == 0 && is_child_subreaper == 1
> | |-3: has_child_subreaper == 0 && is_child_subreaper == 0
> | | |-5: has_child_subreaper == 0 && is_child_subreaper == 0
> | |-4: has_child_subreaper == 1 && is_child_subreaper == 0
> | | |-6: has_child_subreaper == 1 && is_child_subreaper == 0
>
> before c/r: If 4 dies 6 will reparent to 2, if 3 dies 5 will reparent to 1.
> after c/r: (where restorer had is_child_subreaper == 1, everybody in the
> tree will have has_child_subreaper == 1) Everybody will reparent to 2.

This is clear, but this can only happen if 2 forks 3 and after that
sets is_child_subreaper, right?

And if someone actually does this then your patch can break this
application, no?

IOW. Currently CRIU can't restore the process tree with the same
has_child_subreaper bits if some process forks before
prctl(PR_SET_CHILD_SUBREAPER). It restores the tree as if prctl()
was called before the 1st fork.

So you change the semantics of PR_SET_CHILD_SUBREAPER and now CRIU
is fine simply because you remove this feature: the sub-reaper can
no longer pre-fork the children which should reparent to the previous
reaper.

I won't really argure, but I am not sure this is good idea... At least
I think this should be clearly documented.

> >You don't need this new member and descendants_lock. task_struct has
> >the ->real_parent pointer so you can work the tree without recursion.
>
> Sorry I don't get how I can walk down the tree of all descendants with help
> of ->real_parent pointer, can you please point on some example or explain a
> bit more? (I see task_is_descendant() in security/yama/yama_lsm.c but we
> will need to check it for every process, not only descendants, the latter
> can be a lot faster.)

I'll send a patch, probably a generic helper makes sense.

Btw task_is_descendant() looks wrong at first glance.

Oleg.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH] introduce the walk_process_tree() helper
  2017-01-19 16:43 [PATCH] prctl: propagate has_child_subreaper flag to every descendant Pavel Tikhomirov
  2017-01-20 18:14 ` Oleg Nesterov
@ 2017-01-23 11:57 ` Oleg Nesterov
  2017-01-23 12:07   ` Oleg Nesterov
  2017-01-24 15:01   ` Pavel Tikhomirov
  2017-01-23 16:44 ` setns() && PR_SET_CHILD_SUBREAPER Oleg Nesterov
  2 siblings, 2 replies; 19+ messages in thread
From: Oleg Nesterov @ 2017-01-23 11:57 UTC (permalink / raw)
  To: Pavel Tikhomirov
  Cc: Ingo Molnar, Peter Zijlstra, Andrew Morton, Cyrill Gorcunov,
	John Stultz, Thomas Gleixner, Nicolas Pitre, Michal Hocko,
	Stanislav Kinsburskiy, Mateusz Guzik, linux-kernel,
	Pavel Emelyanov, Konstantin Khorenko

Add the new helper to walk the process tree, the next patch adds a user.
Note that it visits the group leaders only, proc_visitor can do
for_each_thread itself or we can trivially extend walk_process_tree() to
do this.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/sched.h |  3 +++
 kernel/fork.c         | 32 ++++++++++++++++++++++++++++++++
 2 files changed, 35 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e9c009d..681f748 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -3015,6 +3015,9 @@ extern bool current_is_single_threaded(void);
 #define for_each_process_thread(p, t)	\
 	for_each_process(p) for_each_thread(p, t)
 
+typedef int (*proc_visitor)(struct task_struct *p, void *data);
+void walk_process_tree(struct task_struct *top, proc_visitor, void *);
+
 static inline int get_nr_threads(struct task_struct *tsk)
 {
 	return tsk->signal->nr_threads;
diff --git a/kernel/fork.c b/kernel/fork.c
index 663c6a7..2522bae 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2042,6 +2042,38 @@ SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
 }
 #endif
 
+void walk_process_tree(struct task_struct *top, proc_visitor visitor, void *data)
+{
+	struct task_struct *leader, *parent, *child;
+	int res;
+
+	read_lock(&tasklist_lock);
+	leader = top = top->group_leader;
+down:
+	for_each_thread(leader, parent) {
+		list_for_each_entry(child, &parent->children, sibling) {
+			res = visitor(child, data);
+			if (res) {
+				if (res < 0)
+					goto out;
+				leader = child;
+				goto down;
+			}
+up:
+			;
+		}
+	}
+
+	if (leader != top) {
+		child = leader;
+		parent = child->real_parent;
+		leader = parent->group_leader;
+		goto up;
+	}
+out:
+	read_unlock(&tasklist_lock);
+}
+
 #ifndef ARCH_MIN_MMSTRUCT_ALIGN
 #define ARCH_MIN_MMSTRUCT_ALIGN 0
 #endif
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] introduce the walk_process_tree() helper
  2017-01-23 11:57 ` [PATCH] introduce the walk_process_tree() helper Oleg Nesterov
@ 2017-01-23 12:07   ` Oleg Nesterov
  2017-01-24 15:01   ` Pavel Tikhomirov
  1 sibling, 0 replies; 19+ messages in thread
From: Oleg Nesterov @ 2017-01-23 12:07 UTC (permalink / raw)
  To: Pavel Tikhomirov
  Cc: Ingo Molnar, Peter Zijlstra, Andrew Morton, Cyrill Gorcunov,
	John Stultz, Thomas Gleixner, Nicolas Pitre, Michal Hocko,
	Stanislav Kinsburskiy, Mateusz Guzik, linux-kernel,
	Pavel Emelyanov, Konstantin Khorenko

On 01/23, Oleg Nesterov wrote:
>
> Add the new helper to walk the process tree, the next patch adds a user.
> Note that it visits the group leaders only, proc_visitor can do
> for_each_thread itself or we can trivially extend walk_process_tree() to
> do this.

Please consider this patch as a preparation for your change, or feel free
to fold this code into your patch.

Now you can simply do

	int xxx(struct task_struct *p, void *data)
	{
		if (p->signal->has_child_subreaper ||
		    is_child_reaper(task_pid(p))
			return 0;
		p->signal->has_child_subreaper = 1;
		return 1;
	}

	walk_process_tree(xxx, NULL);

walk_process_tree() handles the sub-threads correctly, but "p" is always the
leader so is_child_reaper() is fine, but probably we need a new helper anyway.

Oleg.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* task_is_descendant() cleanup
  2017-01-23 11:55     ` Oleg Nesterov
@ 2017-01-23 12:52       ` Oleg Nesterov
  2017-01-25 21:59         ` Kees Cook
  2017-01-23 14:30       ` [PATCH] prctl: propagate has_child_subreaper flag to every descendant Pavel Tikhomirov
  1 sibling, 1 reply; 19+ messages in thread
From: Oleg Nesterov @ 2017-01-23 12:52 UTC (permalink / raw)
  To: Pavel Tikhomirov, Kees Cook
  Cc: Ingo Molnar, Peter Zijlstra, Andrew Morton, Cyrill Gorcunov,
	John Stultz, Thomas Gleixner, Nicolas Pitre, Michal Hocko,
	Stanislav Kinsburskiy, Mateusz Guzik, linux-kernel,
	Pavel Emelyanov, Konstantin Khorenko

On 01/23, Oleg Nesterov wrote:
>
> Btw task_is_descendant() looks wrong at first glance.

No, I missed the 2nd ->group_leader dereference. Still this function looks
overcomplicated and the usage of thread_group_leader/group_leader just add
the unnecessary confusion. It can be simplified a little bit:

	static int task_is_descendant(struct task_struct *parent,
				      struct task_struct *child)
	{
		int rc = 0;
		struct task_struct *walker;

		if (!parent || !child)
			return 0;

		rcu_read_lock();
		for (walker = child; walker->pid; walker = rcu_dereference(walker->real_parent))
			if (same_thread_group(parent, walker)) {
				rc = 1;
				break;
			}
		rcu_read_unlock();

		return rc;
	}

Kees, I can send a patch if you think this very minor cleanup makes any sense.

Oleg.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] prctl: propagate has_child_subreaper flag to every descendant
  2017-01-23 11:55     ` Oleg Nesterov
  2017-01-23 12:52       ` task_is_descendant() cleanup Oleg Nesterov
@ 2017-01-23 14:30       ` Pavel Tikhomirov
  2017-01-23 16:06         ` Oleg Nesterov
  1 sibling, 1 reply; 19+ messages in thread
From: Pavel Tikhomirov @ 2017-01-23 14:30 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ingo Molnar, Peter Zijlstra, Andrew Morton, Cyrill Gorcunov,
	John Stultz, Thomas Gleixner, Nicolas Pitre, Michal Hocko,
	Stanislav Kinsburskiy, Mateusz Guzik, linux-kernel,
	Pavel Emelyanov, Konstantin Khorenko, Lennart Poettering

Add to cc Lennart Poettering <lennart@poettering.net>

On 01/23/2017 02:55 PM, Oleg Nesterov wrote:
> On 01/22, Pavel Tikhomirov wrote:
>>
>>>
>>> Hmm. could you explain how this change helps CRIU? I mean, why
>>> restorer can't do prctl(CHILD_SUBREAPER) before the first fork?
>>
>> Imagine we have these tree in pidns:
>>
>> 1: has_child_subreaper == 0 && is_child_subreaper == 0
>> |-2: has_child_subreaper == 0 && is_child_subreaper == 1
>> | |-3: has_child_subreaper == 0 && is_child_subreaper == 0
>> | | |-5: has_child_subreaper == 0 && is_child_subreaper == 0
>> | |-4: has_child_subreaper == 1 && is_child_subreaper == 0
>> | | |-6: has_child_subreaper == 1 && is_child_subreaper == 0
>>
>> before c/r: If 4 dies 6 will reparent to 2, if 3 dies 5 will reparent to 1.
>> after c/r: (where restorer had is_child_subreaper == 1, everybody in the
>> tree will have has_child_subreaper == 1) Everybody will reparent to 2.
>
> This is clear, but this can only happen if 2 forks 3 and after that
> sets is_child_subreaper, right?
>
> And if someone actually does this then your patch can break this
> application, no?
>
> IOW. Currently CRIU can't restore the process tree with the same
> has_child_subreaper bits if some process forks before
> prctl(PR_SET_CHILD_SUBREAPER). It restores the tree as if prctl()
> was called before the 1st fork.
>
> So you change the semantics of PR_SET_CHILD_SUBREAPER and now CRIU
> is fine simply because you remove this feature: the sub-reaper can
> no longer pre-fork the children which should reparent to the previous
> reaper.
>
> I won't really argure, but I am not sure this is good idea...

If one task uses these feature now it must be very carefull: if some our 
ancestor have enabled is_child_subreaper somewhere up the tree, forked 
our tree and after that disabled is_child_subreaper, so we already have 
has-flag and all children will inherit has-flag irrelevant to what is 
our order of fork/prctl-ing to become subreaper.

And only one way to check if the task has no has_child_subreaper flag is 
to create some childs kill them and see to where they will reparent, but 
I doubt that someone is doing these now.

> At least I think this should be clearly documented.

Yes, I surely need to add some documentation here. Thanks for mentioning 
that! - Will do.

>
>>> You don't need this new member and descendants_lock. task_struct has
>>> the ->real_parent pointer so you can work the tree without recursion.
>>
>> Sorry I don't get how I can walk down the tree of all descendants with help
>> of ->real_parent pointer, can you please point on some example or explain a
>> bit more? (I see task_is_descendant() in security/yama/yama_lsm.c but we
>> will need to check it for every process, not only descendants, the latter
>> can be a lot faster.)
>
> I'll send a patch, probably a generic helper makes sense.
>
> Btw task_is_descendant() looks wrong at first glance.
>
> Oleg.
>

-- 
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] prctl: propagate has_child_subreaper flag to every descendant
  2017-01-23 14:30       ` [PATCH] prctl: propagate has_child_subreaper flag to every descendant Pavel Tikhomirov
@ 2017-01-23 16:06         ` Oleg Nesterov
  0 siblings, 0 replies; 19+ messages in thread
From: Oleg Nesterov @ 2017-01-23 16:06 UTC (permalink / raw)
  To: Pavel Tikhomirov
  Cc: Ingo Molnar, Peter Zijlstra, Andrew Morton, Cyrill Gorcunov,
	John Stultz, Thomas Gleixner, Nicolas Pitre, Michal Hocko,
	Stanislav Kinsburskiy, Mateusz Guzik, linux-kernel,
	Pavel Emelyanov, Konstantin Khorenko, Lennart Poettering

On 01/23, Pavel Tikhomirov wrote:
>
> >IOW. Currently CRIU can't restore the process tree with the same
> >has_child_subreaper bits if some process forks before
> >prctl(PR_SET_CHILD_SUBREAPER). It restores the tree as if prctl()
> >was called before the 1st fork.
> >
> >So you change the semantics of PR_SET_CHILD_SUBREAPER and now CRIU
> >is fine simply because you remove this feature: the sub-reaper can
> >no longer pre-fork the children which should reparent to the previous
> >reaper.
> >
> >I won't really argure, but I am not sure this is good idea...
>
> If one task uses these feature now it must be very carefull: if some our
> ancestor have enabled is_child_subreaper somewhere up the tree, forked our
> tree and after that disabled is_child_subreaper, so we already have has-flag
> and all children will inherit has-flag irrelevant to what is our order of
> fork/prctl-ing to become subreaper.

Agreed.

So let me reword my initial question, why did you make this patch? Did you
actually hit a case when a child of is_child_subreaper process doesn't have
has_child_subreaper bit set?

If yes, then perhaps that application has a reason to do this and your patch
can break it? If no, then you can probably forget this until you have a CRIU
bug report ;)

But let me repeat, I won't really argue. And I even agree that this change
makes the semantics of PR_SET_CHILD_SUBREAPER more clear, just I am always
nervous when we add the subtle user-visible changes like this, and I greatly
misundestood the changelog as if CRIU needs to do prctl(SET_CHILD_SUBREAPER)
after it has already restored the process tree and this can't work even in
the common case.

Oleg.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* setns() && PR_SET_CHILD_SUBREAPER
  2017-01-19 16:43 [PATCH] prctl: propagate has_child_subreaper flag to every descendant Pavel Tikhomirov
  2017-01-20 18:14 ` Oleg Nesterov
  2017-01-23 11:57 ` [PATCH] introduce the walk_process_tree() helper Oleg Nesterov
@ 2017-01-23 16:44 ` Oleg Nesterov
  2017-01-23 18:21   ` Eric W. Biederman
  2 siblings, 1 reply; 19+ messages in thread
From: Oleg Nesterov @ 2017-01-23 16:44 UTC (permalink / raw)
  To: Pavel Tikhomirov, Eric W. Biederman, Lennart Poettering, Kay Sievers
  Cc: Ingo Molnar, Peter Zijlstra, Andrew Morton, Cyrill Gorcunov,
	John Stultz, Thomas Gleixner, Nicolas Pitre, Michal Hocko,
	Stanislav Kinsburskiy, Mateusz Guzik, linux-kernel,
	Pavel Emelyanov, Konstantin Khorenko

And this discussion reminds me again that I do not understand how setns()
and PR_SET_CHILD_SUBREAPER should play together... Add cc's.


Suppose we have a process P in the root namespace and another namespace X.

P does setns() and enters the X namespace.
P forks a child C.

C forks a grandchild G.
C exits.

The question is, where should we reparent the grandchild G? In the normal
case it will be reparented to X->child_reaper and this looks correct.

But lets suppose that P runs with the ->has_child_subreaper bit set. In
this case it will be reparented to P's sub-reaper or a global init, and
given that P can't control its ->has_child_subreaper flag this does not
look right to me.

I can make a simple patch but perhaps I missed something or we actually
want this (imo strange) behaviour?

Oleg.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: setns() && PR_SET_CHILD_SUBREAPER
  2017-01-23 16:44 ` setns() && PR_SET_CHILD_SUBREAPER Oleg Nesterov
@ 2017-01-23 18:21   ` Eric W. Biederman
  2017-01-24 14:07     ` Oleg Nesterov
  0 siblings, 1 reply; 19+ messages in thread
From: Eric W. Biederman @ 2017-01-23 18:21 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Pavel Tikhomirov, Lennart Poettering, Kay Sievers, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Cyrill Gorcunov, John Stultz,
	Thomas Gleixner, Nicolas Pitre, Michal Hocko,
	Stanislav Kinsburskiy, Mateusz Guzik, linux-kernel,
	Pavel Emelyanov, Konstantin Khorenko

Oleg Nesterov <oleg@redhat.com> writes:

> And this discussion reminds me again that I do not understand how setns()
> and PR_SET_CHILD_SUBREAPER should play together... Add cc's.

I agree that they are currently playing together incorrectly.

> Suppose we have a process P in the root namespace and another namespace X.
>
> P does setns() and enters the X namespace.
> P forks a child C.
>
> C forks a grandchild G.
> C exits.
>
> The question is, where should we reparent the grandchild G? In the normal
> case it will be reparented to X->child_reaper and this looks correct.
>
> But lets suppose that P runs with the ->has_child_subreaper bit set. In
> this case it will be reparented to P's sub-reaper or a global init, and
> given that P can't control its ->has_child_subreaper flag this does not
> look right to me.
>
> I can make a simple patch but perhaps I missed something or we actually
> want this (imo strange) behaviour?

We definitely do not want a child to be repareted out of a pid namespace
when the pid namespace has a perfectly fine child_reaper.

The special case for the init_task in find_new_reaper appears to be the
instance of this problem that was considered in the code.

Given the semantics described and asked for of PR_SET_CHILD_SUBREAPER I
believe has_child_subreaper needs to be strictly considered an
implementation detail and any way that userspace can observe it a bug in
the code.

Semantically what we want to do is walk up the parents in the process
tree.  If a parent has is_child_subreaper we stop at it.  If the
transition from one parent to the next we are switching pid namespaces
we want the reaper from the pid namespace.

As I recall has_child_subreaper was just supposed to be an optimization
so the common case would not have to walk up the process tree when
finding it's parent.

If we retain any optimizations such as has_child_subreaper please
consider the case where a process with is_child_subreaper set exits,
and what happens to it's children.

Eric

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: setns() && PR_SET_CHILD_SUBREAPER
  2017-01-23 18:21   ` Eric W. Biederman
@ 2017-01-24 14:07     ` Oleg Nesterov
  2017-01-24 15:24       ` Eric W. Biederman
  0 siblings, 1 reply; 19+ messages in thread
From: Oleg Nesterov @ 2017-01-24 14:07 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Tikhomirov, Lennart Poettering, Kay Sievers, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Cyrill Gorcunov, John Stultz,
	Thomas Gleixner, Nicolas Pitre, Michal Hocko,
	Stanislav Kinsburskiy, Mateusz Guzik, linux-kernel,
	Pavel Emelyanov, Konstantin Khorenko

On 01/24, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > Suppose we have a process P in the root namespace and another namespace X.
> >
> > P does setns() and enters the X namespace.
> > P forks a child C.
> >
> > C forks a grandchild G.
> > C exits.
> >
> > The question is, where should we reparent the grandchild G? In the normal
> > case it will be reparented to X->child_reaper and this looks correct.
> >
> > But lets suppose that P runs with the ->has_child_subreaper bit set. In
> > this case it will be reparented to P's sub-reaper or a global init, and
> > given that P can't control its ->has_child_subreaper flag this does not
> > look right to me.
> >
> > I can make a simple patch but perhaps I missed something or we actually
> > want this (imo strange) behaviour?
>
> We definitely do not want a child to be repareted out of a pid namespace
> when the pid namespace has a perfectly fine child_reaper.
>
> The special case for the init_task in find_new_reaper appears to be the
> instance of this problem that was considered in the code.

Actually we should blame the same_thread_group(reaper, child_reaper) check,
it should had ensured we could not cross the namespaces, but it is not
enough. Because this logic predates setns().

> Semantically what we want to do is walk up the parents in the process
> tree.  If a parent has is_child_subreaper we stop at it.  If the
> transition from one parent to the next we are switching pid namespaces
> we want the reaper from the pid namespace.

Yes, this is what I have in mind, see the patch below. I need to re-check
it and update the comment to explain why we can't simply check child_reaper
as we currently do.

This way we can start the search from father->real_parent, but the comment
above the "reaper == &init_task" is no longer correct, we always need this
check although perhaps is_idle_task(reaper) would be better.

> As I recall has_child_subreaper was just supposed to be an optimization
> so the common case would not have to walk up the process tree when
> finding it's parent.

Yep.

> If we retain any optimizations such as has_child_subreaper please
> consider the case where a process with is_child_subreaper set exits,
> and what happens to it's children.

Yes, in this case it should not have any effect. Well, there is another
corner case, perhaps we should turn

		if (!reaper->signal->is_child_subreaper)
			continue;

into
		if (!reaper->signal->is_child_subreaper) {
			if (!reaper->signal->has_child_subreaper)
				break;
			continue;
		}

this looks a bit more correct if the exited "is_child_subreaper" process
was forked, and after that its parent called prctl(SET_CHILD_SUBREAPER).
But I think we do not care and Pavel is going to eliminate the case when
a child of is_child_subreaper task can run without has_child_subreaper
flag set.

So what do you think about the patch below?

Oleg.

--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -569,15 +569,15 @@ static struct task_struct *find_new_reaper(struct task_struct *father,
 		return thread;
 
 	if (father->signal->has_child_subreaper) {
+		unsigned int level = task_pid(father)->level;
 		/*
 		 * Find the first ->is_child_subreaper ancestor in our pid_ns.
-		 * We start from father to ensure we can not look into another
-		 * namespace, this is safe because all its threads are dead.
+		 * We check pid->level, this is slightly more efficient than
+		 * task_active_pid_ns(reaper) != task_active_pid_ns(father).
 		 */
-		for (reaper = father;
-		     !same_thread_group(reaper, child_reaper);
+		for (reaper = father->real_parent;
+		     task_pid(reaper)->level == level;
 		     reaper = reaper->real_parent) {
-			/* call_usermodehelper() descendants need this check */
 			if (reaper == &init_task)
 				break;
 			if (!reaper->signal->is_child_subreaper)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] introduce the walk_process_tree() helper
  2017-01-23 11:57 ` [PATCH] introduce the walk_process_tree() helper Oleg Nesterov
  2017-01-23 12:07   ` Oleg Nesterov
@ 2017-01-24 15:01   ` Pavel Tikhomirov
  1 sibling, 0 replies; 19+ messages in thread
From: Pavel Tikhomirov @ 2017-01-24 15:01 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ingo Molnar, Peter Zijlstra, Andrew Morton, Cyrill Gorcunov,
	John Stultz, Thomas Gleixner, Nicolas Pitre, Michal Hocko,
	Stanislav Kinsburskiy, Mateusz Guzik, linux-kernel,
	Pavel Emelyanov, Konstantin Khorenko


Will include it in patch-set with documentation fix. Thanks Oleg!

Reviewed-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

On 01/23/2017 02:57 PM, Oleg Nesterov wrote:
> Add the new helper to walk the process tree, the next patch adds a user.
> Note that it visits the group leaders only, proc_visitor can do
> for_each_thread itself or we can trivially extend walk_process_tree() to
> do this.
>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
>  include/linux/sched.h |  3 +++
>  kernel/fork.c         | 32 ++++++++++++++++++++++++++++++++
>  2 files changed, 35 insertions(+)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index e9c009d..681f748 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -3015,6 +3015,9 @@ extern bool current_is_single_threaded(void);
>  #define for_each_process_thread(p, t)	\
>  	for_each_process(p) for_each_thread(p, t)
>
> +typedef int (*proc_visitor)(struct task_struct *p, void *data);
> +void walk_process_tree(struct task_struct *top, proc_visitor, void *);
> +
>  static inline int get_nr_threads(struct task_struct *tsk)
>  {
>  	return tsk->signal->nr_threads;
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 663c6a7..2522bae 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2042,6 +2042,38 @@ SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
>  }
>  #endif
>
> +void walk_process_tree(struct task_struct *top, proc_visitor visitor, void *data)
> +{
> +	struct task_struct *leader, *parent, *child;
> +	int res;
> +
> +	read_lock(&tasklist_lock);
> +	leader = top = top->group_leader;
> +down:
> +	for_each_thread(leader, parent) {
> +		list_for_each_entry(child, &parent->children, sibling) {
> +			res = visitor(child, data);
> +			if (res) {
> +				if (res < 0)
> +					goto out;
> +				leader = child;
> +				goto down;
> +			}
> +up:
> +			;
> +		}
> +	}
> +
> +	if (leader != top) {
> +		child = leader;
> +		parent = child->real_parent;
> +		leader = parent->group_leader;
> +		goto up;
> +	}
> +out:
> +	read_unlock(&tasklist_lock);
> +}
> +
>  #ifndef ARCH_MIN_MMSTRUCT_ALIGN
>  #define ARCH_MIN_MMSTRUCT_ALIGN 0
>  #endif
>

-- 
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: setns() && PR_SET_CHILD_SUBREAPER
  2017-01-24 14:07     ` Oleg Nesterov
@ 2017-01-24 15:24       ` Eric W. Biederman
  2017-01-30 18:16         ` Oleg Nesterov
  2017-01-30 18:17         ` [PATCH] exit: fix the setns() && PR_SET_CHILD_SUBREAPER interaction Oleg Nesterov
  0 siblings, 2 replies; 19+ messages in thread
From: Eric W. Biederman @ 2017-01-24 15:24 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Pavel Tikhomirov, Lennart Poettering, Kay Sievers, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Cyrill Gorcunov, John Stultz,
	Thomas Gleixner, Nicolas Pitre, Michal Hocko,
	Stanislav Kinsburskiy, Mateusz Guzik, linux-kernel,
	Pavel Emelyanov, Konstantin Khorenko

Oleg Nesterov <oleg@redhat.com> writes:

> On 01/24, Eric W. Biederman wrote:
>>
>> Oleg Nesterov <oleg@redhat.com> writes:
>>
>> > Suppose we have a process P in the root namespace and another namespace X.
>> >
>> > P does setns() and enters the X namespace.
>> > P forks a child C.
>> >
>> > C forks a grandchild G.
>> > C exits.
>> >
>> > The question is, where should we reparent the grandchild G? In the normal
>> > case it will be reparented to X->child_reaper and this looks correct.
>> >
>> > But lets suppose that P runs with the ->has_child_subreaper bit set. In
>> > this case it will be reparented to P's sub-reaper or a global init, and
>> > given that P can't control its ->has_child_subreaper flag this does not
>> > look right to me.
>> >
>> > I can make a simple patch but perhaps I missed something or we actually
>> > want this (imo strange) behaviour?
>>
>> We definitely do not want a child to be repareted out of a pid namespace
>> when the pid namespace has a perfectly fine child_reaper.
>>
>> The special case for the init_task in find_new_reaper appears to be the
>> instance of this problem that was considered in the code.
>
> Actually we should blame the same_thread_group(reaper, child_reaper) check,
> it should had ensured we could not cross the namespaces, but it is not
> enough. Because this logic predates setns().
>
>> Semantically what we want to do is walk up the parents in the process
>> tree.  If a parent has is_child_subreaper we stop at it.  If the
>> transition from one parent to the next we are switching pid namespaces
>> we want the reaper from the pid namespace.
>
> Yes, this is what I have in mind, see the patch below. I need to re-check
> it and update the comment to explain why we can't simply check child_reaper
> as we currently do.
>
> This way we can start the search from father->real_parent, but the comment
> above the "reaper == &init_task" is no longer correct, we always need this
> check although perhaps is_idle_task(reaper) would be better.
>
>> As I recall has_child_subreaper was just supposed to be an optimization
>> so the common case would not have to walk up the process tree when
>> finding it's parent.
>
> Yep.
>
>> If we retain any optimizations such as has_child_subreaper please
>> consider the case where a process with is_child_subreaper set exits,
>> and what happens to it's children.
>
> Yes, in this case it should not have any effect. Well, there is another
> corner case, perhaps we should turn
>
> 		if (!reaper->signal->is_child_subreaper)
> 			continue;
>
> into
> 		if (!reaper->signal->is_child_subreaper) {
> 			if (!reaper->signal->has_child_subreaper)
> 				break;
> 			continue;
> 		}
>
> this looks a bit more correct if the exited "is_child_subreaper" process
> was forked, and after that its parent called prctl(SET_CHILD_SUBREAPER).
> But I think we do not care and Pavel is going to eliminate the case when
> a child of is_child_subreaper task can run without has_child_subreaper
> flag set.

As long as we update the flag when reparenting so that it is
accurate and we clear it when creating a child in a new pid namespace.

> So what do you think about the patch below?

That does look like the correct logic.

Whose tree do we want to run merge these fixes through?  I can pick them
up if that would be convinient.

Eric


> Oleg.
>
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -569,15 +569,15 @@ static struct task_struct *find_new_reaper(struct task_struct *father,
>  		return thread;
>  
>  	if (father->signal->has_child_subreaper) {
> +		unsigned int level = task_pid(father)->level;
>  		/*
>  		 * Find the first ->is_child_subreaper ancestor in our pid_ns.
> -		 * We start from father to ensure we can not look into another
> -		 * namespace, this is safe because all its threads are dead.
> +		 * We check pid->level, this is slightly more efficient than
> +		 * task_active_pid_ns(reaper) != task_active_pid_ns(father).
>  		 */
> -		for (reaper = father;
> -		     !same_thread_group(reaper, child_reaper);
> +		for (reaper = father->real_parent;
> +		     task_pid(reaper)->level == level;
>  		     reaper = reaper->real_parent) {
> -			/* call_usermodehelper() descendants need this check */
>  			if (reaper == &init_task)
>  				break;
>  			if (!reaper->signal->is_child_subreaper)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: task_is_descendant() cleanup
  2017-01-23 12:52       ` task_is_descendant() cleanup Oleg Nesterov
@ 2017-01-25 21:59         ` Kees Cook
  2017-01-30 13:49           ` Oleg Nesterov
  0 siblings, 1 reply; 19+ messages in thread
From: Kees Cook @ 2017-01-25 21:59 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Pavel Tikhomirov, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Cyrill Gorcunov, John Stultz, Thomas Gleixner, Nicolas Pitre,
	Michal Hocko, Stanislav Kinsburskiy, Mateusz Guzik, LKML,
	Pavel Emelyanov, Konstantin Khorenko

On Mon, Jan 23, 2017 at 4:52 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 01/23, Oleg Nesterov wrote:
>>
>> Btw task_is_descendant() looks wrong at first glance.
>
> No, I missed the 2nd ->group_leader dereference. Still this function looks
> overcomplicated and the usage of thread_group_leader/group_leader just add
> the unnecessary confusion. It can be simplified a little bit:
>
>         static int task_is_descendant(struct task_struct *parent,
>                                       struct task_struct *child)
>         {
>                 int rc = 0;
>                 struct task_struct *walker;
>
>                 if (!parent || !child)
>                         return 0;
>
>                 rcu_read_lock();
>                 for (walker = child; walker->pid; walker = rcu_dereference(walker->real_parent))
>                         if (same_thread_group(parent, walker)) {
>                                 rc = 1;
>                                 break;
>                         }
>                 rcu_read_unlock();
>
>                 return rc;
>         }
>
> Kees, I can send a patch if you think this very minor cleanup makes any sense.

Err, isn't checking same_thread_group() at every level more expensive
than what I currently have?

-Kees

-- 
Kees Cook
Nexus Security

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: task_is_descendant() cleanup
  2017-01-25 21:59         ` Kees Cook
@ 2017-01-30 13:49           ` Oleg Nesterov
  0 siblings, 0 replies; 19+ messages in thread
From: Oleg Nesterov @ 2017-01-30 13:49 UTC (permalink / raw)
  To: Kees Cook
  Cc: Pavel Tikhomirov, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Cyrill Gorcunov, John Stultz, Thomas Gleixner, Nicolas Pitre,
	Michal Hocko, Stanislav Kinsburskiy, Mateusz Guzik, LKML,
	Pavel Emelyanov, Konstantin Khorenko

On 01/25, Kees Cook wrote:
>
> On Mon, Jan 23, 2017 at 4:52 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> > On 01/23, Oleg Nesterov wrote:
> >>
> >> Btw task_is_descendant() looks wrong at first glance.
> >
> > No, I missed the 2nd ->group_leader dereference. Still this function looks
> > overcomplicated and the usage of thread_group_leader/group_leader just add
> > the unnecessary confusion. It can be simplified a little bit:
> >
> >         static int task_is_descendant(struct task_struct *parent,
> >                                       struct task_struct *child)
> >         {
> >                 int rc = 0;
> >                 struct task_struct *walker;
> >
> >                 if (!parent || !child)
> >                         return 0;
> >
> >                 rcu_read_lock();
> >                 for (walker = child; walker->pid; walker = rcu_dereference(walker->real_parent))
> >                         if (same_thread_group(parent, walker)) {
> >                                 rc = 1;
> >                                 break;
> >                         }
> >                 rcu_read_unlock();
> >
> >                 return rc;
> >         }
> >
> > Kees, I can send a patch if you think this very minor cleanup makes any sense.
>
> Err, isn't checking same_thread_group() at every level more expensive
> than what I currently have?

Well, same_thread_group(p1,p2) is just

	 p1->signal == p2->signal

yes this is a bit more expensive than

	walker == parent

we currently have, yes. But this eliminates

	if (!thread_group_leader(walker))
		walker = rcu_dereference(walker->group_leader);

we currently do at every level. And note that "parent" can exec and change its
->group_leader at any time, we probably do not care but this looks confusing.


But please forget, this is really minor.

Oleg.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: setns() && PR_SET_CHILD_SUBREAPER
  2017-01-24 15:24       ` Eric W. Biederman
@ 2017-01-30 18:16         ` Oleg Nesterov
  2017-01-30 18:17         ` [PATCH] exit: fix the setns() && PR_SET_CHILD_SUBREAPER interaction Oleg Nesterov
  1 sibling, 0 replies; 19+ messages in thread
From: Oleg Nesterov @ 2017-01-30 18:16 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Tikhomirov, Lennart Poettering, Kay Sievers, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Cyrill Gorcunov, John Stultz,
	Thomas Gleixner, Nicolas Pitre, Michal Hocko,
	Stanislav Kinsburskiy, Mateusz Guzik, linux-kernel,
	Pavel Emelyanov, Konstantin Khorenko

Hi Eric,

sorry for delay, I was sick.

On 01/25, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > So what do you think about the patch below?
>
> That does look like the correct logic.

Great,

> Whose tree do we want to run merge these fixes through?  I can pick them
> up if that would be convinient.

Thanks! this would be nice, I am sending the patch.

Oleg.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH] exit: fix the setns() && PR_SET_CHILD_SUBREAPER interaction
  2017-01-24 15:24       ` Eric W. Biederman
  2017-01-30 18:16         ` Oleg Nesterov
@ 2017-01-30 18:17         ` Oleg Nesterov
  1 sibling, 0 replies; 19+ messages in thread
From: Oleg Nesterov @ 2017-01-30 18:17 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Tikhomirov, Lennart Poettering, Kay Sievers, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Cyrill Gorcunov, John Stultz,
	Thomas Gleixner, Nicolas Pitre, Michal Hocko,
	Stanislav Kinsburskiy, Mateusz Guzik, linux-kernel,
	Pavel Emelyanov, Konstantin Khorenko

find_new_reaper() checks same_thread_group(reaper, child_reaper) to
prevent the cross-namespace reparenting but this is not enough if the
exiting parent was injected by setns() + fork().

Suppose we have a process P in the root namespace and some namespace X.
P does setns() to enter the X namespace, and forks the child C.
C forks a grandchild G and exits.

The grandchild G should be re-parented to X->child_reaper, but in this
case the ->real_parent chain does not lead to ->child_reaper, so it will
be wrongly reparanted to P's sub-reaper or a global init.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/exit.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 76e263e..e8b0776 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -569,15 +569,18 @@ static struct task_struct *find_new_reaper(struct task_struct *father,
 		return thread;
 
 	if (father->signal->has_child_subreaper) {
+		unsigned int ns_level = task_pid(father)->level;
 		/*
 		 * Find the first ->is_child_subreaper ancestor in our pid_ns.
-		 * We start from father to ensure we can not look into another
-		 * namespace, this is safe because all its threads are dead.
+		 * We can't check reaper != child_reaper to ensure we do not
+		 * cross the namespaces, the exiting parent could be injected
+		 * by setns() + fork().
+		 * We check pid->level, this is slightly more efficient than
+		 * task_active_pid_ns(reaper) != task_active_pid_ns(father).
 		 */
-		for (reaper = father;
-		     !same_thread_group(reaper, child_reaper);
+		for (reaper = father->real_parent;
+		     task_pid(reaper)->level == ns_level;
 		     reaper = reaper->real_parent) {
-			/* call_usermodehelper() descendants need this check */
 			if (reaper == &init_task)
 				break;
 			if (!reaper->signal->is_child_subreaper)
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2017-01-30 18:17 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-19 16:43 [PATCH] prctl: propagate has_child_subreaper flag to every descendant Pavel Tikhomirov
2017-01-20 18:14 ` Oleg Nesterov
2017-01-22 10:00   ` Pavel Tikhomirov
2017-01-22 10:11   ` Pavel Tikhomirov
2017-01-23 11:55     ` Oleg Nesterov
2017-01-23 12:52       ` task_is_descendant() cleanup Oleg Nesterov
2017-01-25 21:59         ` Kees Cook
2017-01-30 13:49           ` Oleg Nesterov
2017-01-23 14:30       ` [PATCH] prctl: propagate has_child_subreaper flag to every descendant Pavel Tikhomirov
2017-01-23 16:06         ` Oleg Nesterov
2017-01-23 11:57 ` [PATCH] introduce the walk_process_tree() helper Oleg Nesterov
2017-01-23 12:07   ` Oleg Nesterov
2017-01-24 15:01   ` Pavel Tikhomirov
2017-01-23 16:44 ` setns() && PR_SET_CHILD_SUBREAPER Oleg Nesterov
2017-01-23 18:21   ` Eric W. Biederman
2017-01-24 14:07     ` Oleg Nesterov
2017-01-24 15:24       ` Eric W. Biederman
2017-01-30 18:16         ` Oleg Nesterov
2017-01-30 18:17         ` [PATCH] exit: fix the setns() && PR_SET_CHILD_SUBREAPER interaction Oleg Nesterov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).