linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] proc: deuglify task_state()
@ 2014-11-07 20:14 Oleg Nesterov
  2014-11-07 20:14 ` [PATCH 1/4] proc: task_state: read cred->group_info outside of task_lock() Oleg Nesterov
                   ` (8 more replies)
  0 siblings, 9 replies; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-07 20:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aaron Tomlin, Alexey Dobriyan, Eric W. Biederman,
	Sterling Alexander, linux-kernel

Hello.

I tried to optimize the usage of tasklist_lock in exit_notify() paths
but found the bug which should be fixed first: the EXIT_DEAD setting
in reparent_leader() can break the task_ppid_nr_ns()-like code and
ptrace_parent().

I am still thinking about the "right" fix, but whatever we do we need
to cleanup these users, probably before the fix.

One particular user is task_state(), but it also asks for other off-
topic cleanups which I believe make sense anyway, lets do this first.

Oleg.

 fs/proc/array.c |   47 ++++++++++++++++++++++-------------------------
 1 files changed, 22 insertions(+), 25 deletions(-)


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 1/4] proc: task_state: read cred->group_info outside of task_lock()
  2014-11-07 20:14 [PATCH 0/4] proc: deuglify task_state() Oleg Nesterov
@ 2014-11-07 20:14 ` Oleg Nesterov
  2014-11-07 20:14 ` [PATCH 2/4] proc: task_state: deuglify the max_fds calculation Oleg Nesterov
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-07 20:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aaron Tomlin, Alexey Dobriyan, Eric W. Biederman,
	Sterling Alexander, linux-kernel

task_state() reads cred->group_info under task_lock() because a long
ago it was task_struct->group_info and it was actually protected by
task->alloc_lock. Today this task_unlock() after rcu_read_unlock()
just adds the confusion, move task_unlock() up.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/proc/array.c |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index cd3653e..b5810c2 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -201,11 +201,10 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
 		"FDSize:\t%d\n"
 		"Groups:\t",
 		fdt ? fdt->max_fds : 0);
+	task_unlock(p);
 	rcu_read_unlock();
 
 	group_info = cred->group_info;
-	task_unlock(p);
-
 	for (g = 0; g < group_info->ngroups; g++)
 		seq_printf(m, "%d ",
 			   from_kgid_munged(user_ns, GROUP_AT(group_info, g)));
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 2/4] proc: task_state: deuglify the max_fds calculation
  2014-11-07 20:14 [PATCH 0/4] proc: deuglify task_state() Oleg Nesterov
  2014-11-07 20:14 ` [PATCH 1/4] proc: task_state: read cred->group_info outside of task_lock() Oleg Nesterov
@ 2014-11-07 20:14 ` Oleg Nesterov
  2014-11-07 20:14 ` [PATCH 3/4] proc: task_state: move the main seq_printf() outside of rcu_read_lock() Oleg Nesterov
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-07 20:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aaron Tomlin, Alexey Dobriyan, Eric W. Biederman,
	Sterling Alexander, linux-kernel

1. The usage of fdt looks very ugly, it can't be NULL if ->files is
   not NULL. We can use "unsigned int max_fds" instead.

2. This also allows to move seq_printf(max_fds) outside of task_lock()
   and join it with the previous seq_printf(). See also the next patch.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/proc/array.c |   23 +++++++++++------------
 1 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index b5810c2..7c8d9ae 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -157,9 +157,9 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
 	struct user_namespace *user_ns = seq_user_ns(m);
 	struct group_info *group_info;
 	int g;
-	struct fdtable *fdt = NULL;
 	const struct cred *cred;
 	pid_t ppid, tpid;
+	unsigned int max_fds = 0;
 
 	rcu_read_lock();
 	ppid = pid_alive(p) ?
@@ -171,6 +171,12 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
 			tpid = task_pid_nr_ns(tracer, ns);
 	}
 	cred = get_task_cred(p);
+
+	task_lock(p);
+	if (p->files)
+		max_fds = files_fdtable(p->files)->max_fds;
+	task_unlock(p);
+
 	seq_printf(m,
 		"State:\t%s\n"
 		"Tgid:\t%d\n"
@@ -179,7 +185,8 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
 		"PPid:\t%d\n"
 		"TracerPid:\t%d\n"
 		"Uid:\t%d\t%d\t%d\t%d\n"
-		"Gid:\t%d\t%d\t%d\t%d\n",
+		"Gid:\t%d\t%d\t%d\t%d\n"
+		"FDSize:\t%d\nGroups:\t",
 		get_task_state(p),
 		task_tgid_nr_ns(p, ns),
 		task_numa_group_id(p),
@@ -192,16 +199,8 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
 		from_kgid_munged(user_ns, cred->gid),
 		from_kgid_munged(user_ns, cred->egid),
 		from_kgid_munged(user_ns, cred->sgid),
-		from_kgid_munged(user_ns, cred->fsgid));
-
-	task_lock(p);
-	if (p->files)
-		fdt = files_fdtable(p->files);
-	seq_printf(m,
-		"FDSize:\t%d\n"
-		"Groups:\t",
-		fdt ? fdt->max_fds : 0);
-	task_unlock(p);
+		from_kgid_munged(user_ns, cred->fsgid),
+		max_fds);
 	rcu_read_unlock();
 
 	group_info = cred->group_info;
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 3/4] proc: task_state: move the main seq_printf() outside of rcu_read_lock()
  2014-11-07 20:14 [PATCH 0/4] proc: deuglify task_state() Oleg Nesterov
  2014-11-07 20:14 ` [PATCH 1/4] proc: task_state: read cred->group_info outside of task_lock() Oleg Nesterov
  2014-11-07 20:14 ` [PATCH 2/4] proc: task_state: deuglify the max_fds calculation Oleg Nesterov
@ 2014-11-07 20:14 ` Oleg Nesterov
  2014-11-13 18:04   ` Paul E. McKenney
  2014-11-07 20:14 ` [PATCH 4/4] proc: task_state: ptrace_parent() doesn't need pid_alive() check Oleg Nesterov
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-07 20:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aaron Tomlin, Alexey Dobriyan, Eric W. Biederman,
	Sterling Alexander, linux-kernel

task_state() does seq_printf() under rcu_read_lock(), but this is only
needed for task_tgid_nr_ns() and task_numa_group_id(). We can calculate
tgid/ngid and drop rcu lock.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/proc/array.c |   12 ++++++------
 1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 7c8d9ae..800e30f 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -158,7 +158,7 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
 	struct group_info *group_info;
 	int g;
 	const struct cred *cred;
-	pid_t ppid, tpid;
+	pid_t ppid, tpid, tgid, ngid;
 	unsigned int max_fds = 0;
 
 	rcu_read_lock();
@@ -170,12 +170,16 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
 		if (tracer)
 			tpid = task_pid_nr_ns(tracer, ns);
 	}
+
+	tgid = task_tgid_nr_ns(p, ns);
+	ngid = task_numa_group_id(p);
 	cred = get_task_cred(p);
 
 	task_lock(p);
 	if (p->files)
 		max_fds = files_fdtable(p->files)->max_fds;
 	task_unlock(p);
+	rcu_read_unlock();
 
 	seq_printf(m,
 		"State:\t%s\n"
@@ -188,10 +192,7 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
 		"Gid:\t%d\t%d\t%d\t%d\n"
 		"FDSize:\t%d\nGroups:\t",
 		get_task_state(p),
-		task_tgid_nr_ns(p, ns),
-		task_numa_group_id(p),
-		pid_nr_ns(pid, ns),
-		ppid, tpid,
+		tgid, ngid, pid_nr_ns(pid, ns), ppid, tpid,
 		from_kuid_munged(user_ns, cred->uid),
 		from_kuid_munged(user_ns, cred->euid),
 		from_kuid_munged(user_ns, cred->suid),
@@ -201,7 +202,6 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
 		from_kgid_munged(user_ns, cred->sgid),
 		from_kgid_munged(user_ns, cred->fsgid),
 		max_fds);
-	rcu_read_unlock();
 
 	group_info = cred->group_info;
 	for (g = 0; g < group_info->ngroups; g++)
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 4/4] proc: task_state: ptrace_parent() doesn't need pid_alive() check
  2014-11-07 20:14 [PATCH 0/4] proc: deuglify task_state() Oleg Nesterov
                   ` (2 preceding siblings ...)
  2014-11-07 20:14 ` [PATCH 3/4] proc: task_state: move the main seq_printf() outside of rcu_read_lock() Oleg Nesterov
@ 2014-11-07 20:14 ` Oleg Nesterov
  2014-11-10 21:59 ` [PATCH 0/5] exit: reparent zombie fix + cleanups/optimizations Oleg Nesterov
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-07 20:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aaron Tomlin, Alexey Dobriyan, Eric W. Biederman,
	Sterling Alexander, linux-kernel

p->ptrace != 0 means that release_task(p) was not called, so pid_alive()
buys nothing and we can remove this check. Other callers already use it
directly without additional checks.

Note: with or without this patch ptrace_parent() can return the pointer
to the freed task, this will be explained/fixed later.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/proc/array.c |   13 ++++++-------
 1 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 800e30f..bd117d0 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -157,19 +157,18 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
 	struct user_namespace *user_ns = seq_user_ns(m);
 	struct group_info *group_info;
 	int g;
+	struct task_struct *tracer;
 	const struct cred *cred;
-	pid_t ppid, tpid, tgid, ngid;
+	pid_t ppid, tpid = 0, tgid, ngid;
 	unsigned int max_fds = 0;
 
 	rcu_read_lock();
 	ppid = pid_alive(p) ?
 		task_tgid_nr_ns(rcu_dereference(p->real_parent), ns) : 0;
-	tpid = 0;
-	if (pid_alive(p)) {
-		struct task_struct *tracer = ptrace_parent(p);
-		if (tracer)
-			tpid = task_pid_nr_ns(tracer, ns);
-	}
+
+	tracer = ptrace_parent(p);
+	if (tracer)
+		tpid = task_pid_nr_ns(tracer, ns);
 
 	tgid = task_tgid_nr_ns(p, ns);
 	ngid = task_numa_group_id(p);
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 0/5] exit: reparent zombie fix + cleanups/optimizations
  2014-11-07 20:14 [PATCH 0/4] proc: deuglify task_state() Oleg Nesterov
                   ` (3 preceding siblings ...)
  2014-11-07 20:14 ` [PATCH 4/4] proc: task_state: ptrace_parent() doesn't need pid_alive() check Oleg Nesterov
@ 2014-11-10 21:59 ` Oleg Nesterov
  2014-11-10 22:00   ` [PATCH 1/5] sched_show_task: fix unsafe usage of ->real_parent Oleg Nesterov
                     ` (4 more replies)
  2014-11-14  1:37 ` [PATCH 0/5] exit: more cleanups/optimizations Oleg Nesterov
                   ` (3 subsequent siblings)
  8 siblings, 5 replies; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-10 21:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aaron Tomlin, Eric W. Biederman, Peter Zijlstra, Roland McGrath,
	Sterling Alexander, linux-kernel

Hello.

On 11/07, Oleg Nesterov wrote:
>
> I tried to optimize the usage of tasklist_lock in exit_notify() paths
> but found the bug which should be fixed first: the EXIT_DEAD setting
> in reparent_leader() can break the task_ppid_nr_ns()-like code and
> ptrace_parent().
>
> I am still thinking about the "right" fix, but whatever we do we need
> to cleanup these users, probably before the fix.

Yes, we need to cleanup these users but lets fix the bug first. See the
changelog in 2/5.

Plus cleanups + initial (micro)optimizations, more to come.

Oleg.

 include/linux/ptrace.h |    2 +-
 kernel/exit.c          |   51 +++++++++++++++++++----------------------------
 kernel/ptrace.c        |   23 ++------------------
 kernel/sched/core.c    |    4 ++-
 4 files changed, 28 insertions(+), 52 deletions(-)


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 1/5] sched_show_task: fix unsafe usage of ->real_parent
  2014-11-10 21:59 ` [PATCH 0/5] exit: reparent zombie fix + cleanups/optimizations Oleg Nesterov
@ 2014-11-10 22:00   ` Oleg Nesterov
  2014-11-11 10:39     ` Peter Zijlstra
  2014-11-10 22:00   ` [PATCH 2/5] exit: reparent: use ->ptrace_entry rather than ->sibling for EXIT_DEAD tasks Oleg Nesterov
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-10 22:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aaron Tomlin, Eric W. Biederman, Peter Zijlstra, Roland McGrath,
	Sterling Alexander, linux-kernel

rcu_read_lock() can not protect p->real_parent if release_task(p) was
already called, change sched_show_task() to check pis_alive() like
other users do.

Note: we need some helpers to cleanup the code like this. And it seems
that that the usage of cpu_curr(cpu) in dump_cpu_task() is not safe too.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/sched/core.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 240157c..1ec75c9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4572,8 +4572,10 @@ void sched_show_task(struct task_struct *p)
 #ifdef CONFIG_DEBUG_STACK_USAGE
 	free = stack_not_used(p);
 #endif
+	ppid = 0;
 	rcu_read_lock();
-	ppid = task_pid_nr(rcu_dereference(p->real_parent));
+	if (pid_alive(p))
+		ppid = task_pid_nr(rcu_dereference(p->real_parent));
 	rcu_read_unlock();
 	printk(KERN_CONT "%5lu %5d %6d 0x%08lx\n", free,
 		task_pid_nr(p), ppid,
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 2/5] exit: reparent: use ->ptrace_entry rather than ->sibling for EXIT_DEAD tasks
  2014-11-10 21:59 ` [PATCH 0/5] exit: reparent zombie fix + cleanups/optimizations Oleg Nesterov
  2014-11-10 22:00   ` [PATCH 1/5] sched_show_task: fix unsafe usage of ->real_parent Oleg Nesterov
@ 2014-11-10 22:00   ` Oleg Nesterov
  2014-11-10 22:00   ` [PATCH 3/5] exit: reparent: cleanup the changing of ->parent Oleg Nesterov
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-10 22:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aaron Tomlin, Eric W. Biederman, Peter Zijlstra, Roland McGrath,
	Sterling Alexander, linux-kernel

reparent_leader() reuses ->sibling as a list node to add an EXIT_DEAD
task into dead_children list we are going to release. This obviously
removes the dead task from its real_parent->children list and this is
even good; the parent can do nothing with the EXIT_DEAD reparented
zombie, it only makes do_wait() slower.

But, this also means that it can not be reparented once again, so if
its new parent dies too nobody will update ->parent/real_parent, they
can point to the freed memory even before release_task() we are going
to call, this breaks the code which relies on pid_alive() to access
->real_parent/parent.

Fortunately this is mostly theoretical, this can only happen if init
or PR_SET_CHILD_SUBREAPER process ignores SIGCHLD and the new parent
sub-thread exits right after we drop tasklist_lock.

Change this code to use ->ptrace_entry instead, we know that the child
is not traced so nobody can ever use this member. This also allows to
unify this logic with exit_ptrace(), see the next changes.

Note: we really need to change release_task() to nullify real_parent/
parent/group_leader pointers, but we need to change the current users
first somehow. And it would be better to reap this zombie immediately
but release_task_locked() we need is complicated by proc_flush_task().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/exit.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 5d30019..4a9b4c0 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -548,7 +548,7 @@ static void reparent_leader(struct task_struct *father, struct task_struct *p,
 	    p->exit_state == EXIT_ZOMBIE && thread_group_empty(p)) {
 		if (do_notify_parent(p, p->exit_signal)) {
 			p->exit_state = EXIT_DEAD;
-			list_move_tail(&p->sibling, dead);
+			list_add(&p->ptrace_entry, dead);
 		}
 	}
 
@@ -587,8 +587,8 @@ static void forget_original_parent(struct task_struct *father)
 
 	BUG_ON(!list_empty(&father->children));
 
-	list_for_each_entry_safe(p, n, &dead_children, sibling) {
-		list_del_init(&p->sibling);
+	list_for_each_entry_safe(p, n, &dead_children, ptrace_entry) {
+		list_del_init(&p->ptrace_entry);
 		release_task(p);
 	}
 }
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 3/5] exit: reparent: cleanup the changing of ->parent
  2014-11-10 21:59 ` [PATCH 0/5] exit: reparent zombie fix + cleanups/optimizations Oleg Nesterov
  2014-11-10 22:00   ` [PATCH 1/5] sched_show_task: fix unsafe usage of ->real_parent Oleg Nesterov
  2014-11-10 22:00   ` [PATCH 2/5] exit: reparent: use ->ptrace_entry rather than ->sibling for EXIT_DEAD tasks Oleg Nesterov
@ 2014-11-10 22:00   ` Oleg Nesterov
  2014-11-10 22:00   ` [PATCH 4/5] exit: reparent: cleanup the usage of reparent_leader() Oleg Nesterov
  2014-11-10 22:00   ` [PATCH 5/5] exit: ptrace: shift "reap dead" code from exit_ptrace() to forget_original_parent() Oleg Nesterov
  4 siblings, 0 replies; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-10 22:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aaron Tomlin, Eric W. Biederman, Peter Zijlstra, Roland McGrath,
	Sterling Alexander, linux-kernel

1. Cosmetic, but "if (t->parent == father)" looks a bit confusing.
   We need to change t->parent if and only if t is not traced.

2. If we actually want this BUG_ON() to ensure that parent/ptrace
   match each other, then we should also take ptrace_reparented()
   case into account too.

3. Change this code to use for_each_thread() instead of deprecated
   while_each_thread().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/exit.c |   13 +++++--------
 1 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 4a9b4c0..939e9fc 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -557,7 +557,7 @@ static void reparent_leader(struct task_struct *father, struct task_struct *p,
 
 static void forget_original_parent(struct task_struct *father)
 {
-	struct task_struct *p, *n, *reaper;
+	struct task_struct *p, *t, *n, *reaper;
 	LIST_HEAD(dead_children);
 
 	write_lock_irq(&tasklist_lock);
@@ -569,18 +569,15 @@ static void forget_original_parent(struct task_struct *father)
 	reaper = find_new_reaper(father);
 
 	list_for_each_entry_safe(p, n, &father->children, sibling) {
-		struct task_struct *t = p;
-
-		do {
+		for_each_thread(p, t) {
 			t->real_parent = reaper;
-			if (t->parent == father) {
-				BUG_ON(t->ptrace);
+			BUG_ON(!t->ptrace != (t->parent == father));
+			if (likely(!t->ptrace))
 				t->parent = t->real_parent;
-			}
 			if (t->pdeath_signal)
 				group_send_sig_info(t->pdeath_signal,
 						    SEND_SIG_NOINFO, t);
-		} while_each_thread(p, t);
+		}
 		reparent_leader(father, p, &dead_children);
 	}
 	write_unlock_irq(&tasklist_lock);
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 4/5] exit: reparent: cleanup the usage of reparent_leader()
  2014-11-10 21:59 ` [PATCH 0/5] exit: reparent zombie fix + cleanups/optimizations Oleg Nesterov
                     ` (2 preceding siblings ...)
  2014-11-10 22:00   ` [PATCH 3/5] exit: reparent: cleanup the changing of ->parent Oleg Nesterov
@ 2014-11-10 22:00   ` Oleg Nesterov
  2014-11-10 22:00   ` [PATCH 5/5] exit: ptrace: shift "reap dead" code from exit_ptrace() to forget_original_parent() Oleg Nesterov
  4 siblings, 0 replies; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-10 22:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aaron Tomlin, Eric W. Biederman, Peter Zijlstra, Roland McGrath,
	Sterling Alexander, linux-kernel

1. Now that reparent_leader() doesn't abuse ->sibling we can shift
   list_move_tail() from reparent_leader() to forget_original_parent()
   and turn it into a single list_splice_tail_init(). This also makes
   BUG_ON(!list_empty()) and list_for_each_entry_safe() unnecessary.

2. This also allows to shift the same_thread_group() check, it looks
   a bit more clear in the caller.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/exit.c |   22 +++++++++-------------
 1 files changed, 9 insertions(+), 13 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 939e9fc..e3e7379 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -529,15 +529,7 @@ static struct task_struct *find_new_reaper(struct task_struct *father)
 static void reparent_leader(struct task_struct *father, struct task_struct *p,
 				struct list_head *dead)
 {
-	list_move_tail(&p->sibling, &p->real_parent->children);
-
-	if (p->exit_state == EXIT_DEAD)
-		return;
-	/*
-	 * If this is a threaded reparent there is no need to
-	 * notify anyone anything has happened.
-	 */
-	if (same_thread_group(p->real_parent, father))
+	if (unlikely(p->exit_state == EXIT_DEAD))
 		return;
 
 	/* We don't want people slaying init. */
@@ -568,7 +560,7 @@ static void forget_original_parent(struct task_struct *father)
 	exit_ptrace(father);
 	reaper = find_new_reaper(father);
 
-	list_for_each_entry_safe(p, n, &father->children, sibling) {
+	list_for_each_entry(p, &father->children, sibling) {
 		for_each_thread(p, t) {
 			t->real_parent = reaper;
 			BUG_ON(!t->ptrace != (t->parent == father));
@@ -578,12 +570,16 @@ static void forget_original_parent(struct task_struct *father)
 				group_send_sig_info(t->pdeath_signal,
 						    SEND_SIG_NOINFO, t);
 		}
-		reparent_leader(father, p, &dead_children);
+		/*
+		 * If this is a threaded reparent there is no need to
+		 * notify anyone anything has happened.
+		 */
+		if (!same_thread_group(reaper, father))
+			reparent_leader(father, p, &dead_children);
 	}
+	list_splice_tail_init(&father->children, &reaper->children);
 	write_unlock_irq(&tasklist_lock);
 
-	BUG_ON(!list_empty(&father->children));
-
 	list_for_each_entry_safe(p, n, &dead_children, ptrace_entry) {
 		list_del_init(&p->ptrace_entry);
 		release_task(p);
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 5/5] exit: ptrace: shift "reap dead" code from exit_ptrace() to forget_original_parent()
  2014-11-10 21:59 ` [PATCH 0/5] exit: reparent zombie fix + cleanups/optimizations Oleg Nesterov
                     ` (3 preceding siblings ...)
  2014-11-10 22:00   ` [PATCH 4/5] exit: reparent: cleanup the usage of reparent_leader() Oleg Nesterov
@ 2014-11-10 22:00   ` Oleg Nesterov
  4 siblings, 0 replies; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-10 22:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aaron Tomlin, Eric W. Biederman, Peter Zijlstra, Roland McGrath,
	Sterling Alexander, linux-kernel

Now that forget_original_parent() uses ->ptrace_entry for EXIT_DEAD
tasks, we can simply pass "dead_children" list to exit_ptrace() and
remove another release_task() loop. Plus this way we do not need to
drop and reacquire tasklist_lock.

Also shift the list_empty(ptraced) check, if we want this optimization
it makes sense to eliminate the function call altogether.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/ptrace.h |    2 +-
 kernel/exit.c          |   10 ++++------
 kernel/ptrace.c        |   23 +++--------------------
 3 files changed, 8 insertions(+), 27 deletions(-)

diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h
index cc79eff..987a73a 100644
--- a/include/linux/ptrace.h
+++ b/include/linux/ptrace.h
@@ -52,7 +52,7 @@ extern void ptrace_notify(int exit_code);
 extern void __ptrace_link(struct task_struct *child,
 			  struct task_struct *new_parent);
 extern void __ptrace_unlink(struct task_struct *child);
-extern void exit_ptrace(struct task_struct *tracer);
+extern void exit_ptrace(struct task_struct *tracer, struct list_head *dead);
 #define PTRACE_MODE_READ	0x01
 #define PTRACE_MODE_ATTACH	0x02
 #define PTRACE_MODE_NOAUDIT	0x04
diff --git a/kernel/exit.c b/kernel/exit.c
index e3e7379..ee5399b 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -553,13 +553,11 @@ static void forget_original_parent(struct task_struct *father)
 	LIST_HEAD(dead_children);
 
 	write_lock_irq(&tasklist_lock);
-	/*
-	 * Note that exit_ptrace() and find_new_reaper() might
-	 * drop tasklist_lock and reacquire it.
-	 */
-	exit_ptrace(father);
-	reaper = find_new_reaper(father);
+	if (unlikely(!list_empty(&father->ptraced)))
+		exit_ptrace(father, &dead_children);
 
+	/* Can drop and reacquire tasklist_lock */
+	reaper = find_new_reaper(father);
 	list_for_each_entry(p, &father->children, sibling) {
 		for_each_thread(p, t) {
 			t->real_parent = reaper;
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 54e7522..1eb9d90 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -485,36 +485,19 @@ static int ptrace_detach(struct task_struct *child, unsigned int data)
 
 /*
  * Detach all tasks we were using ptrace on. Called with tasklist held
- * for writing, and returns with it held too. But note it can release
- * and reacquire the lock.
+ * for writing.
  */
-void exit_ptrace(struct task_struct *tracer)
-	__releases(&tasklist_lock)
-	__acquires(&tasklist_lock)
+void exit_ptrace(struct task_struct *tracer, struct list_head *dead)
 {
 	struct task_struct *p, *n;
-	LIST_HEAD(ptrace_dead);
-
-	if (likely(list_empty(&tracer->ptraced)))
-		return;
 
 	list_for_each_entry_safe(p, n, &tracer->ptraced, ptrace_entry) {
 		if (unlikely(p->ptrace & PT_EXITKILL))
 			send_sig_info(SIGKILL, SEND_SIG_FORCED, p);
 
 		if (__ptrace_detach(tracer, p))
-			list_add(&p->ptrace_entry, &ptrace_dead);
-	}
-
-	write_unlock_irq(&tasklist_lock);
-	BUG_ON(!list_empty(&tracer->ptraced));
-
-	list_for_each_entry_safe(p, n, &ptrace_dead, ptrace_entry) {
-		list_del_init(&p->ptrace_entry);
-		release_task(p);
+			list_add(&p->ptrace_entry, dead);
 	}
-
-	write_lock_irq(&tasklist_lock);
 }
 
 int ptrace_readdata(struct task_struct *tsk, unsigned long src, char __user *dst, int len)
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/5] sched_show_task: fix unsafe usage of ->real_parent
  2014-11-10 22:00   ` [PATCH 1/5] sched_show_task: fix unsafe usage of ->real_parent Oleg Nesterov
@ 2014-11-11 10:39     ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2014-11-11 10:39 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aaron Tomlin, Eric W. Biederman, Roland McGrath,
	Sterling Alexander, linux-kernel

On Mon, Nov 10, 2014 at 11:00:26PM +0100, Oleg Nesterov wrote:
> rcu_read_lock() can not protect p->real_parent if release_task(p) was
> already called, change sched_show_task() to check pis_alive() like
> other users do.
> 
> Note: we need some helpers to cleanup the code like this. And it seems
> that that the usage of cpu_curr(cpu) in dump_cpu_task() is not safe too.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

> ---
>  kernel/sched/core.c |    4 +++-
>  1 files changed, 3 insertions(+), 1 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 240157c..1ec75c9 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4572,8 +4572,10 @@ void sched_show_task(struct task_struct *p)
>  #ifdef CONFIG_DEBUG_STACK_USAGE
>  	free = stack_not_used(p);
>  #endif
> +	ppid = 0;
>  	rcu_read_lock();
> -	ppid = task_pid_nr(rcu_dereference(p->real_parent));
> +	if (pid_alive(p))
> +		ppid = task_pid_nr(rcu_dereference(p->real_parent));
>  	rcu_read_unlock();
>  	printk(KERN_CONT "%5lu %5d %6d 0x%08lx\n", free,
>  		task_pid_nr(p), ppid,
> -- 
> 1.5.5.1
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 3/4] proc: task_state: move the main seq_printf() outside of rcu_read_lock()
  2014-11-07 20:14 ` [PATCH 3/4] proc: task_state: move the main seq_printf() outside of rcu_read_lock() Oleg Nesterov
@ 2014-11-13 18:04   ` Paul E. McKenney
  0 siblings, 0 replies; 55+ messages in thread
From: Paul E. McKenney @ 2014-11-13 18:04 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aaron Tomlin, Alexey Dobriyan, Eric W. Biederman,
	Sterling Alexander, linux-kernel

On Fri, Nov 07, 2014 at 09:14:51PM +0100, Oleg Nesterov wrote:
> task_state() does seq_printf() under rcu_read_lock(), but this is only
> needed for task_tgid_nr_ns() and task_numa_group_id(). We can calculate
> tgid/ngid and drop rcu lock.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> ---
>  fs/proc/array.c |   12 ++++++------
>  1 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/proc/array.c b/fs/proc/array.c
> index 7c8d9ae..800e30f 100644
> --- a/fs/proc/array.c
> +++ b/fs/proc/array.c
> @@ -158,7 +158,7 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
>  	struct group_info *group_info;
>  	int g;
>  	const struct cred *cred;
> -	pid_t ppid, tpid;
> +	pid_t ppid, tpid, tgid, ngid;
>  	unsigned int max_fds = 0;
> 
>  	rcu_read_lock();
> @@ -170,12 +170,16 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
>  		if (tracer)
>  			tpid = task_pid_nr_ns(tracer, ns);
>  	}
> +
> +	tgid = task_tgid_nr_ns(p, ns);
> +	ngid = task_numa_group_id(p);
>  	cred = get_task_cred(p);
> 
>  	task_lock(p);
>  	if (p->files)
>  		max_fds = files_fdtable(p->files)->max_fds;
>  	task_unlock(p);
> +	rcu_read_unlock();
> 
>  	seq_printf(m,
>  		"State:\t%s\n"
> @@ -188,10 +192,7 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
>  		"Gid:\t%d\t%d\t%d\t%d\n"
>  		"FDSize:\t%d\nGroups:\t",
>  		get_task_state(p),
> -		task_tgid_nr_ns(p, ns),
> -		task_numa_group_id(p),
> -		pid_nr_ns(pid, ns),
> -		ppid, tpid,
> +		tgid, ngid, pid_nr_ns(pid, ns), ppid, tpid,
>  		from_kuid_munged(user_ns, cred->uid),
>  		from_kuid_munged(user_ns, cred->euid),
>  		from_kuid_munged(user_ns, cred->suid),
> @@ -201,7 +202,6 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
>  		from_kgid_munged(user_ns, cred->sgid),
>  		from_kgid_munged(user_ns, cred->fsgid),
>  		max_fds);
> -	rcu_read_unlock();
> 
>  	group_info = cred->group_info;
>  	for (g = 0; g < group_info->ngroups; g++)
> -- 
> 1.5.5.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 0/5] exit: more cleanups/optimizations
  2014-11-07 20:14 [PATCH 0/4] proc: deuglify task_state() Oleg Nesterov
                   ` (4 preceding siblings ...)
  2014-11-10 21:59 ` [PATCH 0/5] exit: reparent zombie fix + cleanups/optimizations Oleg Nesterov
@ 2014-11-14  1:37 ` Oleg Nesterov
  2014-11-14  1:38   ` [PATCH 1/5] exit: wait: cleanup the ptrace_reparented() checks Oleg Nesterov
                     ` (4 more replies)
  2014-11-18 21:29 ` [PATCH 0/6] exit: find_new_reaper() fixes/cleanups Oleg Nesterov
                   ` (2 subsequent siblings)
  8 siblings, 5 replies; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-14  1:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aaron Tomlin, Eric W. Biederman, Rik van Riel,
	Sterling Alexander, linux-kernel

On 11/10, Oleg Nesterov wrote:
>
> Plus cleanups + initial (micro)optimizations, more to come.

Yes, exit_notify() needs more changes but I was distracted, will
do later.

Until then some more "exit:" changes which also makes sense imo.

Oleg.


 fs/proc/base.c |    3 ++
 kernel/exit.c  |   58 ++++++++++++++++++++++++-------------------------------
 2 files changed, 28 insertions(+), 33 deletions(-)


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 1/5] exit: wait: cleanup the ptrace_reparented() checks
  2014-11-14  1:37 ` [PATCH 0/5] exit: more cleanups/optimizations Oleg Nesterov
@ 2014-11-14  1:38   ` Oleg Nesterov
  2014-11-14  1:38   ` [PATCH 2/5] exit: wait: don't use zombie->real_parent Oleg Nesterov
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-14  1:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aaron Tomlin, Eric W. Biederman, Rik van Riel,
	Sterling Alexander, linux-kernel

Now that EXIT_DEAD is the terminal state we can kill "int traced"
variable and check "state == EXIT_DEAD" instead to cleanup the code.
In particular, this way it is clear that the check obviously doesn't
need tasklist_lock.

Also fix the type of "unsigned long state", "long" was always wrong
although this doesn't matter because cmpxchg/xchg uses typeof(*ptr).

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/exit.c |   14 ++++++--------
 1 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index ee5399b..0511f1d 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -973,8 +973,7 @@ static int wait_noreap_copyout(struct wait_opts *wo, struct task_struct *p,
  */
 static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
 {
-	unsigned long state;
-	int retval, status, traced;
+	int state, retval, status;
 	pid_t pid = task_pid_vnr(p);
 	uid_t uid = from_kuid_munged(current_user_ns(), task_uid(p));
 	struct siginfo __user *infop;
@@ -997,19 +996,18 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
 		}
 		return wait_noreap_copyout(wo, p, pid, uid, why, status);
 	}
-
-	traced = ptrace_reparented(p);
 	/*
 	 * Move the task's state to DEAD/TRACE, only one thread can do this.
 	 */
-	state = traced && thread_group_leader(p) ? EXIT_TRACE : EXIT_DEAD;
+	state = ptrace_reparented(p) && thread_group_leader(p) ?
+		EXIT_TRACE : EXIT_DEAD;
 	if (cmpxchg(&p->exit_state, EXIT_ZOMBIE, state) != EXIT_ZOMBIE)
 		return 0;
+
 	/*
-	 * It can be ptraced but not reparented, check
-	 * thread_group_leader() to filter out sub-threads.
+	 * Check thread_group_leader() to exclude the traced sub-threads.
 	 */
-	if (likely(!traced) && thread_group_leader(p)) {
+	if (state == EXIT_DEAD && thread_group_leader(p)) {
 		struct signal_struct *psig;
 		struct signal_struct *sig;
 		unsigned long maxrss;
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 2/5] exit: wait: don't use zombie->real_parent
  2014-11-14  1:37 ` [PATCH 0/5] exit: more cleanups/optimizations Oleg Nesterov
  2014-11-14  1:38   ` [PATCH 1/5] exit: wait: cleanup the ptrace_reparented() checks Oleg Nesterov
@ 2014-11-14  1:38   ` Oleg Nesterov
  2014-11-14  1:38   ` [PATCH 3/5] exit: wait: drop tasklist_lock before psig->c* accounting Oleg Nesterov
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-14  1:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aaron Tomlin, Eric W. Biederman, Rik van Riel,
	Sterling Alexander, linux-kernel

1. wait_task_zombie() uses p->real_parent to get psig/siglock. This is
   correct but needs tasklist_lock, ->real_parent can exit.

   We can use "current" instead. This is our natural child, its parent
   must be our sub-thread.

2. Read psig/sig outside of ->siglock, ->signal is no longer protected
   by this lock.

3. Fix the outdated comments about tasklist_lock. We can not race with
   __exit_signal(), the whole thread group is dead, nobody but us can
   call it.

   Also clarify the usage of ->stats_lock and ->siglock.

Note: thread_group_cputime_adjusted() is sub-optimal in this case, we
probably want to export cputime_adjust() to avoid thread_group_cputime().
The comment says "all threads" but there are no other threads.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/exit.c |   23 +++++++++++------------
 1 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 0511f1d..40d53de 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1008,8 +1008,8 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
 	 * Check thread_group_leader() to exclude the traced sub-threads.
 	 */
 	if (state == EXIT_DEAD && thread_group_leader(p)) {
-		struct signal_struct *psig;
-		struct signal_struct *sig;
+		struct signal_struct *sig = p->signal;
+		struct signal_struct *psig = current->signal;
 		unsigned long maxrss;
 		cputime_t tgutime, tgstime;
 
@@ -1021,21 +1021,20 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
 		 * accumulate in the parent's signal_struct c* fields.
 		 *
 		 * We don't bother to take a lock here to protect these
-		 * p->signal fields, because they are only touched by
-		 * __exit_signal, which runs with tasklist_lock
-		 * write-locked anyway, and so is excluded here.  We do
-		 * need to protect the access to parent->signal fields,
-		 * as other threads in the parent group can be right
-		 * here reaping other children at the same time.
+		 * p->signal fields because the whole thread group is dead
+		 * and nobody can change them.
+		 *
+		 * psig->stats_lock also protects us from our sub-theads
+		 * which can reap other children at the same time. Until
+		 * we change k_getrusage()-like users to rely on this lock
+		 * we have to take ->siglock as well.
 		 *
 		 * We use thread_group_cputime_adjusted() to get times for
 		 * the thread group, which consolidates times for all threads
 		 * in the group including the group leader.
 		 */
 		thread_group_cputime_adjusted(p, &tgutime, &tgstime);
-		spin_lock_irq(&p->real_parent->sighand->siglock);
-		psig = p->real_parent->signal;
-		sig = p->signal;
+		spin_lock_irq(&current->sighand->siglock);
 		write_seqlock(&psig->stats_lock);
 		psig->cutime += tgutime + sig->cutime;
 		psig->cstime += tgstime + sig->cstime;
@@ -1060,7 +1059,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
 		task_io_accounting_add(&psig->ioac, &p->ioac);
 		task_io_accounting_add(&psig->ioac, &sig->ioac);
 		write_sequnlock(&psig->stats_lock);
-		spin_unlock_irq(&p->real_parent->sighand->siglock);
+		spin_unlock_irq(&current->sighand->siglock);
 	}
 
 	/*
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 3/5] exit: wait: drop tasklist_lock before psig->c* accounting
  2014-11-14  1:37 ` [PATCH 0/5] exit: more cleanups/optimizations Oleg Nesterov
  2014-11-14  1:38   ` [PATCH 1/5] exit: wait: cleanup the ptrace_reparented() checks Oleg Nesterov
  2014-11-14  1:38   ` [PATCH 2/5] exit: wait: don't use zombie->real_parent Oleg Nesterov
@ 2014-11-14  1:38   ` Oleg Nesterov
  2014-11-14  1:38   ` [PATCH 4/5] exit: release_task: fix the comment about group leader accounting Oleg Nesterov
  2014-11-14  1:38   ` [PATCH 5/5] exit: proc: don't try to flush /proc/tgid/task/tgid Oleg Nesterov
  4 siblings, 0 replies; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-14  1:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aaron Tomlin, Eric W. Biederman, Rik van Riel,
	Sterling Alexander, linux-kernel

wait_task_zombie() no longer needs tasklist_lock to accumulate the
psig->c* counters, we can drop it right after cmpxchg(exit_state).

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/exit.c |   10 ++++------
 1 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 40d53de..e0914eb 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1003,6 +1003,10 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
 		EXIT_TRACE : EXIT_DEAD;
 	if (cmpxchg(&p->exit_state, EXIT_ZOMBIE, state) != EXIT_ZOMBIE)
 		return 0;
+	/*
+	 * We own this thread, nobody else can reap it.
+	 */
+	read_unlock(&tasklist_lock);
 
 	/*
 	 * Check thread_group_leader() to exclude the traced sub-threads.
@@ -1062,12 +1066,6 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
 		spin_unlock_irq(&current->sighand->siglock);
 	}
 
-	/*
-	 * Now we are sure this task is interesting, and no other
-	 * thread can reap it because we its state == DEAD/TRACE.
-	 */
-	read_unlock(&tasklist_lock);
-
 	retval = wo->wo_rusage
 		? getrusage(p, RUSAGE_BOTH, wo->wo_rusage) : 0;
 	status = (p->signal->flags & SIGNAL_GROUP_EXIT)
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 4/5] exit: release_task: fix the comment about group leader accounting
  2014-11-14  1:37 ` [PATCH 0/5] exit: more cleanups/optimizations Oleg Nesterov
                     ` (2 preceding siblings ...)
  2014-11-14  1:38   ` [PATCH 3/5] exit: wait: drop tasklist_lock before psig->c* accounting Oleg Nesterov
@ 2014-11-14  1:38   ` Oleg Nesterov
  2014-11-14  1:38   ` [PATCH 5/5] exit: proc: don't try to flush /proc/tgid/task/tgid Oleg Nesterov
  4 siblings, 0 replies; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-14  1:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aaron Tomlin, Eric W. Biederman, Rik van Riel,
	Sterling Alexander, linux-kernel

Contrary to what the comment in __exit_signal() says we do account the
group leader. Fix this and explain why.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/exit.c |   11 ++++-------
 1 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index e0914eb..f773863 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -118,13 +118,10 @@ static void __exit_signal(struct task_struct *tsk)
 	}
 
 	/*
-	 * Accumulate here the counters for all threads but the group leader
-	 * as they die, so they can be added into the process-wide totals
-	 * when those are taken.  The group leader stays around as a zombie as
-	 * long as there are other threads.  When it gets reaped, the exit.c
-	 * code will add its counts into these totals.  We won't ever get here
-	 * for the group leader, since it will have been the last reference on
-	 * the signal_struct.
+	 * Accumulate here the counters for all threads as they die. We could
+	 * skip the group leader because it is the last user of signal_struct,
+	 * but we want to avoid the race with thread_group_cputime() which can
+	 * see the empty ->thread_head list.
 	 */
 	task_cputime(tsk, &utime, &stime);
 	write_seqlock(&sig->stats_lock);
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 5/5] exit: proc: don't try to flush /proc/tgid/task/tgid
  2014-11-14  1:37 ` [PATCH 0/5] exit: more cleanups/optimizations Oleg Nesterov
                     ` (3 preceding siblings ...)
  2014-11-14  1:38   ` [PATCH 4/5] exit: release_task: fix the comment about group leader accounting Oleg Nesterov
@ 2014-11-14  1:38   ` Oleg Nesterov
  4 siblings, 0 replies; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-14  1:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aaron Tomlin, Eric W. Biederman, Rik van Riel,
	Sterling Alexander, linux-kernel

proc_flush_task_mnt() always tries to flush task/pid, but this is
pointless if we reap the leader. d_invalidate() is recursive, and
if nothing else the next d_hash_and_lookup(tgid) should fail anyway.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/proc/base.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 772efa4..e7b04a3 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2618,6 +2618,9 @@ static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
 		dput(dentry);
 	}
 
+	if (pid == tgid)
+		return;
+
 	name.name = buf;
 	name.len = snprintf(buf, sizeof(buf), "%d", tgid);
 	leader = d_hash_and_lookup(mnt->mnt_root, &name);
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 0/6] exit: find_new_reaper() fixes/cleanups
  2014-11-07 20:14 [PATCH 0/4] proc: deuglify task_state() Oleg Nesterov
                   ` (5 preceding siblings ...)
  2014-11-14  1:37 ` [PATCH 0/5] exit: more cleanups/optimizations Oleg Nesterov
@ 2014-11-18 21:29 ` Oleg Nesterov
  2014-11-18 21:30   ` [PATCH 1/6] exit: reparent: fix the dead-parent PR_SET_CHILD_SUBREAPER reparenting Oleg Nesterov
                     ` (5 more replies)
  2014-11-20 18:34 ` [PATCH 0/3] exit: avoid O(n ** 2) thread-list scan on group-exit if possible Oleg Nesterov
  2014-11-24 20:06 ` [PATCH 0/2] exit/pid_ns: comments + simple fix Oleg Nesterov
  8 siblings, 6 replies; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-18 21:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aaron Tomlin, Eric W. Biederman, Kay Sievers, Lennart Poettering,
	Sterling Alexander, linux-kernel

On 11/14, Oleg Nesterov wrote:
>
> Yes, exit_notify() needs more changes but I was distracted, will
> do later.

but I found other bugs in this area. Fix them first. This is also
preparation for the next changes in exit_notify() paths.

Oleg.

 kernel/exit.c |   98 +++++++++++++++++++++++++++++++++-----------------------
 1 files changed, 58 insertions(+), 40 deletions(-)


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 1/6] exit: reparent: fix the dead-parent PR_SET_CHILD_SUBREAPER reparenting
  2014-11-18 21:29 ` [PATCH 0/6] exit: find_new_reaper() fixes/cleanups Oleg Nesterov
@ 2014-11-18 21:30   ` Oleg Nesterov
  2014-11-18 21:30   ` [PATCH 2/6] exit: reparent: fix the cross-namespace " Oleg Nesterov
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-18 21:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aaron Tomlin, Eric W. Biederman, Kay Sievers, Lennart Poettering,
	Sterling Alexander, linux-kernel

The ->has_child_subreaper code in find_new_reaper() finds alive
"thread" but returns another "reaper" thread which can be dead.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/exit.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index f773863..31da440 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -512,7 +512,7 @@ static struct task_struct *find_new_reaper(struct task_struct *father)
 			thread = reaper;
 			do {
 				if (!(thread->flags & PF_EXITING))
-					return reaper;
+					return thread;
 			} while_each_thread(reaper, thread);
 		}
 	}
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 2/6] exit: reparent: fix the cross-namespace PR_SET_CHILD_SUBREAPER reparenting
  2014-11-18 21:29 ` [PATCH 0/6] exit: find_new_reaper() fixes/cleanups Oleg Nesterov
  2014-11-18 21:30   ` [PATCH 1/6] exit: reparent: fix the dead-parent PR_SET_CHILD_SUBREAPER reparenting Oleg Nesterov
@ 2014-11-18 21:30   ` Oleg Nesterov
  2014-11-18 21:30   ` [PATCH 3/6] exit: reparent: s/while_each_thread/for_each_thread/ in find_new_reaper() Oleg Nesterov
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-18 21:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aaron Tomlin, Eric W. Biederman, Kay Sievers, Lennart Poettering,
	Sterling Alexander, linux-kernel

find_new_reaper() assumes that "has_child_subreaper" logic is safe as
long as we are not the exiting ->child_reaper and this is doubly wrong:

1. In fact it is safe if "pid_ns->child_reaper == father"; there must
   be no children after zap_pid_ns_processes() returns, so it doesn't
   matter what we return in this case and even pid_ns->child_reaper is
   wrong otherwise: we can't reparent to ->child_reaper == current.

   This is not a bug, but this is confusing.

2. It is not safe if we are not pid_ns->child_reaper but from the same
   thread group. We drop tasklist_lock before zap_pid_ns_processes(),
   so another thread can lock it and choose the new reaper from the
   upper namespace if has_child_subreaper == T, and this is obviously
   wrong.

   This is not that bad, zap_pid_ns_processes() won't return until the
   the new reaper reaps all zombies, but this should be fixed anyway.

We could change for_each_thread() loop to use ->exit_state instead of
PF_EXITING which we had to use until 8aac62706ada, or we could change
copy_signal() to check CLONE_NEWPID before setting has_child_subreaper,
but lets change this code so that it is clear we can't look outside of
our namespace, otherwise same_thread_group(reaper, child_reaper) check
will look wrong and confusing anyway.

We can simply start from "father" and fix the problem. We can't wrongly
return a thread from the same thread group if ->is_child_subreaper == T,
we know that all threads have PF_EXITING set.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/exit.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 31da440..9ade2f5 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -492,7 +492,9 @@ static struct task_struct *find_new_reaper(struct task_struct *father)
 
 		zap_pid_ns_processes(pid_ns);
 		write_lock_irq(&tasklist_lock);
-	} else if (father->signal->has_child_subreaper) {
+	}
+
+	if (father->signal->has_child_subreaper) {
 		struct task_struct *reaper;
 
 		/*
@@ -502,7 +504,7 @@ static struct task_struct *find_new_reaper(struct task_struct *father)
 		 * PID namespace. However we still need the check above, see
 		 * http://marc.info/?l=linux-kernel&m=131385460420380
 		 */
-		for (reaper = father->real_parent;
+		for (reaper = father;
 		     reaper != &init_task;
 		     reaper = reaper->real_parent) {
 			if (same_thread_group(reaper, pid_ns->child_reaper))
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 3/6] exit: reparent: s/while_each_thread/for_each_thread/ in find_new_reaper()
  2014-11-18 21:29 ` [PATCH 0/6] exit: find_new_reaper() fixes/cleanups Oleg Nesterov
  2014-11-18 21:30   ` [PATCH 1/6] exit: reparent: fix the dead-parent PR_SET_CHILD_SUBREAPER reparenting Oleg Nesterov
  2014-11-18 21:30   ` [PATCH 2/6] exit: reparent: fix the cross-namespace " Oleg Nesterov
@ 2014-11-18 21:30   ` Oleg Nesterov
  2014-11-18 21:30   ` [PATCH 4/6] exit: reparent: document the ->has_child_subreaper checks Oleg Nesterov
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-18 21:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aaron Tomlin, Eric W. Biederman, Kay Sievers, Lennart Poettering,
	Sterling Alexander, linux-kernel

Change find_new_reaper() to use for_each_thread() instead of deprecated
while_each_thread(). We do not bother to check "thread != father" in the
1st loop, we can rely on PF_EXITING check.

Note: this means the minor behavioural change: for_each_thread() starts
from the group leader. But this should be fine, nobody should make any
assumption about do_wait(__WNOTHREAD) when it comes to reparented tasks.
And this can avoid the pointless reparenting to a short-living thread
While zombie leaders are not that common.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/exit.c |    8 +++-----
 1 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 9ade2f5..ac7ba9b 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -473,8 +473,7 @@ static struct task_struct *find_new_reaper(struct task_struct *father)
 	struct pid_namespace *pid_ns = task_active_pid_ns(father);
 	struct task_struct *thread;
 
-	thread = father;
-	while_each_thread(father, thread) {
+	for_each_thread(father, thread) {
 		if (thread->flags & PF_EXITING)
 			continue;
 		if (unlikely(pid_ns->child_reaper == father))
@@ -511,11 +510,10 @@ static struct task_struct *find_new_reaper(struct task_struct *father)
 				break;
 			if (!reaper->signal->is_child_subreaper)
 				continue;
-			thread = reaper;
-			do {
+			for_each_thread(reaper, thread) {
 				if (!(thread->flags & PF_EXITING))
 					return thread;
-			} while_each_thread(reaper, thread);
+			}
 		}
 	}
 
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 4/6] exit: reparent: document the ->has_child_subreaper checks
  2014-11-18 21:29 ` [PATCH 0/6] exit: find_new_reaper() fixes/cleanups Oleg Nesterov
                     ` (2 preceding siblings ...)
  2014-11-18 21:30   ` [PATCH 3/6] exit: reparent: s/while_each_thread/for_each_thread/ in find_new_reaper() Oleg Nesterov
@ 2014-11-18 21:30   ` Oleg Nesterov
  2014-11-18 21:30   ` [PATCH 5/6] exit: reparent: introduce find_child_reaper() Oleg Nesterov
  2014-11-18 21:30   ` [PATCH 6/6] exit: reparent: introduce find_alive_thread() Oleg Nesterov
  5 siblings, 0 replies; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-18 21:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aaron Tomlin, Eric W. Biederman, Kay Sievers, Lennart Poettering,
	Sterling Alexander, linux-kernel

Swap the "init_task" and same_thread_group() checks. This way it
is more simple to document these checks and we can remove the link
to the previous discussion on lkml.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/exit.c |   14 ++++++--------
 1 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index ac7ba9b..266e4f4 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -495,18 +495,16 @@ static struct task_struct *find_new_reaper(struct task_struct *father)
 
 	if (father->signal->has_child_subreaper) {
 		struct task_struct *reaper;
-
 		/*
-		 * Find the first ancestor marked as child_subreaper.
-		 * Note that the code below checks same_thread_group(reaper,
-		 * pid_ns->child_reaper).  This is what we need to DTRT in a
-		 * PID namespace. However we still need the check above, see
-		 * http://marc.info/?l=linux-kernel&m=131385460420380
+		 * Find the first ->is_child_subreaper ancestor in our pid_ns.
+		 * We start from father to ensure we can not look into another
+		 * namespace, this is safe because all its threads are dead.
 		 */
 		for (reaper = father;
-		     reaper != &init_task;
+		     !same_thread_group(reaper, pid_ns->child_reaper);
 		     reaper = reaper->real_parent) {
-			if (same_thread_group(reaper, pid_ns->child_reaper))
+			/* call_usermodehelper() descendants need this check */
+			if (reaper == &init_task)
 				break;
 			if (!reaper->signal->is_child_subreaper)
 				continue;
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 5/6] exit: reparent: introduce find_child_reaper()
  2014-11-18 21:29 ` [PATCH 0/6] exit: find_new_reaper() fixes/cleanups Oleg Nesterov
                     ` (3 preceding siblings ...)
  2014-11-18 21:30   ` [PATCH 4/6] exit: reparent: document the ->has_child_subreaper checks Oleg Nesterov
@ 2014-11-18 21:30   ` Oleg Nesterov
  2014-11-18 21:30   ` [PATCH 6/6] exit: reparent: introduce find_alive_thread() Oleg Nesterov
  5 siblings, 0 replies; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-18 21:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aaron Tomlin, Eric W. Biederman, Kay Sievers, Lennart Poettering,
	Sterling Alexander, linux-kernel

find_new_reaper() does 2 completely different things. Not only it
finds a reaper, it also updates pid_ns->child_reaper or kills the
whole namespace if the caller is ->child_reaper.

Now that has_child_subreaper logic doesn't depend on child_reaper
check we can move that pid_ns code into a separate helper. IMHO
this makes the code more clean, and this allows the next changes.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/exit.c |   56 +++++++++++++++++++++++++++++++++++---------------------
 1 files changed, 35 insertions(+), 21 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 266e4f4..00c6139 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -459,6 +459,34 @@ static void exit_mm(struct task_struct *tsk)
 	clear_thread_flag(TIF_MEMDIE);
 }
 
+static struct task_struct *find_child_reaper(struct task_struct *father)
+	__releases(&tasklist_lock)
+	__acquires(&tasklist_lock)
+{
+	struct pid_namespace *pid_ns = task_active_pid_ns(father);
+	struct task_struct *reaper = pid_ns->child_reaper;
+
+	if (likely(reaper != father))
+		return reaper;
+
+	for_each_thread(father, reaper) {
+		if (reaper->flags & PF_EXITING)
+			continue;
+		pid_ns->child_reaper = reaper;
+		return reaper;
+	}
+
+	write_unlock_irq(&tasklist_lock);
+	if (unlikely(pid_ns == &init_pid_ns)) {
+		panic("Attempted to kill init! exitcode=0x%08x\n",
+			father->signal->group_exit_code ?: father->exit_code);
+	}
+	zap_pid_ns_processes(pid_ns);
+	write_lock_irq(&tasklist_lock);
+
+	return father;
+}
+
 /*
  * When we die, we re-parent all our children, and try to:
  * 1. give them to another thread in our thread group, if such a member exists
@@ -466,33 +494,17 @@ static void exit_mm(struct task_struct *tsk)
  *    child_subreaper for its children (like a service manager)
  * 3. give it to the init process (PID 1) in our pid namespace
  */
-static struct task_struct *find_new_reaper(struct task_struct *father)
-	__releases(&tasklist_lock)
-	__acquires(&tasklist_lock)
+static struct task_struct *find_new_reaper(struct task_struct *father,
+					   struct task_struct *child_reaper)
 {
-	struct pid_namespace *pid_ns = task_active_pid_ns(father);
 	struct task_struct *thread;
 
 	for_each_thread(father, thread) {
 		if (thread->flags & PF_EXITING)
 			continue;
-		if (unlikely(pid_ns->child_reaper == father))
-			pid_ns->child_reaper = thread;
 		return thread;
 	}
 
-	if (unlikely(pid_ns->child_reaper == father)) {
-		write_unlock_irq(&tasklist_lock);
-		if (unlikely(pid_ns == &init_pid_ns)) {
-			panic("Attempted to kill init! exitcode=0x%08x\n",
-				father->signal->group_exit_code ?:
-					father->exit_code);
-		}
-
-		zap_pid_ns_processes(pid_ns);
-		write_lock_irq(&tasklist_lock);
-	}
-
 	if (father->signal->has_child_subreaper) {
 		struct task_struct *reaper;
 		/*
@@ -501,7 +513,7 @@ static struct task_struct *find_new_reaper(struct task_struct *father)
 		 * namespace, this is safe because all its threads are dead.
 		 */
 		for (reaper = father;
-		     !same_thread_group(reaper, pid_ns->child_reaper);
+		     !same_thread_group(reaper, child_reaper);
 		     reaper = reaper->real_parent) {
 			/* call_usermodehelper() descendants need this check */
 			if (reaper == &init_task)
@@ -515,7 +527,7 @@ static struct task_struct *find_new_reaper(struct task_struct *father)
 		}
 	}
 
-	return pid_ns->child_reaper;
+	return child_reaper;
 }
 
 /*
@@ -552,7 +564,9 @@ static void forget_original_parent(struct task_struct *father)
 		exit_ptrace(father, &dead_children);
 
 	/* Can drop and reacquire tasklist_lock */
-	reaper = find_new_reaper(father);
+	reaper = find_child_reaper(father);
+
+	reaper = find_new_reaper(father, reaper);
 	list_for_each_entry(p, &father->children, sibling) {
 		for_each_thread(p, t) {
 			t->real_parent = reaper;
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 6/6] exit: reparent: introduce find_alive_thread()
  2014-11-18 21:29 ` [PATCH 0/6] exit: find_new_reaper() fixes/cleanups Oleg Nesterov
                     ` (4 preceding siblings ...)
  2014-11-18 21:30   ` [PATCH 5/6] exit: reparent: introduce find_child_reaper() Oleg Nesterov
@ 2014-11-18 21:30   ` Oleg Nesterov
  5 siblings, 0 replies; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-18 21:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aaron Tomlin, Eric W. Biederman, Kay Sievers, Lennart Poettering,
	Sterling Alexander, linux-kernel

Add the new simple helper to factor out the for_each_thread() code
in find_child_reaper() and find_new_reaper(). It can also simplify
the potential PF_EXITING -> exit_state change, plus perhaps we can
change this code to take SIGNAL_GROUP_EXIT into account.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/exit.c |   32 +++++++++++++++++++-------------
 1 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 00c6139..4e3475d 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -459,6 +459,17 @@ static void exit_mm(struct task_struct *tsk)
 	clear_thread_flag(TIF_MEMDIE);
 }
 
+static struct task_struct *find_alive_thread(struct task_struct *p)
+{
+	struct task_struct *t;
+
+	for_each_thread(p, t) {
+		if (!(t->flags & PF_EXITING))
+			return t;
+	}
+	return NULL;
+}
+
 static struct task_struct *find_child_reaper(struct task_struct *father)
 	__releases(&tasklist_lock)
 	__acquires(&tasklist_lock)
@@ -469,9 +480,8 @@ static struct task_struct *find_child_reaper(struct task_struct *father)
 	if (likely(reaper != father))
 		return reaper;
 
-	for_each_thread(father, reaper) {
-		if (reaper->flags & PF_EXITING)
-			continue;
+	reaper = find_alive_thread(father);
+	if (reaper) {
 		pid_ns->child_reaper = reaper;
 		return reaper;
 	}
@@ -497,16 +507,13 @@ static struct task_struct *find_child_reaper(struct task_struct *father)
 static struct task_struct *find_new_reaper(struct task_struct *father,
 					   struct task_struct *child_reaper)
 {
-	struct task_struct *thread;
+	struct task_struct *thread, *reaper;
 
-	for_each_thread(father, thread) {
-		if (thread->flags & PF_EXITING)
-			continue;
+	thread = find_alive_thread(father);
+	if (thread)
 		return thread;
-	}
 
 	if (father->signal->has_child_subreaper) {
-		struct task_struct *reaper;
 		/*
 		 * Find the first ->is_child_subreaper ancestor in our pid_ns.
 		 * We start from father to ensure we can not look into another
@@ -520,10 +527,9 @@ static struct task_struct *find_new_reaper(struct task_struct *father,
 				break;
 			if (!reaper->signal->is_child_subreaper)
 				continue;
-			for_each_thread(reaper, thread) {
-				if (!(thread->flags & PF_EXITING))
-					return thread;
-			}
+			thread = find_alive_thread(reaper);
+			if (thread)
+				return thread;
 		}
 	}
 
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 0/3] exit: avoid O(n ** 2) thread-list scan on group-exit if possible
  2014-11-07 20:14 [PATCH 0/4] proc: deuglify task_state() Oleg Nesterov
                   ` (6 preceding siblings ...)
  2014-11-18 21:29 ` [PATCH 0/6] exit: find_new_reaper() fixes/cleanups Oleg Nesterov
@ 2014-11-20 18:34 ` Oleg Nesterov
  2014-11-20 18:34   ` [PATCH -mm 1/3] exit: reparent: avoid find_new_reaper() if no children Oleg Nesterov
                     ` (2 more replies)
  2014-11-24 20:06 ` [PATCH 0/2] exit/pid_ns: comments + simple fix Oleg Nesterov
  8 siblings, 3 replies; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-20 18:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aaron Tomlin, Alexey Dobriyan, Eric W. Biederman,
	Sterling Alexander, linux-kernel

On 11/18, Oleg Nesterov wrote:
>
> but I found other bugs in this area. Fix them first. This is also
> preparation for the next changes in exit_notify() paths.

Finally the optimization I had in mind from the very beginning.

Probably this is not the last series... in particular it seems that we
have some problems with sys_setns() in this area, but I need to recheck.
Plus more cleanups/microoptimizations.

Oleg.

 kernel/exit.c |   50 +++++++++++++++++++++++++-------------------------
 1 files changed, 25 insertions(+), 25 deletions(-)



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH -mm 1/3] exit: reparent: avoid find_new_reaper() if no children
  2014-11-20 18:34 ` [PATCH 0/3] exit: avoid O(n ** 2) thread-list scan on group-exit if possible Oleg Nesterov
@ 2014-11-20 18:34   ` Oleg Nesterov
  2014-11-20 22:37     ` Andrew Morton
  2014-11-20 18:34   ` [PATCH -mm 2/3] exit: reparent: call forget_original_parent() under tasklist_lock Oleg Nesterov
  2014-11-20 18:34   ` [PATCH -mm 3/3] exit: exit_notify: re-use "dead" list to autoreap current Oleg Nesterov
  2 siblings, 1 reply; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-20 18:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aaron Tomlin, Eric W. Biederman, Sterling Alexander, linux-kernel

Now that pid_ns logic was isolated we can change forget_original_parent()
to return right after find_child_reaper() when father->children is empty,
there is nothing to reparent in this case.

In particular this avoids find_alive_thread() and this can help if the
whole process exits and it has a lot of PF_EXITING threads at the start
of the thread list, this can easily lead to O(nr_threads ** 2) iterations.

Trivial test case (tested under KVM, 2 CPUs):

	static void *tfunc(void *arg)
	{
		pause();
		return NULL;
	}

	static int child(unsigned int nt)
	{
		pthread_t pt;

		while (nt--)
			assert(pthread_create(&pt, NULL, tfunc, NULL) == 0);

		pthread_kill(pt, SIGTRAP);
		pause();
		return 0;
	}

	int main(int argc, const char *argv[])
	{
		int stat;
		unsigned int nf = atoi(argv[1]);
		unsigned int nt = atoi(argv[2]);

		while (nf--) {
			if (!fork())
				return child(nt);

			wait(&stat);
			assert(stat == SIGTRAP);
		}

		return 0;
	}

$ time ./test 16 16536 shows:

		real		user		sys
	-	5m37.628s	0m4.437s	8m5.560s
	+	0m50.032s	0m7.130s	1m4.927s

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/exit.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 4e3475d..81b62f8 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -571,6 +571,8 @@ static void forget_original_parent(struct task_struct *father)
 
 	/* Can drop and reacquire tasklist_lock */
 	reaper = find_child_reaper(father);
+	if (list_empty(&father->children))
+		goto unlock;
 
 	reaper = find_new_reaper(father, reaper);
 	list_for_each_entry(p, &father->children, sibling) {
@@ -591,6 +593,7 @@ static void forget_original_parent(struct task_struct *father)
 			reparent_leader(father, p, &dead_children);
 	}
 	list_splice_tail_init(&father->children, &reaper->children);
+ unlock:
 	write_unlock_irq(&tasklist_lock);
 
 	list_for_each_entry_safe(p, n, &dead_children, ptrace_entry) {
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH -mm 2/3] exit: reparent: call forget_original_parent() under tasklist_lock
  2014-11-20 18:34 ` [PATCH 0/3] exit: avoid O(n ** 2) thread-list scan on group-exit if possible Oleg Nesterov
  2014-11-20 18:34   ` [PATCH -mm 1/3] exit: reparent: avoid find_new_reaper() if no children Oleg Nesterov
@ 2014-11-20 18:34   ` Oleg Nesterov
  2014-11-20 18:34   ` [PATCH -mm 3/3] exit: exit_notify: re-use "dead" list to autoreap current Oleg Nesterov
  2 siblings, 0 replies; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-20 18:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aaron Tomlin, Eric W. Biederman, Sterling Alexander, linux-kernel

Shift "release dead children" loop from forget_original_parent() to
its caller, exit_notify(). It is safe to reap them even if our parent
reaps us right after we drop tasklist_lock, those children no longer
have any connection to the exiting task.

And this allows us to avoid write_lock_irq(tasklist_lock) right after
it was released by forget_original_parent(), we can simply call it
with tasklist_lock held.

While at it, move the comment about forget_original_parent() up to
this function.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/exit.c |   47 +++++++++++++++++++++++------------------------
 1 files changed, 23 insertions(+), 24 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 81b62f8..ccf27cb 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -560,19 +560,26 @@ static void reparent_leader(struct task_struct *father, struct task_struct *p,
 	kill_orphaned_pgrp(p, father);
 }
 
-static void forget_original_parent(struct task_struct *father)
+/*
+ * This does two things:
+ *
+ * A.  Make init inherit all the child processes
+ * B.  Check to see if any process groups have become orphaned
+ *	as a result of our exiting, and if they have any stopped
+ *	jobs, send them a SIGHUP and then a SIGCONT.  (POSIX 3.2.2.2)
+ */
+static void forget_original_parent(struct task_struct *father,
+					struct list_head *dead)
 {
-	struct task_struct *p, *t, *n, *reaper;
-	LIST_HEAD(dead_children);
+	struct task_struct *p, *t, *reaper;
 
-	write_lock_irq(&tasklist_lock);
 	if (unlikely(!list_empty(&father->ptraced)))
-		exit_ptrace(father, &dead_children);
+		exit_ptrace(father, dead);
 
 	/* Can drop and reacquire tasklist_lock */
 	reaper = find_child_reaper(father);
 	if (list_empty(&father->children))
-		goto unlock;
+		return;
 
 	reaper = find_new_reaper(father, reaper);
 	list_for_each_entry(p, &father->children, sibling) {
@@ -590,16 +597,9 @@ static void forget_original_parent(struct task_struct *father)
 		 * notify anyone anything has happened.
 		 */
 		if (!same_thread_group(reaper, father))
-			reparent_leader(father, p, &dead_children);
+			reparent_leader(father, p, dead);
 	}
 	list_splice_tail_init(&father->children, &reaper->children);
- unlock:
-	write_unlock_irq(&tasklist_lock);
-
-	list_for_each_entry_safe(p, n, &dead_children, ptrace_entry) {
-		list_del_init(&p->ptrace_entry);
-		release_task(p);
-	}
 }
 
 /*
@@ -609,18 +609,12 @@ static void forget_original_parent(struct task_struct *father)
 static void exit_notify(struct task_struct *tsk, int group_dead)
 {
 	bool autoreap;
-
-	/*
-	 * This does two things:
-	 *
-	 * A.  Make init inherit all the child processes
-	 * B.  Check to see if any process groups have become orphaned
-	 *	as a result of our exiting, and if they have any stopped
-	 *	jobs, send them a SIGHUP and then a SIGCONT.  (POSIX 3.2.2.2)
-	 */
-	forget_original_parent(tsk);
+	struct task_struct *p, *n;
+	LIST_HEAD(dead);
 
 	write_lock_irq(&tasklist_lock);
+	forget_original_parent(tsk, &dead);
+
 	if (group_dead)
 		kill_orphaned_pgrp(tsk->group_leader, NULL);
 
@@ -644,6 +638,11 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
 		wake_up_process(tsk->signal->group_exit_task);
 	write_unlock_irq(&tasklist_lock);
 
+	list_for_each_entry_safe(p, n, &dead, ptrace_entry) {
+		list_del_init(&p->ptrace_entry);
+		release_task(p);
+	}
+
 	/* If the process is dead, release it - nobody will wait for it */
 	if (autoreap)
 		release_task(tsk);
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH -mm 3/3] exit: exit_notify: re-use "dead" list to autoreap current
  2014-11-20 18:34 ` [PATCH 0/3] exit: avoid O(n ** 2) thread-list scan on group-exit if possible Oleg Nesterov
  2014-11-20 18:34   ` [PATCH -mm 1/3] exit: reparent: avoid find_new_reaper() if no children Oleg Nesterov
  2014-11-20 18:34   ` [PATCH -mm 2/3] exit: reparent: call forget_original_parent() under tasklist_lock Oleg Nesterov
@ 2014-11-20 18:34   ` Oleg Nesterov
  2 siblings, 0 replies; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-20 18:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aaron Tomlin, Eric W. Biederman, Sterling Alexander, linux-kernel

After the previous change we can add just the exiting EXIT_DEAD task
to the "dead" list and remove another release_task(tsk).

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/exit.c |    6 ++----
 1 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index ccf27cb..43394f7 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -632,6 +632,8 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
 	}
 
 	tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE;
+	if (tsk->exit_state == EXIT_DEAD)
+		list_add(&tsk->ptrace_entry, &dead);
 
 	/* mt-exec, de_thread() is waiting for group leader */
 	if (unlikely(tsk->signal->notify_count < 0))
@@ -642,10 +644,6 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
 		list_del_init(&p->ptrace_entry);
 		release_task(p);
 	}
-
-	/* If the process is dead, release it - nobody will wait for it */
-	if (autoreap)
-		release_task(tsk);
 }
 
 #ifdef CONFIG_DEBUG_STACK_USAGE
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH -mm 1/3] exit: reparent: avoid find_new_reaper() if no children
  2014-11-20 18:34   ` [PATCH -mm 1/3] exit: reparent: avoid find_new_reaper() if no children Oleg Nesterov
@ 2014-11-20 22:37     ` Andrew Morton
  2014-11-21 20:01       ` Oleg Nesterov
  0 siblings, 1 reply; 55+ messages in thread
From: Andrew Morton @ 2014-11-20 22:37 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Aaron Tomlin, Eric W. Biederman, Sterling Alexander, linux-kernel

On Thu, 20 Nov 2014 19:34:23 +0100 Oleg Nesterov <oleg@redhat.com> wrote:

> $ time ./test 16 16536 shows:
> 
> 		real		user		sys
> 	-	5m37.628s	0m4.437s	8m5.560s
> 	+	0m50.032s	0m7.130s	1m4.927s

Is that the best you can do?

(I assume the increase in user time was a glitch?)

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH -mm 1/3] exit: reparent: avoid find_new_reaper() if no children
  2014-11-20 22:37     ` Andrew Morton
@ 2014-11-21 20:01       ` Oleg Nesterov
  0 siblings, 0 replies; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-21 20:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aaron Tomlin, Eric W. Biederman, Sterling Alexander, linux-kernel

On 11/20, Andrew Morton wrote:
>
> On Thu, 20 Nov 2014 19:34:23 +0100 Oleg Nesterov <oleg@redhat.com> wrote:
>
> > $ time ./test 16 16536 shows:
> >
> > 		real		user		sys
> > 	-	5m37.628s	0m4.437s	8m5.560s
> > 	+	0m50.032s	0m7.130s	1m4.927s
>
> Is that the best you can do?

Unfortunately these changes do not even try to solve the main problem,
tasklist_lock doesn't scale simply because it is global. These changes
make sense (I hope) anyway, even if/when we redesign the locking. But
so far I do not have a good plan.

> (I assume the increase in user time was a glitch?)

To be honest, I didn't even notice this change. I repeated the testing
before/after this patch and (to my surprize) the "user" numbers are more
or less stable, and /usr/bin/time reports the increase.

1. First of all: this is impossible ;)

   Note that this test-case uses SIGTRAP to trigger the coredumping.
   This means that exit_notify() can only be called when all threads
   are already in kernel mode, the coredumping thread sleeps until
   they all are parked in exit_mm(). Until then this patch has no
   effect.

2. With this patch applied, I added mdelay(2) into forget_original_parent(),
   right after find_child_reaper(). And yes, this changes the numbers too:

   		real		user		sys
   		10m1.225s	0m5.443s	17m25.797s

   note that "user time" goes down.

3. So I think that this just reminds that utime/stime accounting isn't
   precise. sum_exec_runtime is accurate and thus we can more or less
   trust utime + stime, but utime/stime is random. Plus scale_stime()
   doesn't look very accurate too.

4. In this particular case the accounting is even more impresize, this
   test-case spends a lot of time in kernel mode with irqs disabled and
   this "freezes" task->stime.

5. That said, I still can't really understand why "user" grows. If I
   understand the calculations in cputime_adjust() correctly (probably
   I don't), it should not.

In short, I am a bit confused but I still don't think that this increase
is real.

Oleg.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 0/2] exit/pid_ns: comments + simple fix
  2014-11-07 20:14 [PATCH 0/4] proc: deuglify task_state() Oleg Nesterov
                   ` (7 preceding siblings ...)
  2014-11-20 18:34 ` [PATCH 0/3] exit: avoid O(n ** 2) thread-list scan on group-exit if possible Oleg Nesterov
@ 2014-11-24 20:06 ` Oleg Nesterov
  2014-11-24 20:06   ` [PATCH 1/2] exit: reparent: fix/update the comments in zap_pid_ns_processes() Oleg Nesterov
                     ` (4 more replies)
  8 siblings, 5 replies; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-24 20:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aaron Tomlin, Eric W. Biederman, Pavel Emelyanov, Serge Hallyn,
	Sterling Alexander, linux-kernel

Eric, Pavel, could you review 1/2 ? (documentation only). It is based on the
code inspection, I didn't bother to verify that my understanding matches the
reality ;)

On 11/20, Oleg Nesterov wrote:
>
>
> Probably this is not the last series... in particular it seems that we
> have some problems with sys_setns() in this area, but I need to recheck.

So far only the documentation fix. I'll write another email (hopefully with the
patch), afaics at least setns() doesn't play well with PR_SET_CHILD_SUBREAPER.

Contrary to what I thought zap_pid_ns_processes() looks fine, but it seems only
by accident. Unless I am totally confused, wait for "nr_hashed == init_pids"
could be removed after 0a01f2cc390e10633a "pidns: Make the pidns proc mount/
umount logic obvious". However, now that setns() + fork() can inject a task
into a child namespace, we need this code again for another reason.

I _think_ we can actually remove it and simplify free_pid() as well, but lets
discuss this later and fix the wrong/confusing documentation first.

2/2 looks "obviously correct", but I'll appreciate your review anyway.

Oleg.

 kernel/pid.c           |    7 +++----
 kernel/pid_namespace.c |   23 +++++++++++++++++++----
 2 files changed, 22 insertions(+), 8 deletions(-)


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 1/2] exit: reparent: fix/update the comments in zap_pid_ns_processes()
  2014-11-24 20:06 ` [PATCH 0/2] exit/pid_ns: comments + simple fix Oleg Nesterov
@ 2014-11-24 20:06   ` Oleg Nesterov
  2014-11-24 20:14     ` Oleg Nesterov
  2014-11-24 22:07     ` Eric W. Biederman
  2014-11-24 20:06   ` [PATCH 2/2] exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting Oleg Nesterov
                     ` (3 subsequent siblings)
  4 siblings, 2 replies; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-24 20:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aaron Tomlin, Eric W. Biederman, Pavel Emelyanov, Serge Hallyn,
	Sterling Alexander, linux-kernel

The comments in zap_pid_ns_processes() are simply wrong, we need
to explain how this code actually works.

1. "Ignore SIGCHLD" looks like optimization but it is not, we also
   need this for correctness.

2. The comment above sys_wait4() could be more clear.

3. The comment about TASK_DEAD children is outdated. Contrary to
   what it says we do not need to make sure they all go away.

   At the same time, we do need to wait for nr_hashed == init_pids,
   but the reason is quite different and not obvious.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/pid_namespace.c |   23 +++++++++++++++++++----
 1 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index db95d8e..1519b02 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -190,7 +190,11 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
 	/* Don't allow any more processes into the pid namespace */
 	disable_pid_allocation(pid_ns);
 
-	/* Ignore SIGCHLD causing any terminated children to autoreap */
+	/*
+	 * Ignore SIGCHLD causing any terminated children to autoreap.
+	 * This speeds up the namespace shutdown, plus see the comment
+	 * below.
+	 */
 	spin_lock_irq(&me->sighand->siglock);
 	me->sighand->action[SIGCHLD - 1].sa.sa_handler = SIG_IGN;
 	spin_unlock_irq(&me->sighand->siglock);
@@ -223,15 +227,26 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
 	}
 	read_unlock(&tasklist_lock);
 
-	/* Firstly reap the EXIT_ZOMBIE children we may have. */
+	/* Reap the EXIT_ZOMBIE children we had before we ignored SIGCHLD */
 	do {
 		clear_thread_flag(TIF_SIGPENDING);
 		rc = sys_wait4(-1, NULL, __WALL, NULL);
 	} while (rc != -ECHILD);
 
 	/*
-	 * sys_wait4() above can't reap the TASK_DEAD children.
-	 * Make sure they all go away, see free_pid().
+	 * sys_wait4() above can't reap the TASK_DEAD children but we do not
+	 * really care, we could reparent them to the global init. We could
+	 * exit and reap ->child_reaper even if it is not the last thread in
+	 * this pid_ns, free_pid(nr_hashed == 0) calls proc_cleanup_work(),
+	 * pid_ns can not go away until proc_kill_sb() drops the reference.
+	 *
+	 * But this namespace can also have other tasks injected by setns().
+	 * Again, we do not really need to wait until they are all reaped,
+	 * but they can be reparented to us and thus we need to ensure that
+	 * pid->child_reaper is valid until they all go away.
+	 *
+	 * We rely on ignored SIGCHLD, an injected zombie must be autoreaped
+	 * if reparented.
 	 */
 	for (;;) {
 		set_current_state(TASK_UNINTERRUPTIBLE);
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 2/2] exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting
  2014-11-24 20:06 ` [PATCH 0/2] exit/pid_ns: comments + simple fix Oleg Nesterov
  2014-11-24 20:06   ` [PATCH 1/2] exit: reparent: fix/update the comments in zap_pid_ns_processes() Oleg Nesterov
@ 2014-11-24 20:06   ` Oleg Nesterov
  2014-11-24 21:46     ` Eric W. Biederman
  2014-11-24 21:27   ` [PATCH 0/2] exit/pid_ns: comments + simple fix Eric W. Biederman
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-24 20:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aaron Tomlin, Eric W. Biederman, Pavel Emelyanov, Serge Hallyn,
	Sterling Alexander, linux-kernel

alloc_pid() does get_pid_ns() beforehand but forgets to put_pid_ns()
if it fails because disable_pid_allocation() was called by the exiting
child_reaper. We can move get_pid_ns() down to successful return.

While at it, simplify/cleanup the "goto out" mess, no need to confuse
the success/error return paths.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/pid.c |    7 +++----
 1 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index 9b9a266..dfc2f3b 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -320,7 +320,6 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 			goto out_free;
 	}
 
-	get_pid_ns(ns);
 	atomic_set(&pid->count, 1);
 	for (type = 0; type < PIDTYPE_MAX; ++type)
 		INIT_HLIST_HEAD(&pid->tasks[type]);
@@ -336,7 +335,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 	}
 	spin_unlock_irq(&pidmap_lock);
 
-out:
+	get_pid_ns(ns);
 	return pid;
 
 out_unlock:
@@ -346,8 +345,8 @@ out_free:
 		free_pidmap(pid->numbers + i);
 
 	kmem_cache_free(ns->pid_cachep, pid);
-	pid = NULL;
-	goto out;
+out:
+	return NULL;
 }
 
 void disable_pid_allocation(struct pid_namespace *ns)
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/2] exit: reparent: fix/update the comments in zap_pid_ns_processes()
  2014-11-24 20:06   ` [PATCH 1/2] exit: reparent: fix/update the comments in zap_pid_ns_processes() Oleg Nesterov
@ 2014-11-24 20:14     ` Oleg Nesterov
  2014-11-24 22:07     ` Eric W. Biederman
  1 sibling, 0 replies; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-24 20:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aaron Tomlin, Eric W. Biederman, Pavel Emelyanov, Serge Hallyn,
	Sterling Alexander, linux-kernel

On 11/24, Oleg Nesterov wrote:
>
> +	 * sys_wait4() above can't reap the TASK_DEAD children but we do not
> +	 * really care, we could reparent them to the global init. We could
> +	 * exit and reap ->child_reaper even if it is not the last thread in
> +	 * this pid_ns, free_pid(nr_hashed == 0) calls proc_cleanup_work(),
> +	 * pid_ns can not go away until proc_kill_sb() drops the reference.
> +	 *
> +	 * But this namespace can also have other tasks injected by setns().
                                                                    ^^^^^
I meant setns() + fork(), but I hope this is clear.

Oleg.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/2] exit/pid_ns: comments + simple fix
  2014-11-24 20:06 ` [PATCH 0/2] exit/pid_ns: comments + simple fix Oleg Nesterov
  2014-11-24 20:06   ` [PATCH 1/2] exit: reparent: fix/update the comments in zap_pid_ns_processes() Oleg Nesterov
  2014-11-24 20:06   ` [PATCH 2/2] exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting Oleg Nesterov
@ 2014-11-24 21:27   ` Eric W. Biederman
  2014-11-24 21:38     ` Oleg Nesterov
  2014-11-24 21:48   ` Eric W. Biederman
  2014-11-26 23:54   ` [PATCH v2 " Oleg Nesterov
  4 siblings, 1 reply; 55+ messages in thread
From: Eric W. Biederman @ 2014-11-24 21:27 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aaron Tomlin, Pavel Emelyanov, Serge Hallyn,
	Sterling Alexander, linux-kernel

Oleg Nesterov <oleg@redhat.com> writes:

> Eric, Pavel, could you review 1/2 ? (documentation only). It is based on the
> code inspection, I didn't bother to verify that my understanding matches the
> reality ;)

Oleg you are being subtle, introducing some gratuitous changes and
slightly off base which makes this all very hard to review.

Eric


> On 11/20, Oleg Nesterov wrote:
>>
>>
>> Probably this is not the last series... in particular it seems that we
>> have some problems with sys_setns() in this area, but I need to recheck.
>
> So far only the documentation fix. I'll write another email (hopefully with the
> patch), afaics at least setns() doesn't play well with PR_SET_CHILD_SUBREAPER.
>
> Contrary to what I thought zap_pid_ns_processes() looks fine, but it seems only
> by accident. Unless I am totally confused, wait for "nr_hashed == init_pids"
> could be removed after 0a01f2cc390e10633a "pidns: Make the pidns proc mount/
> umount logic obvious". However, now that setns() + fork() can inject a task
> into a child namespace, we need this code again for another reason.
>
> I _think_ we can actually remove it and simplify free_pid() as well, but lets
> discuss this later and fix the wrong/confusing documentation first.
>
> 2/2 looks "obviously correct", but I'll appreciate your review anyway.
>
> Oleg.
>
>  kernel/pid.c           |    7 +++----
>  kernel/pid_namespace.c |   23 +++++++++++++++++++----
>  2 files changed, 22 insertions(+), 8 deletions(-)

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/2] exit/pid_ns: comments + simple fix
  2014-11-24 21:27   ` [PATCH 0/2] exit/pid_ns: comments + simple fix Eric W. Biederman
@ 2014-11-24 21:38     ` Oleg Nesterov
  0 siblings, 0 replies; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-24 21:38 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Aaron Tomlin, Pavel Emelyanov, Serge Hallyn,
	Sterling Alexander, linux-kernel

On 11/24, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > Eric, Pavel, could you review 1/2 ? (documentation only). It is based on the
> > code inspection, I didn't bother to verify that my understanding matches the
> > reality ;)
>
> Oleg you are being subtle, introducing some gratuitous changes and
> slightly off base which makes this all very hard to review.

Eric, this doesn't depend on other changes I sent ;)

> Eric
> 
> 
> > On 11/20, Oleg Nesterov wrote:
> >>
> >>
> >> Probably this is not the last series... in particular it seems that we
> >> have some problems with sys_setns() in this area, but I need to recheck.
> >
> > So far only the documentation fix. I'll write another email (hopefully with the
> > patch), afaics at least setns() doesn't play well with PR_SET_CHILD_SUBREAPER.
> >
> > Contrary to what I thought zap_pid_ns_processes() looks fine, but it seems only
> > by accident. Unless I am totally confused, wait for "nr_hashed == init_pids"
> > could be removed after 0a01f2cc390e10633a "pidns: Make the pidns proc mount/
> > umount logic obvious". However, now that setns() + fork() can inject a task
> > into a child namespace, we need this code again for another reason.
> >
> > I _think_ we can actually remove it and simplify free_pid() as well, but lets
> > discuss this later and fix the wrong/confusing documentation first.
> >
> > 2/2 looks "obviously correct", but I'll appreciate your review anyway.
> >
> > Oleg.
> >
> >  kernel/pid.c           |    7 +++----
> >  kernel/pid_namespace.c |   23 +++++++++++++++++++----
> >  2 files changed, 22 insertions(+), 8 deletions(-)


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting
  2014-11-24 20:06   ` [PATCH 2/2] exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting Oleg Nesterov
@ 2014-11-24 21:46     ` Eric W. Biederman
  2014-11-25 17:07       ` Oleg Nesterov
  0 siblings, 1 reply; 55+ messages in thread
From: Eric W. Biederman @ 2014-11-24 21:46 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aaron Tomlin, Pavel Emelyanov, Serge Hallyn,
	Sterling Alexander, linux-kernel

Oleg Nesterov <oleg@redhat.com> writes:

> alloc_pid() does get_pid_ns() beforehand but forgets to put_pid_ns()
> if it fails because disable_pid_allocation() was called by the exiting
> child_reaper. We can move get_pid_ns() down to successful return.
>
> While at it, simplify/cleanup the "goto out" mess, no need to confuse
> the success/error return paths.
>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
>  kernel/pid.c |    7 +++----
>  1 files changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 9b9a266..dfc2f3b 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -320,7 +320,6 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>  			goto out_free;
>  	}
>  
> -	get_pid_ns(ns);
>  	atomic_set(&pid->count, 1);
>  	for (type = 0; type < PIDTYPE_MAX; ++type)
>  		INIT_HLIST_HEAD(&pid->tasks[type]);
> @@ -336,7 +335,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>  	}
>  	spin_unlock_irq(&pidmap_lock);
>  
> -out:
> +	get_pid_ns(ns);

Moving the label and changing the goto out logic is gratuitous confusing
and I think it probably even generates worse code.

Furthermore multiple exits make adding debugging code more difficult.

Moving get_pid_ns down does close a leak in the error handling path.

However at the moment my I can't figure out if it is safe to move
get_pid_ns elow hlist_add_head_rcu.  Because once we are on the rcu list
the pid is findable, and being publicly visible with a bad refcount could cause
problems.

The increment of ns->nr_hashed after the adding to the rcu list is only safe
because we always access ns->nr_hashed with the pidmap_lock held.

Eric


>  	return pid;
>  
>  out_unlock:
> @@ -346,8 +345,8 @@ out_free:
>  		free_pidmap(pid->numbers + i);
>  
>  	kmem_cache_free(ns->pid_cachep, pid);
> -	pid = NULL;
> -	goto out;
> +out:
> +	return NULL;
>  }
>  
>  void disable_pid_allocation(struct pid_namespace *ns)

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/2] exit/pid_ns: comments + simple fix
  2014-11-24 20:06 ` [PATCH 0/2] exit/pid_ns: comments + simple fix Oleg Nesterov
                     ` (2 preceding siblings ...)
  2014-11-24 21:27   ` [PATCH 0/2] exit/pid_ns: comments + simple fix Eric W. Biederman
@ 2014-11-24 21:48   ` Eric W. Biederman
  2014-11-25 16:57     ` Oleg Nesterov
  2014-11-26 23:54   ` [PATCH v2 " Oleg Nesterov
  4 siblings, 1 reply; 55+ messages in thread
From: Eric W. Biederman @ 2014-11-24 21:48 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aaron Tomlin, Pavel Emelyanov, Serge Hallyn,
	Sterling Alexander, linux-kernel

Oleg Nesterov <oleg@redhat.com> writes:

> Eric, Pavel, could you review 1/2 ? (documentation only). It is based on the
> code inspection, I didn't bother to verify that my understanding matches the
> reality ;)
>
> On 11/20, Oleg Nesterov wrote:
>>
>>
>> Probably this is not the last series... in particular it seems that we
>> have some problems with sys_setns() in this area, but I need to recheck.
>
> So far only the documentation fix. I'll write another email (hopefully with the
> patch), afaics at least setns() doesn't play well with PR_SET_CHILD_SUBREAPER.
>
> Contrary to what I thought zap_pid_ns_processes() looks fine, but it seems only
> by accident. Unless I am totally confused, wait for "nr_hashed == init_pids"
> could be removed after 0a01f2cc390e10633a "pidns: Make the pidns proc mount/
> umount logic obvious". However, now that setns() + fork() can inject a task
> into a child namespace, we need this code again for another reason.
>
> I _think_ we can actually remove it and simplify free_pid() as well, but lets
> discuss this later and fix the wrong/confusing documentation first.

At the very least there is the issue of rusage being wrong if we allow
the init process to be reaped before all of it's children are reaped.

There is also a huge level of weird non-intuitive behavior that would
require some substantial benefits to justify an optimization of letting
a child exist longer than init.

Eric


> 2/2 looks "obviously correct", but I'll appreciate your review anyway.
>
> Oleg.
>
>  kernel/pid.c           |    7 +++----
>  kernel/pid_namespace.c |   23 +++++++++++++++++++----
>  2 files changed, 22 insertions(+), 8 deletions(-)

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/2] exit: reparent: fix/update the comments in zap_pid_ns_processes()
  2014-11-24 20:06   ` [PATCH 1/2] exit: reparent: fix/update the comments in zap_pid_ns_processes() Oleg Nesterov
  2014-11-24 20:14     ` Oleg Nesterov
@ 2014-11-24 22:07     ` Eric W. Biederman
  2014-11-25 16:57       ` Oleg Nesterov
  1 sibling, 1 reply; 55+ messages in thread
From: Eric W. Biederman @ 2014-11-24 22:07 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aaron Tomlin, Pavel Emelyanov, Serge Hallyn,
	Sterling Alexander, linux-kernel

Oleg Nesterov <oleg@redhat.com> writes:

> The comments in zap_pid_ns_processes() are simply wrong, we need
> to explain how this code actually works.
>
> 1. "Ignore SIGCHLD" looks like optimization but it is not, we also
>    need this for correctness.
>
> 2. The comment above sys_wait4() could be more clear.
>
> 3. The comment about TASK_DEAD children is outdated. Contrary to
>    what it says we do not need to make sure they all go away.

We absolutely do need to wait until they all go away.

- rusage will be wrong if we don't wait.
- We won't wait for an injected process in a pid namespace,
  or a processes debugged with gdb to be reaped before the pid
  init process exits if we don't wait.

>    At the same time, we do need to wait for nr_hashed == init_pids,
>    but the reason is quite different and not obvious.

The user visible semantics go from weird to completely insane if we
relax the rule that the init process is the last process in a pid
namespace.

I don't see anything approaching a good reason for changing the user
space semantics.

> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
>  kernel/pid_namespace.c |   23 +++++++++++++++++++----
>  1 files changed, 19 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index db95d8e..1519b02 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -190,7 +190,11 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>  	/* Don't allow any more processes into the pid namespace */
>  	disable_pid_allocation(pid_ns);

        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The above line disables injections of new processes in this pid
namespace.

>  
> -	/* Ignore SIGCHLD causing any terminated children to autoreap */
> +	/*
> +	 * Ignore SIGCHLD causing any terminated children to autoreap.
> +	 * This speeds up the namespace shutdown, plus see the comment
> +	 * below.
> +	 */

	This speeds up the namespace shutdown and ensures that after we
	have waited for all existing zombies there will be no more
	zombies to wait for.

>  	spin_lock_irq(&me->sighand->siglock);
>  	me->sighand->action[SIGCHLD - 1].sa.sa_handler = SIG_IGN;
>  	spin_unlock_irq(&me->sighand->siglock);
> @@ -223,15 +227,26 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>  	}
>  	read_unlock(&tasklist_lock);
>  
> -	/* Firstly reap the EXIT_ZOMBIE children we may have. */
> +	/* Reap the EXIT_ZOMBIE children we had before we ignored SIGCHLD */
>  	do {
>  		clear_thread_flag(TIF_SIGPENDING);
>  		rc = sys_wait4(-1, NULL, __WALL, NULL);
>  	} while (rc != -ECHILD);
>  
>  	/*
> -	 * sys_wait4() above can't reap the TASK_DEAD children.
> -	 * Make sure they all go away, see free_pid().
> +	 * sys_wait4() above can't reap the TASK_DEAD children but we do not
> +	 * really care, we could reparent them to the global init. 

We do care.


> We could
> +	 * exit and reap ->child_reaper even if it is not the last thread in
> +	 * this pid_ns, free_pid(nr_hashed == 0) calls proc_cleanup_work(),
> +	 * pid_ns can not go away until proc_kill_sb() drops the reference.


> +	 * But this namespace can also have other tasks injected by setns().
> +	 * Again, we do not really need to wait until they are all reaped,

We do, and setns does not matter here.  Injection was stopped way up above.


> +	 * but they can be reparented to us and thus we need to ensure that
> +	 * pid->child_reaper is valid until they all go away.


> +	 * We rely on ignored SIGCHLD, an injected zombie must be autoreaped
> +	 * if reparented.

Your new comment is about 90% wrong.

>  	 */
>  	for (;;) {
>  		set_current_state(TASK_UNINTERRUPTIBLE);

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/2] exit/pid_ns: comments + simple fix
  2014-11-24 21:48   ` Eric W. Biederman
@ 2014-11-25 16:57     ` Oleg Nesterov
  0 siblings, 0 replies; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-25 16:57 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Aaron Tomlin, Pavel Emelyanov, Serge Hallyn,
	Sterling Alexander, linux-kernel

On 11/24, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > Eric, Pavel, could you review 1/2 ? (documentation only). It is based on the
> > code inspection, I didn't bother to verify that my understanding matches the
> > reality ;)
> >
> > On 11/20, Oleg Nesterov wrote:
> >>
> >>
> >> Probably this is not the last series... in particular it seems that we
> >> have some problems with sys_setns() in this area, but I need to recheck.
> >
> > So far only the documentation fix. I'll write another email (hopefully with the
> > patch), afaics at least setns() doesn't play well with PR_SET_CHILD_SUBREAPER.
> >
> > Contrary to what I thought zap_pid_ns_processes() looks fine, but it seems only
> > by accident. Unless I am totally confused, wait for "nr_hashed == init_pids"
> > could be removed after 0a01f2cc390e10633a "pidns: Make the pidns proc mount/
> > umount logic obvious". However, now that setns() + fork() can inject a task
> > into a child namespace, we need this code again for another reason.
> >
> > I _think_ we can actually remove it and simplify free_pid() as well, but lets
> > discuss this later and fix the wrong/confusing documentation first.
>
> At the very least there is the issue of rusage being wrong if we allow
> the init process to be reaped before all of it's children are reaped.

Do you mean cstime/cutime/c* accounting?

Firstly it is not clear what makes child_reaper special in _this_ sense, but
this doesn't matter at all.

The auotoreaping/EXIT_DEAD children are not accounted, only wait_task_zombie()
accumulates these counters. (just in case, accounting in __exit_signal() is
another thing).

> There is also a huge level of weird non-intuitive behavior that would
> require some substantial benefits to justify an optimization of letting
> a child exist longer than init.

Sure. That is why I said "lets discuss this later". This patch doesn't try
to change the rules. It only tries to document the current code.

Oleg.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/2] exit: reparent: fix/update the comments in zap_pid_ns_processes()
  2014-11-24 22:07     ` Eric W. Biederman
@ 2014-11-25 16:57       ` Oleg Nesterov
  2014-11-25 17:17         ` Oleg Nesterov
  0 siblings, 1 reply; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-25 16:57 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Aaron Tomlin, Pavel Emelyanov, Serge Hallyn,
	Sterling Alexander, linux-kernel

On 11/24, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > The comments in zap_pid_ns_processes() are simply wrong, we need
> > to explain how this code actually works.
> >
> > 1. "Ignore SIGCHLD" looks like optimization but it is not, we also
> >    need this for correctness.
> >
> > 2. The comment above sys_wait4() could be more clear.
> >
> > 3. The comment about TASK_DEAD children is outdated. Contrary to
> >    what it says we do not need to make sure they all go away.
>
> We absolutely do need to wait until they all go away.

I can easily miss something, and that is why I asked you to review this
change. But if I missed something, perhaps we should update the comments?

This "wait until TASK_DEAD children" loop was added to ensure that
child_reaper can't be reaped before other tasks in this pid namespace.
6347e90091041e "pidns: guarantee that the pidns init will be the last pidns
process reaped".

But this was needed because proc_flush_task(pid == 1) called
kern_unmount(proc_mnt). After 0a01f2cc390e10633 "pidns: Make the pidns
proc mount/umount logic obvious" we can rely on schedule_work() in
free_pid(nr_hashed == 0).

So in any case I think that the current comment is outdated.


> - rusage will be wrong if we don't wait.

I already answered in 0/2, let me quote myself:

	Do you mean cstime/cutime/c* accounting?

	Firstly it is not clear what makes child_reaper special in _this_ sense, but
	this doesn't matter at all.

	The auotoreaping/EXIT_DEAD children are not accounted, only wait_task_zombie()
	accumulates these counters. (just in case, accounting in __exit_signal() is
	another thing).


> - We won't wait for an injected process in a pid namespace,
>   or a processes debugged with gdb to be reaped before the pid
>   init process exits if we don't wait.

Yes, and I do not see why this is bad, but this is off-topic.

Again, lets discuss this in another thread. This patch doesn't try to
document the desired semantics, it only tries to explain why zap_pid_ns
_has_ to wait until EXIT_DEAD (and in fact EXIT_ZOMBIE) tasks go away.


> The user visible semantics go from weird to completely insane if we
> relax the rule that the init process is the last process in a pid
> namespace.
>
> I don't see anything approaching a good reason for changing the user
> space semantics.

See above.

> > @@ -190,7 +190,11 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
> >  	/* Don't allow any more processes into the pid namespace */
> >  	disable_pid_allocation(pid_ns);
>
>         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> The above line disables injections of new processes in this pid
> namespace.

Yes, sure. See below.

> > -	/* Ignore SIGCHLD causing any terminated children to autoreap */
> > +	/*
> > +	 * Ignore SIGCHLD causing any terminated children to autoreap.
> > +	 * This speeds up the namespace shutdown, plus see the comment
> > +	 * below.
> > +	 */
>
> 	This speeds up the namespace shutdown and ensures that after we
> 	have waited for all existing zombies there will be no more
> 	zombies to wait for.

Afaics, no, see below.

> > -	 * sys_wait4() above can't reap the TASK_DEAD children.
> > -	 * Make sure they all go away, see free_pid().
> > +	 * sys_wait4() above can't reap the TASK_DEAD children but we do not
> > +	 * really care, we could reparent them to the global init.
>
> We do care.

See above. I only meant that nothing bad can happen from the kernel
perspective.

> > +	 * But this namespace can also have other tasks injected by setns().
> > +	 * Again, we do not really need to wait until they are all reaped,
>
> We do, and setns does not matter here.  Injection was stopped way up above.

I think that setns() does matter.

Yes injection was stopped. But a task T can be already injected before
disable_pid_allocation() was called.

Now it is killed (or it could even exit before). To simplify, suppose it
is already EXIT_ZOMBIE, although this doesn't really matter.

The sys_wait4() loop can't see it, it is not our child.

Now suppose that its parent doesn't do wait() but exits. This means that
the exiting parent will try to reparent T to pid_ns->child_reaper.

child_reaper already sleeps in "wait for nr_hashed == init_pids" loop, it
can do nothing with T.

So we rely on autoreaping (forced by ignored SIGCHLD), and this is the
(technical) reason why child_reaper can't exit: pid_ns->child_reaper should
be valid.

No?

> > +	 * We rely on ignored SIGCHLD, an injected zombie must be autoreaped
> > +	 * if reparented.
>
> Your new comment is about 90% wrong.

See above.

Eric. I'd really ask you to take another look. But in any case: thanks for
looking at this.

Oleg.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting
  2014-11-24 21:46     ` Eric W. Biederman
@ 2014-11-25 17:07       ` Oleg Nesterov
  2014-11-25 17:50         ` Eric W. Biederman
  0 siblings, 1 reply; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-25 17:07 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Aaron Tomlin, Pavel Emelyanov, Serge Hallyn,
	Sterling Alexander, linux-kernel

On 11/24, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > --- a/kernel/pid.c
> > +++ b/kernel/pid.c
> > @@ -320,7 +320,6 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> >  			goto out_free;
> >  	}
> >
> > -	get_pid_ns(ns);
> >  	atomic_set(&pid->count, 1);
> >  	for (type = 0; type < PIDTYPE_MAX; ++type)
> >  		INIT_HLIST_HEAD(&pid->tasks[type]);
> > @@ -336,7 +335,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> >  	}
> >  	spin_unlock_irq(&pidmap_lock);
> >
> > -out:
> > +	get_pid_ns(ns);
>
> Moving the label and changing the goto out logic is gratuitous confusing
> and I think it probably even generates worse code.
>
> Furthermore multiple exits make adding debugging code more difficult.

Oh, I strongly disagree but I am not going to argue ;) cleanups are
always subjective, and I do believe in "maintainer is always right"
mantra. I can make v2 without this change.

> Moving get_pid_ns down does close a leak in the error handling path.

OK, good.

> However at the moment my I can't figure out if it is safe to move
> get_pid_ns elow hlist_add_head_rcu.  Because once we are on the rcu list
> the pid is findable, and being publicly visible with a bad refcount could cause
> problems.

The caller has a reference, this ns can't go away. Obviously, otherwise
get_pid_ns(ns) is not safe.

We need this get_pid_ns() to balance put_pid()->put_pid_ns() which obviously
won't be called until we return this pid, otherwise everything is wrong.

So I think this should be safe?

Oleg.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/2] exit: reparent: fix/update the comments in zap_pid_ns_processes()
  2014-11-25 16:57       ` Oleg Nesterov
@ 2014-11-25 17:17         ` Oleg Nesterov
  0 siblings, 0 replies; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-25 17:17 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Aaron Tomlin, Pavel Emelyanov, Serge Hallyn,
	Sterling Alexander, linux-kernel

forgot to mention...

On 11/25, Oleg Nesterov wrote:
>
> On 11/24, Eric W. Biederman wrote:
> >
> > - We won't wait for an injected process in a pid namespace,
> >   or a processes debugged with gdb to be reaped before the pid
> >   init process exits if we don't wait.
>
> Yes, and I do not see why this is bad, but this is off-topic.
>
> Again, lets discuss this in another thread. This patch doesn't try to
> document the desired semantics, it only tries to explain why zap_pid_ns
> _has_ to wait until EXIT_DEAD (and in fact EXIT_ZOMBIE) tasks go away.

Just in case, as for "processes debugged with gdb" I obviously agree,
but this has nothing to do with EXIT_DEAD threads. zap_pid_ns_processes()
will sleep in sys_wait4() until debugger detaches the tracee and reaps
it due to ignored SIGCHLD.

But again, this is off-topic now.

> Eric. I'd really ask you to take another look. But in any case: thanks for
> looking at this.

Yes, please.

Oleg.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting
  2014-11-25 17:07       ` Oleg Nesterov
@ 2014-11-25 17:50         ` Eric W. Biederman
  2014-11-25 18:15           ` Oleg Nesterov
  0 siblings, 1 reply; 55+ messages in thread
From: Eric W. Biederman @ 2014-11-25 17:50 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aaron Tomlin, Pavel Emelyanov, Serge Hallyn,
	Sterling Alexander, linux-kernel

Oleg Nesterov <oleg@redhat.com> writes:

> On 11/24, Eric W. Biederman wrote:
>>
>> Oleg Nesterov <oleg@redhat.com> writes:
>>
>> > --- a/kernel/pid.c
>> > +++ b/kernel/pid.c
>> > @@ -320,7 +320,6 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>> >  			goto out_free;
>> >  	}
>> >
>> > -	get_pid_ns(ns);
>> >  	atomic_set(&pid->count, 1);
>> >  	for (type = 0; type < PIDTYPE_MAX; ++type)
>> >  		INIT_HLIST_HEAD(&pid->tasks[type]);
>> > @@ -336,7 +335,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>> >  	}
>> >  	spin_unlock_irq(&pidmap_lock);
>> >
>> > -out:
>> > +	get_pid_ns(ns);
>>
>> Moving the label and changing the goto out logic is gratuitous confusing
>> and I think it probably even generates worse code.
>>
>> Furthermore multiple exits make adding debugging code more difficult.
>
> Oh, I strongly disagree but I am not going to argue ;) cleanups are
> always subjective, and I do believe in "maintainer is always right"
> mantra. I can make v2 without this change.

Fair enough.  My primary complaint was that you were changing the logic
and fixing a bug at the same time.  That added noise and made analysis
of what was really going on much more difficult.

>> Moving get_pid_ns down does close a leak in the error handling path.
>
> OK, good.
>
>> However at the moment my I can't figure out if it is safe to move
>> get_pid_ns elow hlist_add_head_rcu.  Because once we are on the rcu list
>> the pid is findable, and being publicly visible with a bad refcount could cause
>> problems.
>
> The caller has a reference, this ns can't go away. Obviously, otherwise
> get_pid_ns(ns) is not safe.
>
> We need this get_pid_ns() to balance put_pid()->put_pid_ns() which obviously
> won't be called until we return this pid, otherwise everything is wrong.
>
> So I think this should be safe?

My concern is exposing a half initialized struct pid to the world via an
rcu data structure.  In particular could one of the rcu users get into
trouble because we haven't called get_pid_ns yet?  That is unclear to me.

That is one of those weird nasty races I would rather not have to
consider and moving the get_pid_ns after hlist_add requires that we
think about it.

To fix the error handling and avoid thinking about the races we have two
choices:
- In the error path that is currently called out_unlock we can drop the
  extra references.
- Immediately after we perform the test that on error jumps to out_unlock
  we call get_pid_ns.

My preference would be the first, as it is a trivially correct one line
change.

Aka I think this is the obviously correct trivial fix.

 out_unlock:
 	spin_unlock_irq(&pidmap_lock);
+	put_pid_ns(ns);
 out_free:
 	while (++i <= ns->level)
 		free_pidmap(pid->numbers + i);
 


Eric

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting
  2014-11-25 17:50         ` Eric W. Biederman
@ 2014-11-25 18:15           ` Oleg Nesterov
  2014-11-25 18:43             ` Eric W. Biederman
  0 siblings, 1 reply; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-25 18:15 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Aaron Tomlin, Pavel Emelyanov, Serge Hallyn,
	Sterling Alexander, linux-kernel

On 11/25, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > On 11/24, Eric W. Biederman wrote:
> >>
> >> However at the moment my I can't figure out if it is safe to move
> >> get_pid_ns elow hlist_add_head_rcu.  Because once we are on the rcu list
> >> the pid is findable, and being publicly visible with a bad refcount could cause
> >> problems.
> >
> > The caller has a reference, this ns can't go away. Obviously, otherwise
> > get_pid_ns(ns) is not safe.
> >
> > We need this get_pid_ns() to balance put_pid()->put_pid_ns() which obviously
> > won't be called until we return this pid, otherwise everything is wrong.
> >
> > So I think this should be safe?
>
> My concern is exposing a half initialized struct pid to the world via an
> rcu data structure.  In particular could one of the rcu users get into
> trouble because we haven't called get_pid_ns yet?  That is unclear to me.

They can't. This pid was fully initialized, in particular
pid->numbers[pid->level].ns == ns has a reference.

Just it is not ready for put_pid() which will be called by the "owner" of
this pid, the caller or the new child. So in this sense it doesn't matter
when we call get_pid_ns(), just we need to do this before return.

> That is one of those weird nasty races I would rather not have to
> consider and moving the get_pid_ns after hlist_add requires that we
> think about it.
>
> To fix the error handling and avoid thinking about the races we have two
> choices:
> - In the error path that is currently called out_unlock we can drop the
>   extra references.
> - Immediately after we perform the test that on error jumps to out_unlock
>   we call get_pid_ns.
>
> My preference would be the first, as it is a trivially correct one line
> change.
>
> Aka I think this is the obviously correct trivial fix.
>
>  out_unlock:
>  	spin_unlock_irq(&pidmap_lock);
> +	put_pid_ns(ns);

Sure, initially I was going to do this. But this is sub-optimal imo, I mainly
mean less clear (imho).

But again, I won't argue. I'll send V2 once we finish the discussion about 2/2.

Oleg.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting
  2014-11-25 18:15           ` Oleg Nesterov
@ 2014-11-25 18:43             ` Eric W. Biederman
  2014-11-25 18:59               ` Oleg Nesterov
  0 siblings, 1 reply; 55+ messages in thread
From: Eric W. Biederman @ 2014-11-25 18:43 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aaron Tomlin, Pavel Emelyanov, Serge Hallyn,
	Sterling Alexander, linux-kernel

Oleg Nesterov <oleg@redhat.com> writes:

> On 11/25, Eric W. Biederman wrote:
>>
>> Oleg Nesterov <oleg@redhat.com> writes:
>>
>> > On 11/24, Eric W. Biederman wrote:
>> >>
>> >> However at the moment my I can't figure out if it is safe to move
>> >> get_pid_ns elow hlist_add_head_rcu.  Because once we are on the rcu list
>> >> the pid is findable, and being publicly visible with a bad refcount could cause
>> >> problems.
>> >
>> > The caller has a reference, this ns can't go away. Obviously, otherwise
>> > get_pid_ns(ns) is not safe.
>> >
>> > We need this get_pid_ns() to balance put_pid()->put_pid_ns() which obviously
>> > won't be called until we return this pid, otherwise everything is wrong.
>> >
>> > So I think this should be safe?
>>
>> My concern is exposing a half initialized struct pid to the world via an
>> rcu data structure.  In particular could one of the rcu users get into
>> trouble because we haven't called get_pid_ns yet?  That is unclear to me.
>
> They can't. This pid was fully initialized, in particular
> pid->numbers[pid->level].ns == ns has a reference.
>
> Just it is not ready for put_pid() which will be called by the "owner" of
> this pid, the caller or the new child. So in this sense it doesn't matter
> when we call get_pid_ns(), just we need to do this before return.

Or by someone calling find_get_pid() ... put_pid().

Now the reference count should not hit zero in that case but I hate to
think of that case separately.

>> That is one of those weird nasty races I would rather not have to
>> consider and moving the get_pid_ns after hlist_add requires that we
>> think about it.
>>
>> To fix the error handling and avoid thinking about the races we have two
>> choices:
>> - In the error path that is currently called out_unlock we can drop the
>>   extra references.
>> - Immediately after we perform the test that on error jumps to out_unlock
>>   we call get_pid_ns.
>>
>> My preference would be the first, as it is a trivially correct one line
>> change.
>>
>> Aka I think this is the obviously correct trivial fix.
>>
>>  out_unlock:
>>  	spin_unlock_irq(&pidmap_lock);
>> +	put_pid_ns(ns);
>
> Sure, initially I was going to do this. But this is sub-optimal imo, I mainly
> mean less clear (imho).
>
> But again, I won't argue. I'll send V2 once we finish the discussion
> about 2/2.

At this point, and especially since we need to Cc stable and get this
fix backported to who knows how many kernel releases having something
that is trivial to validate is correct is important.

If you prefer to call get_pid_ns() immedately after:

	if (!(ns->nr_hashed & PIDNS_HASH_ADDING))
		goto out_unlock;

That would be fine with me as well.

Anything else to too clever for my brain to verify is correct today.

Eric


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting
  2014-11-25 18:43             ` Eric W. Biederman
@ 2014-11-25 18:59               ` Oleg Nesterov
  0 siblings, 0 replies; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-25 18:59 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Aaron Tomlin, Pavel Emelyanov, Serge Hallyn,
	Sterling Alexander, linux-kernel

On 11/25, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > On 11/25, Eric W. Biederman wrote:
> >>
> >> My concern is exposing a half initialized struct pid to the world via an
> >> rcu data structure.  In particular could one of the rcu users get into
> >> trouble because we haven't called get_pid_ns yet?  That is unclear to me.
> >
> > They can't. This pid was fully initialized, in particular
> > pid->numbers[pid->level].ns == ns has a reference.
> >
> > Just it is not ready for put_pid() which will be called by the "owner" of
> > this pid, the caller or the new child. So in this sense it doesn't matter
> > when we call get_pid_ns(), just we need to do this before return.
>
> Or by someone calling find_get_pid() ... put_pid().
>
> Now the reference count should not hit zero in that case

Yes. this get_pid_ns() connects to atomic_set(&pid->count, 1) set by us.

> but I hate to
> think of that case separately.

OK.

> At this point, and especially since we need to Cc stable and get this
> fix backported to who knows how many kernel releases having something
> that is trivial to validate is correct is important.

OK. I'll send V2 with -stable cc'ed.

Oleg.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v2 0/2] exit/pid_ns: comments + simple fix
  2014-11-24 20:06 ` [PATCH 0/2] exit/pid_ns: comments + simple fix Oleg Nesterov
                     ` (3 preceding siblings ...)
  2014-11-24 21:48   ` Eric W. Biederman
@ 2014-11-26 23:54   ` Oleg Nesterov
  2014-11-26 23:54     ` [PATCH v2 1/2] exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting Oleg Nesterov
  2014-11-26 23:54     ` [PATCH v2 2/2] exit: pidns: fix/update the comments in zap_pid_ns_processes() Oleg Nesterov
  4 siblings, 2 replies; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-26 23:54 UTC (permalink / raw)
  To: Andrew Morton, Eric W. Biederman
  Cc: Aaron Tomlin, Pavel Emelyanov, Serge Hallyn, Sterling Alexander,
	linux-kernel

To remind, this doesn't depend on other changes in -mm.

It is not clear to me if I addressed Eric's objections or not, but let
me send V2 (I reordered these patches).

I updated the changelog/comments in 2/2, hopefully now it is more clear
that this patch only tries to explain how the current code works, and why
do we need to wait for nr_hashed==init_pids from the kernel perspective.
Otherwise it is basically the same patch, except I added another note
about sys_wait4().

And, Eric, let me repeat. In any case we do not need for EXIT_DEAD tasks.
They are stealth. In fact this state should die, it only complicates
do_wait/reparenting/etc but we can't avoid it right now (mostly because
of the locking in wait paths).

IOW. If we could somehow detect that this pid_ns has only EXIT_DEAD tasks
we could return and user-space won't notice this change. No, it is not
that I think we should do this, but this should be documented.

Please review.

Oleg.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v2 1/2] exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting
  2014-11-26 23:54   ` [PATCH v2 " Oleg Nesterov
@ 2014-11-26 23:54     ` Oleg Nesterov
  2014-11-27 15:44       ` Eric W. Biederman
  2014-11-26 23:54     ` [PATCH v2 2/2] exit: pidns: fix/update the comments in zap_pid_ns_processes() Oleg Nesterov
  1 sibling, 1 reply; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-26 23:54 UTC (permalink / raw)
  To: Andrew Morton, Eric W. Biederman
  Cc: Aaron Tomlin, Pavel Emelyanov, Serge Hallyn, Sterling Alexander,
	linux-kernel

alloc_pid() does get_pid_ns() beforehand but forgets to put_pid_ns()
if it fails because disable_pid_allocation() was called by the exiting
child_reaper.

We could simply move get_pid_ns() down to successful return, but this
fix tries to be as trivial as possible.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: stable@vger.kernel.org
---
 kernel/pid.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index 9b9a266..82430c8 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -341,6 +341,8 @@ out:
 
 out_unlock:
 	spin_unlock_irq(&pidmap_lock);
+	put_pid_ns(ns);
+
 out_free:
 	while (++i <= ns->level)
 		free_pidmap(pid->numbers + i);
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 2/2] exit: pidns: fix/update the comments in zap_pid_ns_processes()
  2014-11-26 23:54   ` [PATCH v2 " Oleg Nesterov
  2014-11-26 23:54     ` [PATCH v2 1/2] exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting Oleg Nesterov
@ 2014-11-26 23:54     ` Oleg Nesterov
  2014-12-01 22:39       ` Andrew Morton
  1 sibling, 1 reply; 55+ messages in thread
From: Oleg Nesterov @ 2014-11-26 23:54 UTC (permalink / raw)
  To: Andrew Morton, Eric W. Biederman
  Cc: Aaron Tomlin, Pavel Emelyanov, Serge Hallyn, Sterling Alexander,
	linux-kernel

The comments in zap_pid_ns_processes() are not clear, we need to
explain how this code actually works.

1. "Ignore SIGCHLD" looks like optimization but it is not, we also
   need this for correctness.

2. The comment above sys_wait4() could tell more.

   EXIT_ZOMBIE child is only possible if it has exited before we
   ignored SIGCHLD. Or if it is traced from the parent namespace,
   but in this case it will be reaped by debugger after detach,
   sys_wait4() acts as a synchronization point.

3. The comment about TASK_DEAD (EXIT_DEAD in fact) children is
   outdated. Contrary to what it says we do not need to make sure
   they all go away after 0a01f2cc390e "pidns: Make the pidns proc
   mount/umount logic obvious".

   At the same time, we do need to wait for nr_hashed==init_pids,
   but the reasons are quite different and not obvious: setns().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/pid_namespace.c |   28 ++++++++++++++++++++++++----
 1 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index db95d8e..bc6d6a8 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -190,7 +190,11 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
 	/* Don't allow any more processes into the pid namespace */
 	disable_pid_allocation(pid_ns);
 
-	/* Ignore SIGCHLD causing any terminated children to autoreap */
+	/*
+	 * Ignore SIGCHLD causing any terminated children to autoreap.
+	 * This speeds up the namespace shutdown, plus see the comment
+	 * below.
+	 */
 	spin_lock_irq(&me->sighand->siglock);
 	me->sighand->action[SIGCHLD - 1].sa.sa_handler = SIG_IGN;
 	spin_unlock_irq(&me->sighand->siglock);
@@ -223,15 +227,31 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
 	}
 	read_unlock(&tasklist_lock);
 
-	/* Firstly reap the EXIT_ZOMBIE children we may have. */
+	/*
+	 * Reap the EXIT_ZOMBIE children we had before we ignored SIGCHLD.
+	 * sys_wait4() will also block until our children traced from the
+	 * parent namespace are detached and become EXIT_DEAD.
+	 */
 	do {
 		clear_thread_flag(TIF_SIGPENDING);
 		rc = sys_wait4(-1, NULL, __WALL, NULL);
 	} while (rc != -ECHILD);
 
 	/*
-	 * sys_wait4() above can't reap the TASK_DEAD children.
-	 * Make sure they all go away, see free_pid().
+	 * sys_wait4() above can't reap the EXIT_DEAD children but we do not
+	 * really care, we could reparent them to the global init. We could
+	 * exit and reap ->child_reaper even if it is not the last thread in
+	 * this pid_ns, free_pid(nr_hashed == 0) calls proc_cleanup_work(),
+	 * pid_ns can not go away until proc_kill_sb() drops the reference.
+	 *
+	 * But this ns can also have other tasks injected by setns()+fork().
+	 * Again, ignoring the user visible semantics we do not really need
+	 * to wait until they are all reaped, but they can be reparented to
+	 * us and thus we need to ensure that pid->child_reaper stays valid
+	 * until they all go away. See free_pid()->wake_up_process().
+	 *
+	 * We rely on ignored SIGCHLD, an injected zombie must be autoreaped
+	 * if reparented.
 	 */
 	for (;;) {
 		set_current_state(TASK_UNINTERRUPTIBLE);
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 1/2] exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting
  2014-11-26 23:54     ` [PATCH v2 1/2] exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting Oleg Nesterov
@ 2014-11-27 15:44       ` Eric W. Biederman
  0 siblings, 0 replies; 55+ messages in thread
From: Eric W. Biederman @ 2014-11-27 15:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Aaron Tomlin, Pavel Emelyanov, Serge Hallyn,
	Sterling Alexander, linux-kernel

Oleg Nesterov <oleg@redhat.com> writes:

> alloc_pid() does get_pid_ns() beforehand but forgets to put_pid_ns()
> if it fails because disable_pid_allocation() was called by the exiting
> child_reaper.
>
> We could simply move get_pid_ns() down to successful return, but this
> fix tries to be as trivial as possible.
>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> Cc: stable@vger.kernel.org
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  kernel/pid.c |    2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 9b9a266..82430c8 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -341,6 +341,8 @@ out:
>  
>  out_unlock:
>  	spin_unlock_irq(&pidmap_lock);
> +	put_pid_ns(ns);
> +
>  out_free:
>  	while (++i <= ns->level)
>  		free_pidmap(pid->numbers + i);

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 2/2] exit: pidns: fix/update the comments in zap_pid_ns_processes()
  2014-11-26 23:54     ` [PATCH v2 2/2] exit: pidns: fix/update the comments in zap_pid_ns_processes() Oleg Nesterov
@ 2014-12-01 22:39       ` Andrew Morton
  2014-12-01 23:24         ` Oleg Nesterov
  0 siblings, 1 reply; 55+ messages in thread
From: Andrew Morton @ 2014-12-01 22:39 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eric W. Biederman, Aaron Tomlin, Pavel Emelyanov, Serge Hallyn,
	Sterling Alexander, linux-kernel

On Thu, 27 Nov 2014 00:54:37 +0100 Oleg Nesterov <oleg@redhat.com> wrote:

> The comments in zap_pid_ns_processes() are not clear, we need to
> explain how this code actually works.

Could we please get some documentation for PIDNS_HASH_ADDING?  What it
does, what is the protocol for handling it, etc?  I was trying to
review your [1/2] but this little mystery has me somewhat stumped.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 2/2] exit: pidns: fix/update the comments in zap_pid_ns_processes()
  2014-12-01 22:39       ` Andrew Morton
@ 2014-12-01 23:24         ` Oleg Nesterov
  0 siblings, 0 replies; 55+ messages in thread
From: Oleg Nesterov @ 2014-12-01 23:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Eric W. Biederman, Aaron Tomlin, Pavel Emelyanov, Serge Hallyn,
	Sterling Alexander, linux-kernel

On 12/01, Andrew Morton wrote:
>
> On Thu, 27 Nov 2014 00:54:37 +0100 Oleg Nesterov <oleg@redhat.com> wrote:
>
> > The comments in zap_pid_ns_processes() are not clear, we need to
> > explain how this code actually works.
>
> Could we please get some documentation for PIDNS_HASH_ADDING?  What it
> does, what is the protocol for handling it, etc?  I was trying to
> review your [1/2] but this little mystery has me somewhat stumped.

OK, I'll send another simple doc patch.

But in short it is simple. "nr_hashed & PIDNS_HASH_ADDING" just means
that this pid_ns is alive, and alloc_pid() can succeed. We need this to
ensure that a new child can't be injected after zap_pid_ns_processes()
has already killed all tasks in its pid_ns.

Oleg.


^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2014-12-01 23:25 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-07 20:14 [PATCH 0/4] proc: deuglify task_state() Oleg Nesterov
2014-11-07 20:14 ` [PATCH 1/4] proc: task_state: read cred->group_info outside of task_lock() Oleg Nesterov
2014-11-07 20:14 ` [PATCH 2/4] proc: task_state: deuglify the max_fds calculation Oleg Nesterov
2014-11-07 20:14 ` [PATCH 3/4] proc: task_state: move the main seq_printf() outside of rcu_read_lock() Oleg Nesterov
2014-11-13 18:04   ` Paul E. McKenney
2014-11-07 20:14 ` [PATCH 4/4] proc: task_state: ptrace_parent() doesn't need pid_alive() check Oleg Nesterov
2014-11-10 21:59 ` [PATCH 0/5] exit: reparent zombie fix + cleanups/optimizations Oleg Nesterov
2014-11-10 22:00   ` [PATCH 1/5] sched_show_task: fix unsafe usage of ->real_parent Oleg Nesterov
2014-11-11 10:39     ` Peter Zijlstra
2014-11-10 22:00   ` [PATCH 2/5] exit: reparent: use ->ptrace_entry rather than ->sibling for EXIT_DEAD tasks Oleg Nesterov
2014-11-10 22:00   ` [PATCH 3/5] exit: reparent: cleanup the changing of ->parent Oleg Nesterov
2014-11-10 22:00   ` [PATCH 4/5] exit: reparent: cleanup the usage of reparent_leader() Oleg Nesterov
2014-11-10 22:00   ` [PATCH 5/5] exit: ptrace: shift "reap dead" code from exit_ptrace() to forget_original_parent() Oleg Nesterov
2014-11-14  1:37 ` [PATCH 0/5] exit: more cleanups/optimizations Oleg Nesterov
2014-11-14  1:38   ` [PATCH 1/5] exit: wait: cleanup the ptrace_reparented() checks Oleg Nesterov
2014-11-14  1:38   ` [PATCH 2/5] exit: wait: don't use zombie->real_parent Oleg Nesterov
2014-11-14  1:38   ` [PATCH 3/5] exit: wait: drop tasklist_lock before psig->c* accounting Oleg Nesterov
2014-11-14  1:38   ` [PATCH 4/5] exit: release_task: fix the comment about group leader accounting Oleg Nesterov
2014-11-14  1:38   ` [PATCH 5/5] exit: proc: don't try to flush /proc/tgid/task/tgid Oleg Nesterov
2014-11-18 21:29 ` [PATCH 0/6] exit: find_new_reaper() fixes/cleanups Oleg Nesterov
2014-11-18 21:30   ` [PATCH 1/6] exit: reparent: fix the dead-parent PR_SET_CHILD_SUBREAPER reparenting Oleg Nesterov
2014-11-18 21:30   ` [PATCH 2/6] exit: reparent: fix the cross-namespace " Oleg Nesterov
2014-11-18 21:30   ` [PATCH 3/6] exit: reparent: s/while_each_thread/for_each_thread/ in find_new_reaper() Oleg Nesterov
2014-11-18 21:30   ` [PATCH 4/6] exit: reparent: document the ->has_child_subreaper checks Oleg Nesterov
2014-11-18 21:30   ` [PATCH 5/6] exit: reparent: introduce find_child_reaper() Oleg Nesterov
2014-11-18 21:30   ` [PATCH 6/6] exit: reparent: introduce find_alive_thread() Oleg Nesterov
2014-11-20 18:34 ` [PATCH 0/3] exit: avoid O(n ** 2) thread-list scan on group-exit if possible Oleg Nesterov
2014-11-20 18:34   ` [PATCH -mm 1/3] exit: reparent: avoid find_new_reaper() if no children Oleg Nesterov
2014-11-20 22:37     ` Andrew Morton
2014-11-21 20:01       ` Oleg Nesterov
2014-11-20 18:34   ` [PATCH -mm 2/3] exit: reparent: call forget_original_parent() under tasklist_lock Oleg Nesterov
2014-11-20 18:34   ` [PATCH -mm 3/3] exit: exit_notify: re-use "dead" list to autoreap current Oleg Nesterov
2014-11-24 20:06 ` [PATCH 0/2] exit/pid_ns: comments + simple fix Oleg Nesterov
2014-11-24 20:06   ` [PATCH 1/2] exit: reparent: fix/update the comments in zap_pid_ns_processes() Oleg Nesterov
2014-11-24 20:14     ` Oleg Nesterov
2014-11-24 22:07     ` Eric W. Biederman
2014-11-25 16:57       ` Oleg Nesterov
2014-11-25 17:17         ` Oleg Nesterov
2014-11-24 20:06   ` [PATCH 2/2] exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting Oleg Nesterov
2014-11-24 21:46     ` Eric W. Biederman
2014-11-25 17:07       ` Oleg Nesterov
2014-11-25 17:50         ` Eric W. Biederman
2014-11-25 18:15           ` Oleg Nesterov
2014-11-25 18:43             ` Eric W. Biederman
2014-11-25 18:59               ` Oleg Nesterov
2014-11-24 21:27   ` [PATCH 0/2] exit/pid_ns: comments + simple fix Eric W. Biederman
2014-11-24 21:38     ` Oleg Nesterov
2014-11-24 21:48   ` Eric W. Biederman
2014-11-25 16:57     ` Oleg Nesterov
2014-11-26 23:54   ` [PATCH v2 " Oleg Nesterov
2014-11-26 23:54     ` [PATCH v2 1/2] exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting Oleg Nesterov
2014-11-27 15:44       ` Eric W. Biederman
2014-11-26 23:54     ` [PATCH v2 2/2] exit: pidns: fix/update the comments in zap_pid_ns_processes() Oleg Nesterov
2014-12-01 22:39       ` Andrew Morton
2014-12-01 23:24         ` Oleg Nesterov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).