linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v6
@ 2012-01-16 15:32 Cyrill Gorcunov
  2012-01-16 16:11 ` Oleg Nesterov
  2012-01-17 21:38 ` KOSAKI Motohiro
  0 siblings, 2 replies; 20+ messages in thread
From: Cyrill Gorcunov @ 2012-01-16 15:32 UTC (permalink / raw)
  To: LKML
  Cc: Oleg Nesterov, Andrew Morton, Pavel Emelyanov, Serge Hallyn,
	KAMEZAWA Hiroyuki, Tejun Heo, Andrew Vagin, Vasiliy Kulikov

When we do checkpoint of a task we need to know the list of children
the task, has but there is no easy and fast way to generate reverse
parent->children chain from arbitrary <pid> (while a parent pid is
provided in "PPid" field of /proc/<pid>/status).

So instead of walking over all pids in the system (creating one big process
tree in memory, just to figure out which children a task has) -- we add
explicit /proc/<pid>/task/<tid>/children entry, because the kernel already has
this kind of information but it is not yet exported.

This is a first level children, not the whole process tree.

v2:
 - Kame suggested to use a separated /proc/<pid>/children entry
   instead of poking /proc/<pid>/status
 - Andew suggested to use rcu facility instead of locking
   tasklist_lock
 - Tejun pointed that non-seekable seq file might not be
   enough for tasks with large number of children

v3:
 - To be on a safe side use %lu format for pid_t printing

v4:
 - New line get printed when sequence ends not at seq->stop,
   a nit pointed by Tejun
 - Documentation update
 - tasklist_lock is back, Oleg pointed that ->children list
   is actually not rcu-safe

v5:
 - Oleg suggested to make /proc/<pid>/task/<tid>/children
   instead of global /proc/<pid>/children, which eliminates
   hardness related to threads and children migration, and
   allows patch to be a way simplier.

v6:
 - Drop ptrace_may_access tests, pids are can be found anyway
   so nothing to protect here.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 Documentation/filesystems/proc.txt |   18 ++++
 fs/proc/array.c                    |  146 +++++++++++++++++++++++++++++++++++++
 fs/proc/base.c                     |    1 
 fs/proc/internal.h                 |    1 
 4 files changed, 166 insertions(+)

Index: linux-2.6.git/Documentation/filesystems/proc.txt
===================================================================
--- linux-2.6.git.orig/Documentation/filesystems/proc.txt
+++ linux-2.6.git/Documentation/filesystems/proc.txt
@@ -40,6 +40,7 @@ Table of Contents
   3.4	/proc/<pid>/coredump_filter - Core dump filtering settings
   3.5	/proc/<pid>/mountinfo - Information about mounts
   3.6	/proc/<pid>/comm  & /proc/<pid>/task/<tid>/comm
+  3.7   /proc/<pid>/task/<tid>/children - Information about task children
 
   4	Configuring procfs
   4.1	Mount options
@@ -1549,6 +1550,23 @@ then the kernel's TASK_COMM_LEN (current
 comm value.
 
 
+3.7	/proc/<pid>/task/<tid>/children - Information about task children
+-------------------------------------------------------------------------
+This file provides a fast way to retrieve first level children pids
+of a task pointed by <pid>/<tid> pair. The format is a space separated
+stream of pids.
+
+Note the "first level" here -- if a child has own children they will
+not be listed here, one needs to read /proc/<children-pid>/task/<tid>/children
+to obtain the descendants.
+
+Since this interface is intended to be fast and cheap it doesn't
+guarantee to provide precise results, which means if there a new child
+created while we read the "children" file -- it might or might not be
+counted. So one need to either stop or freeze processes if precise
+results are needed.
+
+
 ------------------------------------------------------------------------------
 Configuring procfs
 ------------------------------------------------------------------------------
Index: linux-2.6.git/fs/proc/array.c
===================================================================
--- linux-2.6.git.orig/fs/proc/array.c
+++ linux-2.6.git/fs/proc/array.c
@@ -547,3 +547,149 @@ int proc_pid_statm(struct seq_file *m, s
 
 	return 0;
 }
+
+struct proc_pid_children_iter {
+	struct pid_namespace *pid_ns;
+	struct pid *pid_start;
+};
+
+static struct pid *
+get_children_pid(struct proc_pid_children_iter *iter, struct pid *pid_prev, loff_t pos)
+{
+	struct task_struct *start, *task;
+	struct pid *pid = NULL;
+
+	read_lock(&tasklist_lock);
+
+	start = pid_task(iter->pid_start, PIDTYPE_PID);
+	if (!start)
+		goto out;
+
+	/*
+	 * Lets try to continue searching first, this gives
+	 * us significant speedup on children-rich processes.
+	 */
+	if (pid_prev) {
+		task = pid_task(pid_prev, PIDTYPE_PID);
+		if (task && task->real_parent == start &&
+		    !(list_empty(&task->sibling))) {
+			/*
+			 * We might miss some freshly created children
+			 * here, but it was never promised to be
+			 * accurate.
+			 */
+			if (list_is_last(&task->sibling, &start->children))
+				goto out;
+			task = list_first_entry(&task->sibling,
+						struct task_struct, sibling);
+			pid = get_pid(task_pid(task));
+			goto out;
+		}
+	}
+
+	/* Slow search case */
+	list_for_each_entry(task, &start->children, sibling) {
+		if (pos-- == 0) {
+			pid = get_pid(task_pid(task));
+			break;
+		}
+	}
+
+out:
+	read_unlock(&tasklist_lock);
+	return pid;
+}
+
+static int children_seq_show(struct seq_file *seq, void *v)
+{
+	struct proc_pid_children_iter *iter = seq->private;
+	unsigned long pid = (unsigned long)pid_nr_ns(v, iter->pid_ns);
+
+	return seq_printf(seq, " %lu", pid);
+}
+
+static void *children_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	return get_children_pid(seq->private, NULL, *pos);
+}
+
+static void *children_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct proc_pid_children_iter *iter = seq->private;
+	struct pid *pid = NULL;
+
+	pid = get_children_pid(iter, v, *pos + 1);
+	if (!pid)
+		seq_printf(seq, "\n");
+	put_pid(v);
+
+	++*pos;
+	return pid;
+}
+
+static void children_seq_stop(struct seq_file *seq, void *v)
+{
+	put_pid(v);
+}
+
+static const struct seq_operations children_seq_ops = {
+	.start	= children_seq_start,
+	.next	= children_seq_next,
+	.stop	= children_seq_stop,
+	.show	= children_seq_show,
+};
+
+static int children_seq_open(struct inode *inode, struct file *file)
+{
+	struct proc_pid_children_iter *iter = NULL;
+	struct task_struct *task = NULL;
+	int ret = 0;
+
+	task = get_proc_task(inode);
+	if (!task) {
+		ret = -ENOENT;
+		goto err;
+	}
+
+	iter = kmalloc(sizeof(*iter), GFP_KERNEL);
+	if (!iter) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	ret = seq_open(file, &children_seq_ops);
+	if (!ret) {
+		struct seq_file *m = file->private_data;
+		m->private = iter;
+
+		iter->pid_start = get_pid(task_pid(task));
+		iter->pid_ns = inode->i_sb->s_fs_info;
+	}
+
+err:
+	if (task)
+		put_task_struct(task);
+	if (ret)
+		kfree(iter);
+
+	return ret;
+}
+
+int children_seq_release(struct inode *inode, struct file *file)
+{
+	struct seq_file *m = file->private_data;
+	struct proc_pid_children_iter *iter = m->private;
+
+	put_pid(iter->pid_start);
+	kfree(iter);
+
+	seq_release(inode, file);
+	return 0;
+}
+
+const struct file_operations proc_tid_children_operations = {
+	.open    = children_seq_open,
+	.read    = seq_read,
+	.llseek  = seq_lseek,
+	.release = children_seq_release,
+};
Index: linux-2.6.git/fs/proc/base.c
===================================================================
--- linux-2.6.git.orig/fs/proc/base.c
+++ linux-2.6.git/fs/proc/base.c
@@ -3454,6 +3454,7 @@ static const struct pid_entry tid_base_s
 	ONE("stat",      S_IRUGO, proc_tid_stat),
 	ONE("statm",     S_IRUGO, proc_pid_statm),
 	REG("maps",      S_IRUGO, proc_maps_operations),
+	REG("children",  S_IRUGO, proc_tid_children_operations),
 #ifdef CONFIG_NUMA
 	REG("numa_maps", S_IRUGO, proc_numa_maps_operations),
 #endif
Index: linux-2.6.git/fs/proc/internal.h
===================================================================
--- linux-2.6.git.orig/fs/proc/internal.h
+++ linux-2.6.git/fs/proc/internal.h
@@ -53,6 +53,7 @@ extern int proc_pid_statm(struct seq_fil
 				struct pid *pid, struct task_struct *task);
 extern loff_t mem_lseek(struct file *file, loff_t offset, int orig);
 
+extern const struct file_operations proc_tid_children_operations;
 extern const struct file_operations proc_maps_operations;
 extern const struct file_operations proc_numa_maps_operations;
 extern const struct file_operations proc_smaps_operations;

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v6
  2012-01-16 15:32 [RFC] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v6 Cyrill Gorcunov
@ 2012-01-16 16:11 ` Oleg Nesterov
  2012-01-16 16:20   ` Cyrill Gorcunov
  2012-01-17 21:38 ` KOSAKI Motohiro
  1 sibling, 1 reply; 20+ messages in thread
From: Oleg Nesterov @ 2012-01-16 16:11 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: LKML, Andrew Morton, Pavel Emelyanov, Serge Hallyn,
	KAMEZAWA Hiroyuki, Tejun Heo, Andrew Vagin, Vasiliy Kulikov

On 01/16, Cyrill Gorcunov wrote:
>
> +get_children_pid(struct proc_pid_children_iter *iter, struct pid *pid_prev, loff_t pos)
> +{
> +	struct task_struct *start, *task;
> +	struct pid *pid = NULL;
> +
> +	read_lock(&tasklist_lock);
> +
> +	start = pid_task(iter->pid_start, PIDTYPE_PID);
> +	if (!start)
> +		goto out;
> +
> +	/*
> +	 * Lets try to continue searching first, this gives
> +	 * us significant speedup on children-rich processes.
> +	 */
> +	if (pid_prev) {
> +		task = pid_task(pid_prev, PIDTYPE_PID);
> +		if (task && task->real_parent == start &&
> +		    !(list_empty(&task->sibling))) {

Damn. No, this is wrong.

Damn! Yes, it was we who told you to check list_empty(sibling) ;)

But this is not enough. exit_ptrace() can do list_move() without
changing ->real_parent.

I'll try to think. At first glance we can rely on EXIT_DEAD, but
I'd like to avoid this, I think EXIT_DEAD should die.

Oleg.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v6
  2012-01-16 16:11 ` Oleg Nesterov
@ 2012-01-16 16:20   ` Cyrill Gorcunov
  2012-01-17 17:40     ` Oleg Nesterov
  0 siblings, 1 reply; 20+ messages in thread
From: Cyrill Gorcunov @ 2012-01-16 16:20 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Andrew Morton, Pavel Emelyanov, Serge Hallyn,
	KAMEZAWA Hiroyuki, Tejun Heo, Andrew Vagin, Vasiliy Kulikov

On Mon, Jan 16, 2012 at 05:11:14PM +0100, Oleg Nesterov wrote:
> On 01/16, Cyrill Gorcunov wrote:
> >
> > +get_children_pid(struct proc_pid_children_iter *iter, struct pid *pid_prev, loff_t pos)
> > +{
> > +	struct task_struct *start, *task;
> > +	struct pid *pid = NULL;
> > +
> > +	read_lock(&tasklist_lock);
> > +
> > +	start = pid_task(iter->pid_start, PIDTYPE_PID);
> > +	if (!start)
> > +		goto out;
> > +
> > +	/*
> > +	 * Lets try to continue searching first, this gives
> > +	 * us significant speedup on children-rich processes.
> > +	 */
> > +	if (pid_prev) {
> > +		task = pid_task(pid_prev, PIDTYPE_PID);
> > +		if (task && task->real_parent == start &&
> > +		    !(list_empty(&task->sibling))) {
> 
> Damn. No, this is wrong.
> 
> Damn! Yes, it was we who told you to check list_empty(sibling) ;)
> 
> But this is not enough. exit_ptrace() can do list_move() without
> changing ->real_parent.
> 
> I'll try to think. At first glance we can rely on EXIT_DEAD, but
> I'd like to avoid this, I think EXIT_DEAD should die.
> 

Ouch! Thanks for catching this Oleg. I'll try to come with something
to show as well.

	Cyrill

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v6
  2012-01-16 16:20   ` Cyrill Gorcunov
@ 2012-01-17 17:40     ` Oleg Nesterov
  2012-01-17 17:57       ` Cyrill Gorcunov
  0 siblings, 1 reply; 20+ messages in thread
From: Oleg Nesterov @ 2012-01-17 17:40 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: LKML, Andrew Morton, Pavel Emelyanov, Serge Hallyn,
	KAMEZAWA Hiroyuki, Tejun Heo, Andrew Vagin, Vasiliy Kulikov

On 01/16, Cyrill Gorcunov wrote:
>
> On Mon, Jan 16, 2012 at 05:11:14PM +0100, Oleg Nesterov wrote:
> > On 01/16, Cyrill Gorcunov wrote:
> > >
> > > +get_children_pid(struct proc_pid_children_iter *iter, struct pid *pid_prev, loff_t pos)
> > > +{
> > > +	struct task_struct *start, *task;
> > > +	struct pid *pid = NULL;
> > > +
> > > +	read_lock(&tasklist_lock);
> > > +
> > > +	start = pid_task(iter->pid_start, PIDTYPE_PID);
> > > +	if (!start)
> > > +		goto out;
> > > +
> > > +	/*
> > > +	 * Lets try to continue searching first, this gives
> > > +	 * us significant speedup on children-rich processes.
> > > +	 */
> > > +	if (pid_prev) {
> > > +		task = pid_task(pid_prev, PIDTYPE_PID);
> > > +		if (task && task->real_parent == start &&
> > > +		    !(list_empty(&task->sibling))) {
> >
> > Damn. No, this is wrong.
> >
> > Damn! Yes, it was we who told you to check list_empty(sibling) ;)
> >
> > But this is not enough. exit_ptrace() can do list_move() without
> > changing ->real_parent.
> >
> > I'll try to think. At first glance we can rely on EXIT_DEAD, but
> > I'd like to avoid this, I think EXIT_DEAD should die.
>
> Ouch! Thanks for catching this Oleg. I'll try to come with something
> to show as well.

Do you see another approach? I don't, so I'd suggest to check
"task->exit_state != EXIT_DEAD" instead of !list_empty().

Just in case, we can also check "start->exit_state == 0" instead
of "task->real_parent == start" with the same effect, up to you.

It would be nice to add the comment explaining these checks...

And I forgot to mention, the comment below

	> +			/*
	> +			 * We might miss some freshly created children
	> +			 * here, but it was never promised to be
	> +			 * accurate.
	> +			 */
	> +			if (list_is_last(&task->sibling, &start->children))
	> +				goto out;

looks misleading. Contrary to the slow path, we can't miss the
freshly forked child here, copy_process() does list_add_tail().

But the slow path obviously can skip much more than needed and
miss children (freshly forked or not), probably it would be better
to move the comment down and remove the "freshly created" part.

What do you think?

Oleg.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v6
  2012-01-17 17:40     ` Oleg Nesterov
@ 2012-01-17 17:57       ` Cyrill Gorcunov
  2012-01-17 18:14         ` Oleg Nesterov
  0 siblings, 1 reply; 20+ messages in thread
From: Cyrill Gorcunov @ 2012-01-17 17:57 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Andrew Morton, Pavel Emelyanov, Serge Hallyn,
	KAMEZAWA Hiroyuki, Tejun Heo, Andrew Vagin, Vasiliy Kulikov

On Tue, Jan 17, 2012 at 06:40:49PM +0100, Oleg Nesterov wrote:
...
> > >
> > > But this is not enough. exit_ptrace() can do list_move() without
> > > changing ->real_parent.
> > >
> > > I'll try to think. At first glance we can rely on EXIT_DEAD, but
> > > I'd like to avoid this, I think EXIT_DEAD should die.
> >
> > Ouch! Thanks for catching this Oleg. I'll try to come with something
> > to show as well.
> 
> Do you see another approach? I don't, so I'd suggest to check
> "task->exit_state != EXIT_DEAD" instead of !list_empty().
> 

Well, I thought what if I can find another way without EXIT_DEAD
but seems there is no luck.

> Just in case, we can also check "start->exit_state == 0" instead
> of "task->real_parent == start" with the same effect, up to you.
> 

real_parent == start somehow more informative for me so if you allow
(and noone against) I would leave the current form.

> It would be nice to add the comment explaining these checks...
> 

Yeah, I'll add ones. Sure.

> And I forgot to mention, the comment below
> 
> 	> +			/*
> 	> +			 * We might miss some freshly created children
> 	> +			 * here, but it was never promised to be
> 	> +			 * accurate.
> 	> +			 */
> 	> +			if (list_is_last(&task->sibling, &start->children))
> 	> +				goto out;
> 
> looks misleading. Contrary to the slow path, we can't miss the
> freshly forked child here, copy_process() does list_add_tail().
> 

Ah, crap, indeed!

> But the slow path obviously can skip much more than needed and
> miss children (freshly forked or not), probably it would be better
> to move the comment down and remove the "freshly created" part.
> 
> What do you think?
> 
> 

Yeah, thanks a lot, Oleg. I'll update it an post for review (I hope
to finish it tonight ;)

	Cyrill

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v6
  2012-01-17 17:57       ` Cyrill Gorcunov
@ 2012-01-17 18:14         ` Oleg Nesterov
  2012-01-17 18:30           ` Cyrill Gorcunov
  0 siblings, 1 reply; 20+ messages in thread
From: Oleg Nesterov @ 2012-01-17 18:14 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: LKML, Andrew Morton, Pavel Emelyanov, Serge Hallyn,
	KAMEZAWA Hiroyuki, Tejun Heo, Andrew Vagin, Vasiliy Kulikov

On 01/17, Cyrill Gorcunov wrote:
>
> On Tue, Jan 17, 2012 at 06:40:49PM +0100, Oleg Nesterov wrote:
> ...
> > > >
> > > > But this is not enough. exit_ptrace() can do list_move() without
> > > > changing ->real_parent.
> > > >
> > > > I'll try to think. At first glance we can rely on EXIT_DEAD, but
> > > > I'd like to avoid this, I think EXIT_DEAD should die.
> > >
> > > Ouch! Thanks for catching this Oleg. I'll try to come with something
> > > to show as well.
> >
> > Do you see another approach? I don't, so I'd suggest to check
> > "task->exit_state != EXIT_DEAD" instead of !list_empty().
> >
>
> Well, I thought what if I can find another way without EXIT_DEAD
> but seems there is no luck.

Oooooh. Cyrill, it seems I managed to confuse you. And myself.

exit_ptrace() doesn't use ->sibling, it uses ->ptrace_entry!.

Sorry!

Oleg.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v6
  2012-01-17 18:14         ` Oleg Nesterov
@ 2012-01-17 18:30           ` Cyrill Gorcunov
  0 siblings, 0 replies; 20+ messages in thread
From: Cyrill Gorcunov @ 2012-01-17 18:30 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Andrew Morton, Pavel Emelyanov, Serge Hallyn,
	KAMEZAWA Hiroyuki, Tejun Heo, Andrew Vagin, Vasiliy Kulikov

On Tue, Jan 17, 2012 at 07:14:36PM +0100, Oleg Nesterov wrote:
...
> > > Do you see another approach? I don't, so I'd suggest to check
> > > "task->exit_state != EXIT_DEAD" instead of !list_empty().
> > >
> >
> > Well, I thought what if I can find another way without EXIT_DEAD
> > but seems there is no luck.
> 
> Oooooh. Cyrill, it seems I managed to confuse you. And myself.
> 
> exit_ptrace() doesn't use ->sibling, it uses ->ptrace_entry!.
> 
> Sorry!
> 

Heh ;) No worries, I seem to not understand this code enough anyway,
so I'll re-check it again and post an updated version of the patch which
should be reviewed with great caution.

	Cyrill

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v6
  2012-01-16 15:32 [RFC] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v6 Cyrill Gorcunov
  2012-01-16 16:11 ` Oleg Nesterov
@ 2012-01-17 21:38 ` KOSAKI Motohiro
  2012-01-18  9:43   ` Cyrill Gorcunov
  2012-01-18 13:58   ` Oleg Nesterov
  1 sibling, 2 replies; 20+ messages in thread
From: KOSAKI Motohiro @ 2012-01-17 21:38 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: LKML, Oleg Nesterov, Andrew Morton, Pavel Emelyanov,
	Serge Hallyn, KAMEZAWA Hiroyuki, Tejun Heo, Andrew Vagin,
	Vasiliy Kulikov

(1/16/12 10:32 AM), Cyrill Gorcunov wrote:
> When we do checkpoint of a task we need to know the list of children
> the task, has but there is no easy and fast way to generate reverse
> parent->children chain from arbitrary<pid>  (while a parent pid is
> provided in "PPid" field of /proc/<pid>/status).
>
> So instead of walking over all pids in the system (creating one big process
> tree in memory, just to figure out which children a task has) -- we add
> explicit /proc/<pid>/task/<tid>/children entry, because the kernel already has
> this kind of information but it is not yet exported.

I doubt this is good idea. It move some complexity to userland, but not reduce.
Again, if we add this interface, it should help pstree like process traversal
tools. Bare task hierarchy shouldn't be exposed userland. I believe users need
sub process, not sub threads.




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v6
  2012-01-17 21:38 ` KOSAKI Motohiro
@ 2012-01-18  9:43   ` Cyrill Gorcunov
  2012-01-18 13:58   ` Oleg Nesterov
  1 sibling, 0 replies; 20+ messages in thread
From: Cyrill Gorcunov @ 2012-01-18  9:43 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, Oleg Nesterov, Andrew Morton, Pavel Emelyanov,
	Serge Hallyn, KAMEZAWA Hiroyuki, Tejun Heo, Andrew Vagin,
	Vasiliy Kulikov

On Tue, Jan 17, 2012 at 04:38:27PM -0500, KOSAKI Motohiro wrote:
> (1/16/12 10:32 AM), Cyrill Gorcunov wrote:
> >When we do checkpoint of a task we need to know the list of children
> >the task, has but there is no easy and fast way to generate reverse
> >parent->children chain from arbitrary<pid>  (while a parent pid is
> >provided in "PPid" field of /proc/<pid>/status).
> >
> >So instead of walking over all pids in the system (creating one big process
> >tree in memory, just to figure out which children a task has) -- we add
> >explicit /proc/<pid>/task/<tid>/children entry, because the kernel already has
> >this kind of information but it is not yet exported.
> 
> I doubt this is good idea. It move some complexity to userland, but not reduce.
> Again, if we add this interface, it should help pstree like process traversal
> tools. Bare task hierarchy shouldn't be exposed userland. I believe users need
> sub process, not sub threads.
>

Which exactly complexity it moves to user-space? You have some task, and now
you can find all children easily, what the complexity you're talking about?
pstree is building the whole process tree going through all entries in
/proc/<pid>, reading PPid field and then forming the topology. I would like
to be able to find children faster. So I readdir a /proc/<pid>/task/ and
the walk over every /proc/<pid>/task/<tid>/children. This helps alot.

	Cyrill

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v6
  2012-01-17 21:38 ` KOSAKI Motohiro
  2012-01-18  9:43   ` Cyrill Gorcunov
@ 2012-01-18 13:58   ` Oleg Nesterov
  2012-01-18 14:21     ` Cyrill Gorcunov
  1 sibling, 1 reply; 20+ messages in thread
From: Oleg Nesterov @ 2012-01-18 13:58 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Cyrill Gorcunov, LKML, Andrew Morton, Pavel Emelyanov,
	Serge Hallyn, KAMEZAWA Hiroyuki, Tejun Heo, Andrew Vagin,
	Vasiliy Kulikov

On 01/17, KOSAKI Motohiro wrote:
>
> (1/16/12 10:32 AM), Cyrill Gorcunov wrote:
>> When we do checkpoint of a task we need to know the list of children
>> the task, has but there is no easy and fast way to generate reverse
>> parent->children chain from arbitrary<pid>  (while a parent pid is
>> provided in "PPid" field of /proc/<pid>/status).
>>
>> So instead of walking over all pids in the system (creating one big process
>> tree in memory, just to figure out which children a task has) -- we add
>> explicit /proc/<pid>/task/<tid>/children entry, because the kernel already has
>> this kind of information but it is not yet exported.
>
> I doubt this is good idea. It move some complexity to userland, but not reduce.
> Again, if we add this interface, it should help pstree like process traversal
> tools. Bare task hierarchy shouldn't be exposed userland. I believe users need
> sub process, not sub threads.

IOW, you mean that the reading from 'children' should list all children
of the whole thread group. Yes, we discussed this before a bit.

In this case this file should live in /proc/pid/, not in /proc/pid/task/tid.
But this doesn't allow to restore the hierarchy correctly, you need to know
the parent thread. So I think /proc/pid/tasks/tid/children is still needed.
And I think it does help pstree-like apps, just they need to look into task/.

As for /proc/pid/children, may be we can add it later. But this is not that
simple. The problem is the threaded reparenting, we do not want to list the
same child twice.

Oleg.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v6
  2012-01-18 13:58   ` Oleg Nesterov
@ 2012-01-18 14:21     ` Cyrill Gorcunov
  2012-01-18 14:36       ` Oleg Nesterov
  0 siblings, 1 reply; 20+ messages in thread
From: Cyrill Gorcunov @ 2012-01-18 14:21 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: KOSAKI Motohiro, LKML, Andrew Morton, Pavel Emelyanov,
	Serge Hallyn, KAMEZAWA Hiroyuki, Tejun Heo, Andrew Vagin,
	Vasiliy Kulikov

On Wed, Jan 18, 2012 at 02:58:09PM +0100, Oleg Nesterov wrote:
> On 01/17, KOSAKI Motohiro wrote:
> >
> > (1/16/12 10:32 AM), Cyrill Gorcunov wrote:
> >> When we do checkpoint of a task we need to know the list of children
> >> the task, has but there is no easy and fast way to generate reverse
> >> parent->children chain from arbitrary<pid>  (while a parent pid is
> >> provided in "PPid" field of /proc/<pid>/status).
> >>
> >> So instead of walking over all pids in the system (creating one big process
> >> tree in memory, just to figure out which children a task has) -- we add
> >> explicit /proc/<pid>/task/<tid>/children entry, because the kernel already has
> >> this kind of information but it is not yet exported.
> >
> > I doubt this is good idea. It move some complexity to userland, but not reduce.
> > Again, if we add this interface, it should help pstree like process traversal
> > tools. Bare task hierarchy shouldn't be exposed userland. I believe users need
> > sub process, not sub threads.
> 
> IOW, you mean that the reading from 'children' should list all children
> of the whole thread group. Yes, we discussed this before a bit.
> 
> In this case this file should live in /proc/pid/, not in /proc/pid/task/tid.
> But this doesn't allow to restore the hierarchy correctly, you need to know
> the parent thread. So I think /proc/pid/tasks/tid/children is still needed.
> And I think it does help pstree-like apps, just they need to look into task/.
> 
> As for /proc/pid/children, may be we can add it later. But this is not that
> simple. The problem is the threaded reparenting, we do not want to list the
> same child twice.
> 

Yes, /proc/pid/children was initial target but then I fall back to hardness
and though implementing more simplier solution is better approach.

So Oleg, I think you meant something like below? Comment is moved down an
list_empty over siblings remans, right?

Btw, should it be under CONFIG_CHECKPOINT_RESTORE or we could live with it
in general?

	Cyrill
---
From: Cyrill Gorcunov <gorcunov@openvz.org>
Subject: fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v6

When we do checkpoint of a task we need to know the list of children
the task, has but there is no easy and fast way to generate reverse
parent->children chain from arbitrary <pid> (while a parent pid is
provided in "PPid" field of /proc/<pid>/status).

So instead of walking over all pids in the system (creating one big process
tree in memory, just to figure out which children a task has) -- we add
explicit /proc/<pid>/task/<tid>/children entry, because the kernel already has
this kind of information but it is not yet exported.

This is a first level children, not the whole process tree.

v2:
 - Kame suggested to use a separated /proc/<pid>/children entry
   instead of poking /proc/<pid>/status
 - Andew suggested to use rcu facility instead of locking
   tasklist_lock
 - Tejun pointed that non-seekable seq file might not be
   enough for tasks with large number of children

v3:
 - To be on a safe side use %lu format for pid_t printing

v4:
 - New line get printed when sequence ends not at seq->stop,
   a nit pointed by Tejun
 - Documentation update
 - tasklist_lock is back, Oleg pointed that ->children list
   is actually not rcu-safe

v5:
 - Oleg suggested to make /proc/<pid>/task/<tid>/children
   instead of global /proc/<pid>/children, which eliminates
   hardness related to threads and children migration, and
   allows patch to be a way simplier.

v6:
 - Drop ptrace_may_access tests, pids are can be found anyway
   so nothing to protect here.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 Documentation/filesystems/proc.txt |   18 ++++
 fs/proc/array.c                    |  147 +++++++++++++++++++++++++++++++++++++
 fs/proc/base.c                     |    1 
 fs/proc/internal.h                 |    1 
 4 files changed, 167 insertions(+)

Index: linux-2.6.git/Documentation/filesystems/proc.txt
===================================================================
--- linux-2.6.git.orig/Documentation/filesystems/proc.txt
+++ linux-2.6.git/Documentation/filesystems/proc.txt
@@ -40,6 +40,7 @@ Table of Contents
   3.4	/proc/<pid>/coredump_filter - Core dump filtering settings
   3.5	/proc/<pid>/mountinfo - Information about mounts
   3.6	/proc/<pid>/comm  & /proc/<pid>/task/<tid>/comm
+  3.7   /proc/<pid>/task/<tid>/children - Information about task children
 
   4	Configuring procfs
   4.1	Mount options
@@ -1549,6 +1550,23 @@ then the kernel's TASK_COMM_LEN (current
 comm value.
 
 
+3.7	/proc/<pid>/task/<tid>/children - Information about task children
+-------------------------------------------------------------------------
+This file provides a fast way to retrieve first level children pids
+of a task pointed by <pid>/<tid> pair. The format is a space separated
+stream of pids.
+
+Note the "first level" here -- if a child has own children they will
+not be listed here, one needs to read /proc/<children-pid>/task/<tid>/children
+to obtain the descendants.
+
+Since this interface is intended to be fast and cheap it doesn't
+guarantee to provide precise results, which means if there a new child
+created while we read the "children" file -- it might or might not be
+counted. So one need to either stop or freeze processes if precise
+results are needed.
+
+
 ------------------------------------------------------------------------------
 Configuring procfs
 ------------------------------------------------------------------------------
Index: linux-2.6.git/fs/proc/array.c
===================================================================
--- linux-2.6.git.orig/fs/proc/array.c
+++ linux-2.6.git/fs/proc/array.c
@@ -547,3 +547,150 @@ int proc_pid_statm(struct seq_file *m, s
 
 	return 0;
 }
+
+struct proc_pid_children_iter {
+	struct pid_namespace *pid_ns;
+	struct pid *pid_start;
+};
+
+static struct pid *
+get_children_pid(struct proc_pid_children_iter *iter, struct pid *pid_prev, loff_t pos)
+{
+	struct task_struct *start, *task;
+	struct pid *pid = NULL;
+
+	read_lock(&tasklist_lock);
+
+	start = pid_task(iter->pid_start, PIDTYPE_PID);
+	if (!start)
+		goto out;
+
+	/*
+	 * Lets try to continue searching first, this gives
+	 * us significant speedup on children-rich processes.
+	 */
+	if (pid_prev) {
+		task = pid_task(pid_prev, PIDTYPE_PID);
+		if (task && task->real_parent == start &&
+		    !(list_empty(&task->sibling))) {
+			if (list_is_last(&task->sibling, &start->children))
+				goto out;
+			task = list_first_entry(&task->sibling,
+						struct task_struct, sibling);
+			pid = get_pid(task_pid(task));
+			goto out;
+		}
+	}
+
+	/*
+	 * Slow search case
+	 *
+	 * We might miss some freshly created children
+	 * here, but it was never promised to be
+	 * accurate.
+	 */
+	list_for_each_entry(task, &start->children, sibling) {
+		if (pos-- == 0) {
+			pid = get_pid(task_pid(task));
+			break;
+		}
+	}
+
+out:
+	read_unlock(&tasklist_lock);
+	return pid;
+}
+
+static int children_seq_show(struct seq_file *seq, void *v)
+{
+	struct proc_pid_children_iter *iter = seq->private;
+	unsigned long pid = (unsigned long)pid_nr_ns(v, iter->pid_ns);
+
+	return seq_printf(seq, " %lu", pid);
+}
+
+static void *children_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	return get_children_pid(seq->private, NULL, *pos);
+}
+
+static void *children_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct proc_pid_children_iter *iter = seq->private;
+	struct pid *pid = NULL;
+
+	pid = get_children_pid(iter, v, *pos + 1);
+	if (!pid)
+		seq_printf(seq, "\n");
+	put_pid(v);
+
+	++*pos;
+	return pid;
+}
+
+static void children_seq_stop(struct seq_file *seq, void *v)
+{
+	put_pid(v);
+}
+
+static const struct seq_operations children_seq_ops = {
+	.start	= children_seq_start,
+	.next	= children_seq_next,
+	.stop	= children_seq_stop,
+	.show	= children_seq_show,
+};
+
+static int children_seq_open(struct inode *inode, struct file *file)
+{
+	struct proc_pid_children_iter *iter = NULL;
+	struct task_struct *task = NULL;
+	int ret = 0;
+
+	task = get_proc_task(inode);
+	if (!task) {
+		ret = -ENOENT;
+		goto err;
+	}
+
+	iter = kmalloc(sizeof(*iter), GFP_KERNEL);
+	if (!iter) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	ret = seq_open(file, &children_seq_ops);
+	if (!ret) {
+		struct seq_file *m = file->private_data;
+		m->private = iter;
+
+		iter->pid_start = get_pid(task_pid(task));
+		iter->pid_ns = inode->i_sb->s_fs_info;
+	}
+
+err:
+	if (task)
+		put_task_struct(task);
+	if (ret)
+		kfree(iter);
+
+	return ret;
+}
+
+int children_seq_release(struct inode *inode, struct file *file)
+{
+	struct seq_file *m = file->private_data;
+	struct proc_pid_children_iter *iter = m->private;
+
+	put_pid(iter->pid_start);
+	kfree(iter);
+
+	seq_release(inode, file);
+	return 0;
+}
+
+const struct file_operations proc_tid_children_operations = {
+	.open    = children_seq_open,
+	.read    = seq_read,
+	.llseek  = seq_lseek,
+	.release = children_seq_release,
+};
Index: linux-2.6.git/fs/proc/base.c
===================================================================
--- linux-2.6.git.orig/fs/proc/base.c
+++ linux-2.6.git/fs/proc/base.c
@@ -3454,6 +3454,7 @@ static const struct pid_entry tid_base_s
 	ONE("stat",      S_IRUGO, proc_tid_stat),
 	ONE("statm",     S_IRUGO, proc_pid_statm),
 	REG("maps",      S_IRUGO, proc_maps_operations),
+	REG("children",  S_IRUGO, proc_tid_children_operations),
 #ifdef CONFIG_NUMA
 	REG("numa_maps", S_IRUGO, proc_numa_maps_operations),
 #endif
Index: linux-2.6.git/fs/proc/internal.h
===================================================================
--- linux-2.6.git.orig/fs/proc/internal.h
+++ linux-2.6.git/fs/proc/internal.h
@@ -53,6 +53,7 @@ extern int proc_pid_statm(struct seq_fil
 				struct pid *pid, struct task_struct *task);
 extern loff_t mem_lseek(struct file *file, loff_t offset, int orig);
 
+extern const struct file_operations proc_tid_children_operations;
 extern const struct file_operations proc_maps_operations;
 extern const struct file_operations proc_numa_maps_operations;
 extern const struct file_operations proc_smaps_operations;

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v6
  2012-01-18 14:21     ` Cyrill Gorcunov
@ 2012-01-18 14:36       ` Oleg Nesterov
  2012-01-18 18:23         ` Cyrill Gorcunov
  0 siblings, 1 reply; 20+ messages in thread
From: Oleg Nesterov @ 2012-01-18 14:36 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: KOSAKI Motohiro, LKML, Andrew Morton, Pavel Emelyanov,
	Serge Hallyn, KAMEZAWA Hiroyuki, Tejun Heo, Andrew Vagin,
	Vasiliy Kulikov

On 01/18, Cyrill Gorcunov wrote:
>
> So Oleg, I think you meant something like below? Comment is moved down an
> list_empty over siblings remans, right?

Yes, except the comment still looks misleading to me.

Otherwise looks correct, but I'll try to re-check once again with
the fresh head. Although I think you should remove me from CC: after
I found the nonexistent bug ;)

> +get_children_pid(struct proc_pid_children_iter *iter, struct pid *pid_prev, loff_t pos)
> +{
> ...
> +	/*
> +	 * Slow search case
> +	 *
> +	 * We might miss some freshly created children
> +	 * here, but it was never promised to be
> +	 * accurate.
> +	 */
> +	list_for_each_entry(task, &start->children, sibling) {
> +		if (pos-- == 0) {
> +			pid = get_pid(task_pid(task));
> +			break;
> +		}
> +	}

This is minor, but "freshly created" looks very confusing to me.
What does it mean? We hold tasklist, we can't race with fork().

Yes we can miss a child, but this has nothing to do with "freshly".
Just suppose that the parent sleeps, but N children exit after we
printed their tids. Now the slow paths skips N extra children, we
miss N tasks.

Oleg.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v6
  2012-01-18 14:36       ` Oleg Nesterov
@ 2012-01-18 18:23         ` Cyrill Gorcunov
  2012-01-18 19:07           ` Cyrill Gorcunov
  0 siblings, 1 reply; 20+ messages in thread
From: Cyrill Gorcunov @ 2012-01-18 18:23 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: KOSAKI Motohiro, LKML, Andrew Morton, Pavel Emelyanov,
	Serge Hallyn, KAMEZAWA Hiroyuki, Tejun Heo, Andrew Vagin,
	Vasiliy Kulikov

On Wed, Jan 18, 2012 at 03:36:31PM +0100, Oleg Nesterov wrote:
> On 01/18, Cyrill Gorcunov wrote:
> >
> > So Oleg, I think you meant something like below? Comment is moved down an
> > list_empty over siblings remans, right?
> 
> Yes, except the comment still looks misleading to me.
> 
> Otherwise looks correct, but I'll try to re-check once again with
> the fresh head. Although I think you should remove me from CC: after
> I found the nonexistent bug ;)
> 

There is no way back from CC ;)

> This is minor, but "freshly created" looks very confusing to me.
> What does it mean? We hold tasklist, we can't race with fork().
> 

Hmm. Sure we keep a lock here, but changes might happen between reads
If only I'm not missing something again.

Look which scenario I've had in mind. We have a task A and children B,C,D,E.
... Here my scenario ended and I realised that you're right. I'll update
the comment.

> Yes we can miss a child, but this has nothing to do with "freshly".
> Just suppose that the parent sleeps, but N children exit after we
> printed their tids. Now the slow paths skips N extra children, we
> miss N tasks.
> 
> Oleg.
> 

	Cyrill

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v6
  2012-01-18 18:23         ` Cyrill Gorcunov
@ 2012-01-18 19:07           ` Cyrill Gorcunov
  2012-01-19 14:10             ` Oleg Nesterov
  0 siblings, 1 reply; 20+ messages in thread
From: Cyrill Gorcunov @ 2012-01-18 19:07 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: KOSAKI Motohiro, LKML, Andrew Morton, Pavel Emelyanov,
	Serge Hallyn, KAMEZAWA Hiroyuki, Tejun Heo, Andrew Vagin,
	Vasiliy Kulikov

On Wed, Jan 18, 2012 at 10:23:44PM +0400, Cyrill Gorcunov wrote:
> On Wed, Jan 18, 2012 at 03:36:31PM +0100, Oleg Nesterov wrote:
> > On 01/18, Cyrill Gorcunov wrote:
> > >
> > > So Oleg, I think you meant something like below? Comment is moved down an
> > > list_empty over siblings remans, right?
> > 
> > Yes, except the comment still looks misleading to me.
> > 
> > Otherwise looks correct, but I'll try to re-check once again with
> > the fresh head. Although I think you should remove me from CC: after
> > I found the nonexistent bug ;)
> > 
> 
> There is no way back from CC ;)
> 
> > This is minor, but "freshly created" looks very confusing to me.
> > What does it mean? We hold tasklist, we can't race with fork().
> > 
> 
> Hmm. Sure we keep a lock here, but changes might happen between reads
> If only I'm not missing something again.
> 
> Look which scenario I've had in mind. We have a task A and children B,C,D,E.
> ... Here my scenario ended and I realised that you're right. I'll update
> the comment.
> 
> > Yes we can miss a child, but this has nothing to do with "freshly".
> > Just suppose that the parent sleeps, but N children exit after we
> > printed their tids. Now the slow paths skips N extra children, we
> > miss N tasks.
> > 
> > Oleg.
> > 
> 
> 	Cyrill

I suppose it might be something like below. I've updated comment and
quoted your comment there just I wont forget this next time I'll be
reading the source. Thanks!

	Cyrill
---
From: Cyrill Gorcunov <gorcunov@openvz.org>
Subject: fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v6

When we do checkpoint of a task we need to know the list of children
the task, has but there is no easy and fast way to generate reverse
parent->children chain from arbitrary <pid> (while a parent pid is
provided in "PPid" field of /proc/<pid>/status).

So instead of walking over all pids in the system (creating one big process
tree in memory, just to figure out which children a task has) -- we add
explicit /proc/<pid>/task/<tid>/children entry, because the kernel already has
this kind of information but it is not yet exported.

This is a first level children, not the whole process tree.

v2:
 - Kame suggested to use a separated /proc/<pid>/children entry
   instead of poking /proc/<pid>/status
 - Andew suggested to use rcu facility instead of locking
   tasklist_lock
 - Tejun pointed that non-seekable seq file might not be
   enough for tasks with large number of children

v3:
 - To be on a safe side use %lu format for pid_t printing

v4:
 - New line get printed when sequence ends not at seq->stop,
   a nit pointed by Tejun
 - Documentation update
 - tasklist_lock is back, Oleg pointed that ->children list
   is actually not rcu-safe

v5:
 - Oleg suggested to make /proc/<pid>/task/<tid>/children
   instead of global /proc/<pid>/children, which eliminates
   hardness related to threads and children migration, and
   allows patch to be a way simplier.

v6:
 - Drop ptrace_may_access tests, pids are can be found anyway
   so nothing to protect here.
 - Update comments and docs, pointed by Oleg.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 Documentation/filesystems/proc.txt |   18 ++++
 fs/proc/array.c                    |  155 +++++++++++++++++++++++++++++++++++++
 fs/proc/base.c                     |    1 
 fs/proc/internal.h                 |    1 
 4 files changed, 175 insertions(+)

Index: linux-2.6.git/Documentation/filesystems/proc.txt
===================================================================
--- linux-2.6.git.orig/Documentation/filesystems/proc.txt
+++ linux-2.6.git/Documentation/filesystems/proc.txt
@@ -40,6 +40,7 @@ Table of Contents
   3.4	/proc/<pid>/coredump_filter - Core dump filtering settings
   3.5	/proc/<pid>/mountinfo - Information about mounts
   3.6	/proc/<pid>/comm  & /proc/<pid>/task/<tid>/comm
+  3.7   /proc/<pid>/task/<tid>/children - Information about task children
 
   4	Configuring procfs
   4.1	Mount options
@@ -1549,6 +1550,23 @@ then the kernel's TASK_COMM_LEN (current
 comm value.
 
 
+3.7	/proc/<pid>/task/<tid>/children - Information about task children
+-------------------------------------------------------------------------
+This file provides a fast way to retrieve first level children pids
+of a task pointed by <pid>/<tid> pair. The format is a space separated
+stream of pids.
+
+Note the "first level" here -- if a child has own children they will
+not be listed here, one needs to read /proc/<children-pid>/task/<tid>/children
+to obtain the descendants.
+
+Since this interface is intended to be fast and cheap it doesn't
+guarantee to provide precise results and some children might be
+skipped, especially if they've exited right after we printed their
+pids, so one need to either stop or freeze processes being inspected
+if precise results are needed.
+
+
 ------------------------------------------------------------------------------
 Configuring procfs
 ------------------------------------------------------------------------------
Index: linux-2.6.git/fs/proc/array.c
===================================================================
--- linux-2.6.git.orig/fs/proc/array.c
+++ linux-2.6.git/fs/proc/array.c
@@ -547,3 +547,158 @@ int proc_pid_statm(struct seq_file *m, s
 
 	return 0;
 }
+
+struct proc_pid_children_iter {
+	struct pid_namespace *pid_ns;
+	struct pid *pid_start;
+};
+
+static struct pid *
+get_children_pid(struct proc_pid_children_iter *iter, struct pid *pid_prev, loff_t pos)
+{
+	struct task_struct *start, *task;
+	struct pid *pid = NULL;
+
+	read_lock(&tasklist_lock);
+
+	start = pid_task(iter->pid_start, PIDTYPE_PID);
+	if (!start)
+		goto out;
+
+	/*
+	 * Lets try to continue searching first, this gives
+	 * us significant speedup on children-rich processes.
+	 */
+	if (pid_prev) {
+		task = pid_task(pid_prev, PIDTYPE_PID);
+		if (task && task->real_parent == start &&
+		    !(list_empty(&task->sibling))) {
+			if (list_is_last(&task->sibling, &start->children))
+				goto out;
+			task = list_first_entry(&task->sibling,
+						struct task_struct, sibling);
+			pid = get_pid(task_pid(task));
+			goto out;
+		}
+	}
+
+	/*
+	 * Slow search case.
+	 *
+	 * We might miss some children here if children
+	 * are exited while we were not holding the lock,
+	 * but it was never promised to be accurate that
+	 * much.
+	 *
+	 * "Just suppose that the parent sleeps, but N children
+	 *  exit after we printed their tids. Now the slow paths
+	 *  skips N extra children, we miss N tasks." (c)
+	 *
+	 * So one need to stop or freeze the leader and all
+	 * its children to get a precise result.
+	 */
+	list_for_each_entry(task, &start->children, sibling) {
+		if (pos-- == 0) {
+			pid = get_pid(task_pid(task));
+			break;
+		}
+	}
+
+out:
+	read_unlock(&tasklist_lock);
+	return pid;
+}
+
+static int children_seq_show(struct seq_file *seq, void *v)
+{
+	struct proc_pid_children_iter *iter = seq->private;
+	unsigned long pid = (unsigned long)pid_nr_ns(v, iter->pid_ns);
+
+	return seq_printf(seq, " %lu", pid);
+}
+
+static void *children_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	return get_children_pid(seq->private, NULL, *pos);
+}
+
+static void *children_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct proc_pid_children_iter *iter = seq->private;
+	struct pid *pid = NULL;
+
+	pid = get_children_pid(iter, v, *pos + 1);
+	if (!pid)
+		seq_printf(seq, "\n");
+	put_pid(v);
+
+	++*pos;
+	return pid;
+}
+
+static void children_seq_stop(struct seq_file *seq, void *v)
+{
+	put_pid(v);
+}
+
+static const struct seq_operations children_seq_ops = {
+	.start	= children_seq_start,
+	.next	= children_seq_next,
+	.stop	= children_seq_stop,
+	.show	= children_seq_show,
+};
+
+static int children_seq_open(struct inode *inode, struct file *file)
+{
+	struct proc_pid_children_iter *iter = NULL;
+	struct task_struct *task = NULL;
+	int ret = 0;
+
+	task = get_proc_task(inode);
+	if (!task) {
+		ret = -ENOENT;
+		goto err;
+	}
+
+	iter = kmalloc(sizeof(*iter), GFP_KERNEL);
+	if (!iter) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	ret = seq_open(file, &children_seq_ops);
+	if (!ret) {
+		struct seq_file *m = file->private_data;
+		m->private = iter;
+
+		iter->pid_start = get_pid(task_pid(task));
+		iter->pid_ns = inode->i_sb->s_fs_info;
+	}
+
+err:
+	if (task)
+		put_task_struct(task);
+	if (ret)
+		kfree(iter);
+
+	return ret;
+}
+
+int children_seq_release(struct inode *inode, struct file *file)
+{
+	struct seq_file *m = file->private_data;
+	struct proc_pid_children_iter *iter = m->private;
+
+	put_pid(iter->pid_start);
+	kfree(iter);
+
+	seq_release(inode, file);
+	return 0;
+}
+
+const struct file_operations proc_tid_children_operations = {
+	.open    = children_seq_open,
+	.read    = seq_read,
+	.llseek  = seq_lseek,
+	.release = children_seq_release,
+};
Index: linux-2.6.git/fs/proc/base.c
===================================================================
--- linux-2.6.git.orig/fs/proc/base.c
+++ linux-2.6.git/fs/proc/base.c
@@ -3454,6 +3454,7 @@ static const struct pid_entry tid_base_s
 	ONE("stat",      S_IRUGO, proc_tid_stat),
 	ONE("statm",     S_IRUGO, proc_pid_statm),
 	REG("maps",      S_IRUGO, proc_maps_operations),
+	REG("children",  S_IRUGO, proc_tid_children_operations),
 #ifdef CONFIG_NUMA
 	REG("numa_maps", S_IRUGO, proc_numa_maps_operations),
 #endif
Index: linux-2.6.git/fs/proc/internal.h
===================================================================
--- linux-2.6.git.orig/fs/proc/internal.h
+++ linux-2.6.git/fs/proc/internal.h
@@ -53,6 +53,7 @@ extern int proc_pid_statm(struct seq_fil
 				struct pid *pid, struct task_struct *task);
 extern loff_t mem_lseek(struct file *file, loff_t offset, int orig);
 
+extern const struct file_operations proc_tid_children_operations;
 extern const struct file_operations proc_maps_operations;
 extern const struct file_operations proc_numa_maps_operations;
 extern const struct file_operations proc_smaps_operations;

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v6
  2012-01-18 19:07           ` Cyrill Gorcunov
@ 2012-01-19 14:10             ` Oleg Nesterov
  2012-01-19 14:47               ` Cyrill Gorcunov
  2012-01-19 15:55               ` Oleg Nesterov
  0 siblings, 2 replies; 20+ messages in thread
From: Oleg Nesterov @ 2012-01-19 14:10 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: KOSAKI Motohiro, LKML, Andrew Morton, Pavel Emelyanov,
	Serge Hallyn, KAMEZAWA Hiroyuki, Tejun Heo, Andrew Vagin,
	Vasiliy Kulikov

On 01/18, Cyrill Gorcunov wrote:
>
> I suppose it might be something like below. I've updated comment and
> quoted your comment there just I wont forget this next time I'll be
> reading the source. Thanks!

I believe the patrch is correct.

But... Cyrill, I am wondering how much will you hate me if I make
yet another attempt to delay this patch.

> +static int children_seq_open(struct inode *inode, struct file *file)
> +{
> +	struct proc_pid_children_iter *iter = NULL;
> +	struct task_struct *task = NULL;
> +	int ret = 0;
> +
> +	task = get_proc_task(inode);
> +	if (!task) {
> +		ret = -ENOENT;
> +		goto err;
> +	}

For what??

> +	if (!ret) {
> +		struct seq_file *m = file->private_data;
> +		m->private = iter;
> +
> +		iter->pid_start = get_pid(task_pid(task));

This is what we need, right? So can't we remove this "task_struct *task"
and simply do

		iter->pid_start = get_ppid(proc_pid(inode));

?

And while this is absolutely cosmetic probably ->parent_pid is
a bit better name, but this is up to you.

Oleg.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v6
  2012-01-19 14:10             ` Oleg Nesterov
@ 2012-01-19 14:47               ` Cyrill Gorcunov
  2012-01-19 15:55               ` Oleg Nesterov
  1 sibling, 0 replies; 20+ messages in thread
From: Cyrill Gorcunov @ 2012-01-19 14:47 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: KOSAKI Motohiro, LKML, Andrew Morton, Pavel Emelyanov,
	Serge Hallyn, KAMEZAWA Hiroyuki, Tejun Heo, Andrew Vagin,
	Vasiliy Kulikov

On Thu, Jan 19, 2012 at 03:10:51PM +0100, Oleg Nesterov wrote:
> On 01/18, Cyrill Gorcunov wrote:
> >
> > I suppose it might be something like below. I've updated comment and
> > quoted your comment there just I wont forget this next time I'll be
> > reading the source. Thanks!
> 
> I believe the patrch is correct.
> 
> But... Cyrill, I am wondering how much will you hate me if I make
> yet another attempt to delay this patch.
> 

I'm ready to do any iterations until patch become correct. It's better
to spend some time now than later.

> > +static int children_seq_open(struct inode *inode, struct file *file)
> > +{
> > +	struct proc_pid_children_iter *iter = NULL;
> > +	struct task_struct *task = NULL;
> > +	int ret = 0;
> > +
> > +	task = get_proc_task(inode);
> > +	if (!task) {
> > +		ret = -ENOENT;
> > +		goto err;
> > +	}
> 
> For what??
> 
> > +	if (!ret) {
> > +		struct seq_file *m = file->private_data;
> > +		m->private = iter;
> > +
> > +		iter->pid_start = get_pid(task_pid(task));
> 
> This is what we need, right? So can't we remove this "task_struct *task"
> and simply do
> 
> 		iter->pid_start = get_ppid(proc_pid(inode));
> 

Yeah, this one will be cleaner. Thanks!

> ?
> 
> And while this is absolutely cosmetic probably ->parent_pid is
> a bit better name, but this is up to you.
> 

Yeah, I'll change the name, I liked this one more.

	Cyrill

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v6
  2012-01-19 14:10             ` Oleg Nesterov
  2012-01-19 14:47               ` Cyrill Gorcunov
@ 2012-01-19 15:55               ` Oleg Nesterov
  2012-01-19 16:27                 ` Cyrill Gorcunov
  1 sibling, 1 reply; 20+ messages in thread
From: Oleg Nesterov @ 2012-01-19 15:55 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: KOSAKI Motohiro, LKML, Andrew Morton, Pavel Emelyanov,
	Serge Hallyn, KAMEZAWA Hiroyuki, Tejun Heo, Andrew Vagin,
	Vasiliy Kulikov

On 01/19, Oleg Nesterov wrote:
>
> On 01/18, Cyrill Gorcunov wrote:
> >
> > I suppose it might be something like below. I've updated comment and
> > quoted your comment there just I wont forget this next time I'll be
> > reading the source. Thanks!
>
> I believe the patrch is correct.
>
> But... Cyrill, I am wondering how much will you hate me if I make
> yet another attempt to delay this patch.

Cough... and another attempt...

> > +static int children_seq_open(struct inode *inode, struct file *file)
> > +{
> > +	struct proc_pid_children_iter *iter = NULL;
> > +	struct task_struct *task = NULL;
> > +	int ret = 0;
> > +
> > +	task = get_proc_task(inode);
> > +	if (!task) {
> > +		ret = -ENOENT;
> > +		goto err;
> > +	}
>
> For what??
>
> > +	if (!ret) {
> > +		struct seq_file *m = file->private_data;
> > +		m->private = iter;
> > +
> > +		iter->pid_start = get_pid(task_pid(task));
>
> This is what we need, right? So can't we remove this "task_struct *task"
> and simply do
>
> 		iter->pid_start = get_ppid(proc_pid(inode));
>
> ?
>
> And while this is absolutely cosmetic probably ->parent_pid is
> a bit better name, but this is up to you.

Thinking more... I am not sure, but do we really need
proc_pid_children_iter at all??

It is very possibly I missed something, but we can get both
parent_pid and pid_ns from inode, right? so can't we just remember
inode in seq->private?

Oleg.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v6
  2012-01-19 15:55               ` Oleg Nesterov
@ 2012-01-19 16:27                 ` Cyrill Gorcunov
  2012-01-19 16:44                   ` Oleg Nesterov
  0 siblings, 1 reply; 20+ messages in thread
From: Cyrill Gorcunov @ 2012-01-19 16:27 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: KOSAKI Motohiro, LKML, Andrew Morton, Pavel Emelyanov,
	Serge Hallyn, KAMEZAWA Hiroyuki, Tejun Heo, Andrew Vagin,
	Vasiliy Kulikov

On Thu, Jan 19, 2012 at 04:55:29PM +0100, Oleg Nesterov wrote:
> 
> Thinking more... I am not sure, but do we really need
> proc_pid_children_iter at all??
> 
> It is very possibly I missed something, but we can get both
> parent_pid and pid_ns from inode, right? so can't we just remember
> inode in seq->private?
> 

Good point. Letme check if we will need to call for ihold then...
(/me scratchig the head)

	Cyrill

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v6
  2012-01-19 16:27                 ` Cyrill Gorcunov
@ 2012-01-19 16:44                   ` Oleg Nesterov
  2012-01-19 17:07                     ` Cyrill Gorcunov
  0 siblings, 1 reply; 20+ messages in thread
From: Oleg Nesterov @ 2012-01-19 16:44 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: KOSAKI Motohiro, LKML, Andrew Morton, Pavel Emelyanov,
	Serge Hallyn, KAMEZAWA Hiroyuki, Tejun Heo, Andrew Vagin,
	Vasiliy Kulikov

On 01/19, Cyrill Gorcunov wrote:
>
> On Thu, Jan 19, 2012 at 04:55:29PM +0100, Oleg Nesterov wrote:
> >
> > Thinking more... I am not sure, but do we really need
> > proc_pid_children_iter at all??
> >
> > It is very possibly I missed something, but we can get both
> > parent_pid and pid_ns from inode, right? so can't we just remember
> > inode in seq->private?
>
> Good point. Letme check if we will need to call for ihold then...
> (/me scratchig the head)

Why? file/inode can't go away, at least until this fd is closed.

And just in case, get_pid(proc_pid(inode)) is not needed, even in v6/v7.
I didn't realize this.

Cyrill, I won't argue if you prefer to make this in a separate patch
(of course, assuming you are agree) on top of v7, to me it looks
"good enough".

Oleg.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v6
  2012-01-19 16:44                   ` Oleg Nesterov
@ 2012-01-19 17:07                     ` Cyrill Gorcunov
  0 siblings, 0 replies; 20+ messages in thread
From: Cyrill Gorcunov @ 2012-01-19 17:07 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: KOSAKI Motohiro, LKML, Andrew Morton, Pavel Emelyanov,
	Serge Hallyn, KAMEZAWA Hiroyuki, Tejun Heo, Andrew Vagin,
	Vasiliy Kulikov

On Thu, Jan 19, 2012 at 05:44:13PM +0100, Oleg Nesterov wrote:
> On 01/19, Cyrill Gorcunov wrote:
> >
> > On Thu, Jan 19, 2012 at 04:55:29PM +0100, Oleg Nesterov wrote:
> > >
> > > Thinking more... I am not sure, but do we really need
> > > proc_pid_children_iter at all??
> > >
> > > It is very possibly I missed something, but we can get both
> > > parent_pid and pid_ns from inode, right? so can't we just remember
> > > inode in seq->private?
> >
> > Good point. Letme check if we will need to call for ihold then...
> > (/me scratchig the head)
> 
> Why? file/inode can't go away, at least until this fd is closed.
> 
> And just in case, get_pid(proc_pid(inode)) is not needed, even in v6/v7.
> I didn't realize this.
> 

Grr, indeed.

> Cyrill, I won't argue if you prefer to make this in a separate patch
> (of course, assuming you are agree) on top of v7, to me it looks
> "good enough".
> 

I think better to make completely updated version then, than pushing
changes on top. it'll be better to put anything in one patch I think.

Thanks!

	Cyrill

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2012-01-19 17:08 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-16 15:32 [RFC] fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v6 Cyrill Gorcunov
2012-01-16 16:11 ` Oleg Nesterov
2012-01-16 16:20   ` Cyrill Gorcunov
2012-01-17 17:40     ` Oleg Nesterov
2012-01-17 17:57       ` Cyrill Gorcunov
2012-01-17 18:14         ` Oleg Nesterov
2012-01-17 18:30           ` Cyrill Gorcunov
2012-01-17 21:38 ` KOSAKI Motohiro
2012-01-18  9:43   ` Cyrill Gorcunov
2012-01-18 13:58   ` Oleg Nesterov
2012-01-18 14:21     ` Cyrill Gorcunov
2012-01-18 14:36       ` Oleg Nesterov
2012-01-18 18:23         ` Cyrill Gorcunov
2012-01-18 19:07           ` Cyrill Gorcunov
2012-01-19 14:10             ` Oleg Nesterov
2012-01-19 14:47               ` Cyrill Gorcunov
2012-01-19 15:55               ` Oleg Nesterov
2012-01-19 16:27                 ` Cyrill Gorcunov
2012-01-19 16:44                   ` Oleg Nesterov
2012-01-19 17:07                     ` Cyrill Gorcunov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).