All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] proc.5: document /proc/[pid]/task/[tid]/children
@ 2016-08-02  0:25 Jann Horn
       [not found] ` <b97bbf47-1180-0d32-ba08-1482020cc883@gmail.com>
  0 siblings, 1 reply; 11+ messages in thread
From: Jann Horn @ 2016-08-02  0:25 UTC (permalink / raw)
  To: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w; +Cc: linux-man-u79uwXL29TY76Z2rM5mHXA

Document the /proc/[pid]/task/[tid]/children interface from CRIU, and more
importantly, document why it's usually not a good interface.
---
 man5/proc.5 | 23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/man5/proc.5 b/man5/proc.5
index 0970c72..ddb14cc 100644
--- a/man5/proc.5
+++ b/man5/proc.5
@@ -2325,14 +2325,33 @@ the corresponding files under
 .I task/[tid]
 may have different values (e.g., various fields in each of the
 .I task/[tid]/status
-files may be different for each thread).
-
+files may be different for each thread),
+.\" in particular: "children" :/
+or they might not exist in
+.I /proc/[pid]
+at all.
 .\" The following was still true as at kernel 2.6.13
 In a multithreaded process, the contents of the
 .I /proc/[pid]/task
 directory are not available if the main thread has already terminated
 (typically by calling
 .BR pthread_exit (3)).
+
+.TP
+.IR /proc/[pid]/task/[tid]/children " (since Linux 3.5)"
+.\" commit 818411616baf46ceba0cff6f05af3a9b294734f7
+A space-separated list of child tasks of this task.
+Each child task is represented by its TID.
+
+.\" see comments in get_children_pid() in fs/proc/array.c
+This does not work properly if children of the target task exit while
+the file is being read!
+Exiting children may cause non-exiting children to be omitted from
+the list.
+This makes this interface even more unreliable than classic PID-based
+approaches if the inspected task and its children aren't frozen, and
+most code should probably not use this interface.
+
 .TP
 .IR /proc/[pid]/timers " (since Linux 3.10)"
 .\" commit 5ed67f05f66c41e39880a6d61358438a25f9fee5
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] proc.5: document /proc/[pid]/task/[tid]/children
       [not found]   ` <b97bbf47-1180-0d32-ba08-1482020cc883-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2016-08-03 22:52     ` Jann Horn
       [not found]       ` <20160803225254.GA14948-J1fxOzX/cBvk1uMJSBkQmQ@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Jann Horn @ 2016-08-03 22:52 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: linux-man-u79uwXL29TY76Z2rM5mHXA, Iago López Galeiras,
	Cyrill Gorcunov

[-- Attachment #1: Type: text/plain, Size: 3459 bytes --]

On Thu, Aug 04, 2016 at 08:46:03AM +1000, Michael Kerrisk (man-pages) wrote:
> [Adding a few people to CC, who may also be interested]
> 
> Hi Jann,
> 
> On 08/02/2016 10:25 AM, Jann Horn wrote:
> >Document the /proc/[pid]/task/[tid]/children interface from CRIU, and more
> >importantly, document why it's usually not a good interface.
> >---
> > man5/proc.5 | 23 +++++++++++++++++++++--
> > 1 file changed, 21 insertions(+), 2 deletions(-)
> >
> >diff --git a/man5/proc.5 b/man5/proc.5
> >index 0970c72..ddb14cc 100644
> >--- a/man5/proc.5
> >+++ b/man5/proc.5
> >@@ -2325,14 +2325,33 @@ the corresponding files under
> > .I task/[tid]
> > may have different values (e.g., various fields in each of the
> > .I task/[tid]/status
> >-files may be different for each thread).
> >-
> >+files may be different for each thread),
> >+.\" in particular: "children" :/
> >+or they might not exist in
> >+.I /proc/[pid]
> >+at all.
> > .\" The following was still true as at kernel 2.6.13
> > In a multithreaded process, the contents of the
> > .I /proc/[pid]/task
> > directory are not available if the main thread has already terminated
> > (typically by calling
> > .BR pthread_exit (3)).
> >+
> >+.TP
> >+.IR /proc/[pid]/task/[tid]/children " (since Linux 3.5)"
> >+.\" commit 818411616baf46ceba0cff6f05af3a9b294734f7
> >+A space-separated list of child tasks of this task.
> >+Each child task is represented by its TID.
> >+
> >+.\" see comments in get_children_pid() in fs/proc/array.c
> >+This does not work properly if children of the target task exit while
> >+the file is being read!
> >+Exiting children may cause non-exiting children to be omitted from
> >+the list.
> >+This makes this interface even more unreliable than classic PID-based
> >+approaches if the inspected task and its children aren't frozen, and
> >+most code should probably not use this interface.
> >+
> > .TP
> > .IR /proc/[pid]/timers " (since Linux 3.10)"
> > .\" commit 5ed67f05f66c41e39880a6d61358438a25f9fee5
> 
> Thanks for this! I tweaked your text somewhat, and added some
> details about kernel configuration options, so that now the text
> reads:
> 
>        /proc/[pid]/task/[tid]/children (since Linux 3.5)
>               A  space-separated  list  of child tasks of this task.
>               Each child task is represented by its TID.
> 
>               This option is intended for  use  by  the  checkpoint-
>               restore (CRIU) system, and reliably provides a list of
>               children only  if  all  of  the  child  processes  are
>               stopped or frozen.  It does not work properly if chil‐
>               dren of the target task exit while the file  is  being
>               read!  Exiting children may cause non-exiting children
>               to be omitted from the list.  This makes  this  inter‐
>               face  even  more  unreliable  than  classic  PID-based
>               approaches if the  inspected  task  and  its  children
>               aren't  frozen,  and most code should probably not use
>               this interface.
> 
>               Until Linux 4.2, the presence of this  file  was  gov‐
>               erned by the CONFIG_CHECKPOINT_RESTORE kernel configu‐
>               ration option.  Since Linux 4.2, it it is governed  by
>               the CONFIG_PROC_CHILDREN option.
> 
> Look okay?

Looks good to me.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] proc.5: document /proc/[pid]/task/[tid]/children
       [not found]       ` <20160803225254.GA14948-J1fxOzX/cBvk1uMJSBkQmQ@public.gmane.org>
@ 2016-08-14  8:40         ` Cyrill Gorcunov
       [not found]           ` <20160814084026.GA1857-ZmlpmtaulQd+urZeOPWqwQ@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Cyrill Gorcunov @ 2016-08-14  8:40 UTC (permalink / raw)
  To: Jann Horn
  Cc: Michael Kerrisk (man-pages),
	linux-man-u79uwXL29TY76Z2rM5mHXA, Iago López Galeiras

On Thu, Aug 04, 2016 at 12:52:54AM +0200, Jann Horn wrote:
...
> > 
> > Thanks for this! I tweaked your text somewhat, and added some
> > details about kernel configuration options, so that now the text
> > reads:
> > 
> >        /proc/[pid]/task/[tid]/children (since Linux 3.5)
> >               A  space-separated  list  of child tasks of this task.
> >               Each child task is represented by its TID.
> > 
> >               This option is intended for  use  by  the  checkpoint-
> >               restore (CRIU) system, and reliably provides a list of
> >               children only  if  all  of  the  child  processes  are
> >               stopped or frozen.  It does not work properly if chil‐
> >               dren of the target task exit while the file  is  being
> >               read!  Exiting children may cause non-exiting children
> >               to be omitted from the list.  This makes  this  inter‐
> >               face  even  more  unreliable  than  classic  PID-based
> >               approaches if the  inspected  task  and  its  children
> >               aren't  frozen,  and most code should probably not use
> >               this interface.

Hi! First of all, sorry for delay. Guys, this is not really true. The same
applies to plain "ls /proc". You can fetch pid from the procfs and then
process get dead just right after you've finished reading. So this interface
works "properly" all the time, but if one needs precise results it should
stop/freeze processes first. In contrary I think it worth switching into
children interface in user-space programs because it incredibly fast.

> > 
> >               Until Linux 4.2, the presence of this  file  was  gov‐
> >               erned by the CONFIG_CHECKPOINT_RESTORE kernel configu‐
> >               ration option.  Since Linux 4.2, it it is governed  by
> >               the CONFIG_PROC_CHILDREN option.
> > 
> > Look okay?
> 
> Looks good to me.



	Cyrill
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] proc.5: document /proc/[pid]/task/[tid]/children
       [not found]           ` <20160814084026.GA1857-ZmlpmtaulQd+urZeOPWqwQ@public.gmane.org>
@ 2016-08-14 10:48             ` Jann Horn
       [not found]               ` <20160814104856.GA12246-J1fxOzX/cBvk1uMJSBkQmQ@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Jann Horn @ 2016-08-14 10:48 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Michael Kerrisk (man-pages),
	linux-man-u79uwXL29TY76Z2rM5mHXA, Iago López Galeiras

[-- Attachment #1: Type: text/plain, Size: 4790 bytes --]

On Sun, Aug 14, 2016 at 11:40:26AM +0300, Cyrill Gorcunov wrote:
> On Thu, Aug 04, 2016 at 12:52:54AM +0200, Jann Horn wrote:
> ...
> > > 
> > > Thanks for this! I tweaked your text somewhat, and added some
> > > details about kernel configuration options, so that now the text
> > > reads:
> > > 
> > >        /proc/[pid]/task/[tid]/children (since Linux 3.5)
> > >               A  space-separated  list  of child tasks of this task.
> > >               Each child task is represented by its TID.
> > > 
> > >               This option is intended for  use  by  the  checkpoint-
> > >               restore (CRIU) system, and reliably provides a list of
> > >               children only  if  all  of  the  child  processes  are
> > >               stopped or frozen.  It does not work properly if chil‐
> > >               dren of the target task exit while the file  is  being
> > >               read!  Exiting children may cause non-exiting children
> > >               to be omitted from the list.  This makes  this  inter‐
> > >               face  even  more  unreliable  than  classic  PID-based
> > >               approaches if the  inspected  task  and  its  children
> > >               aren't  frozen,  and most code should probably not use
> > >               this interface.
> 
> Hi! First of all, sorry for delay. Guys, this is not really true. The same
> applies to plain "ls /proc".

It does not. /proc is wobbly in a running system, /proc/$pid/children is
completely unreliable.


> You can fetch pid from the procfs and then
> process get dead just right after you've finished reading. So this interface
> works "properly" all the time, but if one needs precise results it should
> stop/freeze processes first. In contrary I think it worth switching into
> children interface in user-space programs because it incredibly fast.

In procfs, when you want to enumerate all tasks that are currently running,
you can do the following:

 - Read /proc with readdir() or so, but discard all information except for
   the PIDs.
 - For each PID:
  - chdir() into /proc/$pid
  - stat '.' and read files inside '.'

This will yield information about all tasks that were running at the start
of the operation and are still running. AFAIK, the internal consistency of
per-task data has the following guarantee: All data that was collected as
per-task data really belongs to the same task; PID reuse has no effect on
that (because the /proc/$pid inode will not be reassociated with a new
task that reuses the PID). Of course, different pieces of data that were
collected at different points in time can still be somewhat inconsistent -
especially if an execve() call happens in the meantime.

Looking up the procfs inodes corresponding to the parents or children of
a process is a bit more complicated, but still doable. To look up the
parent inode for a /proc/$pid inode:

 - Grab the ppid number from the "stat" entry in the process inode.
 - Take a reference (a file descriptor) to the inode at /proc/$ppid.
 - re-read the "stat" entry in the process inode and check whether the
   ppid changed. if not, you're done. if yes, retry.

This works because, while the parent of a task can change multiple
times, each such change changes the PPID to a value it never had before.
This is true because all subreapers of a process have to be ancestors of
it, and the ancestors of a process have to already exist when it spawns,
so they can't spawn after the death of the process, so they can't reuse
the PID of the process. So with this trick, you can determine the parent
of a process in a stable way.

This approach can then be reused to find the children of a process with
inode fd $ppid_fd:

 - Read the PID from "stat" under $ppid_fd.
 - Create an empty result set $result that can hold file descriptors.
 - For each numeric entry in /proc/:
  - chdir() into /proc/$pid.
  - Read "stat"; if the PPID isn't $wanted_ppid, go to next iteration.
  - Add openat(".") to $result.
 - If "stat" under $ppid_fd is still readable (as opposed to returning
   -ESRCH on openat()), return $result.
 - Return an empty result set or an error or so; the parent's PID has
   been deallocated.

I think these should work for obtaining a sufficiently consistent view
of the process structure of a running system.

But yeah, safely using this interface isn't easy, and more
inode-centered APIs for interaction with processes would be nice to
have. (E.g. an entry in /proc/$pid that points to the parent inode,
maybe a directory containing entries that point to the child inodes,
and process directory entries offering functionality equivalent to
syscalls like kill(), sched_setscheduler() and prlimit().)

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] proc.5: document /proc/[pid]/task/[tid]/children
       [not found]               ` <20160814104856.GA12246-J1fxOzX/cBvk1uMJSBkQmQ@public.gmane.org>
@ 2016-08-14 20:14                 ` Cyrill Gorcunov
       [not found]                   ` <20160814201441.GC1857-ZmlpmtaulQd+urZeOPWqwQ@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Cyrill Gorcunov @ 2016-08-14 20:14 UTC (permalink / raw)
  To: Jann Horn
  Cc: Michael Kerrisk (man-pages),
	linux-man-u79uwXL29TY76Z2rM5mHXA, Iago López Galeiras,
	Andrey Vagin

On Sun, Aug 14, 2016 at 12:48:56PM +0200, Jann Horn wrote:
> > 
> > Hi! First of all, sorry for delay. Guys, this is not really true. The same
> > applies to plain "ls /proc".
> 
> It does not. /proc is wobbly in a running system, /proc/$pid/children is
> completely unreliable.

Nope -- look into how pids are instantinated: once pids are read new ones
may appear which you won't notice without re-read. You still may miss freshly
created pids. In turn children doesn't guarantee that the pid you've fetched
is still valid, and for validation sake we've been using ptrace + test of
children's parent pid being the same after the read. So no, I wouldn't call
it _completely_ unreliable. It rather may give misses on tasks which are
using fork/execve intensively, but it's acceptable trade off in a sake
of speed (and the speed was the primary target why we've added this
interface).

> 
> > You can fetch pid from the procfs and then
> > process get dead just right after you've finished reading. So this interface
> > works "properly" all the time, but if one needs precise results it should
> > stop/freeze processes first. In contrary I think it worth switching into
> > children interface in user-space programs because it incredibly fast.
> 
> In procfs, when you want to enumerate all tasks that are currently running,
> you can do the following:
> 
>  - Read /proc with readdir() or so, but discard all information except for
>    the PIDs.
>  - For each PID:
>   - chdir() into /proc/$pid
>   - stat '.' and read files inside '.'
> 
> This will yield information about all tasks that were running at the start
> of the operation and are still running. AFAIK, the internal consistency of

No, they may start exiting while you examinate them, but task structure
and linked data won't disapper until reference is decremented.

> per-task data has the following guarantee: All data that was collected as
> per-task data really belongs to the same task; PID reuse has no effect on
> that (because the /proc/$pid inode will not be reassociated with a new
> task that reuses the PID). Of course, different pieces of data that were
> collected at different points in time can still be somewhat inconsistent -
> especially if an execve() call happens in the meantime.
> 
> Looking up the procfs inodes corresponding to the parents or children of
> a process is a bit more complicated, but still doable. To look up the
> parent inode for a /proc/$pid inode:
> 
>  - Grab the ppid number from the "stat" entry in the process inode.
>  - Take a reference (a file descriptor) to the inode at /proc/$ppid.
>  - re-read the "stat" entry in the process inode and check whether the
>    ppid changed. if not, you're done. if yes, retry.
> 
> This works because, while the parent of a task can change multiple
> times, each such change changes the PPID to a value it never had before.
> This is true because all subreapers of a process have to be ancestors of
> it, and the ancestors of a process have to already exist when it spawns,
> so they can't spawn after the death of the process, so they can't reuse
> the PID of the process. So with this trick, you can determine the parent
> of a process in a stable way.
> 
> This approach can then be reused to find the children of a process with
> inode fd $ppid_fd:
> 
>  - Read the PID from "stat" under $ppid_fd.
>  - Create an empty result set $result that can hold file descriptors.
>  - For each numeric entry in /proc/:
>   - chdir() into /proc/$pid.
>   - Read "stat"; if the PPID isn't $wanted_ppid, go to next iteration.
>   - Add openat(".") to $result.
>  - If "stat" under $ppid_fd is still readable (as opposed to returning
>    -ESRCH on openat()), return $result.
>  - Return an empty result set or an error or so; the parent's PID has
>    been deallocated.
> 
> I think these should work for obtaining a sufficiently consistent view
> of the process structure of a running system.
> 
> But yeah, safely using this interface isn't easy, and more
> inode-centered APIs for interaction with processes would be nice to
> have. (E.g. an entry in /proc/$pid that points to the parent inode,
> maybe a directory containing entries that point to the child inodes,
> and process directory entries offering functionality equivalent to
> syscalls like kill(), sched_setscheduler() and prlimit().)

Well, all this really waste a huge amount of time, that's why we needed
$children. In general more preferred way might be task-diag interface
which Andrew implemented (I'm not sure which exactly state of the
series at the moment, have it been merged or not https://lkml.org/lkml/2016/4/11/932)

	Cyrill
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] proc.5: document /proc/[pid]/task/[tid]/children
       [not found]                   ` <20160814201441.GC1857-ZmlpmtaulQd+urZeOPWqwQ@public.gmane.org>
@ 2016-08-14 20:46                     ` Jann Horn
       [not found]                       ` <20160814204635.GA2803-J1fxOzX/cBvk1uMJSBkQmQ@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Jann Horn @ 2016-08-14 20:46 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Michael Kerrisk (man-pages),
	linux-man-u79uwXL29TY76Z2rM5mHXA, Iago López Galeiras,
	Andrey Vagin

[-- Attachment #1: Type: text/plain, Size: 5566 bytes --]

On Sun, Aug 14, 2016 at 11:14:41PM +0300, Cyrill Gorcunov wrote:
> On Sun, Aug 14, 2016 at 12:48:56PM +0200, Jann Horn wrote:
> > > 
> > > Hi! First of all, sorry for delay. Guys, this is not really true. The same
> > > applies to plain "ls /proc".
> > 
> > It does not. /proc is wobbly in a running system, /proc/$pid/children is
> > completely unreliable.
> 
> Nope -- look into how pids are instantinated: once pids are read new ones
> may appear which you won't notice without re-read. You still may miss freshly
> created pids.

That's pretty much inherent when you're inspecting a moving system - by the
time you've collected your information, it might be stale. So what?

> In turn children doesn't guarantee that the pid you've fetched
> is still valid, and for validation sake we've been using ptrace + test of
> children's parent pid being the same after the read. So no, I wouldn't call
> it _completely_ unreliable. It rather may give misses on tasks which are
> using fork/execve intensively, but it's acceptable trade off in a sake
> of speed (and the speed was the primary target why we've added this
> interface).

It's an "acceptable trade off" when such an interface drops information about
a relationship that existed before the caller starts inspecting the process
relationships and continues to exist while the inspection runs?
Interfaces that ususally work but sometimes, randomly, silently drop
information just suck, at least if you're trying to write software that
actually works.


> > > You can fetch pid from the procfs and then
> > > process get dead just right after you've finished reading. So this interface
> > > works "properly" all the time, but if one needs precise results it should
> > > stop/freeze processes first. In contrary I think it worth switching into
> > > children interface in user-space programs because it incredibly fast.
> > 
> > In procfs, when you want to enumerate all tasks that are currently running,
> > you can do the following:
> > 
> >  - Read /proc with readdir() or so, but discard all information except for
> >    the PIDs.
> >  - For each PID:
> >   - chdir() into /proc/$pid
> >   - stat '.' and read files inside '.'
> > 
> > This will yield information about all tasks that were running at the start
> > of the operation and are still running. AFAIK, the internal consistency of
> 
> No, they may start exiting while you examinate them, but task structure
> and linked data won't disapper until reference is decremented.

... so?


> > per-task data has the following guarantee: All data that was collected as
> > per-task data really belongs to the same task; PID reuse has no effect on
> > that (because the /proc/$pid inode will not be reassociated with a new
> > task that reuses the PID). Of course, different pieces of data that were
> > collected at different points in time can still be somewhat inconsistent -
> > especially if an execve() call happens in the meantime.
> > 
> > Looking up the procfs inodes corresponding to the parents or children of
> > a process is a bit more complicated, but still doable. To look up the
> > parent inode for a /proc/$pid inode:
> > 
> >  - Grab the ppid number from the "stat" entry in the process inode.
> >  - Take a reference (a file descriptor) to the inode at /proc/$ppid.
> >  - re-read the "stat" entry in the process inode and check whether the
> >    ppid changed. if not, you're done. if yes, retry.
> > 
> > This works because, while the parent of a task can change multiple
> > times, each such change changes the PPID to a value it never had before.
> > This is true because all subreapers of a process have to be ancestors of
> > it, and the ancestors of a process have to already exist when it spawns,
> > so they can't spawn after the death of the process, so they can't reuse
> > the PID of the process. So with this trick, you can determine the parent
> > of a process in a stable way.
> > 
> > This approach can then be reused to find the children of a process with
> > inode fd $ppid_fd:
> > 
> >  - Read the PID from "stat" under $ppid_fd.
> >  - Create an empty result set $result that can hold file descriptors.
> >  - For each numeric entry in /proc/:
> >   - chdir() into /proc/$pid.
> >   - Read "stat"; if the PPID isn't $wanted_ppid, go to next iteration.
> >   - Add openat(".") to $result.
> >  - If "stat" under $ppid_fd is still readable (as opposed to returning
> >    -ESRCH on openat()), return $result.
> >  - Return an empty result set or an error or so; the parent's PID has
> >    been deallocated.
> > 
> > I think these should work for obtaining a sufficiently consistent view
> > of the process structure of a running system.
> > 
> > But yeah, safely using this interface isn't easy, and more
> > inode-centered APIs for interaction with processes would be nice to
> > have. (E.g. an entry in /proc/$pid that points to the parent inode,
> > maybe a directory containing entries that point to the child inodes,
> > and process directory entries offering functionality equivalent to
> > syscalls like kill(), sched_setscheduler() and prlimit().)
> 
> Well, all this really waste a huge amount of time, that's why we needed
> $children. In general more preferred way might be task-diag interface
> which Andrew implemented (I'm not sure which exactly state of the
> series at the moment, have it been merged or not https://lkml.org/lkml/2016/4/11/932)

Yuck. Everything is PID-based? That's ugly.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] proc.5: document /proc/[pid]/task/[tid]/children
       [not found]                       ` <20160814204635.GA2803-J1fxOzX/cBvk1uMJSBkQmQ@public.gmane.org>
@ 2016-08-14 22:13                         ` Cyrill Gorcunov
       [not found]                           ` <20160814221359.GD1857-ZmlpmtaulQd+urZeOPWqwQ@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Cyrill Gorcunov @ 2016-08-14 22:13 UTC (permalink / raw)
  To: Jann Horn
  Cc: Michael Kerrisk (man-pages),
	linux-man-u79uwXL29TY76Z2rM5mHXA, Iago López Galeiras,
	Andrey Vagin

On Sun, Aug 14, 2016 at 10:46:35PM +0200, Jann Horn wrote:
> On Sun, Aug 14, 2016 at 11:14:41PM +0300, Cyrill Gorcunov wrote:
> > On Sun, Aug 14, 2016 at 12:48:56PM +0200, Jann Horn wrote:
> > > > 
> > > > Hi! First of all, sorry for delay. Guys, this is not really true. The same
> > > > applies to plain "ls /proc".
> > > 
> > > It does not. /proc is wobbly in a running system, /proc/$pid/children is
> > > completely unreliable.
> > 
> > Nope -- look into how pids are instantinated: once pids are read new ones
> > may appear which you won't notice without re-read. You still may miss freshly
> > created pids.
> 
> That's pretty much inherent when you're inspecting a moving system - by the
> time you've collected your information, it might be stale. So what?

What "what"? I told you that the information you fetch from running system
only valid when kernel does its work, once you jump back to userspace
the information might not be valid anymore. And the @children does the
same: in native procfs read you may miss freshly created processes,
in @children read you may miss exited processes. It's the same nature.

> > In turn children doesn't guarantee that the pid you've fetched
> > is still valid, and for validation sake we've been using ptrace + test of
> > children's parent pid being the same after the read. So no, I wouldn't call
> > it _completely_ unreliable. It rather may give misses on tasks which are
> > using fork/execve intensively, but it's acceptable trade off in a sake
> > of speed (and the speed was the primary target why we've added this
> > interface).
> 
> It's an "acceptable trade off" when such an interface drops information about
> a relationship that existed before the caller starts inspecting the process
> relationships and continues to exist while the inspection runs?

Sigh, even task state may change right after it written back into
seq output. Jann, I really not follow what you're trying to say.
The output procfs produces changes all the time, it's running system,
if you need a sold state -- stop/freeze all processes first and
the output will be valid.

> Interfaces that ususally work but sometimes, randomly, silently drop
> information just suck, at least if you're trying to write software that
> actually works. 

@children interface works as expected and as intended: a cheap way
to get children list in one pass.

> > > > You can fetch pid from the procfs and then
> > > > process get dead just right after you've finished reading. So this interface
> > > > works "properly" all the time, but if one needs precise results it should
> > > > stop/freeze processes first. In contrary I think it worth switching into
> > > > children interface in user-space programs because it incredibly fast.
> > > 
> > > In procfs, when you want to enumerate all tasks that are currently running,
> > > you can do the following:
> > > 
> > >  - Read /proc with readdir() or so, but discard all information except for
> > >    the PIDs.
> > >  - For each PID:
> > >   - chdir() into /proc/$pid
> > >   - stat '.' and read files inside '.'
> > > 
> > > This will yield information about all tasks that were running at the start
> > > of the operation and are still running. AFAIK, the internal consistency of
> > 
> > No, they may start exiting while you examinate them, but task structure
> > and linked data won't disapper until reference is decremented.
> 
> ... so?

See above: to get solid/valid state one have to stop/freeze tasks he is
expecting, that's all.

...
> > > I think these should work for obtaining a sufficiently consistent view
> > > of the process structure of a running system.
> > > 
> > > But yeah, safely using this interface isn't easy, and more
> > > inode-centered APIs for interaction with processes would be nice to
> > > have. (E.g. an entry in /proc/$pid that points to the parent inode,
> > > maybe a directory containing entries that point to the child inodes,
> > > and process directory entries offering functionality equivalent to
> > > syscalls like kill(), sched_setscheduler() and prlimit().)
> > 
> > Well, all this really waste a huge amount of time, that's why we needed
> > $children. In general more preferred way might be task-diag interface
> > which Andrew implemented (I'm not sure which exactly state of the
> > series at the moment, have it been merged or not https://lkml.org/lkml/2016/4/11/932)
> 
> Yuck. Everything is PID-based? That's ugly.

That happened the process are pid based things.
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] proc.5: document /proc/[pid]/task/[tid]/children
       [not found]                           ` <20160814221359.GD1857-ZmlpmtaulQd+urZeOPWqwQ@public.gmane.org>
@ 2016-08-14 22:45                             ` Jann Horn
       [not found]                               ` <20160814224546.GA32168-J1fxOzX/cBvk1uMJSBkQmQ@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Jann Horn @ 2016-08-14 22:45 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Michael Kerrisk (man-pages),
	linux-man-u79uwXL29TY76Z2rM5mHXA, Iago López Galeiras,
	Andrey Vagin

[-- Attachment #1: Type: text/plain, Size: 3935 bytes --]

On Mon, Aug 15, 2016 at 01:13:59AM +0300, Cyrill Gorcunov wrote:
> On Sun, Aug 14, 2016 at 10:46:35PM +0200, Jann Horn wrote:
> > On Sun, Aug 14, 2016 at 11:14:41PM +0300, Cyrill Gorcunov wrote:
> > > On Sun, Aug 14, 2016 at 12:48:56PM +0200, Jann Horn wrote:
> > > > > 
> > > > > Hi! First of all, sorry for delay. Guys, this is not really true. The same
> > > > > applies to plain "ls /proc".
> > > > 
> > > > It does not. /proc is wobbly in a running system, /proc/$pid/children is
> > > > completely unreliable.
> > > 
> > > Nope -- look into how pids are instantinated: once pids are read new ones
> > > may appear which you won't notice without re-read. You still may miss freshly
> > > created pids.
> > 
> > That's pretty much inherent when you're inspecting a moving system - by the
> > time you've collected your information, it might be stale. So what?
> 
> What "what"? I told you that the information you fetch from running system
> only valid when kernel does its work, once you jump back to userspace
> the information might not be valid anymore. And the @children does the
> same: in native procfs read you may miss freshly created processes,
> in @children read you may miss exited processes. It's the same nature.

"in @children read you may miss exited processes"? If that was the extent of it,
everything would be fine. But that's not what's happening. Read the comment in
get_children_pid():

	 * We might miss some children here if children
	 * are exited while we were not holding the lock,
	 * but it was never promised to be accurate that
	 * much.
	 *
	 * "Just suppose that the parent sleeps, but N children
	 *  exit after we printed their tids. Now the slow paths
	 *  skips N extra children, we miss N tasks." (c)

"skips N extra children". *NOT* necessarily the same N children that exited, but
also children that were running before you started reading the "children" file
and are still running afterwards.

That's the big difference between the interfaces: Normal procfs reads might not
return new or outdated information, which is mostly inherent when you're
inspecting a running system, but the "children" interface can also drop
information about completely stable task relationships.


> > > > I think these should work for obtaining a sufficiently consistent view
> > > > of the process structure of a running system.
> > > > 
> > > > But yeah, safely using this interface isn't easy, and more
> > > > inode-centered APIs for interaction with processes would be nice to
> > > > have. (E.g. an entry in /proc/$pid that points to the parent inode,
> > > > maybe a directory containing entries that point to the child inodes,
> > > > and process directory entries offering functionality equivalent to
> > > > syscalls like kill(), sched_setscheduler() and prlimit().)
> > > 
> > > Well, all this really waste a huge amount of time, that's why we needed
> > > $children. In general more preferred way might be task-diag interface
> > > which Andrew implemented (I'm not sure which exactly state of the
> > > series at the moment, have it been merged or not https://lkml.org/lkml/2016/4/11/932)
> > 
> > Yuck. Everything is PID-based? That's ugly.
> 
> That happened the process are pid based things.

PID-based interfaces suck unless you're the ptracer or reaper of all the
tasks you're inspecting, and an interface based on less reusable handles
(like procfs directory file descriptors or unique 64-bit identifiers or
whatever) would be much safer.

Yes, I know that all those traditional APIs use PIDs, but that doesn't
change that those interfaces suck. When you kill -9 a daemon that doesn't
quit when asked to quit, for example, there's a chance that the daemon
actually does quit and its PID is reallocated to some vital system
service just before you call kill() - and then your system breaks in
some unpleasant way.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] proc.5: document /proc/[pid]/task/[tid]/children
       [not found]                               ` <20160814224546.GA32168-J1fxOzX/cBvk1uMJSBkQmQ@public.gmane.org>
@ 2016-08-15  8:50                                 ` Cyrill Gorcunov
       [not found]                                   ` <20160815085004.GE1857-ZmlpmtaulQd+urZeOPWqwQ@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Cyrill Gorcunov @ 2016-08-15  8:50 UTC (permalink / raw)
  To: Jann Horn
  Cc: Michael Kerrisk (man-pages),
	linux-man-u79uwXL29TY76Z2rM5mHXA, Andrey Vagin

On Mon, Aug 15, 2016 at 12:45:46AM +0200, Jann Horn wrote:
> > > 
> > > That's pretty much inherent when you're inspecting a moving system - by the
> > > time you've collected your information, it might be stale. So what?
> > 
> > What "what"? I told you that the information you fetch from running system
> > only valid when kernel does its work, once you jump back to userspace
> > the information might not be valid anymore. And the @children does the
> > same: in native procfs read you may miss freshly created processes,
> > in @children read you may miss exited processes. It's the same nature.
> 
> "in @children read you may miss exited processes"? If that was the extent of it,
> everything would be fine. But that's not what's happening. Read the comment in
> get_children_pid():
> 
> 	 * We might miss some children here if children
> 	 * are exited while we were not holding the lock,
> 	 * but it was never promised to be accurate that
> 	 * much.
> 	 *
> 	 * "Just suppose that the parent sleeps, but N children
> 	 *  exit after we printed their tids. Now the slow paths
> 	 *  skips N extra children, we miss N tasks." (c)
> 
> "skips N extra children". *NOT* necessarily the same N children that exited, but
> also children that were running before you started reading the "children" file
> and are still running afterwards.

I happen to know how this code work, i've been writting it. And it's the same
as reading plain pids: you might miss freshly created pids completely until
the re-read. The rule of thumb is to re-validate the results, always, or
stop the processes first.

> That's the big difference between the interfaces: Normal procfs reads might not
> return new or outdated information, which is mostly inherent when you're

Really? It returns outdated information all the time: for example read
@maps output, once userspace buffer is filled this data no longer valid.
If you need a precise results stop the process first.

> inspecting a running system, but the "children" interface can also drop
> information about completely stable task relationships.

> > > > > I think these should work for obtaining a sufficiently consistent view
> > > > > of the process structure of a running system.
> > > > > 
> > > > > But yeah, safely using this interface isn't easy, and more
> > > > > inode-centered APIs for interaction with processes would be nice to
> > > > > have. (E.g. an entry in /proc/$pid that points to the parent inode,
> > > > > maybe a directory containing entries that point to the child inodes,
> > > > > and process directory entries offering functionality equivalent to
> > > > > syscalls like kill(), sched_setscheduler() and prlimit().)
> > > > 
> > > > Well, all this really waste a huge amount of time, that's why we needed
> > > > $children. In general more preferred way might be task-diag interface
> > > > which Andrew implemented (I'm not sure which exactly state of the
> > > > series at the moment, have it been merged or not https://lkml.org/lkml/2016/4/11/932)
> > > 
> > > Yuck. Everything is PID-based? That's ugly.
> > 
> > That happened the process are pid based things.
> 
> PID-based interfaces suck unless you're the ptracer or reaper of all the
> tasks you're inspecting, and an interface based on less reusable handles
> (like procfs directory file descriptors or unique 64-bit identifiers or
> whatever) would be much safer.
> 
> Yes, I know that all those traditional APIs use PIDs, but that doesn't
> change that those interfaces suck. When you kill -9 a daemon that doesn't
> quit when asked to quit, for example, there's a chance that the daemon
> actually does quit and its PID is reallocated to some vital system
> service just before you call kill() - and then your system breaks in
> some unpleasant way.

/me shrugs

Some unique uiids instead of pids might be better (and in distributed
environments they are the only option) but there is no need to make
things more complex than they already are. For kill -9 example, indeed
once you're typed the command the process might be already dead and pid
reused by someone else. Still, you can simply write your own utility
which would use ptrace to kill exactly process you need, but usually
we go an easy way and simply zap hunging taks by "kill". That's fine.
Everyone knows that there is a risk zapping someone else instead of
a target.

	Cyrill
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] proc.5: document /proc/[pid]/task/[tid]/children
       [not found]                                   ` <20160815085004.GE1857-ZmlpmtaulQd+urZeOPWqwQ@public.gmane.org>
@ 2016-08-15 11:53                                     ` Jann Horn
       [not found]                                       ` <20160815115333.GA11115-J1fxOzX/cBvk1uMJSBkQmQ@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Jann Horn @ 2016-08-15 11:53 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Michael Kerrisk (man-pages),
	linux-man-u79uwXL29TY76Z2rM5mHXA, Andrey Vagin

[-- Attachment #1: Type: text/plain, Size: 2470 bytes --]

On Mon, Aug 15, 2016 at 11:50:04AM +0300, Cyrill Gorcunov wrote:
> On Mon, Aug 15, 2016 at 12:45:46AM +0200, Jann Horn wrote:
> > > > 
> > > > That's pretty much inherent when you're inspecting a moving system - by the
> > > > time you've collected your information, it might be stale. So what?
> > > 
> > > What "what"? I told you that the information you fetch from running system
> > > only valid when kernel does its work, once you jump back to userspace
> > > the information might not be valid anymore. And the @children does the
> > > same: in native procfs read you may miss freshly created processes,
> > > in @children read you may miss exited processes. It's the same nature.
> > 
> > "in @children read you may miss exited processes"? If that was the extent of it,
> > everything would be fine. But that's not what's happening. Read the comment in
> > get_children_pid():
> > 
> > 	 * We might miss some children here if children
> > 	 * are exited while we were not holding the lock,
> > 	 * but it was never promised to be accurate that
> > 	 * much.
> > 	 *
> > 	 * "Just suppose that the parent sleeps, but N children
> > 	 *  exit after we printed their tids. Now the slow paths
> > 	 *  skips N extra children, we miss N tasks." (c)
> > 
> > "skips N extra children". *NOT* necessarily the same N children that exited, but
> > also children that were running before you started reading the "children" file
> > and are still running afterwards.
> 
> I happen to know how this code work, i've been writting it. And it's the same
> as reading plain pids: you might miss freshly created pids completely until
> the re-read. The rule of thumb is to re-validate the results, always, or
> stop the processes first.

Ah, I think maybe I understad what you're saying. If you want a list of PIDs that
includes all stable children of a given running process, you have to read the
"children" file in a loop until two reads return the same list, using the fact
that the children are ordered by the time they became children of the target
process and therefore the read following a read that triggered the slowpath
always returns something different unless all children following the position
that triggered the slowpath are replaced? Or something like that?

(If you just read "children" without the loop-until-stable rule, as far as I can
tell, no amount of revalidation will prevent you from missing dropped children.)

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] proc.5: document /proc/[pid]/task/[tid]/children
       [not found]                                       ` <20160815115333.GA11115-J1fxOzX/cBvk1uMJSBkQmQ@public.gmane.org>
@ 2016-08-15 12:34                                         ` Cyrill Gorcunov
  0 siblings, 0 replies; 11+ messages in thread
From: Cyrill Gorcunov @ 2016-08-15 12:34 UTC (permalink / raw)
  To: Jann Horn
  Cc: Michael Kerrisk (man-pages),
	linux-man-u79uwXL29TY76Z2rM5mHXA, Andrey Vagin

On Mon, Aug 15, 2016 at 01:53:33PM +0200, Jann Horn wrote:
> > 
> > I happen to know how this code work, i've been writting it. And it's the same
> > as reading plain pids: you might miss freshly created pids completely until
> > the re-read. The rule of thumb is to re-validate the results, always, or
> > stop the processes first.
> 
> Ah, I think maybe I understad what you're saying. If you want a list of PIDs that
> includes all stable children of a given running process, you have to read the
> "children" file in a loop until two reads return the same list, using the fact
> that the children are ordered by the time they became children of the target
> process and therefore the read following a read that triggered the slowpath
> always returns something different unless all children following the position
> that triggered the slowpath are replaced? Or something like that?

Exactly. In criu (if freezer cgroup is not used) we do check if the children
we've read are still valid when we start operating with the PIDs. In short
it's like: seize task and fetch its children (no new children may apprer since
task is seized) and then iterate over each children and check the parent pid is not
changed (simultaneously each children get seized). Run-time application should
compare output to be the same if they are not seizing processes, just like you
said.

> (If you just read "children" without the loop-until-stable rule, as far as I can
> tell, no amount of revalidation will prevent you from missing dropped children.)

Yeah. For tools like top/htop the reader should make a few reads if it needs
more-less precise results. In turn if only a rough picture is needed plain
single read is enough. Look, the read of @children it extremelly fast and
may be combined with traditional walk over procfs. IIRC someone been lookin
into using this feature in top-like utility. But I don't remember the
details if they successed.
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-08-15 12:34 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-02  0:25 [PATCH] proc.5: document /proc/[pid]/task/[tid]/children Jann Horn
     [not found] ` <b97bbf47-1180-0d32-ba08-1482020cc883@gmail.com>
     [not found]   ` <b97bbf47-1180-0d32-ba08-1482020cc883-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2016-08-03 22:52     ` Jann Horn
     [not found]       ` <20160803225254.GA14948-J1fxOzX/cBvk1uMJSBkQmQ@public.gmane.org>
2016-08-14  8:40         ` Cyrill Gorcunov
     [not found]           ` <20160814084026.GA1857-ZmlpmtaulQd+urZeOPWqwQ@public.gmane.org>
2016-08-14 10:48             ` Jann Horn
     [not found]               ` <20160814104856.GA12246-J1fxOzX/cBvk1uMJSBkQmQ@public.gmane.org>
2016-08-14 20:14                 ` Cyrill Gorcunov
     [not found]                   ` <20160814201441.GC1857-ZmlpmtaulQd+urZeOPWqwQ@public.gmane.org>
2016-08-14 20:46                     ` Jann Horn
     [not found]                       ` <20160814204635.GA2803-J1fxOzX/cBvk1uMJSBkQmQ@public.gmane.org>
2016-08-14 22:13                         ` Cyrill Gorcunov
     [not found]                           ` <20160814221359.GD1857-ZmlpmtaulQd+urZeOPWqwQ@public.gmane.org>
2016-08-14 22:45                             ` Jann Horn
     [not found]                               ` <20160814224546.GA32168-J1fxOzX/cBvk1uMJSBkQmQ@public.gmane.org>
2016-08-15  8:50                                 ` Cyrill Gorcunov
     [not found]                                   ` <20160815085004.GE1857-ZmlpmtaulQd+urZeOPWqwQ@public.gmane.org>
2016-08-15 11:53                                     ` Jann Horn
     [not found]                                       ` <20160815115333.GA11115-J1fxOzX/cBvk1uMJSBkQmQ@public.gmane.org>
2016-08-15 12:34                                         ` Cyrill Gorcunov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.