All of lore.kernel.org
 help / color / mirror / Atom feed
* Testing lxc 0.6.5 in Fedora 13
@ 2010-03-21 19:50 Grzegorz Nosek
       [not found] ` <20100321195044.GA23757-yp6mvK3Bdd2rDJvtcaxF/A@public.gmane.org>
  2010-03-23 21:28 ` Matt Helsley
  0 siblings, 2 replies; 39+ messages in thread
From: Grzegorz Nosek @ 2010-03-21 19:50 UTC (permalink / raw)
  To: containers-qjLDD68F18O7TbgM5vRIOg

Hi all,

I'd like to report a few problems/gotchas I ran into while testing LXC
in Fedora 13 (current Rawhide).

1. PTY allocation fails (PEBKAC/documentation issue)

While starting up a guest (CentOS 5.4) I get a series of messages:

lxc-start 1268746923.067 WARN     lxc_conf - failed to mount '/dev/pts/2'->'./rootfs.centos/dev/tty1'
lxc-start 1268746923.067 WARN     lxc_conf - failed to mount '/dev/pts/3'->'./rootfs.centos/dev/tty2'
lxc-start 1268746923.067 WARN     lxc_conf - failed to mount '/dev/pts/4'->'./rootfs.centos/dev/tty3'
lxc-start 1268746923.067 WARN     lxc_conf - failed to mount '/dev/pts/5'->'./rootfs.centos/dev/tty4'
lxc-start 1268746923.067 INFO     lxc_conf - 4 tty(s) has been setup

The mingetty set up to start at /dev/tty1 is not attached to any known
TTY and is not accessible via lxc-console:

CONTAINER    PID TTY          TIME CMD
centos      1186 ?        00:00:00 init
centos      1368 ?        00:00:00 syslogd
centos      1371 ?        00:00:00 klogd
centos      1386 ?        00:00:00 sshd
centos      1396 ?        00:00:00 mingetty

The mount fails because there are no ttyX files inside the container's
file system. This may qualify as a PEBKAC but it might be nice to
document somewhere (touching empty plain files with appropriate names
makes the container boot).

2. Weird strace behaviour across pidns boundary

When strace'ing (with -ff) lxc-start, I get a proper strace for the
directly spawned process and the container init. However, any processes
spawned by the container's init are not straced properly (I get two
empty files, named <foo>.<pid-in-root-ns> and <foo>.2 -- presumably pid
inside the container). The container also seems to malfunction under
strace (looks like exec() failing as lxc-ps shows two "init" processes).

This is quite painful as it prevents strace'ing processes in containers
even after startup. Here's a snippet of strace'ing a bash (pid 179
inside, pid 2959 outside) trying to run 'ls'. The shell hangs until I
kill the strace process.

pipe([3, 4])                            = 0
clone(Process 197 attached
child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7859708) = 197
Process 2999 attached (waiting for parent)
[pid  2959] setpgid(197, 197)           = 0
[pid  2959] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid  2959] rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
[pid  2959] close(3)                    = 0
[pid  2959] close(4)                    = 0
[pid  2959] rt_sigprocmask(SIG_BLOCK, [CHLD TSTP TTIN TTOU], [CHLD], 8) = 0
[pid  2959] ioctl(255, TIOCSPGRP, [197]) = 0
[pid  2959] rt_sigprocmask(SIG_SETMASK, [CHLD], NULL, 8) = 0
[pid  2959] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid  2959] rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
[pid  2959] waitpid(-1, Process 2959 suspended
^C <unfinished ...>
Process 2959 detached
Process 197 detached
Process 2999 detached

'strace ls' ran completely inside the container works as expected.

3. Missing /dev/console

There's no "obviously good" default (apart maybe from lxc-start's
stdout/err) but I guess the messages from init could end up somewhere
useful (right now it tries /dev/console, /dev/tty0 and /dev/null, in
order). Maybe an extra PTY (just start counting from 0) would be enough?
Or even a symlink (tty0 -> tty1). But then I think that lxc-start should
attach to the PTY as early as possible to capture boot messages.


Best regards,
 Grzegorz Nosek

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Testing lxc 0.6.5 in Fedora 13
       [not found] ` <20100321195044.GA23757-yp6mvK3Bdd2rDJvtcaxF/A@public.gmane.org>
@ 2010-03-23 21:28   ` Matt Helsley
  0 siblings, 0 replies; 39+ messages in thread
From: Matt Helsley @ 2010-03-23 21:28 UTC (permalink / raw)
  To: Grzegorz Nosek
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Oleg Nesterov, Roland McGrath,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Sun, Mar 21, 2010 at 08:50:44PM +0100, Grzegorz Nosek wrote:

<snip>

> 2. Weird strace behaviour across pidns boundary
> 
> When strace'ing (with -ff) lxc-start, I get a proper strace for the
> directly spawned process and the container init. However, any processes
> spawned by the container's init are not straced properly (I get two
> empty files, named <foo>.<pid-in-root-ns> and <foo>.2 -- presumably pid
> inside the container). The container also seems to malfunction under
> strace (looks like exec() failing as lxc-ps shows two "init" processes).
> 
> This is quite painful as it prevents strace'ing processes in containers
> even after startup. Here's a snippet of strace'ing a bash (pid 179
> inside, pid 2959 outside) trying to run 'ls'. The shell hangs until I
> kill the strace process.
> 
> pipe([3, 4])                            = 0
> clone(Process 197 attached
> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7859708) = 197
> Process 2999 attached (waiting for parent)
> [pid  2959] setpgid(197, 197)           = 0
> [pid  2959] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> [pid  2959] rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
> [pid  2959] close(3)                    = 0
> [pid  2959] close(4)                    = 0
> [pid  2959] rt_sigprocmask(SIG_BLOCK, [CHLD TSTP TTIN TTOU], [CHLD], 8) = 0
> [pid  2959] ioctl(255, TIOCSPGRP, [197]) = 0
> [pid  2959] rt_sigprocmask(SIG_SETMASK, [CHLD], NULL, 8) = 0
> [pid  2959] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> [pid  2959] rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
> [pid  2959] waitpid(-1, Process 2959 suspended
> ^C <unfinished ...>
> Process 2959 detached
> Process 197 detached
> Process 2999 detached
> 
> 'strace ls' ran completely inside the container works as expected.

I'm suprised strace of ls works across pid namespaces. I've been looking
at strace and it seemed to me that one kernel change and a bunch of strace
changes are needed to make strace'ing in child pid namespaces work. Eric
Biederman's setns() patches also might help.

Can you get a little farther with the kernel fix below?

    Fix incorrect pid namespace used by ptrace during fork/vfork/clone
    
    pid namespaces are not used properly by ptrace in do_fork(). When tracing
    parent != real_parent because parent is the tracing task. Yet the pid in
    the real_parent's namespace is being used in do_fork():
    
    	nr = task_pid_vnr(p); /* uses real_parent's pid namespace */
    	if (clone_flags & CLONE_PARENT_SETTID)
    		put_user(nr, parent_tidptr); /* "real_parent_tidptr" */
    	...
    	tracehook_report_clone_complete(trace, regs,
    					clone_flags, nr, p); /* ptrace broken */
    
    	if (clone_flags & CLONE_VFORK) {
    		freezer_do_not_count();
    		wait_for_completion(&vfork);
    		freezer_count();
    		tracehook_report_vfork_done(p, nr); /* ptrace broken */
    
    In this case re-using the value in nr is wrong.
    
    This bug can be seen by attaching to an already-running task
    in a descendent namespace with strace -f. When the traced task forks
    strace won't attach to the new task properly because it sees the
    incorrect pid. For example, if root is running on two VTs and
    root@VTN# indicates switching to VT N:
    
    root@VT1# ns_exec -cp /bin/bash
    root@VT1# echo $$
    1
    root@VT2# strace -f -e fork,vfork,clone -p <pid of bash>
    Process 14518 attached - interrupt to quit
    root@VT1# /bin/bash
    <stops -- new bash shell does not respond to input>
    root@VT2#
    clone(Process 15 attached ... ) = 15
    Process 15044 attached (waiting for parent)
    Process 14518 suspended
    <no more output>
    <hit ctrl-c>
    root@VT1# echo $$
    15
    
    strace sees the pid of the new process to attach to as 15 when it should
    really be attaching to pid 15044. Interestingly enough, it does also
    attach to 15044 later but since the initial attach failed it does not
    properly resume the traced task.
    (I assume wait() helped here -- it reported 15044 and hence strace is aware
    that 15044 exists -- I haven't read the strace code to confirm this.)
    
    Miscellaneous Notes re: ptrace and pid namespaces (Documentation/* fodder?):
    
    Note that if the tracer detaches and a tracer from a different ancestor
    pid namespace attaches we'll have the wrong pid number again. The only
    way to fix that is to have ptrace hold a reference to a struct pid
    so long as it may be needed for PTRACE_GETEVENTMSG.
    
    The only way it's possible to ptrace a task outside the tracer's pid
    namespace is if the already-tracing task enters a new descendent pid
    namespace:
    
      tracer	     tracer does		 .
         \		=> clone(CLONE_NEWPID) =>	/ \
         tracee				  tracer   tracee
    
    In this case the pids returned by PTRACE_GETEVENTMSG will be 0.
    Since attaching to tasks that aren't in descendent namespaces is
    not possible, this is a very unlikely problem to encounter.
    
    Signed-off-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
    Cc: Roland McGrath <roland-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> (MAINTAINERS: ptrace)
    Cc: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> (MAINTAINERS: ptrace)
    Cc: <utrace folks>
    Cc: Sukadev Bhattiprolu <sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> (pid ns)
    Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org (pid ns)
    Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

diff --git a/kernel/fork.c b/kernel/fork.c
index 3a65513..7946ea6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1404,6 +1404,7 @@ long do_fork(unsigned long clone_flags,
 	 */
 	if (!IS_ERR(p)) {
 		struct completion vfork;
+		int ptrace_pid_vnr;
 
 		trace_sched_process_fork(current, p);
 
@@ -1439,14 +1440,21 @@ long do_fork(unsigned long clone_flags,
 			wake_up_new_task(p, clone_flags);
 		}
 
+		ptrace_pid_vnr = nr;
+		if (unlikely(p->parent != p->real_parent)) {
+			rcu_read_lock();
+			ptrace_pid_vnr = task_pid_nr_ns(p, p->parent->nsproxy->pid_ns);
+			rcu_read_unlock();
+		}
 		tracehook_report_clone_complete(trace, regs,
-						clone_flags, nr, p);
+						clone_flags,
+						ptrace_pid_vnr, p);
 
 		if (clone_flags & CLONE_VFORK) {
 			freezer_do_not_count();
 			wait_for_completion(&vfork);
 			freezer_count();
-			tracehook_report_vfork_done(p, nr);
+			tracehook_report_vfork_done(p, ptrace_pid_vnr);
 		}
 	} else {
 		nr = PTR_ERR(p);

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: Testing lxc 0.6.5 in Fedora 13
  2010-03-21 19:50 Testing lxc 0.6.5 in Fedora 13 Grzegorz Nosek
       [not found] ` <20100321195044.GA23757-yp6mvK3Bdd2rDJvtcaxF/A@public.gmane.org>
@ 2010-03-23 21:28 ` Matt Helsley
  2010-03-24  9:25   ` Greg Kurz
       [not found]   ` <20100323212834.GH20796-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  1 sibling, 2 replies; 39+ messages in thread
From: Matt Helsley @ 2010-03-23 21:28 UTC (permalink / raw)
  To: Grzegorz Nosek
  Cc: Roland McGrath, Oleg Nesterov, Sukadev Bhattiprolu, containers,
	linux-kernel

On Sun, Mar 21, 2010 at 08:50:44PM +0100, Grzegorz Nosek wrote:

<snip>

> 2. Weird strace behaviour across pidns boundary
> 
> When strace'ing (with -ff) lxc-start, I get a proper strace for the
> directly spawned process and the container init. However, any processes
> spawned by the container's init are not straced properly (I get two
> empty files, named <foo>.<pid-in-root-ns> and <foo>.2 -- presumably pid
> inside the container). The container also seems to malfunction under
> strace (looks like exec() failing as lxc-ps shows two "init" processes).
> 
> This is quite painful as it prevents strace'ing processes in containers
> even after startup. Here's a snippet of strace'ing a bash (pid 179
> inside, pid 2959 outside) trying to run 'ls'. The shell hangs until I
> kill the strace process.
> 
> pipe([3, 4])                            = 0
> clone(Process 197 attached
> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7859708) = 197
> Process 2999 attached (waiting for parent)
> [pid  2959] setpgid(197, 197)           = 0
> [pid  2959] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> [pid  2959] rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
> [pid  2959] close(3)                    = 0
> [pid  2959] close(4)                    = 0
> [pid  2959] rt_sigprocmask(SIG_BLOCK, [CHLD TSTP TTIN TTOU], [CHLD], 8) = 0
> [pid  2959] ioctl(255, TIOCSPGRP, [197]) = 0
> [pid  2959] rt_sigprocmask(SIG_SETMASK, [CHLD], NULL, 8) = 0
> [pid  2959] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> [pid  2959] rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
> [pid  2959] waitpid(-1, Process 2959 suspended
> ^C <unfinished ...>
> Process 2959 detached
> Process 197 detached
> Process 2999 detached
> 
> 'strace ls' ran completely inside the container works as expected.

I'm suprised strace of ls works across pid namespaces. I've been looking
at strace and it seemed to me that one kernel change and a bunch of strace
changes are needed to make strace'ing in child pid namespaces work. Eric
Biederman's setns() patches also might help.

Can you get a little farther with the kernel fix below?

    Fix incorrect pid namespace used by ptrace during fork/vfork/clone
    
    pid namespaces are not used properly by ptrace in do_fork(). When tracing
    parent != real_parent because parent is the tracing task. Yet the pid in
    the real_parent's namespace is being used in do_fork():
    
    	nr = task_pid_vnr(p); /* uses real_parent's pid namespace */
    	if (clone_flags & CLONE_PARENT_SETTID)
    		put_user(nr, parent_tidptr); /* "real_parent_tidptr" */
    	...
    	tracehook_report_clone_complete(trace, regs,
    					clone_flags, nr, p); /* ptrace broken */
    
    	if (clone_flags & CLONE_VFORK) {
    		freezer_do_not_count();
    		wait_for_completion(&vfork);
    		freezer_count();
    		tracehook_report_vfork_done(p, nr); /* ptrace broken */
    
    In this case re-using the value in nr is wrong.
    
    This bug can be seen by attaching to an already-running task
    in a descendent namespace with strace -f. When the traced task forks
    strace won't attach to the new task properly because it sees the
    incorrect pid. For example, if root is running on two VTs and
    root@VTN# indicates switching to VT N:
    
    root@VT1# ns_exec -cp /bin/bash
    root@VT1# echo $$
    1
    root@VT2# strace -f -e fork,vfork,clone -p <pid of bash>
    Process 14518 attached - interrupt to quit
    root@VT1# /bin/bash
    <stops -- new bash shell does not respond to input>
    root@VT2#
    clone(Process 15 attached ... ) = 15
    Process 15044 attached (waiting for parent)
    Process 14518 suspended
    <no more output>
    <hit ctrl-c>
    root@VT1# echo $$
    15
    
    strace sees the pid of the new process to attach to as 15 when it should
    really be attaching to pid 15044. Interestingly enough, it does also
    attach to 15044 later but since the initial attach failed it does not
    properly resume the traced task.
    (I assume wait() helped here -- it reported 15044 and hence strace is aware
    that 15044 exists -- I haven't read the strace code to confirm this.)
    
    Miscellaneous Notes re: ptrace and pid namespaces (Documentation/* fodder?):
    
    Note that if the tracer detaches and a tracer from a different ancestor
    pid namespace attaches we'll have the wrong pid number again. The only
    way to fix that is to have ptrace hold a reference to a struct pid
    so long as it may be needed for PTRACE_GETEVENTMSG.
    
    The only way it's possible to ptrace a task outside the tracer's pid
    namespace is if the already-tracing task enters a new descendent pid
    namespace:
    
      tracer	     tracer does		 .
         \		=> clone(CLONE_NEWPID) =>	/ \
         tracee				  tracer   tracee
    
    In this case the pids returned by PTRACE_GETEVENTMSG will be 0.
    Since attaching to tasks that aren't in descendent namespaces is
    not possible, this is a very unlikely problem to encounter.
    
    Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
    Cc: Roland McGrath <roland@redhat.com> (MAINTAINERS: ptrace)
    Cc: Oleg Nesterov <oleg@redhat.com> (MAINTAINERS: ptrace)
    Cc: <utrace folks>
    Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> (pid ns)
    Cc: containers@lists.linux-foundation.org (pid ns)
    Cc: linux-kernel@vger.kernel.org

diff --git a/kernel/fork.c b/kernel/fork.c
index 3a65513..7946ea6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1404,6 +1404,7 @@ long do_fork(unsigned long clone_flags,
 	 */
 	if (!IS_ERR(p)) {
 		struct completion vfork;
+		int ptrace_pid_vnr;
 
 		trace_sched_process_fork(current, p);
 
@@ -1439,14 +1440,21 @@ long do_fork(unsigned long clone_flags,
 			wake_up_new_task(p, clone_flags);
 		}
 
+		ptrace_pid_vnr = nr;
+		if (unlikely(p->parent != p->real_parent)) {
+			rcu_read_lock();
+			ptrace_pid_vnr = task_pid_nr_ns(p, p->parent->nsproxy->pid_ns);
+			rcu_read_unlock();
+		}
 		tracehook_report_clone_complete(trace, regs,
-						clone_flags, nr, p);
+						clone_flags,
+						ptrace_pid_vnr, p);
 
 		if (clone_flags & CLONE_VFORK) {
 			freezer_do_not_count();
 			wait_for_completion(&vfork);
 			freezer_count();
-			tracehook_report_vfork_done(p, nr);
+			tracehook_report_vfork_done(p, ptrace_pid_vnr);
 		}
 	} else {
 		nr = PTR_ERR(p);

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: Testing lxc 0.6.5 in Fedora 13
       [not found]   ` <20100323212834.GH20796-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2010-03-24  9:25     ` Greg Kurz
  2010-03-25 21:33       ` Grzegorz Nosek
  1 sibling, 0 replies; 39+ messages in thread
From: Greg Kurz @ 2010-03-24  9:25 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Roland McGrath,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Oleg Nesterov, linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue, 2010-03-23 at 14:28 -0700, Matt Helsley wrote:
> On Sun, Mar 21, 2010 at 08:50:44PM +0100, Grzegorz Nosek wrote:
> 
> <snip>
> 
> > 2. Weird strace behaviour across pidns boundary
> > 
> > When strace'ing (with -ff) lxc-start, I get a proper strace for the
> > directly spawned process and the container init. However, any processes
> > spawned by the container's init are not straced properly (I get two
> > empty files, named <foo>.<pid-in-root-ns> and <foo>.2 -- presumably pid
> > inside the container). The container also seems to malfunction under
> > strace (looks like exec() failing as lxc-ps shows two "init" processes).
> > 

<snip>

> > 
> > 'strace ls' ran completely inside the container works as expected.
> 
> I'm suprised strace of ls works across pid namespaces. I've been looking

Matt,

Grzegorz is telling you that 'strace -ff lxc-start ls' is broken but
'lxc-start strace -ff ls' works as expected => strace of ls doesn't work
across pid namespaces. No surprise.

Cheers.

-- 
Gregory Kurz                                     gkurz-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org
Software Engineer @ IBM/Meiosys                  http://www.ibm.com
Tel +33 (0)534 638 479                           Fax +33 (0)561 400 420

"Anarchy is about taking complete responsibility for yourself."
        Alan Moore.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Testing lxc 0.6.5 in Fedora 13
  2010-03-23 21:28 ` Matt Helsley
@ 2010-03-24  9:25   ` Greg Kurz
       [not found]   ` <20100323212834.GH20796-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  1 sibling, 0 replies; 39+ messages in thread
From: Greg Kurz @ 2010-03-24  9:25 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Grzegorz Nosek, containers, Oleg Nesterov, Roland McGrath, linux-kernel

On Tue, 2010-03-23 at 14:28 -0700, Matt Helsley wrote:
> On Sun, Mar 21, 2010 at 08:50:44PM +0100, Grzegorz Nosek wrote:
> 
> <snip>
> 
> > 2. Weird strace behaviour across pidns boundary
> > 
> > When strace'ing (with -ff) lxc-start, I get a proper strace for the
> > directly spawned process and the container init. However, any processes
> > spawned by the container's init are not straced properly (I get two
> > empty files, named <foo>.<pid-in-root-ns> and <foo>.2 -- presumably pid
> > inside the container). The container also seems to malfunction under
> > strace (looks like exec() failing as lxc-ps shows two "init" processes).
> > 

<snip>

> > 
> > 'strace ls' ran completely inside the container works as expected.
> 
> I'm suprised strace of ls works across pid namespaces. I've been looking

Matt,

Grzegorz is telling you that 'strace -ff lxc-start ls' is broken but
'lxc-start strace -ff ls' works as expected => strace of ls doesn't work
across pid namespaces. No surprise.

Cheers.

-- 
Gregory Kurz                                     gkurz@fr.ibm.com
Software Engineer @ IBM/Meiosys                  http://www.ibm.com
Tel +33 (0)534 638 479                           Fax +33 (0)561 400 420

"Anarchy is about taking complete responsibility for yourself."
        Alan Moore.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Testing lxc 0.6.5 in Fedora 13
  2010-03-23 21:28 ` Matt Helsley
@ 2010-03-25 21:33       ` Grzegorz Nosek
       [not found]   ` <20100323212834.GH20796-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  1 sibling, 0 replies; 39+ messages in thread
From: Grzegorz Nosek @ 2010-03-25 21:33 UTC (permalink / raw)
  To: Matt Helsley
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Oleg Nesterov, Roland McGrath,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On wto, mar 23, 2010 at 02:28:34 -0700, Matt Helsley wrote:
> On Sun, Mar 21, 2010 at 08:50:44PM +0100, Grzegorz Nosek wrote:
> 
> <snip>
> 
> > 2. Weird strace behaviour across pidns boundary
> > 
> > When strace'ing (with -ff) lxc-start, I get a proper strace for the
> > directly spawned process and the container init. However, any processes
> > spawned by the container's init are not straced properly (I get two
> > empty files, named <foo>.<pid-in-root-ns> and <foo>.2 -- presumably pid
> > inside the container). The container also seems to malfunction under
> > strace (looks like exec() failing as lxc-ps shows two "init" processes).
> > 
> > This is quite painful as it prevents strace'ing processes in containers
> > even after startup. Here's a snippet of strace'ing a bash (pid 179
> > inside, pid 2959 outside) trying to run 'ls'. The shell hangs until I
> > kill the strace process.
> > 
> > pipe([3, 4])                            = 0
> > clone(Process 197 attached
> > child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7859708) = 197
> > Process 2999 attached (waiting for parent)
> > [pid  2959] setpgid(197, 197)           = 0
> > [pid  2959] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> > [pid  2959] rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
> > [pid  2959] close(3)                    = 0
> > [pid  2959] close(4)                    = 0
> > [pid  2959] rt_sigprocmask(SIG_BLOCK, [CHLD TSTP TTIN TTOU], [CHLD], 8) = 0
> > [pid  2959] ioctl(255, TIOCSPGRP, [197]) = 0
> > [pid  2959] rt_sigprocmask(SIG_SETMASK, [CHLD], NULL, 8) = 0
> > [pid  2959] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> > [pid  2959] rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
> > [pid  2959] waitpid(-1, Process 2959 suspended
> > ^C <unfinished ...>
> > Process 2959 detached
> > Process 197 detached
> > Process 2999 detached
> > 
> > 'strace ls' ran completely inside the container works as expected.
> 
> I'm suprised strace of ls works across pid namespaces. I've been looking
> at strace and it seemed to me that one kernel change and a bunch of strace
> changes are needed to make strace'ing in child pid namespaces work. Eric
> Biederman's setns() patches also might help.

Thanks for the patch and the detailed explanation.

> Can you get a little farther with the kernel fix below?

No, not really. Attaching from outside to a shell running in a container
and running a command yields:

| rt_sigprocmask(SIG_BLOCK, [INT CHLD], [], 8) = 0
| pipe([3, 4])                            = 0
| clone(Process 2581 attached (waiting for parent)
| Process 190 attached

Without the patch the order of reported pids is reversed (and at least
with the patched kernel the outside pid is consistently reported first)

| child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7713708) = 190
| [pid  2549] setpgid(190, 190)           = 0
| [pid  2549] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
| [pid  2549] rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
| [pid  2549] close(3)                    = 0
| [pid  2549] close(4)                    = 0
| [pid  2549] rt_sigprocmask(SIG_BLOCK, [CHLD TSTP TTIN TTOU], [CHLD], 8) = 0
| [pid  2549] ioctl(255, TIOCSPGRP, [190]) = 0
| [pid  2549] rt_sigprocmask(SIG_SETMASK, [CHLD], NULL, 8) = 0
| [pid  2549] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
| [pid  2549] rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
| [pid  2549] waitpid(-1, Process 2549 suspended

(the shell hangs here)

^C <unfinished ...>
| Process 2549 detached
| Process 2581 detached
| Process 190 detached

(the command executes here normally).

Best regards,
 Grzegorz Nosek

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Testing lxc 0.6.5 in Fedora 13
@ 2010-03-25 21:33       ` Grzegorz Nosek
  0 siblings, 0 replies; 39+ messages in thread
From: Grzegorz Nosek @ 2010-03-25 21:33 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Roland McGrath, Oleg Nesterov, Sukadev Bhattiprolu, containers,
	linux-kernel

On wto, mar 23, 2010 at 02:28:34 -0700, Matt Helsley wrote:
> On Sun, Mar 21, 2010 at 08:50:44PM +0100, Grzegorz Nosek wrote:
> 
> <snip>
> 
> > 2. Weird strace behaviour across pidns boundary
> > 
> > When strace'ing (with -ff) lxc-start, I get a proper strace for the
> > directly spawned process and the container init. However, any processes
> > spawned by the container's init are not straced properly (I get two
> > empty files, named <foo>.<pid-in-root-ns> and <foo>.2 -- presumably pid
> > inside the container). The container also seems to malfunction under
> > strace (looks like exec() failing as lxc-ps shows two "init" processes).
> > 
> > This is quite painful as it prevents strace'ing processes in containers
> > even after startup. Here's a snippet of strace'ing a bash (pid 179
> > inside, pid 2959 outside) trying to run 'ls'. The shell hangs until I
> > kill the strace process.
> > 
> > pipe([3, 4])                            = 0
> > clone(Process 197 attached
> > child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7859708) = 197
> > Process 2999 attached (waiting for parent)
> > [pid  2959] setpgid(197, 197)           = 0
> > [pid  2959] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> > [pid  2959] rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
> > [pid  2959] close(3)                    = 0
> > [pid  2959] close(4)                    = 0
> > [pid  2959] rt_sigprocmask(SIG_BLOCK, [CHLD TSTP TTIN TTOU], [CHLD], 8) = 0
> > [pid  2959] ioctl(255, TIOCSPGRP, [197]) = 0
> > [pid  2959] rt_sigprocmask(SIG_SETMASK, [CHLD], NULL, 8) = 0
> > [pid  2959] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> > [pid  2959] rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
> > [pid  2959] waitpid(-1, Process 2959 suspended
> > ^C <unfinished ...>
> > Process 2959 detached
> > Process 197 detached
> > Process 2999 detached
> > 
> > 'strace ls' ran completely inside the container works as expected.
> 
> I'm suprised strace of ls works across pid namespaces. I've been looking
> at strace and it seemed to me that one kernel change and a bunch of strace
> changes are needed to make strace'ing in child pid namespaces work. Eric
> Biederman's setns() patches also might help.

Thanks for the patch and the detailed explanation.

> Can you get a little farther with the kernel fix below?

No, not really. Attaching from outside to a shell running in a container
and running a command yields:

| rt_sigprocmask(SIG_BLOCK, [INT CHLD], [], 8) = 0
| pipe([3, 4])                            = 0
| clone(Process 2581 attached (waiting for parent)
| Process 190 attached

Without the patch the order of reported pids is reversed (and at least
with the patched kernel the outside pid is consistently reported first)

| child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7713708) = 190
| [pid  2549] setpgid(190, 190)           = 0
| [pid  2549] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
| [pid  2549] rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
| [pid  2549] close(3)                    = 0
| [pid  2549] close(4)                    = 0
| [pid  2549] rt_sigprocmask(SIG_BLOCK, [CHLD TSTP TTIN TTOU], [CHLD], 8) = 0
| [pid  2549] ioctl(255, TIOCSPGRP, [190]) = 0
| [pid  2549] rt_sigprocmask(SIG_SETMASK, [CHLD], NULL, 8) = 0
| [pid  2549] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
| [pid  2549] rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
| [pid  2549] waitpid(-1, Process 2549 suspended

(the shell hangs here)

^C <unfinished ...>
| Process 2549 detached
| Process 2581 detached
| Process 190 detached

(the command executes here normally).

Best regards,
 Grzegorz Nosek

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Testing lxc 0.6.5 in Fedora 13
       [not found]       ` <20100325213356.GB20541-yp6mvK3Bdd2rDJvtcaxF/A@public.gmane.org>
@ 2010-03-26 11:11         ` Oleg Nesterov
  0 siblings, 0 replies; 39+ messages in thread
From: Oleg Nesterov @ 2010-03-26 11:11 UTC (permalink / raw)
  To: Grzegorz Nosek
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Roland McGrath, linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 03/25, Grzegorz Nosek wrote:
>
> On wto, mar 23, 2010 at 02:28:34 -0700, Matt Helsley wrote:
> > On Sun, Mar 21, 2010 at 08:50:44PM +0100, Grzegorz Nosek wrote:
> >
> > <snip>
> >
> > > 2. Weird strace behaviour across pidns boundary
> > >
> > > When strace'ing (with -ff) lxc-start, I get a proper strace for the
> > > directly spawned process and the container init. However, any processes
> > > spawned by the container's init are not straced properly

Yes, this is broken. More precisely, this wasn't even supposed to work.

Even stracing of the sub-init itself (or global init btw) has problems,
the straced init is not protected from unwanted signals.

> > I'm suprised strace of ls works across pid namespaces. I've been looking
> > at strace and it seemed to me that one kernel change and a bunch of strace
> > changes are needed to make strace'ing in child pid namespaces work.

Yes. First of all, tracehook_report_clone_complete() reports the wrong pid nr,
as it seen inside the init's namespace. This is easy to fix, but I doubt this
can help. IIRC strace doesn't use PTRACE_GETEVENTMSG at all, it looks at eax
after syscall.

> Eric
> > Biederman's setns() patches also might help.
>
> Thanks for the patch and the detailed explanation.

which patch?

Oleg.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Testing lxc 0.6.5 in Fedora 13
  2010-03-25 21:33       ` Grzegorz Nosek
  (?)
@ 2010-03-26 11:11       ` Oleg Nesterov
  2010-03-26 11:32         ` Grzegorz Nosek
                           ` (2 more replies)
  -1 siblings, 3 replies; 39+ messages in thread
From: Oleg Nesterov @ 2010-03-26 11:11 UTC (permalink / raw)
  To: Grzegorz Nosek
  Cc: Matt Helsley, Roland McGrath, Sukadev Bhattiprolu, containers,
	linux-kernel

On 03/25, Grzegorz Nosek wrote:
>
> On wto, mar 23, 2010 at 02:28:34 -0700, Matt Helsley wrote:
> > On Sun, Mar 21, 2010 at 08:50:44PM +0100, Grzegorz Nosek wrote:
> >
> > <snip>
> >
> > > 2. Weird strace behaviour across pidns boundary
> > >
> > > When strace'ing (with -ff) lxc-start, I get a proper strace for the
> > > directly spawned process and the container init. However, any processes
> > > spawned by the container's init are not straced properly

Yes, this is broken. More precisely, this wasn't even supposed to work.

Even stracing of the sub-init itself (or global init btw) has problems,
the straced init is not protected from unwanted signals.

> > I'm suprised strace of ls works across pid namespaces. I've been looking
> > at strace and it seemed to me that one kernel change and a bunch of strace
> > changes are needed to make strace'ing in child pid namespaces work.

Yes. First of all, tracehook_report_clone_complete() reports the wrong pid nr,
as it seen inside the init's namespace. This is easy to fix, but I doubt this
can help. IIRC strace doesn't use PTRACE_GETEVENTMSG at all, it looks at eax
after syscall.

> Eric
> > Biederman's setns() patches also might help.
>
> Thanks for the patch and the detailed explanation.

which patch?

Oleg.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Testing lxc 0.6.5 in Fedora 13
       [not found]         ` <20100326111131.GA8604-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2010-03-26 11:32           ` Grzegorz Nosek
  2010-03-26 11:53           ` Matt Helsley
  1 sibling, 0 replies; 39+ messages in thread
From: Grzegorz Nosek @ 2010-03-26 11:32 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Roland McGrath, linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Fri, Mar 26, 2010 at 12:11:31PM +0100, Oleg Nesterov wrote:
> Yes, this is broken. More precisely, this wasn't even supposed to work.
> 
> Even stracing of the sub-init itself (or global init btw) has problems,
> the straced init is not protected from unwanted signals.

Is this impossible/very hard to do cleanly? I understand that container's
init becomes vulnerable to signals sent from root-owned processes in the
container. If so, the impact of this issue should be quite limited, no?

Strace'ing processes across pidns boundary would be really useful in day
to day administrative work but if it's unfixable, I guess at least
preventing strace from attaching to processes in a descendant pidns
would be required in order to prevent container malfunction.

> Yes. First of all, tracehook_report_clone_complete() reports the wrong pid nr,
> as it seen inside the init's namespace. This is easy to fix, but I doubt this
> can help. IIRC strace doesn't use PTRACE_GETEVENTMSG at all, it looks at eax
> after syscall.
> 
> which patch?

The patch below posted by Matt. AIUI, it fixes the
tracehook_report_clone_complete() part, which results in an observable
change in strace's behaviour (not that it makes strace work, though).
Anyway, are there any remaining issues on the kernel side or does strace
have to be taught about pid namespaces?

Best regards,
 Grzegorz Nosek

diff --git a/kernel/fork.c b/kernel/fork.c
index 3a65513..7946ea6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1404,6 +1404,7 @@ long do_fork(unsigned long clone_flags,
         */
        if (!IS_ERR(p)) {
                struct completion vfork;
+               int ptrace_pid_vnr;

                trace_sched_process_fork(current, p);

@@ -1439,14 +1440,21 @@ long do_fork(unsigned long clone_flags,
                        wake_up_new_task(p, clone_flags);
                }

+               ptrace_pid_vnr = nr;
+               if (unlikely(p->parent != p->real_parent)) {
+                       rcu_read_lock();
+                       ptrace_pid_vnr = task_pid_nr_ns(p, p->parent->nsproxy->pid_ns);
+                       rcu_read_unlock();
+               }
                tracehook_report_clone_complete(trace, regs,
-                                               clone_flags, nr, p);
+                                               clone_flags,
+                                               ptrace_pid_vnr, p);

                if (clone_flags & CLONE_VFORK) {
                        freezer_do_not_count();
                        wait_for_completion(&vfork);
                        freezer_count();
-                       tracehook_report_vfork_done(p, nr);
+                       tracehook_report_vfork_done(p, ptrace_pid_vnr);
                }
        } else {
                nr = PTR_ERR(p);

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: Testing lxc 0.6.5 in Fedora 13
  2010-03-26 11:11       ` Oleg Nesterov
@ 2010-03-26 11:32         ` Grzegorz Nosek
       [not found]           ` <20100326113201.GB17113-yp6mvK3Bdd2rDJvtcaxF/A@public.gmane.org>
  2010-03-26 12:00           ` Oleg Nesterov
  2010-03-26 11:53         ` Matt Helsley
       [not found]         ` <20100326111131.GA8604-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2 siblings, 2 replies; 39+ messages in thread
From: Grzegorz Nosek @ 2010-03-26 11:32 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Matt Helsley, Roland McGrath, Sukadev Bhattiprolu, containers,
	linux-kernel

On Fri, Mar 26, 2010 at 12:11:31PM +0100, Oleg Nesterov wrote:
> Yes, this is broken. More precisely, this wasn't even supposed to work.
> 
> Even stracing of the sub-init itself (or global init btw) has problems,
> the straced init is not protected from unwanted signals.

Is this impossible/very hard to do cleanly? I understand that container's
init becomes vulnerable to signals sent from root-owned processes in the
container. If so, the impact of this issue should be quite limited, no?

Strace'ing processes across pidns boundary would be really useful in day
to day administrative work but if it's unfixable, I guess at least
preventing strace from attaching to processes in a descendant pidns
would be required in order to prevent container malfunction.

> Yes. First of all, tracehook_report_clone_complete() reports the wrong pid nr,
> as it seen inside the init's namespace. This is easy to fix, but I doubt this
> can help. IIRC strace doesn't use PTRACE_GETEVENTMSG at all, it looks at eax
> after syscall.
> 
> which patch?

The patch below posted by Matt. AIUI, it fixes the
tracehook_report_clone_complete() part, which results in an observable
change in strace's behaviour (not that it makes strace work, though).
Anyway, are there any remaining issues on the kernel side or does strace
have to be taught about pid namespaces?

Best regards,
 Grzegorz Nosek

diff --git a/kernel/fork.c b/kernel/fork.c
index 3a65513..7946ea6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1404,6 +1404,7 @@ long do_fork(unsigned long clone_flags,
         */
        if (!IS_ERR(p)) {
                struct completion vfork;
+               int ptrace_pid_vnr;

                trace_sched_process_fork(current, p);

@@ -1439,14 +1440,21 @@ long do_fork(unsigned long clone_flags,
                        wake_up_new_task(p, clone_flags);
                }

+               ptrace_pid_vnr = nr;
+               if (unlikely(p->parent != p->real_parent)) {
+                       rcu_read_lock();
+                       ptrace_pid_vnr = task_pid_nr_ns(p, p->parent->nsproxy->pid_ns);
+                       rcu_read_unlock();
+               }
                tracehook_report_clone_complete(trace, regs,
-                                               clone_flags, nr, p);
+                                               clone_flags,
+                                               ptrace_pid_vnr, p);

                if (clone_flags & CLONE_VFORK) {
                        freezer_do_not_count();
                        wait_for_completion(&vfork);
                        freezer_count();
-                       tracehook_report_vfork_done(p, nr);
+                       tracehook_report_vfork_done(p, ptrace_pid_vnr);
                }
        } else {
                nr = PTR_ERR(p);


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: Testing lxc 0.6.5 in Fedora 13
       [not found]         ` <20100326111131.GA8604-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2010-03-26 11:32           ` Grzegorz Nosek
@ 2010-03-26 11:53           ` Matt Helsley
  1 sibling, 0 replies; 39+ messages in thread
From: Matt Helsley @ 2010-03-26 11:53 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Roland McGrath

On Fri, Mar 26, 2010 at 12:11:31PM +0100, Oleg Nesterov wrote:
> On 03/25, Grzegorz Nosek wrote:
> >
> > On wto, mar 23, 2010 at 02:28:34 -0700, Matt Helsley wrote:
> > > On Sun, Mar 21, 2010 at 08:50:44PM +0100, Grzegorz Nosek wrote:
> > >
> > > <snip>
> > >
> > > > 2. Weird strace behaviour across pidns boundary
> > > >
> > > > When strace'ing (with -ff) lxc-start, I get a proper strace for the
> > > > directly spawned process and the container init. However, any processes
> > > > spawned by the container's init are not straced properly
> 
> Yes, this is broken. More precisely, this wasn't even supposed to work.
> 
> Even stracing of the sub-init itself (or global init btw) has problems,
> the straced init is not protected from unwanted signals.
> 
> > > I'm suprised strace of ls works across pid namespaces. I've been looking
> > > at strace and it seemed to me that one kernel change and a bunch of strace
> > > changes are needed to make strace'ing in child pid namespaces work.
> 
> Yes. First of all, tracehook_report_clone_complete() reports the wrong pid nr,
> as it seen inside the init's namespace. This is easy to fix, but I doubt this
> can help. IIRC strace doesn't use PTRACE_GETEVENTMSG at all, it looks at eax
> after syscall.

Yup. strace would need to be modified to use that. I tried that and it still
won't work -- I seem to recall it didn't work because strace uses pid values
obtained from the wait syscall too. To make it work we'd need to be able to
translate those pids in userspace. That's do-able from userspace if you trace
all forks descending from the pidns init task. But it's not do-able for
simple attaches. That's why I was thinking Eric's setns() might be able to
help if strace used it to enter the tracee's pid namespace whenever we need to.

gdb often doesn't use the same methods but has similar problems with pid
namespaces.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Testing lxc 0.6.5 in Fedora 13
  2010-03-26 11:11       ` Oleg Nesterov
  2010-03-26 11:32         ` Grzegorz Nosek
@ 2010-03-26 11:53         ` Matt Helsley
  2010-03-26 12:45           ` Grzegorz Nosek
       [not found]           ` <20100326115357.GA3345-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
       [not found]         ` <20100326111131.GA8604-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2 siblings, 2 replies; 39+ messages in thread
From: Matt Helsley @ 2010-03-26 11:53 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Grzegorz Nosek, Matt Helsley, Roland McGrath,
	Sukadev Bhattiprolu, containers, linux-kernel

On Fri, Mar 26, 2010 at 12:11:31PM +0100, Oleg Nesterov wrote:
> On 03/25, Grzegorz Nosek wrote:
> >
> > On wto, mar 23, 2010 at 02:28:34 -0700, Matt Helsley wrote:
> > > On Sun, Mar 21, 2010 at 08:50:44PM +0100, Grzegorz Nosek wrote:
> > >
> > > <snip>
> > >
> > > > 2. Weird strace behaviour across pidns boundary
> > > >
> > > > When strace'ing (with -ff) lxc-start, I get a proper strace for the
> > > > directly spawned process and the container init. However, any processes
> > > > spawned by the container's init are not straced properly
> 
> Yes, this is broken. More precisely, this wasn't even supposed to work.
> 
> Even stracing of the sub-init itself (or global init btw) has problems,
> the straced init is not protected from unwanted signals.
> 
> > > I'm suprised strace of ls works across pid namespaces. I've been looking
> > > at strace and it seemed to me that one kernel change and a bunch of strace
> > > changes are needed to make strace'ing in child pid namespaces work.
> 
> Yes. First of all, tracehook_report_clone_complete() reports the wrong pid nr,
> as it seen inside the init's namespace. This is easy to fix, but I doubt this
> can help. IIRC strace doesn't use PTRACE_GETEVENTMSG at all, it looks at eax
> after syscall.

Yup. strace would need to be modified to use that. I tried that and it still
won't work -- I seem to recall it didn't work because strace uses pid values
obtained from the wait syscall too. To make it work we'd need to be able to
translate those pids in userspace. That's do-able from userspace if you trace
all forks descending from the pidns init task. But it's not do-able for
simple attaches. That's why I was thinking Eric's setns() might be able to
help if strace used it to enter the tracee's pid namespace whenever we need to.

gdb often doesn't use the same methods but has similar problems with pid
namespaces.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Testing lxc 0.6.5 in Fedora 13
       [not found]           ` <20100326113201.GB17113-yp6mvK3Bdd2rDJvtcaxF/A@public.gmane.org>
@ 2010-03-26 12:00             ` Oleg Nesterov
  0 siblings, 0 replies; 39+ messages in thread
From: Oleg Nesterov @ 2010-03-26 12:00 UTC (permalink / raw)
  To: Grzegorz Nosek
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Roland McGrath, linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 03/26, Grzegorz Nosek wrote:
>
> On Fri, Mar 26, 2010 at 12:11:31PM +0100, Oleg Nesterov wrote:
> > Yes, this is broken. More precisely, this wasn't even supposed to work.
> >
> > Even stracing of the sub-init itself (or global init btw) has problems,
> > the straced init is not protected from unwanted signals.
>
> Is this impossible/very hard to do cleanly? I understand that container's
> init becomes vulnerable to signals sent from root-owned processes in the
> container. If so, the impact of this issue should be quite limited, no?

Yes, probably we can ignore this.

> > Yes. First of all, tracehook_report_clone_complete() reports the wrong pid nr,
> > as it seen inside the init's namespace. This is easy to fix, but I doubt this
> > can help. IIRC strace doesn't use PTRACE_GETEVENTMSG at all, it looks at eax
> > after syscall.
> >
> > which patch?
>
> The patch below posted by Matt. AIUI, it fixes the
> tracehook_report_clone_complete() part, which results in an observable
> change in strace's behaviour (not that it makes strace work, though).

I guess it doesn't work because we need to fix strace, see "strace doesn't
use PTRACE_GETEVENTMSG" above.

> Anyway, are there any remaining issues on the kernel side or does strace
> have to be taught about pid namespaces?

At first glance, I don't see other problems, except sometimes the reported
pid is wrong (like in do_fork). 

> +               ptrace_pid_vnr = nr;
> +               if (unlikely(p->parent != p->real_parent)) {
> +                       rcu_read_lock();
> +                       ptrace_pid_vnr = task_pid_nr_ns(p, p->parent->nsproxy->pid_ns);

Yes, this is what I meant.

But we should not do this in do_fork().


But once again. This change fixes the value in "tracee->ptrace_message == newpid",
but a quick grep shows that strace-4.5.19 doesn't use PTRACE_GETEVENTMSG at all.

Oleg.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Testing lxc 0.6.5 in Fedora 13
  2010-03-26 11:32         ` Grzegorz Nosek
       [not found]           ` <20100326113201.GB17113-yp6mvK3Bdd2rDJvtcaxF/A@public.gmane.org>
@ 2010-03-26 12:00           ` Oleg Nesterov
       [not found]             ` <20100326120028.GA11311-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2010-03-26 12:46             ` Matt Helsley
  1 sibling, 2 replies; 39+ messages in thread
From: Oleg Nesterov @ 2010-03-26 12:00 UTC (permalink / raw)
  To: Grzegorz Nosek
  Cc: Matt Helsley, Roland McGrath, Sukadev Bhattiprolu, containers,
	linux-kernel

On 03/26, Grzegorz Nosek wrote:
>
> On Fri, Mar 26, 2010 at 12:11:31PM +0100, Oleg Nesterov wrote:
> > Yes, this is broken. More precisely, this wasn't even supposed to work.
> >
> > Even stracing of the sub-init itself (or global init btw) has problems,
> > the straced init is not protected from unwanted signals.
>
> Is this impossible/very hard to do cleanly? I understand that container's
> init becomes vulnerable to signals sent from root-owned processes in the
> container. If so, the impact of this issue should be quite limited, no?

Yes, probably we can ignore this.

> > Yes. First of all, tracehook_report_clone_complete() reports the wrong pid nr,
> > as it seen inside the init's namespace. This is easy to fix, but I doubt this
> > can help. IIRC strace doesn't use PTRACE_GETEVENTMSG at all, it looks at eax
> > after syscall.
> >
> > which patch?
>
> The patch below posted by Matt. AIUI, it fixes the
> tracehook_report_clone_complete() part, which results in an observable
> change in strace's behaviour (not that it makes strace work, though).

I guess it doesn't work because we need to fix strace, see "strace doesn't
use PTRACE_GETEVENTMSG" above.

> Anyway, are there any remaining issues on the kernel side or does strace
> have to be taught about pid namespaces?

At first glance, I don't see other problems, except sometimes the reported
pid is wrong (like in do_fork). 

> +               ptrace_pid_vnr = nr;
> +               if (unlikely(p->parent != p->real_parent)) {
> +                       rcu_read_lock();
> +                       ptrace_pid_vnr = task_pid_nr_ns(p, p->parent->nsproxy->pid_ns);

Yes, this is what I meant.

But we should not do this in do_fork().


But once again. This change fixes the value in "tracee->ptrace_message == newpid",
but a quick grep shows that strace-4.5.19 doesn't use PTRACE_GETEVENTMSG at all.

Oleg.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Testing lxc 0.6.5 in Fedora 13
       [not found]           ` <20100326115357.GA3345-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2010-03-26 12:45             ` Grzegorz Nosek
  0 siblings, 0 replies; 39+ messages in thread
From: Grzegorz Nosek @ 2010-03-26 12:45 UTC (permalink / raw)
  To: Matt Helsley
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Oleg Nesterov, Roland McGrath,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Fri, Mar 26, 2010 at 04:53:57AM -0700, Matt Helsley wrote:
> Yup. strace would need to be modified to use that. I tried that and it still
> won't work -- I seem to recall it didn't work because strace uses pid values
> obtained from the wait syscall too. To make it work we'd need to be able to
> translate those pids in userspace. That's do-able from userspace if you trace
> all forks descending from the pidns init task. But it's not do-able for
> simple attaches. That's why I was thinking Eric's setns() might be able to
> help if strace used it to enter the tracee's pid namespace whenever we need to.
> 
> gdb often doesn't use the same methods but has similar problems with pid
> namespaces.

Hmm, is there a good reason why strace does not use the data explicitly
provided by the kernel but instead second-guesses it from syscall return
values? I don't know anything about ptrace, really, but I'd expect the
kernel to provide the tracer with out-of-band information otherwise
taken from clone/waitpid/other syscalls?

Best regards,
 Grzegorz Nosek

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Testing lxc 0.6.5 in Fedora 13
  2010-03-26 11:53         ` Matt Helsley
@ 2010-03-26 12:45           ` Grzegorz Nosek
       [not found]             ` <20100326124522.GD17113-yp6mvK3Bdd2rDJvtcaxF/A@public.gmane.org>
                               ` (2 more replies)
       [not found]           ` <20100326115357.GA3345-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  1 sibling, 3 replies; 39+ messages in thread
From: Grzegorz Nosek @ 2010-03-26 12:45 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Oleg Nesterov, Roland McGrath, Sukadev Bhattiprolu, containers,
	linux-kernel

On Fri, Mar 26, 2010 at 04:53:57AM -0700, Matt Helsley wrote:
> Yup. strace would need to be modified to use that. I tried that and it still
> won't work -- I seem to recall it didn't work because strace uses pid values
> obtained from the wait syscall too. To make it work we'd need to be able to
> translate those pids in userspace. That's do-able from userspace if you trace
> all forks descending from the pidns init task. But it's not do-able for
> simple attaches. That's why I was thinking Eric's setns() might be able to
> help if strace used it to enter the tracee's pid namespace whenever we need to.
> 
> gdb often doesn't use the same methods but has similar problems with pid
> namespaces.

Hmm, is there a good reason why strace does not use the data explicitly
provided by the kernel but instead second-guesses it from syscall return
values? I don't know anything about ptrace, really, but I'd expect the
kernel to provide the tracer with out-of-band information otherwise
taken from clone/waitpid/other syscalls?

Best regards,
 Grzegorz Nosek

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Testing lxc 0.6.5 in Fedora 13
       [not found]             ` <20100326120028.GA11311-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2010-03-26 12:46               ` Matt Helsley
  0 siblings, 0 replies; 39+ messages in thread
From: Matt Helsley @ 2010-03-26 12:46 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Roland McGrath

On Fri, Mar 26, 2010 at 01:00:28PM +0100, Oleg Nesterov wrote:
> On 03/26, Grzegorz Nosek wrote:
> >
> > On Fri, Mar 26, 2010 at 12:11:31PM +0100, Oleg Nesterov wrote:
> > > Yes, this is broken. More precisely, this wasn't even supposed to work.
> > >
> > > Even stracing of the sub-init itself (or global init btw) has problems,
> > > the straced init is not protected from unwanted signals.
> >
> > Is this impossible/very hard to do cleanly? I understand that container's
> > init becomes vulnerable to signals sent from root-owned processes in the
> > container. If so, the impact of this issue should be quite limited, no?
> 
> Yes, probably we can ignore this.
> 
> > > Yes. First of all, tracehook_report_clone_complete() reports the wrong pid nr,
> > > as it seen inside the init's namespace. This is easy to fix, but I doubt this
> > > can help. IIRC strace doesn't use PTRACE_GETEVENTMSG at all, it looks at eax
> > > after syscall.
> > >
> > > which patch?
> >
> > The patch below posted by Matt. AIUI, it fixes the
> > tracehook_report_clone_complete() part, which results in an observable
> > change in strace's behaviour (not that it makes strace work, though).
> 
> I guess it doesn't work because we need to fix strace, see "strace doesn't
> use PTRACE_GETEVENTMSG" above.
> 
> > Anyway, are there any remaining issues on the kernel side or does strace
> > have to be taught about pid namespaces?
> 
> At first glance, I don't see other problems, except sometimes the reported
> pid is wrong (like in do_fork). 
> 
> > +               ptrace_pid_vnr = nr;
> > +               if (unlikely(p->parent != p->real_parent)) {
> > +                       rcu_read_lock();
> > +                       ptrace_pid_vnr = task_pid_nr_ns(p, p->parent->nsproxy->pid_ns);
> 
> Yes, this is what I meant.
> 
> But we should not do this in do_fork().

I'm puzzled. If not here, where should we do this? Or are you saying
ptrace should take a reference to the pid, store it, and get the vnr during
PTRACE_GETEVENTMSG? (drop the reference at detach or when a new pid reference
comes in..)

> But once again. This change fixes the value in "tracee->ptrace_message == newpid",
> but a quick grep shows that strace-4.5.19 doesn't use PTRACE_GETEVENTMSG at all.

You are correct. However strace and gdb aren't necessarily the only users
of ptrace so wouldn't it still be good to fix this?

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Testing lxc 0.6.5 in Fedora 13
  2010-03-26 12:00           ` Oleg Nesterov
       [not found]             ` <20100326120028.GA11311-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2010-03-26 12:46             ` Matt Helsley
  2010-03-26 13:34               ` Oleg Nesterov
       [not found]               ` <20100326124619.GC3345-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  1 sibling, 2 replies; 39+ messages in thread
From: Matt Helsley @ 2010-03-26 12:46 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Grzegorz Nosek, Matt Helsley, Roland McGrath,
	Sukadev Bhattiprolu, containers, linux-kernel

On Fri, Mar 26, 2010 at 01:00:28PM +0100, Oleg Nesterov wrote:
> On 03/26, Grzegorz Nosek wrote:
> >
> > On Fri, Mar 26, 2010 at 12:11:31PM +0100, Oleg Nesterov wrote:
> > > Yes, this is broken. More precisely, this wasn't even supposed to work.
> > >
> > > Even stracing of the sub-init itself (or global init btw) has problems,
> > > the straced init is not protected from unwanted signals.
> >
> > Is this impossible/very hard to do cleanly? I understand that container's
> > init becomes vulnerable to signals sent from root-owned processes in the
> > container. If so, the impact of this issue should be quite limited, no?
> 
> Yes, probably we can ignore this.
> 
> > > Yes. First of all, tracehook_report_clone_complete() reports the wrong pid nr,
> > > as it seen inside the init's namespace. This is easy to fix, but I doubt this
> > > can help. IIRC strace doesn't use PTRACE_GETEVENTMSG at all, it looks at eax
> > > after syscall.
> > >
> > > which patch?
> >
> > The patch below posted by Matt. AIUI, it fixes the
> > tracehook_report_clone_complete() part, which results in an observable
> > change in strace's behaviour (not that it makes strace work, though).
> 
> I guess it doesn't work because we need to fix strace, see "strace doesn't
> use PTRACE_GETEVENTMSG" above.
> 
> > Anyway, are there any remaining issues on the kernel side or does strace
> > have to be taught about pid namespaces?
> 
> At first glance, I don't see other problems, except sometimes the reported
> pid is wrong (like in do_fork). 
> 
> > +               ptrace_pid_vnr = nr;
> > +               if (unlikely(p->parent != p->real_parent)) {
> > +                       rcu_read_lock();
> > +                       ptrace_pid_vnr = task_pid_nr_ns(p, p->parent->nsproxy->pid_ns);
> 
> Yes, this is what I meant.
> 
> But we should not do this in do_fork().

I'm puzzled. If not here, where should we do this? Or are you saying
ptrace should take a reference to the pid, store it, and get the vnr during
PTRACE_GETEVENTMSG? (drop the reference at detach or when a new pid reference
comes in..)

> But once again. This change fixes the value in "tracee->ptrace_message == newpid",
> but a quick grep shows that strace-4.5.19 doesn't use PTRACE_GETEVENTMSG at all.

You are correct. However strace and gdb aren't necessarily the only users
of ptrace so wouldn't it still be good to fix this?

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Testing lxc 0.6.5 in Fedora 13
       [not found]             ` <20100326124522.GD17113-yp6mvK3Bdd2rDJvtcaxF/A@public.gmane.org>
@ 2010-03-26 12:54               ` Matt Helsley
  2010-03-26 13:47               ` Oleg Nesterov
  1 sibling, 0 replies; 39+ messages in thread
From: Matt Helsley @ 2010-03-26 12:54 UTC (permalink / raw)
  To: Grzegorz Nosek
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Oleg Nesterov,
	Roland McGrath

On Fri, Mar 26, 2010 at 01:45:22PM +0100, Grzegorz Nosek wrote:
> On Fri, Mar 26, 2010 at 04:53:57AM -0700, Matt Helsley wrote:
> > Yup. strace would need to be modified to use that. I tried that and it still
> > won't work -- I seem to recall it didn't work because strace uses pid values
> > obtained from the wait syscall too. To make it work we'd need to be able to
> > translate those pids in userspace. That's do-able from userspace if you trace
> > all forks descending from the pidns init task. But it's not do-able for
> > simple attaches. That's why I was thinking Eric's setns() might be able to
> > help if strace used it to enter the tracee's pid namespace whenever we need to.
> > 
> > gdb often doesn't use the same methods but has similar problems with pid
> > namespaces.
> 
> Hmm, is there a good reason why strace does not use the data explicitly
> provided by the kernel but instead second-guesses it from syscall return
> values? I don't know anything about ptrace, really, but I'd expect the

strace isn't linux-only. Checking the syscall return values may be seen
as being more portable. At least that's my guess. That said there are
plenty of #ifdefs in strace and patching it to use GETEVENTMSG is quite
a small patch.

However, as I said, that still doesn't "fix" strace so that it can
be used to trace tasks in child pid namespaces. Especially when the
traced tasks are more than one namepace deeper. :(

> kernel to provide the tracer with out-of-band information otherwise
> taken from clone/waitpid/other syscalls?

I don't think the kernel provides special out-of-band methods for fetching
pids related to traced tasks except during fork and clone. Not wait*(). The
rest of ptrace tends to be focused on reading/writing registers/memory and
managing signal delivery.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Testing lxc 0.6.5 in Fedora 13
  2010-03-26 12:45           ` Grzegorz Nosek
       [not found]             ` <20100326124522.GD17113-yp6mvK3Bdd2rDJvtcaxF/A@public.gmane.org>
@ 2010-03-26 12:54             ` Matt Helsley
       [not found]               ` <20100326125459.GD3345-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  2010-03-26 13:56               ` Oleg Nesterov
  2010-03-26 13:47             ` Oleg Nesterov
  2 siblings, 2 replies; 39+ messages in thread
From: Matt Helsley @ 2010-03-26 12:54 UTC (permalink / raw)
  To: Grzegorz Nosek
  Cc: Matt Helsley, Oleg Nesterov, Roland McGrath, Sukadev Bhattiprolu,
	containers, linux-kernel

On Fri, Mar 26, 2010 at 01:45:22PM +0100, Grzegorz Nosek wrote:
> On Fri, Mar 26, 2010 at 04:53:57AM -0700, Matt Helsley wrote:
> > Yup. strace would need to be modified to use that. I tried that and it still
> > won't work -- I seem to recall it didn't work because strace uses pid values
> > obtained from the wait syscall too. To make it work we'd need to be able to
> > translate those pids in userspace. That's do-able from userspace if you trace
> > all forks descending from the pidns init task. But it's not do-able for
> > simple attaches. That's why I was thinking Eric's setns() might be able to
> > help if strace used it to enter the tracee's pid namespace whenever we need to.
> > 
> > gdb often doesn't use the same methods but has similar problems with pid
> > namespaces.
> 
> Hmm, is there a good reason why strace does not use the data explicitly
> provided by the kernel but instead second-guesses it from syscall return
> values? I don't know anything about ptrace, really, but I'd expect the

strace isn't linux-only. Checking the syscall return values may be seen
as being more portable. At least that's my guess. That said there are
plenty of #ifdefs in strace and patching it to use GETEVENTMSG is quite
a small patch.

However, as I said, that still doesn't "fix" strace so that it can
be used to trace tasks in child pid namespaces. Especially when the
traced tasks are more than one namepace deeper. :(

> kernel to provide the tracer with out-of-band information otherwise
> taken from clone/waitpid/other syscalls?

I don't think the kernel provides special out-of-band methods for fetching
pids related to traced tasks except during fork and clone. Not wait*(). The
rest of ptrace tends to be focused on reading/writing registers/memory and
managing signal delivery.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Testing lxc 0.6.5 in Fedora 13
       [not found]               ` <20100326124619.GC3345-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2010-03-26 13:34                 ` Oleg Nesterov
  0 siblings, 0 replies; 39+ messages in thread
From: Oleg Nesterov @ 2010-03-26 13:34 UTC (permalink / raw)
  To: Matt Helsley
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Roland McGrath, linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 03/26, Matt Helsley wrote:
>
> On Fri, Mar 26, 2010 at 01:00:28PM +0100, Oleg Nesterov wrote:
> > 
> > > +               ptrace_pid_vnr = nr;
> > > +               if (unlikely(p->parent != p->real_parent)) {
> > > +                       rcu_read_lock();
> > > +                       ptrace_pid_vnr = task_pid_nr_ns(p, p->parent->nsproxy->pid_ns);
> > 
> > Yes, this is what I meant.
> > 
> > But we should not do this in do_fork().
> 
> I'm puzzled. If not here, where should we do this? Or are you saying
> ptrace should take a reference to the pid,

Ah, no, sorry.

I meant tracehook_report_clone_complete should do this under "if (trace)".
And we need a helper to get the right pid, it could be used
by do_notify_parent() too, and (probably) we need more changes like this.

> > But once again. This change fixes the value in "tracee->ptrace_message == newpid",
> > but a quick grep shows that strace-4.5.19 doesn't use PTRACE_GETEVENTMSG at all.
>
> You are correct. However strace and gdb aren't necessarily the only users
> of ptrace so wouldn't it still be good to fix this?

Yes, agreed.

Oh. The only problem is utrace patches in -mm. I mean the possible textual
conflicts...

Oleg.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Testing lxc 0.6.5 in Fedora 13
  2010-03-26 12:46             ` Matt Helsley
@ 2010-03-26 13:34               ` Oleg Nesterov
       [not found]               ` <20100326124619.GC3345-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  1 sibling, 0 replies; 39+ messages in thread
From: Oleg Nesterov @ 2010-03-26 13:34 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Grzegorz Nosek, Roland McGrath, Sukadev Bhattiprolu, containers,
	linux-kernel

On 03/26, Matt Helsley wrote:
>
> On Fri, Mar 26, 2010 at 01:00:28PM +0100, Oleg Nesterov wrote:
> > 
> > > +               ptrace_pid_vnr = nr;
> > > +               if (unlikely(p->parent != p->real_parent)) {
> > > +                       rcu_read_lock();
> > > +                       ptrace_pid_vnr = task_pid_nr_ns(p, p->parent->nsproxy->pid_ns);
> > 
> > Yes, this is what I meant.
> > 
> > But we should not do this in do_fork().
> 
> I'm puzzled. If not here, where should we do this? Or are you saying
> ptrace should take a reference to the pid,

Ah, no, sorry.

I meant tracehook_report_clone_complete should do this under "if (trace)".
And we need a helper to get the right pid, it could be used
by do_notify_parent() too, and (probably) we need more changes like this.

> > But once again. This change fixes the value in "tracee->ptrace_message == newpid",
> > but a quick grep shows that strace-4.5.19 doesn't use PTRACE_GETEVENTMSG at all.
>
> You are correct. However strace and gdb aren't necessarily the only users
> of ptrace so wouldn't it still be good to fix this?

Yes, agreed.

Oh. The only problem is utrace patches in -mm. I mean the possible textual
conflicts...

Oleg.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Testing lxc 0.6.5 in Fedora 13
       [not found]             ` <20100326124522.GD17113-yp6mvK3Bdd2rDJvtcaxF/A@public.gmane.org>
  2010-03-26 12:54               ` Matt Helsley
@ 2010-03-26 13:47               ` Oleg Nesterov
  1 sibling, 0 replies; 39+ messages in thread
From: Oleg Nesterov @ 2010-03-26 13:47 UTC (permalink / raw)
  To: Grzegorz Nosek
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Roland McGrath, linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 03/26, Grzegorz Nosek wrote:
>
> Hmm, is there a good reason why strace does not use the data explicitly
> provided by the kernel but instead second-guesses it from syscall return
> values? I don't know anything about ptrace, really,

I know almost nothing too. But _iirc_, this all is even worse, strace
doesn't necessary gets the notification from do_fork(), at least
another quick grep shows it doesn't use PTRACE_O_TRACEFORK.

This is another reason why the change in do_fork() can't really help
right now.

Oleg.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Testing lxc 0.6.5 in Fedora 13
  2010-03-26 12:45           ` Grzegorz Nosek
       [not found]             ` <20100326124522.GD17113-yp6mvK3Bdd2rDJvtcaxF/A@public.gmane.org>
  2010-03-26 12:54             ` Matt Helsley
@ 2010-03-26 13:47             ` Oleg Nesterov
       [not found]               ` <20100326134709.GB15790-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2010-04-06  3:44               ` Roland McGrath
  2 siblings, 2 replies; 39+ messages in thread
From: Oleg Nesterov @ 2010-03-26 13:47 UTC (permalink / raw)
  To: Grzegorz Nosek
  Cc: Matt Helsley, Roland McGrath, Sukadev Bhattiprolu, containers,
	linux-kernel

On 03/26, Grzegorz Nosek wrote:
>
> Hmm, is there a good reason why strace does not use the data explicitly
> provided by the kernel but instead second-guesses it from syscall return
> values? I don't know anything about ptrace, really,

I know almost nothing too. But _iirc_, this all is even worse, strace
doesn't necessary gets the notification from do_fork(), at least
another quick grep shows it doesn't use PTRACE_O_TRACEFORK.

This is another reason why the change in do_fork() can't really help
right now.

Oleg.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Testing lxc 0.6.5 in Fedora 13
       [not found]               ` <20100326125459.GD3345-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2010-03-26 13:56                 ` Oleg Nesterov
  0 siblings, 0 replies; 39+ messages in thread
From: Oleg Nesterov @ 2010-03-26 13:56 UTC (permalink / raw)
  To: Matt Helsley
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Roland McGrath, linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 03/26, Matt Helsley wrote:
>
> That said there are
> plenty of #ifdefs in strace and patching it to use GETEVENTMSG is quite
> a small patch.

Not really, unless I missed something. See my another email, strace
doesn't even use the notification from do_fork(), not only it doesn't
read ->ptrace_message.

> However, as I said, that still doesn't "fix" strace so that it can
> be used to trace tasks in child pid namespaces.

Yes. Looks like, the necessary change in kernel is simple (and btw,
it is well known fact the reported pid is not ns-friendly). What
should be really fixed is strace/etc.

> Especially when the
> traced tasks are more than one namepace deeper. :(

Hmm... I guess, you mean setns() idea? Otherwise, I _think_ that the
deeper namepaces are fine wrt pid_nr.

> I don't think the kernel provides special out-of-band methods for fetching
> pids related to traced tasks except during fork and clone. Not wait*().

Could you clarify? I think wait*() is already namespace-friendly?

Oleg.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Testing lxc 0.6.5 in Fedora 13
  2010-03-26 12:54             ` Matt Helsley
       [not found]               ` <20100326125459.GD3345-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2010-03-26 13:56               ` Oleg Nesterov
  1 sibling, 0 replies; 39+ messages in thread
From: Oleg Nesterov @ 2010-03-26 13:56 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Grzegorz Nosek, Roland McGrath, Sukadev Bhattiprolu, containers,
	linux-kernel

On 03/26, Matt Helsley wrote:
>
> That said there are
> plenty of #ifdefs in strace and patching it to use GETEVENTMSG is quite
> a small patch.

Not really, unless I missed something. See my another email, strace
doesn't even use the notification from do_fork(), not only it doesn't
read ->ptrace_message.

> However, as I said, that still doesn't "fix" strace so that it can
> be used to trace tasks in child pid namespaces.

Yes. Looks like, the necessary change in kernel is simple (and btw,
it is well known fact the reported pid is not ns-friendly). What
should be really fixed is strace/etc.

> Especially when the
> traced tasks are more than one namepace deeper. :(

Hmm... I guess, you mean setns() idea? Otherwise, I _think_ that the
deeper namepaces are fine wrt pid_nr.

> I don't think the kernel provides special out-of-band methods for fetching
> pids related to traced tasks except during fork and clone. Not wait*().

Could you clarify? I think wait*() is already namespace-friendly?

Oleg.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Testing lxc 0.6.5 in Fedora 13
       [not found]               ` <20100326134709.GB15790-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2010-04-06  3:44                 ` Roland McGrath
  0 siblings, 0 replies; 39+ messages in thread
From: Roland McGrath @ 2010-04-06  3:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

(I've been away for a couple of weeks.)
I concur with the things Oleg's said in this thread.

As to what's "correct" for the kernel in theory, it would certainly make
sense to clean up the ptrace cases to use the tracer (parent) pid_ns when
reporting any PID as such.  The wait and SIGCHLD code already does this, so
that would be consistent.  Off hand I don't see anything other than
tracehook_report_clone{,_complete}() that sees the wrong value now.

Fixing that requires a bit of hair.  The simple and clean approach is to
just have the tracehook calls (i.e. ptrace layer) extract the PID from the
task_struct using the desired pid_ns.  The trouble there is that the
tracehook_report_clone_complete() call is made when that task_struct is no
longer guaranteed to be valid.  The contrary approach of extracting the
appropriate value for the tracer earlier breaks the clean layering because
the fork.c code really should not know at all that ->parent->nsproxy is the
place to look for what values to pass to tracehook calls.  I guess the
simple and clean fix is to get_task_struct() before wake_up_new_task()
and put_task_struct() after tracehook_report_clone_complete().  That does
add some gratuitous atomic incr/decr overhead, though.

None of this has much of anything to do with strace, of course.  As I've
said, I don't see anything other than the PTRACE_GETEVENTMSG value for
PTRACE_EVENT_{CLONE,FORK,VFORK} reports that is wrong in the kernel.  As
Oleg said, strace doesn't use that at all.  (This is not the place to
discuss the details of strace further.)


Thanks,
Roland

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Testing lxc 0.6.5 in Fedora 13
  2010-03-26 13:47             ` Oleg Nesterov
       [not found]               ` <20100326134709.GB15790-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2010-04-06  3:44               ` Roland McGrath
  2010-04-06 13:53                 ` Matt Helsley
       [not found]                 ` <20100406034443.8B40ED477-nL1rrgvulkc2UH6IwYuUx0EOCMrvLtNR@public.gmane.org>
  1 sibling, 2 replies; 39+ messages in thread
From: Roland McGrath @ 2010-04-06  3:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Grzegorz Nosek, Matt Helsley, Sukadev Bhattiprolu, containers,
	linux-kernel

(I've been away for a couple of weeks.)
I concur with the things Oleg's said in this thread.

As to what's "correct" for the kernel in theory, it would certainly make
sense to clean up the ptrace cases to use the tracer (parent) pid_ns when
reporting any PID as such.  The wait and SIGCHLD code already does this, so
that would be consistent.  Off hand I don't see anything other than
tracehook_report_clone{,_complete}() that sees the wrong value now.

Fixing that requires a bit of hair.  The simple and clean approach is to
just have the tracehook calls (i.e. ptrace layer) extract the PID from the
task_struct using the desired pid_ns.  The trouble there is that the
tracehook_report_clone_complete() call is made when that task_struct is no
longer guaranteed to be valid.  The contrary approach of extracting the
appropriate value for the tracer earlier breaks the clean layering because
the fork.c code really should not know at all that ->parent->nsproxy is the
place to look for what values to pass to tracehook calls.  I guess the
simple and clean fix is to get_task_struct() before wake_up_new_task()
and put_task_struct() after tracehook_report_clone_complete().  That does
add some gratuitous atomic incr/decr overhead, though.

None of this has much of anything to do with strace, of course.  As I've
said, I don't see anything other than the PTRACE_GETEVENTMSG value for
PTRACE_EVENT_{CLONE,FORK,VFORK} reports that is wrong in the kernel.  As
Oleg said, strace doesn't use that at all.  (This is not the place to
discuss the details of strace further.)


Thanks,
Roland

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Testing lxc 0.6.5 in Fedora 13
       [not found]                 ` <20100406034443.8B40ED477-nL1rrgvulkc2UH6IwYuUx0EOCMrvLtNR@public.gmane.org>
@ 2010-04-06 13:53                   ` Matt Helsley
  0 siblings, 0 replies; 39+ messages in thread
From: Matt Helsley @ 2010-04-06 13:53 UTC (permalink / raw)
  To: Roland McGrath
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Oleg Nesterov,
	Biederman Eric Biederman


On Mon, Apr 05, 2010 at 08:44:43PM -0700, Roland McGrath wrote:
> (I've been away for a couple of weeks.)
> I concur with the things Oleg's said in this thread.
> 
> As to what's "correct" for the kernel in theory, it would certainly make
> sense to clean up the ptrace cases to use the tracer (parent) pid_ns when
> reporting any PID as such.  The wait and SIGCHLD code already does this, so
> that would be consistent.  Off hand I don't see anything other than
> tracehook_report_clone{,_complete}() that sees the wrong value now.

Yup.

> Fixing that requires a bit of hair.  The simple and clean approach is to
> just have the tracehook calls (i.e. ptrace layer) extract the PID from the
> task_struct using the desired pid_ns.  The trouble there is that the

It's also possible to take an extra reference to the struct pid and pass
that to the tracehook. That and the pid_ns of the tracer receiving the pid
is all we'll ever need inside the tracehook layer. The only advantage, I
think, is we wouldn't pin the task struct while holding the pid reference.

> tracehook_report_clone_complete() call is made when that task_struct is no
> longer guaranteed to be valid.  The contrary approach of extracting the
> appropriate value for the tracer earlier breaks the clean layering because
> the fork.c code really should not know at all that ->parent->nsproxy is the
> place to look for what values to pass to tracehook calls.  I guess the
> simple and clean fix is to get_task_struct() before wake_up_new_task()
> and put_task_struct() after tracehook_report_clone_complete().  That does
> add some gratuitous atomic incr/decr overhead, though.

Also true.

Of course my suggestion of holding the pid reference won't avoid adding
atomic ops -- just changes which refcount they work on.

> 
> None of this has much of anything to do with strace, of course.  As I've
> said, I don't see anything other than the PTRACE_GETEVENTMSG value for
> PTRACE_EVENT_{CLONE,FORK,VFORK} reports that is wrong in the kernel.  As
> Oleg said, strace doesn't use that at all.  (This is not the place to
> discuss the details of strace further.)

Also, looking at proposed changes (utrace and Eric Biederman's setns())
storing a pid nr rather than a reference to a task struct or struct pid
probably won't be correct.

In the case of Eric Biederman's setns(), if capable of changing pid namespace,
we could have:

	Traced				Tracer
	fork()
					... (an arbitrary amount of time passes)
					setns()
					ptrace(GETEVENTMSG)

At which point returning a static pid number held in the message field
produces the wrong pid. Also, if utrace allows multiple tracers and they each
exist in a different namespace then storing a pid nr isn't going to work.

So my hunch is, in the long run, we'll need to hold a reference there and
drop it when the last tracer detaches or the next event would have
overwritten the "message". The amount of time it would need to be held
suggests to me that we should refer to a struct pid and not a task struct
if possible.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Testing lxc 0.6.5 in Fedora 13
  2010-04-06  3:44               ` Roland McGrath
@ 2010-04-06 13:53                 ` Matt Helsley
  2010-04-06 14:36                   ` Oleg Nesterov
                                     ` (2 more replies)
       [not found]                 ` <20100406034443.8B40ED477-nL1rrgvulkc2UH6IwYuUx0EOCMrvLtNR@public.gmane.org>
  1 sibling, 3 replies; 39+ messages in thread
From: Matt Helsley @ 2010-04-06 13:53 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Oleg Nesterov, Grzegorz Nosek, Matt Helsley, Sukadev Bhattiprolu,
	containers, linux-kernel, Biederman Eric Biederman


On Mon, Apr 05, 2010 at 08:44:43PM -0700, Roland McGrath wrote:
> (I've been away for a couple of weeks.)
> I concur with the things Oleg's said in this thread.
> 
> As to what's "correct" for the kernel in theory, it would certainly make
> sense to clean up the ptrace cases to use the tracer (parent) pid_ns when
> reporting any PID as such.  The wait and SIGCHLD code already does this, so
> that would be consistent.  Off hand I don't see anything other than
> tracehook_report_clone{,_complete}() that sees the wrong value now.

Yup.

> Fixing that requires a bit of hair.  The simple and clean approach is to
> just have the tracehook calls (i.e. ptrace layer) extract the PID from the
> task_struct using the desired pid_ns.  The trouble there is that the

It's also possible to take an extra reference to the struct pid and pass
that to the tracehook. That and the pid_ns of the tracer receiving the pid
is all we'll ever need inside the tracehook layer. The only advantage, I
think, is we wouldn't pin the task struct while holding the pid reference.

> tracehook_report_clone_complete() call is made when that task_struct is no
> longer guaranteed to be valid.  The contrary approach of extracting the
> appropriate value for the tracer earlier breaks the clean layering because
> the fork.c code really should not know at all that ->parent->nsproxy is the
> place to look for what values to pass to tracehook calls.  I guess the
> simple and clean fix is to get_task_struct() before wake_up_new_task()
> and put_task_struct() after tracehook_report_clone_complete().  That does
> add some gratuitous atomic incr/decr overhead, though.

Also true.

Of course my suggestion of holding the pid reference won't avoid adding
atomic ops -- just changes which refcount they work on.

> 
> None of this has much of anything to do with strace, of course.  As I've
> said, I don't see anything other than the PTRACE_GETEVENTMSG value for
> PTRACE_EVENT_{CLONE,FORK,VFORK} reports that is wrong in the kernel.  As
> Oleg said, strace doesn't use that at all.  (This is not the place to
> discuss the details of strace further.)

Also, looking at proposed changes (utrace and Eric Biederman's setns())
storing a pid nr rather than a reference to a task struct or struct pid
probably won't be correct.

In the case of Eric Biederman's setns(), if capable of changing pid namespace,
we could have:

	Traced				Tracer
	fork()
					... (an arbitrary amount of time passes)
					setns()
					ptrace(GETEVENTMSG)

At which point returning a static pid number held in the message field
produces the wrong pid. Also, if utrace allows multiple tracers and they each
exist in a different namespace then storing a pid nr isn't going to work.

So my hunch is, in the long run, we'll need to hold a reference there and
drop it when the last tracer detaches or the next event would have
overwritten the "message". The amount of time it would need to be held
suggests to me that we should refer to a struct pid and not a task struct
if possible.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Testing lxc 0.6.5 in Fedora 13
       [not found]                   ` <20100406135345.GC3345-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2010-04-06 14:36                     ` Oleg Nesterov
  2010-04-06 15:13                     ` Eric W. Biederman
  1 sibling, 0 replies; 39+ messages in thread
From: Oleg Nesterov @ 2010-04-06 14:36 UTC (permalink / raw)
  To: Matt Helsley
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Biederman Eric Biederman,
	Roland McGrath

On 04/06, Matt Helsley wrote:
>
> On Mon, Apr 05, 2010 at 08:44:43PM -0700, Roland McGrath wrote:
>
> > tracehook_report_clone_complete() call is made when that task_struct is no
> > longer guaranteed to be valid.

Hmm. I missed this.

> Also, if utrace allows multiple tracers and they each
> exist in a different namespace then storing a pid nr isn't going to work.

Yes, but utrace is simple. ptrace_report_clone() does

	ctx->eventmsg = child->pid;

we should fix this line and that is all, afaics. Every tracer has a
separate "struct ptrace_context *ctx".

> So my hunch is, in the long run, we'll need to hold a reference there and
> drop it when the last tracer detaches

Without utrace only one tracer is possible.

So, I think we should either change do_fork() to get the right tracee_pid_nr,
or add get/put into do_fork() and change tracehooks as Roland suggested.

Oleg.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Testing lxc 0.6.5 in Fedora 13
  2010-04-06 13:53                 ` Matt Helsley
@ 2010-04-06 14:36                   ` Oleg Nesterov
  2010-04-06 15:17                     ` Eric W. Biederman
       [not found]                     ` <20100406143635.GA12315-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2010-04-06 15:13                   ` Eric W. Biederman
       [not found]                   ` <20100406135345.GC3345-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  2 siblings, 2 replies; 39+ messages in thread
From: Oleg Nesterov @ 2010-04-06 14:36 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Roland McGrath, Grzegorz Nosek, Sukadev Bhattiprolu, containers,
	linux-kernel, Biederman Eric Biederman

On 04/06, Matt Helsley wrote:
>
> On Mon, Apr 05, 2010 at 08:44:43PM -0700, Roland McGrath wrote:
>
> > tracehook_report_clone_complete() call is made when that task_struct is no
> > longer guaranteed to be valid.

Hmm. I missed this.

> Also, if utrace allows multiple tracers and they each
> exist in a different namespace then storing a pid nr isn't going to work.

Yes, but utrace is simple. ptrace_report_clone() does

	ctx->eventmsg = child->pid;

we should fix this line and that is all, afaics. Every tracer has a
separate "struct ptrace_context *ctx".

> So my hunch is, in the long run, we'll need to hold a reference there and
> drop it when the last tracer detaches

Without utrace only one tracer is possible.

So, I think we should either change do_fork() to get the right tracee_pid_nr,
or add get/put into do_fork() and change tracehooks as Roland suggested.

Oleg.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Testing lxc 0.6.5 in Fedora 13
       [not found]                   ` <20100406135345.GC3345-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  2010-04-06 14:36                     ` Oleg Nesterov
@ 2010-04-06 15:13                     ` Eric W. Biederman
  1 sibling, 0 replies; 39+ messages in thread
From: Eric W. Biederman @ 2010-04-06 15:13 UTC (permalink / raw)
  To: Matt Helsley
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Oleg Nesterov, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Roland McGrath

Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:

> On Mon, Apr 05, 2010 at 08:44:43PM -0700, Roland McGrath wrote:
>> (I've been away for a couple of weeks.)
>> I concur with the things Oleg's said in this thread.
>> 
>> As to what's "correct" for the kernel in theory, it would certainly make
>> sense to clean up the ptrace cases to use the tracer (parent) pid_ns when
>> reporting any PID as such.  The wait and SIGCHLD code already does this, so
>> that would be consistent.  Off hand I don't see anything other than
>> tracehook_report_clone{,_complete}() that sees the wrong value now.
>
> Yup.
>
>> Fixing that requires a bit of hair.  The simple and clean approach is to
>> just have the tracehook calls (i.e. ptrace layer) extract the PID from the
>> task_struct using the desired pid_ns.  The trouble there is that the
>
> It's also possible to take an extra reference to the struct pid and pass
> that to the tracehook. That and the pid_ns of the tracer receiving the pid
> is all we'll ever need inside the tracehook layer. The only advantage, I
> think, is we wouldn't pin the task struct while holding the pid reference.
>
>> tracehook_report_clone_complete() call is made when that task_struct is no
>> longer guaranteed to be valid.  The contrary approach of extracting the
>> appropriate value for the tracer earlier breaks the clean layering because
>> the fork.c code really should not know at all that ->parent->nsproxy is the
>> place to look for what values to pass to tracehook calls.  I guess the
>> simple and clean fix is to get_task_struct() before wake_up_new_task()
>> and put_task_struct() after tracehook_report_clone_complete().  That does
>> add some gratuitous atomic incr/decr overhead, though.
>
> Also true.
>
> Of course my suggestion of holding the pid reference won't avoid adding
> atomic ops -- just changes which refcount they work on.
>
>> 
>> None of this has much of anything to do with strace, of course.  As I've
>> said, I don't see anything other than the PTRACE_GETEVENTMSG value for
>> PTRACE_EVENT_{CLONE,FORK,VFORK} reports that is wrong in the kernel.  As
>> Oleg said, strace doesn't use that at all.  (This is not the place to
>> discuss the details of strace further.)
>
> Also, looking at proposed changes (utrace and Eric Biederman's setns())
> storing a pid nr rather than a reference to a task struct or struct pid
> probably won't be correct.

My setns work has demonstrated that even for entering a namespace we
never ever need to change the struct pid of a task.  setns has no other
bearing on this problem then to say there is no foreseeable reason to
change the rules.

> In the case of Eric Biederman's setns(), if capable of changing pid namespace,
> we could have:
>
> 	Traced				Tracer
> 	fork()
> 					... (an arbitrary amount of time passes)
> 					setns()
> 					ptrace(GETEVENTMSG)

Forget that.  The pid namespace was architected so that we can ptrace a process
in another pid namespace.

> At which point returning a static pid number held in the message field
> produces the wrong pid.

No.  A processes always sees pids from the context of it's original pid
namespace.  All setns does is affect which pid namespace children will
be native in.


Eric

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Testing lxc 0.6.5 in Fedora 13
  2010-04-06 13:53                 ` Matt Helsley
  2010-04-06 14:36                   ` Oleg Nesterov
@ 2010-04-06 15:13                   ` Eric W. Biederman
       [not found]                     ` <m1vdc48vhy.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  2010-04-06 15:29                     ` Matt Helsley
       [not found]                   ` <20100406135345.GC3345-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  2 siblings, 2 replies; 39+ messages in thread
From: Eric W. Biederman @ 2010-04-06 15:13 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Roland McGrath, Oleg Nesterov, Grzegorz Nosek,
	Sukadev Bhattiprolu, containers, linux-kernel

Matt Helsley <matthltc@us.ibm.com> writes:

> On Mon, Apr 05, 2010 at 08:44:43PM -0700, Roland McGrath wrote:
>> (I've been away for a couple of weeks.)
>> I concur with the things Oleg's said in this thread.
>> 
>> As to what's "correct" for the kernel in theory, it would certainly make
>> sense to clean up the ptrace cases to use the tracer (parent) pid_ns when
>> reporting any PID as such.  The wait and SIGCHLD code already does this, so
>> that would be consistent.  Off hand I don't see anything other than
>> tracehook_report_clone{,_complete}() that sees the wrong value now.
>
> Yup.
>
>> Fixing that requires a bit of hair.  The simple and clean approach is to
>> just have the tracehook calls (i.e. ptrace layer) extract the PID from the
>> task_struct using the desired pid_ns.  The trouble there is that the
>
> It's also possible to take an extra reference to the struct pid and pass
> that to the tracehook. That and the pid_ns of the tracer receiving the pid
> is all we'll ever need inside the tracehook layer. The only advantage, I
> think, is we wouldn't pin the task struct while holding the pid reference.
>
>> tracehook_report_clone_complete() call is made when that task_struct is no
>> longer guaranteed to be valid.  The contrary approach of extracting the
>> appropriate value for the tracer earlier breaks the clean layering because
>> the fork.c code really should not know at all that ->parent->nsproxy is the
>> place to look for what values to pass to tracehook calls.  I guess the
>> simple and clean fix is to get_task_struct() before wake_up_new_task()
>> and put_task_struct() after tracehook_report_clone_complete().  That does
>> add some gratuitous atomic incr/decr overhead, though.
>
> Also true.
>
> Of course my suggestion of holding the pid reference won't avoid adding
> atomic ops -- just changes which refcount they work on.
>
>> 
>> None of this has much of anything to do with strace, of course.  As I've
>> said, I don't see anything other than the PTRACE_GETEVENTMSG value for
>> PTRACE_EVENT_{CLONE,FORK,VFORK} reports that is wrong in the kernel.  As
>> Oleg said, strace doesn't use that at all.  (This is not the place to
>> discuss the details of strace further.)
>
> Also, looking at proposed changes (utrace and Eric Biederman's setns())
> storing a pid nr rather than a reference to a task struct or struct pid
> probably won't be correct.

My setns work has demonstrated that even for entering a namespace we
never ever need to change the struct pid of a task.  setns has no other
bearing on this problem then to say there is no foreseeable reason to
change the rules.

> In the case of Eric Biederman's setns(), if capable of changing pid namespace,
> we could have:
>
> 	Traced				Tracer
> 	fork()
> 					... (an arbitrary amount of time passes)
> 					setns()
> 					ptrace(GETEVENTMSG)

Forget that.  The pid namespace was architected so that we can ptrace a process
in another pid namespace.

> At which point returning a static pid number held in the message field
> produces the wrong pid.

No.  A processes always sees pids from the context of it's original pid
namespace.  All setns does is affect which pid namespace children will
be native in.


Eric

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Testing lxc 0.6.5 in Fedora 13
       [not found]                     ` <20100406143635.GA12315-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2010-04-06 15:17                       ` Eric W. Biederman
  0 siblings, 0 replies; 39+ messages in thread
From: Eric W. Biederman @ 2010-04-06 15:17 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Roland McGrath

Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:

> On 04/06, Matt Helsley wrote:
>>
>> On Mon, Apr 05, 2010 at 08:44:43PM -0700, Roland McGrath wrote:
>>
>> > tracehook_report_clone_complete() call is made when that task_struct is no
>> > longer guaranteed to be valid.
>
> Hmm. I missed this.
>
>> Also, if utrace allows multiple tracers and they each
>> exist in a different namespace then storing a pid nr isn't going to work.
>
> Yes, but utrace is simple. ptrace_report_clone() does
>
> 	ctx->eventmsg = child->pid;
>
> we should fix this line and that is all, afaics. Every tracer has a
> separate "struct ptrace_context *ctx".
>
>> So my hunch is, in the long run, we'll need to hold a reference there and
>> drop it when the last tracer detaches
>
> Without utrace only one tracer is possible.
>
> So, I think we should either change do_fork() to get the right tracee_pid_nr,
> or add get/put into do_fork() and change tracehooks as Roland suggested.

For a unicast path where the is no danger of the destination process changing
I don't see why we can't compute the userspace pid_nr.  It only get's tricky
for things like broadcast signals (pgrp, session) when we don't who the
final recipient process will be.

I think that only gets truly bad in the case of unix domain sockets.

Eric

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Testing lxc 0.6.5 in Fedora 13
  2010-04-06 14:36                   ` Oleg Nesterov
@ 2010-04-06 15:17                     ` Eric W. Biederman
       [not found]                     ` <20100406143635.GA12315-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 0 replies; 39+ messages in thread
From: Eric W. Biederman @ 2010-04-06 15:17 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Matt Helsley, Roland McGrath, Grzegorz Nosek,
	Sukadev Bhattiprolu, containers, linux-kernel

Oleg Nesterov <oleg@redhat.com> writes:

> On 04/06, Matt Helsley wrote:
>>
>> On Mon, Apr 05, 2010 at 08:44:43PM -0700, Roland McGrath wrote:
>>
>> > tracehook_report_clone_complete() call is made when that task_struct is no
>> > longer guaranteed to be valid.
>
> Hmm. I missed this.
>
>> Also, if utrace allows multiple tracers and they each
>> exist in a different namespace then storing a pid nr isn't going to work.
>
> Yes, but utrace is simple. ptrace_report_clone() does
>
> 	ctx->eventmsg = child->pid;
>
> we should fix this line and that is all, afaics. Every tracer has a
> separate "struct ptrace_context *ctx".
>
>> So my hunch is, in the long run, we'll need to hold a reference there and
>> drop it when the last tracer detaches
>
> Without utrace only one tracer is possible.
>
> So, I think we should either change do_fork() to get the right tracee_pid_nr,
> or add get/put into do_fork() and change tracehooks as Roland suggested.

For a unicast path where the is no danger of the destination process changing
I don't see why we can't compute the userspace pid_nr.  It only get's tricky
for things like broadcast signals (pgrp, session) when we don't who the
final recipient process will be.

I think that only gets truly bad in the case of unix domain sockets.

Eric

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Testing lxc 0.6.5 in Fedora 13
       [not found]                     ` <m1vdc48vhy.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2010-04-06 15:29                       ` Matt Helsley
  0 siblings, 0 replies; 39+ messages in thread
From: Matt Helsley @ 2010-04-06 15:29 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Oleg Nesterov, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Roland McGrath

On Tue, Apr 06, 2010 at 08:13:13AM -0700, Eric W. Biederman wrote:
> Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:
> 
> > On Mon, Apr 05, 2010 at 08:44:43PM -0700, Roland McGrath wrote:

<snip>

> >> None of this has much of anything to do with strace, of course.  As I've
> >> said, I don't see anything other than the PTRACE_GETEVENTMSG value for
> >> PTRACE_EVENT_{CLONE,FORK,VFORK} reports that is wrong in the kernel.  As
> >> Oleg said, strace doesn't use that at all.  (This is not the place to
> >> discuss the details of strace further.)
> >
> > Also, looking at proposed changes (utrace and Eric Biederman's setns())
> > storing a pid nr rather than a reference to a task struct or struct pid
> > probably won't be correct.
> 
> My setns work has demonstrated that even for entering a namespace we
> never ever need to change the struct pid of a task.  setns has no other
> bearing on this problem then to say there is no foreseeable reason to
> change the rules.
> 
> > In the case of Eric Biederman's setns(), if capable of changing pid namespace,
> > we could have:
> >
> > 	Traced				Tracer
> > 	fork()
> > 					... (an arbitrary amount of time passes)
> > 					setns()
> > 					ptrace(GETEVENTMSG)
> 
> Forget that.  The pid namespace was architected so that we can ptrace a process
> in another pid namespace.
> 
> > At which point returning a static pid number held in the message field
> > produces the wrong pid.
> 
> No.  A processes always sees pids from the context of it's original pid
> namespace.  All setns does is affect which pid namespace children will
> be native in.

OK, good. So we can resolve the tasks/struct pids within the tracehook
and be done with it. Thanks Eric!

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Testing lxc 0.6.5 in Fedora 13
  2010-04-06 15:13                   ` Eric W. Biederman
       [not found]                     ` <m1vdc48vhy.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2010-04-06 15:29                     ` Matt Helsley
  1 sibling, 0 replies; 39+ messages in thread
From: Matt Helsley @ 2010-04-06 15:29 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Matt Helsley, Roland McGrath, Oleg Nesterov, Grzegorz Nosek,
	Sukadev Bhattiprolu, containers, linux-kernel

On Tue, Apr 06, 2010 at 08:13:13AM -0700, Eric W. Biederman wrote:
> Matt Helsley <matthltc@us.ibm.com> writes:
> 
> > On Mon, Apr 05, 2010 at 08:44:43PM -0700, Roland McGrath wrote:

<snip>

> >> None of this has much of anything to do with strace, of course.  As I've
> >> said, I don't see anything other than the PTRACE_GETEVENTMSG value for
> >> PTRACE_EVENT_{CLONE,FORK,VFORK} reports that is wrong in the kernel.  As
> >> Oleg said, strace doesn't use that at all.  (This is not the place to
> >> discuss the details of strace further.)
> >
> > Also, looking at proposed changes (utrace and Eric Biederman's setns())
> > storing a pid nr rather than a reference to a task struct or struct pid
> > probably won't be correct.
> 
> My setns work has demonstrated that even for entering a namespace we
> never ever need to change the struct pid of a task.  setns has no other
> bearing on this problem then to say there is no foreseeable reason to
> change the rules.
> 
> > In the case of Eric Biederman's setns(), if capable of changing pid namespace,
> > we could have:
> >
> > 	Traced				Tracer
> > 	fork()
> > 					... (an arbitrary amount of time passes)
> > 					setns()
> > 					ptrace(GETEVENTMSG)
> 
> Forget that.  The pid namespace was architected so that we can ptrace a process
> in another pid namespace.
> 
> > At which point returning a static pid number held in the message field
> > produces the wrong pid.
> 
> No.  A processes always sees pids from the context of it's original pid
> namespace.  All setns does is affect which pid namespace children will
> be native in.

OK, good. So we can resolve the tasks/struct pids within the tracehook
and be done with it. Thanks Eric!

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2010-04-06 15:30 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-21 19:50 Testing lxc 0.6.5 in Fedora 13 Grzegorz Nosek
     [not found] ` <20100321195044.GA23757-yp6mvK3Bdd2rDJvtcaxF/A@public.gmane.org>
2010-03-23 21:28   ` Matt Helsley
2010-03-23 21:28 ` Matt Helsley
2010-03-24  9:25   ` Greg Kurz
     [not found]   ` <20100323212834.GH20796-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2010-03-24  9:25     ` Greg Kurz
2010-03-25 21:33     ` Grzegorz Nosek
2010-03-25 21:33       ` Grzegorz Nosek
2010-03-26 11:11       ` Oleg Nesterov
2010-03-26 11:32         ` Grzegorz Nosek
     [not found]           ` <20100326113201.GB17113-yp6mvK3Bdd2rDJvtcaxF/A@public.gmane.org>
2010-03-26 12:00             ` Oleg Nesterov
2010-03-26 12:00           ` Oleg Nesterov
     [not found]             ` <20100326120028.GA11311-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2010-03-26 12:46               ` Matt Helsley
2010-03-26 12:46             ` Matt Helsley
2010-03-26 13:34               ` Oleg Nesterov
     [not found]               ` <20100326124619.GC3345-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2010-03-26 13:34                 ` Oleg Nesterov
2010-03-26 11:53         ` Matt Helsley
2010-03-26 12:45           ` Grzegorz Nosek
     [not found]             ` <20100326124522.GD17113-yp6mvK3Bdd2rDJvtcaxF/A@public.gmane.org>
2010-03-26 12:54               ` Matt Helsley
2010-03-26 13:47               ` Oleg Nesterov
2010-03-26 12:54             ` Matt Helsley
     [not found]               ` <20100326125459.GD3345-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2010-03-26 13:56                 ` Oleg Nesterov
2010-03-26 13:56               ` Oleg Nesterov
2010-03-26 13:47             ` Oleg Nesterov
     [not found]               ` <20100326134709.GB15790-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2010-04-06  3:44                 ` Roland McGrath
2010-04-06  3:44               ` Roland McGrath
2010-04-06 13:53                 ` Matt Helsley
2010-04-06 14:36                   ` Oleg Nesterov
2010-04-06 15:17                     ` Eric W. Biederman
     [not found]                     ` <20100406143635.GA12315-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2010-04-06 15:17                       ` Eric W. Biederman
2010-04-06 15:13                   ` Eric W. Biederman
     [not found]                     ` <m1vdc48vhy.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-04-06 15:29                       ` Matt Helsley
2010-04-06 15:29                     ` Matt Helsley
     [not found]                   ` <20100406135345.GC3345-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2010-04-06 14:36                     ` Oleg Nesterov
2010-04-06 15:13                     ` Eric W. Biederman
     [not found]                 ` <20100406034443.8B40ED477-nL1rrgvulkc2UH6IwYuUx0EOCMrvLtNR@public.gmane.org>
2010-04-06 13:53                   ` Matt Helsley
     [not found]           ` <20100326115357.GA3345-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2010-03-26 12:45             ` Grzegorz Nosek
     [not found]         ` <20100326111131.GA8604-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2010-03-26 11:32           ` Grzegorz Nosek
2010-03-26 11:53           ` Matt Helsley
     [not found]       ` <20100325213356.GB20541-yp6mvK3Bdd2rDJvtcaxF/A@public.gmane.org>
2010-03-26 11:11         ` Oleg Nesterov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.