All of lore.kernel.org
 help / color / mirror / Atom feed
* Ptrace documentation, draft #3
@ 2011-05-20 19:23 Denys Vlasenko
  2011-05-25 14:32 ` Tejun Heo
  2011-05-30 13:35 ` Ptrace documentation, draft #3 Oleg Nesterov
  0 siblings, 2 replies; 17+ messages in thread
From: Denys Vlasenko @ 2011-05-20 19:23 UTC (permalink / raw)
  To: Tejun Heo, jan.kratochvil, oleg; +Cc: linux-kernel, torvalds, akpm, indan

Ptrace discussions repeatedly display a higher than average amount
of misunderstanding and confusion. New ptrace users and even people
who already worked with it are repeatedly confused by details
which are not documented anywhere and knowledge about which exists
mostly in the brains of strace/gdb/other_such_tools developers.

This document is meant as a brain dump of this knowledge.
It assumes that the reader has basic understanding what ptrace is.

Since draft no. 2, I added/changed some info:

* more GETSIGINFO information
* extended section about execve
* five less "???" remains (16 -> 11)

======================================================================
======================================================================
======================================================================
		Ptrace

Ptrace API (ab)uses standard Unix parent/child signaling over waitpid.
An unfortunate effect of it is that resulting API is complex and has
subtle quirks. This document aims to describe these quirks.

It is split into two parts. First part focuses exclusively on
userspace-visible API and behavior. Second section describes kernel
internals of ptrace.



		1. Userspace API.

(Note to editors: in this section, do not use kernel concepts and terms
which are not observable through userspace API and user-visible
behavior. Use section 2 for that.)

Debugged processes (tracees) first need to be attached to the debugging
process (tracer). Attachment and subsequent commands are per-thread: in
multi-threaded process, every thread can be individually attached to a
(potentially different) tracer, or left not attached and thus not
debugged. Therefore, "tracee" always means "(one) thread", never "a
(possibly multi-threaded) process". Ptrace commands are always sent to
a specific tracee using ptrace(PTRACE_foo, pid, ...), where pid is a
TID of the corresponding Linux thread.

After attachment, each tracee can be in two states: running or stopped.

There are many kinds of states when tracee is stopped, and in ptrace
discussions they are often conflated. Therefore, it is important to use
precise terms.

In this document, any stopped state in which tracee is ready to accept
ptrace commands from the tracer is called ptrace-stop. Ptrace-stops can
be further subdivided into signal-delivery-stop, group-stop,
syscall-stop and so on. They are described in detail later.


	1.x Death under ptrace.

When a (possibly multi-threaded) process receives a killing signal (a
signal set to SIG_DFL and whose default action is to kill the process),
all threads exit. Tracees report their death to the tracer(s). This is
not a ptrace-stop (because tracer can't query tracee status such as
register contents, cannot restart tracee etc) but the notification
about this event is delivered through waitpid API similarly to
ptrace-stop.

Note that killing signal will first cause signal-delivery-stop (on one
tracee only), and only after it is injected by tracer (or after it was
dispatched to a thread which isn't traced), death from signal will
happen on ALL tracees within multi-threaded process.

SIGKILL operates similarly, with exceptions. No signal-delivery-stop is
generated for SIGKILL and therefore tracer can't suppress it. SIGKILL
kills even within syscalls (syscall-exit-stop is not generated prior to
death by SIGKILL). The net effect is that SIGKILL always kills the
process (all its threads), even if some threads of the process are
ptraced.

Tracer can kill a tracee with ptrace(PTRACE_KILL, pid, 0, 0). This
opeartion is deprecated, use kill/tgkill(SIGKILL) instead.

^^^ Oleg prefers to deprecate it instead of describing (and needing to
support) PTRACE_KILL's quirks.

When tracee executes exit syscall, it reports its death to its tracer.
Other threads are not affected.

When any thread executes exit_group syscall, every tracee in its thread
group reports its death to its tracer.

If PTRACE_O_TRACEEXIT option is on, PTRACE_EVENT_EXIT will happen
before actual death. This applies to both normal exits and signal
deaths (except SIGKILL).

KNOWN BUG: PTRACE_EVENT_EXIT should happen for every tracee in thread
group on exit_group or signal death, but currently (~2.6.38) this is
buggy: some of these stops may be missed.

Tracer cannot assume that ptrace-stopped tracee exists. There are many
scenarios when tracee may die while stopped (such as SIGKILL). There
are cases where tracee disappears without reporting death (such as
execve in multi-threaded process). Therefore, tracer must always be
prepared to handle ESRCH error on any ptrace operation. Unfortunately,
the same error is returned if tracee exists but is not ptrace-stopped
(for commands which require stopped tracee). Tracer needs to keep track
of stopped/running state, and interpret ESRCH as "tracee died
unexpectedly" only if it knows that tracee has been observed to enter
ptrace-stop.

There is no guarantee that waitpid(WNOHANG) will reliably report
tracee's death status if ptrace operation returned ESRCH.
waitpid(WNOHANG) may return 0 instead. IOW: tracee may be "not yet
fully dead" but already refusing ptrace ops.

Tracer can not assume that tracee ALWAYS ends its life by reporting
WIFEXITED(status) or WIFSIGNALED(status). One notable case is execve in
multi-threaded process, which is described later.


	1.x Stopped states.

When running tracee enters ptrace-stop, it notifies its tracer using
waitpid API. Tracer should use waitpid family of syscalls to wait for
tracee to stop. Most of this document assumes that tracer waits with:
	pid = waitpid(pid_or_minus_1, &status, __WALL);
Ptrace-stopped tracees are reported as returns with pid > 0 and
WIFSTOPPED(status) == true.

??? any pitfalls with WNOHANG (I remember that there are bugs in this
    area)? effects of WSTOPPED, WEXITED, WCONTINUED bits? Are they ok?
    waitid usage? WNOWAIT?


	1.x.x Signal-delivery-stop

When (possibly multi-threaded) process receives any signal except
SIGKILL, kernel selects a thread which handles the signal (if signal is
generated with tgkill, thread selection is done by user). If selected
thread is traced, it enters signal-delivery-stop. By this point, signal
is not yet delivered to the process, and can be suppressed by tracer.
If tracer doesn't suppress the signal, it passes signal to tracee in
the next ptrace request. This is called "signal injection" and will be
described later. Note that if signal is blocked, signal-delivery-stop
doesn't happen until signal is unblocked, with the usual exception that
SIGSTOP can't be blocked.

Signal-delivery-stop is observed by tracer as waitpid returning with
WIFSTOPPED(status) == true, WSTOPSIG(status) == signal. If
WSTOPSIG(status) == SIGTRAP, this may be a different kind of
ptrace-stop - see "Syscall-stops" and "execve" sections below for
details. If WSTOPSIG(status) == stopping signal, this may be a
group-stop - see below.


	1.x.x Signal injection and suppression.

After signal-delivery-stop is observed by tracer, tracer should restart
tracee with
	ptrace(PTRACE_rest, pid, 0, sig)
call, where PTRACE_rest is one of the restarting ptrace ops. If sig is
0, then signal is not delivered. Otherwise, signal sig is delivered.
This operation is called "signal injection", to distinguish it from
signal delivery which causes signal-delivery-stop.

Note that sig value may be different from WSTOPSIG(status) value -
tracer can cause a different signal to be injected.

Note that suppressed signal still causes syscalls to return
prematurely. Restartable syscalls will be restarted (tracer will
observe tracee to execute restart_syscall(2) syscall if tracer uses
PTRACE_SYSCALL), non-restartable syscalls (for example, nanosleep) may
return with -EINTR even though no observable signal is injected to the
tracee.

Note that restarting ptrace commands issued in ptrace-stops other than
signal-delivery-stop are not guaranteed to inject a signal, even if sig
is nonzero. No error is reported, nonzero sig may simply be ignored.
Ptrace users should not try to "create new signal" this way: use
tgkill(2) instead.

This is a cause of confusion among ptrace users. One typical scenario
is that tracer observes group-stop, mistakes it for
signal-delivery-stop, restarts tracee with ptrace(PTRACE_rest, pid, 0,
stopsig) with the intention of injecting stopsig, but stopsig gets
ignored and tracee continues to run.

SIGCONT signal has a side effect of waking up (all threads of)
group-stopped process. This side effect happens before
signal-delivery-stop. Tracer can't suppress this side-effect (it can
only suppress signal injection, which only causes SIGCONT handler to
not be executed in the tracee, if such handler is installed). In fact,
waking up from group-stop may be followed by signal-delivery-stop for
signal(s) *other than* SIGCONT, if they were pending when SIGCONT was
delivered. IOW: SIGCONT may be not the first signal observed by the
tracee after it was sent.

Stopping signals cause (all threads of) process to enter group-stop.
This side effect happens after signal injection, and therefore can be
suppressed by tracer.

PTRACE_GETSIGINFO can be used to retrieve siginfo_t structure which
corresponds to delivered signal. PTRACE_SETSIGINFO may be used to
modify it. If PTRACE_SETSIGINFO has been used to alter siginfo_t,
si_signo field and sig parameter in restarting command must match.


	1.x.x Group-stop

When a (possibly multi-threaded) process receives a stopping signal,
all threads stop. If some threads are traced, they enter a group-stop.
Note that stopping signal will first cause signal-delivery-stop (on one
tracee only), and only after it is injected by tracer (or after it was
dispatched to a thread which isn't traced), group-stop will be
initiated on ALL tracees within multi-threaded process. As usual, every
tracee reports its group-stop to corresponding tracer.

Group-stop is observed by tracer as waitpid returning with
WIFSTOPPED(status) == true, WSTOPSIG(status) == signal. The same result
is returned by some other classes of ptrace-stops, therefore the
recommended practice is to perform
	ptrace(PTRACE_GETSIGINFO, pid, 0, &siginfo)
call. The call can be avoided if signal number is not SIGSTOP, SIGTSTP,
SIGTTIN or SIGTTOU - only these four signals are stopping signals. If
tracer sees something else, it can't be group-stop. Otherwise, tracer
needs to call PTRACE_GETSIGINFO. If PTRACE_GETSIGINFO fails, then it is
definitely a group-stop.

As of kernel 2.6.38, after tracer sees tracee ptrace-stop and until it
restarts or kills it, tracee will not run, and will not send
notifications (except SIGKILL death) to tracer, even if tracer enters
into another waitpid call.

Currently, it causes a problem with transparent handling of stopping
signals: if tracer restarts tracee after group-stop, SIGSTOP is
effectively ignored: tracee doesn't remain stopped, it runs. If tracer
doesn't restart tracee before entering into next waitpid, future
SIGCONT will not be reported to the tracer. Which would make SIGCONT to
have no effect.


	1.x.x PTRACE_EVENT stops

If tracer sets TRACE_O_TRACEfoo options, tracee will enter ptrace-stops
called PTRACE_EVENT stops.

PTRACE_EVENT stops are observed by tracer as waitpid returning with
WIFSTOPPED(status) == true, WSTOPSIG(status) == SIGTRAP. Additional bit
is set in a higher byte of status word: value ((status >> 8) & 0xffff)
will be (SIGTRAP | PTRACE_EVENT_foo << 8). The following events exist:

PTRACE_EVENT_VFORK - stop before return from vfork/clone+CLONE_VFORK.
When tracee is continued after this, it will wait for child to
exit/exec before continuing its execution (IOW: usual behavior on
vfork).

PTRACE_EVENT_FORK - stop before return from fork/clone+SIGCHLD

PTRACE_EVENT_CLONE - stop before return from clone

PTRACE_EVENT_VFORK_DONE - stop before return from
vfork/clone+CLONE_VFORK, but after vfork child unblocked this tracee by
exiting or exec'ing.

For all four stops described above: stop occurs in parent, not in newly
created thread. PTRACE_GETEVENTMSG can be used to retrieve new thread's
tid.

PTRACE_EVENT_EXEC - stop before return from exec.

PTRACE_EVENT_EXIT - stop before exit. PTRACE_GETEVENTMSG returns exit
status. Registers can be examined (unlike when "real" exit happens).
The tracee is still alive, it needs to be PTRACE_CONTed to finish exit.

PTRACE_GETSIGINFO on PTRACE_EVENT stops returns si_signo = SIGTRAP,
si_code = (event << 8) | SIGTRAP.


	1.x.x Syscall-stops

If tracee was restarted by PTRACE_SYSCALL, tracee enters
syscall-enter-stop just prior to entering any syscall. If tracer
restarts it with PTRACE_SYSCALL, tracee enters syscall-exit-stop when
syscall is finished, or if it is interrupted by a signal. (That is,
signal-delivery-stop never happens between syscall-enter-stop and
syscall-exit-stop, it happens *after* syscall-exit-stop).

Other possibilities are that tracee may stop in a PTRACE_EVENT stop,
exit (if it entered exit or exit_group syscall), be killed by SIGKILL,
or die silently (if execve syscall happened in another thread).

Syscall-enter-stop and syscall-exit-stop are observed by tracer as
waitpid returning with WIFSTOPPED(status) == true, WSTOPSIG(status) ==
SIGTRAP. If PTRACE_O_TRACESYSGOOD option was set by tracer, then
WSTOPSIG(status) == (SIGTRAP | 0x80).

There is no portable way to distinguish them from signal-delivery-stop
with SIGTRAP. Some architectures allow to distinguish them by examining
registers. For example, on x86 rax = -ENOSYS in syscall-enter-stop.
Since SIGTRAP (like any other signal) always happens *after*
syscall-exit-stop, and at this point rax almost never contains -ENOSYS,
SIGTRAP looks like "syscall-stop which is not syscall-enter-stop", IOW:
it looks like a "stray syscall-exit-stop" and can be detected this way.
But such detection is fragile and is best avoided. Using
PTRACE_O_TRACESYSGOOD option is a recommended method.

??? can be distinguished by PTRACE_GETSIGINFO, si_code <= 0 if sent by
usual suspects like [t]kill, sigqueue; or = SI_KERNEL (0x80) if sent by
kernel, whereas syscall-stops have si_code = SIGTRAP or (SIGTRAP |
0x80). Right? Should this be documented?

Syscall-enter-stop and syscall-exit-stop are indistinguishable from
each other by tracer. Tracer needs to keep track of the sequence of
ptrace-stops in order to not misinterpret syscall-enter-stop as
syscall-exit-stop or vice versa. The rule is that syscall-enter-stop is
always followed by syscall-exit-stop, PTRACE_EVENT stop or tracee's
death - no other kinds of ptrace-stop can occur in between.

If after syscall-enter-stop tracer uses restarting command other than
PTRACE_SYSCALL, syscall-exit-stop is not generated.

PTRACE_GETSIGINFO on syscall-stops returns si_signo = SIGTRAP, si_code
= SIGTRAP or (SIGTRAP | 0x80).


	1.x.x SINGLESTEP, SYSEMU, SYSEMU_SINGLESTEP

??? document PTRACE_SINGLESTEP, PTRACE_SYSEMU, PTRACE_SYSEMU_SINGLESTEP


	1.x Informational and restarting ptrace commands.

Most ptrace commands (all except ATTACH, TRACEME, KILL) require tracee
to be in ptrace-stop, otherwise they fail with ESRCH.

When tracee is in ptrace-stop, tracer can read and write data to tracee
using informational commands. They leave tracee in ptrace-stopped state:

longv = ptrace(PTRACE_PEEKTEXT/PEEKDATA/PEEKUSER, pid, addr, 0);
	ptrace(PTRACE_POKETEXT/POKEDATA/POKEUSER, pid, addr, long_val);
	ptrace(PTRACE_GETREGS/GETFPREGS, pid, 0, &struct);
	ptrace(PTRACE_SETREGS/SETFPREGS, pid, 0, &struct);
	ptrace(PTRACE_GETSIGINFO, pid, 0, &siginfo);
	ptrace(PTRACE_SETSIGINFO, pid, 0, &siginfo);
	ptrace(PTRACE_GETEVENTMSG, pid, 0, &long_var);
	ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_flags);

Note that some errors are not reported. For example, setting siginfo
may have no effect in some ptrace-stops, yet the call may succeed
(return 0 and don't set errno).

ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_flags) affects one tracee.
Current flags are replaced. Flags are inherited by new tracees created
and "auto-attached" via active PTRACE_O_TRACE[V]FORK or
PTRACE_O_TRACECLONE options.

Another group of commands makes ptrace-stopped tracee run. They have
the form:
	ptrace(PTRACE_cmd, pid, 0, sig);
where cmd is CONT, DETACH, SYSCALL, SINGLESTEP, SYSEMU, or
SYSEMU_SINGLESTEP. If tracee is in signal-delivery-stop, sig is the
signal to be injected. Otherwise, sig may be ignored.


	1.x Attaching and detaching

A thread can be attached to tracer using ptrace(PTRACE_ATTACH, pid, 0,
0) call. This also sends SIGSTOP to this thread. If tracer wants this
SIGSTOP to have no effect, it needs to suppress it. Note that if other
signals are concurrently sent to this thread during attach, tracer may
see tracee enter signal-delivery-stop with other signal(s) first! The
usual practice is to reinject these signals until SIGSTOP is seen, then
suppress SIGSTOP injection. The design bug here is that attach and
concurrent SIGSTOP are racing and SIGSTOP may be lost.

??? Describe how to attach to a thread which is already group-stopped.

Since attaching sends SIGSTOP and tracer usually suppresses it, this
may cause stray EINTR return from the currently executing syscall in
the tracee, as described in "signal injection and suppression" section.

ptrace(PTRACE_TRACEME, 0, 0, 0) request turns current thread into a
tracee. It continues to run (doesn't enter ptrace-stop). A common
practice is follow ptrace(PTRACE_TRACEME) with raise(SIGSTOP) and allow
parent (which is our tracer now) to observe our signal-delivery-stop.

If PTRACE_O_TRACE[V]FORK or PTRACE_O_TRACECLONE options are in effect,
then children created by (vfork or clone(CLONE_VFORK)), (fork or
clone(SIGCHLD)) and (other kinds of clone) respectively are
automatically attached to the same tracer which traced their parent.
SIGSTOP is delivered to them, causing them to enter
signal-delivery-stop after they exit syscall which created them.

Detaching of tracee is performed by ptrace(PTRACE_DETACH, pid, 0, sig).
PTRACE_DETACH is a restarting operation, therefore it requires tracee
to be in ptrace-stop. If tracee is in signal-delivery-stop, signal can
be injected. Othervice, sig parameter may be silently ignored.

If tracee is running when tracer wants to detach it, the usual solution
is to send SIGSTOP (using tgkill, to make sure it goes to the correct
thread), wait for tracee to stop in signal-delivery-stop for SIGSTOP
and then detach it (suppressing SIGSTOP injection). Design bug is that
this can race with concurrent SIGSTOPs. Another complication is that
tracee may enter other ptrace-stops and needs to be restarted and
waited for again, until SIGSTOP is seen. Yet another complication is to
be sure that tracee is not already group-stopped, because no signal
delivery happens while it is - not even SIGSTOP.

??? is above accurate?

??? Describe how to detach from a group-stopped tracee so that it
    doesn't run, but continues to wait for SIGCONT.

If tracer dies, all tracees are automatically detached and restarted,
unless they were in group-stop. Handling of restart from group-stop is
currently buggy, but "as planned" behavior is to leave tracee stopped
and waiting for SIGCONT. If tracee is restarted from
signal-delivery-stop, pending signal is injected.


	1.x execve under ptrace.

During execve, kernel destroys all other threads in the process, and
resets execve'ing thread tid to tgid (process id). This looks very
confusing to tracers:

All other threads "disappear" - that is, they terminate their execution
without returning any waitpid notifications to anyone, even if they are
currently traced.

The execve-ing tracee changes its pid while it is in execve syscall.
(Remember, under ptrace 'pid' returned from waitpid, or fed into ptrace
calls, is tracee's tid). That is, pid is reset to process id, which
coincides with thread group leader tid.

If thread group leader has reported its death by this time, for tracer
this looks like dead thread leader "reappears from nowhere". If thread
group leader was still alive, for tracer this may look as if thread
group leader returns from a different syscall than it entered, or even
"returned from syscall even though it was not in any syscall". If
thread group leader was not traced (or was traced by a different
tracer), during execve it will appear as if it has become a tracee of
the tracer of execve'ing tracee. All these effects are the artifacts of
pid change.

PTRACE_O_TRACEEXEC option is the recommended tool for dealing with this
case. It enables PTRACE_EVENT_EXEC stop which occurs before execve
syscall return.

Pid change happens before PTRACE_EVENT_EXEC stop, not after.

When tracer receives PTRACE_EVENT_EXEC stop notification, it is
guaranteed that except this tracee, no other threads from the process
are alive. Moreover, it is guaranteed that tracer will not receive any
"buffered" death reports from any of them, even if some threads were
racing with execve'ing tracee, for example were entering exit syscall.

On receiving this notification, tracer should clean up all its internal
data structures about all threads of this process, and retain only one
data structure, one which describes single still running tracee, with
pid = tgid = process id.

??? How tracer knows which of its many tracees _are_ threads of that
particular process? (It may trace more than one process; it may even
don't keep track of its tracees' thread group relations at all...)

??? what happens if two threads execve at the same time? Clearly, only
one of them succeeds, but *which* one? Think "strace -f" or
multi-threaded process here:

  ** we get death notification: leader died: **
 PID0 exit(0)                            = ?
  ** we get syscall-entry-stop in thread 1: **
 PID1 execve("/bin/foo", "foo" <unfinished ...>
  ** we get syscall-entry-stop in thread 2: **
 PID2 execve("/bin/bar", "bar" <unfinished ...>
  ** we get PTRACE_EVENT_EXEC for PID0, we issue PTRACE_SYSCALL **
  ** we get syscall-exit-stop for PID0: **
 PID0 <... execve resumed> )             = 0

??? Question: WHICH execve succeeded? Can tracer figure it out?

If PTRACE_O_TRACEEXEC option is NOT in effect for the execve'ing
tracee, kernel delivers an extra SIGTRAP to tracee after execve syscall
returns. This is an ordinary signal (similar to one which can be
generated by "kill -TRAP"), not a special kind of ptrace-stop.
GETSIGINFO on it has si_code = 0 (SI_USER). It can be blocked by signal
mask, and thus can happen (much) later.

Usually, tracer (for example, strace) would not want to show this extra
post-execve SIGTRAP signal to the user, and would suppress its delivery
to the tracee (if SIGTRAP is set to SIG_DFL, it is a killing signal).
However, determining *which* SIGTRAP to suppress is not easy. Setting
PTRACE_O_TRACEEXEC option and thus suppressing this extra SIGTRAP is
the recommended approach.


	1.x Real parent

Ptrace API (ab)uses standard Unix parent/child signaling over waitpid.
This used to cause real parent of the process to stop receiving several
kinds of waitpid notifications when child process is traced by some
other process.

Many of these bugs have been fixed, but as of 2.6.38 several still
exist.

As of 2.6.38, the following is believed to work correctly:

- exit/death by signal is reported first to tracer, then, when tracer
consumes waitpid result, to real parent (to real parent only when the
whole multi-threaded process exits). If they are the same process, the
report is sent only once.

- ??? add more docs

Following bugs still exist:

- group-stop notifications are sent to tracer, but not to real parent.

- If thread group leader it is traced and exits, do_wait(WEXITED)
doesn't work (until all threads exit) for its the tracer.

??? add more known bugs here



	2. Linux kernel implementation

TODO

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Ptrace documentation, draft #3
  2011-05-20 19:23 Ptrace documentation, draft #3 Denys Vlasenko
@ 2011-05-25 14:32 ` Tejun Heo
  2011-05-30  3:08   ` Denys Vlasenko
  2011-05-30  3:28   ` execve-under-ptrace API bug (was Re: Ptrace documentation, draft #3) Denys Vlasenko
  2011-05-30 13:35 ` Ptrace documentation, draft #3 Oleg Nesterov
  1 sibling, 2 replies; 17+ messages in thread
From: Tejun Heo @ 2011-05-25 14:32 UTC (permalink / raw)
  To: Denys Vlasenko; +Cc: jan.kratochvil, oleg, linux-kernel, torvalds, akpm, indan

Hello, Denys.

On Fri, May 20, 2011 at 09:23:07PM +0200, Denys Vlasenko wrote:
> When running tracee enters ptrace-stop, it notifies its tracer using
> waitpid API. Tracer should use waitpid family of syscalls to wait for
> tracee to stop. Most of this document assumes that tracer waits with:
> 	pid = waitpid(pid_or_minus_1, &status, __WALL);

It might not be the best idea to listen for WCONTINUED from ptracer.
Unlike stop (or trapped) state, the continued state is per-process and
consuming it would confuse other parents (including the real parent)
of the process.  Plus, continued exit state doesn't carry much
interesting information for ptracer anyway (it can't be used for group
stop state tracking).

> Ptrace-stopped tracees are reported as returns with pid > 0 and
> WIFSTOPPED(status) == true.
> 
> ??? any pitfalls with WNOHANG (I remember that there are bugs in this
>     area)? effects of WSTOPPED, WEXITED, WCONTINUED bits? Are they ok?
>     waitid usage? WNOWAIT?

Yes, there are some race conditions around WNOHANG waits.  If ptracer
is waiting only for stopped state, it shouldn't be visible, I think,
but there are race conditions where transitions between different
states race with WNOHANG wait and wait(2) fails unexpectedly.  Should
be fixed eventually but it has been broken for a very long time.

> 	1.x.x Signal-delivery-stop
> 
> When (possibly multi-threaded) process receives any signal except
> SIGKILL, kernel selects a thread which handles the signal (if signal is
> generated with tgkill, thread selection is done by user). If selected
> thread is traced, it enters signal-delivery-stop. By this point, signal
> is not yet delivered to the process, and can be suppressed by tracer.
> If tracer doesn't suppress the signal, it passes signal to tracee in
> the next ptrace request. This is called "signal injection" and will be
> described later.

I think it would be better to discern between actual signal delivery
and injection.  I'll write more later.

> Note that if signal is blocked, signal-delivery-stop doesn't happen
> until signal is unblocked, with the usual exception that SIGSTOP
> can't be blocked.
>
> Signal-delivery-stop is observed by tracer as waitpid returning with
> WIFSTOPPED(status) == true, WSTOPSIG(status) == signal. If
> WSTOPSIG(status) == SIGTRAP, this may be a different kind of
> ptrace-stop - see "Syscall-stops" and "execve" sections below for
> details. If WSTOPSIG(status) == stopping signal, this may be a
> group-stop - see below.

It might be better to first outline different ptrace-stops and how to
discern them?

> 	1.x.x Signal injection and suppression.
> 
> After signal-delivery-stop is observed by tracer, tracer should restart
> tracee with
> 	ptrace(PTRACE_rest, pid, 0, sig)
> call, where PTRACE_rest is one of the restarting ptrace ops. If sig is
> 0, then signal is not delivered. Otherwise, signal sig is delivered.
> This operation is called "signal injection", to distinguish it from
> signal delivery which causes signal-delivery-stop.

Hmmm... I'm unsure whether injection is the appropriate word here
especially because we also have pure signal injections in other ptrace
requests where the kernel really just injects (sends) the requested
signal, which will traverse the signal delivery path later.

This is part of signal delivery path.  Kernel is consulting what to do
about the signal with the ptracer.  The signal is not being injected
by ptracer although it can be squashed or modified.

> Note that sig value may be different from WSTOPSIG(status) value -
> tracer can cause a different signal to be injected.
>
> Note that suppressed signal still causes syscalls to return
> prematurely. Restartable syscalls will be restarted (tracer will
> observe tracee to execute restart_syscall(2) syscall if tracer uses
> PTRACE_SYSCALL), non-restartable syscalls (for example, nanosleep) may
> return with -EINTR even though no observable signal is injected to the
> tracee.

AFAICS, this can also happen when there's no ptracer.
signal_pending() can trigger -EINTR return and signal delivery can
race with other threads and by the time the woken up thread reaches
signal delivery path, there could be no pending signal left and -EINTR
will happen without actually the thread deliverying anything.

> Note that restarting ptrace commands issued in ptrace-stops other than
> signal-delivery-stop are not guaranteed to inject a signal, even if sig
> is nonzero. No error is reported, nonzero sig may simply be ignored.
> Ptrace users should not try to "create new signal" this way: use
> tgkill(2) instead.
>
> This is a cause of confusion among ptrace users. One typical scenario
> is that tracer observes group-stop, mistakes it for
> signal-delivery-stop, restarts tracee with ptrace(PTRACE_rest, pid, 0,
> stopsig) with the intention of injecting stopsig, but stopsig gets
> ignored and tracee continues to run.

Yes, so, IMHO it's important to discern these two.  One is delivery,
the other is injection.  Dunno why but injections aren't even
consistent.  It's available for some traps, not for others.  Also, the
injected signal is fundamentally different in that it'll later go
through signal delivery path to be actually delivered.

I think it would be best to discourage the use of injections and only
deal with signals when ptrace reports a signal to deliver.

> SIGCONT signal has a side effect of waking up (all threads of)
> group-stopped process. This side effect happens before
> signal-delivery-stop.

More precisely, it happens at the time SIGCONT is sent.

> Tracer can't suppress this side-effect (it can
> only suppress signal injection, which only causes SIGCONT handler to
> not be executed in the tracee, if such handler is installed). In fact,
> waking up from group-stop may be followed by signal-delivery-stop for
> signal(s) *other than* SIGCONT, if they were pending when SIGCONT was
> delivered. IOW: SIGCONT may be not the first signal observed by the
> tracee after it was sent.

Please also note that from 2.6.40, the waking up won't happen if the
tracee is ptraced.  Before 2.6.40, if ptracer didn't issue any further
ptrace request after group stop, tracee was woken up by SIGCONT.  It
was racy and buggy and both strace and gdb issued further ptrace
requests right away so wasn't being used.

> Stopping signals cause (all threads of) process to enter group-stop.
> This side effect happens after signal injection, and therefore can be
> suppressed by tracer.

Maybe it would be clearer to state that group stop is initiated by the
delivery of a stop signal and ended by sending of SIGCONT?  I think
clearly distinguishing different stages of signal handling would be
nice.  It's visible to ptracer anyway.  ie. sending -> dequeueing (and
consulting ptracer via signal delivery ptrace-stop) -> delivery
(sigaction taken).

> PTRACE_GETSIGINFO can be used to retrieve siginfo_t structure which
> corresponds to delivered signal. PTRACE_SETSIGINFO may be used to
> modify it. If PTRACE_SETSIGINFO has been used to alter siginfo_t,
> si_signo field and sig parameter in restarting command must match.

Yeap and if it doesn't match, kernel generates a standard user signal
one but probably best to state that the outcome is undefined.

> 	1.x.x Group-stop
> 
> When a (possibly multi-threaded) process receives a stopping signal,
> all threads stop. If some threads are traced, they enter a group-stop.
> Note that stopping signal will first cause signal-delivery-stop (on one
> tracee only), and only after it is injected by tracer (or after it was
> dispatched to a thread which isn't traced), group-stop will be
> initiated on ALL tracees within multi-threaded process. As usual, every
> tracee reports its group-stop to corresponding tracer.

Again, if we discern different stages of signal handling, I think the
above can be much clearly explained.  Group stop is initiated when a
stop signal is delivered.  Also, note that without the distinction
between "delivery" and "injection", the above paragraph is inaccurate.
After an actual signal injection, group stop won't be initiated until
it is actually delivered by some thread in the group.

> Group-stop is observed by tracer as waitpid returning with
> WIFSTOPPED(status) == true, WSTOPSIG(status) == signal. The same result
> is returned by some other classes of ptrace-stops, therefore the
> recommended practice is to perform
> 	ptrace(PTRACE_GETSIGINFO, pid, 0, &siginfo)
> call. The call can be avoided if signal number is not SIGSTOP, SIGTSTP,
> SIGTTIN or SIGTTOU - only these four signals are stopping signals. If
> tracer sees something else, it can't be group-stop. Otherwise, tracer
> needs to call PTRACE_GETSIGINFO. If PTRACE_GETSIGINFO fails, then it is
> definitely a group-stop.

It might also be worth watching the error code.  -EINVAL failure
firmly indicates group stop but it may also fail with -ESRCH as you
pointed out before.

> As of kernel 2.6.38, after tracer sees tracee ptrace-stop and until it
> restarts or kills it, tracee will not run, and will not send
> notifications (except SIGKILL death) to tracer, even if tracer enters
> into another waitpid call.

This isn't strictly true.  There's a race window there and tracee
could be woken up behind ptracer's back if SIGCONT is sent before the
first ptrace request after group stop.  This race window should be
gone from 2.6.40.

> Currently, it causes a problem with transparent handling of stopping
> signals: if tracer restarts tracee after group-stop, SIGSTOP is
> effectively ignored: tracee doesn't remain stopped, it runs. If tracer
> doesn't restart tracee before entering into next waitpid, future
> SIGCONT will not be reported to the tracer. Which would make SIGCONT to
> have no effect.
...
> 	1.x.x Syscall-stops
> 
> If tracee was restarted by PTRACE_SYSCALL, tracee enters
> syscall-enter-stop just prior to entering any syscall. If tracer
> restarts it with PTRACE_SYSCALL, tracee enters syscall-exit-stop when
> syscall is finished, or if it is interrupted by a signal. (That is,
> signal-delivery-stop never happens between syscall-enter-stop and
> syscall-exit-stop, it happens *after* syscall-exit-stop).
> 
> Other possibilities are that tracee may stop in a PTRACE_EVENT stop,
> exit (if it entered exit or exit_group syscall), be killed by SIGKILL,
> or die silently (if execve syscall happened in another thread).
> 
> Syscall-enter-stop and syscall-exit-stop are observed by tracer as
> waitpid returning with WIFSTOPPED(status) == true, WSTOPSIG(status) ==
> SIGTRAP. If PTRACE_O_TRACESYSGOOD option was set by tracer, then
> WSTOPSIG(status) == (SIGTRAP | 0x80).

This is because it is handled as a real signal delivery.  Kernel
actually queues the signal than taking trap there.  Later, signal
delivery path kicks in and what userland sees is the actual delivery
of that kernel generated signal and being an actual signal it
interferes with user generated SIGTRAPs, siginfo can be lost under
memory pressure and so on.

> There is no portable way to distinguish them from signal-delivery-stop
> with SIGTRAP. Some architectures allow to distinguish them by examining
> registers. For example, on x86 rax = -ENOSYS in syscall-enter-stop.
> Since SIGTRAP (like any other signal) always happens *after*
> syscall-exit-stop, and at this point rax almost never contains -ENOSYS,
> SIGTRAP looks like "syscall-stop which is not syscall-enter-stop", IOW:
> it looks like a "stray syscall-exit-stop" and can be detected this way.
> But such detection is fragile and is best avoided. Using
> PTRACE_O_TRACESYSGOOD option is a recommended method.
> 
> ??? can be distinguished by PTRACE_GETSIGINFO, si_code <= 0 if sent by
> usual suspects like [t]kill, sigqueue; or = SI_KERNEL (0x80) if sent by
> kernel, whereas syscall-stops have si_code = SIGTRAP or (SIGTRAP |
> 0x80). Right? Should this be documented?

Yes, no user sent signal can have si_code > 0.

> Syscall-enter-stop and syscall-exit-stop are indistinguishable from
> each other by tracer. Tracer needs to keep track of the sequence of
> ptrace-stops in order to not misinterpret syscall-enter-stop as
> syscall-exit-stop or vice versa. The rule is that syscall-enter-stop is
> always followed by syscall-exit-stop, PTRACE_EVENT stop or tracee's
> death - no other kinds of ptrace-stop can occur in between.
> 
> If after syscall-enter-stop tracer uses restarting command other than
> PTRACE_SYSCALL, syscall-exit-stop is not generated.
> 
> PTRACE_GETSIGINFO on syscall-stops returns si_signo = SIGTRAP, si_code
> = SIGTRAP or (SIGTRAP | 0x80).

This needs more discussion but I think it would be better to unify all
trapping mechanism into ptrace traps with unique PTRACE_EVENT_* codes.
This way, it wouldn't interact with user signals or affected by memory
pressure and most notifications can be handled the same way by the
ptracer.

> 	1.x Informational and restarting ptrace commands.
> 
> Most ptrace commands (all except ATTACH, TRACEME, KILL) require tracee
> to be in ptrace-stop, otherwise they fail with ESRCH.
> 
> When tracee is in ptrace-stop, tracer can read and write data to tracee
> using informational commands. They leave tracee in ptrace-stopped state:
> 
> longv = ptrace(PTRACE_PEEKTEXT/PEEKDATA/PEEKUSER, pid, addr, 0);
> 	ptrace(PTRACE_POKETEXT/POKEDATA/POKEUSER, pid, addr, long_val);
> 	ptrace(PTRACE_GETREGS/GETFPREGS, pid, 0, &struct);
> 	ptrace(PTRACE_SETREGS/SETFPREGS, pid, 0, &struct);
> 	ptrace(PTRACE_GETSIGINFO, pid, 0, &siginfo);
> 	ptrace(PTRACE_SETSIGINFO, pid, 0, &siginfo);
> 	ptrace(PTRACE_GETEVENTMSG, pid, 0, &long_var);
> 	ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_flags);
> 
> Note that some errors are not reported. For example, setting siginfo
> may have no effect in some ptrace-stops, yet the call may succeed
> (return 0 and don't set errno).

Yeah, it should be used pretty much only during signal delivery stop.

> 	1.x Attaching and detaching
> 
> A thread can be attached to tracer using ptrace(PTRACE_ATTACH, pid, 0,
> 0) call. This also sends SIGSTOP to this thread. If tracer wants this
> SIGSTOP to have no effect, it needs to suppress it. Note that if other
> signals are concurrently sent to this thread during attach, tracer may
> see tracee enter signal-delivery-stop with other signal(s) first! The
> usual practice is to reinject these signals until SIGSTOP is seen, then
> suppress SIGSTOP injection. The design bug here is that attach and
> concurrent SIGSTOP are racing and SIGSTOP may be lost.

Heh, yeah, it's broken.

> ??? Describe how to attach to a thread which is already group-stopped.

No idea.  Sorry.

> Since attaching sends SIGSTOP and tracer usually suppresses it, this
> may cause stray EINTR return from the currently executing syscall in
> the tracee, as described in "signal injection and suppression" section.

As I wrote before, I think this can happen regardless of ptrace.

> ptrace(PTRACE_TRACEME, 0, 0, 0) request turns current thread into a
> tracee. It continues to run (doesn't enter ptrace-stop). A common
> practice is follow ptrace(PTRACE_TRACEME) with raise(SIGSTOP) and allow
> parent (which is our tracer now) to observe our signal-delivery-stop.
> 
> If PTRACE_O_TRACE[V]FORK or PTRACE_O_TRACECLONE options are in effect,
> then children created by (vfork or clone(CLONE_VFORK)), (fork or
> clone(SIGCHLD)) and (other kinds of clone) respectively are
> automatically attached to the same tracer which traced their parent.
> SIGSTOP is delivered to them, causing them to enter
> signal-delivery-stop after they exit syscall which created them.
> 
> Detaching of tracee is performed by ptrace(PTRACE_DETACH, pid, 0, sig).
> PTRACE_DETACH is a restarting operation, therefore it requires tracee
> to be in ptrace-stop. If tracee is in signal-delivery-stop, signal can
> be injected. Othervice, sig parameter may be silently ignored.
>
> If tracee is running when tracer wants to detach it, the usual solution
> is to send SIGSTOP (using tgkill, to make sure it goes to the correct
> thread), wait for tracee to stop in signal-delivery-stop for SIGSTOP
> and then detach it (suppressing SIGSTOP injection). Design bug is that
> this can race with concurrent SIGSTOPs. Another complication is that
> tracee may enter other ptrace-stops and needs to be restarted and
> waited for again, until SIGSTOP is seen. Yet another complication is to
> be sure that tracee is not already group-stopped, because no signal
> delivery happens while it is - not even SIGSTOP.
> 
> ??? is above accurate?

Mostly, I think.  The only thing is that a stopped tracee doesn't
deliver signals regardless of where it's stopped.  It doesn't matter
whether it's group stop or ptrace stop.

> ??? Describe how to detach from a group-stopped tracee so that it
>     doesn't run, but continues to wait for SIGCONT.

Currently, this department is so thoroughly broken, I don't think
there's a way to do it in generic manner.  We can suit the solution
sequence to one scenario but it will break for others.

> If tracer dies, all tracees are automatically detached and restarted,
> unless they were in group-stop. Handling of restart from group-stop is
> currently buggy, but "as planned" behavior is to leave tracee stopped
> and waiting for SIGCONT. If tracee is restarted from
> signal-delivery-stop, pending signal is injected.

Yeap, the plan is to decouple group stop and tracee execution.

> 	1.x execve under ptrace.
> 
...
>   ** we get death notification: leader died: **
>  PID0 exit(0)                            = ?
>   ** we get syscall-entry-stop in thread 1: **
>  PID1 execve("/bin/foo", "foo" <unfinished ...>
>   ** we get syscall-entry-stop in thread 2: **
>  PID2 execve("/bin/bar", "bar" <unfinished ...>
>   ** we get PTRACE_EVENT_EXEC for PID0, we issue PTRACE_SYSCALL **
>   ** we get syscall-exit-stop for PID0: **
>  PID0 <... execve resumed> )             = 0
> 
> ??? Question: WHICH execve succeeded? Can tracer figure it out?

Hmmm... I don't know.  Maybe we can set ptrace message to the original
tid?

> 	1.x Real parent
> 
> Ptrace API (ab)uses standard Unix parent/child signaling over waitpid.
> This used to cause real parent of the process to stop receiving several
> kinds of waitpid notifications when child process is traced by some
> other process.
> 
> Many of these bugs have been fixed, but as of 2.6.38 several still
> exist.

Yeap, it should behave sanely from 2.6.40.

Wheee... that's a long scary document.  Thanks a lot.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Ptrace documentation, draft #3
  2011-05-25 14:32 ` Tejun Heo
@ 2011-05-30  3:08   ` Denys Vlasenko
  2011-05-30  3:28   ` execve-under-ptrace API bug (was Re: Ptrace documentation, draft #3) Denys Vlasenko
  1 sibling, 0 replies; 17+ messages in thread
From: Denys Vlasenko @ 2011-05-30  3:08 UTC (permalink / raw)
  To: Tejun Heo; +Cc: jan.kratochvil, oleg, linux-kernel, torvalds, akpm, indan

On Wednesday 25 May 2011 16:32, Tejun Heo wrote:
> On Fri, May 20, 2011 at 09:23:07PM +0200, Denys Vlasenko wrote:
> > When running tracee enters ptrace-stop, it notifies its tracer using
> > waitpid API. Tracer should use waitpid family of syscalls to wait for
> > tracee to stop. Most of this document assumes that tracer waits with:
> > 	pid = waitpid(pid_or_minus_1, &status, __WALL);
> 
> It might not be the best idea to listen for WCONTINUED from ptracer.
> Unlike stop (or trapped) state, the continued state is per-process and
> consuming it would confuse other parents (including the real parent)
> of the process.  Plus, continued exit state doesn't carry much
> interesting information for ptracer anyway (it can't be used for group
> stop state tracking).

Added this info to the next doc revision.


> > Ptrace-stopped tracees are reported as returns with pid > 0 and
> > WIFSTOPPED(status) == true.
> > 
> > ??? any pitfalls with WNOHANG (I remember that there are bugs in this
> >     area)? effects of WSTOPPED, WEXITED, WCONTINUED bits? Are they ok?
> >     waitid usage? WNOWAIT?
> 
> Yes, there are some race conditions around WNOHANG waits.  If ptracer
> is waiting only for stopped state, it shouldn't be visible, I think,
> but there are race conditions where transitions between different
> states race with WNOHANG wait and wait(2) fails unexpectedly.  Should
> be fixed eventually but it has been broken for a very long time.

Added this info to the next doc revision.


> > 	1.x.x Signal-delivery-stop
> > 
> > When (possibly multi-threaded) process receives any signal except
> > SIGKILL, kernel selects a thread which handles the signal (if signal is
> > generated with tgkill, thread selection is done by user). If selected
> > thread is traced, it enters signal-delivery-stop. By this point, signal
> > is not yet delivered to the process, and can be suppressed by tracer.
> > If tracer doesn't suppress the signal, it passes signal to tracee in
> > the next ptrace request. This is called "signal injection" and will be
> > described later.
> 
> I think it would be better to discern between actual signal delivery
> and injection.  I'll write more later.

I think it's just a matter of agreeing on a terminology.
In this doc, I call this "signal delivery (under ptrace)":

waitpid: WIFSTOPPED == 1, WSTOPSIG == sig

and call this subsequent operation "signal injection":

ptrace(PTRACE_cont, pid, 0, sig);

I am not particularly attached to these exact terms.
Maybe yours will sound better. How would you call these things?

 
> > Note that if signal is blocked, signal-delivery-stop doesn't happen
> > until signal is unblocked, with the usual exception that SIGSTOP
> > can't be blocked.
> >
> > Signal-delivery-stop is observed by tracer as waitpid returning with
> > WIFSTOPPED(status) == true, WSTOPSIG(status) == signal. If
> > WSTOPSIG(status) == SIGTRAP, this may be a different kind of
> > ptrace-stop - see "Syscall-stops" and "execve" sections below for
> > details. If WSTOPSIG(status) == stopping signal, this may be a
> > group-stop - see below.
> 
> It might be better to first outline different ptrace-stops and how to
> discern them?

Yes.

 
> > 	1.x.x Signal injection and suppression.
> > 
> > After signal-delivery-stop is observed by tracer, tracer should restart
> > tracee with
> > 	ptrace(PTRACE_rest, pid, 0, sig)
> > call, where PTRACE_rest is one of the restarting ptrace ops. If sig is
> > 0, then signal is not delivered. Otherwise, signal sig is delivered.
> > This operation is called "signal injection", to distinguish it from
> > signal delivery which causes signal-delivery-stop.
> 
> Hmmm... I'm unsure whether injection is the appropriate word here
> especially because we also have pure signal injections in other ptrace
> requests where the kernel really just injects (sends) the requested
> signal, which will traverse the signal delivery path later.

I don't know any (documented) way to do something like this.
Please elaborate.


> This is part of signal delivery path.  Kernel is consulting what to do
> about the signal with the ptracer.  The signal is not being injected
> by ptracer although it can be squashed or modified.

You don't like the word "inject" because it implies *creation*
of a new signal? Propose different term please.


> > Note that sig value may be different from WSTOPSIG(status) value -
> > tracer can cause a different signal to be injected.
> >
> > Note that suppressed signal still causes syscalls to return
> > prematurely. Restartable syscalls will be restarted (tracer will
> > observe tracee to execute restart_syscall(2) syscall if tracer uses
> > PTRACE_SYSCALL), non-restartable syscalls (for example, nanosleep) may
> > return with -EINTR even though no observable signal is injected to the
> > tracee.
> 
> AFAICS, this can also happen when there's no ptracer.
> signal_pending() can trigger -EINTR return and signal delivery can
> race with other threads and by the time the woken up thread reaches
> signal delivery path, there could be no pending signal left and -EINTR
> will happen without actually the thread deliverying anything.

It can't happen in single-threaded process. Whereas under ptrace,
it can. Therefore this is still an observable effect and we can't
handwave it away.


> > Note that restarting ptrace commands issued in ptrace-stops other than
> > signal-delivery-stop are not guaranteed to inject a signal, even if sig
> > is nonzero. No error is reported, nonzero sig may simply be ignored.
> > Ptrace users should not try to "create new signal" this way: use
> > tgkill(2) instead.
> >
> > This is a cause of confusion among ptrace users. One typical scenario
> > is that tracer observes group-stop, mistakes it for
> > signal-delivery-stop, restarts tracee with ptrace(PTRACE_rest, pid, 0,
> > stopsig) with the intention of injecting stopsig, but stopsig gets
> > ignored and tracee continues to run.
> 
> Yes, so, IMHO it's important to discern these two.  One is delivery,
> the other is injection. 

And I _do_ discern them. See above.


> Dunno why but injections aren't even 
> consistent.  It's available for some traps, not for others.  Also, the
> injected signal is fundamentally different 

Fundamentally different from what?

> in that it'll later go 
> through signal delivery path to be actually delivered.
> 
> I think it would be best to discourage the use of injections and only
> deal with signals when ptrace reports a signal to deliver.

Yes, Oleg also says that for now we need to declare ptrace(PTRACE_cont, pid, 0, sig)
behavior undefined when it's done not after signal-delivery-stop.


> > SIGCONT signal has a side effect of waking up (all threads of)
> > group-stopped process. This side effect happens before
> > signal-delivery-stop.
> 
> More precisely, it happens at the time SIGCONT is sent.

>From userspace POV, this is the same thing.


> > Tracer can't suppress this side-effect (it can
> > only suppress signal injection, which only causes SIGCONT handler to
> > not be executed in the tracee, if such handler is installed). In fact,
> > waking up from group-stop may be followed by signal-delivery-stop for
> > signal(s) *other than* SIGCONT, if they were pending when SIGCONT was
> > delivered. IOW: SIGCONT may be not the first signal observed by the
> > tracee after it was sent.
> 
> Please also note that from 2.6.40, the waking up won't happen if the
> tracee is ptraced.  Before 2.6.40, if ptracer didn't issue any further
> ptrace request after group stop, tracee was woken up by SIGCONT.  It
> was racy and buggy and both strace and gdb issued further ptrace
> requests right away so wasn't being used.

I and Oleg think that we should not document this pre-2.6.40 behavior.
We should just say that currently, not PTRACE_cont'ing group-stopped tracee
is a bad idea, and PTRACE_cont'ing tracee will wake it up (make it run).


> > Stopping signals cause (all threads of) process to enter group-stop.
> > This side effect happens after signal injection, and therefore can be
> > suppressed by tracer.
> 
> Maybe it would be clearer to state that group stop is initiated by the
> delivery of a stop signal and ended by sending of SIGCONT?

I simply documented current buggy state: that group-stop is reported,
but is not retained: PTRACE_cont makes tracee run. (Hmm. what happens
in multi-threaded processes?...)


> I think 
> clearly distinguishing different stages of signal handling would be
> nice.  It's visible to ptracer anyway.  ie. sending -> dequeueing (and
> consulting ptracer via signal delivery ptrace-stop) -> delivery
> (sigaction taken).

Sending: is unobservable (it is done by someone else),
dequeuing: I call it "delivery"
delivery: I call it "injection"


> > PTRACE_GETSIGINFO can be used to retrieve siginfo_t structure which
> > corresponds to delivered signal. PTRACE_SETSIGINFO may be used to
> > modify it. If PTRACE_SETSIGINFO has been used to alter siginfo_t,
> > si_signo field and sig parameter in restarting command must match.
> 
> Yeap and if it doesn't match, kernel generates a standard user signal
> one but probably best to state that the outcome is undefined.

Added this to the next doc revision.


> > 	1.x.x Group-stop
> > 
> > When a (possibly multi-threaded) process receives a stopping signal,
> > all threads stop. If some threads are traced, they enter a group-stop.
> > Note that stopping signal will first cause signal-delivery-stop (on one
> > tracee only), and only after it is injected by tracer (or after it was
> > dispatched to a thread which isn't traced), group-stop will be
> > initiated on ALL tracees within multi-threaded process. As usual, every
> > tracee reports its group-stop to corresponding tracer.
> 
> Again, if we discern different stages of signal handling, I think the
> above can be much clearly explained.  Group stop is initiated when a
> stop signal is delivered.  Also, note that without the distinction
> between "delivery" and "injection", the above paragraph is inaccurate.
> After an actual signal injection, group stop won't be initiated until
> it is actually delivered by some thread in the group.

How would you call the stop which I call "signal-delivery-stop"?
How would you call ptrace(PTRACE_cont, pid, 0, dig) op?

 
> > Group-stop is observed by tracer as waitpid returning with
> > WIFSTOPPED(status) == true, WSTOPSIG(status) == signal. The same result
> > is returned by some other classes of ptrace-stops, therefore the
> > recommended practice is to perform
> > 	ptrace(PTRACE_GETSIGINFO, pid, 0, &siginfo)
> > call. The call can be avoided if signal number is not SIGSTOP, SIGTSTP,
> > SIGTTIN or SIGTTOU - only these four signals are stopping signals. If
> > tracer sees something else, it can't be group-stop. Otherwise, tracer
> > needs to call PTRACE_GETSIGINFO. If PTRACE_GETSIGINFO fails, then it is
> > definitely a group-stop.
> 
> It might also be worth watching the error code.  -EINVAL failure
> firmly indicates group stop but it may also fail with -ESRCH as you
> pointed out before.

Added this to the next doc revision.

 
> > As of kernel 2.6.38, after tracer sees tracee ptrace-stop and until it
> > restarts or kills it, tracee will not run, and will not send
> > notifications (except SIGKILL death) to tracer, even if tracer enters
> > into another waitpid call.
> 
> This isn't strictly true.  There's a race window there and tracee
> could be woken up behind ptracer's back if SIGCONT is sent before the
> first ptrace request after group stop.  This race window should be
> gone from 2.6.40.

Yes.


> > 	1.x.x Syscall-stops
> > 
> > If tracee was restarted by PTRACE_SYSCALL, tracee enters
> > syscall-enter-stop just prior to entering any syscall. If tracer
> > restarts it with PTRACE_SYSCALL, tracee enters syscall-exit-stop when
> > syscall is finished, or if it is interrupted by a signal. (That is,
> > signal-delivery-stop never happens between syscall-enter-stop and
> > syscall-exit-stop, it happens *after* syscall-exit-stop).
> > 
> > Other possibilities are that tracee may stop in a PTRACE_EVENT stop,
> > exit (if it entered exit or exit_group syscall), be killed by SIGKILL,
> > or die silently (if execve syscall happened in another thread).
> > 
> > Syscall-enter-stop and syscall-exit-stop are observed by tracer as
> > waitpid returning with WIFSTOPPED(status) == true, WSTOPSIG(status) ==
> > SIGTRAP. If PTRACE_O_TRACESYSGOOD option was set by tracer, then
> > WSTOPSIG(status) == (SIGTRAP | 0x80).
> 
> This is because it is handled as a real signal delivery.  Kernel
> actually queues the signal than taking trap there.  Later, signal
> delivery path kicks in and what userland sees is the actual delivery
> of that kernel generated signal and being an actual signal it
> interferes with user generated SIGTRAPs, siginfo can be lost under
> memory pressure and so on.

Has it userspace-observable effects? Such as: will blocking SIGTRAP
block it too?


> > Syscall-enter-stop and syscall-exit-stop are indistinguishable from
> > each other by tracer. Tracer needs to keep track of the sequence of
> > ptrace-stops in order to not misinterpret syscall-enter-stop as
> > syscall-exit-stop or vice versa. The rule is that syscall-enter-stop is
> > always followed by syscall-exit-stop, PTRACE_EVENT stop or tracee's
> > death - no other kinds of ptrace-stop can occur in between.
> > 
> > If after syscall-enter-stop tracer uses restarting command other than
> > PTRACE_SYSCALL, syscall-exit-stop is not generated.
> > 
> > PTRACE_GETSIGINFO on syscall-stops returns si_signo = SIGTRAP, si_code
> > = SIGTRAP or (SIGTRAP | 0x80).
> 
> This needs more discussion but I think it would be better to unify all
> trapping mechanism into ptrace traps with unique PTRACE_EVENT_* codes.
> This way, it wouldn't interact with user signals or affected by memory
> pressure and most notifications can be handled the same way by the
> ptracer.

Probably a good idea, but not a goal of this doc. The doc is meant to describe
current situation.


> > Detaching of tracee is performed by ptrace(PTRACE_DETACH, pid, 0, sig).
> > PTRACE_DETACH is a restarting operation, therefore it requires tracee
> > to be in ptrace-stop. If tracee is in signal-delivery-stop, signal can
> > be injected. Othervice, sig parameter may be silently ignored.
> >
> > If tracee is running when tracer wants to detach it, the usual solution
> > is to send SIGSTOP (using tgkill, to make sure it goes to the correct
> > thread), wait for tracee to stop in signal-delivery-stop for SIGSTOP
> > and then detach it (suppressing SIGSTOP injection). Design bug is that
> > this can race with concurrent SIGSTOPs. Another complication is that
> > tracee may enter other ptrace-stops and needs to be restarted and
> > waited for again, until SIGSTOP is seen. Yet another complication is to
> > be sure that tracee is not already group-stopped, because no signal
> > delivery happens while it is - not even SIGSTOP.
> > 
> > ??? is above accurate?
> 
> Mostly, I think.  The only thing is that a stopped tracee doesn't
> deliver signals regardless of where it's stopped.  It doesn't matter
> whether it's group stop or ptrace stop.

In this document, I presume that group-stop is a form of ptrace-stop
(for ptraced threads). [Remember: I describe what userspace sees,
not kernel's internal machinery].

So, s/tracee is not already group-stopped/tracee is not already ptrace-stopped/


> Currently, this department is so thoroughly broken, I don't think
> there's a way to do it in generic manner.  We can suit the solution
> sequence to one scenario but it will break for others.

IIRC gdb performs some scary magic which mostly works.


Expect updated doc soon.

-- 
vda

^ permalink raw reply	[flat|nested] 17+ messages in thread

* execve-under-ptrace API bug (was Re: Ptrace documentation, draft #3)
  2011-05-25 14:32 ` Tejun Heo
  2011-05-30  3:08   ` Denys Vlasenko
@ 2011-05-30  3:28   ` Denys Vlasenko
  2011-05-30  8:49     ` Tejun Heo
  2011-05-30 13:49     ` Oleg Nesterov
  1 sibling, 2 replies; 17+ messages in thread
From: Denys Vlasenko @ 2011-05-30  3:28 UTC (permalink / raw)
  To: Tejun Heo; +Cc: jan.kratochvil, oleg, linux-kernel, torvalds, akpm, indan

On Wednesday 25 May 2011 16:32, Tejun Heo wrote:
> > 	1.x execve under ptrace.
> > 
> ...
> >   ** we get death notification: leader died: **
> >  PID0 exit(0)                            = ?
> >   ** we get syscall-entry-stop in thread 1: **
> >  PID1 execve("/bin/foo", "foo" <unfinished ...>
> >   ** we get syscall-entry-stop in thread 2: **
> >  PID2 execve("/bin/bar", "bar" <unfinished ...>
> >   ** we get PTRACE_EVENT_EXEC for PID0, we issue PTRACE_SYSCALL **
> >   ** we get syscall-exit-stop for PID0: **
> >  PID0 <... execve resumed> )             = 0
> > 
> > ??? Question: WHICH execve succeeded? Can tracer figure it out?
> 
> Hmmm... I don't know.  Maybe we can set ptrace message to the original
> tid?

The problem with execve is bigger than merely reporting this pid.

Consider how strace tracks its tracees. Currently, it remembers
their pids - sometimes by remembering clone's return values!
This is hopelessly broken wrt pid namespaces.

So I looked at removing all pid tracking from strace, because
it uses pids only for some (extremely fragile) workarounds
for old kernel bugs, such as: it suspends waitpid's in tracees
until there is a child it can wait for; it detaches from
a tracee if it gets signaled with a fatal signal or calls exit;
and similar madness.

There are many bugs in strace in this area, because it cannot
properly emulate a lot of things (such as signal interrupting
waitpid, waitpid(-PGID), etc).

Therefore I plan to delete this madness.

The idea is that strace can simply create a new tracee's data
structure when it sees a new, never before seen pid popping up
from waitpid - this means that [v]fork/clone created a child,
and now it is traced too. It does not need to know beforehand
about its pid. It does not need to know who is whose parent
or sibling.

This works (I have a patch against a somewhat older strace),
but now in light of this "interesting" execve-under-ptrace
behavior it appears to have a flaw: all threads except the
execve'ing one disappear without any notification to strace,
therefore strace doesn't know which tracee data ("struct tcb"
in strace-speak) need to be dropped!

I am not sure current strace handles this correctly either.
I will be very surprised if it does.

I think the API needs fixing. Tracee must never disappear like that
on execve (or in any other case). They must always deliver a
WIFEXITED or WIFSIGNALED notification, allowing tracer to know
that they are gone. We probably also need to document how are these
"I died on execve" notifications are ordered wrt PTRACE_EVENT_EXEC
stop in execve-ing thread.

Ideas?


-- 
vda



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: execve-under-ptrace API bug (was Re: Ptrace documentation, draft #3)
  2011-05-30  3:28   ` execve-under-ptrace API bug (was Re: Ptrace documentation, draft #3) Denys Vlasenko
@ 2011-05-30  8:49     ` Tejun Heo
  2011-05-30 11:40       ` Denys Vlasenko
  2011-05-30 13:56       ` Oleg Nesterov
  2011-05-30 13:49     ` Oleg Nesterov
  1 sibling, 2 replies; 17+ messages in thread
From: Tejun Heo @ 2011-05-30  8:49 UTC (permalink / raw)
  To: Denys Vlasenko; +Cc: jan.kratochvil, oleg, linux-kernel, torvalds, akpm, indan

Hello, Denys.

On Mon, May 30, 2011 at 05:28:17AM +0200, Denys Vlasenko wrote:
> On Wednesday 25 May 2011 16:32, Tejun Heo wrote:
> > > 	1.x execve under ptrace.
> > > 
> > ...
> > >   ** we get death notification: leader died: **
> > >  PID0 exit(0)                            = ?
> > >   ** we get syscall-entry-stop in thread 1: **
> > >  PID1 execve("/bin/foo", "foo" <unfinished ...>
> > >   ** we get syscall-entry-stop in thread 2: **
> > >  PID2 execve("/bin/bar", "bar" <unfinished ...>
> > >   ** we get PTRACE_EVENT_EXEC for PID0, we issue PTRACE_SYSCALL **
> > >   ** we get syscall-exit-stop for PID0: **
> > >  PID0 <... execve resumed> )             = 0
> > > 
> > > ??? Question: WHICH execve succeeded? Can tracer figure it out?
> > 
> > Hmmm... I don't know.  Maybe we can set ptrace message to the original
> > tid?
> 
> The problem with execve is bigger than merely reporting this pid.
>
> Consider how strace tracks its tracees. Currently, it remembers
> their pids - sometimes by remembering clone's return values!
> This is hopelessly broken wrt pid namespaces.

I'm not too familiar with pid namespaces but don't all threads of the
same process belong to the same namespace?  I don't think strace would
need to track pids all the time.  It just needs to store pids of
in-flight exec's and match it on exec completion.  I'm probably
missing something but why wouldn't that work?

> This works (I have a patch against a somewhat older strace),
> but now in light of this "interesting" execve-under-ptrace
> behavior it appears to have a flaw: all threads except the
> execve'ing one disappear without any notification to strace,
> therefore strace doesn't know which tracee data ("struct tcb"
> in strace-speak) need to be dropped!
> 
> I am not sure current strace handles this correctly either.
> I will be very surprised if it does.
> 
> I think the API needs fixing. Tracee must never disappear like that
> on execve (or in any other case). They must always deliver a
> WIFEXITED or WIFSIGNALED notification, allowing tracer to know
> that they are gone. We probably also need to document how are these
> "I died on execve" notifications are ordered wrt PTRACE_EVENT_EXEC
> stop in execve-ing thread.

A problem is that by the time de-threading is in progress, it's
already too deep and there's no way back and the exec'ing thread has
to wait for completion in uninterruptible sleeps - ie. it expects
de-threading to finish in finite amount of time and to achieve that it
basically sends SIGKILL to all other threads.  If we introduce a trap
in de-threading itself, we can easily end up with an unkillable
task.

> Ideas?

But, if necessary, I can think of two other ways,

1. Don't allow more than one thread in the same group enter exec(2)
   path at all.  It's not like parallel execution of exec(2) buys us
   anything anyway.  One thing to be careful about is that binfmt code
   may recurse.

2. Add another trap point right before de-threading commences.  It can
   still back out if de-threading hasn't started yet.  We'll still
   need to add explicit synchronization there but the window would be
   much smaller.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: execve-under-ptrace API bug (was Re: Ptrace documentation, draft #3)
  2011-05-30  8:49     ` Tejun Heo
@ 2011-05-30 11:40       ` Denys Vlasenko
  2011-05-30 14:27         ` Denys Vlasenko
  2011-05-30 13:56       ` Oleg Nesterov
  1 sibling, 1 reply; 17+ messages in thread
From: Denys Vlasenko @ 2011-05-30 11:40 UTC (permalink / raw)
  To: Tejun Heo; +Cc: jan.kratochvil, oleg, linux-kernel, torvalds, akpm, indan

On Mon, May 30, 2011 at 10:49 AM, Tejun Heo <tj@kernel.org> wrote:
> On Mon, May 30, 2011 at 05:28:17AM +0200, Denys Vlasenko wrote:
>> On Wednesday 25 May 2011 16:32, Tejun Heo wrote:
>> > >   1.x execve under ptrace.
>> > >
>> > ...
>> > >   ** we get death notification: leader died: **
>> > >  PID0 exit(0)                            = ?
>> > >   ** we get syscall-entry-stop in thread 1: **
>> > >  PID1 execve("/bin/foo", "foo" <unfinished ...>
>> > >   ** we get syscall-entry-stop in thread 2: **
>> > >  PID2 execve("/bin/bar", "bar" <unfinished ...>
>> > >   ** we get PTRACE_EVENT_EXEC for PID0, we issue PTRACE_SYSCALL **
>> > >   ** we get syscall-exit-stop for PID0: **
>> > >  PID0 <... execve resumed> )             = 0
>> > >
>> > > ??? Question: WHICH execve succeeded? Can tracer figure it out?
>> >
>> > Hmmm... I don't know.  Maybe we can set ptrace message to the original
>> > tid?
>>
>> The problem with execve is bigger than merely reporting this pid.
>>
>> Consider how strace tracks its tracees. Currently, it remembers
>> their pids - sometimes by remembering clone's return values!
>> This is hopelessly broken wrt pid namespaces.
>
> I'm not too familiar with pid namespaces but don't all threads of the
> same process belong to the same namespace?  I don't think strace would
> need to track pids all the time.  It just needs to store pids of
> in-flight exec's and match it on exec completion.  I'm probably
> missing something but why wouldn't that work?

I think I was not clear (or elaborate) enough. I am not worrying
about "two execve's in two threads at once" scenario. I am worried about
the following scenario:

* strace is run as "strace -f PROG ARGS" - that is, "trace children too" mode.
* PROG forks a few times. Now strace traces several processes.
* Now some of those processes create threads. Now, strace traces
several processes, some (or even all) of them are multi-threaded.
* From strace POW, it just knows a bunch of pids it traces. It doesn't
maintain information who is whose parent *or sibling*.
* One of threads in one of the processes execves.
* Because of execve, _some_ threads (not _all_ straced pids, but only some!),
more precisely, only those which comprise the thread group
of the execve'ing thread, are dying, and execve'ing thread
changes its pid on syscall exit and continues executing
as a thread leader of the newly forked, (so far) single-threaded process.
* PROBLEM: how strace knows which of its tracees are dead now?

IOW: consider the following program (pseudo-C):

/* we are pid0 now: thread leader. Single-threaded so far... */
/* create an ordinary child (not a thread) */
child = fork();
if (child==0) { sleep(0.001); exit(0); }
/* create two threads */
pid1 = clone();
pid2 = clone();
/* we have three threads now */
if (we are not pid2) sleep(1); else execve("/proc/self/exe");
/* pid0 and pid1 died, pid2 execve'ed and become "new" pid0 */
/* go back to the beginning */

Now imagine that you run it under "strace -f".
If on execve strace would not bother deleting malloced
struct tcb's which correspond to each running thread,
it will leak memory on each execve.
And because of the fork, it cannot delete ALL struct tcb's
on execve - the child is not killed by execve, it must be
still tracked!


>> This works (I have a patch against a somewhat older strace),
>> but now in light of this "interesting" execve-under-ptrace
>> behavior it appears to have a flaw: all threads except the
>> execve'ing one disappear without any notification to strace,
>> therefore strace doesn't know which tracee data ("struct tcb"
>> in strace-speak) need to be dropped!
>>
>> I am not sure current strace handles this correctly either.
>> I will be very surprised if it does.
>>
>> I think the API needs fixing. Tracee must never disappear like that
>> on execve (or in any other case). They must always deliver a
>> WIFEXITED or WIFSIGNALED notification, allowing tracer to know
>> that they are gone. We probably also need to document how are these
>> "I died on execve" notifications are ordered wrt PTRACE_EVENT_EXEC
>> stop in execve-ing thread.
>
> A problem is that by the time de-threading is in progress, it's
> already too deep and there's no way back and the exec'ing thread has
> to wait for completion in uninterruptible sleeps - ie. it expects
> de-threading to finish in finite amount of time and to achieve that it
> basically sends SIGKILL to all other threads.

Which is fine. Can we make the death from this "internal SIGKILL"
visible to the tracer of killed tracees?


>  If we introduce a trap
> in de-threading itself, we can easily end up with an unkillable
> task.

I don't see the need to ensure that de-threading deaths are visible to tracer
before execve returns. They can be queued and seen by tracer later.


-- 
vda

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Ptrace documentation, draft #3
  2011-05-20 19:23 Ptrace documentation, draft #3 Denys Vlasenko
  2011-05-25 14:32 ` Tejun Heo
@ 2011-05-30 13:35 ` Oleg Nesterov
  1 sibling, 0 replies; 17+ messages in thread
From: Oleg Nesterov @ 2011-05-30 13:35 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Tejun Heo, jan.kratochvil, linux-kernel, torvalds, akpm, indan

On 05/20, Denys Vlasenko wrote:
>
 ??? How tracer knows which of its many tracees _are_ threads of that
> particular process? (It may trace more than one process; it may even
> don't keep track of its tracees' thread group relations at all...)

I think the tracer should track the tgid relations if it wants to know
this. Although we can add the simple PTREAD_ request which provides some
info including tgid.

> ??? what happens if two threads execve at the same time? Clearly, only
> one of them succeeds, but *which* one? Think "strace -f" or
> multi-threaded process here:
>
>   ** we get death notification: leader died: **
>  PID0 exit(0)                            = ?
>   ** we get syscall-entry-stop in thread 1: **
>  PID1 execve("/bin/foo", "foo" <unfinished ...>
>   ** we get syscall-entry-stop in thread 2: **
>  PID2 execve("/bin/bar", "bar" <unfinished ...>
>   ** we get PTRACE_EVENT_EXEC for PID0, we issue PTRACE_SYSCALL **
>   ** we get syscall-exit-stop for PID0: **
>  PID0 <... execve resumed> )             = 0
>
> ??? Question: WHICH execve succeeded? Can tracer figure it out?

Afaics, in general the tracer can't figure it out. Well, in this
particular case the tracer can inspect the arguments when the winner
(now it is PID0) reports the syscall-exit.

Also. All threads but the winner can report PTRACE_EVENT_EXIT. But once
again, we have problems with PTRACE_EVENT_EXIT/fatal_signal_pending().

Oleg.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: execve-under-ptrace API bug (was Re: Ptrace documentation, draft #3)
  2011-05-30  3:28   ` execve-under-ptrace API bug (was Re: Ptrace documentation, draft #3) Denys Vlasenko
  2011-05-30  8:49     ` Tejun Heo
@ 2011-05-30 13:49     ` Oleg Nesterov
  1 sibling, 0 replies; 17+ messages in thread
From: Oleg Nesterov @ 2011-05-30 13:49 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Tejun Heo, jan.kratochvil, linux-kernel, torvalds, akpm, indan

On 05/30, Denys Vlasenko wrote:
>
> Consider how strace tracks its tracees. Currently, it remembers
> their pids - sometimes by remembering clone's return values!
> This is hopelessly broken wrt pid namespaces.

Yes. Unless the tracer lives in the same namespace it can't use
RAX as the pid. This return value is only makes sense inside the
tracee's namespace.

There is another problem, tracehook_report_clone_complete()
sets PTRACE_GETEVENTMSG == global_pid. IOW, this value can't
be used unless the tracer runs in the root namespace.

> So I looked at removing all pid tracking from strace,

I am not sure... but you certainly know better what strace
can/should do.

> The idea is that strace can simply create a new tracee's data
> structure when it sees a new, never before seen pid popping up
> from waitpid

This can probably work for strace. Note that this means strace
can't detach all tracees gracefully, it simply doesn't know them
all. But probably strace doesn't need this.

> This works (I have a patch against a somewhat older strace),
> but now in light of this "interesting" execve-under-ptrace
> behavior it appears to have a flaw: all threads except the
> execve'ing one disappear without any notification to strace,
> therefore strace doesn't know which tracee data ("struct tcb"
> in strace-speak) need to be dropped!

I think there is no choice currently, strace should remember tgid.

> I think the API needs fixing.

ptrace() should not be pid/thread based ;) But this is offtopic now.

Oleg.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: execve-under-ptrace API bug (was Re: Ptrace documentation, draft #3)
  2011-05-30  8:49     ` Tejun Heo
  2011-05-30 11:40       ` Denys Vlasenko
@ 2011-05-30 13:56       ` Oleg Nesterov
  1 sibling, 0 replies; 17+ messages in thread
From: Oleg Nesterov @ 2011-05-30 13:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Denys Vlasenko, jan.kratochvil, linux-kernel, torvalds, akpm, indan

On 05/30, Tejun Heo wrote:
>
> A problem is that by the time de-threading is in progress, it's
> already too deep and there's no way back and the exec'ing thread has
> to wait for completion in uninterruptible sleeps - ie. it expects
> de-threading to finish in finite amount of time and to achieve that it
> basically sends SIGKILL to all other threads.  If we introduce a trap
> in de-threading itself, we can easily end up with an unkillable
> task.

"unkillable" is not the problem, afaics. But the new trap is problematic,
we do not want the TASK_TRACED task holding the mutexes taken by the
callers of de_thread.

> 1. Don't allow more than one thread in the same group enter exec(2)
>    path at all.

This is already done, see do_execve()->prepare_bprm_creds().
cred_guard_mutex serializes exec. Btw, probably this allows us to do
more cleanups/simplifications in do_execve() paths.

> 2. Add another trap point right before de-threading commences.

See above.

Oleg.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: execve-under-ptrace API bug (was Re: Ptrace documentation, draft #3)
  2011-05-30 11:40       ` Denys Vlasenko
@ 2011-05-30 14:27         ` Denys Vlasenko
  2011-05-30 16:42           ` Oleg Nesterov
  2011-05-30 18:11           ` Denys Vlasenko
  0 siblings, 2 replies; 17+ messages in thread
From: Denys Vlasenko @ 2011-05-30 14:27 UTC (permalink / raw)
  To: Tejun Heo; +Cc: jan.kratochvil, oleg, linux-kernel, torvalds, akpm, indan

On Mon, May 30, 2011 at 1:40 PM, Denys Vlasenko
<vda.linux@googlemail.com> wrote:
>>> I think the API needs fixing. Tracee must never disappear like that
>>> on execve (or in any other case). They must always deliver a
>>> WIFEXITED or WIFSIGNALED notification, allowing tracer to know
>>> that they are gone. We probably also need to document how are these
>>> "I died on execve" notifications are ordered wrt PTRACE_EVENT_EXEC
>>> stop in execve-ing thread.
>>
>> A problem is that by the time de-threading is in progress, it's
>> already too deep and there's no way back and the exec'ing thread has
>> to wait for completion in uninterruptible sleeps - ie. it expects
>> de-threading to finish in finite amount of time and to achieve that it
>> basically sends SIGKILL to all other threads.
>
> Which is fine. Can we make the death from this "internal SIGKILL"
> visible to the tracer of killed tracees?

Ok, let's take a deeper look at API needs. What we need to report, and when?

We have three kinds of threads at execve:
1. execve'ing thread,
2. leader, two cases: (2a) leader is still alive, (2b) leader has exited by now.
3. other threads.

(3) is the most simple: API should report death of these threads.
There is no need to ensure these death notifications are reported
before execve syscall exit is reported. They can be consumed
by tracer later.

(1) execve'ing thread is obviously alive. current kernel already
reports its execve success. The only thing we need to add is
a way to retrieve its former pid, so that tracer can drop
former pid's data, and also to cater for the "two execve's" case.
PTRACE_EVENT_EXEC seems to be a good place to do it.
Say, using GETEVENTMSG?

(2) is the most problematic. If leader is still alive, should
we report its death? This makes sense since if we do,
and if we ensure its death is always reported before
PTRACE_EVENT_EXEC, then the rule is pretty simple:
at PTRACE_EVENT_EXEC, leader is always reported dead.

However, I don't see why we _must_ do it this way.
The life of tracer is not that much worse if at
PTRACE_EVENT_EXEC leader which is still alive
is simply "supplanted" by the execve'ed process.

We definitely must ensure, though, that if leader races with
execve'ing thread and enters exit(2), its death is never reported
*after* PTRACE_EVENT_EXEC - that'd confuse the tracer for sure!
Process which has exited but is still alive?! Not good!

-- 
vda

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: execve-under-ptrace API bug (was Re: Ptrace documentation, draft #3)
  2011-05-30 14:27         ` Denys Vlasenko
@ 2011-05-30 16:42           ` Oleg Nesterov
  2011-05-30 23:43             ` Denys Vlasenko
  2011-05-30 18:11           ` Denys Vlasenko
  1 sibling, 1 reply; 17+ messages in thread
From: Oleg Nesterov @ 2011-05-30 16:42 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Tejun Heo, jan.kratochvil, linux-kernel, torvalds, akpm, indan

On 05/30, Denys Vlasenko wrote:
>
> On Mon, May 30, 2011 at 1:40 PM, Denys Vlasenko
> <vda.linux@googlemail.com> wrote:
> >
> > Which is fine. Can we make the death from this "internal SIGKILL"
> > visible to the tracer of killed tracees?
>
> Ok, let's take a deeper look at API needs. What we need to report, and when?

OK. but I'm afraid I am a bit confused ;)

> We have three kinds of threads at execve:
> 1. execve'ing thread,
> 2. leader, two cases: (2a) leader is still alive, (2b) leader has exited by now.
> 3. other threads.
>
> (3) is the most simple: API should report death of these threads.
> There is no need to ensure these death notifications are reported
> before execve syscall exit is reported.

I guess you mean PTRACE_EVENT_EXIT? Probably yes,

> They can be consumed
> by tracer later.

by wait(WEXITED), OK.

> (1) execve'ing thread is obviously alive. current kernel already
> reports its execve success. The only thing we need to add is
> a way to retrieve its former pid, so that tracer can drop
> former pid's data, and also to cater for the "two execve's" case.

This is only needed if strace doesn't track the tracee's tgids, right?

> PTRACE_EVENT_EXEC seems to be a good place to do it.
> Say, using GETEVENTMSG?

Yes, Tejun suggested the same. Ignoring the pid_ns issues, this is trivial.
If the tracer runs in the parent namespace it is not, we can't simply
record the old tid. Lets ignore the problems with namespaces for now...

> (2) is the most problematic. If leader is still alive, should
> we report its death? This makes sense since if we do,
> and if we ensure its death is always reported before
> PTRACE_EVENT_EXEC,

Note that we simply can't report this after PTRACE_EVENT_EXEC because
its tid was already re-used by the new group leader.

And it is not trivial to report this before. Even if we forget about
the technical problems, please recall that wait() can't work in this
case. Forget about de_thread/exec, suppose that the group leader simply
exits before other threads. Yes, we are going to change this somehow.

But I am not sure it really makes sense to report the death of the old
leader. Why? We know for sure it is already dead at PTRACE_EVENT_EXEC
time, but at the same time it is better to pretend that it is not dead,
it is the execve'ing thread who should be considered dead in some sense.

IOW. Two threads, L is the leader with tid == tgid == 100, and T with
tid = 101. T does execve(). After that we have the process with the
same tgid and its new leader has tid == 100 as well. If we forget about
the actual implementation, it is T who silently disappears, not L.

OTOH, there is a problem: we should trace them both. Otherwise, if we
only trace L, even GETEVENTMSG can't help. And this means we can only
rely on PTRACE_EVENT_EXIT currently. Which needs fixes ;) We could add
another trap, but why it would be better?

In short: I do not think we can make what you want (assuming I understand
your suggestion correctly). Consider the simple example: we are tracing
the single thread and it is the group leader, another (untraced) thread
execs. I do not think we should change de_thread() so that the execing
thread should sleep waiting for waitpid(traced_leader_pid, WEXITED)
from the tracer before it reuses its pid. And in any case, even if we
do this, we should solve another problem with the dead group leader
first.

> We definitely must ensure, though, that if leader races with
> execve'ing thread and enters exit(2), its death is never reported
> *after* PTRACE_EVENT_EXEC

Yes... but this is not possible?

Oleg.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: execve-under-ptrace API bug (was Re: Ptrace documentation, draft #3)
  2011-05-30 14:27         ` Denys Vlasenko
  2011-05-30 16:42           ` Oleg Nesterov
@ 2011-05-30 18:11           ` Denys Vlasenko
  1 sibling, 0 replies; 17+ messages in thread
From: Denys Vlasenko @ 2011-05-30 18:11 UTC (permalink / raw)
  To: Tejun Heo; +Cc: jan.kratochvil, oleg, linux-kernel, torvalds, akpm, indan

[-- Attachment #1: Type: text/plain, Size: 3170 bytes --]

> Ok, let's take a deeper look at API needs. What we need to report, and when?
>
> We have three kinds of threads at execve:
> 1. execve'ing thread,
> 2. leader, two cases: (2a) leader is still alive, (2b) leader has exited by now.
> 3. other threads.
>
> (3) is the most simple: API should report death of these threads.
> There is no need to ensure these death notifications are reported
> before execve syscall exit is reported. They can be consumed
> by tracer later.
>
> (1) execve'ing thread is obviously alive. current kernel already
> reports its execve success. The only thing we need to add is
> a way to retrieve its former pid, so that tracer can drop
> former pid's data, and also to cater for the "two execve's" case.
> PTRACE_EVENT_EXEC seems to be a good place to do it.
> Say, using GETEVENTMSG?
>
> (2) is the most problematic. If leader is still alive, should
> we report its death? This makes sense since if we do,
> and if we ensure its death is always reported before
> PTRACE_EVENT_EXEC, then the rule is pretty simple:
> at PTRACE_EVENT_EXEC, leader is always reported dead.
>
> However, I don't see why we _must_ do it this way.
> The life of tracer is not that much worse if at
> PTRACE_EVENT_EXEC leader which is still alive
> is simply "supplanted" by the execve'ed process.
>
> We definitely must ensure, though, that if leader races with
> execve'ing thread and enters exit(2), its death is never reported
> *after* PTRACE_EVENT_EXEC - that'd confuse the tracer for sure!
> Process which has exited but is still alive?! Not good!


FWIW, here is the current behavior (2.6.38.6-27.fc15.i686.PAE).

Test program creates two threads and execve's from last thread.
PTRACE_O_TRACECLONE | PTRACE_O_TRACEEXIT | PTRACE_O_TRACEEXEC
is requested by tracer.

Compiled attached program with gcc -Wall threaded-execve.c,
ran it and I see this:

6797: thread leader
6797: status:0003057f WIFSTOPPED sig:5 (TRAP) event:CLONE
6798: status:0000137f WIFSTOPPED sig:19 (STOP) event:(null)
6797: status:0003057f WIFSTOPPED sig:5 (TRAP) event:CLONE
6799: status:0000137f WIFSTOPPED sig:19 (STOP) event:(null)
6798: status:0006057f WIFSTOPPED sig:5 (TRAP) event:EXIT
6797: status:0006057f WIFSTOPPED sig:5 (TRAP) event:EXIT
6798: status:00000000 WIFEXITED exitcode:0
6797: status:0004057f WIFSTOPPED sig:5 (TRAP) event:EXEC
6797: status:0003057f WIFSTOPPED sig:5 (TRAP) event:CLONE
6800: status:0000137f WIFSTOPPED sig:19 (STOP) event:(null)
6797: status:0003057f WIFSTOPPED sig:5 (TRAP) event:CLONE
6801: status:0000137f WIFSTOPPED sig:19 (STOP) event:(null)
6800: status:0006057f WIFSTOPPED sig:5 (TRAP) event:EXIT
6797: status:0006057f WIFSTOPPED sig:5 (TRAP) event:EXIT
6800: status:00000000 WIFEXITED exitcode:0
6797: status:0004057f WIFSTOPPED sig:5 (TRAP) event:EXEC
...
...
...

In short, it doesn't look too bad:  we do get EXIT events for both
destroyed threads, and even get WIFEXITED for the non-leader.
(IOW: maybe PTRACE_O_TRACEEXIT is not even needed!)
EXEC event is reported last (also good!)

Oleg, does it look like it works as intended, or am I just lucky?

I guess I need to test larger number of threads, and throw in some races...

-- 
vda

[-- Attachment #2: threaded-execve.c --]
[-- Type: text/x-csrc, Size: 7182 bytes --]

/* ...DESCRITION...

   This software is provided 'as-is', without any express or implied
   warranty.  In no event will the authors be held liable for any damages
   arising from the use of this software.

   Permission is granted to anyone to use this software for any purpose,
   including commercial applications, and to alter it and redistribute it
   freely.  */

#define _GNU_SOURCE 1
#include <assert.h>
#include <limits.h>
#include <stddef.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <errno.h>
#include <stdio.h>
#include <sched.h>
#include <signal.h>
#include <dirent.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/syscall.h>
/* #include <pthread.h> */
/* Dance around ptrace.h + user.h incompatibility */
#ifdef __ia64__
# define ia64_fpreg ia64_fpreg_DISABLE
# define pt_all_user_regs pt_all_user_regs_DISABLE
#endif
#include <sys/ptrace.h>
#include <linux/ptrace.h>
#ifdef __ia64__
# undef ia64_fpreg
# undef pt_all_user_regs
#endif
#include <sys/user.h>
#if defined __i386__ || defined __x86_64__
# include <sys/debugreg.h>
#endif
/* Define clone2 for all arches */
#ifdef __ia64__
extern int __clone2(int (*fn) (void *), void *child_stack_base,
                     size_t stack_size, int flags, void *arg, ...);
#define clone2 __clone2
#else
#define clone2(func, stack_base, size, flags, arg...) \
        clone(func, (stack_base) + (size), flags, arg)
#endif


static int verbose;

#define VERBOSE(...) do { \
	if (verbose) { \
		printf(__VA_ARGS__); fflush(stdout); \
	} \
} while (0)

static pid_t child;
/*static pid_t grandchild;*/

static void
sigkill(pid_t *pp)
{
	pid_t pid = *pp;
	*pp = 0;
	if (pid > 0)
		kill(pid, SIGKILL);
}

static void
cleanup(void)
{
	/*sigkill(&grandchild);*/
	sigkill(&child);
	while (waitpid(-1, NULL, __WALL) > 0)
		continue;
}

static void
handler_fail(int signo)
{
	VERBOSE("alarm timed out\n");
	sigset_t set;
	signal(SIGABRT, SIG_DFL);
	signal(SIGALRM, SIG_DFL);
	/* SIGALRM may be blocked in sighandler, need to unblock */
	sigfillset(&set);
	sigprocmask(SIG_UNBLOCK, &set, NULL);
	/* Due to kernel bugs, waitpid may block. Need to have a timeout */
	alarm(1);
	cleanup();
	assert(0);
}

static const char* sig_name(unsigned sig)
{
	static const char *const sigs[] = {
		[SIGSTOP] = "STOP", [SIGTRAP] = "TRAP", [SIGKILL] = "KILL",
		[SIGTERM] = "TERM", [SIGINT ] = "INT ", [0      ] = "0   ",
		[SIGTRAP|0x80] = "TRAP|80",
	};
	static const unsigned num_sigs = sizeof(sigs) / sizeof(sigs[0]);
	if (sig < num_sigs)
		return sigs[sig];
	return "SIG????";
}

static const char* event_name(int status)
{
	static const char *const events[] = {
		[PTRACE_EVENT_FORK      ] = "FORK",
		[PTRACE_EVENT_VFORK     ] = "VFORK",
		[PTRACE_EVENT_CLONE     ] = "CLONE",
		[PTRACE_EVENT_EXEC      ] = "EXEC",
		[PTRACE_EVENT_VFORK_DONE] = "VFORK_DONE",
		[PTRACE_EVENT_EXIT      ] = "EXIT",
	};
	static const unsigned num_events = sizeof(events) / sizeof(events[0]);
	status = (unsigned)status >> 16;
	if (status < num_events)
		return events[status];
	return "EV???";
}

/****************** Standard scaffolding ends here ****************/

/*
 * Extended commentary of the entire test.
 *
 * What kernels / patches exhibit it? When it was fixed?
 * Is it CPU vendor/model dependent? SMP dependent?
 * Is it deterministic?
 * How easy/hard is to reproduce it
 * (always? a dozen loops? a second? minute? etc)
 */

/* If the test is not deterministic:
 * Amount of seconds needed to almost 100% catch it */
//#define DEFAULT_TESTTIME 5
/* or (if reproducible in a few loops only) */
//#define DEFAULT_LOOPS 100

static int
thread1(void *unused)
{
	for(;;) pause();
	return 0;
}

static int
thread2(void *unused)
{
	execl("/proc/self/exe", "exe", NULL);
	for(;;) pause();
	return 0;
}

static int
thread_leader(void *unused)
{
	/* malloc gives sufficiently aligned buffer.
	 * long buf[] does not! (on ia64).
	 */
	/* As seen in pthread_create(): */
	clone2(thread1, malloc(16 * 1024), 16 * 1024, 0
		| CLONE_VM
		| CLONE_FS
		| CLONE_FILES | CLONE_SIGHAND | CLONE_THREAD | CLONE_SYSVSEM
//		| CLONE_PTRACE
		| 0        /* no signal to send on death */
		, NULL);
	usleep(50*1000);
	clone2(thread2, malloc(16 * 1024), 16 * 1024, 0
		| CLONE_VM
		| CLONE_FS
		| CLONE_FILES | CLONE_SIGHAND | CLONE_THREAD | CLONE_SYSVSEM
//		| CLONE_PTRACE
		| 0        /* no signal to send on death */
		, NULL);
	for(;;) pause();
	return 0;
}

/* If nothing strange happens, just returns.
 * Notable events (which are not bugs) print some sort of marker
 * is verbose is on, but still continue and return normally.
 * Known bugs also print a message if verbose, but they exit(1).
 * New bugs are likely to trip asserts or cause hang/kernel crash :)
 */
static void
reproduce(void)
{
	int status;
	pid_t pid;

	VERBOSE(".");
	alarm(1);

	/* Typical scenario starts like this.  */
	child = fork();
	assert(child != -1);
	if (child == 0) {
		/* child */
		errno = 0;
		ptrace(PTRACE_TRACEME, 0, (void *) 0, (void *) 0);
		assert_perror(errno);
		raise(SIGSTOP);
		assert_perror(errno);

		printf("%d: thread leader\n", getpid());
		thread_leader(NULL);
	}

	/* We are parent tracer */
	assert(child > 0);
	errno = 0;

	/* Child has stopped itself, checking */
	pid = waitpid(child, &status, 0);
	assert(pid == child);
	assert(WIFSTOPPED (status));
	assert(WSTOPSIG (status) == SIGSTOP);

	ptrace(PTRACE_SETOPTIONS, child, NULL,
		PTRACE_O_TRACESYSGOOD
		| PTRACE_O_TRACECLONE
		| PTRACE_O_TRACEEXIT
		| PTRACE_O_TRACEEXEC);
	assert_perror(errno);

	ptrace(PTRACE_CONT, child, NULL, (void *) 0);
	assert_perror(errno);

	/* Let's just look on the resulting sequence of events */
	for (;;) {
		pid_t pid = waitpid(-1, &status, __WALL);
		if (pid <= 0) {
			printf("waitpid returned %d\n", pid);
			return;
		}
		if (WIFSTOPPED(status)) {
			printf("%d: status:%08x WIFSTOPPED sig:%d (%s) event:%s\n",
				pid, status,
				WSTOPSIG(status), sig_name(WSTOPSIG(status)),
				event_name(status)
			);
			ptrace(PTRACE_CONT, pid, NULL, (void *)0);
			assert_perror(errno);
		}
		else if (WIFEXITED(status))
			printf("%d: status:%08x WIFEXITED exitcode:%d\n",
				pid, status, WEXITSTATUS(status));
		else if (WIFSIGNALED(status))
			printf("%d: status:%08x WIFSIGNALED sig:%d (%s)\n",
				pid, status, WTERMSIG(status), sig_name(WTERMSIG(status)));
		else
			printf("%d: status:%08x - ???\n",
				pid, status);
	}

	cleanup();
}

int
main(int argc, char **argv)
{
	setbuf(stdout, NULL);

	if (strcmp(argv[0], "exe") == 0)
		thread_leader(NULL);

#if defined DEFAULT_TESTTIME || defined DEFAULT_LOOPS
	int i;
	char *env_testtime = getenv("TESTTIME");  /* misnomer */
	int testtime = (env_testtime ? atoi(env_testtime) : 1);
#endif

	atexit(cleanup);
	signal(SIGINT, handler_fail);
	signal(SIGABRT, handler_fail);
	signal(SIGALRM, handler_fail);
	verbose = (argc - 1);

#if defined DEFAULT_TESTTIME
	testtime *= DEFAULT_TESTTIME;
	for(i = 0; i < testtime; i++) {
		time_t t = time(NULL);
		while (t == time(NULL))
		reproduce();
	}
	VERBOSE("\n");
#elif defined DEFAULT_LOOPS
	testtime *= DEFAULT_LOOPS;
	for(i = 0; i < testtime; i++)
		reproduce();
	VERBOSE("\n");
#else
	reproduce();
#endif

	return 0;
}

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: execve-under-ptrace API bug (was Re: Ptrace documentation, draft #3)
  2011-05-30 16:42           ` Oleg Nesterov
@ 2011-05-30 23:43             ` Denys Vlasenko
  2011-05-31 13:51               ` Oleg Nesterov
  0 siblings, 1 reply; 17+ messages in thread
From: Denys Vlasenko @ 2011-05-30 23:43 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Tejun Heo, jan.kratochvil, linux-kernel, torvalds, akpm, indan

On Monday 30 May 2011 18:42, Oleg Nesterov wrote:
> On 05/30, Denys Vlasenko wrote:
> >
> > On Mon, May 30, 2011 at 1:40 PM, Denys Vlasenko
> > <vda.linux@googlemail.com> wrote:
> > >
> > > Which is fine. Can we make the death from this "internal SIGKILL"
> > > visible to the tracer of killed tracees?
> >
> > Ok, let's take a deeper look at API needs. What we need to report, and when?
> 
> OK. but I'm afraid I am a bit confused ;)

I am trying to write up the ptrace API (in this particular thread, wrt execve).

Basically, I try to sync up your / Jan's / Tejun's knowledge about the following:

* how current kernels are supposed to work, both:
  - what we promise, and
  - what we DON'T promise
    (such as "don't expect ptrace ops to always succeed,
    you may get ESRCH any time", or "wait(WHOHANG) may return spurious 0"...)
* what actually does work (modulo unknown bugs),
* what is known to be "slightly" broken, but likely to be fixed,
  and finally,
* what is broken so hopelessly that some API changes/additions will be needed,


While working on thie document, and thanks to your request to run actual test
with multi-threaded execve, we just discovered that our idea of how API
works now doesn't match reality: other threads do not die silently.
They do emit death notifications. Only execve'ing thread itself
"disappears".

Let's decide how we want ptrace API to work in this area.
The behavior I observed with the test program:

6797: thread 0 (leader): sleeps in pause()
6798: thread 1: sleeps in pause()
6799: thread 2: execve("/proc/self/exe")

Tracer sees the following:

6798: status:0006057f WIFSTOPPED sig:5 (TRAP) event:EXIT
6797: status:0006057f WIFSTOPPED sig:5 (TRAP) event:EXIT
6798: status:00000000 WIFEXITED exitcode:0
6797: status:0004057f WIFSTOPPED sig:5 (TRAP) event:EXEC

(I tested it with 10 threads and the pattern seems to be the same)

Every thread including leader, but excluding execve'ing one,
reports EVENT_EXIT.

Then every thread. excluding leader and excluding execve'ing one,
reports WIFEXITED.

(question to you, Oleg:)
??? do we guarantee that EVENT_EXIT happens? Do we guarantee
that WIFEXITED happens? (If not, do you think we can fix it,
or we are better to not include such a guarantee in the API?)
Do we guarantee the order between them?

Note: WIFEXITED of thread 1 can happen before EVENT_EXIT of thread 0.
IOW: there is no ordering *between* threads for these ptrace-stops.
(I saw reordering with more threads)

Then we get EVENT_EXEC with pid of the leader.
execve'ing thread's pid is no longer usable by tracer after this.

??? do we guarantee that this happens after all EVENT_EXITs and WIFEXITEDs?


> > (1) execve'ing thread is obviously alive. current kernel already
> > reports its execve success. The only thing we need to add is
> > a way to retrieve its former pid, so that tracer can drop
> > former pid's data, and also to cater for the "two execve's" case.
> 
> This is only needed if strace doesn't track the tracee's tgids, right?
> 
> > PTRACE_EVENT_EXEC seems to be a good place to do it.
> > Say, using GETEVENTMSG?
> 
> Yes, Tejun suggested the same. Ignoring the pid_ns issues, this is trivial.
> If the tracer runs in the parent namespace it is not, we can't simply
> record the old tid. Lets ignore the problems with namespaces for now...

Yes, this would make tracee's life much easier if we'd tell it
what was the pid of the tracee which exec'ed, and therefore this pid
is gone.


> OTOH, there is a problem: we should trace them both. Otherwise, if we
> only trace L, even GETEVENTMSG can't help.

In practice, people do this more rarely than tracing every thread.
But anyway, I have an idea...


> And this means we can only 
> rely on PTRACE_EVENT_EXIT currently. Which needs fixes ;)

What is broken?


> In short: I do not think we can make what you want (assuming I understand
> your suggestion correctly). Consider the simple example: we are tracing
> the single thread and it is the group leader, another (untraced) thread
> execs.

I do not know what would be the right behavior in this case.
It depends whether we consider "tracedness" to be attached to a pid
or to a thread of execution.

I think the better (more general) question is "what if both threads
are traced by _different_ tracers?".

Possible answers:


If we think "tracedness" is attached to pid:

tracer 0 (traces leader) sees:
status:0006057f WIFSTOPPED sig:5 (TRAP) event:EXIT
status:0004057f WIFSTOPPED sig:5 (TRAP) event:EXEC
<continues tracing>

tracer 1 (traces execve'ing thread) sees:
<nothing, tracee is gone>

What is bad about it:
* tracer 2 has no idea whatsoever that its tracee is gone.


If we think "tracedness" is attached to thread (task struct):

tracer 0 (traces leader) sees:
status:0006057f WIFSTOPPED sig:5 (TRAP) event:EXIT
<tracee is gone>

tracer 1 (traces execve'ing thread) sees:
status:0004057f WIFSTOPPED sig:5 (TRAP) event:EXEC, and pid has changed!

What is bad about it:
* tracer 0 expects yet another notification, "status:00000000 WIFEXITED exitcode:0"
  or similar, but it will never come.
* tracer 1 can be rather confused by getting EVENT_EXEC from a tracee it knows
  nothing about (since the pid has changed!). If it has more than one tracee,
  it can't guess which one did that. (Yes, it can resort to ugly racy hacks...)


I think the second case is "less broken". What API changes can make it better
for userspace?

First, returning old pid via GETEVENTMSG helps with second
badness - tracer 1 can fetch it, and understand which of his tracees
changed pid just now.

And second, if we'd return "status:00000000 WIFEXITED exitcode:0" thing
on execve _for leader too_, then tracer 0 will be happy (it will see consistent
sequence of events).
If it's hard to do, then alternatively, we can add this information
to EVENT_EXIT somehow. Normally, GETEVENTMSG returns exit status.
Can be hijack a bit there to say "dont expect WIFEXITED on me"?


Final touch may be to make "I exited because some other thread exec'ed"
notification different from "I exited because of _exit(0)".
It would make strace to say what _actually_ happened, which is a good thing.
Silly ideas department proposes returning WIFSIGNALED, WTERMSIG = 0 ;)

-- 
vda

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: execve-under-ptrace API bug (was Re: Ptrace documentation, draft #3)
  2011-05-30 23:43             ` Denys Vlasenko
@ 2011-05-31 13:51               ` Oleg Nesterov
  2011-06-02 10:57                 ` Pedro Alves
  2011-06-02 15:12                 ` Denys Vlasenko
  0 siblings, 2 replies; 17+ messages in thread
From: Oleg Nesterov @ 2011-05-31 13:51 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Tejun Heo, jan.kratochvil, linux-kernel, torvalds, akpm, indan

On 05/31, Denys Vlasenko wrote:
>
> Let's decide how we want ptrace API to work in this area.
> The behavior I observed with the test program:
>
> 6797: thread 0 (leader): sleeps in pause()
> 6798: thread 1: sleeps in pause()
> 6799: thread 2: execve("/proc/self/exe")
>
> Tracer sees the following:
>
> 6798: status:0006057f WIFSTOPPED sig:5 (TRAP) event:EXIT
> 6797: status:0006057f WIFSTOPPED sig:5 (TRAP) event:EXIT
> 6798: status:00000000 WIFEXITED exitcode:0
> 6797: status:0004057f WIFSTOPPED sig:5 (TRAP) event:EXEC
>
> (I tested it with 10 threads and the pattern seems to be the same)
>
> Every thread including leader, but excluding execve'ing one,
> reports EVENT_EXIT.

Yes, this is expected. But let me repeat, there are problems.

The main problem is: it is not clear do we really want EVENT_EXIT
in this case. I think we do, Roland thought we do not. OTOH I never
really the purpose of EVENT_EXIT, but this doesn't matter.

If we decide we do want this notification (in this case), then we
need fixes. EVENT_EXIT is not reliable. Say, the thread can exit
before it dequeues SIGKILL and in this case it doesn't stop.
Also. If we guarantee EVENT_EXIT in this case, then probably the
implicit SIGKILL should not wakeup the TASK_TRACED tracee (except
the new PTRACE_LISTEN case).

In short: currently I do not know what should be documented. I do
not know the original intent, I can only see what the code actually
does. In any case, I strongly believe the code should be changed,
but firtsly we should decide what we want. But not right now,
please ;) There are other connected problems...

> Then every thread. excluding leader and excluding execve'ing one,
> reports WIFEXITED.

Yes.

> ??? do we guarantee that EVENT_EXIT happens?

See above,

> Do we guarantee
> that WIFEXITED happens?

Yes. (excluding leader)

> Note: WIFEXITED of thread 1 can happen before EVENT_EXIT of thread 0.
> IOW: there is no ordering *between* threads for these ptrace-stops.

Sure, why not?

Btw... you are talking about the ordering as it seen by the tracer, it
can differ from the "real" ordering. But this doesn't matter.

> Then we get EVENT_EXEC with pid of the leader.
> execve'ing thread's pid is no longer usable by tracer after this.
>
> ??? do we guarantee that this happens after all EVENT_EXITs and WIFEXITEDs?

Yes. At this time all other threads do not exist.

> > In short: I do not think we can make what you want (assuming I understand
> > your suggestion correctly). Consider the simple example: we are tracing
> > the single thread and it is the group leader, another (untraced) thread
> > execs.
>
> I do not know what would be the right behavior in this case.
> It depends whether we consider "tracedness" to be attached to a pid
> or to a thread of execution.
>
> I think the better (more general) question is "what if both threads
> are traced by _different_ tracers?".

I don't really understand why do you think this is more general...
Nevermind.

> If we think "tracedness" is attached to pid:

No it is not. The tracing is per-thread (task_struct). But the API is
per-pid. The tracee changes its pid. That is all ;)

> First, returning old pid via GETEVENTMSG helps with second
> badness - tracer 1 can fetch it, and understand which of his tracees
> changed pid just now.

OK, we already discussed this. This looks reasonable.

> And second, if we'd return "status:00000000 WIFEXITED exitcode:0" thing
> on execve _for leader too_, then tracer 0 will be happy (it will see consistent
> sequence of events).

Once again, we can only do this before the execing thread changes its
pid. This means that this thread should look at the leader, and if it
is traced it should wait until the tracer does do_wait(). I do not think
this is good.

And, once again, even if we do this, we need to change the current
behaviour with do_wait(ptraced_exited_leader_thread), see another
discussion.

> If it's hard to do, then alternatively, we can add this information
> to EVENT_EXIT somehow. Normally, GETEVENTMSG returns exit status.
> Can be hijack a bit there to say "dont expect WIFEXITED on me"?

Given that this pid will be reused, what does this "me" actually mean?

> Final touch may be to make "I exited because some other thread exec'ed"
> notification different from "I exited because of _exit(0)".
> It would make strace to say what _actually_ happened, which is a good thing.
> Silly ideas department proposes returning WIFSIGNALED, WTERMSIG = 0 ;)

Perhaps, I dunno. Personally I'd prefer to add the new PTRACE_ request
which provides some info about the tracee and its thread group. It can
report tgid, it can report exec-in-progress or group-exit-in-progress.
I don't really know. But in any case I don't think we can change the
current exit_code/etc.

Btw, "I exited because of _exit(0)" is not exactly right. You can use
GETEVENTMSG, it should report SIGKILL.

Oleg.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: execve-under-ptrace API bug (was Re: Ptrace documentation, draft #3)
  2011-05-31 13:51               ` Oleg Nesterov
@ 2011-06-02 10:57                 ` Pedro Alves
  2011-06-02 14:59                   ` Denys Vlasenko
  2011-06-02 15:12                 ` Denys Vlasenko
  1 sibling, 1 reply; 17+ messages in thread
From: Pedro Alves @ 2011-06-02 10:57 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Denys Vlasenko, Tejun Heo, jan.kratochvil, linux-kernel,
	torvalds, akpm, indan

On Tuesday 31 May 2011 14:51:16, Oleg Nesterov wrote:

> The main problem is: it is not clear do we really want EVENT_EXIT
> in this case. I think we do, Roland thought we do not. OTOH I never
> really the purpose of EVENT_EXIT, but this doesn't matter.
> 
> If we decide we do want this notification (in this case), then we
> need fixes. EVENT_EXIT is not reliable. Say, the thread can exit
> before it dequeues SIGKILL and in this case it doesn't stop.
> Also. If we guarantee EVENT_EXIT in this case, then probably the
> implicit SIGKILL should not wakeup the TASK_TRACED tracee (except
> the new PTRACE_LISTEN case).
> 
> In short: currently I do not know what should be documented. I do
> not know the original intent, I can only see what the code actually
> does. 

Daniel Jacobowitz said when he submitted it:

<http://lkml.indiana.edu/hypermail/linux/kernel/0302.0/1051.html>

"PTRACE_EVENT_EXIT, which triggers in do_exit().  This is useful to quickly
 find out where a program is making an exit syscall from, etc. - it
 triggers before the mm is released, so we can still get backtraces et
 cetera."

That said, GDB was never made to use it:

  /* Do not enable PTRACE_O_TRACEEXIT until GDB is more prepared to support
     read-only process state.  */

-- 
Pedro Alves

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: execve-under-ptrace API bug (was Re: Ptrace documentation, draft #3)
  2011-06-02 10:57                 ` Pedro Alves
@ 2011-06-02 14:59                   ` Denys Vlasenko
  0 siblings, 0 replies; 17+ messages in thread
From: Denys Vlasenko @ 2011-06-02 14:59 UTC (permalink / raw)
  To: Pedro Alves
  Cc: Oleg Nesterov, Tejun Heo, jan.kratochvil, linux-kernel, torvalds,
	akpm, indan

On Thu, Jun 2, 2011 at 12:57 PM, Pedro Alves <pedro@codesourcery.com> wrote:
> On Tuesday 31 May 2011 14:51:16, Oleg Nesterov wrote:
>
>> The main problem is: it is not clear do we really want EVENT_EXIT
>> in this case. I think we do, Roland thought we do not. OTOH I never
>> really the purpose of EVENT_EXIT, but this doesn't matter.
>>
>> If we decide we do want this notification (in this case), then we
>> need fixes. EVENT_EXIT is not reliable. Say, the thread can exit
>> before it dequeues SIGKILL and in this case it doesn't stop.
>> Also. If we guarantee EVENT_EXIT in this case, then probably the
>> implicit SIGKILL should not wakeup the TASK_TRACED tracee (except
>> the new PTRACE_LISTEN case).
>>
>> In short: currently I do not know what should be documented. I do
>> not know the original intent, I can only see what the code actually
>> does.
>
> Daniel Jacobowitz said when he submitted it:
>
> <http://lkml.indiana.edu/hypermail/linux/kernel/0302.0/1051.html>
>
> "PTRACE_EVENT_EXIT, which triggers in do_exit().  This is useful to quickly
>  find out where a program is making an exit syscall from, etc. - it
>  triggers before the mm is released, so we can still get backtraces et
>  cetera."

We have circa 340 syscalls. What makes exit so special that it has to have
a separate ptrace stop specially for it? People may legitimately
want to know where write() syscall happens, should we add
PTRACE_EVENT_WRITE? Rinse, repeat...

-- 
vda

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: execve-under-ptrace API bug (was Re: Ptrace documentation, draft #3)
  2011-05-31 13:51               ` Oleg Nesterov
  2011-06-02 10:57                 ` Pedro Alves
@ 2011-06-02 15:12                 ` Denys Vlasenko
  1 sibling, 0 replies; 17+ messages in thread
From: Denys Vlasenko @ 2011-06-02 15:12 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Tejun Heo, jan.kratochvil, linux-kernel, torvalds, akpm, indan

On Tue, May 31, 2011 at 3:51 PM, Oleg Nesterov <oleg@redhat.com> wrote:
>> I think the better (more general) question is "what if both threads
>> are traced by _different_ tracers?".
>
> I don't really understand why do you think this is more general...

Because it reveals more problems, and thus allows to think about
a solution for all of them. Here it is again:


On Tue, May 31, 2011 at 1:43 AM, Denys Vlasenko
<vda.linux@googlemail.com> wrote:
> If we think "tracedness" is attached to thread (task struct):
>
> tracer 0 (traces leader) sees:
> status:0006057f WIFSTOPPED sig:5 (TRAP) event:EXIT
> <tracee is gone>
>
> tracer 1 (traces execve'ing thread) sees:
> status:0004057f WIFSTOPPED sig:5 (TRAP) event:EXEC, and pid has changed!
>
> What is bad about it:
> * tracer 0 expects yet another notification, "status:00000000 WIFEXITED exitcode:0"
>  or similar, but it will never come.
> * tracer 1 can be rather confused by getting EVENT_EXEC from a tracee it knows
>  nothing about (since the pid has changed!). If it has more than one tracee,
>  it can't guess which one did that. (Yes, it can resort to ugly racy hacks...)
>
> I think the second case is "less broken". What API changes can make it better
> for userspace?
>
> First, returning old pid via GETEVENTMSG helps with second
> badness - tracer 1 can fetch it, and understand which of his tracees
> changed pid just now.
>
> And second, if we'd return "status:00000000 WIFEXITED exitcode:0" thing
> on execve _for leader too_, then tracer 0 will be happy (it will see consistent
> sequence of events).
> If it's hard to do, then alternatively, we can add this information
> to EVENT_EXIT somehow. Normally, GETEVENTMSG returns exit status.
> Can be hijack a bit there to say "dont expect WIFEXITED on me"?


>> And second, if we'd return "status:00000000 WIFEXITED exitcode:0" thing
>> on execve _for leader too_, then tracer 0 will be happy (it will see consistent
>> sequence of events).
>
> Once again, we can only do this before the execing thread changes its
> pid. This means that this thread should look at the leader, and if it
> is traced it should wait until the tracer does do_wait(). I do not think
> this is good.

I understand, but so far I don't see any better solution. Current behavior
is simply not acceptable. Here is it again:

> tracer 0 (traces leader) sees:
> status:0006057f WIFSTOPPED sig:5 (TRAP) event:EXIT
> <tracee is gone>

It's a total "WTF?" situation. As far as tracer is concerned, tracee just
vanished into thin air: no WIFEXITED seen, and since this tracer doesn't
see execve because it doesn't trace execve'ing thread, it has no way
to understand what the hell happened. Tracer will sit in waitpid forever.
If it had not requested EVENT_EXIT to be shown, it wouldn't even get the
EVENT_EXIT shown above which tells it that tracee is _probably_ gone.


> And, once again, even if we do this, we need to change the current
> behaviour with do_wait(ptraced_exited_leader_thread), see another
> discussion.

I wrote a test program and the behavior is worse than I thought.
I have a case where exited leader causes waitpid to hang and not report
ptrace events from other tracees, without any execve!
I'll sent the program in another thread.
-- 
vda

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2011-06-02 15:14 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-20 19:23 Ptrace documentation, draft #3 Denys Vlasenko
2011-05-25 14:32 ` Tejun Heo
2011-05-30  3:08   ` Denys Vlasenko
2011-05-30  3:28   ` execve-under-ptrace API bug (was Re: Ptrace documentation, draft #3) Denys Vlasenko
2011-05-30  8:49     ` Tejun Heo
2011-05-30 11:40       ` Denys Vlasenko
2011-05-30 14:27         ` Denys Vlasenko
2011-05-30 16:42           ` Oleg Nesterov
2011-05-30 23:43             ` Denys Vlasenko
2011-05-31 13:51               ` Oleg Nesterov
2011-06-02 10:57                 ` Pedro Alves
2011-06-02 14:59                   ` Denys Vlasenko
2011-06-02 15:12                 ` Denys Vlasenko
2011-05-30 18:11           ` Denys Vlasenko
2011-05-30 13:56       ` Oleg Nesterov
2011-05-30 13:49     ` Oleg Nesterov
2011-05-30 13:35 ` Ptrace documentation, draft #3 Oleg Nesterov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.