[PATCH] man ptrace: add extended description of various ptrace quirks

* [PATCH] man ptrace: add extended description of various ptrace quirks
@ 2011-07-21 11:09 Denys Vlasenko
  2011-07-21 16:51 ` Oleg Nesterov
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Denys Vlasenko @ 2011-07-21 11:09 UTC (permalink / raw)
  To: mtk.manpages, Oleg Nesterov, Jan Kratochvil, linux-kernel, Tejun Heo

[-- Attachment #1: Type: text/plain, Size: 1180 bytes --]

Hi Michael,

Please apply attached patch which updates ptrace manpage.
(I'm not sending it inline, google web mail might mangle it. Sorry).

Changes include:

s/parent/tracer/g, s/child/tracee/g - ptrace interface now
is sufficiently cleaned up to not treat tracing process as parent.

Deleted several outright false statements:
- pid 1 can be traced
- tracer is not shown as parent in ps output
- PTRACE_ATTACH is not "the same behavior as if tracee had done
  a PTRACE_TRACEME": PTRACE_ATTACH delivers a SIGSTOP.
- SIGSTOP _can_ be injected.
- Removed mentions of SunOS and Solaris as irrelevant.
- Added a few more known bugs.

Added a large block of text in DESCRIPTION which doesn't focus
on mechanical description of each flag and operation, but rather
tries to describe a bigger picture. The targeted audience is
a person which is reasonably knowledgeable in Unix but did not
spend years working with ptrace, and thus may be unaware of its quirks.
This text went through several iterations of review by Oleg Nesterov
and Tejun Heo.
This block of text intentionally uses as little markup as possible,
otherwise future modifications to it will be very hard to make.

-- 
vda

[-- Attachment #2: d196032aff8a2a828e3bbdbbb35f9fe7ed280028.diff --]
[-- Type: text/x-patch, Size: 43251 bytes --]

commit d196032aff8a2a828e3bbdbbb35f9fe7ed280028
Author: Denys Vlasenko <dvlasenk@redhat.com>
Date:   Thu Jul 21 12:55:49 2011 +0200

    ptrace: add extended description of various ptrace quirks
    
    Changes include:
    
    s/parent/tracer/g, s/child/tracee/g - ptrace interface now
    is sufficiently cleaned up to not treat tracing process as parent.
    
    Deleted several outright false statements:
    - pid 1 can be traced
    - tracer is not shown as parent in ps output
    - PTRACE_ATTACH is not "the same behavior as if tracee had done
      a PTRACE_TRACEME": PTRACE_ATTACH delivers a SIGSTOP.
    - SIGSTOP _can_ be injected.
    - Removed mentions of SunOS and Solaris as irrelevant.
    - Added a few more known bugs.
    
    Added a large block of text in DESCRIPTION which doesn't focus
    on mechanical description of each flag and operation, but rather
    tries to describe a bigger picture. The targeted audience is
    a person which is reasonably knowledgeable in Unix but did not
    spend years working with ptrace, and thus may be unaware of its quirks.
    This text went through several iterations of review by Oleg Nesterov
    and Tejun Heo.
    This block of text intentionally uses as little markup as possible,
    otherwise future modifications to it will be very hard to make.
    
    Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>

diff --git a/man2/ptrace.2 b/man2/ptrace.2
index 9cd5899..8875873 100644
--- a/man2/ptrace.2
+++ b/man2/ptrace.2
@@ -53,45 +53,51 @@ ptrace \- process trace
 .SH DESCRIPTION
 The
 .BR ptrace ()
-system call provides a means by which a parent process may observe
-and control the execution of another process,
-and examine and change its core image and registers.
+system call provides a means by which a process (tracer) may observe
+and control the execution of another processes (tracees),
+and examine and change their core image and registers.
 It is primarily used to implement breakpoint debugging and system
 call tracing.
 .LP
-The parent can initiate a trace by calling
+Tracees first need to be attached to the tracer.
+Attachment and subsequent commands are per-thread: in
+multi-threaded process, every thread can be individually attached to a
+(potentially different) tracer, or left not attached and thus not
+debugged. Therefore, "tracee" always means "(one) thread", never "a
+(possibly multi-threaded) process". Ptrace commands are always sent to
+a specific tracee using ptrace(PTRACE_foo, pid, ...), where pid is the
+thread ID of the corresponding Linux thread.
+.LP
+The process can initiate a trace by calling
 .BR fork (2)
 and having the resulting child do a
 .BR PTRACE_TRACEME ,
 followed (typically) by an
-.BR exec (3).
-Alternatively, the parent may commence trace of an existing process using
+.BR execve (2).
+Alternatively, the process may commence trace of an existing process using
 .BR PTRACE_ATTACH .
 .LP
-While being traced, the child will stop each time a signal is delivered,
+While being traced, the tracee will stop each time a signal is delivered,
 even if the signal is being ignored.
 (The exception is
 .BR SIGKILL ,
 which has its usual effect.)
-The parent will be notified at its next
+The tracer will be notified at its next
 .BR wait (2)
-and may inspect and modify the child process while it is stopped.
-The parent then causes the child to continue,
+and may inspect and modify the tracee while it is stopped.
+The tracer then causes the tracee to continue,
 optionally ignoring the delivered signal
 (or even delivering a different signal instead).
 .LP
-When the parent is finished tracing, it can terminate the child with
-.B PTRACE_KILL
-or cause it to continue executing in a normal, untraced mode
-via
+When the tracer is finished tracing, it can cause tracee to continue
+executing in a normal, untraced mode via
 .BR PTRACE_DETACH .
 .LP
 The value of \fIrequest\fP determines the action to be performed:
 .TP
 .B PTRACE_TRACEME
 Indicates that this process is to be traced by its parent.
-Any signal
-(except
+Any signal (except
 .BR SIGKILL )
 delivered to this process will cause it to stop and its
 parent to be notified via
@@ -107,19 +113,18 @@ A process probably shouldn't make this request if its parent
 isn't expecting to trace it.
 (\fIpid\fP, \fIaddr\fP, and \fIdata\fP are ignored.)
 .LP
-The above request is used only by the child process;
-the rest are used only by the parent.
-In the following requests, \fIpid\fP specifies the child process
+The above request is used only by the tracee;
+the rest are used only by the tracer.
+In the following requests, \fIpid\fP specifies the tracee
 to be acted on.
 For requests other than
 .BR PTRACE_KILL ,
-the child process must
-be stopped.
+the tracee must be stopped.
 .TP
 .BR PTRACE_PEEKTEXT ", " PTRACE_PEEKDATA
 Reads a word at the location
 .I addr
-in the child's memory, returning the word as the result of the
+in the tracee's memory, returning the word as the result of the
 .BR ptrace ()
 call.
 Linux does not have separate text and data address spaces, so the two
@@ -131,7 +136,7 @@ requests are currently equivalent.
 .\" and that is the name that seems common on other systems.
 Reads a word at offset
 .I addr
-in the child's USER area,
+in the tracee's USER area,
 which holds the registers and other information about the process
 (see \fI<sys/user.h>\fP).
 The word is returned as the result of the
@@ -147,7 +152,7 @@ Copies the word
 .I data
 to location
 .I addr
-in the child's memory.
+in the tracee's memory.
 As above, the two requests are currently equivalent.
 .TP
 .B PTRACE_POKEUSER
@@ -157,14 +162,14 @@ Copies the word
 .I data
 to offset
 .I addr
-in the child's USER area.
+in the tracee's USER area.
 As above, the offset must typically be word-aligned.
 In order to maintain the integrity of the kernel,
 some modifications to the USER area are disallowed.
 .TP
 .BR PTRACE_GETREGS ", " PTRACE_GETFPREGS
-Copies the child's general purpose or floating-point registers,
-respectively, to location \fIdata\fP in the parent.
+Copies the tracee's general purpose or floating-point registers,
+respectively, to location \fIdata\fP in the tracer.
 See \fI<sys/user.h>\fP for information on
 the format of this data.
 (\fIaddr\fP is ignored.)
@@ -173,12 +178,12 @@ the format of this data.
 Retrieve information about the signal that caused the stop.
 Copies a \fIsiginfo_t\fP structure (see
 .BR sigaction (2))
-from the child to location \fIdata\fP in the parent.
+from the tracee to location \fIdata\fP in the tracer.
 (\fIaddr\fP is ignored.)
 .TP
 .BR PTRACE_SETREGS ", " PTRACE_SETFPREGS
-Copies the child's general purpose or floating-point registers,
-respectively, from location \fIdata\fP in the parent.
+Copies the tracee's general purpose or floating-point registers,
+respectively, from location \fIdata\fP in the tracer.
 As for
 .BR PTRACE_POKEUSER ,
 some general
@@ -188,9 +193,9 @@ purpose register modifications may be disallowed.
 .BR PTRACE_SETSIGINFO " (since Linux 2.3.99-pre6)"
 Set signal information.
 Copies a \fIsiginfo_t\fP structure from location \fIdata\fP in the
-parent to the child.
+tracer to the tracee.
 This will only affect signals that would normally be delivered to
-the child and were caught by the tracer.
+the tracee and were caught by the tracer.
 It may be difficult to tell
 these normal signals from synthetic signals generated by
 .BR ptrace ()
@@ -198,7 +203,7 @@ itself.
 (\fIaddr\fP is ignored.)
 .TP
 .BR PTRACE_SETOPTIONS " (since Linux 2.4.6; see BUGS for caveats)"
-Sets ptrace options from \fIdata\fP in the parent.
+Sets ptrace options from \fIdata\fP.
 (\fIaddr\fP is ignored.)
 \fIdata\fP is interpreted
 as a bit mask of options, which are specified by the following flags:
@@ -213,7 +218,7 @@ between normal traps and those caused by a syscall.
 may not work on all architectures.)
 .TP
 .BR PTRACE_O_TRACEFORK " (since Linux 2.5.46)"
-Stop the child at the next
+Stop the tracee at the next
 .BR fork (2)
 call with \fISIGTRAP | PTRACE_EVENT_FORK\ <<\ 8\fP and automatically
 start tracing the newly forked process,
@@ -223,7 +228,7 @@ The PID for the new process can be retrieved with
 .BR PTRACE_GETEVENTMSG .
 .TP
 .BR PTRACE_O_TRACEVFORK " (since Linux 2.5.46)"
-Stop the child at the next
+Stop the tracee at the next
 .BR vfork (2)
 call with \fISIGTRAP | PTRACE_EVENT_VFORK\ <<\ 8\fP and automatically start
 tracing the newly vforked process, which will start with a
@@ -232,7 +237,7 @@ The PID for the new process can be retrieved with
 .BR PTRACE_GETEVENTMSG .
 .TP
 .BR PTRACE_O_TRACECLONE " (since Linux 2.5.46)"
-Stop the child at the next
+Stop the tracee at the next
 .BR clone (2)
 call with \fISIGTRAP | PTRACE_EVENT_CLONE\ <<\ 8\fP and automatically start
 tracing the newly cloned process, which will start with a
@@ -242,7 +247,7 @@ The PID for the new process can be retrieved with
 This option may not catch
 .BR clone (2)
 calls in all cases.
-If the child calls
+If the tracee calls
 .BR clone (2)
 with the
 .B CLONE_VFORK
@@ -251,7 +256,7 @@ flag,
 will be delivered instead
 if
 .B PTRACE_O_TRACEVFORK
-is set; otherwise if the child calls
+is set; otherwise if the tracee calls
 .BR clone (2)
 with the exit signal set to
 .BR SIGCHLD ,
@@ -262,18 +267,18 @@ if
 is set.
 .TP
 .BR PTRACE_O_TRACEEXEC " (since Linux 2.5.46)"
-Stop the child at the next
+Stop the tracee at the next
 .BR execve (2)
 call with \fISIGTRAP | PTRACE_EVENT_EXEC\ <<\ 8\fP.
 .TP
 .BR PTRACE_O_TRACEVFORKDONE " (since Linux 2.5.60)"
-Stop the child at the completion of the next
+Stop the tracee at the completion of the next
 .BR vfork (2)
 call with \fISIGTRAP | PTRACE_EVENT_VFORK_DONE\ <<\ 8\fP.
 .TP
 .BR PTRACE_O_TRACEEXIT " (since Linux 2.5.60)"
-Stop the child at exit with \fISIGTRAP | PTRACE_EVENT_EXIT\ <<\ 8\fP.
-The child's exit status can be retrieved with
+Stop the tracee at exit with \fISIGTRAP | PTRACE_EVENT_EXIT\ <<\ 8\fP.
+The tracee's exit status can be retrieved with
 .BR PTRACE_GETEVENTMSG .
 This stop will be done early during process exit when registers
 are still available, allowing the tracer to see where the exit occurred,
@@ -287,10 +292,10 @@ happening at this point.
 Retrieve a message (as an
 .IR "unsigned long" )
 about the ptrace event
-that just happened, placing it in the location \fIdata\fP in the parent.
+that just happened, placing it in the location \fIdata\fP in the tracer.
 For
 .B PTRACE_EVENT_EXIT
-this is the child's exit status.
+this is the tracee's exit status.
 For
 .BR PTRACE_EVENT_FORK ,
 .B PTRACE_EVENT_VFORK
@@ -304,23 +309,21 @@ for
 (\fIaddr\fP is ignored.)
 .TP
 .B PTRACE_CONT
-Restarts the stopped child process.
-If \fIdata\fP is nonzero and not
-.BR SIGSTOP ,
-it is interpreted as a signal to be delivered to the child;
+Restarts the stopped tracee process.
+If \fIdata\fP is nonzero, it is interpreted as a signal to be delivered to the tracee;
 otherwise, no signal is delivered.
-Thus, for example, the parent can control
-whether a signal sent to the child is delivered or not.
+Thus, for example, the tracer can control
+whether a signal sent to the tracee is delivered or not.
 (\fIaddr\fP is ignored.)
 .TP
 .BR PTRACE_SYSCALL ", " PTRACE_SINGLESTEP
-Restarts the stopped child as for
+Restarts the stopped tracee as for
 .BR PTRACE_CONT ,
 but arranges for
-the child to be stopped at the next entry to or exit from a system call,
+the tracee to be stopped at the next entry to or exit from a system call,
 or after execution of a single instruction, respectively.
-(The child will also, as usual, be stopped upon receipt of a signal.)
-From the parent's perspective, the child will appear to have been
+(The tracee will also, as usual, be stopped upon receipt of a signal.)
+From the tracer's perspective, the tracee will appear to have been
 stopped by receipt of a
 .BR SIGTRAP .
 So, for
@@ -347,7 +350,7 @@ For
 do the same
 but also singlestep if not a syscall.
 This call is used by programs like
-User Mode Linux that want to emulate all the child's system calls.
+User Mode Linux that want to emulate all the tracee's system calls.
 The
 .I data
 argument is treated as for
@@ -356,44 +359,523 @@ argument is treated as for
 not supported on all architectures.)
 .TP
 .B PTRACE_KILL
-Sends the child a
+Sends the tracee a
 .B SIGKILL
 to terminate it.
 (\fIaddr\fP and \fIdata\fP are ignored.)
+This operation is deprecated, use kill(SIGKILL) or tgkill(SIGKILL) instead.
 .TP
 .B PTRACE_ATTACH
 Attaches to the process specified in
 .IR pid ,
-making it a traced "child" of the calling process;
-the behavior of the child is as if it had done a
-.BR PTRACE_TRACEME .
-The calling process actually becomes the parent of the child
-process for most purposes (e.g., it will receive
-notification of child events and appears in
-.BR ps (1)
-output as the child's parent), but a
-.BR getppid (2)
-by the child will still return the PID of the original parent.
-The child is sent a
+making it a tracee of the calling process.
+.\" Not true:
+.\" ; the behavior of the tracee is as if it had done a
+.\" .BR PTRACE_TRACEME .
+.\" The calling process actually becomes the parent of the tracee
+.\" process for most purposes (e.g., it will receive
+.\" notification of tracee events and appears in
+.\" .BR ps (1)
+.\" output as the tracee's parent), but a
+.\" .BR getppid (2)
+.\" by the tracee will still return the PID of the original parent.
+The tracee is sent a
 .BR SIGSTOP ,
 but will not necessarily have stopped
 by the completion of this call; use
 .BR wait (2)
-to wait for the child to stop.
+to wait for the tracee to stop. See "Attaching and detaching" subsection
+for additional information.
 (\fIaddr\fP and \fIdata\fP are ignored.)
 .TP
 .B PTRACE_DETACH
-Restarts the stopped child as for
+Restarts the stopped tracee as for
 .BR PTRACE_CONT ,
-but first detaches
-from the process, undoing the reparenting effect of
-.BR PTRACE_ATTACH ,
-and the effects of
-.BR PTRACE_TRACEME .
-Although perhaps not intended, under Linux a traced child can be
+but first detaches from it.
+Under Linux a tracee can be
 detached in this way regardless of which method was used to initiate
 tracing.
 (\fIaddr\fP is ignored.)
+.\"
+.\" In the text below, we decided to avoid prettifying the text with markup:
+.\" it would make the source nearly impossible to edit, and we _do_ intend
+.\" to edit it often, in order to keep it updated:
+.\" ptrace API is full of quirks, no need to compound this situation by
+.\" making it excruciatingly painful to document them!
+.\"
+.SS Death under ptrace
+When a (possibly multi-threaded) process receives a killing signal (a
+signal set to SIG_DFL and whose default action is to kill the process),
+all threads exit. Tracees report their death to their tracer(s). The
+notification about this event is delivered through waitpid API.
+.LP
+Note that killing signal will first cause signal-delivery-stop (on one
+tracee only), and only after it is injected by tracer (or after it was
+dispatched to a thread which isn't traced), death from signal will
+happen on ALL tracees within multi-threaded process.
+.LP
+SIGKILL operates similarly, with exceptions. No signal-delivery-stop is
+generated for SIGKILL and therefore tracer can't suppress it. SIGKILL
+kills even within syscalls (syscall-exit-stop is not generated prior to
+death by SIGKILL). The net effect is that SIGKILL always kills the
+process (all its threads), even if some threads of the process are
+ptraced.
+.LP
+Tracer can kill a tracee with ptrace(PTRACE_KILL, pid, 0, 0). This
+operation is deprecated, use kill(SIGKILL) or tgkill(SIGKILL) instead.
+The problem with this operation is that it requires tracee to be in
+signal-delivery-stop, otherwise it may not work (may complete
+successfully but won't kill the tracee), whereas tgkill(SIGKILL)
+has no such limitation.
+.LP
+[Note: deprecation suggested by Oleg Nesterov. He prefers to deprecate
+it instead of describing (and needing to support) PTRACE_KILL's quirks.]
+.LP
+When tracee executes exit syscall, it reports its death to its tracer.
+Other threads are not affected.
+.LP
+When any thread executes exit_group syscall, every tracee in its thread
+group reports its death to its tracer.
+.LP
+If PTRACE_O_TRACEEXIT option is on, PTRACE_EVENT_EXIT will happen
+before actual death. This applies to exits on exit syscall, group_exit
+syscall, signal deaths (except SIGKILL), and when threads are torn down
+on execve in multi-threaded process.
+.LP
+Tracer cannot assume that ptrace-stopped tracee exists. There are many
+scenarios when tracee may die while stopped (such as SIGKILL).
+Therefore, tracer must always be prepared to handle ESRCH error on any
+ptrace operation. Unfortunately, the same error is returned if tracee
+exists but is not ptrace-stopped (for commands which require stopped
+tracee), or if it is not traced by process which issued ptrace call.
+Tracer needs to keep track of stopped/running state, and interpret
+ESRCH as "tracee died unexpectedly" only if it knows that tracee has
+been observed to enter ptrace-stop. Note that there is no guarantee
+that waitpid(WNOHANG) will reliably report tracee's death status if
+ptrace operation returned ESRCH. waitpid(WNOHANG) may return 0 instead.
+IOW: tracee may be "not yet fully dead" but already refusing ptrace ops.
+.LP
+Tracer can not assume that tracee ALWAYS ends its life by reporting
+WIFEXITED(status) or WIFSIGNALED(status).
+.LP
+.\" or can it? Do we include such a promise into ptrace API?
+.SS Stopped states
+A tracee can be in two states: running or stopped.
+.LP
+There are many kinds of states when tracee is stopped, and in ptrace
+discussions they are often conflated. Therefore, it is important to use
+precise terms.
+.LP
+In this document, any stopped state in which tracee is ready to accept
+ptrace commands from the tracer is called ptrace-stop. Ptrace-stops can
+be further subdivided into signal-delivery-stop, group-stop,
+syscall-stop and so on. They are described in detail later.
+.LP
+When running tracee enters ptrace-stop, it notifies its tracer using
+waitpid API. Tracer should use waitpid family of syscalls to wait for
+tracee to stop. Most of this document assumes that tracer waits with:
+.LP
+	pid = waitpid(pid_or_minus_1, &status, __WALL);
+.LP
+Ptrace-stopped tracees are reported as returns with pid > 0 and
+WIFSTOPPED(status) == true.
+.LP
+.\" Do we require __WALL usage, or will just using 0 be ok? Are the
+.\" rules different if user wants to use waitid? Will waitid require
+.\" WEXITED?
+.LP
+__WALL value does not include WSTOPPED and WEXITED bits, but implies
+their functionality.
+.LP
+Setting of WCONTINUED bit in waitpid flags is not recommended: the
+continued state is per-process and consuming it can confuse real parent
+of the tracee.
+.LP
+Use of WNOHANG bit in waitpid flags may cause waitpid return 0 ("no
+wait results available yet") even if tracer knows there should be a
+notification. Example: kill(tracee, SIGKILL); waitpid(tracee, &status,
+__WALL | WNOHANG);
+.\" waitid usage? WNOWAIT?
+.\" describe how wait notifications queue (or not queue)
+.LP
+The following kinds of ptrace-stops exist: signal-delivery-stops,
+group-stop, PTRACE_EVENT stops, syscall-stops [, SINGLESTEP, SYSEMU,
+SYSEMU_SINGLESTEP]. They all are reported as waitpid result with
+WIFSTOPPED(status) == true. They may be differentiated by checking
+(status >> 8) value, and if looking at (status >> 8) value doesn't
+resolve ambiguity, by querying PTRACE_GETSIGINFO. (Note:
+WSTOPSIG(status) macro returns ((status >> 8) & 0xff) value).
+.SS Signal-delivery-stop
+When (possibly multi-threaded) process receives any signal except
+SIGKILL, kernel selects a thread which handles the signal (if signal is
+generated with t[g]kill, thread selection is done by user). If selected
+thread is traced, it enters signal-delivery-stop. By this point, signal
+is not yet delivered to the process, and can be suppressed by tracer.
+If tracer doesn't suppress the signal, it passes signal to tracee in
+the next ptrace request. This second step of signal delivery is called
+"signal injection" in this document. Note that if signal is blocked,
+signal-delivery-stop doesn't happen until signal is unblocked, with the
+usual exception that SIGSTOP can't be blocked.
+.LP
+Signal-delivery-stop is observed by tracer as waitpid returning with
+WIFSTOPPED(status) == true, WSTOPSIG(status) == signal. If
+WSTOPSIG(status) == SIGTRAP, this may be a different kind of
+ptrace-stop - see "Syscall-stops" and "execve" sections below for
+details. If WSTOPSIG(status) == stopping signal, this may be a
+group-stop - see below.
+.SS Signal injection and suppression
+After signal-delivery-stop is observed by tracer, tracer should restart
+tracee with
+.LP
+	ptrace(PTRACE_rest, pid, 0, sig)
+.LP
+call, where PTRACE_rest is one of the restarting ptrace ops. If sig is
+0, then signal is not delivered. Otherwise, signal sig is delivered.
+This operation is called "signal injection" in this document, to
+distinguish it from signal-delivery-stop.
+.LP
+Note that sig value may be different from WSTOPSIG(status) value -
+tracer can cause a different signal to be injected.
+.LP
+Note that suppressed signal still causes syscalls to return
+prematurely. Restartable syscalls will be restarted (tracer will
+observe tracee to execute restart_syscall(2) syscall if tracer uses
+PTRACE_SYSCALL), non-restartable syscalls (for example, nanosleep) may
+return with -EINTR even though no observable signal is injected to the
+tracee.
+.LP
+Note that restarting ptrace commands issued in ptrace-stops other than
+signal-delivery-stop are not guaranteed to inject a signal, even if sig
+is nonzero. No error is reported, nonzero sig may simply be ignored.
+Ptrace users should not try to "create new signal" this way: use
+tgkill(2) instead.
+.LP
+This is a cause of confusion among ptrace users. One typical scenario
+is that tracer observes group-stop, mistakes it for
+signal-delivery-stop, restarts tracee with ptrace(PTRACE_rest, pid, 0,
+stopsig) with the intention of injecting stopsig, but stopsig gets
+ignored and tracee continues to run.
+.LP
+SIGCONT signal has a side effect of waking up (all threads of)
+group-stopped process. This side effect happens before
+signal-delivery-stop. Tracer can't suppress this side-effect (it can
+only suppress signal injection, which only causes SIGCONT handler to
+not be executed in the tracee, if such handler is installed). In fact,
+waking up from group-stop may be followed by signal-delivery-stop for
+signal(s) *other than* SIGCONT, if they were pending when SIGCONT was
+delivered. IOW: SIGCONT may be not the first signal observed by the
+tracee after it was sent.
+.LP
+Stopping signals cause (all threads of) process to enter group-stop.
+This side effect happens after signal injection, and therefore can be
+suppressed by tracer.
+.LP
+PTRACE_GETSIGINFO can be used to retrieve siginfo_t structure which
+corresponds to delivered signal. PTRACE_SETSIGINFO may be used to
+modify it. If PTRACE_SETSIGINFO has been used to alter siginfo_t,
+si_signo field and sig parameter in restarting command must match,
+otherwise the result is undefined.
+.SS Group-stop
+When a (possibly multi-threaded) process receives a stopping signal,
+all threads stop. If some threads are traced, they enter a group-stop.
+Note that stopping signal will first cause signal-delivery-stop (on one
+tracee only), and only after it is injected by tracer (or after it was
+dispatched to a thread which isn't traced), group-stop will be
+initiated on ALL tracees within multi-threaded process. As usual, every
+tracee reports its group-stop separately to corresponding tracer.
+.LP
+Group-stop is observed by tracer as waitpid returning with
+WIFSTOPPED(status) == true, WSTOPSIG(status) == signal. The same result
+is returned by some other classes of ptrace-stops, therefore the
+recommended practice is to perform
+.LP
+	ptrace(PTRACE_GETSIGINFO, pid, 0, &siginfo)
+.LP
+call. The call can be avoided if signal number is not SIGSTOP, SIGTSTP,
+SIGTTIN or SIGTTOU - only these four signals are stopping signals. If
+tracer sees something else, it can't be group-stop. Otherwise, tracer
+needs to call PTRACE_GETSIGINFO. If PTRACE_GETSIGINFO fails with
+EINVAL, then it is definitely a group-stop. (Other failure codes are
+possible, such as ESRCH "no such process" if SIGKILL killed the tracee).
+.LP
+As of kernel 2.6.38, after tracer sees tracee ptrace-stop and until it
+restarts or kills it, tracee will not run, and will not send
+notifications (except SIGKILL death) to tracer, even if tracer enters
+into another waitpid call.
+.LP
+Currently, it causes a problem with transparent handling of stopping
+signals: if tracer restarts tracee after group-stop, SIGSTOP is
+effectively ignored: tracee doesn't remain stopped, it runs. If tracer
+doesn't restart tracee before entering into next waitpid, future
+SIGCONT will not be reported to the tracer. Which would make SIGCONT to
+have no effect.
+.SS PTRACE_EVENT stops
+If tracer sets TRACE_O_TRACEfoo options, tracee will enter ptrace-stops
+called PTRACE_EVENT stops.
+.LP
+PTRACE_EVENT stops are observed by tracer as waitpid returning with
+WIFSTOPPED(status) == true, WSTOPSIG(status) == SIGTRAP. Additional bit
+is set in a higher byte of status word: value (status >> 8)
+will be (SIGTRAP | PTRACE_EVENT_foo << 8). The following events exist:
+.LP
+PTRACE_EVENT_VFORK - stop before return from vfork or clone+CLONE_VFORK.
+When tracee is continued after this stop, it will wait for child to
+exit/exec before continuing its execution (IOW: usual behavior on
+vfork).
+.LP
+PTRACE_EVENT_FORK - stop before return from fork or clone+SIGCHLD
+.LP
+PTRACE_EVENT_CLONE - stop before return from clone
+.LP
+PTRACE_EVENT_VFORK_DONE - stop before return from
+vfork or clone+CLONE_VFORK, but after vforked child unblocked this
+tracee by exiting or exec'ing.
+.LP
+For all four stops described above: stop occurs in parent, not in newly
+created thread. PTRACE_GETEVENTMSG can be used to retrieve new thread's
+tid.
+.LP
+PTRACE_EVENT_EXEC - stop before return from execve.
+.LP
+PTRACE_EVENT_EXIT - stop before exit (including death from exit_group),
+signal death, or exit caused by execve in multi-threaded process.
+PTRACE_GETEVENTMSG returns exit status. Registers can be examined
+(unlike when "real" exit happens). The tracee is still alive, it needs
+to be PTRACE_CONTed or PTRACE_DETACHed to finish exit.
+.LP
+PTRACE_GETSIGINFO on PTRACE_EVENT stops returns si_signo = SIGTRAP,
+si_code = (event << 8) | SIGTRAP.
+.SS Syscall-stops
+If tracee was restarted by PTRACE_SYSCALL, tracee enters
+syscall-enter-stop just prior to entering any syscall. If tracer
+restarts it with PTRACE_SYSCALL, tracee enters syscall-exit-stop when
+syscall is finished, or if it is interrupted by a signal. (That is,
+signal-delivery-stop never happens between syscall-enter-stop and
+syscall-exit-stop, it happens *after* syscall-exit-stop).
+.LP
+Other possibilities are that tracee may stop in a PTRACE_EVENT stop,
+exit (if it entered exit or exit_group syscall), be killed by SIGKILL,
+or die silently (if it is a thread group leader, execve syscall happened
+in another thread, and that thread is not traced by the same tracer -
+this sutuation is discussed later).
+.LP
+Syscall-enter-stop and syscall-exit-stop are observed by tracer as
+waitpid returning with WIFSTOPPED(status) == true, WSTOPSIG(status) ==
+SIGTRAP. If PTRACE_O_TRACESYSGOOD option was set by tracer, then
+WSTOPSIG(status) == (SIGTRAP | 0x80).
+.LP
+Syscall-stops can be distinguished from signal-delivery-stop with
+SIGTRAP by querying PTRACE_GETSIGINFO: si_code <= 0 if SIGTRAP was sent by usual
+suspects like [tg]kill/sigqueue/etc; or = SI_KERNEL (0x80) if sent by
+kernel, whereas syscall-stops have si_code = SIGTRAP or (SIGTRAP |
+0x80). However, syscall-stops happen very often (twice per syscall),
+and performing PTRACE_GETSIGINFO for every syscall-stop may be somewhat
+expensive.
+.LP
+Some architectures allow to distinguish them by examining registers.
+For example, on x86 rax = -ENOSYS in syscall-enter-stop. Since SIGTRAP
+(like any other signal) always happens *after* syscall-exit-stop, and
+at this point rax almost never contains -ENOSYS, SIGTRAP looks like
+"syscall-stop which is not syscall-enter-stop", IOW: it looks like a
+"stray syscall-exit-stop" and can be detected this way. But such
+detection is fragile and is best avoided.
+.LP
+Using PTRACE_O_TRACESYSGOOD option is a recommended method, since it is
+reliable and does not incur performance penalty.
+.LP
+Syscall-enter-stop and syscall-exit-stop are indistinguishable from
+each other by tracer. Tracer needs to keep track of the sequence of
+ptrace-stops in order to not misinterpret syscall-enter-stop as
+syscall-exit-stop or vice versa. The rule is that syscall-enter-stop is
+always followed by syscall-exit-stop, PTRACE_EVENT stop or tracee's
+death - no other kinds of ptrace-stop can occur in between.
+.LP
+If after syscall-enter-stop tracer uses restarting command other than
+PTRACE_SYSCALL, syscall-exit-stop is not generated.
+.LP
+PTRACE_GETSIGINFO on syscall-stops returns si_signo = SIGTRAP, si_code
+= SIGTRAP or (SIGTRAP | 0x80).
+.SS SINGLESTEP, SYSEMU, SYSEMU_SINGLESTEP stops
+(TODO: document stops occurring with PTRACE_SINGLESTEP, PTRACE_SYSEMU,
+PTRACE_SYSEMU_SINGLESTEP)
+.SS Informational and restarting ptrace commands
+Most ptrace commands (all except ATTACH, TRACEME, KILL) require tracee
+to be in a ptrace-stop, otherwise they fail with ESRCH.
+.LP
+When tracee is in ptrace-stop, tracer can read and write data to tracee
+using informational commands. They leave tracee in ptrace-stopped state:
+.LP
+.nf
+longv = ptrace(PTRACE_PEEKTEXT/PEEKDATA/PEEKUSER, pid, addr, 0);
+	ptrace(PTRACE_POKETEXT/POKEDATA/POKEUSER, pid, addr, long_val);
+	ptrace(PTRACE_GETREGS/GETFPREGS, pid, 0, &struct);
+	ptrace(PTRACE_SETREGS/SETFPREGS, pid, 0, &struct);
+	ptrace(PTRACE_GETSIGINFO, pid, 0, &siginfo);
+	ptrace(PTRACE_SETSIGINFO, pid, 0, &siginfo);
+	ptrace(PTRACE_GETEVENTMSG, pid, 0, &long_var);
+	ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_flags);
+.fi
+.LP
+Note that some errors are not reported. For example, setting siginfo
+may have no effect in some ptrace-stops, yet the call may succeed
+(return 0 and don't set errno); querying GETEVENTMSG may succeed
+and return some random value if current ptrace-stop is not documented
+as returning meaningful event message.
+.LP
+ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_flags) affects one tracee.
+Current flags are replaced. Flags are inherited by new tracees created
+and "auto-attached" via active PTRACE_O_TRACE[V]FORK or
+PTRACE_O_TRACECLONE options.
+.LP
+Another group of commands makes ptrace-stopped tracee run. They have
+the form:
+.LP
+	ptrace(PTRACE_cmd, pid, 0, sig);
+.LP
+where cmd is CONT, DETACH, SYSCALL, SINGLESTEP, SYSEMU, or
+SYSEMU_SINGLESTEP. If tracee is in signal-delivery-stop, sig is the
+signal to be injected. Otherwise, sig may be ignored (recommended
+practice is to always pass 0 in these cases).
+.SS Attaching and detaching
+A thread can be attached to tracer using ptrace(PTRACE_ATTACH, pid, 0,
+0) call. This also sends SIGSTOP to this thread. If tracer wants this
+SIGSTOP to have no effect, it needs to suppress it. Note that if other
+signals are concurrently sent to this thread during attach, tracer may
+see tracee enter signal-delivery-stop with other signal(s) first! The
+usual practice is to reinject these signals until SIGSTOP is seen, then
+suppress SIGSTOP injection. The design bug here is that attach and
+concurrent SIGSTOP are racing and concurrent SIGSTOP may be lost.
+.\" Describe how to attach to a thread which is already group-stopped.
+.LP
+Since attaching sends SIGSTOP and tracer usually suppresses it, this
+may cause stray EINTR return from the currently executing syscall in
+the tracee, as described in "signal injection and suppression" section.
+.LP
+ptrace(PTRACE_TRACEME, 0, 0, 0) request turns current thread into a
+tracee. It continues to run (doesn't enter ptrace-stop). A common
+practice is to follow ptrace(PTRACE_TRACEME) with raise(SIGSTOP) and
+allow parent (which is our tracer now) to observe our
+signal-delivery-stop.
+.LP
+If PTRACE_O_TRACE[V]FORK or PTRACE_O_TRACECLONE options are in effect,
+then children created by (vfork or clone(CLONE_VFORK)), (fork or
+clone(SIGCHLD)) and (other kinds of clone) respectively are
+automatically attached to the same tracer which traced their parent.
+SIGSTOP is delivered to them, causing them to enter
+signal-delivery-stop after they exit syscall which created them.
+.LP
+Detaching of tracee is performed by ptrace(PTRACE_DETACH, pid, 0, sig).
+PTRACE_DETACH is a restarting operation, therefore it requires tracee
+to be in ptrace-stop. If tracee is in signal-delivery-stop, signal can
+be injected. Otherwise, sig parameter may be silently ignored.
+.LP
+If tracee is running when tracer wants to detach it, the usual solution
+is to send SIGSTOP (using tgkill, to make sure it goes to the correct
+thread), wait for tracee to stop in signal-delivery-stop for SIGSTOP
+and then detach it (suppressing SIGSTOP injection). Design bug is that
+this can race with concurrent SIGSTOPs. Another complication is that
+tracee may enter other ptrace-stops and needs to be restarted and
+waited for again, until SIGSTOP is seen. Yet another complication is to
+be sure that tracee is not already ptrace-stopped, because no signal
+delivery happens while it is - not even SIGSTOP.
+.\" Describe how to detach from a group-stopped tracee so that it
+.\" doesn't run, but continues to wait for SIGCONT.
+.LP
+If tracer dies, all tracees are automatically detached and restarted,
+unless they were in group-stop. Handling of restart from group-stop is
+currently buggy, but "as planned" behavior is to leave tracee stopped
+and waiting for SIGCONT. If tracee is restarted from
+signal-delivery-stop, pending signal is injected.
+.SS execve under ptrace
+During execve, kernel destroys all other threads in the process, and
+resets execve'ing thread tid to tgid (process id). This looks very
+confusing to tracers:
+.LP
+All other threads stop in PTRACE_EXIT stop, if requested by active
+ptrace option. Then all other threads except thread group leader report
+death as if they exited via exit syscall with exit code 0. Then
+PTRACE_EVENT_EXEC stop happens, if requested by active ptrace option.
+.\" (on which tracee - leader? execve-ing one?)
+.LP
+The execve-ing tracee changes its pid while it is in execve syscall.
+(Remember, under ptrace 'pid' returned from waitpid, or fed into ptrace
+calls, is tracee's tid). That is, pid is reset to process id, which
+coincides with thread group leader tid.
+.LP
+If thread group leader has reported its death by this time, for tracer
+this looks like dead thread leader "reappears from nowhere". If thread
+group leader was still alive, for tracer this may look as if thread
+group leader returns from a different syscall than it entered, or even
+"returned from syscall even though it was not in any syscall". If
+thread group leader was not traced (or was traced by a different
+tracer), during execve it will appear as if it has become a tracee of
+the tracer of execve-ing tracee. All these effects are the artifacts of
+pid change.
+.LP
+PTRACE_O_TRACEEXEC option is the recommended tool for dealing with this
+case. It enables PTRACE_EVENT_EXEC stop which occurs before execve
+syscall returns.
+.LP
+Pid change happens before PTRACE_EVENT_EXEC stop, not after.
+.LP
+When tracer receives PTRACE_EVENT_EXEC stop notification, it is
+guaranteed that except this tracee and thread group leader, no other
+threads from the process are alive.
+.LP
+On receiving this notification, tracer should clean up all its internal
+data structures about all threads of this process, and retain only one
+data structure, one which describes single still running tracee, with
+pid = tgid = process id.
+.LP
+Currently, there is no way to retrieve former pid of execve-ing tracee.
+If tracer doesn't keep track of its tracees' thread group relations, it
+may be unable to know which tracee execve-ed and therefore no longer
+exists under old pid due to pid change.
+.LP
+Example: two threads execve at the same time:
+.LP
+.nf
+*** we get syscall-entry-stop in thread 1: **
+PID1 execve("/bin/foo", "foo" <unfinished ...>
+*** we issue PTRACE_SYSCALL for thread 1 **
+*** we get syscall-entry-stop in thread 2: **
+PID2 execve("/bin/bar", "bar" <unfinished ...>
+*** we issue PTRACE_SYSCALL for thread 2 **
+*** we get PTRACE_EVENT_EXEC for PID0, we issue PTRACE_SYSCALL **
+*** we get syscall-exit-stop for PID0: **
+PID0 <... execve resumed> )             = 0
+.fi
+.LP
+In this situation there is no way to know which execve succeeded.
+.LP
+If PTRACE_O_TRACEEXEC option is NOT in effect for the execve-ing
+tracee, kernel delivers an extra SIGTRAP to tracee after execve syscall
+returns. This is an ordinary signal (similar to one which can be
+generated by "kill -TRAP"), not a special kind of ptrace-stop.
+GETSIGINFO on it has si_code = 0 (SI_USER). It can be blocked by signal
+mask, and thus can happen (much) later.
+.LP
+Usually, tracer (for example, strace) would not want to show this extra
+post-execve SIGTRAP signal to the user, and would suppress its delivery
+to the tracee (if SIGTRAP is set to SIG_DFL, it is a killing signal).
+However, determining *which* SIGTRAP to suppress is not easy. Setting
+PTRACE_O_TRACEEXEC option and thus suppressing this extra SIGTRAP is
+the recommended approach.
+.SS Real parent
+Ptrace API (ab)uses standard Unix parent/child signaling over waitpid.
+This used to cause real parent of the process to stop receiving several
+kinds of waitpid notifications when child process is traced by some
+other process.
+.LP
+Many of these bugs have been fixed, but as of 2.6.38 several still
+exist - see BUGS section below.
+.LP
+As of 2.6.38, the following is believed to work correctly:
+.LP
+* exit/death by signal is reported first to tracer, then, when tracer
+consumes waitpid result, to real parent (to real parent only when the
+whole multi-threaded process exits). If they are the same process, the
+report is sent only once.
 .SH "RETURN VALUE"
 On success,
 .B PTRACE_PEEK*
@@ -415,7 +897,7 @@ register.
 .TP
 .B EFAULT
 There was an attempt to read from or write to an invalid area in
-the parent's or child's memory,
+the tracer's or tracee's memory,
 probably because the area wasn't mapped or accessible.
 Unfortunately, under Linux, different variations of this fault
 will return
@@ -429,14 +911,14 @@ An attempt was made to set an invalid option.
 .TP
 .B EIO
 \fIrequest\fP is invalid, or an attempt was made to read from or
-write to an invalid area in the parent's or child's memory,
+write to an invalid area in the tracer's or tracee's memory,
 or there was a word-alignment violation,
 or an invalid signal was specified during a restart request.
 .TP
 .B EPERM
 The specified process cannot be traced.
 This could be because the
-parent has insufficient privileges (the required capability is
+tracer has insufficient privileges (the required capability is
 .BR CAP_SYS_PTRACE );
 unprivileged processes cannot trace processes that they
 cannot send signals to or those running
@@ -461,10 +943,11 @@ This means that unneeded trailing arguments may be omitted,
 though doing so makes use of undocumented
 .BR gcc (1)
 behavior.
-.LP
-.BR init (8),
-the process with PID 1, may not be traced.
-.LP
+.\" Not true anymore:
+.\" .LP
+.\" .BR init (8),
+.\" the process with PID 1, may not be traced.
+.\" .LP
 The layout of the contents of memory and the USER area are quite OS- and
 architecture-specific.
 The offset supplied, and the data returned,
@@ -474,30 +957,31 @@ might not entirely match with the definition of
 .LP
 The size of a "word" is determined by the OS variant
 (e.g., for 32-bit Linux it is 32 bits, etc.).
-.LP
-Tracing causes a few subtle differences in the semantics of
-traced processes.
-For example, if a process is attached to with
-.BR PTRACE_ATTACH ,
-its original parent can no longer receive notification via
-.BR wait (2)
-when it stops, and there is no way for the new parent to
-effectively simulate this notification.
-.LP
-When the parent receives an event with
-.B PTRACE_EVENT_*
-set,
-the child is not in the normal signal delivery path.
-This means the parent cannot do
-.BR ptrace (PTRACE_CONT)
-with a signal or
-.BR ptrace (PTRACE_KILL).
-.BR kill (2)
-with a
-.B SIGKILL
-signal can be used instead to kill the child process
-after receiving one of these messages.
-.LP
+.\" Covered in more details above:
+.\" .LP
+.\" Tracing causes a few subtle differences in the semantics of
+.\" traced processes.
+.\" For example, if a process is attached to with
+.\" .BR PTRACE_ATTACH ,
+.\" its original parent can no longer receive notification via
+.\" .BR wait (2)
+.\" when it stops, and there is no way for the new parent to
+.\" effectively simulate this notification.
+.\" .LP
+.\" When the parent receives an event with
+.\" .B PTRACE_EVENT_*
+.\" set,
+.\" the tracee is not in the normal signal delivery path.
+.\" This means the parent cannot do
+.\" .BR ptrace (PTRACE_CONT)
+.\" with a signal or
+.\" .BR ptrace (PTRACE_KILL).
+.\" .BR kill (2)
+.\" with a
+.\" .B SIGKILL
+.\" signal can be used instead to kill the tracee
+.\" after receiving one of these messages.
+.\" .LP
 This page documents the way the
 .BR ptrace ()
 call works currently in Linux.
@@ -505,14 +989,6 @@ Its behavior differs noticeably on other flavors of UNIX.
 In any case, use of
 .BR ptrace ()
 is highly OS- and architecture-specific.
-.LP
-The SunOS man page describes
-.BR ptrace ()
-as "unique and arcane", which it is.
-The proc-based debugging interface
-present in Solaris 2 implements a superset of
-.BR ptrace ()
-functionality in a more powerful and uniform way.
 .SH BUGS
 On hosts with 2.6 kernel headers,
 .B PTRACE_SETOPTIONS
@@ -525,6 +1001,25 @@ This can be worked around by redefining
 to
 .BR PTRACE_OLDSETOPTIONS ,
 if that is defined.
+.LP
+Group-stop notifications are sent to tracer, but not to real parent.
+Last confirmed on 2.6.38.6.
+.LP
+If thread group leader is traced and exits by calling exit syscall,
+PTRACE_EVENT_EXIT stop will happen for it (if requested), but
+subsequent WIFEXITED notification will not be delivered until all other
+threads exit. As explained above, if one of other threads execve's,
+thread group leader death will *never* be reported. If execve-ed thread
+is not traced by this tracer, tracer will never know that execve
+happened.
+One possible workaround is to detach thread group leader instead of
+restarting it in this case. Last confirmed on 2.6.38.6.
+.\" ^^^ need to test/verify this scenario
+.LP
+SIGKILL signal may still cause PTRACE_EVENT_EXIT stop before actual
+signal death. This may be changed in the future - SIGKILL is meant to
+always immediately kill tasks even under ptrace. Last confirmed on
+2.6.38.6.
 .SH "SEE ALSO"
 .BR gdb (1),
 .BR strace (1),

^ permalink raw reply related	[flat|nested] 18+ messages in thread