linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] man ptrace: add extended description of various ptrace quirks
@ 2011-07-21 11:09 Denys Vlasenko
  2011-07-21 16:51 ` Oleg Nesterov
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Denys Vlasenko @ 2011-07-21 11:09 UTC (permalink / raw)
  To: mtk.manpages, Oleg Nesterov, Jan Kratochvil, linux-kernel, Tejun Heo

[-- Attachment #1: Type: text/plain, Size: 1180 bytes --]

Hi Michael,

Please apply attached patch which updates ptrace manpage.
(I'm not sending it inline, google web mail might mangle it. Sorry).

Changes include:

s/parent/tracer/g, s/child/tracee/g - ptrace interface now
is sufficiently cleaned up to not treat tracing process as parent.

Deleted several outright false statements:
- pid 1 can be traced
- tracer is not shown as parent in ps output
- PTRACE_ATTACH is not "the same behavior as if tracee had done
  a PTRACE_TRACEME": PTRACE_ATTACH delivers a SIGSTOP.
- SIGSTOP _can_ be injected.
- Removed mentions of SunOS and Solaris as irrelevant.
- Added a few more known bugs.

Added a large block of text in DESCRIPTION which doesn't focus
on mechanical description of each flag and operation, but rather
tries to describe a bigger picture. The targeted audience is
a person which is reasonably knowledgeable in Unix but did not
spend years working with ptrace, and thus may be unaware of its quirks.
This text went through several iterations of review by Oleg Nesterov
and Tejun Heo.
This block of text intentionally uses as little markup as possible,
otherwise future modifications to it will be very hard to make.

-- 
vda

[-- Attachment #2: d196032aff8a2a828e3bbdbbb35f9fe7ed280028.diff --]
[-- Type: text/x-patch, Size: 43251 bytes --]

commit d196032aff8a2a828e3bbdbbb35f9fe7ed280028
Author: Denys Vlasenko <dvlasenk@redhat.com>
Date:   Thu Jul 21 12:55:49 2011 +0200

    ptrace: add extended description of various ptrace quirks
    
    Changes include:
    
    s/parent/tracer/g, s/child/tracee/g - ptrace interface now
    is sufficiently cleaned up to not treat tracing process as parent.
    
    Deleted several outright false statements:
    - pid 1 can be traced
    - tracer is not shown as parent in ps output
    - PTRACE_ATTACH is not "the same behavior as if tracee had done
      a PTRACE_TRACEME": PTRACE_ATTACH delivers a SIGSTOP.
    - SIGSTOP _can_ be injected.
    - Removed mentions of SunOS and Solaris as irrelevant.
    - Added a few more known bugs.
    
    Added a large block of text in DESCRIPTION which doesn't focus
    on mechanical description of each flag and operation, but rather
    tries to describe a bigger picture. The targeted audience is
    a person which is reasonably knowledgeable in Unix but did not
    spend years working with ptrace, and thus may be unaware of its quirks.
    This text went through several iterations of review by Oleg Nesterov
    and Tejun Heo.
    This block of text intentionally uses as little markup as possible,
    otherwise future modifications to it will be very hard to make.
    
    Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>

diff --git a/man2/ptrace.2 b/man2/ptrace.2
index 9cd5899..8875873 100644
--- a/man2/ptrace.2
+++ b/man2/ptrace.2
@@ -53,45 +53,51 @@ ptrace \- process trace
 .SH DESCRIPTION
 The
 .BR ptrace ()
-system call provides a means by which a parent process may observe
-and control the execution of another process,
-and examine and change its core image and registers.
+system call provides a means by which a process (tracer) may observe
+and control the execution of another processes (tracees),
+and examine and change their core image and registers.
 It is primarily used to implement breakpoint debugging and system
 call tracing.
 .LP
-The parent can initiate a trace by calling
+Tracees first need to be attached to the tracer.
+Attachment and subsequent commands are per-thread: in
+multi-threaded process, every thread can be individually attached to a
+(potentially different) tracer, or left not attached and thus not
+debugged. Therefore, "tracee" always means "(one) thread", never "a
+(possibly multi-threaded) process". Ptrace commands are always sent to
+a specific tracee using ptrace(PTRACE_foo, pid, ...), where pid is the
+thread ID of the corresponding Linux thread.
+.LP
+The process can initiate a trace by calling
 .BR fork (2)
 and having the resulting child do a
 .BR PTRACE_TRACEME ,
 followed (typically) by an
-.BR exec (3).
-Alternatively, the parent may commence trace of an existing process using
+.BR execve (2).
+Alternatively, the process may commence trace of an existing process using
 .BR PTRACE_ATTACH .
 .LP
-While being traced, the child will stop each time a signal is delivered,
+While being traced, the tracee will stop each time a signal is delivered,
 even if the signal is being ignored.
 (The exception is
 .BR SIGKILL ,
 which has its usual effect.)
-The parent will be notified at its next
+The tracer will be notified at its next
 .BR wait (2)
-and may inspect and modify the child process while it is stopped.
-The parent then causes the child to continue,
+and may inspect and modify the tracee while it is stopped.
+The tracer then causes the tracee to continue,
 optionally ignoring the delivered signal
 (or even delivering a different signal instead).
 .LP
-When the parent is finished tracing, it can terminate the child with
-.B PTRACE_KILL
-or cause it to continue executing in a normal, untraced mode
-via
+When the tracer is finished tracing, it can cause tracee to continue
+executing in a normal, untraced mode via
 .BR PTRACE_DETACH .
 .LP
 The value of \fIrequest\fP determines the action to be performed:
 .TP
 .B PTRACE_TRACEME
 Indicates that this process is to be traced by its parent.
-Any signal
-(except
+Any signal (except
 .BR SIGKILL )
 delivered to this process will cause it to stop and its
 parent to be notified via
@@ -107,19 +113,18 @@ A process probably shouldn't make this request if its parent
 isn't expecting to trace it.
 (\fIpid\fP, \fIaddr\fP, and \fIdata\fP are ignored.)
 .LP
-The above request is used only by the child process;
-the rest are used only by the parent.
-In the following requests, \fIpid\fP specifies the child process
+The above request is used only by the tracee;
+the rest are used only by the tracer.
+In the following requests, \fIpid\fP specifies the tracee
 to be acted on.
 For requests other than
 .BR PTRACE_KILL ,
-the child process must
-be stopped.
+the tracee must be stopped.
 .TP
 .BR PTRACE_PEEKTEXT ", " PTRACE_PEEKDATA
 Reads a word at the location
 .I addr
-in the child's memory, returning the word as the result of the
+in the tracee's memory, returning the word as the result of the
 .BR ptrace ()
 call.
 Linux does not have separate text and data address spaces, so the two
@@ -131,7 +136,7 @@ requests are currently equivalent.
 .\" and that is the name that seems common on other systems.
 Reads a word at offset
 .I addr
-in the child's USER area,
+in the tracee's USER area,
 which holds the registers and other information about the process
 (see \fI<sys/user.h>\fP).
 The word is returned as the result of the
@@ -147,7 +152,7 @@ Copies the word
 .I data
 to location
 .I addr
-in the child's memory.
+in the tracee's memory.
 As above, the two requests are currently equivalent.
 .TP
 .B PTRACE_POKEUSER
@@ -157,14 +162,14 @@ Copies the word
 .I data
 to offset
 .I addr
-in the child's USER area.
+in the tracee's USER area.
 As above, the offset must typically be word-aligned.
 In order to maintain the integrity of the kernel,
 some modifications to the USER area are disallowed.
 .TP
 .BR PTRACE_GETREGS ", " PTRACE_GETFPREGS
-Copies the child's general purpose or floating-point registers,
-respectively, to location \fIdata\fP in the parent.
+Copies the tracee's general purpose or floating-point registers,
+respectively, to location \fIdata\fP in the tracer.
 See \fI<sys/user.h>\fP for information on
 the format of this data.
 (\fIaddr\fP is ignored.)
@@ -173,12 +178,12 @@ the format of this data.
 Retrieve information about the signal that caused the stop.
 Copies a \fIsiginfo_t\fP structure (see
 .BR sigaction (2))
-from the child to location \fIdata\fP in the parent.
+from the tracee to location \fIdata\fP in the tracer.
 (\fIaddr\fP is ignored.)
 .TP
 .BR PTRACE_SETREGS ", " PTRACE_SETFPREGS
-Copies the child's general purpose or floating-point registers,
-respectively, from location \fIdata\fP in the parent.
+Copies the tracee's general purpose or floating-point registers,
+respectively, from location \fIdata\fP in the tracer.
 As for
 .BR PTRACE_POKEUSER ,
 some general
@@ -188,9 +193,9 @@ purpose register modifications may be disallowed.
 .BR PTRACE_SETSIGINFO " (since Linux 2.3.99-pre6)"
 Set signal information.
 Copies a \fIsiginfo_t\fP structure from location \fIdata\fP in the
-parent to the child.
+tracer to the tracee.
 This will only affect signals that would normally be delivered to
-the child and were caught by the tracer.
+the tracee and were caught by the tracer.
 It may be difficult to tell
 these normal signals from synthetic signals generated by
 .BR ptrace ()
@@ -198,7 +203,7 @@ itself.
 (\fIaddr\fP is ignored.)
 .TP
 .BR PTRACE_SETOPTIONS " (since Linux 2.4.6; see BUGS for caveats)"
-Sets ptrace options from \fIdata\fP in the parent.
+Sets ptrace options from \fIdata\fP.
 (\fIaddr\fP is ignored.)
 \fIdata\fP is interpreted
 as a bit mask of options, which are specified by the following flags:
@@ -213,7 +218,7 @@ between normal traps and those caused by a syscall.
 may not work on all architectures.)
 .TP
 .BR PTRACE_O_TRACEFORK " (since Linux 2.5.46)"
-Stop the child at the next
+Stop the tracee at the next
 .BR fork (2)
 call with \fISIGTRAP | PTRACE_EVENT_FORK\ <<\ 8\fP and automatically
 start tracing the newly forked process,
@@ -223,7 +228,7 @@ The PID for the new process can be retrieved with
 .BR PTRACE_GETEVENTMSG .
 .TP
 .BR PTRACE_O_TRACEVFORK " (since Linux 2.5.46)"
-Stop the child at the next
+Stop the tracee at the next
 .BR vfork (2)
 call with \fISIGTRAP | PTRACE_EVENT_VFORK\ <<\ 8\fP and automatically start
 tracing the newly vforked process, which will start with a
@@ -232,7 +237,7 @@ The PID for the new process can be retrieved with
 .BR PTRACE_GETEVENTMSG .
 .TP
 .BR PTRACE_O_TRACECLONE " (since Linux 2.5.46)"
-Stop the child at the next
+Stop the tracee at the next
 .BR clone (2)
 call with \fISIGTRAP | PTRACE_EVENT_CLONE\ <<\ 8\fP and automatically start
 tracing the newly cloned process, which will start with a
@@ -242,7 +247,7 @@ The PID for the new process can be retrieved with
 This option may not catch
 .BR clone (2)
 calls in all cases.
-If the child calls
+If the tracee calls
 .BR clone (2)
 with the
 .B CLONE_VFORK
@@ -251,7 +256,7 @@ flag,
 will be delivered instead
 if
 .B PTRACE_O_TRACEVFORK
-is set; otherwise if the child calls
+is set; otherwise if the tracee calls
 .BR clone (2)
 with the exit signal set to
 .BR SIGCHLD ,
@@ -262,18 +267,18 @@ if
 is set.
 .TP
 .BR PTRACE_O_TRACEEXEC " (since Linux 2.5.46)"
-Stop the child at the next
+Stop the tracee at the next
 .BR execve (2)
 call with \fISIGTRAP | PTRACE_EVENT_EXEC\ <<\ 8\fP.
 .TP
 .BR PTRACE_O_TRACEVFORKDONE " (since Linux 2.5.60)"
-Stop the child at the completion of the next
+Stop the tracee at the completion of the next
 .BR vfork (2)
 call with \fISIGTRAP | PTRACE_EVENT_VFORK_DONE\ <<\ 8\fP.
 .TP
 .BR PTRACE_O_TRACEEXIT " (since Linux 2.5.60)"
-Stop the child at exit with \fISIGTRAP | PTRACE_EVENT_EXIT\ <<\ 8\fP.
-The child's exit status can be retrieved with
+Stop the tracee at exit with \fISIGTRAP | PTRACE_EVENT_EXIT\ <<\ 8\fP.
+The tracee's exit status can be retrieved with
 .BR PTRACE_GETEVENTMSG .
 This stop will be done early during process exit when registers
 are still available, allowing the tracer to see where the exit occurred,
@@ -287,10 +292,10 @@ happening at this point.
 Retrieve a message (as an
 .IR "unsigned long" )
 about the ptrace event
-that just happened, placing it in the location \fIdata\fP in the parent.
+that just happened, placing it in the location \fIdata\fP in the tracer.
 For
 .B PTRACE_EVENT_EXIT
-this is the child's exit status.
+this is the tracee's exit status.
 For
 .BR PTRACE_EVENT_FORK ,
 .B PTRACE_EVENT_VFORK
@@ -304,23 +309,21 @@ for
 (\fIaddr\fP is ignored.)
 .TP
 .B PTRACE_CONT
-Restarts the stopped child process.
-If \fIdata\fP is nonzero and not
-.BR SIGSTOP ,
-it is interpreted as a signal to be delivered to the child;
+Restarts the stopped tracee process.
+If \fIdata\fP is nonzero, it is interpreted as a signal to be delivered to the tracee;
 otherwise, no signal is delivered.
-Thus, for example, the parent can control
-whether a signal sent to the child is delivered or not.
+Thus, for example, the tracer can control
+whether a signal sent to the tracee is delivered or not.
 (\fIaddr\fP is ignored.)
 .TP
 .BR PTRACE_SYSCALL ", " PTRACE_SINGLESTEP
-Restarts the stopped child as for
+Restarts the stopped tracee as for
 .BR PTRACE_CONT ,
 but arranges for
-the child to be stopped at the next entry to or exit from a system call,
+the tracee to be stopped at the next entry to or exit from a system call,
 or after execution of a single instruction, respectively.
-(The child will also, as usual, be stopped upon receipt of a signal.)
-From the parent's perspective, the child will appear to have been
+(The tracee will also, as usual, be stopped upon receipt of a signal.)
+From the tracer's perspective, the tracee will appear to have been
 stopped by receipt of a
 .BR SIGTRAP .
 So, for
@@ -347,7 +350,7 @@ For
 do the same
 but also singlestep if not a syscall.
 This call is used by programs like
-User Mode Linux that want to emulate all the child's system calls.
+User Mode Linux that want to emulate all the tracee's system calls.
 The
 .I data
 argument is treated as for
@@ -356,44 +359,523 @@ argument is treated as for
 not supported on all architectures.)
 .TP
 .B PTRACE_KILL
-Sends the child a
+Sends the tracee a
 .B SIGKILL
 to terminate it.
 (\fIaddr\fP and \fIdata\fP are ignored.)
+This operation is deprecated, use kill(SIGKILL) or tgkill(SIGKILL) instead.
 .TP
 .B PTRACE_ATTACH
 Attaches to the process specified in
 .IR pid ,
-making it a traced "child" of the calling process;
-the behavior of the child is as if it had done a
-.BR PTRACE_TRACEME .
-The calling process actually becomes the parent of the child
-process for most purposes (e.g., it will receive
-notification of child events and appears in
-.BR ps (1)
-output as the child's parent), but a
-.BR getppid (2)
-by the child will still return the PID of the original parent.
-The child is sent a
+making it a tracee of the calling process.
+.\" Not true:
+.\" ; the behavior of the tracee is as if it had done a
+.\" .BR PTRACE_TRACEME .
+.\" The calling process actually becomes the parent of the tracee
+.\" process for most purposes (e.g., it will receive
+.\" notification of tracee events and appears in
+.\" .BR ps (1)
+.\" output as the tracee's parent), but a
+.\" .BR getppid (2)
+.\" by the tracee will still return the PID of the original parent.
+The tracee is sent a
 .BR SIGSTOP ,
 but will not necessarily have stopped
 by the completion of this call; use
 .BR wait (2)
-to wait for the child to stop.
+to wait for the tracee to stop. See "Attaching and detaching" subsection
+for additional information.
 (\fIaddr\fP and \fIdata\fP are ignored.)
 .TP
 .B PTRACE_DETACH
-Restarts the stopped child as for
+Restarts the stopped tracee as for
 .BR PTRACE_CONT ,
-but first detaches
-from the process, undoing the reparenting effect of
-.BR PTRACE_ATTACH ,
-and the effects of
-.BR PTRACE_TRACEME .
-Although perhaps not intended, under Linux a traced child can be
+but first detaches from it.
+Under Linux a tracee can be
 detached in this way regardless of which method was used to initiate
 tracing.
 (\fIaddr\fP is ignored.)
+.\"
+.\" In the text below, we decided to avoid prettifying the text with markup:
+.\" it would make the source nearly impossible to edit, and we _do_ intend
+.\" to edit it often, in order to keep it updated:
+.\" ptrace API is full of quirks, no need to compound this situation by
+.\" making it excruciatingly painful to document them!
+.\"
+.SS Death under ptrace
+When a (possibly multi-threaded) process receives a killing signal (a
+signal set to SIG_DFL and whose default action is to kill the process),
+all threads exit. Tracees report their death to their tracer(s). The
+notification about this event is delivered through waitpid API.
+.LP
+Note that killing signal will first cause signal-delivery-stop (on one
+tracee only), and only after it is injected by tracer (or after it was
+dispatched to a thread which isn't traced), death from signal will
+happen on ALL tracees within multi-threaded process.
+.LP
+SIGKILL operates similarly, with exceptions. No signal-delivery-stop is
+generated for SIGKILL and therefore tracer can't suppress it. SIGKILL
+kills even within syscalls (syscall-exit-stop is not generated prior to
+death by SIGKILL). The net effect is that SIGKILL always kills the
+process (all its threads), even if some threads of the process are
+ptraced.
+.LP
+Tracer can kill a tracee with ptrace(PTRACE_KILL, pid, 0, 0). This
+operation is deprecated, use kill(SIGKILL) or tgkill(SIGKILL) instead.
+The problem with this operation is that it requires tracee to be in
+signal-delivery-stop, otherwise it may not work (may complete
+successfully but won't kill the tracee), whereas tgkill(SIGKILL)
+has no such limitation.
+.LP
+[Note: deprecation suggested by Oleg Nesterov. He prefers to deprecate
+it instead of describing (and needing to support) PTRACE_KILL's quirks.]
+.LP
+When tracee executes exit syscall, it reports its death to its tracer.
+Other threads are not affected.
+.LP
+When any thread executes exit_group syscall, every tracee in its thread
+group reports its death to its tracer.
+.LP
+If PTRACE_O_TRACEEXIT option is on, PTRACE_EVENT_EXIT will happen
+before actual death. This applies to exits on exit syscall, group_exit
+syscall, signal deaths (except SIGKILL), and when threads are torn down
+on execve in multi-threaded process.
+.LP
+Tracer cannot assume that ptrace-stopped tracee exists. There are many
+scenarios when tracee may die while stopped (such as SIGKILL).
+Therefore, tracer must always be prepared to handle ESRCH error on any
+ptrace operation. Unfortunately, the same error is returned if tracee
+exists but is not ptrace-stopped (for commands which require stopped
+tracee), or if it is not traced by process which issued ptrace call.
+Tracer needs to keep track of stopped/running state, and interpret
+ESRCH as "tracee died unexpectedly" only if it knows that tracee has
+been observed to enter ptrace-stop. Note that there is no guarantee
+that waitpid(WNOHANG) will reliably report tracee's death status if
+ptrace operation returned ESRCH. waitpid(WNOHANG) may return 0 instead.
+IOW: tracee may be "not yet fully dead" but already refusing ptrace ops.
+.LP
+Tracer can not assume that tracee ALWAYS ends its life by reporting
+WIFEXITED(status) or WIFSIGNALED(status).
+.LP
+.\" or can it? Do we include such a promise into ptrace API?
+.SS Stopped states
+A tracee can be in two states: running or stopped.
+.LP
+There are many kinds of states when tracee is stopped, and in ptrace
+discussions they are often conflated. Therefore, it is important to use
+precise terms.
+.LP
+In this document, any stopped state in which tracee is ready to accept
+ptrace commands from the tracer is called ptrace-stop. Ptrace-stops can
+be further subdivided into signal-delivery-stop, group-stop,
+syscall-stop and so on. They are described in detail later.
+.LP
+When running tracee enters ptrace-stop, it notifies its tracer using
+waitpid API. Tracer should use waitpid family of syscalls to wait for
+tracee to stop. Most of this document assumes that tracer waits with:
+.LP
+	pid = waitpid(pid_or_minus_1, &status, __WALL);
+.LP
+Ptrace-stopped tracees are reported as returns with pid > 0 and
+WIFSTOPPED(status) == true.
+.LP
+.\" Do we require __WALL usage, or will just using 0 be ok? Are the
+.\" rules different if user wants to use waitid? Will waitid require
+.\" WEXITED?
+.LP
+__WALL value does not include WSTOPPED and WEXITED bits, but implies
+their functionality.
+.LP
+Setting of WCONTINUED bit in waitpid flags is not recommended: the
+continued state is per-process and consuming it can confuse real parent
+of the tracee.
+.LP
+Use of WNOHANG bit in waitpid flags may cause waitpid return 0 ("no
+wait results available yet") even if tracer knows there should be a
+notification. Example: kill(tracee, SIGKILL); waitpid(tracee, &status,
+__WALL | WNOHANG);
+.\" waitid usage? WNOWAIT?
+.\" describe how wait notifications queue (or not queue)
+.LP
+The following kinds of ptrace-stops exist: signal-delivery-stops,
+group-stop, PTRACE_EVENT stops, syscall-stops [, SINGLESTEP, SYSEMU,
+SYSEMU_SINGLESTEP]. They all are reported as waitpid result with
+WIFSTOPPED(status) == true. They may be differentiated by checking
+(status >> 8) value, and if looking at (status >> 8) value doesn't
+resolve ambiguity, by querying PTRACE_GETSIGINFO. (Note:
+WSTOPSIG(status) macro returns ((status >> 8) & 0xff) value).
+.SS Signal-delivery-stop
+When (possibly multi-threaded) process receives any signal except
+SIGKILL, kernel selects a thread which handles the signal (if signal is
+generated with t[g]kill, thread selection is done by user). If selected
+thread is traced, it enters signal-delivery-stop. By this point, signal
+is not yet delivered to the process, and can be suppressed by tracer.
+If tracer doesn't suppress the signal, it passes signal to tracee in
+the next ptrace request. This second step of signal delivery is called
+"signal injection" in this document. Note that if signal is blocked,
+signal-delivery-stop doesn't happen until signal is unblocked, with the
+usual exception that SIGSTOP can't be blocked.
+.LP
+Signal-delivery-stop is observed by tracer as waitpid returning with
+WIFSTOPPED(status) == true, WSTOPSIG(status) == signal. If
+WSTOPSIG(status) == SIGTRAP, this may be a different kind of
+ptrace-stop - see "Syscall-stops" and "execve" sections below for
+details. If WSTOPSIG(status) == stopping signal, this may be a
+group-stop - see below.
+.SS Signal injection and suppression
+After signal-delivery-stop is observed by tracer, tracer should restart
+tracee with
+.LP
+	ptrace(PTRACE_rest, pid, 0, sig)
+.LP
+call, where PTRACE_rest is one of the restarting ptrace ops. If sig is
+0, then signal is not delivered. Otherwise, signal sig is delivered.
+This operation is called "signal injection" in this document, to
+distinguish it from signal-delivery-stop.
+.LP
+Note that sig value may be different from WSTOPSIG(status) value -
+tracer can cause a different signal to be injected.
+.LP
+Note that suppressed signal still causes syscalls to return
+prematurely. Restartable syscalls will be restarted (tracer will
+observe tracee to execute restart_syscall(2) syscall if tracer uses
+PTRACE_SYSCALL), non-restartable syscalls (for example, nanosleep) may
+return with -EINTR even though no observable signal is injected to the
+tracee.
+.LP
+Note that restarting ptrace commands issued in ptrace-stops other than
+signal-delivery-stop are not guaranteed to inject a signal, even if sig
+is nonzero. No error is reported, nonzero sig may simply be ignored.
+Ptrace users should not try to "create new signal" this way: use
+tgkill(2) instead.
+.LP
+This is a cause of confusion among ptrace users. One typical scenario
+is that tracer observes group-stop, mistakes it for
+signal-delivery-stop, restarts tracee with ptrace(PTRACE_rest, pid, 0,
+stopsig) with the intention of injecting stopsig, but stopsig gets
+ignored and tracee continues to run.
+.LP
+SIGCONT signal has a side effect of waking up (all threads of)
+group-stopped process. This side effect happens before
+signal-delivery-stop. Tracer can't suppress this side-effect (it can
+only suppress signal injection, which only causes SIGCONT handler to
+not be executed in the tracee, if such handler is installed). In fact,
+waking up from group-stop may be followed by signal-delivery-stop for
+signal(s) *other than* SIGCONT, if they were pending when SIGCONT was
+delivered. IOW: SIGCONT may be not the first signal observed by the
+tracee after it was sent.
+.LP
+Stopping signals cause (all threads of) process to enter group-stop.
+This side effect happens after signal injection, and therefore can be
+suppressed by tracer.
+.LP
+PTRACE_GETSIGINFO can be used to retrieve siginfo_t structure which
+corresponds to delivered signal. PTRACE_SETSIGINFO may be used to
+modify it. If PTRACE_SETSIGINFO has been used to alter siginfo_t,
+si_signo field and sig parameter in restarting command must match,
+otherwise the result is undefined.
+.SS Group-stop
+When a (possibly multi-threaded) process receives a stopping signal,
+all threads stop. If some threads are traced, they enter a group-stop.
+Note that stopping signal will first cause signal-delivery-stop (on one
+tracee only), and only after it is injected by tracer (or after it was
+dispatched to a thread which isn't traced), group-stop will be
+initiated on ALL tracees within multi-threaded process. As usual, every
+tracee reports its group-stop separately to corresponding tracer.
+.LP
+Group-stop is observed by tracer as waitpid returning with
+WIFSTOPPED(status) == true, WSTOPSIG(status) == signal. The same result
+is returned by some other classes of ptrace-stops, therefore the
+recommended practice is to perform
+.LP
+	ptrace(PTRACE_GETSIGINFO, pid, 0, &siginfo)
+.LP
+call. The call can be avoided if signal number is not SIGSTOP, SIGTSTP,
+SIGTTIN or SIGTTOU - only these four signals are stopping signals. If
+tracer sees something else, it can't be group-stop. Otherwise, tracer
+needs to call PTRACE_GETSIGINFO. If PTRACE_GETSIGINFO fails with
+EINVAL, then it is definitely a group-stop. (Other failure codes are
+possible, such as ESRCH "no such process" if SIGKILL killed the tracee).
+.LP
+As of kernel 2.6.38, after tracer sees tracee ptrace-stop and until it
+restarts or kills it, tracee will not run, and will not send
+notifications (except SIGKILL death) to tracer, even if tracer enters
+into another waitpid call.
+.LP
+Currently, it causes a problem with transparent handling of stopping
+signals: if tracer restarts tracee after group-stop, SIGSTOP is
+effectively ignored: tracee doesn't remain stopped, it runs. If tracer
+doesn't restart tracee before entering into next waitpid, future
+SIGCONT will not be reported to the tracer. Which would make SIGCONT to
+have no effect.
+.SS PTRACE_EVENT stops
+If tracer sets TRACE_O_TRACEfoo options, tracee will enter ptrace-stops
+called PTRACE_EVENT stops.
+.LP
+PTRACE_EVENT stops are observed by tracer as waitpid returning with
+WIFSTOPPED(status) == true, WSTOPSIG(status) == SIGTRAP. Additional bit
+is set in a higher byte of status word: value (status >> 8)
+will be (SIGTRAP | PTRACE_EVENT_foo << 8). The following events exist:
+.LP
+PTRACE_EVENT_VFORK - stop before return from vfork or clone+CLONE_VFORK.
+When tracee is continued after this stop, it will wait for child to
+exit/exec before continuing its execution (IOW: usual behavior on
+vfork).
+.LP
+PTRACE_EVENT_FORK - stop before return from fork or clone+SIGCHLD
+.LP
+PTRACE_EVENT_CLONE - stop before return from clone
+.LP
+PTRACE_EVENT_VFORK_DONE - stop before return from
+vfork or clone+CLONE_VFORK, but after vforked child unblocked this
+tracee by exiting or exec'ing.
+.LP
+For all four stops described above: stop occurs in parent, not in newly
+created thread. PTRACE_GETEVENTMSG can be used to retrieve new thread's
+tid.
+.LP
+PTRACE_EVENT_EXEC - stop before return from execve.
+.LP
+PTRACE_EVENT_EXIT - stop before exit (including death from exit_group),
+signal death, or exit caused by execve in multi-threaded process.
+PTRACE_GETEVENTMSG returns exit status. Registers can be examined
+(unlike when "real" exit happens). The tracee is still alive, it needs
+to be PTRACE_CONTed or PTRACE_DETACHed to finish exit.
+.LP
+PTRACE_GETSIGINFO on PTRACE_EVENT stops returns si_signo = SIGTRAP,
+si_code = (event << 8) | SIGTRAP.
+.SS Syscall-stops
+If tracee was restarted by PTRACE_SYSCALL, tracee enters
+syscall-enter-stop just prior to entering any syscall. If tracer
+restarts it with PTRACE_SYSCALL, tracee enters syscall-exit-stop when
+syscall is finished, or if it is interrupted by a signal. (That is,
+signal-delivery-stop never happens between syscall-enter-stop and
+syscall-exit-stop, it happens *after* syscall-exit-stop).
+.LP
+Other possibilities are that tracee may stop in a PTRACE_EVENT stop,
+exit (if it entered exit or exit_group syscall), be killed by SIGKILL,
+or die silently (if it is a thread group leader, execve syscall happened
+in another thread, and that thread is not traced by the same tracer -
+this sutuation is discussed later).
+.LP
+Syscall-enter-stop and syscall-exit-stop are observed by tracer as
+waitpid returning with WIFSTOPPED(status) == true, WSTOPSIG(status) ==
+SIGTRAP. If PTRACE_O_TRACESYSGOOD option was set by tracer, then
+WSTOPSIG(status) == (SIGTRAP | 0x80).
+.LP
+Syscall-stops can be distinguished from signal-delivery-stop with
+SIGTRAP by querying PTRACE_GETSIGINFO: si_code <= 0 if SIGTRAP was sent by usual
+suspects like [tg]kill/sigqueue/etc; or = SI_KERNEL (0x80) if sent by
+kernel, whereas syscall-stops have si_code = SIGTRAP or (SIGTRAP |
+0x80). However, syscall-stops happen very often (twice per syscall),
+and performing PTRACE_GETSIGINFO for every syscall-stop may be somewhat
+expensive.
+.LP
+Some architectures allow to distinguish them by examining registers.
+For example, on x86 rax = -ENOSYS in syscall-enter-stop. Since SIGTRAP
+(like any other signal) always happens *after* syscall-exit-stop, and
+at this point rax almost never contains -ENOSYS, SIGTRAP looks like
+"syscall-stop which is not syscall-enter-stop", IOW: it looks like a
+"stray syscall-exit-stop" and can be detected this way. But such
+detection is fragile and is best avoided.
+.LP
+Using PTRACE_O_TRACESYSGOOD option is a recommended method, since it is
+reliable and does not incur performance penalty.
+.LP
+Syscall-enter-stop and syscall-exit-stop are indistinguishable from
+each other by tracer. Tracer needs to keep track of the sequence of
+ptrace-stops in order to not misinterpret syscall-enter-stop as
+syscall-exit-stop or vice versa. The rule is that syscall-enter-stop is
+always followed by syscall-exit-stop, PTRACE_EVENT stop or tracee's
+death - no other kinds of ptrace-stop can occur in between.
+.LP
+If after syscall-enter-stop tracer uses restarting command other than
+PTRACE_SYSCALL, syscall-exit-stop is not generated.
+.LP
+PTRACE_GETSIGINFO on syscall-stops returns si_signo = SIGTRAP, si_code
+= SIGTRAP or (SIGTRAP | 0x80).
+.SS SINGLESTEP, SYSEMU, SYSEMU_SINGLESTEP stops
+(TODO: document stops occurring with PTRACE_SINGLESTEP, PTRACE_SYSEMU,
+PTRACE_SYSEMU_SINGLESTEP)
+.SS Informational and restarting ptrace commands
+Most ptrace commands (all except ATTACH, TRACEME, KILL) require tracee
+to be in a ptrace-stop, otherwise they fail with ESRCH.
+.LP
+When tracee is in ptrace-stop, tracer can read and write data to tracee
+using informational commands. They leave tracee in ptrace-stopped state:
+.LP
+.nf
+longv = ptrace(PTRACE_PEEKTEXT/PEEKDATA/PEEKUSER, pid, addr, 0);
+	ptrace(PTRACE_POKETEXT/POKEDATA/POKEUSER, pid, addr, long_val);
+	ptrace(PTRACE_GETREGS/GETFPREGS, pid, 0, &struct);
+	ptrace(PTRACE_SETREGS/SETFPREGS, pid, 0, &struct);
+	ptrace(PTRACE_GETSIGINFO, pid, 0, &siginfo);
+	ptrace(PTRACE_SETSIGINFO, pid, 0, &siginfo);
+	ptrace(PTRACE_GETEVENTMSG, pid, 0, &long_var);
+	ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_flags);
+.fi
+.LP
+Note that some errors are not reported. For example, setting siginfo
+may have no effect in some ptrace-stops, yet the call may succeed
+(return 0 and don't set errno); querying GETEVENTMSG may succeed
+and return some random value if current ptrace-stop is not documented
+as returning meaningful event message.
+.LP
+ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_flags) affects one tracee.
+Current flags are replaced. Flags are inherited by new tracees created
+and "auto-attached" via active PTRACE_O_TRACE[V]FORK or
+PTRACE_O_TRACECLONE options.
+.LP
+Another group of commands makes ptrace-stopped tracee run. They have
+the form:
+.LP
+	ptrace(PTRACE_cmd, pid, 0, sig);
+.LP
+where cmd is CONT, DETACH, SYSCALL, SINGLESTEP, SYSEMU, or
+SYSEMU_SINGLESTEP. If tracee is in signal-delivery-stop, sig is the
+signal to be injected. Otherwise, sig may be ignored (recommended
+practice is to always pass 0 in these cases).
+.SS Attaching and detaching
+A thread can be attached to tracer using ptrace(PTRACE_ATTACH, pid, 0,
+0) call. This also sends SIGSTOP to this thread. If tracer wants this
+SIGSTOP to have no effect, it needs to suppress it. Note that if other
+signals are concurrently sent to this thread during attach, tracer may
+see tracee enter signal-delivery-stop with other signal(s) first! The
+usual practice is to reinject these signals until SIGSTOP is seen, then
+suppress SIGSTOP injection. The design bug here is that attach and
+concurrent SIGSTOP are racing and concurrent SIGSTOP may be lost.
+.\" Describe how to attach to a thread which is already group-stopped.
+.LP
+Since attaching sends SIGSTOP and tracer usually suppresses it, this
+may cause stray EINTR return from the currently executing syscall in
+the tracee, as described in "signal injection and suppression" section.
+.LP
+ptrace(PTRACE_TRACEME, 0, 0, 0) request turns current thread into a
+tracee. It continues to run (doesn't enter ptrace-stop). A common
+practice is to follow ptrace(PTRACE_TRACEME) with raise(SIGSTOP) and
+allow parent (which is our tracer now) to observe our
+signal-delivery-stop.
+.LP
+If PTRACE_O_TRACE[V]FORK or PTRACE_O_TRACECLONE options are in effect,
+then children created by (vfork or clone(CLONE_VFORK)), (fork or
+clone(SIGCHLD)) and (other kinds of clone) respectively are
+automatically attached to the same tracer which traced their parent.
+SIGSTOP is delivered to them, causing them to enter
+signal-delivery-stop after they exit syscall which created them.
+.LP
+Detaching of tracee is performed by ptrace(PTRACE_DETACH, pid, 0, sig).
+PTRACE_DETACH is a restarting operation, therefore it requires tracee
+to be in ptrace-stop. If tracee is in signal-delivery-stop, signal can
+be injected. Otherwise, sig parameter may be silently ignored.
+.LP
+If tracee is running when tracer wants to detach it, the usual solution
+is to send SIGSTOP (using tgkill, to make sure it goes to the correct
+thread), wait for tracee to stop in signal-delivery-stop for SIGSTOP
+and then detach it (suppressing SIGSTOP injection). Design bug is that
+this can race with concurrent SIGSTOPs. Another complication is that
+tracee may enter other ptrace-stops and needs to be restarted and
+waited for again, until SIGSTOP is seen. Yet another complication is to
+be sure that tracee is not already ptrace-stopped, because no signal
+delivery happens while it is - not even SIGSTOP.
+.\" Describe how to detach from a group-stopped tracee so that it
+.\" doesn't run, but continues to wait for SIGCONT.
+.LP
+If tracer dies, all tracees are automatically detached and restarted,
+unless they were in group-stop. Handling of restart from group-stop is
+currently buggy, but "as planned" behavior is to leave tracee stopped
+and waiting for SIGCONT. If tracee is restarted from
+signal-delivery-stop, pending signal is injected.
+.SS execve under ptrace
+During execve, kernel destroys all other threads in the process, and
+resets execve'ing thread tid to tgid (process id). This looks very
+confusing to tracers:
+.LP
+All other threads stop in PTRACE_EXIT stop, if requested by active
+ptrace option. Then all other threads except thread group leader report
+death as if they exited via exit syscall with exit code 0. Then
+PTRACE_EVENT_EXEC stop happens, if requested by active ptrace option.
+.\" (on which tracee - leader? execve-ing one?)
+.LP
+The execve-ing tracee changes its pid while it is in execve syscall.
+(Remember, under ptrace 'pid' returned from waitpid, or fed into ptrace
+calls, is tracee's tid). That is, pid is reset to process id, which
+coincides with thread group leader tid.
+.LP
+If thread group leader has reported its death by this time, for tracer
+this looks like dead thread leader "reappears from nowhere". If thread
+group leader was still alive, for tracer this may look as if thread
+group leader returns from a different syscall than it entered, or even
+"returned from syscall even though it was not in any syscall". If
+thread group leader was not traced (or was traced by a different
+tracer), during execve it will appear as if it has become a tracee of
+the tracer of execve-ing tracee. All these effects are the artifacts of
+pid change.
+.LP
+PTRACE_O_TRACEEXEC option is the recommended tool for dealing with this
+case. It enables PTRACE_EVENT_EXEC stop which occurs before execve
+syscall returns.
+.LP
+Pid change happens before PTRACE_EVENT_EXEC stop, not after.
+.LP
+When tracer receives PTRACE_EVENT_EXEC stop notification, it is
+guaranteed that except this tracee and thread group leader, no other
+threads from the process are alive.
+.LP
+On receiving this notification, tracer should clean up all its internal
+data structures about all threads of this process, and retain only one
+data structure, one which describes single still running tracee, with
+pid = tgid = process id.
+.LP
+Currently, there is no way to retrieve former pid of execve-ing tracee.
+If tracer doesn't keep track of its tracees' thread group relations, it
+may be unable to know which tracee execve-ed and therefore no longer
+exists under old pid due to pid change.
+.LP
+Example: two threads execve at the same time:
+.LP
+.nf
+*** we get syscall-entry-stop in thread 1: **
+PID1 execve("/bin/foo", "foo" <unfinished ...>
+*** we issue PTRACE_SYSCALL for thread 1 **
+*** we get syscall-entry-stop in thread 2: **
+PID2 execve("/bin/bar", "bar" <unfinished ...>
+*** we issue PTRACE_SYSCALL for thread 2 **
+*** we get PTRACE_EVENT_EXEC for PID0, we issue PTRACE_SYSCALL **
+*** we get syscall-exit-stop for PID0: **
+PID0 <... execve resumed> )             = 0
+.fi
+.LP
+In this situation there is no way to know which execve succeeded.
+.LP
+If PTRACE_O_TRACEEXEC option is NOT in effect for the execve-ing
+tracee, kernel delivers an extra SIGTRAP to tracee after execve syscall
+returns. This is an ordinary signal (similar to one which can be
+generated by "kill -TRAP"), not a special kind of ptrace-stop.
+GETSIGINFO on it has si_code = 0 (SI_USER). It can be blocked by signal
+mask, and thus can happen (much) later.
+.LP
+Usually, tracer (for example, strace) would not want to show this extra
+post-execve SIGTRAP signal to the user, and would suppress its delivery
+to the tracee (if SIGTRAP is set to SIG_DFL, it is a killing signal).
+However, determining *which* SIGTRAP to suppress is not easy. Setting
+PTRACE_O_TRACEEXEC option and thus suppressing this extra SIGTRAP is
+the recommended approach.
+.SS Real parent
+Ptrace API (ab)uses standard Unix parent/child signaling over waitpid.
+This used to cause real parent of the process to stop receiving several
+kinds of waitpid notifications when child process is traced by some
+other process.
+.LP
+Many of these bugs have been fixed, but as of 2.6.38 several still
+exist - see BUGS section below.
+.LP
+As of 2.6.38, the following is believed to work correctly:
+.LP
+* exit/death by signal is reported first to tracer, then, when tracer
+consumes waitpid result, to real parent (to real parent only when the
+whole multi-threaded process exits). If they are the same process, the
+report is sent only once.
 .SH "RETURN VALUE"
 On success,
 .B PTRACE_PEEK*
@@ -415,7 +897,7 @@ register.
 .TP
 .B EFAULT
 There was an attempt to read from or write to an invalid area in
-the parent's or child's memory,
+the tracer's or tracee's memory,
 probably because the area wasn't mapped or accessible.
 Unfortunately, under Linux, different variations of this fault
 will return
@@ -429,14 +911,14 @@ An attempt was made to set an invalid option.
 .TP
 .B EIO
 \fIrequest\fP is invalid, or an attempt was made to read from or
-write to an invalid area in the parent's or child's memory,
+write to an invalid area in the tracer's or tracee's memory,
 or there was a word-alignment violation,
 or an invalid signal was specified during a restart request.
 .TP
 .B EPERM
 The specified process cannot be traced.
 This could be because the
-parent has insufficient privileges (the required capability is
+tracer has insufficient privileges (the required capability is
 .BR CAP_SYS_PTRACE );
 unprivileged processes cannot trace processes that they
 cannot send signals to or those running
@@ -461,10 +943,11 @@ This means that unneeded trailing arguments may be omitted,
 though doing so makes use of undocumented
 .BR gcc (1)
 behavior.
-.LP
-.BR init (8),
-the process with PID 1, may not be traced.
-.LP
+.\" Not true anymore:
+.\" .LP
+.\" .BR init (8),
+.\" the process with PID 1, may not be traced.
+.\" .LP
 The layout of the contents of memory and the USER area are quite OS- and
 architecture-specific.
 The offset supplied, and the data returned,
@@ -474,30 +957,31 @@ might not entirely match with the definition of
 .LP
 The size of a "word" is determined by the OS variant
 (e.g., for 32-bit Linux it is 32 bits, etc.).
-.LP
-Tracing causes a few subtle differences in the semantics of
-traced processes.
-For example, if a process is attached to with
-.BR PTRACE_ATTACH ,
-its original parent can no longer receive notification via
-.BR wait (2)
-when it stops, and there is no way for the new parent to
-effectively simulate this notification.
-.LP
-When the parent receives an event with
-.B PTRACE_EVENT_*
-set,
-the child is not in the normal signal delivery path.
-This means the parent cannot do
-.BR ptrace (PTRACE_CONT)
-with a signal or
-.BR ptrace (PTRACE_KILL).
-.BR kill (2)
-with a
-.B SIGKILL
-signal can be used instead to kill the child process
-after receiving one of these messages.
-.LP
+.\" Covered in more details above:
+.\" .LP
+.\" Tracing causes a few subtle differences in the semantics of
+.\" traced processes.
+.\" For example, if a process is attached to with
+.\" .BR PTRACE_ATTACH ,
+.\" its original parent can no longer receive notification via
+.\" .BR wait (2)
+.\" when it stops, and there is no way for the new parent to
+.\" effectively simulate this notification.
+.\" .LP
+.\" When the parent receives an event with
+.\" .B PTRACE_EVENT_*
+.\" set,
+.\" the tracee is not in the normal signal delivery path.
+.\" This means the parent cannot do
+.\" .BR ptrace (PTRACE_CONT)
+.\" with a signal or
+.\" .BR ptrace (PTRACE_KILL).
+.\" .BR kill (2)
+.\" with a
+.\" .B SIGKILL
+.\" signal can be used instead to kill the tracee
+.\" after receiving one of these messages.
+.\" .LP
 This page documents the way the
 .BR ptrace ()
 call works currently in Linux.
@@ -505,14 +989,6 @@ Its behavior differs noticeably on other flavors of UNIX.
 In any case, use of
 .BR ptrace ()
 is highly OS- and architecture-specific.
-.LP
-The SunOS man page describes
-.BR ptrace ()
-as "unique and arcane", which it is.
-The proc-based debugging interface
-present in Solaris 2 implements a superset of
-.BR ptrace ()
-functionality in a more powerful and uniform way.
 .SH BUGS
 On hosts with 2.6 kernel headers,
 .B PTRACE_SETOPTIONS
@@ -525,6 +1001,25 @@ This can be worked around by redefining
 to
 .BR PTRACE_OLDSETOPTIONS ,
 if that is defined.
+.LP
+Group-stop notifications are sent to tracer, but not to real parent.
+Last confirmed on 2.6.38.6.
+.LP
+If thread group leader is traced and exits by calling exit syscall,
+PTRACE_EVENT_EXIT stop will happen for it (if requested), but
+subsequent WIFEXITED notification will not be delivered until all other
+threads exit. As explained above, if one of other threads execve's,
+thread group leader death will *never* be reported. If execve-ed thread
+is not traced by this tracer, tracer will never know that execve
+happened.
+One possible workaround is to detach thread group leader instead of
+restarting it in this case. Last confirmed on 2.6.38.6.
+.\" ^^^ need to test/verify this scenario
+.LP
+SIGKILL signal may still cause PTRACE_EVENT_EXIT stop before actual
+signal death. This may be changed in the future - SIGKILL is meant to
+always immediately kill tasks even under ptrace. Last confirmed on
+2.6.38.6.
 .SH "SEE ALSO"
 .BR gdb (1),
 .BR strace (1),

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] man ptrace: add extended description of various ptrace quirks
  2011-07-21 11:09 [PATCH] man ptrace: add extended description of various ptrace quirks Denys Vlasenko
@ 2011-07-21 16:51 ` Oleg Nesterov
  2011-07-21 18:00   ` [PATCH 0/1] (Was: man ptrace: add extended description of various ptrace quirks) Oleg Nesterov
  2011-09-21  5:10   ` [PATCH] man ptrace: add extended description of various ptrace quirks Michael Kerrisk
  2011-09-25  6:10 ` Michael Kerrisk
  2011-09-29 19:08 ` Michael Kerrisk
  2 siblings, 2 replies; 18+ messages in thread
From: Oleg Nesterov @ 2011-07-21 16:51 UTC (permalink / raw)
  To: Denys Vlasenko; +Cc: mtk.manpages, Jan Kratochvil, linux-kernel, Tejun Heo

On 07/21, Denys Vlasenko wrote:
>
> Deleted several outright false statements:
> - pid 1 can be traced
> - tracer is not shown as parent in ps output
> - PTRACE_ATTACH is not "the same behavior as if tracee had done
>   a PTRACE_TRACEME": PTRACE_ATTACH delivers a SIGSTOP.
> - SIGSTOP _can_ be injected.

Yes, this is correct, thanks.

> +Tracer can not assume that tracee ALWAYS ends its life by reporting
> +WIFEXITED(status) or WIFSIGNALED(status).
> +.LP
> +.\" or can it? Do we include such a promise into ptrace API?

IIRC, we already discussed this... The traced group leader can
disappear during mt-exec, otherwise the tracee can never go away
silently.

> +Tracer can kill a tracee with ptrace(PTRACE_KILL, pid, 0, 0). This
> +operation is deprecated, use kill(SIGKILL) or tgkill(SIGKILL) instead.
> +The problem with this operation is that it requires tracee to be in
> +signal-delivery-stop, otherwise it may not work (may complete
> +successfully but won't kill the tracee),

In short, ptrace(PTRACE_KILL) is more or less ptrace(PTRACE_CONT, SIGKILL),
but it always returns 0. IOW, it never worked as decribed in the man
page. And I guess today nobody can explain why PTRACE_KILL exists.

Oleg.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 0/1] (Was: man ptrace: add extended description of various ptrace quirks)
  2011-07-21 16:51 ` Oleg Nesterov
@ 2011-07-21 18:00   ` Oleg Nesterov
  2011-07-21 18:00     ` [PATCH 1/1] ptrace: do_wait(traced_leader_killed_by_mt_exec) can block forever Oleg Nesterov
  2011-09-21  5:10   ` [PATCH] man ptrace: add extended description of various ptrace quirks Michael Kerrisk
  1 sibling, 1 reply; 18+ messages in thread
From: Oleg Nesterov @ 2011-07-21 18:00 UTC (permalink / raw)
  To: Denys Vlasenko, Tejun Heo; +Cc: mtk.manpages, Jan Kratochvil, linux-kernel

On 07/21, Oleg Nesterov wrote:
>
> IIRC, we already discussed this... The traced group leader can
> disappear during mt-exec,

Hmm. But the tracer should not block in do_wait() in this case
anyway...

Oleg.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/1] ptrace: do_wait(traced_leader_killed_by_mt_exec) can block forever
  2011-07-21 18:00   ` [PATCH 0/1] (Was: man ptrace: add extended description of various ptrace quirks) Oleg Nesterov
@ 2011-07-21 18:00     ` Oleg Nesterov
  2011-07-22  8:44       ` Tejun Heo
  0 siblings, 1 reply; 18+ messages in thread
From: Oleg Nesterov @ 2011-07-21 18:00 UTC (permalink / raw)
  To: Denys Vlasenko, Tejun Heo; +Cc: mtk.manpages, Jan Kratochvil, linux-kernel

Test-case:

	void *tfunc(void *arg)
	{
		execvp("true", NULL);
		return NULL;
	}

	int main(void)
	{
		int pid;

		if (fork()) {
			pthread_t t;

			kill(getpid(), SIGSTOP);

			pthread_create(&t, NULL, tfunc, NULL);

			for (;;)
				pause();
		}

		pid = getppid();
		assert(ptrace(PTRACE_ATTACH, pid, 0,0) == 0);

		while (wait(NULL) > 0)
			ptrace(PTRACE_CONT, pid, 0,0);

		return 0;
	}

It is racy, exit_notify() does __wake_up_parent() too. But in the
likely case it triggers the problem: de_thread() does release_task()
and the old leader goes away without the notification, the tracer
sleeps in do_wait() without children/tracees.

Change de_thread() to do __wake_up_parent(traced_leader->parent).
Since it is already EXIT_DEAD we can do this without ptrace_unlink(),
EXIT_DEAD threads do not exist from do_wait's pov.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 fs/exec.c |    8 ++++++++
 1 file changed, 8 insertions(+)

--- ptrace/fs/exec.c~ptrace_mt_exec_wait_hang	2011-07-17 20:16:36.000000000 +0200
+++ ptrace/fs/exec.c	2011-07-21 19:56:22.000000000 +0200
@@ -967,6 +967,14 @@ static int de_thread(struct task_struct 
 
 		BUG_ON(leader->exit_state != EXIT_ZOMBIE);
 		leader->exit_state = EXIT_DEAD;
+
+		/*
+		 * We are going to release_task()->ptrace_unlink() silently,
+		 * the tracer can sleep in do_wait(). EXIT_DEAD guarantees
+		 * the tracer wont't block again waiting for this thread.
+		 */
+		if (unlikely(leader->ptrace))
+			__wake_up_parent(leader, leader->parent);
 		write_unlock_irq(&tasklist_lock);
 
 		release_task(leader);


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/1] ptrace: do_wait(traced_leader_killed_by_mt_exec) can block forever
  2011-07-21 18:00     ` [PATCH 1/1] ptrace: do_wait(traced_leader_killed_by_mt_exec) can block forever Oleg Nesterov
@ 2011-07-22  8:44       ` Tejun Heo
  0 siblings, 0 replies; 18+ messages in thread
From: Tejun Heo @ 2011-07-22  8:44 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Denys Vlasenko, mtk.manpages, Jan Kratochvil, linux-kernel

Hello,

On Thu, Jul 21, 2011 at 08:00:43PM +0200, Oleg Nesterov wrote:
> It is racy, exit_notify() does __wake_up_parent() too. But in the
> likely case it triggers the problem: de_thread() does release_task()
> and the old leader goes away without the notification, the tracer
> sleeps in do_wait() without children/tracees.
> 
> Change de_thread() to do __wake_up_parent(traced_leader->parent).
> Since it is already EXIT_DEAD we can do this without ptrace_unlink(),
> EXIT_DEAD threads do not exist from do_wait's pov.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

Nice catch as always. :)

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] man ptrace: add extended description of various ptrace quirks
  2011-07-21 16:51 ` Oleg Nesterov
  2011-07-21 18:00   ` [PATCH 0/1] (Was: man ptrace: add extended description of various ptrace quirks) Oleg Nesterov
@ 2011-09-21  5:10   ` Michael Kerrisk
  2011-09-23  9:31     ` Denys Vlasenko
  1 sibling, 1 reply; 18+ messages in thread
From: Michael Kerrisk @ 2011-09-21  5:10 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Oleg Nesterov, Jan Kratochvil, linux-kernel, Tejun Heo, Michael Kerrisk

Denys,

On Thu, Jul 21, 2011 at 6:51 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 07/21, Denys Vlasenko wrote:
>>
>> Deleted several outright false statements:
>> - pid 1 can be traced
>> - tracer is not shown as parent in ps output
>> - PTRACE_ATTACH is not "the same behavior as if tracee had done
>>   a PTRACE_TRACEME": PTRACE_ATTACH delivers a SIGSTOP.
>> - SIGSTOP _can_ be injected.
>
> Yes, this is correct, thanks.
>
>> +Tracer can not assume that tracee ALWAYS ends its life by reporting
>> +WIFEXITED(status) or WIFSIGNALED(status).
>> +.LP
>> +.\" or can it? Do we include such a promise into ptrace API?
>
> IIRC, we already discussed this... The traced group leader can
> disappear during mt-exec, otherwise the tracee can never go away
> silently.
>
>> +Tracer can kill a tracee with ptrace(PTRACE_KILL, pid, 0, 0). This
>> +operation is deprecated, use kill(SIGKILL) or tgkill(SIGKILL) instead.
>> +The problem with this operation is that it requires tracee to be in
>> +signal-delivery-stop, otherwise it may not work (may complete
>> +successfully but won't kill the tracee),
>
> In short, ptrace(PTRACE_KILL) is more or less ptrace(PTRACE_CONT, SIGKILL),
> but it always returns 0. IOW, it never worked as decribed in the man
> page. And I guess today nobody can explain why PTRACE_KILL exists.
>
> Oleg.

Does your patch need any revision in the light of Oleg's comments?

Thanks,

Michael



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface"; http://man7.org/tlpi/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] man ptrace: add extended description of various ptrace quirks
  2011-09-21  5:10   ` [PATCH] man ptrace: add extended description of various ptrace quirks Michael Kerrisk
@ 2011-09-23  9:31     ` Denys Vlasenko
  0 siblings, 0 replies; 18+ messages in thread
From: Denys Vlasenko @ 2011-09-23  9:31 UTC (permalink / raw)
  To: mtk.manpages
  Cc: Oleg Nesterov, Jan Kratochvil, linux-kernel, Tejun Heo, Michael Kerrisk

On Wed, Sep 21, 2011 at 7:10 AM, Michael Kerrisk <mtk.manpages@gmail.com> wrote:
>>> +Tracer can not assume that tracee ALWAYS ends its life by reporting
>>> +WIFEXITED(status) or WIFSIGNALED(status).
>>> +.LP
>>> +.\" or can it? Do we include such a promise into ptrace API?
>>
>> IIRC, we already discussed this... The traced group leader can
>> disappear during mt-exec, otherwise the tracee can never go away
>> silently.

>From userspace perspective, traced group leader doesn't go away
silently, it's execing thread who goes away silently.

> Does your patch need any revision in the light of Oleg's comments?

No, it does not. Please commit it to manpage git.

-- 
vda

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] man ptrace: add extended description of various ptrace quirks
  2011-07-21 11:09 [PATCH] man ptrace: add extended description of various ptrace quirks Denys Vlasenko
  2011-07-21 16:51 ` Oleg Nesterov
@ 2011-09-25  6:10 ` Michael Kerrisk
  2011-09-29 19:08 ` Michael Kerrisk
  2 siblings, 0 replies; 18+ messages in thread
From: Michael Kerrisk @ 2011-09-25  6:10 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Oleg Nesterov, Jan Kratochvil, linux-kernel, Tejun Heo, linux-man

Hello Denys,

On Thu, Jul 21, 2011 at 1:09 PM, Denys Vlasenko
<vda.linux@googlemail.com> wrote:
> Hi Michael,
>
> Please apply attached patch which updates ptrace manpage.
> (I'm not sending it inline, google web mail might mangle it. Sorry).

Thanks. This is a fantastic piece of work! I'm working on the patch
now; it'll probably take me a few days before I have done with it.
I'll send you the new version for review when I'm done.

> Changes include:
>
> s/parent/tracer/g, s/child/tracee/g - ptrace interface now
> is sufficiently cleaned up to not treat tracing process as parent.
>
> Deleted several outright false statements:
> - pid 1 can be traced

For changes like this, it's useful to record when the change occurred,
rather than just deleting the old text. I'm taking it that the change
occurred in 2.6.26, with commit
00cd5c37afd5f431ac186dd131705048c0a11fdb. Right?

> - tracer is not shown as parent in ps output

Was this a change triggered by a kernel revision? If so, which kernel version?

> - PTRACE_ATTACH is not "the same behavior as if tracee had done
>  a PTRACE_TRACEME": PTRACE_ATTACH delivers a SIGSTOP.

Again, was this a change triggered by a kernel revision? If so, which
kernel version?

> - SIGSTOP _can_ be injected.

Again, was this a change triggered by a kernel revision? If so, which
kernel version?

> - Removed mentions of SunOS and Solaris as irrelevant.

Yes, seems fair enough to me.

> - Added a few more known bugs.

Thanks.

> Added a large block of text in DESCRIPTION which doesn't focus
> on mechanical description of each flag and operation, but rather
> tries to describe a bigger picture. The targeted audience is
> a person which is reasonably knowledgeable in Unix but did not
> spend years working with ptrace, and thus may be unaware of its quirks.
> This text went through several iterations of review by Oleg Nesterov
> and Tejun Heo.

Really, really good that you added that. Thank you.

> This block of text intentionally uses as little markup as possible,
> otherwise future modifications to it will be very hard to make.

Sorry, here I have to disagree ;-). Maintaining consistency across the
almost 1000 pages in man-pages is important for many reasons:
* Reader comfort; all man pages should look fairly consistent.
* Some scripts rely on consistent formatting to produce nicely rendered output
* Some global edits and replaces, and checking scripts, are more
likely to work well when formatting is consistent
* Consistent formatting probably also is helpful for downstream translations.

So, for these reasons, mark-up really should be applied. I'll do that
now, and if you need assistance with mark-up for future revisions, I'm
very happy to help out.

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface"; http://man7.org/tlpi/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] man ptrace: add extended description of various ptrace quirks
  2011-07-21 11:09 [PATCH] man ptrace: add extended description of various ptrace quirks Denys Vlasenko
  2011-07-21 16:51 ` Oleg Nesterov
  2011-09-25  6:10 ` Michael Kerrisk
@ 2011-09-29 19:08 ` Michael Kerrisk
  2011-09-30 14:14   ` Denys Vlasenko
  2011-09-30 14:28   ` Denys Vlasenko
  2 siblings, 2 replies; 18+ messages in thread
From: Michael Kerrisk @ 2011-09-29 19:08 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: mtk.manpages, Oleg Nesterov, Jan Kratochvil, linux-kernel,
	Tejun Heo, linux-man, Heiko Carstens, Chuck Ebbert, Blaisorblade,
	Daniel Jacobowitz

[-- Attachment #1: Type: text/plain, Size: 56616 bytes --]

[CC+=linux-man + a few other possibly interested individuals]

Hello Denys, (Oleg, Tejun),

On Thu, Jul 21, 2011 at 1:09 PM, Denys Vlasenko
<vda.linux@googlemail.com> wrote:
> Hi Michael,
>
> Please apply attached patch which updates ptrace manpage.
> (I'm not sending it inline, google web mail might mangle it. Sorry).

Thanks once again for this nice piece of work. Some comments, and a
revised page below.

> Changes include:
>
> s/parent/tracer/g, s/child/tracee/g - ptrace interface now
> is sufficiently cleaned up to not treat tracing process as parent.

Thanks!

> Deleted several outright false statements:
> - pid 1 can be traced

It looks to me as though this was once true, and I amended the page
accordingly. (Man-pages documents not just current behavior, but
historical behavior too.)

> - tracer is not shown as parent in ps output

Was this true at one time? If yes, then we should document past and
current behavior, and note when the change occurred.

> - PTRACE_ATTACH is not "the same behavior as if tracee had done
>  a PTRACE_TRACEME": PTRACE_ATTACH delivers a SIGSTOP.

Okay.

> - SIGSTOP _can_ be injected.

Was this true at one time? If yes, then we should document past and
current behavior, and note when the change occurred.

In the Linux 2.4 sources, I see the following in
arch/i386/kernel/signal.c::do_signal():

                        /* The debugger continued.  Ignore SIGSTOP.  */
                        if (signr == SIGSTOP)
                                continue;

Did that code prevent SIGSTOP being injected in the 2.4 series?

> - Removed mentions of SunOS and Solaris as irrelevant.

Okay.

> - Added a few more known bugs.

Thanks.

> Added a large block of text in DESCRIPTION which doesn't focus
> on mechanical description of each flag and operation, but rather
> tries to describe a bigger picture. The targeted audience is
> a person which is reasonably knowledgeable in Unix but did not
> spend years working with ptrace, and thus may be unaware of its quirks.
> This text went through several iterations of review by Oleg Nesterov
> and Tejun Heo.

Thanks. That's a great addition!

> This block of text intentionally uses as little markup as possible,
> otherwise future modifications to it will be very hard to make.

As I noted in an earlier mail, there are many very good reasons to
have the formatting. I've applied suitable formatting, and I can help
with formatting for future revisions to the text.

So, I took your patch, and then did a global edit of the page to fix
various pieces (in the existing text, as well as do some language
clean-ups for the new text). In the process, I found a number of
pieces that are still unclear (some in the old text, some in your new
text). I also made some changes to your text that I'd like you to
check. I've marked each of these with FIXME below. Could you please
take a look at the FIXMEs, and write me a comment for each of these.
(I appreciate that in some cases, especially for the existing text,
you may not have a handy answer Denys, but if you (and others) can
give any help, that would be great.)

Rather than you writing a new patch to this version of the page, I
think it might be easiest if you just replied to the FIXMEs inline
below, then I can revise the page in the light of your comments.

Thanks,

Michael

PS Note that I also added a copyright notice for your changes.

PPS Note also that there's a FIXME at the top of the file for
PTRACE_SEIZE, PTRACE_INTERRUPT, and PTRACE_LISTEN. That's
informational for me, not a request to you (though if someone wants to
write documentation for those three new flags, I am happy receive it).

.\" Hey Emacs! This file is -*- nroff -*- source.
.\"
.\" Copyright (c) 1993 Michael Haardt <michael@moria.de>
.\" Fri Apr  2 11:32:09 MET DST 1993
.\"
.\" and changes Copyright (C) 1999 Mike Coleman (mkc@acm.org)
.\" -- major revision to fully document ptrace semantics per recent Linux
.\"    kernel (2.2.10) and glibc (2.1.2)
.\" Sun Nov  7 03:18:35 CST 1999
.\"
.\" and Copyright (c) 2011, Denys Vlasenko <vda.linux@googlemail.com>
.\"
.\" This is free documentation; you can redistribute it and/or
.\" modify it under the terms of the GNU General Public License as
.\" published by the Free Software Foundation; either version 2 of
.\" the License, or (at your option) any later version.
.\"
.\" The GNU General Public License's references to "object code"
.\" and "executables" are to be interpreted as the output of any
.\" document formatting or typesetting system, including
.\" intermediate and printed output.
.\"
.\" This manual is distributed in the hope that it will be useful,
.\" but WITHOUT ANY WARRANTY; without even the implied warranty of
.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
.\" GNU General Public License for more details.
.\"
.\" You should have received a copy of the GNU General Public
.\" License along with this manual; if not, write to the Free
.\" Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111,
.\" USA.
.\"
.\" Modified Fri Jul 23 23:47:18 1993 by Rik Faith <faith@cs.unc.edu>
.\" Modified Fri Jan 31 16:46:30 1997 by Eric S. Raymond <esr@thyrsus.com>
.\" Modified Thu Oct  7 17:28:49 1999 by Andries Brouwer <aeb@cwi.nl>
.\" Modified, 27 May 2004, Michael Kerrisk <mtk.manpages@gmail.com>
.\"     Added notes on capability requirements
.\"
.\" 2006-03-24, Chuck Ebbert <76306.1226@compuserve.com>
.\"    Added    PTRACE_SETOPTIONS, PTRACE_GETEVENTMSG, PTRACE_GETSIGINFO,
.\"        PTRACE_SETSIGINFO, PTRACE_SYSEMU, PTRACE_SYSEMU_SINGLESTEP
.\"    (Thanks to Blaisorblade, Daniel Jacobowitz and others who helped.)
.\" 2011-09, major update by Denys Vlasenko <vda.linux@googlemail.com>
.\"
.\" FIXME (later): Linux 3.1 adds PTRACE_SEIZE, PTRACE_INTERRUPT, and
PTRACE_LISTEN.
.\"
.TH PTRACE 2 2011-09-26 "Linux" "Linux Programmer's Manual"
.SH NAME
ptrace \- process trace
.SH SYNOPSIS
.nf
.B #include <sys/ptrace.h>
.sp
.BI "long ptrace(enum __ptrace_request " request ", pid_t " pid ", "
.BI "            void *" addr ", void *" data );
.fi
.SH DESCRIPTION
The
.BR ptrace ()
system call provides a means by which one process (the "tracer")
may observe and control the execution of another process (the "tracee"),
and examine and change the tracee's memory and registers.
It is primarily used to implement breakpoint debugging and system
call tracing.
.LP
A tracee first need to be attached to the tracer.
Attachment and subsequent commands are per thread:
in a multithreaded process,
every thread can be individually attached to a
(potentially different) tracer,
or left not attached and thus not debugged.
Therefore, "tracee" always means "(one) thread",
never "a (possibly multithreaded) process".
Ptrace commands are always sent to
a specific tracee using a call of the form

    ptrace(PTRACE_foo, pid, ...)

where
.I pid
is the thread ID of the corresponding Linux thread.
.LP
A process can initiate a trace by calling
.BR fork (2)
and having the resulting child do a
.BR PTRACE_TRACEME ,
followed (typically) by an
.BR execve (2).
Alternatively, one process may commence tracing another process using
.BR PTRACE_ATTACH .
.LP
While being traced, the tracee will stop each time a signal is delivered,
even if the signal is being ignored.
(An exception is
.BR SIGKILL ,
which has its usual effect.)
The tracer will be notified at its next call to
.BR waitpid (2)
(or one of the related "wait" system calls)
and may inspect and modify the tracee while it is stopped.
The tracer then causes the tracee to continue,
optionally ignoring the delivered signal
(or even delivering a different signal instead).
.LP
When the tracer is finished tracing, it can cause the tracee to continue
executing in a normal, untraced mode via
.BR PTRACE_DETACH .
.LP
The value of
.I request
determines the action to be performed:
.TP
.B PTRACE_TRACEME
Indicate that this process is to be traced by its parent.
Any signal (except
.BR SIGKILL )
delivered to this process will cause it to stop and its
parent to be notified via
.BR waitpid (2).
In addition, all subsequent calls to
.BR execve (2)
by the traced process will cause a
.B SIGTRAP
to be sent to it,
giving the parent a chance to gain control before the new program
begins execution.
A process probably shouldn't make this request if its parent
isn't expecting to trace it.
.RI ( pid ,
.IR addr ,
and
.IR data
are ignored.)
.LP
The
.B PTRACE_TRACEME
request is used only by the tracee;
the remaining requests are used only by the tracer.
In the following requests,
.I pid
specifies the thread ID of the tracee to be acted on.
For requests other than
.BR PTRACE_KILL ,
the tracee must be stopped.
.TP
.BR PTRACE_PEEKTEXT ", " PTRACE_PEEKDATA
Read a word at the address
.I addr
in the tracee's memory, returning the word as the result of the
.BR ptrace ()
call.
Linux does not have separate text and data address spaces,
so these two requests are currently equivalent.
.RI ( data
is ignored.)
.TP
.B PTRACE_PEEKUSER
.\" PTRACE_PEEKUSR in kernel source, but glibc uses PTRACE_PEEKUSER,
.\" and that is the name that seems common on other systems.
Read a word at offset
.I addr
in the tracee's USER area,
which holds the registers and other information about the process
(see
.IR <sys/user.h> ).
The word is returned as the result of the
.BR ptrace ()
call.
Typically, the offset must be word-aligned, though this might vary by
architecture.
See NOTES.
.RI ( data
is ignored.)
.TP
.BR PTRACE_POKETEXT ", " PTRACE_POKEDATA
Copy the word
.I data
to the address
.I addr
in the tracee's memory.
As for
.BR PTRACE_PEEKTEXT
and
.BR PTRACE_PEEKDATA ,
these two requests are currently equivalent.
.TP
.B PTRACE_POKEUSER
.\" PTRACE_POKEUSR in kernel source, but glibc uses PTRACE_POKEUSER,
.\" and that is the name that seems common on other systems.
Copy the word
.I data
to offset
.I addr
in the tracee's USER area.
As for
.BR PTRACE_PEEKUSER ,
the offset must typically be word-aligned.
In order to maintain the integrity of the kernel,
some modifications to the USER area are disallowed.
.\" FIXME In the preceding sentence, which modifications are disallowed,
.\" and when they are disallowed, how does userspace discover that fact?
.TP
.BR PTRACE_GETREGS ", " PTRACE_GETFPREGS
Copy the tracee's general purpose or floating-point registers,
respectively, to the address
.I data
in the tracer.
See
.I <sys/user.h>
for information on the format of this data.
.RI ( addr
is ignored.)
.TP
.BR PTRACE_GETSIGINFO " (since Linux 2.3.99-pre6)"
Retrieve information about the signal that caused the stop.
Copy a
.I siginfo_t
structure (see
.BR sigaction (2))
from the tracee to the address
.I data
in the tracer.
.RI ( addr
is ignored.)
.TP
.BR PTRACE_SETREGS ", " PTRACE_SETFPREGS
Copy the tracee's general purpose or floating-point registers,
respectively, from the address
.I data
in the tracer.
As for
.BR PTRACE_POKEUSER ,
some general purpose register modifications may be disallowed.
.\" FIXME In the preceding sentence, which modifications are disallowed,
.\" and when they are disallowed, how does userspace discover that fact?
.RI ( addr
is ignored.)
.TP
.BR PTRACE_SETSIGINFO " (since Linux 2.3.99-pre6)"
Set signal information:
copy a
.I siginfo_t
structure from the address
.I data
in the tracer to the tracee.
This will affect only signals that would normally be delivered to
the tracee and were caught by the tracer.
It may be difficult to tell
these normal signals from synthetic signals generated by
.BR ptrace ()
itself.
.RI ( addr
is ignored.)
.TP
.BR PTRACE_SETOPTIONS " (since Linux 2.4.6; see BUGS for caveats)"
Set ptrace options from
.IR data .
.RI ( addr
is ignored.)
.IR data
is interpreted as a bit mask of options,
which are specified by the following flags:
.RS
.TP
.BR PTRACE_O_TRACESYSGOOD " (since Linux 2.4.6)"
When delivering system call traps, set bit 7 in the signal number
(i.e., deliver
.IR "SIGTRAP|0x80" ).
This makes it easy for the tracer to distinguish
normal traps from those caused by a system call.
.RB ( PTRACE_O_TRACESYSGOOD
may not work on all architectures.)
.\" FIXME Please check. In the following paragraphs, I substituted language
.\" such as:
.\"     Stop tracee at next fork(2) call with SIGTRAP|PTRACE_EVENT_FORK<<8
.\" with:
.\"     Stop tracee at next fork(2) call... A subsequent PTRACE_GETSIGINFO
.\"     on the stopped tracee will return a siginfo_t structure with si_code
.\"     set to SIGTRAP|PTRACE_EVENT_FORK<<8.
.\"
.\" Is this change correct?
.\"
.TP
.BR PTRACE_O_TRACEFORK " (since Linux 2.5.46)"
Stop the tracee at the next
.BR fork (2)
and automatically start tracing the newly forked process,
which will start with a
.BR SIGSTOP .
A subsequent
.B PTRACE_GETSIGINFO
on the stopped tracee will return a
.I siginfo_t
structure with
.I si_code
set to
.IR SIGTRAP|PTRACE_EVENT_FORK<<8 .
The PID of the new process can be retrieved with
.BR PTRACE_GETEVENTMSG .
.TP
.BR PTRACE_O_TRACEVFORK " (since Linux 2.5.46)"
Stop the tracee at the next
.BR vfork (2)
and automatically start tracing the newly vforked process,
which will start with a
.BR SIGSTOP .
A subsequent
.B PTRACE_GETSIGINFO
on the stopped tracee will return a
.I siginfo_t
structure with
.I si_code
set to
.IR SIGTRAP|PTRACE_EVENT_VFORK<<8 .
The PID of the new process can be retrieved with
.BR PTRACE_GETEVENTMSG .
.TP
.BR PTRACE_O_TRACECLONE " (since Linux 2.5.46)"
Stop the tracee at the next
.BR clone (2)
and automatically start tracing the newly cloned process,
which will start with a
.BR SIGSTOP .
A subsequent
.B PTRACE_GETSIGINFO
on the stopped tracee will return a
.I siginfo_t
structure with
.I si_code
set to
.IR SIGTRAP|PTRACE_EVENT_CLONE<<8 .
The PID of the new process can be retrieved with
.BR PTRACE_GETEVENTMSG .
.IP
This option may not catch
.BR clone (2)
calls in all cases.
If the tracee calls
.BR clone (2)
with the
.B CLONE_VFORK
flag,
.B PTRACE_EVENT_VFORK
will be delivered instead
if
.B PTRACE_O_TRACEVFORK
is set; otherwise if the tracee calls
.BR clone (2)
with the exit signal set to
.BR SIGCHLD ,
.B PTRACE_EVENT_FORK
will be delivered if
.B PTRACE_O_TRACEFORK
is set.
.TP
.BR PTRACE_O_TRACEEXEC " (since Linux 2.5.46)"
Stop the tracee at the next
.BR execve (2).
A subsequent
.B PTRACE_GETSIGINFO
on the stopped tracee will return a
.I siginfo_t
structure with
.I si_code
set to
.IR SIGTRAP|PTRACE_EVENT_EXEC<<8 .
.TP
.BR PTRACE_O_TRACEVFORKDONE " (since Linux 2.5.60)"
Stop the tracee at the completion of the next
.BR vfork (2).
A subsequent
.B PTRACE_GETSIGINFO
on the stopped tracee will return a
.I siginfo_t
structure with
.I si_code
set to
.IR SIGTRAP|PTRACE_EVENT_VFORK_DONE<<8 .
The PID of the new process can (since Linux 2.6.18) be retrieved with
.BR PTRACE_GETEVENTMSG .
.TP
.BR PTRACE_O_TRACEEXIT " (since Linux 2.5.60)"
Stop the tracee at exit.
A subsequent
.B PTRACE_GETSIGINFO
on the stopped tracee will return a
.I siginfo_t
structure with
.I si_code
set to
.IR SIGTRAP|PTRACE_EVENT_EXIT<<8 .
The tracee's exit status can be retrieved with
.BR PTRACE_GETEVENTMSG .
.IP
The tracee is stopped early during process exit,
when registers are still available,
allowing the tracer to see where the exit occurred,
whereas the normal exit notification is done after the process
is finished exiting.
Even though context is available,
the tracer cannot prevent the exit from happening at this point.
.RE
.TP
.BR PTRACE_GETEVENTMSG " (since Linux 2.5.46)"
Retrieve a message (as an
.IR "unsigned long" )
about the ptrace event
that just happened, placing it at the address
.I data
in the tracer.
For
.BR PTRACE_EVENT_EXIT ,
this is the tracee's exit status.
For
.BR PTRACE_EVENT_FORK ,
.BR PTRACE_EVENT_VFORK ,
.BR PTRACE_EVENT_VFORK_DONE ,
and
.BR PTRACE_EVENT_CLONE ,
this is the PID of the new process.
.RI (  addr
is ignored.)
.TP
.B PTRACE_CONT
Restart the stopped tracee process.
If
.I data
is nonzero,
it is interpreted as the number of a signal to be delivered to the tracee;
otherwise, no signal is delivered.
Thus, for example, the tracer can control
whether a signal sent to the tracee is delivered or not.
.RI ( addr
is ignored.)
.TP
.BR PTRACE_SYSCALL ", " PTRACE_SINGLESTEP
Restart the stopped tracee as for
.BR PTRACE_CONT ,
but arrange for the tracee to be stopped at
the next entry to or exit from a system call,
or after execution of a single instruction, respectively.
(The tracee will also, as usual, be stopped upon receipt of a signal.)
>From the tracer's perspective, the tracee will appear to have been
stopped by receipt of a
.BR SIGTRAP .
So, for
.BR PTRACE_SYSCALL ,
for example, the idea is to inspect
the arguments to the system call at the first stop,
then do another
.B PTRACE_SYSCALL
and inspect the return value of the system call at the second stop.
The
.I data
argument is treated as for
.BR PTRACE_CONT .
.RI (addr
is ignored.)
.TP
.BR PTRACE_SYSEMU ", " PTRACE_SYSEMU_SINGLESTEP " (since Linux 2.6.14)"
For
.BR PTRACE_SYSEMU ,
continue and stop on entry to the next system call,
which will not be executed.
For
.BR PTRACE_SYSEMU_SINGLESTEP ,
do the same but also singlestep if not a system call.
This call is used by programs like
User Mode Linux that want to emulate all the tracee's system calls.
The
.I data
argument is treated as for
.BR PTRACE_CONT .
.RI ( addr
is ignored;
not supported on all architectures.)
.TP
.B PTRACE_KILL
Send the tracee a
.B SIGKILL
to terminate it.
.RI ( addr
and
.I data
are ignored.)
.IP
.I This operation is deprecated; do not use it!
Instead, send a
.BR SIGKILL
directly using
.BR kill (2)
or
.BR tgkill (2).
The problem with
.B PTRACE_KILL
is that it requires the tracee to be in signal-delivery-stop,
otherwise it may not work
(i.e., may complete successfully but won't kill the tracee).
By contrast, sending a
.B SIGKILL
directly has no such limitation.
.\" mtk: Commented out the following. It doesn't belong in the man page.
.\" .LP
.\" [Note: deprecation suggested by Oleg Nesterov. He prefers to deprecate
.\" it instead of describing (and needing to support) PTRACE_KILL's quirks.]
.TP
.B PTRACE_ATTACH
Attach to the process specified in
.IR pid ,
making it a tracee of the calling process.
.\" FIXME So, was the following EVER true? IF it was,
.\"       we should reinstate the text and add mention of
.\"       the kernel version where the behaviour changed.
.\"
.\" Not true: (removed by dv)
.\" ; the behavior of the tracee is as if it had done a
.\" .BR PTRACE_TRACEME .
.\" The calling process actually becomes the parent of the tracee
.\" process for most purposes (e.g., it will receive
.\" notification of tracee events and appears in
.\" .BR ps (1)
.\" output as the tracee's parent), but a
.\" .BR getppid (2)
.\" by the tracee will still return the PID of the original parent.
The tracee is sent a
.BR SIGSTOP ,
but will not necessarily have stopped
by the completion of this call; use
.BR waitpid (2)
to wait for the tracee to stop.
See the "Attaching and detaching" subsection for additional information.
.RI ( addr
and
.I data
are ignored.)
.TP
.B PTRACE_DETACH
Restart the stopped tracee as for
.BR PTRACE_CONT ,
but first detach from it.
Under Linux, a tracee can be detached in this way regardless
of which method was used to initiate tracing.
.RI ( addr
is ignored.)
.\"
.\" In the text below, we decided to avoid prettifying the text with markup:
.\" it would make the source nearly impossible to edit, and we _do_ intend
.\" to edit it often, in order to keep it updated:
.\" ptrace API is full of quirks, no need to compound this situation by
.\" making it excruciatingly painful to document them!
.\"
.SS Death under ptrace
When a (possibly multithreaded) process receives a killing signal
(one whose disposition is set to
.B SIG_DFL
and whose default action is to kill the process),
all threads exit.
Tracees report their death to their tracer(s).
Notification of this event is delivered via
.BR waitpid (2).
.LP
Note that the killing signal will first cause signal-delivery-stop
(on one tracee only),
and only after it is injected by the tracer
(or after it was dispatched to a thread which isn't traced),
will death from the signal happen on
.I all
tracees within a multithreaded process.
(The term "signal-delivery-stop" is explained below.)
.LP
.B SIGKILL
operates similarly, with exceptions.
No signal-delivery-stop is generated for
.B SIGKILL
and therefore the tracer can't suppress it.
.B SIGKILL
kills even within system calls
(syscall-exit-stop is not generated prior to death by
.BR SIGKILL ).
The net effect is that
.B SIGKILL
always kills the process (all its threads),
even if some threads of the process are ptraced.
.LP
When the tracee calls
.BR _exit (2),
it reports its death to its tracer.
Other threads are not affected.
.LP
When any thread executes
.BR exit_group (2),
every tracee in its thread group reports its death to its tracer.
.LP
If the
.B PTRACE_O_TRACEEXIT
option is on,
.B PTRACE_EVENT_EXIT
will happen before actual death.
This applies to exits via
.BR exit (2),
.BR exit_group (2),
and signal deaths (except
.BR SIGKILL ),
and when threads are torn down on
.BR execve (2)
in a multithreaded process.
.LP
The tracer cannot assume that the ptrace-stopped tracee exists.
There are many scenarios when the tracee may die while stopped (such as
.BR SIGKILL ).
Therefore, the tracer must be prepared to handle an
.B ESRCH
error on any ptrace operation.
Unfortunately, the same error is returned if the tracee
exists but is not ptrace-stopped
(for commands which require a stopped tracee),
or if it is not traced by the process which issued the ptrace call.
The tracer needs to keep track of the stopped/running state of the tracee,
and interpret
.B ESRCH
as "tracee died unexpectedly" only if it knows that the tracee has
been observed to enter ptrace-stop.
Note that there is no guarantee that
.I waitpid(WNOHANG)
will reliably report the tracee's death status if a
ptrace operation returned
.BR ESRCH .
.I waitpid(WNOHANG)
may return 0 instead.
In other words, the tracee may be "not yet fully dead",
but already refusing ptrace requests.
.LP
The tracer can't assume that the tracee
.I always
ends its life by reporting
.I WIFEXITED(status)
or
.IR WIFSIGNALED(status) .
.LP
.\"     or can it? Do we include such a promise into ptrace API?
.\"
.\" FIXME: The preceding comment seems to be unresolved?
.\"        Do you want to add anything?
.\"
.SS Stopped states
A tracee can be in two states: running or stopped.
.LP
There are many kinds of states when the tracee is stopped, and in ptrace
discussions they are often conflated.
Therefore, it is important to use precise terms.
.LP
In this manual page, any stopped state in which the tracee is ready
to accept ptrace commands from the tracer is called
.IR ptrace-stop .
Ptrace-stops can
be further subdivided into
.IR signal-delivery-stop ,
.IR group-stop ,
.IR syscall-stop ,
and so on.
These stopped states are described in detail below.
.LP
When the running tracee enters ptrace-stop, it notifies its tracer using
.BR waitpid (2)
(or one of the other "wait" system calls).
Most of this manual page assumes that the tracer waits with:
.LP
    pid = waitpid(pid_or_minus_1, &status, __WALL);
.LP
Ptrace-stopped tracees are reported as returns with
.I pid
greater than 0 and
.I WIFSTOPPED(status)
true.
.LP
.\" FIXME: mtk: the following comment seems to be unresolved?
.\"        Do you want to add anything?
.\"
.\"     Do we require __WALL usage, or will just using 0 be ok? Are the
.\"     rules different if user wants to use waitid? Will waitid require
.\"     WEXITED?
.\"
.LP
.\" FIXME: Is the following comment "__WALL... implies" true?
The
.B __WALL
flag does not include the
.B WSTOPPED
and
.B WEXITED
flags, but implies their functionality.
.LP
Setting the
.B WCONTINUED
flag when calling
.BR waitpid (2)
is not recommended: the "continued" state is per-process and
consuming it can confuse the real parent of the tracee.
.LP
Use of the
.B WNOHANG
flag may cause
.BR waitpid (2)
to return 0 ("no wait results available yet")
even if the tracer knows there should be a notification.
Example:
.nf

    kill(tracee, SIGKILL);
    waitpid(tracee, &status, __WALL | WNOHANG);
.fi
.\" FIXME: mtk: the following comment seems to be unresolved?
.\"        Do you want to add anything?
.\"
.\"     waitid usage? WNOWAIT?
.\"     describe how wait notifications queue (or not queue)
.LP
The following kinds of ptrace-stops exist: signal-delivery-stops,
group-stop, PTRACE_EVENT stops, syscall-stops
.\"
.\" FIXME: mtk: the following text appears to be incomplete.
.\"        Do you want to add anything?
.\"
[, PTRACE_SINGLESTEP, PTRACE_SYSEMU,
PTRACE_SYSEMU_SINGLESTEP].
They all are reported by
.BR waitpid (2)
with
.I WIFSTOPPED(status)
true.
They may be differentiated by examining the value
.IR status>>8 ,
and if there is ambiguity in that value, by querying
.BR PTRACE_GETSIGINFO .
.\"
.\" FIXME What is the purpose of the following sentence? Is it to warn
.\"       the reader not to use WSTOPSIG()? If so, we should make that
.\"       point more explicitly.
(Note: the
.I WSTOPSIG(status)
macro returns the value
.IR "(status>>8)\ &\ 0xff)" .)
.SS Signal-delivery-stop
When a (possibly multithreaded) process receives any signal except
.BR SIGKILL ,
the kernel selects an arbitrary thread which handles the signal.
(If the signal is generated with
.BR tgkill (2),
the target thread can be explicitly selected by the caller.)
If the selected thread is traced, it enters signal-delivery-stop.
At this point, the signal is not yet delivered to the process,
and can be suppressed by the tracer.
If the tracer doesn't suppress the signal,
.\"
.\" FIXME: I added the word "restart" to the following line. Okay?
it passes the signal to the tracee in the next ptrace restart request.
This second step of signal delivery is called
.I "signal injection"
in this manual page.
Note that if the signal is blocked,
signal-delivery-stop doesn't happen until the signal is unblocked,
with the usual exception that
.B SIGSTOP
can't be blocked.
.LP
Signal-delivery-stop is observed by the tracer as
.BR waitpid (2)
returning with
.I WIFSTOPPED(status)
true, with the stopping signal returned by
.IR WSTOPSIG(status) .
If the stopping signal is
.BR SIGTRAP ,
this may be a different kind of ptrace-stop;
see the "Syscall-stops" and "execve" sections below for details.
If
.I WSTOPSIG(status)
returns a stopping signal, this may be a group-stop; see below.
.SS Signal injection and suppression
After signal-delivery-stop is observed by the tracer,
the tracer should restart the tracee with the call
.LP
    ptrace(PTRACE_restart, pid, 0, sig)
.LP
where
.B PTRACE_restart
is one of the restarting ptrace requests.
If
.I sig
is 0, then a signal is not delivered.
Otherwise, the signal
.I sig
is delivered.
This operation is called
.I "signal injection"
in this manual page, to distinguish it from signal-delivery-stop.
.LP
Note that the
.I sig
value may be different from the
.I WSTOPSIG(status)
value: the tracer can cause a different signal to be injected.
.LP
Note that a suppressed signal still causes system calls to return
prematurely.
Restartable system calls will be restarted (the tracer will
observe the tracee to execute
.BR restart_syscall(2)
if the tracer uses
.BR PTRACE_SYSCALL );
non-restartable system calls may fail with
.B EINTR
even though no observable signal is injected to the tracee.
.LP
Note that restarting ptrace commands issued in ptrace-stops other than
signal-delivery-stop are not guaranteed to inject a signal, even if
.I sig
is nonzero.
No error is reported; a nonzero
.I sig
may simply be ignored.
Ptrace users should not try to "create a new signal" this way: use
.BR tgkill (2)
instead.
.LP
.\"
.\" FIXME: the referrent of "This" in the next line is not clear.
.\"        What does "This" refer to?
This is a cause of confusion among ptrace users.
One typical scenario is that the tracer observes group-stop,
mistakes it for signal-delivery-stop, restarts the tracee with

    ptrace(PTRACE_rest, pid, 0, stopsig)

with the intention of injecting
.IR stopsig ,
but
.I stopsig
gets ignored and the tracee continues to run.
.LP
The
.B SIGCONT
signal has a side effect of waking up (all threads of)
a group-stopped process.
This side effect happens before signal-delivery-stop.
The tracer can't suppress this side-effect (it can
only suppress signal injection, which only causes the
.BR SIGCONT
handler to not be executed in the tracee, if such a handler is installed).
In fact, waking up from group-stop may be followed by
signal-delivery-stop for signal(s)
.I other than
.BR SIGCONT ,
if they were pending when
.B SIGCONT
was delivered.
In other words,
.B SIGCONT
may be not the first signal observed by the tracee after it was sent.
.LP
Stopping signals cause (all threads of) a process to enter group-stop.
This side effect happens after signal injection, and therefore can be
suppressed by the tracer.
.LP
.B PTRACE_GETSIGINFO
can be used to retrieve a
.I siginfo_t
structure which corresponds to the delivered signal.
.B PTRACE_SETSIGINFO
may be used to modify it.
If
.B PTRACE_SETSIGINFO
has been used to alter
.IR siginfo_t ,
the
.I si_signo
field and the
.I sig
parameter in the restarting command must match,
otherwise the result is undefined.
.SS Group-stop
When a (possibly multithreaded) process receives a stopping signal,
all threads stop.
If some threads are traced, they enter a group-stop.
Note that the stopping signal will first cause signal-delivery-stop
(on one tracee only), and only after it is injected by the tracer
(or after it was dispatched to a thread which isn't traced),
will group-stop be initiated on
.I all
tracees within the multithreaded process.
As usual, every tracee reports its group-stop separately
to the corresponding tracer.
.LP
Group-stop is observed by the tracer as
.BR waitpid (2)
returning with
.I WIFSTOPPED(status)
true, with the stopping signal available via
.IR WSTOPSIG(status) .
The same result is returned by some other classes of ptrace-stops,
therefore the recommended practice is to perform the call
.LP
    ptrace(PTRACE_GETSIGINFO, pid, 0, &siginfo)
.LP
The call can be avoided if the signal is not
.BR SIGSTOP ,
.BR SIGTSTP ,
.BR SIGTTIN ,
or
.BR SIGTTOU ;
only these four signals are stopping signals.
If the tracer sees something else, it can't be a group-stop.
Otherwise, the tracer needs to call
.BR PTRACE_GETSIGINFO .
If
.B PTRACE_GETSIGINFO
fails with
.BR EINVAL ,
then it is definitely a group-stop.
(Other failure codes are possible, such as
.B ESRCH
("no such process") if a
.B SIGKILL
killed the tracee.)
.LP
As of kernel 2.6.38,
after the tracer sees the tracee ptrace-stop and until it
restarts or kills it, the tracee will not run,
and will not send notifications (except
.B SIGKILL
death) to the tracer, even if the tracer enters into another
.BR waitpid (2)
call.
.LP
.\"
.\" FIXME ??? referrent of "it" in the next line is unclear
.\"        What does "it" refer to?
Currently, it causes a problem with transparent handling of stopping
signals: if the tracer restarts the tracee after group-stop,
.B SIGSTOP
is effectively ignored: the tracee doesn't remain stopped, it runs.
If the tracer doesn't restart the tracee before entering into the next
.BR waitpid (2),
future
.B SIGCONT
signals will not be reported to the tracer.
This would cause
.B SIGCONT
to have no effect.
.SS PTRACE_EVENT stops
If the tracer sets
.B PTRACE_O_TRACE_*
options, the tracee will enter ptrace-stops called
.B PTRACE_EVENT
stops.
.LP
.B PTRACE_EVENT
stops are observed by the tracer as
.BR waitpid (2)
returning with
.IR WIFSTOPPED(status) ,
and
.I WSTOPSIG(status)
returns
.BR SIGTRAP .
An additional bit is set in the higher byte of the status word:
the value
.I status>>8
will be

    (SIGTRAP | PTRACE_EVENT_foo << 8).

The following events exist:
.TP
.B PTRACE_EVENT_VFORK
Stop before return from
.BR vfork (2)
or
.BR clone (2)
with the
.B CLONE_VFORK
flag.
When the tracee is continued after this stop, it will wait for child to
exit/exec before continuing its execution
(in other words, the usual behavior on
.BR vfork (2)).
.TP
.B PTRACE_EVENT_FORK
Stop before return from
.BR fork (2)
or
.BR clone (2)
with the exit signal set to
.BR SIGCHLD .
.TP
.B PTRACE_EVENT_CLONE
Stop before return from
.BR clone (2)
.TP
.B PTRACE_EVENT_VFORK_DONE
Stop before return from
.BR vfork (2)
or
.BR clone (2)
with the
.B CLONE_VFORK
flag,
but after the child unblocked this tracee by exiting or execing.
.LP
For all four stops described above,
the stop occurs in the parent (i.e., the tracee),
not in the newly created thread.
.BR PTRACE_GETEVENTMSG
can be used to retrieve the new thread's ID.
.TP
.B PTRACE_EVENT_EXEC
Stop before return from
.BR execve (2).
.TP
.B PTRACE_EVENT_EXIT
Stop before exit (including death from
.BR exit_group (2)),
signal death, or exit caused by
.BR execve (2)
in a multithreaded process.
.B PTRACE_GETEVENTMSG
returns the exit status.
Registers can be examined
(unlike when "real" exit happens).
The tracee is still alive; it needs to be
.BR PTRACE_CONT ed
or
.BR PTRACE_DETACH ed
to finish exiting.
.LP
.B PTRACE_GETSIGINFO
on
.B PTRACE_EVENT
stops returns
.B SIGTRAP in
.IR si_signo ,
with
.I si_code
set to
.IR "(event<<8)\ |\ SIGTRAP" .
.SS Syscall-stops
If the tracee was restarted by
.BR PTRACE_SYSCALL ,
the tracee enters
syscall-enter-stop just prior to entering any system call.
If the tracer restarts the tracee with
.BR PTRACE_SYSCALL ,
the tracee enters syscall-exit-stop when the system call is finished,
or if it is interrupted by a signal.
(That is, signal-delivery-stop never happens between syscall-enter-stop
and syscall-exit-stop; it happens
.I after
syscall-exit-stop.)
.LP
Other possibilities are that the tracee may stop in a
.B PTRACE_EVENT
stop, exit (if it entered
.BR _exit (2)
or
.BR exit_group (2)),
be killed by
.BR SIGKILL ,
or die silently (if it is a thread group leader, the
.BR execve (2)
happened in another thread,
and that thread is not traced by the same tracer;
this situation is discussed later).
.LP
Syscall-enter-stop and syscall-exit-stop are observed by the tracer as
.BR waitpid (2)
returning with
.I WIFSTOPPED(status)
true, and
.I WSTOPSIG(status)
giving
.BR SIGTRAP .
If the
.B PTRACE_O_TRACESYSGOOD
option was set by the tracer, then
.I WSTOPSIG(status)
will give the value
.IR "(SIGTRAP\ |\ 0x80)" .
.LP
Syscall-stops can be distinguished from signal-delivery-stop with
.B SIGTRAP
by querying
.BR PTRACE_GETSIGINFO
for the following cases:
.TP
.IR si_code " <= 0"
.B SIGTRAP
.\" FIXME: Please confirm this is okay: I changed
.\"        "the usual suspects" to "by a system call". Okay?
.\"        Shouldn't we also add kill(2) here?
was sent by a system call
.RB ( tgkill (2),
.BR sigqueue (3),
etc.)
.TP
.IR si_code " == SI_KERNEL (0x80)"
.B SIGTRAP
was sent by the kernel.
.TP
.IR si_code " == SIGTRAP or " si_code " == (SIGTRAP|0x80)"
This is a syscall-stop.
.LP
However, syscall-stops happen very often (twice per system call),
and performing
.B PTRACE_GETSIGINFO
for every syscall-stop may be somewhat expensive.
.LP
.\"
.\" FIXME referrent of "them" in next line ???
.\"       What does "them" refer to?
Some architectures allow the cases to be distinguished
by examining registers.
For example, on x86,
.I rax
==
.RB - ENOSYS
in syscall-enter-stop.
Since
.B SIGTRAP
(like any other signal) always happens
.I after
syscall-exit-stop,
and at this point
.I rax
almost never contains
.RB - ENOSYS ,
the
.B SIGTRAP
looks like "syscall-stop which is not syscall-enter-stop";
in other words, it looks like a
"stray syscall-exit-stop" and can be detected this way.
But such detection is fragile and is best avoided.
.LP
Using the
.B PTRACE_O_TRACESYSGOOD
.\"
.\" FIXME Below: "is the recommended method" for WHAT?
option is the recommended method,
since it is reliable and does not incur a performance penalty.
.LP
Syscall-enter-stop and syscall-exit-stop are
indistinguishable from each other by the tracer.
The tracer needs to keep track of the sequence of
ptrace-stops in order to not misinterpret syscall-enter-stop as
syscall-exit-stop or vice versa.
The rule is that syscall-enter-stop is
always followed by syscall-exit-stop,
.B PTRACE_EVENT
stop or the tracee's death;
no other kinds of ptrace-stop can occur in between.
.LP
If after syscall-enter-stop,
the tracer uses a restarting command other than
.BR PTRACE_SYSCALL ,
syscall-exit-stop is not generated.
.LP
.B PTRACE_GETSIGINFO
on syscall-stops returns
.B SIGTRAP
in
.IR si_signo ,
with
.I si_code
set to
.B SIGTRAP
or
.IR (SIGTRAP|0x80) .
.SS PTRACE_SINGLESTEP, PTRACE_SYSEMU, PTRACE_SYSEMU_SINGLESTEP stops
.\"
.\" FIXME The following TODO is unresolved
.\"       Do you want to add anything?
.\"
(TODO: document stops occurring with PTRACE_SINGLESTEP, PTRACE_SYSEMU,
PTRACE_SYSEMU_SINGLESTEP)
.SS Informational and restarting ptrace commands
Most ptrace commands (all except
.BR PTRACE_ATTACH ,
.BR PTRACE_TRACEME ,
and
.BR PTRACE_KILL )
require the tracee to be in a ptrace-stop, otherwise they fail with
.BR ESRCH .
.LP
When the tracee is in ptrace-stop,
the tracer can read and write data to
the tracee using informational commands.
These commands leave the tracee in ptrace-stopped state:
.LP
.nf
    ptrace(PTRACE_PEEKTEXT/PEEKDATA/PEEKUSER, pid, addr, 0);
    ptrace(PTRACE_POKETEXT/POKEDATA/POKEUSER, pid, addr, long_val);
    ptrace(PTRACE_GETREGS/GETFPREGS, pid, 0, &struct);
    ptrace(PTRACE_SETREGS/SETFPREGS, pid, 0, &struct);
    ptrace(PTRACE_GETSIGINFO, pid, 0, &siginfo);
    ptrace(PTRACE_SETSIGINFO, pid, 0, &siginfo);
    ptrace(PTRACE_GETEVENTMSG, pid, 0, &long_var);
    ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_flags);
.fi
.LP
Note that some errors are not reported.
For example, setting signal information
.RI ( siginfo )
may have no effect in some ptrace-stops, yet the call may succeed
(return 0 and not set
.IR errno );
querying
.B PTRACE_GETEVENTMSG
may succeed and return some random value if current ptrace-stop
is not documented as returning a meaningful event message.
.LP
The call

    ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_flags);

affects one tracee.
The tracee's current flags are replaced.
Flags are inherited by new tracees created and "auto-attached" via active
.BR PTRACE_O_TRACEFORK ,
.BR PTRACE_O_TRACEVFORK ,
or
.BR PTRACE_O_TRACECLONE
options.
.LP
Another group of commands makes the ptrace-stopped tracee run.
They have the form:
.LP
    ptrace(PTRACE_cmd, pid, 0, sig);
.LP
where
.I cmd
is
.BR PTRACE_CONT ,
.BR PTRACE_DETACH ,
.BR PTRACE_SYSCALL ,
.BR PTRACE_SINGLESTEP ,
.BR PTRACE_SYSEMU ,
or
.BR PTRACE_SYSEMU_SINGLESTEP.
If the tracee is in signal-delivery-stop,
.I sig
is the signal to be injected (if it is nonzero).
Otherwise,
.I sig
may be ignored.
(Recommended practice is to always pass 0 in these cases.)
.SS Attaching and detaching
A thread can be attached to the tracer using the call

    ptrace(PTRACE_ATTACH, pid, 0, 0);

This also sends
.B SIGSTOP
to this thread.
If the tracer wants this
.B SIGSTOP
to have no effect, it needs to suppress it.
Note that if other signals are concurrently sent to
this thread during attach,
the tracer may see the tracee enter signal-delivery-stop
with other signal(s) first!
The usual practice is to reinject these signals until
.B SIGSTOP
is seen, then suppress
.B SIGSTOP
injection.
.\"
.\" FIXME I significantly rewrote the following sentence to try to make it
.\" clearer. Is the meaning still preserved?
The design bug here is that a ptrace attach and a concurrently delivered
.B SIGSTOP
may race and the concurrent
.B SIGSTOP
may be lost.
.\"
.\" FIXME: mtk: the following comment seems to be unresolved?
.\"	   Do you want to add any text?
.\"
.\"      Describe how to attach to a thread which is already group-stopped.
.LP
Since attaching sends
.B SIGSTOP
and the tracer usually suppresses it, this may cause a stray
.I EINTR
return from the currently executing system call in the tracee,
as described in the "signal injection and suppression" section.
.LP
The request

    ptrace(PTRACE_TRACEME, 0, 0, 0);

turns the calling thread into a tracee.
The thread continues to run (doesn't enter ptrace-stop).
A common practice is to follow the
.B PTRACE_TRACEME
with

    raise(SIGSTOP);

and allow the parent (which is our tracer now) to observe our
signal-delivery-stop.
.LP
If the
.BR PTRACE_O_TRACEFORK ,
.BR PTRACE_O_TRACEVFORK ,
or
.BR PTRACE_O_TRACECLONE
options are in effect, then children created by, respectively,
.BR vfork (2)
or
.BR clone (2)
with the
.B CLONE_VFORK
flag,
.BR fork (2)
or
.BR clone (2)
with the exit signal set to
.BR SIGCHLD ,
and other kinds of
.BR clone (2),
are automatically attached to the same tracer which traced their parent.
.B SIGSTOP
is delivered to the children, causing them to enter
signal-delivery-stop after they exit the system call which created them.
.LP
Detaching of the tracee is performed by:

    ptrace(PTRACE_DETACH, pid, 0, sig);

.B PTRACE_DETACH
is a restarting operation;
therefore it requires the tracee to be in ptrace-stop.
If the tracee is in signal-delivery-stop, a signal can be injected.
Otherwise, the
.I sig
parameter may be silently ignored.
.LP
If the tracee is running when the tracer wants to detach it,
the usual solution is to send
.B SIGSTOP
(using
.BR tgkill (2),
to make sure it goes to the correct thread),
wait for the tracee to stop in signal-delivery-stop for
.B SIGSTOP
and then detach it (suppressing
.B SIGSTOP
injection).
A design bug is that this can race with concurrent
.BR SIGSTOP s.
Another complication is that the tracee may enter other ptrace-stops
and needs to be restarted and waited for again, until
.B SIGSTOP
is seen.
Yet another complication is to be sure that
the tracee is not already ptrace-stopped,
because no signal delivery happens while it is\(emnot even
.BR SIGSTOP .
.\" FIXME: mtk: the following comment seems to be unresolved?
.\"       Do you want to add anything?
.\"
.\"     Describe how to detach from a group-stopped tracee so that it
.\"     doesn't run, but continues to wait for SIGCONT.
.\"
.LP
If the tracer dies, all tracees are automatically detached and restarted,
unless they were in group-stop.
Handling of restart from group-stop is
.\" FIXME: Define currently
currently buggy, but the
.\" FIXME: Planned for when?
"as planned" behavior is to leave tracee stopped and waiting for
.BR SIGCONT .
If the tracee is restarted from signal-delivery-stop,
the pending signal is injected.
.SS execve(2) under ptrace
.\" clone(2) THREAD_CLONE says:
.\"     If  any  of the threads in a thread group performs an execve(2),
.\"     then all threads other than the thread group leader are terminated,
.\"     and the new program is executed in the thread group leader.
.\"
.\" FIXME mtk-addition:  please check: I added the following piece to
.\"       clarify that multithreaded here means clone()+CLONE_THREAD
.\"
When one thread in a multithreaded process
(i.e., a thread group consisting of threads created using the
.BR clone (2)
.B CLONE_THREAD
flag) calls
.\" FIXME end-mtk-addition
.\"
.BR execve (2),
the kernel destroys all other threads in the process,
.\" In kernel 3.1 sources, see fs/exec.c::de_thread()
and resets the thread ID of the execing thread to the
thread group ID (process ID).
.\"
.\" FIXME mtk-addition:  please check: I added the following piece:
(Or, to put things another way, when a multithreaded process does an
.BR execve (2),
the kernel makes it look as though the
.BR execve (2)
occurred in the thread group leader, regardless of which thread did the
.BR execve (2).)
.\" FIXME end-mtk-addition
.\"
This resetting of the thread ID looks very confusing to tracers:
.IP * 3
All other threads stop in
.\" FIXME: mtk: What is "PTRACE_EXIT stop"?
.\"        Should that be "PTRACE_EVENT_EXIT stop"?
.B PTRACE_EXIT
stop,
.\" FIXME: mtk: In the next line, "by active ptrace option" is unclear.
.\"        What does it mean?
if requested by active ptrace option.
Then all other threads except the thread group leader report
death as if they exited via
.BR _exit (2)
with exit code 0.
Then
.B PTRACE_EVENT_EXEC
.\" FIXME: mtk: In the next line, "by active ptrace option" is unclear
.\"        What does it mean?
stop happens, if requested by active ptrace option.
.\" FIXME: mtk: the following comment seems to be unresolved?
.\"       (on which tracee - leader? execve-ing one?)
.\"
.\" FIXME: Please check: at various places in the following,
.\"        I have changed "pid" to "[the tracee's] thead ID"
.\"        Is that okay?
.IP *
The execing tracee changes its thread ID while it is in the
.BR execve (2).
(Remember, under ptrace, the "pid" returned from
.BR waitpid (2),
or fed into ptrace calls, is the tracee's thread ID.)
That is, the tracee's thread ID is reset to be the same as its process ID,
which is the same as the thread group leader's thread ID.
.IP *
If the thread group leader has reported its death by this time,
it appears to the tracer that
the dead thread leader "reappears from nowhere".
If the thread group leader was still alive,
for the tracer this may look as if thread group leader
returns from a different system call than it entered,
or even "returned from a system call even though
it was not in any system call".
If the thread group leader was not traced
(or was traced by a different tracer), then during
.BR execve (2)
it will appear as if it has become a tracee of
the tracer of the execing tracee.
.LP
All of the above effects are the artifacts of
the thread ID change in the tracee.
.LP
The
.B PTRACE_O_TRACEEXEC
option is the recommended tool for dealing with this situation.
It enables
.B PTRACE_EVENT_EXEC
stop, which occurs before
.BR execve (2)
returns.
.\" FIXME Following on from the previous sentences,
.\"       can/should we add a few more words on how
.\"       PTRACE_EVENT_EXEC stop helps us deal with this situation?
.LP
The thread ID change happens before
.B PTRACE_EVENT_EXEC
stop, not after.
.LP
When the tracer receives
.B PTRACE_EVENT_EXEC
stop notification,
it is guaranteed that except this tracee and the thread group leader,
no other threads from the process are alive.
.LP
On receiving the
.B PTRACE_EVENT_EXEC
stop notification,
the tracer should clean up all its internal
data structures describing the threads of this process,
and retain only one data structure\(emone which
describes the single still running tracee, with

    thread ID == thread group ID == process id.
.LP
Currently, there is no way to retrieve the former
thread ID of the execing tracee.
If the tracer doesn't keep track of its tracees' thread group relations,
it may be unable to know which tracee execed and therefore no longer
exists under the old thread ID due to a thread ID change.
.LP
Example: two threads call
.BR execve (2)
at the same time:
.LP
.nf
*** we get syscall-entry-stop in thread 1: **
PID1 execve("/bin/foo", "foo" <unfinished ...>
*** we issue PTRACE_SYSCALL for thread 1 **
*** we get syscall-entry-stop in thread 2: **
PID2 execve("/bin/bar", "bar" <unfinished ...>
*** we issue PTRACE_SYSCALL for thread 2 **
*** we get PTRACE_EVENT_EXEC for PID0, we issue PTRACE_SYSCALL **
*** we get syscall-exit-stop for PID0: **
PID0 <... execve resumed> )             = 0
.fi
.LP
In this situation, there is no way to know which
.BR execve (2)
succeeded.
.LP
If the
.B PTRACE_O_TRACEEXEC
option is
.I not
in effect for the execing tracee, the kernel delivers an extra
.B SIGTRAP
to the tracee after
.BR execve (2)
returns.
This is an ordinary signal (similar to one which can be
generated by
.IR "kill -TRAP" ),
not a special kind of ptrace-stop.
Employing
.B PTRACE_GETSIGINFO
for this signal returns
.I si_code
set to 0
.RI ( SI_USER ).
This signal may be blocked by signal mask,
and thus may be delivered (much) later.
.LP
Usually, the tracer (for example,
.BR strace (1))
would not want to show this extra post-execve
.B SIGTRAP
signal to the user, and would suppress its delivery to the tracee (if
.B SIGTRAP
is set to
.BR SIG_DFL ,
it is a killing signal).
However, determining
.I which
.B SIGTRAP
to suppress is not easy.
Setting the
.B PTRACE_O_TRACEEXEC
option and thus suppressing this extra
.B SIGTRAP
is the recommended approach.
.SS Real parent
The ptrace API (ab)uses the standard UNIX parent/child signaling over
.BR waitpid (2).
This used to cause the real parent of the process to stop receiving
several kinds of
.BR waitpid (2)
notifications when the child process is traced by some other process.
.LP
Many of these bugs have been fixed, but as of Linux 2.6.38 several still
exist; see BUGS below.
.LP
As of Linux 2.6.38, the following is believed to work correctly:
.IP * 3
exit/death by signal is reported first to the tracer, then, when the tracer
consumes the
.BR waitpid (2)
result, to the real parent (to the real parent only when the
whole multithreaded process exits).
.\"
.\" FIXME mtk: Please check: In the next line,
.\" I changed "they" to "the tracer and the real parent". Okay?
If the tracer and the real parent are the same process,
the report is sent only once.
.SH "RETURN VALUE"
On success,
.B PTRACE_PEEK*
requests return the requested data,
while other requests return zero.
On error, all requests return \-1, and
.I errno
is set appropriately.
Since the value returned by a successful
.B PTRACE_PEEK*
request may be \-1, the caller must clear
.I errno
before the call, and then check it afterward
to determine whether or not an error occurred.
.SH ERRORS
.TP
.B EBUSY
(i386 only) There was an error with allocating or freeing a debug register.
.TP
.B EFAULT
There was an attempt to read from or write to an invalid area in
the tracer's or the tracee's memory,
probably because the area wasn't mapped or accessible.
Unfortunately, under Linux, different variations of this fault
will return
.B EIO
or
.B EFAULT
more or less arbitrarily.
.TP
.B EINVAL
An attempt was made to set an invalid option.
.TP
.B EIO
.I request
is invalid, or an attempt was made to read from or
write to an invalid area in the tracer's or the tracee's memory,
or there was a word-alignment violation,
or an invalid signal was specified during a restart request.
.TP
.B EPERM
The specified process cannot be traced.
This could be because the
tracer has insufficient privileges (the required capability is
.BR CAP_SYS_PTRACE );
unprivileged processes cannot trace processes that they
cannot send signals to or those running
set-user-ID/set-group-ID programs, for obvious reasons.
.\"
.\" FIXME I reworked the mention of init here to note
.\" when the behavior changed for tracing init(8). Okay?
Alternatively, the process may already be being traced,
or (on kernels before 2.6.26) be
.BR init (8)
(PID 1).
.TP
.B ESRCH
The specified process does not exist, or is not currently being traced
by the caller, or is not stopped
(for requests that require a stopped tracee).
.SH "CONFORMING TO"
SVr4, 4.3BSD.
.SH NOTES
Although arguments to
.BR ptrace ()
are interpreted according to the prototype given,
glibc currently declares
.BR ptrace ()
as a variadic function with only the
.I request
argument fixed.
This means that unneeded trailing arguments may be omitted,
though doing so makes use of undocumented
.BR gcc (1)
behavior.
.\" FIXME Please review. I reinstated the following, noting the
.\" kernel version number where it ceased to be true
.LP
In Linux kernels before 2.6.26,
.\" See commit 00cd5c37afd5f431ac186dd131705048c0a11fdb
.BR init (8),
the process with PID 1, may not be traced.
.LP
The layout of the contents of memory and the USER area are
quite operating-system- and architecture-specific.
The offset supplied, and the data returned,
might not entirely match with the definition of
.IR "struct user" .
.\" See http://lkml.org/lkml/2008/5/8/375
.LP
The size of a "word" is determined by the operating-system variant
(e.g., for 32-bit Linux it is 32 bits, etc.).
.\" FIXME So, can we just remove the following text?
.\"
.\" Covered in more details above: (removed by dv)
.\" .LP
.\" Tracing causes a few subtle differences in the semantics of
.\" traced processes.
.\" For example, if a process is attached to with
.\" .BR PTRACE_ATTACH ,
.\" its original parent can no longer receive notification via
.\" .BR waitpid (2)
.\" when it stops, and there is no way for the new parent to
.\" effectively simulate this notification.
.\" .LP
.\" When the parent receives an event with
.\" .B PTRACE_EVENT_*
.\" set,
.\" the tracee is not in the normal signal delivery path.
.\" This means the parent cannot do
.\" .BR ptrace (PTRACE_CONT)
.\" with a signal or
.\" .BR ptrace (PTRACE_KILL).
.\" .BR kill (2)
.\" with a
.\" .B SIGKILL
.\" signal can be used instead to kill the tracee
.\" after receiving one of these messages.
.\" .LP
This page documents the way the
.BR ptrace ()
call works currently in Linux.
Its behavior differs noticeably on other flavors of UNIX.
In any case, use of
.BR ptrace ()
is highly specific to the operating system and architecture.
.SH BUGS
On hosts with 2.6 kernel headers,
.B PTRACE_SETOPTIONS
is declared with a different value than the one for 2.4.
This leads to applications compiled with 2.6 kernel
headers failing when run on 2.4 kernels.
This can be worked around by redefining
.B PTRACE_SETOPTIONS
to
.BR PTRACE_OLDSETOPTIONS ,
if that is defined.
.LP
Group-stop notifications are sent to the tracer, but not to real parent.
Last confirmed on 2.6.38.6.
.LP
.\"
.\" FIXME Does "exits" in the following mean
.\" just "_exit(2)" or or both "_exit(2) and exit_group(2)"?
If a thread group leader is traced and exits by calling
.BR _exit (2),
a
.B PTRACE_EVENT_EXIT
stop will happen for it (if requested), but the subsequent
.B WIFEXITED
notification will not be delivered until all other threads exit.
As explained above, if one of other threads calls
.BR execve (2),
the death of the thread group leader will
.I never
be reported.
If the execed thread is not traced by this tracer,
the tracer will never know that
.BR execve (2)
happened.
One possible workaround is to
.B PTRACE_DETACH
the thread group leader instead of restarting it in this case.
Last confirmed on 2.6.38.6.
.\"        ^^^ need to test/verify this scenario
.\" FIXME: mtk: the preceding comment seems to be unresolved?
.\"        Do you want to add anything?
.LP
A
.B SIGKILL
signal may still cause a
.B PTRACE_EVENT_EXIT
stop before actual signal death.
This may be changed in the future;
.B SIGKILL
is meant to always immediately kill tasks even under ptrace.
Last confirmed on 2.6.38.6.
.SH "SEE ALSO"
.BR gdb (1),
.BR strace (1),
.BR clone (2),
.BR execve (2),
.BR fork (2),
.BR gettid (2),
.BR sigaction (2),
.BR tgkill (2),
.BR vfork (2),
.BR waitpid (2),
.BR exec (3),
.BR capabilities (7),
.BR signal (7)


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface"; http://man7.org/tlpi/

[-- Attachment #2: ptrace.2 --]
[-- Type: application/octet-stream, Size: 50922 bytes --]

.\" Hey Emacs! This file is -*- nroff -*- source.
.\"
.\" Copyright (c) 1993 Michael Haardt <michael@moria.de>
.\" Fri Apr  2 11:32:09 MET DST 1993
.\" and Copyright (c) 2011, Denys Vlasenko <vda.linux@googlemail.com>
.\"
.\" changes Copyright 1999 Mike Coleman (mkc@acm.org)
.\" -- major revision to fully document ptrace semantics per recent Linux
.\"    kernel (2.2.10) and glibc (2.1.2)
.\" Sun Nov  7 03:18:35 CST 1999
.\"
.\" This is free documentation; you can redistribute it and/or
.\" modify it under the terms of the GNU General Public License as
.\" published by the Free Software Foundation; either version 2 of
.\" the License, or (at your option) any later version.
.\"
.\" The GNU General Public License's references to "object code"
.\" and "executables" are to be interpreted as the output of any
.\" document formatting or typesetting system, including
.\" intermediate and printed output.
.\"
.\" This manual is distributed in the hope that it will be useful,
.\" but WITHOUT ANY WARRANTY; without even the implied warranty of
.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
.\" GNU General Public License for more details.
.\"
.\" You should have received a copy of the GNU General Public
.\" License along with this manual; if not, write to the Free
.\" Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111,
.\" USA.
.\"
.\" Modified Fri Jul 23 23:47:18 1993 by Rik Faith <faith@cs.unc.edu>
.\" Modified Fri Jan 31 16:46:30 1997 by Eric S. Raymond <esr@thyrsus.com>
.\" Modified Thu Oct  7 17:28:49 1999 by Andries Brouwer <aeb@cwi.nl>
.\" Modified, 27 May 2004, Michael Kerrisk <mtk.manpages@gmail.com>
.\"     Added notes on capability requirements
.\"
.\" 2006-03-24, Chuck Ebbert <76306.1226@compuserve.com>
.\"    Added    PTRACE_SETOPTIONS, PTRACE_GETEVENTMSG, PTRACE_GETSIGINFO,
.\"        PTRACE_SETSIGINFO, PTRACE_SYSEMU, PTRACE_SYSEMU_SINGLESTEP
.\"    (Thanks to Blaisorblade, Daniel Jacobowitz and others who helped.)
.\" 2011-09, major update by Denys Vlasenko <vda.linux@googlemail.com>
.\"
.\" FIXME (later): Linux 3.1 adds PTRACE_SEIZE, PTRACE_INTERRUPT, and PTRACE_LISTEN.
.\"
.TH PTRACE 2 2011-09-26 "Linux" "Linux Programmer's Manual"
.SH NAME
ptrace \- process trace
.SH SYNOPSIS
.nf
.B #include <sys/ptrace.h>
.sp
.BI "long ptrace(enum __ptrace_request " request ", pid_t " pid ", "
.BI "            void *" addr ", void *" data );
.fi
.SH DESCRIPTION
The
.BR ptrace ()
system call provides a means by which one process (the "tracer")
may observe and control the execution of another process (the "tracee"),
and examine and change the tracee's memory and registers.
It is primarily used to implement breakpoint debugging and system
call tracing.
.LP
A tracee first need to be attached to the tracer.
Attachment and subsequent commands are per thread:
in a multithreaded process,
every thread can be individually attached to a
(potentially different) tracer,
or left not attached and thus not debugged.
Therefore, "tracee" always means "(one) thread",
never "a (possibly multithreaded) process".
Ptrace commands are always sent to
a specific tracee using a call of the form

    ptrace(PTRACE_foo, pid, ...)

where
.I pid
is the thread ID of the corresponding Linux thread.
.LP
A process can initiate a trace by calling
.BR fork (2)
and having the resulting child do a
.BR PTRACE_TRACEME ,
followed (typically) by an
.BR execve (2).
Alternatively, one process may commence tracing another process using
.BR PTRACE_ATTACH .
.LP
While being traced, the tracee will stop each time a signal is delivered,
even if the signal is being ignored.
(An exception is
.BR SIGKILL ,
which has its usual effect.)
The tracer will be notified at its next call to
.BR waitpid (2)
(or one of the related "wait" system calls)
and may inspect and modify the tracee while it is stopped.
The tracer then causes the tracee to continue,
optionally ignoring the delivered signal
(or even delivering a different signal instead).
.LP
When the tracer is finished tracing, it can cause the tracee to continue
executing in a normal, untraced mode via
.BR PTRACE_DETACH .
.LP
The value of
.I request
determines the action to be performed:
.TP
.B PTRACE_TRACEME
Indicate that this process is to be traced by its parent.
Any signal (except
.BR SIGKILL )
delivered to this process will cause it to stop and its
parent to be notified via
.BR waitpid (2).
In addition, all subsequent calls to
.BR execve (2)
by the traced process will cause a
.B SIGTRAP
to be sent to it,
giving the parent a chance to gain control before the new program
begins execution.
A process probably shouldn't make this request if its parent
isn't expecting to trace it.
.RI ( pid ,
.IR addr ,
and
.IR data
are ignored.)
.LP
The
.B PTRACE_TRACEME
request is used only by the tracee;
the remaining requests are used only by the tracer.
In the following requests,
.I pid
specifies the thread ID of the tracee to be acted on.
For requests other than
.BR PTRACE_KILL ,
the tracee must be stopped.
.TP
.BR PTRACE_PEEKTEXT ", " PTRACE_PEEKDATA
Read a word at the address
.I addr
in the tracee's memory, returning the word as the result of the
.BR ptrace ()
call.
Linux does not have separate text and data address spaces,
so these two requests are currently equivalent.
.RI ( data
is ignored.)
.TP
.B PTRACE_PEEKUSER
.\" PTRACE_PEEKUSR in kernel source, but glibc uses PTRACE_PEEKUSER,
.\" and that is the name that seems common on other systems.
Read a word at offset
.I addr
in the tracee's USER area,
which holds the registers and other information about the process
(see
.IR <sys/user.h> ).
The word is returned as the result of the
.BR ptrace ()
call.
Typically, the offset must be word-aligned, though this might vary by
architecture.
See NOTES.
.RI ( data
is ignored.)
.TP
.BR PTRACE_POKETEXT ", " PTRACE_POKEDATA
Copy the word
.I data
to the address
.I addr
in the tracee's memory.
As for
.BR PTRACE_PEEKTEXT 
and
.BR PTRACE_PEEKDATA ,
these two requests are currently equivalent.
.TP
.B PTRACE_POKEUSER
.\" PTRACE_POKEUSR in kernel source, but glibc uses PTRACE_POKEUSER,
.\" and that is the name that seems common on other systems.
Copy the word
.I data
to offset
.I addr
in the tracee's USER area.
As for
.BR PTRACE_PEEKUSER ,
the offset must typically be word-aligned.
In order to maintain the integrity of the kernel,
some modifications to the USER area are disallowed.
.\" FIXME In the preceding sentence, which modifications are disallowed,
.\" and when they are disallowed, how does userspace discover that fact?
.TP
.BR PTRACE_GETREGS ", " PTRACE_GETFPREGS
Copy the tracee's general purpose or floating-point registers,
respectively, to the address
.I data
in the tracer.
See
.I <sys/user.h>
for information on the format of this data.
.RI ( addr
is ignored.)
.TP
.BR PTRACE_GETSIGINFO " (since Linux 2.3.99-pre6)"
Retrieve information about the signal that caused the stop.
Copy a
.I siginfo_t
structure (see
.BR sigaction (2))
from the tracee to the address
.I data
in the tracer.
.RI ( addr
is ignored.)
.TP
.BR PTRACE_SETREGS ", " PTRACE_SETFPREGS
Copy the tracee's general purpose or floating-point registers,
respectively, from the address
.I data
in the tracer.
As for
.BR PTRACE_POKEUSER ,
some general purpose register modifications may be disallowed.
.\" FIXME In the preceding sentence, which modifications are disallowed,
.\" and when they are disallowed, how does userspace discover that fact?
.RI ( addr
is ignored.)
.TP
.BR PTRACE_SETSIGINFO " (since Linux 2.3.99-pre6)"
Set signal information:
copy a
.I siginfo_t
structure from the address
.I data
in the tracer to the tracee.
This will affect only signals that would normally be delivered to
the tracee and were caught by the tracer.
It may be difficult to tell
these normal signals from synthetic signals generated by
.BR ptrace ()
itself.
.RI ( addr
is ignored.)
.TP
.BR PTRACE_SETOPTIONS " (since Linux 2.4.6; see BUGS for caveats)"
Set ptrace options from
.IR data .
.RI ( addr
is ignored.)
.IR data
is interpreted as a bit mask of options,
which are specified by the following flags:
.RS
.TP
.BR PTRACE_O_TRACESYSGOOD " (since Linux 2.4.6)"
When delivering system call traps, set bit 7 in the signal number
(i.e., deliver
.IR "SIGTRAP|0x80" ).
This makes it easy for the tracer to distinguish
normal traps from those caused by a system call.
.RB ( PTRACE_O_TRACESYSGOOD
may not work on all architectures.)
.\" FIXME Please check. In the following paragraphs, I substituted language
.\" such as:
.\"     Stop tracee at next fork(2) call with SIGTRAP|PTRACE_EVENT_FORK<<8
.\" with:
.\"     Stop tracee at next fork(2) call... A subsequent PTRACE_GETSIGINFO
.\"     on the stopped tracee will return a siginfo_t structure with si_code
.\"     set to SIGTRAP|PTRACE_EVENT_FORK<<8.
.\"
.\" Is this change correct?
.\"
.TP
.BR PTRACE_O_TRACEFORK " (since Linux 2.5.46)"
Stop the tracee at the next
.BR fork (2)
and automatically start tracing the newly forked process,
which will start with a
.BR SIGSTOP .
A subsequent
.B PTRACE_GETSIGINFO
on the stopped tracee will return a
.I siginfo_t
structure with
.I si_code
set to
.IR SIGTRAP|PTRACE_EVENT_FORK<<8 .
The PID of the new process can be retrieved with
.BR PTRACE_GETEVENTMSG .
.TP
.BR PTRACE_O_TRACEVFORK " (since Linux 2.5.46)"
Stop the tracee at the next
.BR vfork (2)
and automatically start tracing the newly vforked process,
which will start with a
.BR SIGSTOP .
A subsequent
.B PTRACE_GETSIGINFO
on the stopped tracee will return a
.I siginfo_t
structure with
.I si_code
set to
.IR SIGTRAP|PTRACE_EVENT_VFORK<<8 .
The PID of the new process can be retrieved with
.BR PTRACE_GETEVENTMSG .
.TP
.BR PTRACE_O_TRACECLONE " (since Linux 2.5.46)"
Stop the tracee at the next
.BR clone (2)
and automatically start tracing the newly cloned process,
which will start with a
.BR SIGSTOP .
A subsequent
.B PTRACE_GETSIGINFO
on the stopped tracee will return a
.I siginfo_t
structure with
.I si_code
set to
.IR SIGTRAP|PTRACE_EVENT_CLONE<<8 .
The PID of the new process can be retrieved with
.BR PTRACE_GETEVENTMSG .
.IP
This option may not catch
.BR clone (2)
calls in all cases.
If the tracee calls
.BR clone (2)
with the
.B CLONE_VFORK
flag,
.B PTRACE_EVENT_VFORK
will be delivered instead
if
.B PTRACE_O_TRACEVFORK
is set; otherwise if the tracee calls
.BR clone (2)
with the exit signal set to
.BR SIGCHLD ,
.B PTRACE_EVENT_FORK
will be delivered if
.B PTRACE_O_TRACEFORK
is set.
.TP
.BR PTRACE_O_TRACEEXEC " (since Linux 2.5.46)"
Stop the tracee at the next
.BR execve (2).
A subsequent
.B PTRACE_GETSIGINFO
on the stopped tracee will return a
.I siginfo_t
structure with
.I si_code
set to
.IR SIGTRAP|PTRACE_EVENT_EXEC<<8 .
.TP
.BR PTRACE_O_TRACEVFORKDONE " (since Linux 2.5.60)"
Stop the tracee at the completion of the next
.BR vfork (2).
A subsequent
.B PTRACE_GETSIGINFO
on the stopped tracee will return a
.I siginfo_t
structure with
.I si_code
set to
.IR SIGTRAP|PTRACE_EVENT_VFORK_DONE<<8 .
The PID of the new process can (since Linux 2.6.18) be retrieved with
.BR PTRACE_GETEVENTMSG .
.TP
.BR PTRACE_O_TRACEEXIT " (since Linux 2.5.60)"
Stop the tracee at exit.
A subsequent
.B PTRACE_GETSIGINFO
on the stopped tracee will return a
.I siginfo_t
structure with
.I si_code
set to
.IR SIGTRAP|PTRACE_EVENT_EXIT<<8 .
The tracee's exit status can be retrieved with
.BR PTRACE_GETEVENTMSG .
.IP
The tracee is stopped early during process exit,
when registers are still available,
allowing the tracer to see where the exit occurred,
whereas the normal exit notification is done after the process
is finished exiting.
Even though context is available,
the tracer cannot prevent the exit from happening at this point.
.RE
.TP
.BR PTRACE_GETEVENTMSG " (since Linux 2.5.46)"
Retrieve a message (as an
.IR "unsigned long" )
about the ptrace event
that just happened, placing it at the address
.I data
in the tracer.
For
.BR PTRACE_EVENT_EXIT ,
this is the tracee's exit status.
For
.BR PTRACE_EVENT_FORK ,
.BR PTRACE_EVENT_VFORK ,
.BR PTRACE_EVENT_VFORK_DONE ,
and
.BR PTRACE_EVENT_CLONE ,
this is the PID of the new process.
.RI (  addr
is ignored.)
.TP
.B PTRACE_CONT
Restart the stopped tracee process.
If
.I data
is nonzero,
it is interpreted as the number of a signal to be delivered to the tracee;
otherwise, no signal is delivered.
Thus, for example, the tracer can control
whether a signal sent to the tracee is delivered or not.
.RI ( addr
is ignored.)
.TP
.BR PTRACE_SYSCALL ", " PTRACE_SINGLESTEP
Restart the stopped tracee as for
.BR PTRACE_CONT ,
but arrange for the tracee to be stopped at
the next entry to or exit from a system call,
or after execution of a single instruction, respectively.
(The tracee will also, as usual, be stopped upon receipt of a signal.)
From the tracer's perspective, the tracee will appear to have been
stopped by receipt of a
.BR SIGTRAP .
So, for
.BR PTRACE_SYSCALL ,
for example, the idea is to inspect
the arguments to the system call at the first stop,
then do another
.B PTRACE_SYSCALL
and inspect the return value of the system call at the second stop.
The
.I data
argument is treated as for
.BR PTRACE_CONT .
.RI (addr
is ignored.)
.TP
.BR PTRACE_SYSEMU ", " PTRACE_SYSEMU_SINGLESTEP " (since Linux 2.6.14)"
For
.BR PTRACE_SYSEMU ,
continue and stop on entry to the next system call,
which will not be executed.
For
.BR PTRACE_SYSEMU_SINGLESTEP ,
do the same but also singlestep if not a system call.
This call is used by programs like
User Mode Linux that want to emulate all the tracee's system calls.
The
.I data
argument is treated as for
.BR PTRACE_CONT .
.RI ( addr
is ignored;
not supported on all architectures.)
.TP
.B PTRACE_KILL
Send the tracee a
.B SIGKILL
to terminate it.
.RI ( addr
and
.I data
are ignored.)
.IP
.I This operation is deprecated; do not use it!
Instead, send a
.BR SIGKILL
directly using
.BR kill (2)
or
.BR tgkill (2).
The problem with
.B PTRACE_KILL
is that it requires the tracee to be in signal-delivery-stop,
otherwise it may not work
(i.e., may complete successfully but won't kill the tracee).
By contrast, sending a
.B SIGKILL
directly has no such limitation.
.\" mtk: Commented out the following. It doesn't belong in the man page.
.\" .LP
.\" [Note: deprecation suggested by Oleg Nesterov. He prefers to deprecate
.\" it instead of describing (and needing to support) PTRACE_KILL's quirks.]
.TP
.B PTRACE_ATTACH
Attach to the process specified in
.IR pid ,
making it a tracee of the calling process.
.\" FIXME So, was the following EVER true? IF it was,
.\"       we should reinstate the text and add mention of
.\"       the kernel version where the behaviour changed.
.\"
.\" Not true: (removed by dv)
.\" ; the behavior of the tracee is as if it had done a
.\" .BR PTRACE_TRACEME .
.\" The calling process actually becomes the parent of the tracee
.\" process for most purposes (e.g., it will receive
.\" notification of tracee events and appears in
.\" .BR ps (1)
.\" output as the tracee's parent), but a
.\" .BR getppid (2)
.\" by the tracee will still return the PID of the original parent.
The tracee is sent a
.BR SIGSTOP ,
but will not necessarily have stopped
by the completion of this call; use
.BR waitpid (2)
to wait for the tracee to stop.
See the "Attaching and detaching" subsection for additional information.
.RI ( addr
and
.I data
are ignored.)
.TP
.B PTRACE_DETACH
Restart the stopped tracee as for
.BR PTRACE_CONT ,
but first detach from it.
Under Linux, a tracee can be detached in this way regardless
of which method was used to initiate tracing.
.RI ( addr
is ignored.)
.\"
.\" In the text below, we decided to avoid prettifying the text with markup:
.\" it would make the source nearly impossible to edit, and we _do_ intend
.\" to edit it often, in order to keep it updated:
.\" ptrace API is full of quirks, no need to compound this situation by
.\" making it excruciatingly painful to document them!
.\"
.SS Death under ptrace
When a (possibly multithreaded) process receives a killing signal
(one whose disposition is set to
.B SIG_DFL
and whose default action is to kill the process),
all threads exit.
Tracees report their death to their tracer(s).
Notification of this event is delivered via
.BR waitpid (2).
.LP
Note that the killing signal will first cause signal-delivery-stop
(on one tracee only),
and only after it is injected by the tracer
(or after it was dispatched to a thread which isn't traced),
will death from the signal happen on
.I all
tracees within a multithreaded process.
(The term "signal-delivery-stop" is explained below.)
.LP
.B SIGKILL
operates similarly, with exceptions.
No signal-delivery-stop is generated for
.B SIGKILL
and therefore the tracer can't suppress it.
.B SIGKILL
kills even within system calls
(syscall-exit-stop is not generated prior to death by
.BR SIGKILL ).
The net effect is that
.B SIGKILL
always kills the process (all its threads),
even if some threads of the process are ptraced.
.LP
When the tracee calls
.BR _exit (2),
it reports its death to its tracer.
Other threads are not affected.
.LP
When any thread executes
.BR exit_group (2),
every tracee in its thread group reports its death to its tracer.
.LP
If the
.B PTRACE_O_TRACEEXIT
option is on,
.B PTRACE_EVENT_EXIT
will happen before actual death.
This applies to exits via
.BR exit (2),
.BR exit_group (2),
and signal deaths (except
.BR SIGKILL ),
and when threads are torn down on
.BR execve (2)
in a multithreaded process.
.LP
The tracer cannot assume that the ptrace-stopped tracee exists.
There are many scenarios when the tracee may die while stopped (such as
.BR SIGKILL ).
Therefore, the tracer must be prepared to handle an 
.B ESRCH
error on any ptrace operation.
Unfortunately, the same error is returned if the tracee
exists but is not ptrace-stopped
(for commands which require a stopped tracee),
or if it is not traced by the process which issued the ptrace call.
The tracer needs to keep track of the stopped/running state of the tracee,
and interpret
.B ESRCH
as "tracee died unexpectedly" only if it knows that the tracee has
been observed to enter ptrace-stop.
Note that there is no guarantee that
.I waitpid(WNOHANG)
will reliably report the tracee's death status if a
ptrace operation returned
.BR ESRCH .
.I waitpid(WNOHANG)
may return 0 instead.
In other words, the tracee may be "not yet fully dead",
but already refusing ptrace requests.
.LP
The tracer can't assume that the tracee
.I always
ends its life by reporting
.I WIFEXITED(status)
or
.IR WIFSIGNALED(status) .
.LP
.\"     or can it? Do we include such a promise into ptrace API?
.\"
.\" FIXME: The preceding comment seems to be unresolved?
.\"        Do you want to add anything?
.\"
.SS Stopped states
A tracee can be in two states: running or stopped.
.LP
There are many kinds of states when the tracee is stopped, and in ptrace
discussions they are often conflated.
Therefore, it is important to use precise terms.
.LP
In this manual page, any stopped state in which the tracee is ready
to accept ptrace commands from the tracer is called
.IR ptrace-stop .
Ptrace-stops can
be further subdivided into
.IR signal-delivery-stop ,
.IR group-stop ,
.IR syscall-stop ,
and so on.
These stopped states are described in detail below.
.LP
When the running tracee enters ptrace-stop, it notifies its tracer using
.BR waitpid (2)
(or one of the other "wait" system calls).
Most of this manual page assumes that the tracer waits with:
.LP
    pid = waitpid(pid_or_minus_1, &status, __WALL);
.LP
Ptrace-stopped tracees are reported as returns with
.I pid
greater than 0 and
.I WIFSTOPPED(status)
true.
.LP
.\" FIXME: mtk: the following comment seems to be unresolved?
.\"        Do you want to add anything?
.\"
.\"     Do we require __WALL usage, or will just using 0 be ok? Are the
.\"     rules different if user wants to use waitid? Will waitid require
.\"     WEXITED?
.\"
.LP
.\" FIXME: Is the following comment "__WALL... implies" true?
The
.B __WALL
flag does not include the
.B WSTOPPED
and
.B WEXITED
flags, but implies their functionality.
.LP
Setting the
.B WCONTINUED
flag when calling
.BR waitpid (2)
is not recommended: the "continued" state is per-process and
consuming it can confuse the real parent of the tracee.
.LP
Use of the
.B WNOHANG
flag may cause
.BR waitpid (2)
to return 0 ("no wait results available yet")
even if the tracer knows there should be a notification.
Example:
.nf

    kill(tracee, SIGKILL);
    waitpid(tracee, &status, __WALL | WNOHANG);
.fi
.\" FIXME: mtk: the following comment seems to be unresolved?
.\"        Do you want to add anything?
.\"
.\"     waitid usage? WNOWAIT?
.\"     describe how wait notifications queue (or not queue)
.LP
The following kinds of ptrace-stops exist: signal-delivery-stops,
group-stop, PTRACE_EVENT stops, syscall-stops
.\"
.\" FIXME: mtk: the following text appears to be incomplete.
.\"        Do you want to add anything?
.\"
[, PTRACE_SINGLESTEP, PTRACE_SYSEMU,
PTRACE_SYSEMU_SINGLESTEP].
They all are reported by
.BR waitpid (2)
with
.I WIFSTOPPED(status)
true.
They may be differentiated by examining the value
.IR status>>8 ,
and if there is ambiguity in that value, by querying
.BR PTRACE_GETSIGINFO .
.\"
.\" FIXME What is the purpose of the following sentence? Is it to warn
.\"       the reader not to use WSTOPSIG()? If so, we should make that
.\"       point more explicitly.
(Note: the
.I WSTOPSIG(status)
macro returns the value
.IR "(status>>8)\ &\ 0xff)" .)
.SS Signal-delivery-stop
When a (possibly multithreaded) process receives any signal except
.BR SIGKILL ,
the kernel selects an arbitrary thread which handles the signal.
(If the signal is generated with
.BR tgkill (2),
the target thread can be explicitly selected by the caller.)
If the selected thread is traced, it enters signal-delivery-stop.
At this point, the signal is not yet delivered to the process,
and can be suppressed by the tracer.
If the tracer doesn't suppress the signal,
.\"
.\" FIXME: I added the word "restart" to the following line. Okay?
it passes the signal to the tracee in the next ptrace restart request.
This second step of signal delivery is called
.I "signal injection"
in this manual page.
Note that if the signal is blocked,
signal-delivery-stop doesn't happen until the signal is unblocked,
with the usual exception that
.B SIGSTOP
can't be blocked.
.LP
Signal-delivery-stop is observed by the tracer as
.BR waitpid (2)
returning with
.I WIFSTOPPED(status)
true, with the stopping signal returned by
.IR WSTOPSIG(status) .
If the stopping signal is
.BR SIGTRAP ,
this may be a different kind of ptrace-stop;
see the "Syscall-stops" and "execve" sections below for details.
If
.I WSTOPSIG(status)
returns a stopping signal, this may be a group-stop; see below.
.SS Signal injection and suppression
After signal-delivery-stop is observed by the tracer,
the tracer should restart the tracee with the call
.LP
    ptrace(PTRACE_restart, pid, 0, sig)
.LP
where
.B PTRACE_restart
is one of the restarting ptrace requests.
If
.I sig
is 0, then a signal is not delivered.
Otherwise, the signal
.I sig
is delivered.
This operation is called
.I "signal injection"
in this manual page, to distinguish it from signal-delivery-stop.
.LP
Note that the
.I sig
value may be different from the
.I WSTOPSIG(status)
value: the tracer can cause a different signal to be injected.
.LP
Note that a suppressed signal still causes system calls to return
prematurely.
Restartable system calls will be restarted (the tracer will
observe the tracee to execute
.BR restart_syscall(2)
if the tracer uses
.BR PTRACE_SYSCALL );
non-restartable system calls may fail with
.B EINTR
even though no observable signal is injected to the tracee.
.LP
Note that restarting ptrace commands issued in ptrace-stops other than
signal-delivery-stop are not guaranteed to inject a signal, even if
.I sig
is nonzero.
No error is reported; a nonzero
.I sig
may simply be ignored.
Ptrace users should not try to "create a new signal" this way: use
.BR tgkill (2)
instead.
.LP
.\"
.\" FIXME: the referrent of "This" in the next line is not clear.
.\"        What does "This" refer to?
This is a cause of confusion among ptrace users.
One typical scenario is that the tracer observes group-stop,
mistakes it for signal-delivery-stop, restarts the tracee with

    ptrace(PTRACE_rest, pid, 0, stopsig)

with the intention of injecting
.IR stopsig ,
but
.I stopsig
gets ignored and the tracee continues to run.
.LP
The
.B SIGCONT
signal has a side effect of waking up (all threads of)
a group-stopped process.
This side effect happens before signal-delivery-stop.
The tracer can't suppress this side-effect (it can
only suppress signal injection, which only causes the
.BR SIGCONT
handler to not be executed in the tracee, if such a handler is installed).
In fact, waking up from group-stop may be followed by
signal-delivery-stop for signal(s)
.I other than
.BR SIGCONT ,
if they were pending when
.B SIGCONT
was delivered.
In other words,
.B SIGCONT
may be not the first signal observed by the tracee after it was sent.
.LP
Stopping signals cause (all threads of) a process to enter group-stop.
This side effect happens after signal injection, and therefore can be
suppressed by the tracer.
.LP
.B PTRACE_GETSIGINFO
can be used to retrieve a
.I siginfo_t
structure which corresponds to the delivered signal.
.B PTRACE_SETSIGINFO
may be used to modify it.
If
.B PTRACE_SETSIGINFO
has been used to alter
.IR siginfo_t ,
the
.I si_signo
field and the
.I sig
parameter in the restarting command must match,
otherwise the result is undefined.
.SS Group-stop
When a (possibly multithreaded) process receives a stopping signal,
all threads stop.
If some threads are traced, they enter a group-stop.
Note that the stopping signal will first cause signal-delivery-stop
(on one tracee only), and only after it is injected by the tracer
(or after it was dispatched to a thread which isn't traced),
will group-stop be initiated on
.I all
tracees within the multithreaded process.
As usual, every tracee reports its group-stop separately
to the corresponding tracer.
.LP
Group-stop is observed by the tracer as
.BR waitpid (2)
returning with
.I WIFSTOPPED(status)
true, with the stopping signal available via
.IR WSTOPSIG(status) .
The same result is returned by some other classes of ptrace-stops,
therefore the recommended practice is to perform the call
.LP
    ptrace(PTRACE_GETSIGINFO, pid, 0, &siginfo)
.LP
The call can be avoided if the signal is not
.BR SIGSTOP ,
.BR SIGTSTP ,
.BR SIGTTIN ,
or
.BR SIGTTOU ;
only these four signals are stopping signals.
If the tracer sees something else, it can't be a group-stop.
Otherwise, the tracer needs to call
.BR PTRACE_GETSIGINFO .
If
.B PTRACE_GETSIGINFO
fails with
.BR EINVAL ,
then it is definitely a group-stop.
(Other failure codes are possible, such as
.B ESRCH
("no such process") if a
.B SIGKILL
killed the tracee.)
.LP
As of kernel 2.6.38,
after the tracer sees the tracee ptrace-stop and until it
restarts or kills it, the tracee will not run,
and will not send notifications (except
.B SIGKILL
death) to the tracer, even if the tracer enters into another
.BR waitpid (2)
call.
.LP
.\"
.\" FIXME ??? referrent of "it" in the next line is unclear
.\"        What does "it" refer to?
Currently, it causes a problem with transparent handling of stopping
signals: if the tracer restarts the tracee after group-stop,
.B SIGSTOP
is effectively ignored: the tracee doesn't remain stopped, it runs.
If the tracer doesn't restart the tracee before entering into the next
.BR waitpid (2),
future
.B SIGCONT
signals will not be reported to the tracer.
This would cause
.B SIGCONT
to have no effect.
.SS PTRACE_EVENT stops
If the tracer sets
.B PTRACE_O_TRACE_*
options, the tracee will enter ptrace-stops called
.B PTRACE_EVENT
stops.
.LP
.B PTRACE_EVENT
stops are observed by the tracer as
.BR waitpid (2)
returning with
.IR WIFSTOPPED(status) ,
and
.I WSTOPSIG(status)
returns
.BR SIGTRAP .
An additional bit is set in the higher byte of the status word:
the value
.I status>>8
will be

    (SIGTRAP | PTRACE_EVENT_foo << 8).

The following events exist:
.TP
.B PTRACE_EVENT_VFORK
Stop before return from
.BR vfork (2)
or
.BR clone (2)
with the
.B CLONE_VFORK
flag.
When the tracee is continued after this stop, it will wait for child to
exit/exec before continuing its execution
(in other words, the usual behavior on
.BR vfork (2)).
.TP
.B PTRACE_EVENT_FORK
Stop before return from
.BR fork (2)
or
.BR clone (2)
with the exit signal set to
.BR SIGCHLD .
.TP
.B PTRACE_EVENT_CLONE
Stop before return from
.BR clone (2)
.TP
.B PTRACE_EVENT_VFORK_DONE
Stop before return from
.BR vfork (2)
or
.BR clone (2)
with the
.B CLONE_VFORK
flag,
but after the child unblocked this tracee by exiting or execing.
.LP
For all four stops described above,
the stop occurs in the parent (i.e., the tracee),
not in the newly created thread.
.BR PTRACE_GETEVENTMSG
can be used to retrieve the new thread's ID.
.TP
.B PTRACE_EVENT_EXEC
Stop before return from
.BR execve (2).
.TP
.B PTRACE_EVENT_EXIT
Stop before exit (including death from
.BR exit_group (2)),
signal death, or exit caused by
.BR execve (2)
in a multithreaded process.
.B PTRACE_GETEVENTMSG
returns the exit status.
Registers can be examined
(unlike when "real" exit happens).
The tracee is still alive; it needs to be
.BR PTRACE_CONT ed
or
.BR PTRACE_DETACH ed
to finish exiting.
.LP
.B PTRACE_GETSIGINFO
on
.B PTRACE_EVENT
stops returns
.B SIGTRAP in
.IR si_signo ,
with
.I si_code
set to
.IR "(event<<8)\ |\ SIGTRAP" .
.SS Syscall-stops
If the tracee was restarted by
.BR PTRACE_SYSCALL ,
the tracee enters
syscall-enter-stop just prior to entering any system call.
If the tracer restarts the tracee with
.BR PTRACE_SYSCALL ,
the tracee enters syscall-exit-stop when the system call is finished,
or if it is interrupted by a signal.
(That is, signal-delivery-stop never happens between syscall-enter-stop
and syscall-exit-stop; it happens
.I after
syscall-exit-stop.)
.LP
Other possibilities are that the tracee may stop in a
.B PTRACE_EVENT
stop, exit (if it entered
.BR _exit (2)
or
.BR exit_group (2)),
be killed by
.BR SIGKILL ,
or die silently (if it is a thread group leader, the
.BR execve (2)
happened in another thread,
and that thread is not traced by the same tracer;
this situation is discussed later).
.LP
Syscall-enter-stop and syscall-exit-stop are observed by the tracer as
.BR waitpid (2)
returning with
.I WIFSTOPPED(status)
true, and
.I WSTOPSIG(status)
giving
.BR SIGTRAP .
If the
.B PTRACE_O_TRACESYSGOOD
option was set by the tracer, then
.I WSTOPSIG(status)
will give the value
.IR "(SIGTRAP\ |\ 0x80)" .
.LP
Syscall-stops can be distinguished from signal-delivery-stop with
.B SIGTRAP
by querying
.BR PTRACE_GETSIGINFO
for the following cases:
.TP
.IR si_code " <= 0"
.B SIGTRAP
.\" FIXME: Please confirm this is okay: I changed
.\"        "the usual suspects" to "by a system call". Okay?
.\"        Shouldn't we also add kill(2) here?
was sent by a system call
.RB ( tgkill (2),
.BR sigqueue (3),
etc.)
.TP
.IR si_code " == SI_KERNEL (0x80)"
.B SIGTRAP
was sent by the kernel.
.TP
.IR si_code " == SIGTRAP or " si_code " == (SIGTRAP|0x80)"
This is a syscall-stop.
.LP
However, syscall-stops happen very often (twice per system call),
and performing
.B PTRACE_GETSIGINFO
for every syscall-stop may be somewhat expensive.
.LP
.\"
.\" FIXME referrent of "them" in next line ???
.\"       What does "them" refer to?
Some architectures allow the cases to be distinguished
by examining registers.
For example, on x86,
.I rax
==
.RB - ENOSYS
in syscall-enter-stop.
Since
.B SIGTRAP
(like any other signal) always happens
.I after
syscall-exit-stop,
and at this point
.I rax
almost never contains
.RB - ENOSYS ,
the
.B SIGTRAP
looks like "syscall-stop which is not syscall-enter-stop";
in other words, it looks like a
"stray syscall-exit-stop" and can be detected this way.
But such detection is fragile and is best avoided.
.LP
Using the
.B PTRACE_O_TRACESYSGOOD
.\"
.\" FIXME Below: "is the recommended method" for WHAT?
option is the recommended method,
since it is reliable and does not incur a performance penalty.
.LP
Syscall-enter-stop and syscall-exit-stop are
indistinguishable from each other by the tracer.
The tracer needs to keep track of the sequence of
ptrace-stops in order to not misinterpret syscall-enter-stop as
syscall-exit-stop or vice versa.
The rule is that syscall-enter-stop is
always followed by syscall-exit-stop,
.B PTRACE_EVENT
stop or the tracee's death;
no other kinds of ptrace-stop can occur in between.
.LP
If after syscall-enter-stop,
the tracer uses a restarting command other than
.BR PTRACE_SYSCALL ,
syscall-exit-stop is not generated.
.LP
.B PTRACE_GETSIGINFO
on syscall-stops returns
.B SIGTRAP
in
.IR si_signo ,
with
.I si_code
set to
.B SIGTRAP
or
.IR (SIGTRAP|0x80) .
.SS PTRACE_SINGLESTEP, PTRACE_SYSEMU, PTRACE_SYSEMU_SINGLESTEP stops
.\"
.\" FIXME The following TODO is unresolved
.\"       Do you want to add anything?
.\"
(TODO: document stops occurring with PTRACE_SINGLESTEP, PTRACE_SYSEMU,
PTRACE_SYSEMU_SINGLESTEP)
.SS Informational and restarting ptrace commands
Most ptrace commands (all except
.BR PTRACE_ATTACH ,
.BR PTRACE_TRACEME ,
and
.BR PTRACE_KILL )
require the tracee to be in a ptrace-stop, otherwise they fail with
.BR ESRCH .
.LP
When the tracee is in ptrace-stop,
the tracer can read and write data to
the tracee using informational commands.
These commands leave the tracee in ptrace-stopped state:
.LP
.nf
    ptrace(PTRACE_PEEKTEXT/PEEKDATA/PEEKUSER, pid, addr, 0);
    ptrace(PTRACE_POKETEXT/POKEDATA/POKEUSER, pid, addr, long_val);
    ptrace(PTRACE_GETREGS/GETFPREGS, pid, 0, &struct);
    ptrace(PTRACE_SETREGS/SETFPREGS, pid, 0, &struct);
    ptrace(PTRACE_GETSIGINFO, pid, 0, &siginfo);
    ptrace(PTRACE_SETSIGINFO, pid, 0, &siginfo);
    ptrace(PTRACE_GETEVENTMSG, pid, 0, &long_var);
    ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_flags);
.fi
.LP
Note that some errors are not reported.
For example, setting signal information
.RI ( siginfo )
may have no effect in some ptrace-stops, yet the call may succeed
(return 0 and not set
.IR errno );
querying
.B PTRACE_GETEVENTMSG
may succeed and return some random value if current ptrace-stop
is not documented as returning a meaningful event message.
.LP
The call

    ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_flags);
    
affects one tracee.
The tracee's current flags are replaced.
Flags are inherited by new tracees created and "auto-attached" via active
.BR PTRACE_O_TRACEFORK ,
.BR PTRACE_O_TRACEVFORK ,
or
.BR PTRACE_O_TRACECLONE
options.
.LP
Another group of commands makes the ptrace-stopped tracee run.
They have the form:
.LP
    ptrace(PTRACE_cmd, pid, 0, sig);
.LP
where
.I cmd
is
.BR PTRACE_CONT ,
.BR PTRACE_DETACH ,
.BR PTRACE_SYSCALL ,
.BR PTRACE_SINGLESTEP ,
.BR PTRACE_SYSEMU ,
or
.BR PTRACE_SYSEMU_SINGLESTEP.
If the tracee is in signal-delivery-stop,
.I sig
is the signal to be injected (if it is nonzero).
Otherwise,
.I sig
may be ignored.
(Recommended practice is to always pass 0 in these cases.)
.SS Attaching and detaching
A thread can be attached to the tracer using the call

    ptrace(PTRACE_ATTACH, pid, 0, 0);

This also sends
.B SIGSTOP
to this thread.
If the tracer wants this
.B SIGSTOP
to have no effect, it needs to suppress it.
Note that if other signals are concurrently sent to
this thread during attach,
the tracer may see the tracee enter signal-delivery-stop
with other signal(s) first!
The usual practice is to reinject these signals until
.B SIGSTOP
is seen, then suppress
.B SIGSTOP
injection.
.\"
.\" FIXME I significantly rewrote the following sentence to try to make it
.\" clearer. Is the meaning still preserved?
The design bug here is that a ptrace attach and a concurrently delivered
.B SIGSTOP
may race and the concurrent
.B SIGSTOP
may be lost.
.\"
.\" FIXME: mtk: the following comment seems to be unresolved?
.\"	   Do you want to add any text?
.\"
.\"      Describe how to attach to a thread which is already group-stopped.
.LP
Since attaching sends
.B SIGSTOP
and the tracer usually suppresses it, this may cause a stray
.I EINTR
return from the currently executing system call in the tracee,
as described in the "signal injection and suppression" section.
.LP
The request

    ptrace(PTRACE_TRACEME, 0, 0, 0);

turns the calling thread into a tracee.
The thread continues to run (doesn't enter ptrace-stop).
A common practice is to follow the
.B PTRACE_TRACEME
with

    raise(SIGSTOP);

and allow the parent (which is our tracer now) to observe our
signal-delivery-stop.
.LP
If the 
.BR PTRACE_O_TRACEFORK ,
.BR PTRACE_O_TRACEVFORK ,
or
.BR PTRACE_O_TRACECLONE
options are in effect, then children created by, respectively,
.BR vfork (2)
or
.BR clone (2)
with the
.B CLONE_VFORK
flag,
.BR fork (2)
or
.BR clone (2)
with the exit signal set to
.BR SIGCHLD ,
and other kinds of
.BR clone (2),
are automatically attached to the same tracer which traced their parent.
.B SIGSTOP
is delivered to the children, causing them to enter
signal-delivery-stop after they exit the system call which created them.
.LP
Detaching of the tracee is performed by:

    ptrace(PTRACE_DETACH, pid, 0, sig);

.B PTRACE_DETACH
is a restarting operation;
therefore it requires the tracee to be in ptrace-stop.
If the tracee is in signal-delivery-stop, a signal can be injected.
Otherwise, the
.I sig
parameter may be silently ignored.
.LP
If the tracee is running when the tracer wants to detach it,
the usual solution is to send
.B SIGSTOP
(using
.BR tgkill (2),
to make sure it goes to the correct thread),
wait for the tracee to stop in signal-delivery-stop for
.B SIGSTOP
and then detach it (suppressing
.B SIGSTOP
injection).
A design bug is that this can race with concurrent
.BR SIGSTOP s.
Another complication is that the tracee may enter other ptrace-stops
and needs to be restarted and waited for again, until
.B SIGSTOP
is seen.
Yet another complication is to be sure that
the tracee is not already ptrace-stopped,
because no signal delivery happens while it is\(emnot even
.BR SIGSTOP .
.\" FIXME: mtk: the following comment seems to be unresolved?
.\"       Do you want to add anything?
.\"
.\"     Describe how to detach from a group-stopped tracee so that it
.\"     doesn't run, but continues to wait for SIGCONT.
.\"
.LP
If the tracer dies, all tracees are automatically detached and restarted,
unless they were in group-stop.
Handling of restart from group-stop is
.\" FIXME: Define currently
currently buggy, but the
.\" FIXME: Planned for when?
"as planned" behavior is to leave tracee stopped and waiting for
.BR SIGCONT .
If the tracee is restarted from signal-delivery-stop,
the pending signal is injected.
.SS execve(2) under ptrace
.\" clone(2) THREAD_CLONE says:
.\"     If  any  of the threads in a thread group performs an execve(2),
.\"     then all threads other than the thread group leader are terminated,
.\"     and the new program is executed in the thread group leader.  
.\"
.\" FIXME mtk-addition:  please check: I added the following piece to
.\"       clarify that multithreaded here means clone()+CLONE_THREAD
.\"
When one thread in a multithreaded process
(i.e., a thread group consisting of threads created using the
.BR clone (2)
.B CLONE_THREAD
flag) calls
.\" FIXME end-mtk-addition
.\"
.BR execve (2),
the kernel destroys all other threads in the process,
.\" In kernel 3.1 sources, see fs/exec.c::de_thread()
and resets the thread ID of the execing thread to the
thread group ID (process ID).
.\"
.\" FIXME mtk-addition:  please check: I added the following piece:
(Or, to put things another way, when a multithreaded process does an
.BR execve (2),
the kernel makes it look as though the
.BR execve (2)
occurred in the thread group leader, regardless of which thread did the
.BR execve (2).)
.\" FIXME end-mtk-addition
.\"
This resetting of the thread ID looks very confusing to tracers:
.IP * 3
All other threads stop in
.\" FIXME: mtk: What is "PTRACE_EXIT stop"?
.\"        Should that be "PTRACE_EVENT_EXIT stop"?
.B PTRACE_EXIT
stop,
.\" FIXME: mtk: In the next line, "by active ptrace option" is unclear.
.\"        What does it mean?
if requested by active ptrace option.
Then all other threads except the thread group leader report
death as if they exited via
.BR _exit (2)
with exit code 0.
Then
.B PTRACE_EVENT_EXEC
.\" FIXME: mtk: In the next line, "by active ptrace option" is unclear
.\"        What does it mean?
stop happens, if requested by active ptrace option.
.\" FIXME: mtk: the following comment seems to be unresolved?
.\"       (on which tracee - leader? execve-ing one?)
.\" 
.\" FIXME: Please check: at various places in the following,
.\"        I have changed "pid" to "[the tracee's] thead ID"
.\"        Is that okay?
.IP *
The execing tracee changes its thread ID while it is in the
.BR execve (2).
(Remember, under ptrace, the "pid" returned from
.BR waitpid (2),
or fed into ptrace calls, is the tracee's thread ID.)
That is, the tracee's thread ID is reset to be the same as its process ID,
which is the same as the thread group leader's thread ID.
.IP *
If the thread group leader has reported its death by this time,
it appears to the tracer that
the dead thread leader "reappears from nowhere".
If the thread group leader was still alive,
for the tracer this may look as if thread group leader
returns from a different system call than it entered,
or even "returned from a system call even though
it was not in any system call".
If the thread group leader was not traced
(or was traced by a different tracer), then during
.BR execve (2)
it will appear as if it has become a tracee of
the tracer of the execing tracee.
.LP
All of the above effects are the artifacts of
the thread ID change in the tracee.
.LP
The
.B PTRACE_O_TRACEEXEC
option is the recommended tool for dealing with this situation.
It enables
.B PTRACE_EVENT_EXEC
stop, which occurs before
.BR execve (2)
returns.
.\" FIXME Following on from the previous sentences,
.\"       can/should we add a few more words on how
.\"       PTRACE_EVENT_EXEC stop helps us deal with this situation?
.LP
The thread ID change happens before
.B PTRACE_EVENT_EXEC
stop, not after.
.LP
When the tracer receives
.B PTRACE_EVENT_EXEC
stop notification,
it is guaranteed that except this tracee and the thread group leader,
no other threads from the process are alive.
.LP
On receiving the
.B PTRACE_EVENT_EXEC
stop notification,
the tracer should clean up all its internal
data structures describing the threads of this process,
and retain only one data structure\(emone which
describes the single still running tracee, with

    thread ID == thread group ID == process id.
.LP
Currently, there is no way to retrieve the former
thread ID of the execing tracee.
If the tracer doesn't keep track of its tracees' thread group relations,
it may be unable to know which tracee execed and therefore no longer
exists under the old thread ID due to a thread ID change.
.LP
Example: two threads call
.BR execve (2)
at the same time:
.LP
.nf
*** we get syscall-entry-stop in thread 1: **
PID1 execve("/bin/foo", "foo" <unfinished ...>
*** we issue PTRACE_SYSCALL for thread 1 **
*** we get syscall-entry-stop in thread 2: **
PID2 execve("/bin/bar", "bar" <unfinished ...>
*** we issue PTRACE_SYSCALL for thread 2 **
*** we get PTRACE_EVENT_EXEC for PID0, we issue PTRACE_SYSCALL **
*** we get syscall-exit-stop for PID0: **
PID0 <... execve resumed> )             = 0
.fi
.LP
In this situation, there is no way to know which
.BR execve (2)
succeeded.
.LP
If the
.B PTRACE_O_TRACEEXEC
option is
.I not
in effect for the execing tracee, the kernel delivers an extra
.B SIGTRAP
to the tracee after
.BR execve (2)
returns.
This is an ordinary signal (similar to one which can be
generated by
.IR "kill -TRAP" ),
not a special kind of ptrace-stop.
Employing
.B PTRACE_GETSIGINFO
for this signal returns
.I si_code
set to 0
.RI ( SI_USER ).
This signal may be blocked by signal mask,
and thus may be delivered (much) later.
.LP
Usually, the tracer (for example,
.BR strace (1))
would not want to show this extra post-execve
.B SIGTRAP
signal to the user, and would suppress its delivery to the tracee (if
.B SIGTRAP
is set to
.BR SIG_DFL ,
it is a killing signal).
However, determining 
.I which
.B SIGTRAP
to suppress is not easy.
Setting the
.B PTRACE_O_TRACEEXEC
option and thus suppressing this extra
.B SIGTRAP
is the recommended approach.
.SS Real parent
The ptrace API (ab)uses the standard UNIX parent/child signaling over
.BR waitpid (2).
This used to cause the real parent of the process to stop receiving
several kinds of
.BR waitpid (2)
notifications when the child process is traced by some other process.
.LP
Many of these bugs have been fixed, but as of Linux 2.6.38 several still
exist; see BUGS below.
.LP
As of Linux 2.6.38, the following is believed to work correctly:
.IP * 3
exit/death by signal is reported first to the tracer, then, when the tracer
consumes the
.BR waitpid (2)
result, to the real parent (to the real parent only when the
whole multithreaded process exits).
.\"
.\" FIXME mtk: Please check: In the next line, 
.\" I changed "they" to "the tracer and the real parent". Okay?
If the tracer and the real parent are the same process,
the report is sent only once.
.SH "RETURN VALUE"
On success,
.B PTRACE_PEEK*
requests return the requested data,
while other requests return zero.
On error, all requests return \-1, and
.I errno
is set appropriately.
Since the value returned by a successful
.B PTRACE_PEEK*
request may be \-1, the caller must clear
.I errno
before the call, and then check it afterward
to determine whether or not an error occurred.
.SH ERRORS
.TP
.B EBUSY
(i386 only) There was an error with allocating or freeing a debug register.
.TP
.B EFAULT
There was an attempt to read from or write to an invalid area in
the tracer's or the tracee's memory,
probably because the area wasn't mapped or accessible.
Unfortunately, under Linux, different variations of this fault
will return
.B EIO
or
.B EFAULT
more or less arbitrarily.
.TP
.B EINVAL
An attempt was made to set an invalid option.
.TP
.B EIO
.I request
is invalid, or an attempt was made to read from or
write to an invalid area in the tracer's or the tracee's memory,
or there was a word-alignment violation,
or an invalid signal was specified during a restart request.
.TP
.B EPERM
The specified process cannot be traced.
This could be because the
tracer has insufficient privileges (the required capability is
.BR CAP_SYS_PTRACE );
unprivileged processes cannot trace processes that they
cannot send signals to or those running
set-user-ID/set-group-ID programs, for obvious reasons.
.\" 
.\" FIXME I reworked the mention of init here to note
.\" when the behavior changed for tracing init(8). Okay?
Alternatively, the process may already be being traced,
or (on kernels before 2.6.26) be
.BR init (8)
(PID 1).
.TP
.B ESRCH
The specified process does not exist, or is not currently being traced
by the caller, or is not stopped
(for requests that require a stopped tracee).
.SH "CONFORMING TO"
SVr4, 4.3BSD.
.SH NOTES
Although arguments to
.BR ptrace ()
are interpreted according to the prototype given,
glibc currently declares
.BR ptrace ()
as a variadic function with only the
.I request
argument fixed.
This means that unneeded trailing arguments may be omitted,
though doing so makes use of undocumented
.BR gcc (1)
behavior.
.\" FIXME Please review. I reinstated the following, noting the
.\" kernel version number where it ceased to be true
.LP
In Linux kernels before 2.6.26,
.\" See commit 00cd5c37afd5f431ac186dd131705048c0a11fdb
.BR init (8),
the process with PID 1, may not be traced.
.LP
The layout of the contents of memory and the USER area are
quite operating-system- and architecture-specific.
The offset supplied, and the data returned,
might not entirely match with the definition of
.IR "struct user" .
.\" See http://lkml.org/lkml/2008/5/8/375
.LP
The size of a "word" is determined by the operating-system variant
(e.g., for 32-bit Linux it is 32 bits, etc.).
.\" FIXME So, can we just remove the following text?
.\"
.\" Covered in more details above: (removed by dv)
.\" .LP
.\" Tracing causes a few subtle differences in the semantics of
.\" traced processes.
.\" For example, if a process is attached to with
.\" .BR PTRACE_ATTACH ,
.\" its original parent can no longer receive notification via
.\" .BR waitpid (2)
.\" when it stops, and there is no way for the new parent to
.\" effectively simulate this notification.
.\" .LP
.\" When the parent receives an event with
.\" .B PTRACE_EVENT_*
.\" set,
.\" the tracee is not in the normal signal delivery path.
.\" This means the parent cannot do
.\" .BR ptrace (PTRACE_CONT)
.\" with a signal or
.\" .BR ptrace (PTRACE_KILL).
.\" .BR kill (2)
.\" with a
.\" .B SIGKILL
.\" signal can be used instead to kill the tracee
.\" after receiving one of these messages.
.\" .LP
This page documents the way the
.BR ptrace ()
call works currently in Linux.
Its behavior differs noticeably on other flavors of UNIX.
In any case, use of
.BR ptrace ()
is highly specific to the operating system and architecture.
.SH BUGS
On hosts with 2.6 kernel headers,
.B PTRACE_SETOPTIONS
is declared with a different value than the one for 2.4.
This leads to applications compiled with 2.6 kernel
headers failing when run on 2.4 kernels.
This can be worked around by redefining
.B PTRACE_SETOPTIONS
to
.BR PTRACE_OLDSETOPTIONS ,
if that is defined.
.LP
Group-stop notifications are sent to the tracer, but not to real parent.
Last confirmed on 2.6.38.6.
.LP
.\" 
.\" FIXME Does "exits" in the following mean
.\" just "_exit(2)" or or both "_exit(2) and exit_group(2)"?
If a thread group leader is traced and exits by calling
.BR _exit (2),
a
.B PTRACE_EVENT_EXIT
stop will happen for it (if requested), but the subsequent
.B WIFEXITED
notification will not be delivered until all other threads exit.
As explained above, if one of other threads calls
.BR execve (2),
the death of the thread group leader will
.I never
be reported.
If the execed thread is not traced by this tracer,
the tracer will never know that
.BR execve (2)
happened.
One possible workaround is to
.B PTRACE_DETACH
the thread group leader instead of restarting it in this case.
Last confirmed on 2.6.38.6.
.\"        ^^^ need to test/verify this scenario
.\" FIXME: mtk: the preceding comment seems to be unresolved?
.\"        Do you want to add anything?
.LP
A
.B SIGKILL
signal may still cause a
.B PTRACE_EVENT_EXIT
stop before actual signal death.
This may be changed in the future;
.B SIGKILL
is meant to always immediately kill tasks even under ptrace.
Last confirmed on 2.6.38.6.
.SH "SEE ALSO"
.BR gdb (1),
.BR strace (1),
.BR clone (2),
.BR execve (2),
.BR fork (2),
.BR gettid (2),
.BR sigaction (2),
.BR tgkill (2),
.BR vfork (2),
.BR waitpid (2),
.BR exec (3),
.BR capabilities (7),
.BR signal (7)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] man ptrace: add extended description of various ptrace quirks
  2011-09-29 19:08 ` Michael Kerrisk
@ 2011-09-30 14:14   ` Denys Vlasenko
  2011-10-03  5:27     ` Michael Kerrisk
  2011-09-30 14:28   ` Denys Vlasenko
  1 sibling, 1 reply; 18+ messages in thread
From: Denys Vlasenko @ 2011-09-30 14:14 UTC (permalink / raw)
  To: mtk.manpages
  Cc: Oleg Nesterov, Jan Kratochvil, linux-kernel, Tejun Heo,
	linux-man, Heiko Carstens, Chuck Ebbert, Blaisorblade,
	Daniel Jacobowitz

On Thu, Sep 29, 2011 at 9:08 PM, Michael Kerrisk <mtk.manpages@gmail.com> wrote:
> [CC+=linux-man + a few other possibly interested individuals]
>
> Hello Denys, (Oleg, Tejun),
>
> On Thu, Jul 21, 2011 at 1:09 PM, Denys Vlasenko
> <vda.linux@googlemail.com> wrote:
>> Hi Michael,
>>
>> Please apply attached patch which updates ptrace manpage.
>> (I'm not sending it inline, google web mail might mangle it. Sorry).
>
> Thanks once again for this nice piece of work. Some comments, and a
> revised page below.
>
>> Changes include:
>>
>> s/parent/tracer/g, s/child/tracee/g - ptrace interface now
>> is sufficiently cleaned up to not treat tracing process as parent.
>
> Thanks!
>
>> Deleted several outright false statements:
>> - pid 1 can be traced
>
> It looks to me as though this was once true, and I amended the page
> accordingly. (Man-pages documents not just current behavior, but
> historical behavior too.)
>
>> - tracer is not shown as parent in ps output
>
> Was this true at one time? If yes, then we should document past and
> current behavior, and note when the change occurred.

It stopped being true VERY long ago (think Linux 2.2 or even before).


>> - SIGSTOP _can_ be injected.
>
> Was this true at one time? If yes, then we should document past and
> current behavior, and note when the change occurred.
>
> In the Linux 2.4 sources, I see the following in
> arch/i386/kernel/signal.c::do_signal():
>
>                        /* The debugger continued.  Ignore SIGSTOP.  */
>                        if (signr == SIGSTOP)
>                                continue;
>
> Did that code prevent SIGSTOP being injected in the 2.4 series?

Looks like it is indeed the code.


> Rather than you writing a new patch to this version of the page, I
> think it might be easiest if you just replied to the FIXMEs inline
> below, then I can revise the page in the light of your comments.

Ok --


.\" FIXME Please check. In the following paragraphs, I substituted language
.\" such as:
.\"     Stop tracee at next fork(2) call with SIGTRAP|PTRACE_EVENT_FORK<<8
.\" with:
.\"     Stop tracee at next fork(2) call... A subsequent PTRACE_GETSIGINFO
.\"     on the stopped tracee will return a siginfo_t structure with si_code
.\"     set to SIGTRAP|PTRACE_EVENT_FORK<<8.
.\"
.\" Is this change correct?

No, it is not correct.
SIGTRAP|PTRACE_EVENT_FORK<<8 value is returned in waitpid status word.
See "PTRACE_EVENT stops" section.

No need to do PTRACE_GETSIGINFO.
Rememeber, requiring PTRACE_GETSIGINFO on every ptrace stop
is a performance hit.



.B PTRACE_ATTACH
Attach to the process specified in
.IR pid ,
making it a tracee of the calling process.
.\" FIXME So, was the following EVER true? IF it was,
.\"       we should reinstate the text and add mention of
.\"       the kernel version where the behaviour changed.
.\"
.\" Not true: (removed by dv)
.\" ; the behavior of the tracee is as if it had done a
.\" .BR PTRACE_TRACEME .
.\" The calling process actually becomes the parent of the tracee
.\" process for most purposes (e.g., it will receive
.\" notification of tracee events and appears in
.\" .BR ps (1)
.\" output as the tracee's parent), but a
.\" .BR getppid (2)
.\" by the tracee will still return the PID of the original parent.

I think it isn't true in non-ancient 2.4 and in 2.6/3.x.
Basically, it's not true for any Linux in practical use.



The tracer can't assume that the tracee
.I always
ends its life by reporting
.I WIFEXITED(status)
or
.IR WIFSIGNALED(status) .
.LP
.\"     or can it? Do we include such a promise into ptrace API?
.\"
.\" FIXME: The preceding comment seems to be unresolved?
.\"        Do you want to add anything?
.\"

I know at least one case when tracee disappears without ever reporting
WIFEXITED(status) or WIFSIGNALED(status): if thread other than
thread group leader execs, it disappears - its pid will never be seen
again, any subsequent ptrace stops will be reported under thread group
leader's pid.

Maybe this example (and all other examples we will discover later)
should be explicitly given here.



.\" FIXME: mtk: the following comment seems to be unresolved?
.\"        Do you want to add anything?
.\"
.\"     Do we require __WALL usage, or will just using 0 be ok? Are the
.\"     rules different if user wants to use waitid? Will waitid require
.\"     WEXITED?
.\"

Still not resolved. For now, I can only say that with __WALL, it works.
With 0, I am not 100% sure there aren't ugly corner cases.



.LP
.\" FIXME: Is the following comment "__WALL... implies" true?
The
.B __WALL
flag does not include the
.B WSTOPPED
and
.B WEXITED
flags, but implies their functionality.

Yes, it's true.



They may be differentiated by examining the value
.IR status>>8 ,
and if there is ambiguity in that value, by querying
.BR PTRACE_GETSIGINFO .
.\"
.\" FIXME What is the purpose of the following sentence? Is it to warn
.\"       the reader not to use WSTOPSIG()? If so, we should make that
.\"       point more explicitly.
(Note: the
.I WSTOPSIG(status)
macro returns the value
.IR "(status>>8)\ &\ 0xff)" .)

Yes. The purpose of this text is to say that WSTOPSIG() can't be used
to check for PTRACE_EVENT stops. This won't work:

    if (WSTOPSIG(status) == (SIGTRAP | (PTRACE_EVENT_foo << 8))) ...

There are no macros for this, one needs to open-code it:

    unsigned sig_and_event = status >> 8;
    if (sig_and_event == (SIGTRAP | (PTRACE_EVENT_foo << 8))) ...



If the tracer doesn't suppress the signal,
.\"
.\" FIXME: I added the word "restart" to the following line. Okay?
it passes the signal to the tracee in the next ptrace restart request.

Yes, it's ok.



.\"
.\" FIXME: the referrent of "This" in the next line is not clear.
.\"        What does "This" refer to?
This is a cause of confusion among ptrace users.
One typical scenario is that the tracer observes group-stop,
mistakes it for signal-delivery-stop, restarts the tracee with
    ptrace(PTRACE_rest, pid, 0, stopsig)
with the intention of injecting
.IR stopsig ,
but
.I stopsig
gets ignored and the tracee continues to run.

"This" refers to the ptrace behavior of ignoring 'sig' argument
on restarting ptrace commands if ptrace-stop is not a
signal-delivery-stop. The confusion even reached ptrace manpage.
The reason manpage used to claim that SIGSTOP
can't be injected is because people were trying to inject it
in the wrong ptrace-stop, which of course doesn't work.



As of kernel 2.6.38,
after the tracer sees the tracee ptrace-stop and until it
restarts or kills it, the tracee will not run,
and will not send notifications (except
.B SIGKILL
death) to the tracer, even if the tracer enters into another
.BR waitpid (2)
call.
.LP
.\"
.\" FIXME ??? referrent of "it" in the next line is unclear
.\"        What does "it" refer to?
Currently, it causes a problem with transparent handling of stopping
signals: if the tracer restarts the tracee after group-stop,
.B SIGSTOP
is effectively ignored: the tracee doesn't remain stopped, it runs.
If the tracer doesn't restart the tracee before entering into the next
.BR waitpid (2),
future
.B SIGCONT
signals will not be reported to the tracer.
This would cause
.B SIGCONT
to have no effect.

"it" refers to ptrace behavior versus group-stops and SIGCONT,
as described. Feel free to rephrase.



Syscall-stops can be distinguished from signal-delivery-stop with
.B SIGTRAP
by querying
.BR PTRACE_GETSIGINFO
for the following cases:
.TP
.IR si_code " <= 0"
.B SIGTRAP
.\" FIXME: Please confirm this is okay: I changed
.\"        "the usual suspects" to "by a system call". Okay?
.\"        Shouldn't we also add kill(2) here?
was sent by a system call
.RB ( tgkill (2),
.BR sigqueue (3),
etc.)

No, it is not ok. Please consult sigaction(2) manpage and
/usr/include/bits/siginfo.h
For example, si_code == SI_TIMER (-2) can be sent by timer
expiration, which is not a system call. There are many other signal
sources which are not systcalls.



However, syscall-stops happen very often (twice per system call),
and performing
.B PTRACE_GETSIGINFO
for every syscall-stop may be somewhat expensive.
.LP
.\"
.\" FIXME referrent of "them" in next line ???
.\"       What does "them" refer to?
Some architectures allow the cases to be distinguished
by examining registers.
For example, on x86,
.I rax
==
.RB - ENOSYS
in syscall-enter-stop.

I don't see word "them" anywhere in that line...



.\"
.\" FIXME I significantly rewrote the following sentence to try to make it
.\" clearer. Is the meaning still preserved?
The design bug here is that a ptrace attach and a concurrently delivered
.B SIGSTOP
may race and the concurrent
.B SIGSTOP
may be lost.

Yes, it looks ok.



.SS execve(2) under ptrace
.\" clone(2) THREAD_CLONE says:
.\"     If  any  of the threads in a thread group performs an execve(2),
.\"     then all threads other than the thread group leader are terminated,
.\"     and the new program is executed in the thread group leader...
.\"
.\" FIXME mtk-addition:  please check: I added the following piece to
.\"       clarify that multithreaded here means clone()+CLONE_THREAD
.\"
When one thread in a multithreaded process
(i.e., a thread group consisting of threads created using the
.BR clone (2)
.B CLONE_THREAD
flag) calls
.\" FIXME end-mtk-addition
.\"

I think this addition is not necessary. If someone reached this point
reading the documentation and he still doesn't understand what is meant
by 'multithreaded' in this context, no amount of clarification will help
that person...



.\"
.\" FIXME mtk-addition:  please check: I added the following piece:
(Or, to put things another way, when a multithreaded process does an
.BR execve (2),
the kernel makes it look as though the
.BR execve (2)
occurred in the thread group leader, regardless of which thread did the
.BR execve (2).)
.\" FIXME end-mtk-addition
.\"

This is not exactly true. If tracer is tracking syscall entry/exit
(e.g. strace),
the entry into execve will be seen happening in the execing thread,
but corresponding syscall exit will happen in thread leader.
This doesn't look exactly like "execve occurred in the thread group leader"!



All other threads stop in
.\" FIXME: mtk: What is "PTRACE_EXIT stop"?
.\"        Should that be "PTRACE_EVENT_EXIT stop"?
.B PTRACE_EXIT
stop,

Correct, it meant to be PTRACE_EVENT_EXIT



.\" FIXME: mtk: In the next line, "by active ptrace option" is unclear.
.\"        What does it mean?
if requested by active ptrace option.

It means "if PTRACE_O_TRACEEXIT option was turned on".



.\"
.\" FIXME Does "exits" in the following mean
.\" just "_exit(2)" or or both "_exit(2) and exit_group(2)"?
If a thread group leader is traced and exits by calling
.BR _exit (2),
a
.B PTRACE_EVENT_EXIT
stop will happen for it (if requested), but the subsequent
.B WIFEXITED
notification will not be delivered until all other threads exit.

Here "exits" means any kind of death - _exit, exit_group,
signal death. Signal death and exit_group cases are trivial,
though: since signal death and exit_group kill all other threads
too, "until all other threads exit" thing happens rather soon
in these cases. Therefore, only _exit presents observably
puzzling behavior to ptrace users: thread leader _exit's,
but WIFEXITED isn't reported! We are trying to explain here
why it is so.



(Note: the
.I WSTOPSIG(status)
macro returns the value
.IR "(status>>8)\ &\ 0xff)" .)

Unpaired parentheses in the fragment above.
I suggest:

.IR "((status>>8)\ &\ 0xff)" .)

-- 
vda

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] man ptrace: add extended description of various ptrace quirks
  2011-09-29 19:08 ` Michael Kerrisk
  2011-09-30 14:14   ` Denys Vlasenko
@ 2011-09-30 14:28   ` Denys Vlasenko
  2011-10-03  5:35     ` Michael Kerrisk
  1 sibling, 1 reply; 18+ messages in thread
From: Denys Vlasenko @ 2011-09-30 14:28 UTC (permalink / raw)
  To: mtk.manpages
  Cc: Oleg Nesterov, Jan Kratochvil, linux-kernel, Tejun Heo,
	linux-man, Heiko Carstens, Chuck Ebbert, Blaisorblade,
	Daniel Jacobowitz

On Thu, Sep 29, 2011 at 9:08 PM, Michael Kerrisk <mtk.manpages@gmail.com> wrote:
> So, I took your patch, and then did a global edit of the page to fix
> various pieces (in the existing text, as well as do some language
> clean-ups for the new text). In the process, I found a number of
> pieces that are still unclear (some in the old text, some in your new
> text). I also made some changes to your text that I'd like you to
> check. I've marked each of these with FIXME below. Could you please
> take a look at the FIXMEs, and write me a comment for each of these.
> (I appreciate that in some cases, especially for the existing text,
> you may not have a handy answer Denys, but if you (and others) can
> give any help, that would be great.)
>
> Rather than you writing a new patch to this version of the page, I
> think it might be easiest if you just replied to the FIXMEs inline
> below, then I can revise the page in the light of your comments.

       Another group of commands makes the ptrace-stopped  tracee  run.   They
       have the form:

           ptrace(PTRACE_cmd, pid, 0, sig);

       where  cmd  is  PTRACE_CONT, PTRACE_DETACH, PTRACE_SYSCALL, PTRACE_SIN-
       GLESTEP, PTRACE_SYSEMU, or PTRACE_SYSEMU_SINGLESTEP.

Cosmetics: cmd is, of course, CONT, DETACH,..., not PTRACE_CONT,
PTRACE_DETACH...


       If the tracee  is
       in  signal-delivery-stop,  sig  is  the signal to be injected (if it is
       nonzero).  Otherwise, sig may be ignored.  (Recommended practice is  to
       always pass 0 in these cases.)

Looks like (my) text in last sentence is confusing. I meant:
"If you are resterting thracee from a ptrace-stop other than
signal-delivery-stop, recommended practice is  to always pass
sig == 0".

-- 
vda

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] man ptrace: add extended description of various ptrace quirks
  2011-09-30 14:14   ` Denys Vlasenko
@ 2011-10-03  5:27     ` Michael Kerrisk
  2012-02-13 22:02       ` Denys Vlasenko
  0 siblings, 1 reply; 18+ messages in thread
From: Michael Kerrisk @ 2011-10-03  5:27 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Oleg Nesterov, Jan Kratochvil, linux-kernel, Tejun Heo,
	linux-man, Heiko Carstens, Blaisorblade, Daniel Jacobowitz

Hi Denys,

Thanks for the detailed responses. Some comments to your remarks
below, and a couple of open questions (marked "????"). If you send me
the answers, then I can get another draft for review.

On Fri, Sep 30, 2011 at 4:14 PM, Denys Vlasenko
<vda.linux@googlemail.com> wrote:
> On Thu, Sep 29, 2011 at 9:08 PM, Michael Kerrisk <mtk.manpages@gmail.com> wrote:
>> [CC+=linux-man + a few other possibly interested individuals]
>>
>> Hello Denys, (Oleg, Tejun),
>>
>> On Thu, Jul 21, 2011 at 1:09 PM, Denys Vlasenko
>> <vda.linux@googlemail.com> wrote:
>>> Hi Michael,
>>>
>>> Please apply attached patch which updates ptrace manpage.
>>> (I'm not sending it inline, google web mail might mangle it. Sorry).
>>
>> Thanks once again for this nice piece of work. Some comments, and a
>> revised page below.
>>
>>> Changes include:
>>>
>>> s/parent/tracer/g, s/child/tracee/g - ptrace interface now
>>> is sufficiently cleaned up to not treat tracing process as parent.
>>
>> Thanks!
>>
>>> Deleted several outright false statements:
>>> - pid 1 can be traced
>>
>> It looks to me as though this was once true, and I amended the page
>> accordingly. (Man-pages documents not just current behavior, but
>> historical behavior too.)
>>
>>> - tracer is not shown as parent in ps output
>>
>> Was this true at one time? If yes, then we should document past and
>> current behavior, and note when the change occurred.
>
> It stopped being true VERY long ago (think Linux 2.2 or even before).

Okay -- then I think we can drop it. (I'd ideally have wanted a note
somewhere describing the ancient behavior, and when it changed. But
when it's that long ago, it may be too much trouble to make the
details accurate.)


>>> - SIGSTOP _can_ be injected.
>>
>> Was this true at one time? If yes, then we should document past and
>> current behavior, and note when the change occurred.
>>
>> In the Linux 2.4 sources, I see the following in
>> arch/i386/kernel/signal.c::do_signal():
>>
>>                        /* The debugger continued.  Ignore SIGSTOP.  */
>>                        if (signr == SIGSTOP)
>>                                continue;
>>
>> Did that code prevent SIGSTOP being injected in the 2.4 series?
>
> Looks like it is indeed the code.

????
Sorry -- I'm not quite clear there. You're confirming that SIGSTOP
could not be injected in 2.4, right?


>> Rather than you writing a new patch to this version of the page, I
>> think it might be easiest if you just replied to the FIXMEs inline
>> below, then I can revise the page in the light of your comments.
>
> Ok --
>
>
> .\" FIXME Please check. In the following paragraphs, I substituted language
> .\" such as:
> .\"     Stop tracee at next fork(2) call with SIGTRAP|PTRACE_EVENT_FORK<<8
> .\" with:
> .\"     Stop tracee at next fork(2) call... A subsequent PTRACE_GETSIGINFO
> .\"     on the stopped tracee will return a siginfo_t structure with si_code
> .\"     set to SIGTRAP|PTRACE_EVENT_FORK<<8.
> .\"
> .\" Is this change correct?
>
> No, it is not correct.
> SIGTRAP|PTRACE_EVENT_FORK<<8 value is returned in waitpid status word.
> See "PTRACE_EVENT stops" section.
>
> No need to do PTRACE_GETSIGINFO.
> Rememeber, requiring PTRACE_GETSIGINFO on every ptrace stop
> is a performance hit.

Thanks. So I'll change that sentence (and the others):

A subsequent PTRACE_GETSIGINFO on the stopped tracee will return a
siginfo_t structure with si_code set to SIGTRAP|PTRACE_EVENT_FORK<<8.

to:

A waitpid() by the tracer will return SIGTRAP|PTRACE_EVENT_FORK<<8 as
the status of the tracee.

> .B PTRACE_ATTACH
> Attach to the process specified in
> .IR pid ,
> making it a tracee of the calling process.
> .\" FIXME So, was the following EVER true? IF it was,
> .\"       we should reinstate the text and add mention of
> .\"       the kernel version where the behaviour changed.
> .\"
> .\" Not true: (removed by dv)
> .\" ; the behavior of the tracee is as if it had done a
> .\" .BR PTRACE_TRACEME .
> .\" The calling process actually becomes the parent of the tracee
> .\" process for most purposes (e.g., it will receive
> .\" notification of tracee events and appears in
> .\" .BR ps (1)
> .\" output as the tracee's parent), but a
> .\" .BR getppid (2)
> .\" by the tracee will still return the PID of the original parent.
>
> I think it isn't true in non-ancient 2.4 and in 2.6/3.x.
> Basically, it's not true for any Linux in practical use.

Okay.

> The tracer can't assume that the tracee
> .I always
> ends its life by reporting
> .I WIFEXITED(status)
> or
> .IR WIFSIGNALED(status) .
> .LP
> .\"     or can it? Do we include such a promise into ptrace API?
> .\"
> .\" FIXME: The preceding comment seems to be unresolved?
> .\"        Do you want to add anything?
> .\"
>
> I know at least one case when tracee disappears without ever reporting
> WIFEXITED(status) or WIFSIGNALED(status): if thread other than
> thread group leader execs, it disappears - its pid will never be seen
> again, any subsequent ptrace stops will be reported under thread group
> leader's pid.
>
> Maybe this example (and all other examples we will discover later)
> should be explicitly given here.

Okay. I've added that text.

The tracer can't assume that the tracee
.I always
ends its life by reporting
.I WIFEXITED(status)
or
.IR WIFSIGNALED(status) ;
there are cases where this does not occur.
For example, if a thread other than thread group leader does an
.BR execve (2),
it disappears;
its PID will never be seen again,
and any subsequent ptrace stops will be reported under
the thread group leader's PID.


> .\" FIXME: mtk: the following comment seems to be unresolved?
> .\"        Do you want to add anything?
> .\"
> .\"     Do we require __WALL usage, or will just using 0 be ok? Are the
> .\"     rules different if user wants to use waitid? Will waitid require
> .\"     WEXITED?
> .\"
>
> Still not resolved. For now, I can only say that with __WALL, it works.
> With 0, I am not 100% sure there aren't ugly corner cases.

Okay.


> .LP
> .\" FIXME: Is the following comment "__WALL... implies" true?
> The
> .B __WALL
> flag does not include the
> .B WSTOPPED
> and
> .B WEXITED
> flags, but implies their functionality.
>
> Yes, it's true.

Okay.


> They may be differentiated by examining the value
> .IR status>>8 ,
> and if there is ambiguity in that value, by querying
> .BR PTRACE_GETSIGINFO .
> .\"
> .\" FIXME What is the purpose of the following sentence? Is it to warn
> .\"       the reader not to use WSTOPSIG()? If so, we should make that
> .\"       point more explicitly.
> (Note: the
> .I WSTOPSIG(status)
> macro returns the value
> .IR "(status>>8)\ &\ 0xff)" .)
>
> Yes. The purpose of this text is to say that WSTOPSIG() can't be used
> to check for PTRACE_EVENT stops. This won't work:
>
>    if (WSTOPSIG(status) == (SIGTRAP | (PTRACE_EVENT_foo << 8))) ...
>
> There are no macros for this, one needs to open-code it:
>
>    unsigned sig_and_event = status >> 8;
>    if (sig_and_event == (SIGTRAP | (PTRACE_EVENT_foo << 8))) ...

Thanks. I added some words to the page to make this clear to the reader.


> If the tracer doesn't suppress the signal,
> .\"
> .\" FIXME: I added the word "restart" to the following line. Okay?
> it passes the signal to the tracee in the next ptrace restart request.
>
> Yes, it's ok.

Thanks.


> .\"
> .\" FIXME: the referrent of "This" in the next line is not clear.
> .\"        What does "This" refer to?
> This is a cause of confusion among ptrace users.
> One typical scenario is that the tracer observes group-stop,
> mistakes it for signal-delivery-stop, restarts the tracee with
>    ptrace(PTRACE_rest, pid, 0, stopsig)
> with the intention of injecting
> .IR stopsig ,
> but
> .I stopsig
> gets ignored and the tracee continues to run.
>
> "This" refers to the ptrace behavior of ignoring 'sig' argument
> on restarting ptrace commands if ptrace-stop is not a
> signal-delivery-stop. The confusion even reached ptrace manpage.
> The reason manpage used to claim that SIGSTOP
> can't be injected is because people were trying to inject it
> in the wrong ptrace-stop, which of course doesn't work.

Okay. I replaced "This" with some of your words above.


> As of kernel 2.6.38,
> after the tracer sees the tracee ptrace-stop and until it
> restarts or kills it, the tracee will not run,
> and will not send notifications (except
> .B SIGKILL
> death) to the tracer, even if the tracer enters into another
> .BR waitpid (2)
> call.
> .LP
> .\"
> .\" FIXME ??? referrent of "it" in the next line is unclear
> .\"        What does "it" refer to?
> Currently, it causes a problem with transparent handling of stopping
> signals: if the tracer restarts the tracee after group-stop,
> .B SIGSTOP
> is effectively ignored: the tracee doesn't remain stopped, it runs.
> If the tracer doesn't restart the tracee before entering into the next
> .BR waitpid (2),
> future
> .B SIGCONT
> signals will not be reported to the tracer.
> This would cause
> .B SIGCONT
> to have no effect.
>
> "it" refers to ptrace behavior versus group-stops and SIGCONT,
> as described. Feel free to rephrase.

????
Help! I'm still having problems here. The problem may possibly be
this: when one uses a pronoun like "it" in English, it's normally a
back reference to some text already given. Is this "it" a back
reference (In that case, could you please send me a rewritten version
of the sentence that replaces "it" by some descriptive text), or is it
a reference to the current paragraph (in other words, should this
paragraph rather start with the words "Currently, here is a problem
with...")?


> Syscall-stops can be distinguished from signal-delivery-stop with
> .B SIGTRAP
> by querying
> .BR PTRACE_GETSIGINFO
> for the following cases:
> .TP
> .IR si_code " <= 0"
> .B SIGTRAP
> .\" FIXME: Please confirm this is okay: I changed
> .\"        "the usual suspects" to "by a system call". Okay?
> .\"        Shouldn't we also add kill(2) here?
> was sent by a system call
> .RB ( tgkill (2),
> .BR sigqueue (3),
> etc.)
>
> No, it is not ok. Please consult sigaction(2) manpage and
> /usr/include/bits/siginfo.h
> For example, si_code == SI_TIMER (-2) can be sent by timer
> expiration, which is not a system call. There are many other signal
> sources which are not systcalls.

Okay. So how about the following:

was delivered as a result of a userspace action,
for example, a direct system call
.RB ( tgkill (2),
.BR kill (2),
.BR sigqueue (3),
etc.),
expiration of a POSIX timer,
change of state on a POSIX message queue,
or completion of an asynchronous I/O request.


> However, syscall-stops happen very often (twice per system call),
> and performing
> .B PTRACE_GETSIGINFO
> for every syscall-stop may be somewhat expensive.
> .LP
> .\"
> .\" FIXME referrent of "them" in next line ???
> .\"       What does "them" refer to?
> Some architectures allow the cases to be distinguished
> by examining registers.
> For example, on x86,
> .I rax
> ==
> .RB - ENOSYS
> in syscall-enter-stop.
>
> I don't see word "them" anywhere in that line...

Hmmm -- not sure what happened there. Ignore!


> .\"
> .\" FIXME I significantly rewrote the following sentence to try to make it
> .\" clearer. Is the meaning still preserved?
> The design bug here is that a ptrace attach and a concurrently delivered
> .B SIGSTOP
> may race and the concurrent
> .B SIGSTOP
> may be lost.
>
> Yes, it looks ok.

Thanks.


> .SS execve(2) under ptrace
> .\" clone(2) THREAD_CLONE says:
> .\"     If  any  of the threads in a thread group performs an execve(2),
> .\"     then all threads other than the thread group leader are terminated,
> .\"     and the new program is executed in the thread group leader...
> .\"
> .\" FIXME mtk-addition:  please check: I added the following piece to
> .\"       clarify that multithreaded here means clone()+CLONE_THREAD
> .\"
> When one thread in a multithreaded process
> (i.e., a thread group consisting of threads created using the
> .BR clone (2)
> .B CLONE_THREAD
> flag) calls
> .\" FIXME end-mtk-addition
> .\"
>
> I think this addition is not necessary. If someone reached this point
> reading the documentation and he still doesn't understand what is meant
> by 'multithreaded' in this context, no amount of clarification will help
> that person...

Agreed, the text comes too late.

But (because the definition of thread is different in different
contexts--for example Pthreads) I think it's needed to make clear to
the reader what the definition of "multithreaded process" is for
purposes of the discussion on this man page. So, I moved that text
much close to the start of this page.


> .\"
> .\" FIXME mtk-addition:  please check: I added the following piece:
> (Or, to put things another way, when a multithreaded process does an
> .BR execve (2),
> the kernel makes it look as though the
> .BR execve (2)
> occurred in the thread group leader, regardless of which thread did the
> .BR execve (2).)
> .\" FIXME end-mtk-addition
> .\"
>
> This is not exactly true. If tracer is tracking syscall entry/exit
> (e.g. strace),
> the entry into execve will be seen happening in the execing thread,
> but corresponding syscall exit will happen in thread leader.
> This doesn't look exactly like "execve occurred in the thread group leader"!

True. I've changed it to:

(Or, to put things another way, when a multithreaded process does an
.BR execve (2),
at completion of the call, it appears as though the
.BR execve (2)
occurred in the thread group leader, regardless of which thread did the
.BR execve (2).)


> All other threads stop in
> .\" FIXME: mtk: What is "PTRACE_EXIT stop"?
> .\"        Should that be "PTRACE_EVENT_EXIT stop"?
> .B PTRACE_EXIT
> stop,
>
> Correct, it meant to be PTRACE_EVENT_EXIT

Thanks.


> .\" FIXME: mtk: In the next line, "by active ptrace option" is unclear.
> .\"        What does it mean?
> if requested by active ptrace option.
>
> It means "if PTRACE_O_TRACEEXIT option was turned on".

Thanks; I made that change, and a similar one just below for PTRACE_EVENT_EXEC


> .\"
> .\" FIXME Does "exits" in the following mean
> .\" just "_exit(2)" or or both "_exit(2) and exit_group(2)"?
> If a thread group leader is traced and exits by calling
> .BR _exit (2),
> a
> .B PTRACE_EVENT_EXIT
> stop will happen for it (if requested), but the subsequent
> .B WIFEXITED
> notification will not be delivered until all other threads exit.
>
> Here "exits" means any kind of death - _exit, exit_group,
> signal death. Signal death and exit_group cases are trivial,
> though: since signal death and exit_group kill all other threads
> too, "until all other threads exit" thing happens rather soon
> in these cases. Therefore, only _exit presents observably
> puzzling behavior to ptrace users: thread leader _exit's,
> but WIFEXITED isn't reported! We are trying to explain here
> why it is so.

Okay -- thanks.


> (Note: the
> .I WSTOPSIG(status)
> macro returns the value
> .IR "(status>>8)\ &\ 0xff)" .)
>
> Unpaired parentheses in the fragment above.
> I suggest:
>
> .IR "((status>>8)\ &\ 0xff)" .)

Thanks. Fixed now.

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface"; http://man7.org/tlpi/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] man ptrace: add extended description of various ptrace quirks
  2011-09-30 14:28   ` Denys Vlasenko
@ 2011-10-03  5:35     ` Michael Kerrisk
  0 siblings, 0 replies; 18+ messages in thread
From: Michael Kerrisk @ 2011-10-03  5:35 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Oleg Nesterov, Jan Kratochvil, linux-kernel, Tejun Heo,
	linux-man, Heiko Carstens, Chuck Ebbert, Blaisorblade,
	Daniel Jacobowitz

Hi Denys,

On Fri, Sep 30, 2011 at 4:28 PM, Denys Vlasenko
<vda.linux@googlemail.com> wrote:
> On Thu, Sep 29, 2011 at 9:08 PM, Michael Kerrisk <mtk.manpages@gmail.com> wrote:
>> So, I took your patch, and then did a global edit of the page to fix
>> various pieces (in the existing text, as well as do some language
>> clean-ups for the new text). In the process, I found a number of
>> pieces that are still unclear (some in the old text, some in your new
>> text). I also made some changes to your text that I'd like you to
>> check. I've marked each of these with FIXME below. Could you please
>> take a look at the FIXMEs, and write me a comment for each of these.
>> (I appreciate that in some cases, especially for the existing text,
>> you may not have a handy answer Denys, but if you (and others) can
>> give any help, that would be great.)
>>
>> Rather than you writing a new patch to this version of the page, I
>> think it might be easiest if you just replied to the FIXMEs inline
>> below, then I can revise the page in the light of your comments.
>
>       Another group of commands makes the ptrace-stopped  tracee  run.   They
>       have the form:
>
>           ptrace(PTRACE_cmd, pid, 0, sig);
>
>       where  cmd  is  PTRACE_CONT, PTRACE_DETACH, PTRACE_SYSCALL, PTRACE_SIN-
>       GLESTEP, PTRACE_SYSEMU, or PTRACE_SYSEMU_SINGLESTEP.
>
> Cosmetics: cmd is, of course, CONT, DETACH,..., not PTRACE_CONT,
> PTRACE_DETACH...

Yes. But what I did to fix is change the ptrace call to:

      ptrace(cmd, pid, 0, sig);

(Having these constants shown without the "PTRACE_" prefix is a little
confusing.)

>       If the tracee  is
>       in  signal-delivery-stop,  sig  is  the signal to be injected (if it is
>       nonzero).  Otherwise, sig may be ignored.  (Recommended practice is  to
>       always pass 0 in these cases.)
>
> Looks like (my) text in last sentence is confusing. I meant:
> "If you are resterting thracee from a ptrace-stop other than
> signal-delivery-stop, recommended practice is  to always pass
> sig == 0".

Okay -- I made that change.

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface"; http://man7.org/tlpi/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] man ptrace: add extended description of various ptrace quirks
  2011-10-03  5:27     ` Michael Kerrisk
@ 2012-02-13 22:02       ` Denys Vlasenko
  2012-02-26 18:25         ` Michael Kerrisk
  0 siblings, 1 reply; 18+ messages in thread
From: Denys Vlasenko @ 2012-02-13 22:02 UTC (permalink / raw)
  To: mtk.manpages
  Cc: Oleg Nesterov, Jan Kratochvil, linux-kernel, Tejun Heo,
	linux-man, Heiko Carstens, Blaisorblade, Daniel Jacobowitz

On Mon, Oct 3, 2011 at 7:27 AM, Michael Kerrisk <mtk.manpages@gmail.com> wrote:
> Thanks for the detailed responses. Some comments to your remarks
> below, and a couple of open questions (marked "????"). If you send me
> the answers, then I can get another draft for review.
>
>>>> - SIGSTOP _can_ be injected.
>>>
>>> Was this true at one time? If yes, then we should document past and
>>> current behavior, and note when the change occurred.
>>>
>>> In the Linux 2.4 sources, I see the following in
>>> arch/i386/kernel/signal.c::do_signal():
>>>
>>>                        /* The debugger continued.  Ignore SIGSTOP.  */
>>>                        if (signr == SIGSTOP)
>>>                                continue;
>>>
>>> Did that code prevent SIGSTOP being injected in the 2.4 series?
>>
>> Looks like it is indeed the code.
>
> ????
> Sorry -- I'm not quite clear there. You're confirming that SIGSTOP
> could not be injected in 2.4, right?

Yes. In 2.4, SIGSTOP can't be injected.



>> No need to do PTRACE_GETSIGINFO.
>> Remember, requiring PTRACE_GETSIGINFO on every ptrace stop
>> is a performance hit.
>
> Thanks. So I'll change that sentence (and the others):
>
> A subsequent PTRACE_GETSIGINFO on the stopped tracee will return a
> siginfo_t structure with si_code set to SIGTRAP|PTRACE_EVENT_FORK<<8.
>
> to:
>
> A waitpid() by the tracer will return SIGTRAP|PTRACE_EVENT_FORK<<8 as
> the status of the tracee.

Word "status" above is ambiguous. Is it waitpid status?
Is it si_code field in PTRACE_GETSIGINFO result?
We probably need to be ridiculously verbose here
to avoid confusion:

"A waitpid() by the tracer will return status value which
will have SIGTRAP | (PTRACE_EVENT_FORK << 8) in its
most significant 24 bits. IOW: (status >> 8) will be equal to
SIGTRAP | (PTRACE_EVENT_FORK << 8)."




>> As of kernel 2.6.38,
>> after the tracer sees the tracee ptrace-stop and until it
>> restarts or kills it, the tracee will not run,
>> and will not send notifications (except
>> .B SIGKILL
>> death) to the tracer, even if the tracer enters into another
>> .BR waitpid (2)
>> call.
>> .LP
>> .\"
>> .\" FIXME ??? referrent of "it" in the next line is unclear
>> .\"        What does "it" refer to?
>> Currently, it causes a problem with transparent handling of stopping
>> signals: if the tracer restarts the tracee after group-stop,
>> .B SIGSTOP
>> is effectively ignored: the tracee doesn't remain stopped, it runs.
>> If the tracer doesn't restart the tracee before entering into the next
>> .BR waitpid (2),
>> future
>> .B SIGCONT
>> signals will not be reported to the tracer.
>> This would cause
>> .B SIGCONT
>> to have no effect.
>>
>> "it" refers to ptrace behavior versus group-stops and SIGCONT,
>> as described. Feel free to rephrase.
>
> ????
> Help! I'm still having problems here. The problem may possibly be
> this: when one uses a pronoun like "it" in English, it's normally a
> back reference to some text already given. Is this "it" a back
> reference (In that case, could you please send me a rewritten version
> of the sentence that replaces "it" by some descriptive text), or is it
> a reference to the current paragraph (in other words, should this
> paragraph rather start with the words "Currently, here is a problem
> with...")?

I think replacing "it" with "this kernel behavior" will do:

"Currently, this kernel behavior causes a problem with transparent
handling of stopping signals: if the tracer restarts the tracee
after group-stop, the stopping signal is effectively ignored:
the tracee doesn't remain stopped, it runs. ..."

(^^^^^^ also, replaced SIGSTOP with "the stopping signal" -
all stopping signals are equally affected).


>> No, it is not ok. Please consult sigaction(2) manpage and
>> /usr/include/bits/siginfo.h
>> For example, si_code == SI_TIMER (-2) can be sent by timer
>> expiration, which is not a system call. There are many other signal
>> sources which are not systcalls.
>
> Okay. So how about the following:
>
> was delivered as a result of a userspace action,
> for example, a direct system call
> .RB ( tgkill (2),
> .BR kill (2),
> .BR sigqueue (3),
> etc.),
> expiration of a POSIX timer,
> change of state on a POSIX message queue,
> or completion of an asynchronous I/O request.

Yes, this looks ok.



-- 
vda

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] man ptrace: add extended description of various ptrace quirks
  2012-02-13 22:02       ` Denys Vlasenko
@ 2012-02-26 18:25         ` Michael Kerrisk
  2012-02-26 18:42           ` Michael Kerrisk
  0 siblings, 1 reply; 18+ messages in thread
From: Michael Kerrisk @ 2012-02-26 18:25 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Oleg Nesterov, Jan Kratochvil, linux-kernel, Tejun Heo,
	linux-man, Heiko Carstens, Blaisorblade, Daniel Jacobowitz

 Hello Denys,

Thanks for these comments.

On Tue, Feb 14, 2012 at 11:02 AM, Denys Vlasenko
<vda.linux@googlemail.com> wrote:
> On Mon, Oct 3, 2011 at 7:27 AM, Michael Kerrisk <mtk.manpages@gmail.com> wrote:
>> Thanks for the detailed responses. Some comments to your remarks
>> below, and a couple of open questions (marked "????"). If you send me
>> the answers, then I can get another draft for review.
>>
>>>>> - SIGSTOP _can_ be injected.
>>>>
>>>> Was this true at one time? If yes, then we should document past and
>>>> current behavior, and note when the change occurred.
>>>>
>>>> In the Linux 2.4 sources, I see the following in
>>>> arch/i386/kernel/signal.c::do_signal():
>>>>
>>>>                        /* The debugger continued.  Ignore SIGSTOP.  */
>>>>                        if (signr == SIGSTOP)
>>>>                                continue;
>>>>
>>>> Did that code prevent SIGSTOP being injected in the 2.4 series?
>>>
>>> Looks like it is indeed the code.
>>
>> ????
>> Sorry -- I'm not quite clear there. You're confirming that SIGSTOP
>> could not be injected in 2.4, right?
>
> Yes. In 2.4, SIGSTOP can't be injected.

Okay -- I added some words to (what I hope is) an appropriate place in
the page. Can you please check this in the next draft.

>>> No need to do PTRACE_GETSIGINFO.
>>> Remember, requiring PTRACE_GETSIGINFO on every ptrace stop
>>> is a performance hit.
>>
>> Thanks. So I'll change that sentence (and the others):
>>
>> A subsequent PTRACE_GETSIGINFO on the stopped tracee will return a
>> siginfo_t structure with si_code set to SIGTRAP|PTRACE_EVENT_FORK<<8.
>>
>> to:
>>
>> A waitpid() by the tracer will return SIGTRAP|PTRACE_EVENT_FORK<<8 as
>> the status of the tracee.
>
> Word "status" above is ambiguous. Is it waitpid status?
> Is it si_code field in PTRACE_GETSIGINFO result?
> We probably need to be ridiculously verbose here
> to avoid confusion:
>
> "A waitpid() by the tracer will return status value which
> will have SIGTRAP | (PTRACE_EVENT_FORK << 8) in its
> most significant 24 bits. IOW: (status >> 8) will be equal to
> SIGTRAP | (PTRACE_EVENT_FORK << 8)."


That's a bit repetitious, so I simplified to sentences of the form:

A waitpid(2) by the tracer will return a status value such that

      status>>8 == (SIGTRAP | (PTRACE_EVENT_FORK<<8))

>>> As of kernel 2.6.38,
>>> after the tracer sees the tracee ptrace-stop and until it
>>> restarts or kills it, the tracee will not run,
>>> and will not send notifications (except
>>> .B SIGKILL
>>> death) to the tracer, even if the tracer enters into another
>>> .BR waitpid (2)
>>> call.
>>> .LP
>>> .\"
>>> .\" FIXME ??? referrent of "it" in the next line is unclear
>>> .\"        What does "it" refer to?
>>> Currently, it causes a problem with transparent handling of stopping
>>> signals: if the tracer restarts the tracee after group-stop,
>>> .B SIGSTOP
>>> is effectively ignored: the tracee doesn't remain stopped, it runs.
>>> If the tracer doesn't restart the tracee before entering into the next
>>> .BR waitpid (2),
>>> future
>>> .B SIGCONT
>>> signals will not be reported to the tracer.
>>> This would cause
>>> .B SIGCONT
>>> to have no effect.
>>>
>>> "it" refers to ptrace behavior versus group-stops and SIGCONT,
>>> as described. Feel free to rephrase.
>>
>> ????
>> Help! I'm still having problems here. The problem may possibly be
>> this: when one uses a pronoun like "it" in English, it's normally a
>> back reference to some text already given. Is this "it" a back
>> reference (In that case, could you please send me a rewritten version
>> of the sentence that replaces "it" by some descriptive text), or is it
>> a reference to the current paragraph (in other words, should this
>> paragraph rather start with the words "Currently, here is a problem
>> with...")?
>
> I think replacing "it" with "this kernel behavior" will do:

That helps, but still it's a bit unclear. I'll leave you a question in
the next draft.

> "Currently, this kernel behavior causes a problem with transparent
> handling of stopping signals: if the tracer restarts the tracee
> after group-stop, the stopping signal is effectively ignored:
> the tracee doesn't remain stopped, it runs. ..."
>
> (^^^^^^ also, replaced SIGSTOP with "the stopping signal" -
> all stopping signals are equally affected).

Okay -- I made that change also.

>>> No, it is not ok. Please consult sigaction(2) manpage and
>>> /usr/include/bits/siginfo.h
>>> For example, si_code == SI_TIMER (-2) can be sent by timer
>>> expiration, which is not a system call. There are many other signal
>>> sources which are not systcalls.
>>
>> Okay. So how about the following:
>>
>> was delivered as a result of a userspace action,
>> for example, a direct system call
>> .RB ( tgkill (2),
>> .BR kill (2),
>> .BR sigqueue (3),
>> etc.),
>> expiration of a POSIX timer,
>> change of state on a POSIX message queue,
>> or completion of an asynchronous I/O request.
>
> Yes, this looks ok.

Good.

I will shortly send you another draft for review.

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface"; http://man7.org/tlpi/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] man ptrace: add extended description of various ptrace quirks
  2012-02-26 18:25         ` Michael Kerrisk
@ 2012-02-26 18:42           ` Michael Kerrisk
  2012-02-27  0:58             ` Denys Vlasenko
  0 siblings, 1 reply; 18+ messages in thread
From: Michael Kerrisk @ 2012-02-26 18:42 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Oleg Nesterov, Jan Kratochvil, linux-kernel, Tejun Heo,
	linux-man, Heiko Carstens, Blaisorblade, Daniel Jacobowitz

[-- Attachment #1: Type: text/plain, Size: 51174 bytes --]

Hello Denys,

Below is another iteration of the ptrace.2 page with your new
material. Could you please take a look at the page in general, and the
FIXMEs in particular? (I'd like to get specific input from you on all
of the FIXMEs, if possible.)

Thanks,

Michael

.\" Hey Emacs! This file is -*- nroff -*- source.
.\"
.\" Copyright (c) 1993 Michael Haardt <michael@moria.de>
.\" Fri Apr  2 11:32:09 MET DST 1993
.\"
.\" and changes Copyright (C) 1999 Mike Coleman (mkc@acm.org)
.\" -- major revision to fully document ptrace semantics per recent Linux
.\"    kernel (2.2.10) and glibc (2.1.2)
.\" Sun Nov  7 03:18:35 CST 1999
.\"
.\" and Copyright (c) 2011, Denys Vlasenko <vda.linux@googlemail.com>
.\"
.\" This is free documentation; you can redistribute it and/or
.\" modify it under the terms of the GNU General Public License as
.\" published by the Free Software Foundation; either version 2 of
.\" the License, or (at your option) any later version.
.\"
.\" The GNU General Public License's references to "object code"
.\" and "executables" are to be interpreted as the output of any
.\" document formatting or typesetting system, including
.\" intermediate and printed output.
.\"
.\" This manual is distributed in the hope that it will be useful,
.\" but WITHOUT ANY WARRANTY; without even the implied warranty of
.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
.\" GNU General Public License for more details.
.\"
.\" You should have received a copy of the GNU General Public
.\" License along with this manual; if not, write to the Free
.\" Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111,
.\" USA.
.\"
.\" Modified Fri Jul 23 23:47:18 1993 by Rik Faith <faith@cs.unc.edu>
.\" Modified Fri Jan 31 16:46:30 1997 by Eric S. Raymond <esr@thyrsus.com>
.\" Modified Thu Oct  7 17:28:49 1999 by Andries Brouwer <aeb@cwi.nl>
.\" Modified, 27 May 2004, Michael Kerrisk <mtk.manpages@gmail.com>
.\"     Added notes on capability requirements
.\"
.\" 2006-03-24, Chuck Ebbert <76306.1226@compuserve.com>
.\"    Added    PTRACE_SETOPTIONS, PTRACE_GETEVENTMSG, PTRACE_GETSIGINFO,
.\"        PTRACE_SETSIGINFO, PTRACE_SYSEMU, PTRACE_SYSEMU_SINGLESTEP
.\"    (Thanks to Blaisorblade, Daniel Jacobowitz and others who helped.)
.\" 2011-09, major update by Denys Vlasenko <vda.linux@googlemail.com>
.\"
.\" FIXME (later): Linux 3.1 adds PTRACE_SEIZE, PTRACE_INTERRUPT,
.\"                and PTRACE_LISTEN.
.\"
.TH PTRACE 2 2012-02-27 "Linux" "Linux Programmer's Manual"
.SH NAME
ptrace \- process trace
.SH SYNOPSIS
.nf
.B #include <sys/ptrace.h>
.sp
.BI "long ptrace(enum __ptrace_request " request ", pid_t " pid ", "
.BI "            void *" addr ", void *" data );
.fi
.SH DESCRIPTION
The
.BR ptrace ()
system call provides a means by which one process (the "tracer")
may observe and control the execution of another process (the "tracee"),
and examine and change the tracee's memory and registers.
It is primarily used to implement breakpoint debugging and system
call tracing.
.LP
A tracee first needs to be attached to the tracer.
Attachment and subsequent commands are per thread:
in a multithreaded process,
every thread can be individually attached to a
(potentially different) tracer,
or left not attached and thus not debugged.
Therefore, "tracee" always means "(one) thread",
never "a (possibly multithreaded) process".
Ptrace commands are always sent to
a specific tracee using a call of the form

    ptrace(PTRACE_foo, pid, ...)

where
.I pid
is the thread ID of the corresponding Linux thread.
.LP
(Note that in this page, a "multithreaded process"
means a thread group consisting of threads created using the
.BR clone (2)
.B CLONE_THREAD
flag.)
.LP
A process can initiate a trace by calling
.BR fork (2)
and having the resulting child do a
.BR PTRACE_TRACEME ,
followed (typically) by an
.BR execve (2).
Alternatively, one process may commence tracing another process using
.BR PTRACE_ATTACH .
.LP
While being traced, the tracee will stop each time a signal is delivered,
even if the signal is being ignored.
(An exception is
.BR SIGKILL ,
which has its usual effect.)
The tracer will be notified at its next call to
.BR waitpid (2)
(or one of the related "wait" system calls); that call will return a
.I status
value containing information that indicates
the cause of the stop in the tracee.
While the tracee is stopped,
the tracer can use various ptrace requests to inspect and modify the tracee.
The tracer then causes the tracee to continue,
optionally ignoring the delivered signal
(or even delivering a different signal instead).
.LP
When the tracer is finished tracing, it can cause the tracee to continue
executing in a normal, untraced mode via
.BR PTRACE_DETACH .
.LP
The value of
.I request
determines the action to be performed:
.TP
.B PTRACE_TRACEME
Indicate that this process is to be traced by its parent.
Any signal (except
.BR SIGKILL )
delivered to this process will cause it to stop and its
parent to be notified via
.BR waitpid (2).
In addition, all subsequent calls to
.BR execve (2)
by the traced process will cause a
.B SIGTRAP
to be sent to it,
giving the parent a chance to gain control before the new program
begins execution.
A process probably shouldn't make this request if its parent
isn't expecting to trace it.
.RI ( pid ,
.IR addr ,
and
.IR data
are ignored.)
.LP
The
.B PTRACE_TRACEME
request is used only by the tracee;
the remaining requests are used only by the tracer.
In the following requests,
.I pid
specifies the thread ID of the tracee to be acted on.
For requests other than
.BR PTRACE_KILL ,
the tracee must be stopped.
.TP
.BR PTRACE_PEEKTEXT ", " PTRACE_PEEKDATA
Read a word at the address
.I addr
in the tracee's memory, returning the word as the result of the
.BR ptrace ()
call.
Linux does not have separate text and data address spaces,
so these two requests are currently equivalent.
.RI ( data
is ignored.)
.TP
.B PTRACE_PEEKUSER
.\" PTRACE_PEEKUSR in kernel source, but glibc uses PTRACE_PEEKUSER,
.\" and that is the name that seems common on other systems.
Read a word at offset
.I addr
in the tracee's USER area,
which holds the registers and other information about the process
(see
.IR <sys/user.h> ).
The word is returned as the result of the
.BR ptrace ()
call.
Typically, the offset must be word-aligned, though this might vary by
architecture.
See NOTES.
.RI ( data
is ignored.)
.TP
.BR PTRACE_POKETEXT ", " PTRACE_POKEDATA
Copy the word
.I data
to the address
.I addr
in the tracee's memory.
As for
.BR PTRACE_PEEKTEXT
and
.BR PTRACE_PEEKDATA ,
these two requests are currently equivalent.
.TP
.B PTRACE_POKEUSER
.\" PTRACE_POKEUSR in kernel source, but glibc uses PTRACE_POKEUSER,
.\" and that is the name that seems common on other systems.
Copy the word
.I data
to offset
.I addr
in the tracee's USER area.
As for
.BR PTRACE_PEEKUSER ,
the offset must typically be word-aligned.
In order to maintain the integrity of the kernel,
some modifications to the USER area are disallowed.
.\" FIXME In the preceding sentence, which modifications are disallowed,
.\" and when they are disallowed, how does userspace discover that fact?
.TP
.BR PTRACE_GETREGS ", " PTRACE_GETFPREGS
Copy the tracee's general purpose or floating-point registers,
respectively, to the address
.I data
in the tracer.
See
.I <sys/user.h>
for information on the format of this data.
.RI ( addr
is ignored.)
.TP
.BR PTRACE_GETSIGINFO " (since Linux 2.3.99-pre6)"
Retrieve information about the signal that caused the stop.
Copy a
.I siginfo_t
structure (see
.BR sigaction (2))
from the tracee to the address
.I data
in the tracer.
.RI ( addr
is ignored.)
.TP
.BR PTRACE_SETREGS ", " PTRACE_SETFPREGS
Copy the tracee's general purpose or floating-point registers,
respectively, from the address
.I data
in the tracer.
As for
.BR PTRACE_POKEUSER ,
some general purpose register modifications may be disallowed.
.\" FIXME In the preceding sentence, which modifications are disallowed,
.\" and when they are disallowed, how does userspace discover that fact?
.RI ( addr
is ignored.)
.TP
.BR PTRACE_SETSIGINFO " (since Linux 2.3.99-pre6)"
Set signal information:
copy a
.I siginfo_t
structure from the address
.I data
in the tracer to the tracee.
This will affect only signals that would normally be delivered to
the tracee and were caught by the tracer.
It may be difficult to tell
these normal signals from synthetic signals generated by
.BR ptrace ()
itself.
.RI ( addr
is ignored.)
.TP
.BR PTRACE_SETOPTIONS " (since Linux 2.4.6; see BUGS for caveats)"
Set ptrace options from
.IR data .
.RI ( addr
is ignored.)
.IR data
is interpreted as a bit mask of options,
which are specified by the following flags:
.RS
.TP
.BR PTRACE_O_TRACESYSGOOD " (since Linux 2.4.6)"
When delivering system call traps, set bit 7 in the signal number
(i.e., deliver
.IR "SIGTRAP|0x80" ).
This makes it easy for the tracer to distinguish
normal traps from those caused by a system call.
.RB ( PTRACE_O_TRACESYSGOOD
may not work on all architectures.)
.TP
.BR PTRACE_O_TRACEFORK " (since Linux 2.5.46)"
Stop the tracee at the next
.BR fork (2)
and automatically start tracing the newly forked process,
which will start with a
.BR SIGSTOP .
A
.BR waitpid (2)
by the tracer will return a
.I status
value such that

.nf
  status>>8 == (SIGTRAP | (PTRACE_EVENT_FORK<<8))
.fi

The PID of the new process can be retrieved with
.BR PTRACE_GETEVENTMSG .
.TP
.BR PTRACE_O_TRACEVFORK " (since Linux 2.5.46)"
Stop the tracee at the next
.BR vfork (2)
and automatically start tracing the newly vforked process,
which will start with a
.BR SIGSTOP .
A
.BR waitpid (2)
by the tracer will return a
.I status
value such that

.nf
  status>>8 == (SIGTRAP | (PTRACE_EVENT_VFORK<<8))
.fi

The PID of the new process can be retrieved with
.BR PTRACE_GETEVENTMSG .
.TP
.BR PTRACE_O_TRACECLONE " (since Linux 2.5.46)"
Stop the tracee at the next
.BR clone (2)
and automatically start tracing the newly cloned process,
which will start with a
.BR SIGSTOP .
A
.BR waitpid (2)
by the tracer will return a
.I status
value such that

.nf
  status>>8 == (SIGTRAP | (PTRACE_EVENT_CLONE<<8))
.fi

The PID of the new process can be retrieved with
.BR PTRACE_GETEVENTMSG .
.IP
This option may not catch
.BR clone (2)
calls in all cases.
If the tracee calls
.BR clone (2)
with the
.B CLONE_VFORK
flag,
.B PTRACE_EVENT_VFORK
will be delivered instead
if
.B PTRACE_O_TRACEVFORK
is set; otherwise if the tracee calls
.BR clone (2)
with the exit signal set to
.BR SIGCHLD ,
.B PTRACE_EVENT_FORK
will be delivered if
.B PTRACE_O_TRACEFORK
is set.
.TP
.BR PTRACE_O_TRACEEXEC " (since Linux 2.5.46)"
Stop the tracee at the next
.BR execve (2).
A
.BR waitpid (2)
by the tracer will return a
.I status
value such that

.nf
  status>>8 == (SIGTRAP | (PTRACE_EVENT_EXEC<<8))
.fi

.TP
.BR PTRACE_O_TRACEVFORKDONE " (since Linux 2.5.60)"
Stop the tracee at the completion of the next
.BR vfork (2).
A
.BR waitpid (2)
by the tracer will return a
.I status
value such that

.nf
  status>>8 == (SIGTRAP | (PTRACE_EVENT_VFORK_DONE<<8))
.fi

The PID of the new process can (since Linux 2.6.18) be retrieved with
.BR PTRACE_GETEVENTMSG .
.TP
.BR PTRACE_O_TRACEEXIT " (since Linux 2.5.60)"
Stop the tracee at exit.
A
.BR waitpid (2)
by the tracer will return a
.I status
value such that

.nf
  status>>8 == (SIGTRAP | (PTRACE_EVENT_EXIT<<8))
.fi

The tracee's exit status can be retrieved with
.BR PTRACE_GETEVENTMSG .
.IP
The tracee is stopped early during process exit,
when registers are still available,
allowing the tracer to see where the exit occurred,
whereas the normal exit notification is done after the process
is finished exiting.
Even though context is available,
the tracer cannot prevent the exit from happening at this point.
.RE
.TP
.BR PTRACE_GETEVENTMSG " (since Linux 2.5.46)"
Retrieve a message (as an
.IR "unsigned long" )
about the ptrace event
that just happened, placing it at the address
.I data
in the tracer.
For
.BR PTRACE_EVENT_EXIT ,
this is the tracee's exit status.
For
.BR PTRACE_EVENT_FORK ,
.BR PTRACE_EVENT_VFORK ,
.BR PTRACE_EVENT_VFORK_DONE ,
and
.BR PTRACE_EVENT_CLONE ,
this is the PID of the new process.
.RI (  addr
is ignored.)
.TP
.B PTRACE_CONT
Restart the stopped tracee process.
If
.I data
is nonzero,
it is interpreted as the number of a signal to be delivered to the tracee;
otherwise, no signal is delivered.
Thus, for example, the tracer can control
whether a signal sent to the tracee is delivered or not.
.RI ( addr
is ignored.)
.TP
.BR PTRACE_SYSCALL ", " PTRACE_SINGLESTEP
Restart the stopped tracee as for
.BR PTRACE_CONT ,
but arrange for the tracee to be stopped at
the next entry to or exit from a system call,
or after execution of a single instruction, respectively.
(The tracee will also, as usual, be stopped upon receipt of a signal.)
>From the tracer's perspective, the tracee will appear to have been
stopped by receipt of a
.BR SIGTRAP .
So, for
.BR PTRACE_SYSCALL ,
for example, the idea is to inspect
the arguments to the system call at the first stop,
then do another
.B PTRACE_SYSCALL
and inspect the return value of the system call at the second stop.
The
.I data
argument is treated as for
.BR PTRACE_CONT .
.RI (addr
is ignored.)
.TP
.BR PTRACE_SYSEMU ", " PTRACE_SYSEMU_SINGLESTEP " (since Linux 2.6.14)"
For
.BR PTRACE_SYSEMU ,
continue and stop on entry to the next system call,
which will not be executed.
For
.BR PTRACE_SYSEMU_SINGLESTEP ,
do the same but also singlestep if not a system call.
This call is used by programs like
User Mode Linux that want to emulate all the tracee's system calls.
The
.I data
argument is treated as for
.BR PTRACE_CONT .
.RI ( addr
is ignored;
not supported on all architectures.)
.TP
.B PTRACE_KILL
Send the tracee a
.B SIGKILL
to terminate it.
.RI ( addr
and
.I data
are ignored.)
.IP
.I This operation is deprecated; do not use it!
Instead, send a
.BR SIGKILL
directly using
.BR kill (2)
or
.BR tgkill (2).
The problem with
.B PTRACE_KILL
is that it requires the tracee to be in signal-delivery-stop,
otherwise it may not work
(i.e., may complete successfully but won't kill the tracee).
By contrast, sending a
.B SIGKILL
directly has no such limitation.
.\" [Note from Denys Vlasenko:
.\"     deprecation suggested by Oleg Nesterov. He prefers to deprecate it
.\"     instead of describing (and needing to support) PTRACE_KILL's quirks.]
.TP
.B PTRACE_ATTACH
Attach to the process specified in
.IR pid ,
making it a tracee of the calling process.
.\" No longer true (removed by Denys Vlasenko, 2011, who remarks:
.\"        "I think it isn't true in non-ancient 2.4 and in 2.6/3.x.
.\"         Basically, it's not true for any Linux in practical use.
.\" ; the behavior of the tracee is as if it had done a
.\" .BR PTRACE_TRACEME .
.\" The calling process actually becomes the parent of the tracee
.\" process for most purposes (e.g., it will receive
.\" notification of tracee events and appears in
.\" .BR ps (1)
.\" output as the tracee's parent), but a
.\" .BR getppid (2)
.\" by the tracee will still return the PID of the original parent.
The tracee is sent a
.BR SIGSTOP ,
but will not necessarily have stopped
by the completion of this call; use
.BR waitpid (2)
to wait for the tracee to stop.
See the "Attaching and detaching" subsection for additional information.
.RI ( addr
and
.I data
are ignored.)
.TP
.B PTRACE_DETACH
Restart the stopped tracee as for
.BR PTRACE_CONT ,
but first detach from it.
Under Linux, a tracee can be detached in this way regardless
of which method was used to initiate tracing.
.RI ( addr
is ignored.)
.\"
.\" In the text below, we decided to avoid prettifying the text with markup:
.\" it would make the source nearly impossible to edit, and we _do_ intend
.\" to edit it often, in order to keep it updated:
.\" ptrace API is full of quirks, no need to compound this situation by
.\" making it excruciatingly painful to document them!
.\"
.SS Death under ptrace
When a (possibly multithreaded) process receives a killing signal
(one whose disposition is set to
.B SIG_DFL
and whose default action is to kill the process),
all threads exit.
Tracees report their death to their tracer(s).
Notification of this event is delivered via
.BR waitpid (2).
.LP
Note that the killing signal will first cause signal-delivery-stop
(on one tracee only),
and only after it is injected by the tracer
(or after it was dispatched to a thread which isn't traced),
will death from the signal happen on
.I all
tracees within a multithreaded process.
(The term "signal-delivery-stop" is explained below.)
.LP
.B SIGKILL
operates similarly, with exceptions.
No signal-delivery-stop is generated for
.B SIGKILL
and therefore the tracer can't suppress it.
.B SIGKILL
kills even within system calls
(syscall-exit-stop is not generated prior to death by
.BR SIGKILL ).
The net effect is that
.B SIGKILL
always kills the process (all its threads),
even if some threads of the process are ptraced.
.LP
When the tracee calls
.BR _exit (2),
it reports its death to its tracer.
Other threads are not affected.
.LP
When any thread executes
.BR exit_group (2),
every tracee in its thread group reports its death to its tracer.
.LP
If the
.B PTRACE_O_TRACEEXIT
option is on,
.B PTRACE_EVENT_EXIT
will happen before actual death.
This applies to exits via
.BR exit (2),
.BR exit_group (2),
and signal deaths (except
.BR SIGKILL ),
and when threads are torn down on
.BR execve (2)
in a multithreaded process.
.LP
The tracer cannot assume that the ptrace-stopped tracee exists.
There are many scenarios when the tracee may die while stopped (such as
.BR SIGKILL ).
Therefore, the tracer must be prepared to handle an
.B ESRCH
error on any ptrace operation.
Unfortunately, the same error is returned if the tracee
exists but is not ptrace-stopped
(for commands which require a stopped tracee),
or if it is not traced by the process which issued the ptrace call.
The tracer needs to keep track of the stopped/running state of the tracee,
and interpret
.B ESRCH
as "tracee died unexpectedly" only if it knows that the tracee has
been observed to enter ptrace-stop.
Note that there is no guarantee that
.I waitpid(WNOHANG)
will reliably report the tracee's death status if a
ptrace operation returned
.BR ESRCH .
.I waitpid(WNOHANG)
may return 0 instead.
In other words, the tracee may be "not yet fully dead",
but already refusing ptrace requests.
.LP
The tracer can't assume that the tracee
.I always
ends its life by reporting
.I WIFEXITED(status)
or
.IR WIFSIGNALED(status) ;
there are cases where this does not occur.
For example, if a thread other than thread group leader does an
.BR execve (2),
it disappears;
its PID will never be seen again,
and any subsequent ptrace stops will be reported under
the thread group leader's PID.
.SS Stopped states
A tracee can be in two states: running or stopped.
.LP
There are many kinds of states when the tracee is stopped, and in ptrace
discussions they are often conflated.
Therefore, it is important to use precise terms.
.LP
In this manual page, any stopped state in which the tracee is ready
to accept ptrace commands from the tracer is called
.IR ptrace-stop .
Ptrace-stops can
be further subdivided into
.IR signal-delivery-stop ,
.IR group-stop ,
.IR syscall-stop ,
and so on.
These stopped states are described in detail below.
.LP
When the running tracee enters ptrace-stop, it notifies its tracer using
.BR waitpid (2)
(or one of the other "wait" system calls).
Most of this manual page assumes that the tracer waits with:
.LP
    pid = waitpid(pid_or_minus_1, &status, __WALL);
.LP
Ptrace-stopped tracees are reported as returns with
.I pid
greater than 0 and
.I WIFSTOPPED(status)
true.
.\" Denys Vlasenko:
.\"     Do we require __WALL usage, or will just using 0 be ok? (With 0,
.\"     I am not 100% sure there aren't ugly corner cases.) Are the
.\"     rules different if user wants to use waitid? Will waitid require
.\"     WEXITED?
.\"
.LP
The
.B __WALL
flag does not include the
.B WSTOPPED
and
.B WEXITED
flags, but implies their functionality.
.LP
Setting the
.B WCONTINUED
flag when calling
.BR waitpid (2)
is not recommended: the "continued" state is per-process and
consuming it can confuse the real parent of the tracee.
.LP
Use of the
.B WNOHANG
flag may cause
.BR waitpid (2)
to return 0 ("no wait results available yet")
even if the tracer knows there should be a notification.
Example:
.nf

    kill(tracee, SIGKILL);
    waitpid(tracee, &status, __WALL | WNOHANG);
.fi
.\" FIXME: mtk: the following comment seems to be unresolved?
.\"        Do you want to add anything?
.\"
.\"     waitid usage? WNOWAIT?
.\"     describe how wait notifications queue (or not queue)
.LP
The following kinds of ptrace-stops exist: signal-delivery-stops,
group-stop, PTRACE_EVENT stops, syscall-stops
.\"
.\" FIXME: mtk: the following text ("[, PTRACE_SINGLESTEP...") is incomplete.
.\"        Do you want to add anything?
.\"
[, PTRACE_SINGLESTEP, PTRACE_SYSEMU,
PTRACE_SYSEMU_SINGLESTEP].
They all are reported by
.BR waitpid (2)
with
.I WIFSTOPPED(status)
true.
They may be differentiated by examining the value
.IR status>>8 ,
and if there is ambiguity in that value, by querying
.BR PTRACE_GETSIGINFO .
(Note: the
.I WSTOPSIG(status)
macro can't be used to perform this examination,
because it returns the value
(\fIstatus\>>8)\ \fB&\ 0xff\fP\fP.)
.SS Signal-delivery-stop
When a (possibly multithreaded) process receives any signal except
.BR SIGKILL ,
the kernel selects an arbitrary thread which handles the signal.
(If the signal is generated with
.BR tgkill (2),
the target thread can be explicitly selected by the caller.)
If the selected thread is traced, it enters signal-delivery-stop.
At this point, the signal is not yet delivered to the process,
and can be suppressed by the tracer.
If the tracer doesn't suppress the signal,
it passes the signal to the tracee in the next ptrace restart request.
This second step of signal delivery is called
.I "signal injection"
in this manual page.
Note that if the signal is blocked,
signal-delivery-stop doesn't happen until the signal is unblocked,
with the usual exception that
.B SIGSTOP
can't be blocked.
.LP
Signal-delivery-stop is observed by the tracer as
.BR waitpid (2)
returning with
.I WIFSTOPPED(status)
true, with the stopping signal returned by
.IR WSTOPSIG(status) .
If the stopping signal is
.BR SIGTRAP ,
this may be a different kind of ptrace-stop;
see the "Syscall-stops" and "execve" sections below for details.
If
.I WSTOPSIG(status)
returns a stopping signal, this may be a group-stop; see below.
.SS Signal injection and suppression
After signal-delivery-stop is observed by the tracer,
the tracer should restart the tracee with the call
.LP
    ptrace(PTRACE_restart, pid, 0, sig)
.LP
where
.B PTRACE_restart
is one of the restarting ptrace requests.
If
.I sig
is 0, then a signal is not delivered.
Otherwise, the signal
.I sig
is delivered.
This operation is called
.I "signal injection"
in this manual page, to distinguish it from signal-delivery-stop.
.LP
The
.I sig
value may be different from the
.I WSTOPSIG(status)
value: the tracer can cause a different signal to be injected.
.LP
Note that a suppressed signal still causes system calls to return
prematurely.
Restartable system calls will be restarted (the tracer will
observe the tracee to execute
.BR restart_syscall(2)
if the tracer uses
.BR PTRACE_SYSCALL );
non-restartable system calls may fail with
.B EINTR
even though no observable signal is injected to the tracee.
.LP
Restarting ptrace commands issued in ptrace-stops other than
signal-delivery-stop are not guaranteed to inject a signal, even if
.I sig
is nonzero.
No error is reported; a nonzero
.I sig
may simply be ignored.
Ptrace users should not try to "create a new signal" this way: use
.BR tgkill (2)
instead.
.LP
The fact that signal injection requests may be ignored
when restarting the tracee after
ptrace stops that are not signal-delivery-stops
is a cause of confusion among ptrace users.
One typical scenario is that the tracer observes group-stop,
mistakes it for signal-delivery-stop, restarts the tracee with

    ptrace(PTRACE_rest, pid, 0, stopsig)

with the intention of injecting
.IR stopsig ,
but
.I stopsig
gets ignored and the tracee continues to run.
.LP
The
.B SIGCONT
signal has a side effect of waking up (all threads of)
a group-stopped process.
This side effect happens before signal-delivery-stop.
The tracer can't suppress this side-effect (it can
only suppress signal injection, which only causes the
.BR SIGCONT
handler to not be executed in the tracee, if such a handler is installed).
In fact, waking up from group-stop may be followed by
signal-delivery-stop for signal(s)
.I other than
.BR SIGCONT ,
if they were pending when
.B SIGCONT
was delivered.
In other words,
.B SIGCONT
may be not the first signal observed by the tracee after it was sent.
.LP
Stopping signals cause (all threads of) a process to enter group-stop.
This side effect happens after signal injection, and therefore can be
suppressed by the tracer.
.LP
In Linux 2.4 and earlier, the
.B SIGSTOP
signal can't be injected.
.\" In the Linux 2.4 sources, in arch/i386/kernel/signal.c::do_signal(),
.\" there is:
.\"
.\"             /* The debugger continued.  Ignore SIGSTOP.  */
.\"             if (signr == SIGSTOP)
.\"                     continue;
.LP
.B PTRACE_GETSIGINFO
can be used to retrieve a
.I siginfo_t
structure which corresponds to the delivered signal.
.B PTRACE_SETSIGINFO
may be used to modify it.
If
.B PTRACE_SETSIGINFO
has been used to alter
.IR siginfo_t ,
the
.I si_signo
field and the
.I sig
parameter in the restarting command must match,
otherwise the result is undefined.
.SS Group-stop
When a (possibly multithreaded) process receives a stopping signal,
all threads stop.
If some threads are traced, they enter a group-stop.
Note that the stopping signal will first cause signal-delivery-stop
(on one tracee only), and only after it is injected by the tracer
(or after it was dispatched to a thread which isn't traced),
will group-stop be initiated on
.I all
tracees within the multithreaded process.
As usual, every tracee reports its group-stop separately
to the corresponding tracer.
.LP
Group-stop is observed by the tracer as
.BR waitpid (2)
returning with
.I WIFSTOPPED(status)
true, with the stopping signal available via
.IR WSTOPSIG(status) .
The same result is returned by some other classes of ptrace-stops,
therefore the recommended practice is to perform the call
.LP
    ptrace(PTRACE_GETSIGINFO, pid, 0, &siginfo)
.LP
The call can be avoided if the signal is not
.BR SIGSTOP ,
.BR SIGTSTP ,
.BR SIGTTIN ,
or
.BR SIGTTOU ;
only these four signals are stopping signals.
If the tracer sees something else, it can't be a group-stop.
Otherwise, the tracer needs to call
.BR PTRACE_GETSIGINFO .
If
.B PTRACE_GETSIGINFO
fails with
.BR EINVAL ,
then it is definitely a group-stop.
(Other failure codes are possible, such as
.B ESRCH
("no such process") if a
.B SIGKILL
killed the tracee.)
.LP
As of kernel 2.6.38,
after the tracer sees the tracee ptrace-stop and until it
restarts or kills it, the tracee will not run,
and will not send notifications (except
.B SIGKILL
death) to the tracer, even if the tracer enters into another
.BR waitpid (2)
call.
.LP
.\" FIXME It is unclear what "this kernel behavior" refers to.
.\" Can show me exactly which piece of text above or below is
.\" referred to when you say "this kernel behavior"?
Currently, this kernel behavior
causes a problem with transparent handling of stopping signals:
if the tracer restarts the tracee after group-stop,
the stopping signal
is effectively ignored\(emthe tracee doesn't remain stopped, it runs.
If the tracer doesn't restart the tracee before entering into the next
.BR waitpid (2),
future
.B SIGCONT
signals will not be reported to the tracer.
This would cause
.B SIGCONT
to have no effect.
.SS PTRACE_EVENT stops
If the tracer sets
.B PTRACE_O_TRACE_*
options, the tracee will enter ptrace-stops called
.B PTRACE_EVENT
stops.
.LP
.B PTRACE_EVENT
stops are observed by the tracer as
.BR waitpid (2)
returning with
.IR WIFSTOPPED(status) ,
and
.I WSTOPSIG(status)
returns
.BR SIGTRAP .
An additional bit is set in the higher byte of the status word:
the value
.I status>>8
will be

    (SIGTRAP | PTRACE_EVENT_foo << 8).

The following events exist:
.TP
.B PTRACE_EVENT_VFORK
Stop before return from
.BR vfork (2)
or
.BR clone (2)
with the
.B CLONE_VFORK
flag.
When the tracee is continued after this stop, it will wait for child to
exit/exec before continuing its execution
(in other words, the usual behavior on
.BR vfork (2)).
.TP
.B PTRACE_EVENT_FORK
Stop before return from
.BR fork (2)
or
.BR clone (2)
with the exit signal set to
.BR SIGCHLD .
.TP
.B PTRACE_EVENT_CLONE
Stop before return from
.BR clone (2)
.TP
.B PTRACE_EVENT_VFORK_DONE
Stop before return from
.BR vfork (2)
or
.BR clone (2)
with the
.B CLONE_VFORK
flag,
but after the child unblocked this tracee by exiting or execing.
.LP
For all four stops described above,
the stop occurs in the parent (i.e., the tracee),
not in the newly created thread.
.BR PTRACE_GETEVENTMSG
can be used to retrieve the new thread's ID.
.TP
.B PTRACE_EVENT_EXEC
Stop before return from
.BR execve (2).
.TP
.B PTRACE_EVENT_EXIT
Stop before exit (including death from
.BR exit_group (2)),
signal death, or exit caused by
.BR execve (2)
in a multithreaded process.
.B PTRACE_GETEVENTMSG
returns the exit status.
Registers can be examined
(unlike when "real" exit happens).
The tracee is still alive; it needs to be
.BR PTRACE_CONT ed
or
.BR PTRACE_DETACH ed
to finish exiting.
.LP
.B PTRACE_GETSIGINFO
on
.B PTRACE_EVENT
stops returns
.B SIGTRAP in
.IR si_signo ,
with
.I si_code
set to
.IR "(event<<8)\ |\ SIGTRAP" .
.SS Syscall-stops
If the tracee was restarted by
.BR PTRACE_SYSCALL ,
the tracee enters
syscall-enter-stop just prior to entering any system call.
If the tracer restarts the tracee with
.BR PTRACE_SYSCALL ,
the tracee enters syscall-exit-stop when the system call is finished,
or if it is interrupted by a signal.
(That is, signal-delivery-stop never happens between syscall-enter-stop
and syscall-exit-stop; it happens
.I after
syscall-exit-stop.)
.LP
Other possibilities are that the tracee may stop in a
.B PTRACE_EVENT
stop, exit (if it entered
.BR _exit (2)
or
.BR exit_group (2)),
be killed by
.BR SIGKILL ,
or die silently (if it is a thread group leader, the
.BR execve (2)
happened in another thread,
and that thread is not traced by the same tracer;
this situation is discussed later).
.LP
Syscall-enter-stop and syscall-exit-stop are observed by the tracer as
.BR waitpid (2)
returning with
.I WIFSTOPPED(status)
true, and
.I WSTOPSIG(status)
giving
.BR SIGTRAP .
If the
.B PTRACE_O_TRACESYSGOOD
option was set by the tracer, then
.I WSTOPSIG(status)
will give the value
.IR "(SIGTRAP\ |\ 0x80)" .
.LP
Syscall-stops can be distinguished from signal-delivery-stop with
.B SIGTRAP
by querying
.BR PTRACE_GETSIGINFO
for the following cases:
.TP
.IR si_code " <= 0"
.B SIGTRAP
was delivered as a result of a userspace action,
for example, a system call
.RB ( tgkill (2),
.BR kill (2),
.BR sigqueue (3),
etc.),
expiration of a POSIX timer,
change of state on a POSIX message queue,
or completion of an asynchronous I/O request.
.TP
.IR si_code " == SI_KERNEL (0x80)"
.B SIGTRAP
was sent by the kernel.
.TP
.IR si_code " == SIGTRAP or " si_code " == (SIGTRAP|0x80)"
This is a syscall-stop.
.LP
However, syscall-stops happen very often (twice per system call),
and performing
.B PTRACE_GETSIGINFO
for every syscall-stop may be somewhat expensive.
.LP
Some architectures allow the cases to be distinguished
by examining registers.
For example, on x86,
.I rax
==
.RB - ENOSYS
in syscall-enter-stop.
Since
.B SIGTRAP
(like any other signal) always happens
.I after
syscall-exit-stop,
and at this point
.I rax
almost never contains
.RB - ENOSYS ,
the
.B SIGTRAP
looks like "syscall-stop which is not syscall-enter-stop";
in other words, it looks like a
"stray syscall-exit-stop" and can be detected this way.
But such detection is fragile and is best avoided.
.LP
Using the
.B PTRACE_O_TRACESYSGOOD
.\"
.\" FIXME Below: "is the recommended method" for WHAT?
option is the recommended method,
since it is reliable and does not incur a performance penalty.
.LP
Syscall-enter-stop and syscall-exit-stop are
indistinguishable from each other by the tracer.
The tracer needs to keep track of the sequence of
ptrace-stops in order to not misinterpret syscall-enter-stop as
syscall-exit-stop or vice versa.
The rule is that syscall-enter-stop is
always followed by syscall-exit-stop,
.B PTRACE_EVENT
stop or the tracee's death;
no other kinds of ptrace-stop can occur in between.
.LP
If after syscall-enter-stop,
the tracer uses a restarting command other than
.BR PTRACE_SYSCALL ,
syscall-exit-stop is not generated.
.LP
.B PTRACE_GETSIGINFO
on syscall-stops returns
.B SIGTRAP
in
.IR si_signo ,
with
.I si_code
set to
.B SIGTRAP
or
.IR (SIGTRAP|0x80) .
.SS PTRACE_SINGLESTEP, PTRACE_SYSEMU, PTRACE_SYSEMU_SINGLESTEP stops
.\"
.\" FIXME The following TODO is unresolved
.\"       Do you want to add anything, or (less good) do we just
.\"       convert this into a comment in the source indicating
.\"       that these points still need to be documented?
.\"
(TODO: document stops occurring with PTRACE_SINGLESTEP, PTRACE_SYSEMU,
PTRACE_SYSEMU_SINGLESTEP)
.SS Informational and restarting ptrace commands
Most ptrace commands (all except
.BR PTRACE_ATTACH ,
.BR PTRACE_TRACEME ,
and
.BR PTRACE_KILL )
require the tracee to be in a ptrace-stop, otherwise they fail with
.BR ESRCH .
.LP
When the tracee is in ptrace-stop,
the tracer can read and write data to
the tracee using informational commands.
These commands leave the tracee in ptrace-stopped state:
.LP
.nf
    ptrace(PTRACE_PEEKTEXT/PEEKDATA/PEEKUSER, pid, addr, 0);
    ptrace(PTRACE_POKETEXT/POKEDATA/POKEUSER, pid, addr, long_val);
    ptrace(PTRACE_GETREGS/GETFPREGS, pid, 0, &struct);
    ptrace(PTRACE_SETREGS/SETFPREGS, pid, 0, &struct);
    ptrace(PTRACE_GETSIGINFO, pid, 0, &siginfo);
    ptrace(PTRACE_SETSIGINFO, pid, 0, &siginfo);
    ptrace(PTRACE_GETEVENTMSG, pid, 0, &long_var);
    ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_flags);
.fi
.LP
Note that some errors are not reported.
For example, setting signal information
.RI ( siginfo )
may have no effect in some ptrace-stops, yet the call may succeed
(return 0 and not set
.IR errno );
querying
.B PTRACE_GETEVENTMSG
may succeed and return some random value if current ptrace-stop
is not documented as returning a meaningful event message.
.LP
The call

    ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_flags);

affects one tracee.
The tracee's current flags are replaced.
Flags are inherited by new tracees created and "auto-attached" via active
.BR PTRACE_O_TRACEFORK ,
.BR PTRACE_O_TRACEVFORK ,
or
.BR PTRACE_O_TRACECLONE
options.
.LP
Another group of commands makes the ptrace-stopped tracee run.
They have the form:
.LP
    ptrace(cmd, pid, 0, sig);
.LP
where
.I cmd
is
.BR PTRACE_CONT ,
.BR PTRACE_DETACH ,
.BR PTRACE_SYSCALL ,
.BR PTRACE_SINGLESTEP ,
.BR PTRACE_SYSEMU ,
or
.BR PTRACE_SYSEMU_SINGLESTEP.
If the tracee is in signal-delivery-stop,
.I sig
is the signal to be injected (if it is nonzero).
Otherwise,
.I sig
may be ignored.
(When restarting a tracee from a ptrace-stop other than signal-delivery-stop,
recommended practice is to always pass 0 in
.I sig .)
.SS Attaching and detaching
A thread can be attached to the tracer using the call

    ptrace(PTRACE_ATTACH, pid, 0, 0);

This also sends
.B SIGSTOP
to this thread.
If the tracer wants this
.B SIGSTOP
to have no effect, it needs to suppress it.
Note that if other signals are concurrently sent to
this thread during attach,
the tracer may see the tracee enter signal-delivery-stop
with other signal(s) first!
The usual practice is to reinject these signals until
.B SIGSTOP
is seen, then suppress
.B SIGSTOP
injection.
The design bug here is that a ptrace attach and a concurrently delivered
.B SIGSTOP
may race and the concurrent
.B SIGSTOP
may be lost.
.\"
.\" FIXME: mtk: the following comment seems to be unresolved?
.\"	   Do you want to add any text?
.\"
.\"      Describe how to attach to a thread which is already group-stopped.
.LP
Since attaching sends
.B SIGSTOP
and the tracer usually suppresses it, this may cause a stray
.I EINTR
return from the currently executing system call in the tracee,
as described in the "signal injection and suppression" section.
.LP
The request

    ptrace(PTRACE_TRACEME, 0, 0, 0);

turns the calling thread into a tracee.
The thread continues to run (doesn't enter ptrace-stop).
A common practice is to follow the
.B PTRACE_TRACEME
with

    raise(SIGSTOP);

and allow the parent (which is our tracer now) to observe our
signal-delivery-stop.
.LP
If the
.BR PTRACE_O_TRACEFORK ,
.BR PTRACE_O_TRACEVFORK ,
or
.BR PTRACE_O_TRACECLONE
options are in effect, then children created by, respectively,
.BR vfork (2)
or
.BR clone (2)
with the
.B CLONE_VFORK
flag,
.BR fork (2)
or
.BR clone (2)
with the exit signal set to
.BR SIGCHLD ,
and other kinds of
.BR clone (2),
are automatically attached to the same tracer which traced their parent.
.B SIGSTOP
is delivered to the children, causing them to enter
signal-delivery-stop after they exit the system call which created them.
.LP
Detaching of the tracee is performed by:

    ptrace(PTRACE_DETACH, pid, 0, sig);

.B PTRACE_DETACH
is a restarting operation;
therefore it requires the tracee to be in ptrace-stop.
If the tracee is in signal-delivery-stop, a signal can be injected.
Otherwise, the
.I sig
parameter may be silently ignored.
.LP
If the tracee is running when the tracer wants to detach it,
the usual solution is to send
.B SIGSTOP
(using
.BR tgkill (2),
to make sure it goes to the correct thread),
wait for the tracee to stop in signal-delivery-stop for
.B SIGSTOP
and then detach it (suppressing
.B SIGSTOP
injection).
A design bug is that this can race with concurrent
.BR SIGSTOP s.
Another complication is that the tracee may enter other ptrace-stops
and needs to be restarted and waited for again, until
.B SIGSTOP
is seen.
Yet another complication is to be sure that
the tracee is not already ptrace-stopped,
because no signal delivery happens while it is\(emnot even
.BR SIGSTOP .
.\" FIXME: mtk: the following comment seems to be unresolved?
.\"       Do you want to add anything?
.\"
.\"     Describe how to detach from a group-stopped tracee so that it
.\"     doesn't run, but continues to wait for SIGCONT.
.\"
.LP
If the tracer dies, all tracees are automatically detached and restarted,
unless they were in group-stop.
Handling of restart from group-stop is
.\" FIXME: Define currently
currently buggy, but the
.\" FIXME: Planned for when? And should applications be designed
.\" in some way so as to allow for this future change?
"as planned" behavior is to leave tracee stopped and waiting for
.BR SIGCONT .
If the tracee is restarted from signal-delivery-stop,
the pending signal is injected.
.SS execve(2) under ptrace
.\" clone(2) THREAD_CLONE says:
.\"     If  any  of the threads in a thread group performs an execve(2),
.\"     then all threads other than the thread group leader are terminated,
.\"     and the new program is executed in the thread group leader.
.\"
When one thread in a multithreaded process calls
.BR execve (2),
the kernel destroys all other threads in the process,
.\" In kernel 3.1 sources, see fs/exec.c::de_thread()
and resets the thread ID of the execing thread to the
thread group ID (process ID).
(Or, to put things another way, when a multithreaded process does an
.BR execve (2),
at completion of the call, it appears as though the
.BR execve (2)
occurred in the thread group leader, regardless of which thread did the
.BR execve (2).)
This resetting of the thread ID looks very confusing to tracers:
.IP * 3
All other threads stop in
.B PTRACE_EVENT_EXIT
stop,
if the
.BR PTRACE_O_TRACEEXIT
option was turned on.
Then all other threads except the thread group leader report
death as if they exited via
.BR _exit (2)
with exit code 0.
Then a
.B PTRACE_EVENT_EXEC
stop happens, if the
.BR PTRACE_O_TRACEEXEC
option was turned on.
.\" FIXME: mtk: the following comment seems to be unresolved?
.\"       (on which tracee - leader? execve-ing one?)
.\"
.\" FIXME: Please check: at various places in the following,
.\"        I have changed "pid" to "[the tracee's] thead ID"
.\"        Is that okay?
.IP *
The execing tracee changes its thread ID while it is in the
.BR execve (2).
(Remember, under ptrace, the "pid" returned from
.BR waitpid (2),
or fed into ptrace calls, is the tracee's thread ID.)
That is, the tracee's thread ID is reset to be the same as its process ID,
which is the same as the thread group leader's thread ID.
.IP *
If the thread group leader has reported its death by this time,
it appears to the tracer that
the dead thread leader "reappears from nowhere".
If the thread group leader was still alive,
for the tracer this may look as if thread group leader
returns from a different system call than it entered,
or even "returned from a system call even though
it was not in any system call".
If the thread group leader was not traced
(or was traced by a different tracer), then during
.BR execve (2)
it will appear as if it has become a tracee of
the tracer of the execing tracee.
.LP
All of the above effects are the artifacts of
the thread ID change in the tracee.
.LP
The
.B PTRACE_O_TRACEEXEC
option is the recommended tool for dealing with this situation.
It enables
.B PTRACE_EVENT_EXEC
stop, which occurs before
.BR execve (2)
returns.
.\" FIXME Following on from the previous sentences,
.\"       can/should we add a few more words on how
.\"       PTRACE_EVENT_EXEC stop helps us deal with this situation?
.LP
The thread ID change happens before
.B PTRACE_EVENT_EXEC
stop, not after.
.LP
When the tracer receives
.B PTRACE_EVENT_EXEC
stop notification,
it is guaranteed that except this tracee and the thread group leader,
no other threads from the process are alive.
.LP
On receiving the
.B PTRACE_EVENT_EXEC
stop notification,
the tracer should clean up all its internal
data structures describing the threads of this process,
and retain only one data structure\(emone which
describes the single still running tracee, with

    thread ID == thread group ID == process id.
.LP
Currently, there is no way to retrieve the former
thread ID of the execing tracee.
If the tracer doesn't keep track of its tracees' thread group relations,
it may be unable to know which tracee execed and therefore no longer
exists under the old thread ID due to a thread ID change.
.LP
Example: two threads call
.BR execve (2)
at the same time:
.LP
.nf
*** we get syscall-entry-stop in thread 1: **
PID1 execve("/bin/foo", "foo" <unfinished ...>
*** we issue PTRACE_SYSCALL for thread 1 **
*** we get syscall-entry-stop in thread 2: **
PID2 execve("/bin/bar", "bar" <unfinished ...>
*** we issue PTRACE_SYSCALL for thread 2 **
*** we get PTRACE_EVENT_EXEC for PID0, we issue PTRACE_SYSCALL **
*** we get syscall-exit-stop for PID0: **
PID0 <... execve resumed> )             = 0
.fi
.LP
In this situation, there is no way to know which
.BR execve (2)
succeeded.
.LP
If the
.B PTRACE_O_TRACEEXEC
option is
.I not
in effect for the execing tracee, the kernel delivers an extra
.B SIGTRAP
to the tracee after
.BR execve (2)
returns.
This is an ordinary signal (similar to one which can be
generated by
.IR "kill -TRAP" ),
not a special kind of ptrace-stop.
Employing
.B PTRACE_GETSIGINFO
for this signal returns
.I si_code
set to 0
.RI ( SI_USER ).
This signal may be blocked by signal mask,
and thus may be delivered (much) later.
.LP
Usually, the tracer (for example,
.BR strace (1))
would not want to show this extra post-execve
.B SIGTRAP
signal to the user, and would suppress its delivery to the tracee (if
.B SIGTRAP
is set to
.BR SIG_DFL ,
it is a killing signal).
However, determining
.I which
.B SIGTRAP
to suppress is not easy.
Setting the
.B PTRACE_O_TRACEEXEC
option and thus suppressing this extra
.B SIGTRAP
is the recommended approach.
.SS Real parent
The ptrace API (ab)uses the standard UNIX parent/child signaling over
.BR waitpid (2).
This used to cause the real parent of the process to stop receiving
several kinds of
.BR waitpid (2)
notifications when the child process is traced by some other process.
.LP
Many of these bugs have been fixed, but as of Linux 2.6.38 several still
exist; see BUGS below.
.LP
As of Linux 2.6.38, the following is believed to work correctly:
.IP * 3
exit/death by signal is reported first to the tracer, then,
when the tracer consumes the
.BR waitpid (2)
result, to the real parent (to the real parent only when the
whole multithreaded process exits).
.\"
.\" FIXME mtk: Please check: In the next line,
.\" I changed "they" to "the tracer and the real parent". Okay?
If the tracer and the real parent are the same process,
the report is sent only once.
.SH "RETURN VALUE"
On success,
.B PTRACE_PEEK*
requests return the requested data,
while other requests return zero.
On error, all requests return \-1, and
.I errno
is set appropriately.
Since the value returned by a successful
.B PTRACE_PEEK*
request may be \-1, the caller must clear
.I errno
before the call, and then check it afterward
to determine whether or not an error occurred.
.SH ERRORS
.TP
.B EBUSY
(i386 only) There was an error with allocating or freeing a debug register.
.TP
.B EFAULT
There was an attempt to read from or write to an invalid area in
the tracer's or the tracee's memory,
probably because the area wasn't mapped or accessible.
Unfortunately, under Linux, different variations of this fault
will return
.B EIO
or
.B EFAULT
more or less arbitrarily.
.TP
.B EINVAL
An attempt was made to set an invalid option.
.TP
.B EIO
.I request
is invalid, or an attempt was made to read from or
write to an invalid area in the tracer's or the tracee's memory,
or there was a word-alignment violation,
or an invalid signal was specified during a restart request.
.TP
.B EPERM
The specified process cannot be traced.
This could be because the
tracer has insufficient privileges (the required capability is
.BR CAP_SYS_PTRACE );
unprivileged processes cannot trace processes that they
cannot send signals to or those running
set-user-ID/set-group-ID programs, for obvious reasons.
.\"
.\" FIXME I reworked the discussion of init below to note
.\" the kernel version (2.6.26) when the behavior changed for
.\" tracing init(8). Okay?
Alternatively, the process may already be being traced,
or (on kernels before 2.6.26) be
.BR init (8)
(PID 1).
.TP
.B ESRCH
The specified process does not exist, or is not currently being traced
by the caller, or is not stopped
(for requests that require a stopped tracee).
.SH "CONFORMING TO"
SVr4, 4.3BSD.
.SH NOTES
Although arguments to
.BR ptrace ()
are interpreted according to the prototype given,
glibc currently declares
.BR ptrace ()
as a variadic function with only the
.I request
argument fixed.
This means that unneeded trailing arguments may be omitted,
though doing so makes use of undocumented
.BR gcc (1)
behavior.
.\" FIXME Please review. I reinstated the following, noting the
.\" kernel version number where it ceased to be true
.LP
In Linux kernels before 2.6.26,
.\" See commit 00cd5c37afd5f431ac186dd131705048c0a11fdb
.BR init (8),
the process with PID 1, may not be traced.
.LP
The layout of the contents of memory and the USER area are
quite operating-system- and architecture-specific.
The offset supplied, and the data returned,
might not entirely match with the definition of
.IR "struct user" .
.\" See http://lkml.org/lkml/2008/5/8/375
.LP
The size of a "word" is determined by the operating-system variant
(e.g., for 32-bit Linux it is 32 bits, etc.).
.\" FIXME So, can we just remove the following text (rather than
.\" just commenting it out)?
.\"
.\" Covered in more details above: (removed by dv)
.\" .LP
.\" Tracing causes a few subtle differences in the semantics of
.\" traced processes.
.\" For example, if a process is attached to with
.\" .BR PTRACE_ATTACH ,
.\" its original parent can no longer receive notification via
.\" .BR waitpid (2)
.\" when it stops, and there is no way for the new parent to
.\" effectively simulate this notification.
.\" .LP
.\" When the parent receives an event with
.\" .B PTRACE_EVENT_*
.\" set,
.\" the tracee is not in the normal signal delivery path.
.\" This means the parent cannot do
.\" .BR ptrace (PTRACE_CONT)
.\" with a signal or
.\" .BR ptrace (PTRACE_KILL).
.\" .BR kill (2)
.\" with a
.\" .B SIGKILL
.\" signal can be used instead to kill the tracee
.\" after receiving one of these messages.
.\" .LP
This page documents the way the
.BR ptrace ()
call works currently in Linux.
Its behavior differs noticeably on other flavors of UNIX.
In any case, use of
.BR ptrace ()
is highly specific to the operating system and architecture.
.SH BUGS
On hosts with 2.6 kernel headers,
.B PTRACE_SETOPTIONS
is declared with a different value than the one for 2.4.
This leads to applications compiled with 2.6 kernel
headers failing when run on 2.4 kernels.
This can be worked around by redefining
.B PTRACE_SETOPTIONS
to
.BR PTRACE_OLDSETOPTIONS ,
if that is defined.
.LP
Group-stop notifications are sent to the tracer, but not to real parent.
Last confirmed on 2.6.38.6.
.LP
If a thread group leader is traced and exits by calling
.BR _exit (2),
.\" Note from Denys Vlasenko:
.\"     Here "exits" means any kind of death - _exit, exit_group,
.\"     signal death. Signal death and exit_group cases are trivial,
.\"     though: since signal death and exit_group kill all other threads
.\"     too, "until all other threads exit" thing happens rather soon
.\"     in these cases. Therefore, only _exit presents observably
.\"     puzzling behavior to ptrace users: thread leader _exit's,
.\"     but WIFEXITED isn't reported! We are trying to explain here
.\"     why it is so.
a
.B PTRACE_EVENT_EXIT
stop will happen for it (if requested), but the subsequent
.B WIFEXITED
notification will not be delivered until all other threads exit.
As explained above, if one of other threads calls
.BR execve (2),
the death of the thread group leader will
.I never
be reported.
If the execed thread is not traced by this tracer,
the tracer will never know that
.BR execve (2)
happened.
One possible workaround is to
.B PTRACE_DETACH
the thread group leader instead of restarting it in this case.
Last confirmed on 2.6.38.6.
.\"        ^^^ need to test/verify this scenario
.\" FIXME: mtk: the preceding comment seems to be unresolved?
.\"        Do you want to add anything?
.LP
A
.B SIGKILL
signal may still cause a
.B PTRACE_EVENT_EXIT
stop before actual signal death.
This may be changed in the future;
.B SIGKILL
is meant to always immediately kill tasks even under ptrace.
Last confirmed on 2.6.38.6.
.SH "SEE ALSO"
.BR gdb (1),
.BR strace (1),
.BR clone (2),
.BR execve (2),
.BR fork (2),
.BR gettid (2),
.BR sigaction (2),
.BR tgkill (2),
.BR vfork (2),
.BR waitpid (2),
.BR exec (3),
.BR capabilities (7),
.BR signal (7)

[-- Attachment #2: ptrace.2 --]
[-- Type: application/octet-stream, Size: 50922 bytes --]

.\" Hey Emacs! This file is -*- nroff -*- source.
.\"
.\" Copyright (c) 1993 Michael Haardt <michael@moria.de>
.\" Fri Apr  2 11:32:09 MET DST 1993
.\"
.\" and changes Copyright (C) 1999 Mike Coleman (mkc@acm.org)
.\" -- major revision to fully document ptrace semantics per recent Linux
.\"    kernel (2.2.10) and glibc (2.1.2)
.\" Sun Nov  7 03:18:35 CST 1999
.\"
.\" and Copyright (c) 2011, Denys Vlasenko <vda.linux@googlemail.com>
.\"
.\" This is free documentation; you can redistribute it and/or
.\" modify it under the terms of the GNU General Public License as
.\" published by the Free Software Foundation; either version 2 of
.\" the License, or (at your option) any later version.
.\"
.\" The GNU General Public License's references to "object code"
.\" and "executables" are to be interpreted as the output of any
.\" document formatting or typesetting system, including
.\" intermediate and printed output.
.\"
.\" This manual is distributed in the hope that it will be useful,
.\" but WITHOUT ANY WARRANTY; without even the implied warranty of
.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
.\" GNU General Public License for more details.
.\"
.\" You should have received a copy of the GNU General Public
.\" License along with this manual; if not, write to the Free
.\" Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111,
.\" USA.
.\"
.\" Modified Fri Jul 23 23:47:18 1993 by Rik Faith <faith@cs.unc.edu>
.\" Modified Fri Jan 31 16:46:30 1997 by Eric S. Raymond <esr@thyrsus.com>
.\" Modified Thu Oct  7 17:28:49 1999 by Andries Brouwer <aeb@cwi.nl>
.\" Modified, 27 May 2004, Michael Kerrisk <mtk.manpages@gmail.com>
.\"     Added notes on capability requirements
.\"
.\" 2006-03-24, Chuck Ebbert <76306.1226@compuserve.com>
.\"    Added    PTRACE_SETOPTIONS, PTRACE_GETEVENTMSG, PTRACE_GETSIGINFO,
.\"        PTRACE_SETSIGINFO, PTRACE_SYSEMU, PTRACE_SYSEMU_SINGLESTEP
.\"    (Thanks to Blaisorblade, Daniel Jacobowitz and others who helped.)
.\" 2011-09, major update by Denys Vlasenko <vda.linux@googlemail.com>
.\"
.\" FIXME (later): Linux 3.1 adds PTRACE_SEIZE, PTRACE_INTERRUPT,
.\"                and PTRACE_LISTEN.
.\"
.TH PTRACE 2 2012-02-27 "Linux" "Linux Programmer's Manual"
.SH NAME
ptrace \- process trace
.SH SYNOPSIS
.nf
.B #include <sys/ptrace.h>
.sp
.BI "long ptrace(enum __ptrace_request " request ", pid_t " pid ", "
.BI "            void *" addr ", void *" data );
.fi
.SH DESCRIPTION
The
.BR ptrace ()
system call provides a means by which one process (the "tracer")
may observe and control the execution of another process (the "tracee"),
and examine and change the tracee's memory and registers.
It is primarily used to implement breakpoint debugging and system
call tracing.
.LP
A tracee first needs to be attached to the tracer.
Attachment and subsequent commands are per thread:
in a multithreaded process,
every thread can be individually attached to a
(potentially different) tracer,
or left not attached and thus not debugged.
Therefore, "tracee" always means "(one) thread",
never "a (possibly multithreaded) process".
Ptrace commands are always sent to
a specific tracee using a call of the form

    ptrace(PTRACE_foo, pid, ...)

where
.I pid
is the thread ID of the corresponding Linux thread.
.LP
(Note that in this page, a "multithreaded process"
means a thread group consisting of threads created using the
.BR clone (2)
.B CLONE_THREAD
flag.)
.LP
A process can initiate a trace by calling
.BR fork (2)
and having the resulting child do a
.BR PTRACE_TRACEME ,
followed (typically) by an
.BR execve (2).
Alternatively, one process may commence tracing another process using
.BR PTRACE_ATTACH .
.LP
While being traced, the tracee will stop each time a signal is delivered,
even if the signal is being ignored.
(An exception is
.BR SIGKILL ,
which has its usual effect.)
The tracer will be notified at its next call to
.BR waitpid (2)
(or one of the related "wait" system calls); that call will return a
.I status
value containing information that indicates
the cause of the stop in the tracee.
While the tracee is stopped,
the tracer can use various ptrace requests to inspect and modify the tracee.
The tracer then causes the tracee to continue,
optionally ignoring the delivered signal
(or even delivering a different signal instead).
.LP
When the tracer is finished tracing, it can cause the tracee to continue
executing in a normal, untraced mode via
.BR PTRACE_DETACH .
.LP
The value of
.I request
determines the action to be performed:
.TP
.B PTRACE_TRACEME
Indicate that this process is to be traced by its parent.
Any signal (except
.BR SIGKILL )
delivered to this process will cause it to stop and its
parent to be notified via
.BR waitpid (2).
In addition, all subsequent calls to
.BR execve (2)
by the traced process will cause a
.B SIGTRAP
to be sent to it,
giving the parent a chance to gain control before the new program
begins execution.
A process probably shouldn't make this request if its parent
isn't expecting to trace it.
.RI ( pid ,
.IR addr ,
and
.IR data
are ignored.)
.LP
The
.B PTRACE_TRACEME
request is used only by the tracee;
the remaining requests are used only by the tracer.
In the following requests,
.I pid
specifies the thread ID of the tracee to be acted on.
For requests other than
.BR PTRACE_KILL ,
the tracee must be stopped.
.TP
.BR PTRACE_PEEKTEXT ", " PTRACE_PEEKDATA
Read a word at the address
.I addr
in the tracee's memory, returning the word as the result of the
.BR ptrace ()
call.
Linux does not have separate text and data address spaces,
so these two requests are currently equivalent.
.RI ( data
is ignored.)
.TP
.B PTRACE_PEEKUSER
.\" PTRACE_PEEKUSR in kernel source, but glibc uses PTRACE_PEEKUSER,
.\" and that is the name that seems common on other systems.
Read a word at offset
.I addr
in the tracee's USER area,
which holds the registers and other information about the process
(see
.IR <sys/user.h> ).
The word is returned as the result of the
.BR ptrace ()
call.
Typically, the offset must be word-aligned, though this might vary by
architecture.
See NOTES.
.RI ( data
is ignored.)
.TP
.BR PTRACE_POKETEXT ", " PTRACE_POKEDATA
Copy the word
.I data
to the address
.I addr
in the tracee's memory.
As for
.BR PTRACE_PEEKTEXT 
and
.BR PTRACE_PEEKDATA ,
these two requests are currently equivalent.
.TP
.B PTRACE_POKEUSER
.\" PTRACE_POKEUSR in kernel source, but glibc uses PTRACE_POKEUSER,
.\" and that is the name that seems common on other systems.
Copy the word
.I data
to offset
.I addr
in the tracee's USER area.
As for
.BR PTRACE_PEEKUSER ,
the offset must typically be word-aligned.
In order to maintain the integrity of the kernel,
some modifications to the USER area are disallowed.
.\" FIXME In the preceding sentence, which modifications are disallowed,
.\" and when they are disallowed, how does userspace discover that fact?
.TP
.BR PTRACE_GETREGS ", " PTRACE_GETFPREGS
Copy the tracee's general purpose or floating-point registers,
respectively, to the address
.I data
in the tracer.
See
.I <sys/user.h>
for information on the format of this data.
.RI ( addr
is ignored.)
.TP
.BR PTRACE_GETSIGINFO " (since Linux 2.3.99-pre6)"
Retrieve information about the signal that caused the stop.
Copy a
.I siginfo_t
structure (see
.BR sigaction (2))
from the tracee to the address
.I data
in the tracer.
.RI ( addr
is ignored.)
.TP
.BR PTRACE_SETREGS ", " PTRACE_SETFPREGS
Copy the tracee's general purpose or floating-point registers,
respectively, from the address
.I data
in the tracer.
As for
.BR PTRACE_POKEUSER ,
some general purpose register modifications may be disallowed.
.\" FIXME In the preceding sentence, which modifications are disallowed,
.\" and when they are disallowed, how does userspace discover that fact?
.RI ( addr
is ignored.)
.TP
.BR PTRACE_SETSIGINFO " (since Linux 2.3.99-pre6)"
Set signal information:
copy a
.I siginfo_t
structure from the address
.I data
in the tracer to the tracee.
This will affect only signals that would normally be delivered to
the tracee and were caught by the tracer.
It may be difficult to tell
these normal signals from synthetic signals generated by
.BR ptrace ()
itself.
.RI ( addr
is ignored.)
.TP
.BR PTRACE_SETOPTIONS " (since Linux 2.4.6; see BUGS for caveats)"
Set ptrace options from
.IR data .
.RI ( addr
is ignored.)
.IR data
is interpreted as a bit mask of options,
which are specified by the following flags:
.RS
.TP
.BR PTRACE_O_TRACESYSGOOD " (since Linux 2.4.6)"
When delivering system call traps, set bit 7 in the signal number
(i.e., deliver
.IR "SIGTRAP|0x80" ).
This makes it easy for the tracer to distinguish
normal traps from those caused by a system call.
.RB ( PTRACE_O_TRACESYSGOOD
may not work on all architectures.)
.TP
.BR PTRACE_O_TRACEFORK " (since Linux 2.5.46)"
Stop the tracee at the next
.BR fork (2)
and automatically start tracing the newly forked process,
which will start with a
.BR SIGSTOP .
A
.BR waitpid (2)
by the tracer will return a
.I status
value such that

.nf
  status>>8 == (SIGTRAP | (PTRACE_EVENT_FORK<<8))
.fi

The PID of the new process can be retrieved with
.BR PTRACE_GETEVENTMSG .
.TP
.BR PTRACE_O_TRACEVFORK " (since Linux 2.5.46)"
Stop the tracee at the next
.BR vfork (2)
and automatically start tracing the newly vforked process,
which will start with a
.BR SIGSTOP .
A
.BR waitpid (2)
by the tracer will return a
.I status
value such that

.nf
  status>>8 == (SIGTRAP | (PTRACE_EVENT_VFORK<<8))
.fi

The PID of the new process can be retrieved with
.BR PTRACE_GETEVENTMSG .
.TP
.BR PTRACE_O_TRACECLONE " (since Linux 2.5.46)"
Stop the tracee at the next
.BR clone (2)
and automatically start tracing the newly cloned process,
which will start with a
.BR SIGSTOP .
A
.BR waitpid (2)
by the tracer will return a
.I status
value such that

.nf
  status>>8 == (SIGTRAP | (PTRACE_EVENT_CLONE<<8))
.fi

The PID of the new process can be retrieved with
.BR PTRACE_GETEVENTMSG .
.IP
This option may not catch
.BR clone (2)
calls in all cases.
If the tracee calls
.BR clone (2)
with the
.B CLONE_VFORK
flag,
.B PTRACE_EVENT_VFORK
will be delivered instead
if
.B PTRACE_O_TRACEVFORK
is set; otherwise if the tracee calls
.BR clone (2)
with the exit signal set to
.BR SIGCHLD ,
.B PTRACE_EVENT_FORK
will be delivered if
.B PTRACE_O_TRACEFORK
is set.
.TP
.BR PTRACE_O_TRACEEXEC " (since Linux 2.5.46)"
Stop the tracee at the next
.BR execve (2).
A
.BR waitpid (2)
by the tracer will return a
.I status
value such that

.nf
  status>>8 == (SIGTRAP | (PTRACE_EVENT_EXEC<<8))
.fi

.TP
.BR PTRACE_O_TRACEVFORKDONE " (since Linux 2.5.60)"
Stop the tracee at the completion of the next
.BR vfork (2).
A
.BR waitpid (2)
by the tracer will return a
.I status
value such that

.nf
  status>>8 == (SIGTRAP | (PTRACE_EVENT_VFORK_DONE<<8))
.fi

The PID of the new process can (since Linux 2.6.18) be retrieved with
.BR PTRACE_GETEVENTMSG .
.TP
.BR PTRACE_O_TRACEEXIT " (since Linux 2.5.60)"
Stop the tracee at exit.
A
.BR waitpid (2)
by the tracer will return a
.I status
value such that

.nf
  status>>8 == (SIGTRAP | (PTRACE_EVENT_EXIT<<8))
.fi

The tracee's exit status can be retrieved with
.BR PTRACE_GETEVENTMSG .
.IP
The tracee is stopped early during process exit,
when registers are still available,
allowing the tracer to see where the exit occurred,
whereas the normal exit notification is done after the process
is finished exiting.
Even though context is available,
the tracer cannot prevent the exit from happening at this point.
.RE
.TP
.BR PTRACE_GETEVENTMSG " (since Linux 2.5.46)"
Retrieve a message (as an
.IR "unsigned long" )
about the ptrace event
that just happened, placing it at the address
.I data
in the tracer.
For
.BR PTRACE_EVENT_EXIT ,
this is the tracee's exit status.
For
.BR PTRACE_EVENT_FORK ,
.BR PTRACE_EVENT_VFORK ,
.BR PTRACE_EVENT_VFORK_DONE ,
and
.BR PTRACE_EVENT_CLONE ,
this is the PID of the new process.
.RI (  addr
is ignored.)
.TP
.B PTRACE_CONT
Restart the stopped tracee process.
If
.I data
is nonzero,
it is interpreted as the number of a signal to be delivered to the tracee;
otherwise, no signal is delivered.
Thus, for example, the tracer can control
whether a signal sent to the tracee is delivered or not.
.RI ( addr
is ignored.)
.TP
.BR PTRACE_SYSCALL ", " PTRACE_SINGLESTEP
Restart the stopped tracee as for
.BR PTRACE_CONT ,
but arrange for the tracee to be stopped at
the next entry to or exit from a system call,
or after execution of a single instruction, respectively.
(The tracee will also, as usual, be stopped upon receipt of a signal.)
From the tracer's perspective, the tracee will appear to have been
stopped by receipt of a
.BR SIGTRAP .
So, for
.BR PTRACE_SYSCALL ,
for example, the idea is to inspect
the arguments to the system call at the first stop,
then do another
.B PTRACE_SYSCALL
and inspect the return value of the system call at the second stop.
The
.I data
argument is treated as for
.BR PTRACE_CONT .
.RI (addr
is ignored.)
.TP
.BR PTRACE_SYSEMU ", " PTRACE_SYSEMU_SINGLESTEP " (since Linux 2.6.14)"
For
.BR PTRACE_SYSEMU ,
continue and stop on entry to the next system call,
which will not be executed.
For
.BR PTRACE_SYSEMU_SINGLESTEP ,
do the same but also singlestep if not a system call.
This call is used by programs like
User Mode Linux that want to emulate all the tracee's system calls.
The
.I data
argument is treated as for
.BR PTRACE_CONT .
.RI ( addr
is ignored;
not supported on all architectures.)
.TP
.B PTRACE_KILL
Send the tracee a
.B SIGKILL
to terminate it.
.RI ( addr
and
.I data
are ignored.)
.IP
.I This operation is deprecated; do not use it!
Instead, send a
.BR SIGKILL
directly using
.BR kill (2)
or
.BR tgkill (2).
The problem with
.B PTRACE_KILL
is that it requires the tracee to be in signal-delivery-stop,
otherwise it may not work
(i.e., may complete successfully but won't kill the tracee).
By contrast, sending a
.B SIGKILL
directly has no such limitation.
.\" [Note from Denys Vlasenko:
.\"     deprecation suggested by Oleg Nesterov. He prefers to deprecate it
.\"     instead of describing (and needing to support) PTRACE_KILL's quirks.]
.TP
.B PTRACE_ATTACH
Attach to the process specified in
.IR pid ,
making it a tracee of the calling process.
.\" No longer true (removed by Denys Vlasenko, 2011, who remarks:
.\"        "I think it isn't true in non-ancient 2.4 and in 2.6/3.x.
.\"         Basically, it's not true for any Linux in practical use.
.\" ; the behavior of the tracee is as if it had done a
.\" .BR PTRACE_TRACEME .
.\" The calling process actually becomes the parent of the tracee
.\" process for most purposes (e.g., it will receive
.\" notification of tracee events and appears in
.\" .BR ps (1)
.\" output as the tracee's parent), but a
.\" .BR getppid (2)
.\" by the tracee will still return the PID of the original parent.
The tracee is sent a
.BR SIGSTOP ,
but will not necessarily have stopped
by the completion of this call; use
.BR waitpid (2)
to wait for the tracee to stop.
See the "Attaching and detaching" subsection for additional information.
.RI ( addr
and
.I data
are ignored.)
.TP
.B PTRACE_DETACH
Restart the stopped tracee as for
.BR PTRACE_CONT ,
but first detach from it.
Under Linux, a tracee can be detached in this way regardless
of which method was used to initiate tracing.
.RI ( addr
is ignored.)
.\"
.\" In the text below, we decided to avoid prettifying the text with markup:
.\" it would make the source nearly impossible to edit, and we _do_ intend
.\" to edit it often, in order to keep it updated:
.\" ptrace API is full of quirks, no need to compound this situation by
.\" making it excruciatingly painful to document them!
.\"
.SS Death under ptrace
When a (possibly multithreaded) process receives a killing signal
(one whose disposition is set to
.B SIG_DFL
and whose default action is to kill the process),
all threads exit.
Tracees report their death to their tracer(s).
Notification of this event is delivered via
.BR waitpid (2).
.LP
Note that the killing signal will first cause signal-delivery-stop
(on one tracee only),
and only after it is injected by the tracer
(or after it was dispatched to a thread which isn't traced),
will death from the signal happen on
.I all
tracees within a multithreaded process.
(The term "signal-delivery-stop" is explained below.)
.LP
.B SIGKILL
operates similarly, with exceptions.
No signal-delivery-stop is generated for
.B SIGKILL
and therefore the tracer can't suppress it.
.B SIGKILL
kills even within system calls
(syscall-exit-stop is not generated prior to death by
.BR SIGKILL ).
The net effect is that
.B SIGKILL
always kills the process (all its threads),
even if some threads of the process are ptraced.
.LP
When the tracee calls
.BR _exit (2),
it reports its death to its tracer.
Other threads are not affected.
.LP
When any thread executes
.BR exit_group (2),
every tracee in its thread group reports its death to its tracer.
.LP
If the
.B PTRACE_O_TRACEEXIT
option is on,
.B PTRACE_EVENT_EXIT
will happen before actual death.
This applies to exits via
.BR exit (2),
.BR exit_group (2),
and signal deaths (except
.BR SIGKILL ),
and when threads are torn down on
.BR execve (2)
in a multithreaded process.
.LP
The tracer cannot assume that the ptrace-stopped tracee exists.
There are many scenarios when the tracee may die while stopped (such as
.BR SIGKILL ).
Therefore, the tracer must be prepared to handle an 
.B ESRCH
error on any ptrace operation.
Unfortunately, the same error is returned if the tracee
exists but is not ptrace-stopped
(for commands which require a stopped tracee),
or if it is not traced by the process which issued the ptrace call.
The tracer needs to keep track of the stopped/running state of the tracee,
and interpret
.B ESRCH
as "tracee died unexpectedly" only if it knows that the tracee has
been observed to enter ptrace-stop.
Note that there is no guarantee that
.I waitpid(WNOHANG)
will reliably report the tracee's death status if a
ptrace operation returned
.BR ESRCH .
.I waitpid(WNOHANG)
may return 0 instead.
In other words, the tracee may be "not yet fully dead",
but already refusing ptrace requests.
.LP
The tracer can't assume that the tracee
.I always
ends its life by reporting
.I WIFEXITED(status)
or
.IR WIFSIGNALED(status) ;
there are cases where this does not occur.
For example, if a thread other than thread group leader does an
.BR execve (2),
it disappears;
its PID will never be seen again,
and any subsequent ptrace stops will be reported under
the thread group leader's PID.
.SS Stopped states
A tracee can be in two states: running or stopped.
.LP
There are many kinds of states when the tracee is stopped, and in ptrace
discussions they are often conflated.
Therefore, it is important to use precise terms.
.LP
In this manual page, any stopped state in which the tracee is ready
to accept ptrace commands from the tracer is called
.IR ptrace-stop .
Ptrace-stops can
be further subdivided into
.IR signal-delivery-stop ,
.IR group-stop ,
.IR syscall-stop ,
and so on.
These stopped states are described in detail below.
.LP
When the running tracee enters ptrace-stop, it notifies its tracer using
.BR waitpid (2)
(or one of the other "wait" system calls).
Most of this manual page assumes that the tracer waits with:
.LP
    pid = waitpid(pid_or_minus_1, &status, __WALL);
.LP
Ptrace-stopped tracees are reported as returns with
.I pid
greater than 0 and
.I WIFSTOPPED(status)
true.
.\" Denys Vlasenko:
.\"     Do we require __WALL usage, or will just using 0 be ok? (With 0,
.\"     I am not 100% sure there aren't ugly corner cases.) Are the
.\"     rules different if user wants to use waitid? Will waitid require
.\"     WEXITED?
.\"
.LP
The
.B __WALL
flag does not include the
.B WSTOPPED
and
.B WEXITED
flags, but implies their functionality.
.LP
Setting the
.B WCONTINUED
flag when calling
.BR waitpid (2)
is not recommended: the "continued" state is per-process and
consuming it can confuse the real parent of the tracee.
.LP
Use of the
.B WNOHANG
flag may cause
.BR waitpid (2)
to return 0 ("no wait results available yet")
even if the tracer knows there should be a notification.
Example:
.nf

    kill(tracee, SIGKILL);
    waitpid(tracee, &status, __WALL | WNOHANG);
.fi
.\" FIXME: mtk: the following comment seems to be unresolved?
.\"        Do you want to add anything?
.\"
.\"     waitid usage? WNOWAIT?
.\"     describe how wait notifications queue (or not queue)
.LP
The following kinds of ptrace-stops exist: signal-delivery-stops,
group-stop, PTRACE_EVENT stops, syscall-stops
.\"
.\" FIXME: mtk: the following text ("[, PTRACE_SINGLESTEP...") is incomplete.
.\"        Do you want to add anything?
.\"
[, PTRACE_SINGLESTEP, PTRACE_SYSEMU,
PTRACE_SYSEMU_SINGLESTEP].
They all are reported by
.BR waitpid (2)
with
.I WIFSTOPPED(status)
true.
They may be differentiated by examining the value
.IR status>>8 ,
and if there is ambiguity in that value, by querying
.BR PTRACE_GETSIGINFO .
(Note: the
.I WSTOPSIG(status)
macro can't be used to perform this examination,
because it returns the value
(\fIstatus\>>8)\ \fB&\ 0xff\fP\fP.)
.SS Signal-delivery-stop
When a (possibly multithreaded) process receives any signal except
.BR SIGKILL ,
the kernel selects an arbitrary thread which handles the signal.
(If the signal is generated with
.BR tgkill (2),
the target thread can be explicitly selected by the caller.)
If the selected thread is traced, it enters signal-delivery-stop.
At this point, the signal is not yet delivered to the process,
and can be suppressed by the tracer.
If the tracer doesn't suppress the signal,
it passes the signal to the tracee in the next ptrace restart request.
This second step of signal delivery is called
.I "signal injection"
in this manual page.
Note that if the signal is blocked,
signal-delivery-stop doesn't happen until the signal is unblocked,
with the usual exception that
.B SIGSTOP
can't be blocked.
.LP
Signal-delivery-stop is observed by the tracer as
.BR waitpid (2)
returning with
.I WIFSTOPPED(status)
true, with the stopping signal returned by
.IR WSTOPSIG(status) .
If the stopping signal is
.BR SIGTRAP ,
this may be a different kind of ptrace-stop;
see the "Syscall-stops" and "execve" sections below for details.
If
.I WSTOPSIG(status)
returns a stopping signal, this may be a group-stop; see below.
.SS Signal injection and suppression
After signal-delivery-stop is observed by the tracer,
the tracer should restart the tracee with the call
.LP
    ptrace(PTRACE_restart, pid, 0, sig)
.LP
where
.B PTRACE_restart
is one of the restarting ptrace requests.
If
.I sig
is 0, then a signal is not delivered.
Otherwise, the signal
.I sig
is delivered.
This operation is called
.I "signal injection"
in this manual page, to distinguish it from signal-delivery-stop.
.LP
The
.I sig
value may be different from the
.I WSTOPSIG(status)
value: the tracer can cause a different signal to be injected.
.LP
Note that a suppressed signal still causes system calls to return
prematurely.
Restartable system calls will be restarted (the tracer will
observe the tracee to execute
.BR restart_syscall(2)
if the tracer uses
.BR PTRACE_SYSCALL );
non-restartable system calls may fail with
.B EINTR
even though no observable signal is injected to the tracee.
.LP
Restarting ptrace commands issued in ptrace-stops other than
signal-delivery-stop are not guaranteed to inject a signal, even if
.I sig
is nonzero.
No error is reported; a nonzero
.I sig
may simply be ignored.
Ptrace users should not try to "create a new signal" this way: use
.BR tgkill (2)
instead.
.LP
The fact that signal injection requests may be ignored
when restarting the tracee after
ptrace stops that are not signal-delivery-stops
is a cause of confusion among ptrace users.
One typical scenario is that the tracer observes group-stop,
mistakes it for signal-delivery-stop, restarts the tracee with

    ptrace(PTRACE_rest, pid, 0, stopsig)

with the intention of injecting
.IR stopsig ,
but
.I stopsig
gets ignored and the tracee continues to run.
.LP
The
.B SIGCONT
signal has a side effect of waking up (all threads of)
a group-stopped process.
This side effect happens before signal-delivery-stop.
The tracer can't suppress this side-effect (it can
only suppress signal injection, which only causes the
.BR SIGCONT
handler to not be executed in the tracee, if such a handler is installed).
In fact, waking up from group-stop may be followed by
signal-delivery-stop for signal(s)
.I other than
.BR SIGCONT ,
if they were pending when
.B SIGCONT
was delivered.
In other words,
.B SIGCONT
may be not the first signal observed by the tracee after it was sent.
.LP
Stopping signals cause (all threads of) a process to enter group-stop.
This side effect happens after signal injection, and therefore can be
suppressed by the tracer.
.LP
In Linux 2.4 and earlier, the
.B SIGSTOP
signal can't be injected.
.\" In the Linux 2.4 sources, in arch/i386/kernel/signal.c::do_signal(),
.\" there is:
.\" 
.\"             /* The debugger continued.  Ignore SIGSTOP.  */
.\"             if (signr == SIGSTOP)
.\"                     continue;
.LP
.B PTRACE_GETSIGINFO
can be used to retrieve a
.I siginfo_t
structure which corresponds to the delivered signal.
.B PTRACE_SETSIGINFO
may be used to modify it.
If
.B PTRACE_SETSIGINFO
has been used to alter
.IR siginfo_t ,
the
.I si_signo
field and the
.I sig
parameter in the restarting command must match,
otherwise the result is undefined.
.SS Group-stop
When a (possibly multithreaded) process receives a stopping signal,
all threads stop.
If some threads are traced, they enter a group-stop.
Note that the stopping signal will first cause signal-delivery-stop
(on one tracee only), and only after it is injected by the tracer
(or after it was dispatched to a thread which isn't traced),
will group-stop be initiated on
.I all
tracees within the multithreaded process.
As usual, every tracee reports its group-stop separately
to the corresponding tracer.
.LP
Group-stop is observed by the tracer as
.BR waitpid (2)
returning with
.I WIFSTOPPED(status)
true, with the stopping signal available via
.IR WSTOPSIG(status) .
The same result is returned by some other classes of ptrace-stops,
therefore the recommended practice is to perform the call
.LP
    ptrace(PTRACE_GETSIGINFO, pid, 0, &siginfo)
.LP
The call can be avoided if the signal is not
.BR SIGSTOP ,
.BR SIGTSTP ,
.BR SIGTTIN ,
or
.BR SIGTTOU ;
only these four signals are stopping signals.
If the tracer sees something else, it can't be a group-stop.
Otherwise, the tracer needs to call
.BR PTRACE_GETSIGINFO .
If
.B PTRACE_GETSIGINFO
fails with
.BR EINVAL ,
then it is definitely a group-stop.
(Other failure codes are possible, such as
.B ESRCH
("no such process") if a
.B SIGKILL
killed the tracee.)
.LP
As of kernel 2.6.38,
after the tracer sees the tracee ptrace-stop and until it
restarts or kills it, the tracee will not run,
and will not send notifications (except
.B SIGKILL
death) to the tracer, even if the tracer enters into another
.BR waitpid (2)
call.
.LP
.\" FIXME It is unclear what "this kernel behavior" refers to.
.\" Can show me exactly which piece of text above or below is
.\" referred to when you say "this kernel behavior"?
Currently, this kernel behavior
causes a problem with transparent handling of stopping signals:
if the tracer restarts the tracee after group-stop,
the stopping signal
is effectively ignored\(emthe tracee doesn't remain stopped, it runs.
If the tracer doesn't restart the tracee before entering into the next
.BR waitpid (2),
future
.B SIGCONT
signals will not be reported to the tracer.
This would cause
.B SIGCONT
to have no effect.
.SS PTRACE_EVENT stops
If the tracer sets
.B PTRACE_O_TRACE_*
options, the tracee will enter ptrace-stops called
.B PTRACE_EVENT
stops.
.LP
.B PTRACE_EVENT
stops are observed by the tracer as
.BR waitpid (2)
returning with
.IR WIFSTOPPED(status) ,
and
.I WSTOPSIG(status)
returns
.BR SIGTRAP .
An additional bit is set in the higher byte of the status word:
the value
.I status>>8
will be

    (SIGTRAP | PTRACE_EVENT_foo << 8).

The following events exist:
.TP
.B PTRACE_EVENT_VFORK
Stop before return from
.BR vfork (2)
or
.BR clone (2)
with the
.B CLONE_VFORK
flag.
When the tracee is continued after this stop, it will wait for child to
exit/exec before continuing its execution
(in other words, the usual behavior on
.BR vfork (2)).
.TP
.B PTRACE_EVENT_FORK
Stop before return from
.BR fork (2)
or
.BR clone (2)
with the exit signal set to
.BR SIGCHLD .
.TP
.B PTRACE_EVENT_CLONE
Stop before return from
.BR clone (2)
.TP
.B PTRACE_EVENT_VFORK_DONE
Stop before return from
.BR vfork (2)
or
.BR clone (2)
with the
.B CLONE_VFORK
flag,
but after the child unblocked this tracee by exiting or execing.
.LP
For all four stops described above,
the stop occurs in the parent (i.e., the tracee),
not in the newly created thread.
.BR PTRACE_GETEVENTMSG
can be used to retrieve the new thread's ID.
.TP
.B PTRACE_EVENT_EXEC
Stop before return from
.BR execve (2).
.TP
.B PTRACE_EVENT_EXIT
Stop before exit (including death from
.BR exit_group (2)),
signal death, or exit caused by
.BR execve (2)
in a multithreaded process.
.B PTRACE_GETEVENTMSG
returns the exit status.
Registers can be examined
(unlike when "real" exit happens).
The tracee is still alive; it needs to be
.BR PTRACE_CONT ed
or
.BR PTRACE_DETACH ed
to finish exiting.
.LP
.B PTRACE_GETSIGINFO
on
.B PTRACE_EVENT
stops returns
.B SIGTRAP in
.IR si_signo ,
with
.I si_code
set to
.IR "(event<<8)\ |\ SIGTRAP" .
.SS Syscall-stops
If the tracee was restarted by
.BR PTRACE_SYSCALL ,
the tracee enters
syscall-enter-stop just prior to entering any system call.
If the tracer restarts the tracee with
.BR PTRACE_SYSCALL ,
the tracee enters syscall-exit-stop when the system call is finished,
or if it is interrupted by a signal.
(That is, signal-delivery-stop never happens between syscall-enter-stop
and syscall-exit-stop; it happens
.I after
syscall-exit-stop.)
.LP
Other possibilities are that the tracee may stop in a
.B PTRACE_EVENT
stop, exit (if it entered
.BR _exit (2)
or
.BR exit_group (2)),
be killed by
.BR SIGKILL ,
or die silently (if it is a thread group leader, the
.BR execve (2)
happened in another thread,
and that thread is not traced by the same tracer;
this situation is discussed later).
.LP
Syscall-enter-stop and syscall-exit-stop are observed by the tracer as
.BR waitpid (2)
returning with
.I WIFSTOPPED(status)
true, and
.I WSTOPSIG(status)
giving
.BR SIGTRAP .
If the
.B PTRACE_O_TRACESYSGOOD
option was set by the tracer, then
.I WSTOPSIG(status)
will give the value
.IR "(SIGTRAP\ |\ 0x80)" .
.LP
Syscall-stops can be distinguished from signal-delivery-stop with
.B SIGTRAP
by querying
.BR PTRACE_GETSIGINFO
for the following cases:
.TP
.IR si_code " <= 0"
.B SIGTRAP
was delivered as a result of a userspace action,
for example, a system call
.RB ( tgkill (2),
.BR kill (2),
.BR sigqueue (3),
etc.),
expiration of a POSIX timer,
change of state on a POSIX message queue,
or completion of an asynchronous I/O request.
.TP
.IR si_code " == SI_KERNEL (0x80)"
.B SIGTRAP
was sent by the kernel.
.TP
.IR si_code " == SIGTRAP or " si_code " == (SIGTRAP|0x80)"
This is a syscall-stop.
.LP
However, syscall-stops happen very often (twice per system call),
and performing
.B PTRACE_GETSIGINFO
for every syscall-stop may be somewhat expensive.
.LP
Some architectures allow the cases to be distinguished
by examining registers.
For example, on x86,
.I rax
==
.RB - ENOSYS
in syscall-enter-stop.
Since
.B SIGTRAP
(like any other signal) always happens
.I after
syscall-exit-stop,
and at this point
.I rax
almost never contains
.RB - ENOSYS ,
the
.B SIGTRAP
looks like "syscall-stop which is not syscall-enter-stop";
in other words, it looks like a
"stray syscall-exit-stop" and can be detected this way.
But such detection is fragile and is best avoided.
.LP
Using the
.B PTRACE_O_TRACESYSGOOD
.\"
.\" FIXME Below: "is the recommended method" for WHAT?
option is the recommended method,
since it is reliable and does not incur a performance penalty.
.LP
Syscall-enter-stop and syscall-exit-stop are
indistinguishable from each other by the tracer.
The tracer needs to keep track of the sequence of
ptrace-stops in order to not misinterpret syscall-enter-stop as
syscall-exit-stop or vice versa.
The rule is that syscall-enter-stop is
always followed by syscall-exit-stop,
.B PTRACE_EVENT
stop or the tracee's death;
no other kinds of ptrace-stop can occur in between.
.LP
If after syscall-enter-stop,
the tracer uses a restarting command other than
.BR PTRACE_SYSCALL ,
syscall-exit-stop is not generated.
.LP
.B PTRACE_GETSIGINFO
on syscall-stops returns
.B SIGTRAP
in
.IR si_signo ,
with
.I si_code
set to
.B SIGTRAP
or
.IR (SIGTRAP|0x80) .
.SS PTRACE_SINGLESTEP, PTRACE_SYSEMU, PTRACE_SYSEMU_SINGLESTEP stops
.\"
.\" FIXME The following TODO is unresolved
.\"       Do you want to add anything, or (less good) do we just
.\"       convert this into a comment in the source indicating
.\"       that these points still need to be documented?
.\"
(TODO: document stops occurring with PTRACE_SINGLESTEP, PTRACE_SYSEMU,
PTRACE_SYSEMU_SINGLESTEP)
.SS Informational and restarting ptrace commands
Most ptrace commands (all except
.BR PTRACE_ATTACH ,
.BR PTRACE_TRACEME ,
and
.BR PTRACE_KILL )
require the tracee to be in a ptrace-stop, otherwise they fail with
.BR ESRCH .
.LP
When the tracee is in ptrace-stop,
the tracer can read and write data to
the tracee using informational commands.
These commands leave the tracee in ptrace-stopped state:
.LP
.nf
    ptrace(PTRACE_PEEKTEXT/PEEKDATA/PEEKUSER, pid, addr, 0);
    ptrace(PTRACE_POKETEXT/POKEDATA/POKEUSER, pid, addr, long_val);
    ptrace(PTRACE_GETREGS/GETFPREGS, pid, 0, &struct);
    ptrace(PTRACE_SETREGS/SETFPREGS, pid, 0, &struct);
    ptrace(PTRACE_GETSIGINFO, pid, 0, &siginfo);
    ptrace(PTRACE_SETSIGINFO, pid, 0, &siginfo);
    ptrace(PTRACE_GETEVENTMSG, pid, 0, &long_var);
    ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_flags);
.fi
.LP
Note that some errors are not reported.
For example, setting signal information
.RI ( siginfo )
may have no effect in some ptrace-stops, yet the call may succeed
(return 0 and not set
.IR errno );
querying
.B PTRACE_GETEVENTMSG
may succeed and return some random value if current ptrace-stop
is not documented as returning a meaningful event message.
.LP
The call

    ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_flags);
    
affects one tracee.
The tracee's current flags are replaced.
Flags are inherited by new tracees created and "auto-attached" via active
.BR PTRACE_O_TRACEFORK ,
.BR PTRACE_O_TRACEVFORK ,
or
.BR PTRACE_O_TRACECLONE
options.
.LP
Another group of commands makes the ptrace-stopped tracee run.
They have the form:
.LP
    ptrace(cmd, pid, 0, sig);
.LP
where
.I cmd
is
.BR PTRACE_CONT ,
.BR PTRACE_DETACH ,
.BR PTRACE_SYSCALL ,
.BR PTRACE_SINGLESTEP ,
.BR PTRACE_SYSEMU ,
or
.BR PTRACE_SYSEMU_SINGLESTEP.
If the tracee is in signal-delivery-stop,
.I sig
is the signal to be injected (if it is nonzero).
Otherwise,
.I sig
may be ignored.
(When restarting a tracee from a ptrace-stop other than signal-delivery-stop,
recommended practice is to always pass 0 in
.I sig .)
.SS Attaching and detaching
A thread can be attached to the tracer using the call

    ptrace(PTRACE_ATTACH, pid, 0, 0);

This also sends
.B SIGSTOP
to this thread.
If the tracer wants this
.B SIGSTOP
to have no effect, it needs to suppress it.
Note that if other signals are concurrently sent to
this thread during attach,
the tracer may see the tracee enter signal-delivery-stop
with other signal(s) first!
The usual practice is to reinject these signals until
.B SIGSTOP
is seen, then suppress
.B SIGSTOP
injection.
The design bug here is that a ptrace attach and a concurrently delivered
.B SIGSTOP
may race and the concurrent
.B SIGSTOP
may be lost.
.\"
.\" FIXME: mtk: the following comment seems to be unresolved?
.\"	   Do you want to add any text?
.\"
.\"      Describe how to attach to a thread which is already group-stopped.
.LP
Since attaching sends
.B SIGSTOP
and the tracer usually suppresses it, this may cause a stray
.I EINTR
return from the currently executing system call in the tracee,
as described in the "signal injection and suppression" section.
.LP
The request

    ptrace(PTRACE_TRACEME, 0, 0, 0);

turns the calling thread into a tracee.
The thread continues to run (doesn't enter ptrace-stop).
A common practice is to follow the
.B PTRACE_TRACEME
with

    raise(SIGSTOP);

and allow the parent (which is our tracer now) to observe our
signal-delivery-stop.
.LP
If the 
.BR PTRACE_O_TRACEFORK ,
.BR PTRACE_O_TRACEVFORK ,
or
.BR PTRACE_O_TRACECLONE
options are in effect, then children created by, respectively,
.BR vfork (2)
or
.BR clone (2)
with the
.B CLONE_VFORK
flag,
.BR fork (2)
or
.BR clone (2)
with the exit signal set to
.BR SIGCHLD ,
and other kinds of
.BR clone (2),
are automatically attached to the same tracer which traced their parent.
.B SIGSTOP
is delivered to the children, causing them to enter
signal-delivery-stop after they exit the system call which created them.
.LP
Detaching of the tracee is performed by:

    ptrace(PTRACE_DETACH, pid, 0, sig);

.B PTRACE_DETACH
is a restarting operation;
therefore it requires the tracee to be in ptrace-stop.
If the tracee is in signal-delivery-stop, a signal can be injected.
Otherwise, the
.I sig
parameter may be silently ignored.
.LP
If the tracee is running when the tracer wants to detach it,
the usual solution is to send
.B SIGSTOP
(using
.BR tgkill (2),
to make sure it goes to the correct thread),
wait for the tracee to stop in signal-delivery-stop for
.B SIGSTOP
and then detach it (suppressing
.B SIGSTOP
injection).
A design bug is that this can race with concurrent
.BR SIGSTOP s.
Another complication is that the tracee may enter other ptrace-stops
and needs to be restarted and waited for again, until
.B SIGSTOP
is seen.
Yet another complication is to be sure that
the tracee is not already ptrace-stopped,
because no signal delivery happens while it is\(emnot even
.BR SIGSTOP .
.\" FIXME: mtk: the following comment seems to be unresolved?
.\"       Do you want to add anything?
.\"
.\"     Describe how to detach from a group-stopped tracee so that it
.\"     doesn't run, but continues to wait for SIGCONT.
.\"
.LP
If the tracer dies, all tracees are automatically detached and restarted,
unless they were in group-stop.
Handling of restart from group-stop is
.\" FIXME: Define currently
currently buggy, but the
.\" FIXME: Planned for when? And should applications be designed
.\" in some way so as to allow for this future change?
"as planned" behavior is to leave tracee stopped and waiting for
.BR SIGCONT .
If the tracee is restarted from signal-delivery-stop,
the pending signal is injected.
.SS execve(2) under ptrace
.\" clone(2) THREAD_CLONE says:
.\"     If  any  of the threads in a thread group performs an execve(2),
.\"     then all threads other than the thread group leader are terminated,
.\"     and the new program is executed in the thread group leader.  
.\"
When one thread in a multithreaded process calls
.BR execve (2),
the kernel destroys all other threads in the process,
.\" In kernel 3.1 sources, see fs/exec.c::de_thread()
and resets the thread ID of the execing thread to the
thread group ID (process ID).
(Or, to put things another way, when a multithreaded process does an
.BR execve (2),
at completion of the call, it appears as though the
.BR execve (2)
occurred in the thread group leader, regardless of which thread did the
.BR execve (2).)
This resetting of the thread ID looks very confusing to tracers:
.IP * 3
All other threads stop in
.B PTRACE_EVENT_EXIT
stop,
if the
.BR PTRACE_O_TRACEEXIT
option was turned on.
Then all other threads except the thread group leader report
death as if they exited via
.BR _exit (2)
with exit code 0.
Then a
.B PTRACE_EVENT_EXEC
stop happens, if the
.BR PTRACE_O_TRACEEXEC
option was turned on.
.\" FIXME: mtk: the following comment seems to be unresolved?
.\"       (on which tracee - leader? execve-ing one?)
.\" 
.\" FIXME: Please check: at various places in the following,
.\"        I have changed "pid" to "[the tracee's] thead ID"
.\"        Is that okay?
.IP *
The execing tracee changes its thread ID while it is in the
.BR execve (2).
(Remember, under ptrace, the "pid" returned from
.BR waitpid (2),
or fed into ptrace calls, is the tracee's thread ID.)
That is, the tracee's thread ID is reset to be the same as its process ID,
which is the same as the thread group leader's thread ID.
.IP *
If the thread group leader has reported its death by this time,
it appears to the tracer that
the dead thread leader "reappears from nowhere".
If the thread group leader was still alive,
for the tracer this may look as if thread group leader
returns from a different system call than it entered,
or even "returned from a system call even though
it was not in any system call".
If the thread group leader was not traced
(or was traced by a different tracer), then during
.BR execve (2)
it will appear as if it has become a tracee of
the tracer of the execing tracee.
.LP
All of the above effects are the artifacts of
the thread ID change in the tracee.
.LP
The
.B PTRACE_O_TRACEEXEC
option is the recommended tool for dealing with this situation.
It enables
.B PTRACE_EVENT_EXEC
stop, which occurs before
.BR execve (2)
returns.
.\" FIXME Following on from the previous sentences,
.\"       can/should we add a few more words on how
.\"       PTRACE_EVENT_EXEC stop helps us deal with this situation?
.LP
The thread ID change happens before
.B PTRACE_EVENT_EXEC
stop, not after.
.LP
When the tracer receives
.B PTRACE_EVENT_EXEC
stop notification,
it is guaranteed that except this tracee and the thread group leader,
no other threads from the process are alive.
.LP
On receiving the
.B PTRACE_EVENT_EXEC
stop notification,
the tracer should clean up all its internal
data structures describing the threads of this process,
and retain only one data structure\(emone which
describes the single still running tracee, with

    thread ID == thread group ID == process id.
.LP
Currently, there is no way to retrieve the former
thread ID of the execing tracee.
If the tracer doesn't keep track of its tracees' thread group relations,
it may be unable to know which tracee execed and therefore no longer
exists under the old thread ID due to a thread ID change.
.LP
Example: two threads call
.BR execve (2)
at the same time:
.LP
.nf
*** we get syscall-entry-stop in thread 1: **
PID1 execve("/bin/foo", "foo" <unfinished ...>
*** we issue PTRACE_SYSCALL for thread 1 **
*** we get syscall-entry-stop in thread 2: **
PID2 execve("/bin/bar", "bar" <unfinished ...>
*** we issue PTRACE_SYSCALL for thread 2 **
*** we get PTRACE_EVENT_EXEC for PID0, we issue PTRACE_SYSCALL **
*** we get syscall-exit-stop for PID0: **
PID0 <... execve resumed> )             = 0
.fi
.LP
In this situation, there is no way to know which
.BR execve (2)
succeeded.
.LP
If the
.B PTRACE_O_TRACEEXEC
option is
.I not
in effect for the execing tracee, the kernel delivers an extra
.B SIGTRAP
to the tracee after
.BR execve (2)
returns.
This is an ordinary signal (similar to one which can be
generated by
.IR "kill -TRAP" ),
not a special kind of ptrace-stop.
Employing
.B PTRACE_GETSIGINFO
for this signal returns
.I si_code
set to 0
.RI ( SI_USER ).
This signal may be blocked by signal mask,
and thus may be delivered (much) later.
.LP
Usually, the tracer (for example,
.BR strace (1))
would not want to show this extra post-execve
.B SIGTRAP
signal to the user, and would suppress its delivery to the tracee (if
.B SIGTRAP
is set to
.BR SIG_DFL ,
it is a killing signal).
However, determining 
.I which
.B SIGTRAP
to suppress is not easy.
Setting the
.B PTRACE_O_TRACEEXEC
option and thus suppressing this extra
.B SIGTRAP
is the recommended approach.
.SS Real parent
The ptrace API (ab)uses the standard UNIX parent/child signaling over
.BR waitpid (2).
This used to cause the real parent of the process to stop receiving
several kinds of
.BR waitpid (2)
notifications when the child process is traced by some other process.
.LP
Many of these bugs have been fixed, but as of Linux 2.6.38 several still
exist; see BUGS below.
.LP
As of Linux 2.6.38, the following is believed to work correctly:
.IP * 3
exit/death by signal is reported first to the tracer, then,
when the tracer consumes the
.BR waitpid (2)
result, to the real parent (to the real parent only when the
whole multithreaded process exits).
.\"
.\" FIXME mtk: Please check: In the next line, 
.\" I changed "they" to "the tracer and the real parent". Okay?
If the tracer and the real parent are the same process,
the report is sent only once.
.SH "RETURN VALUE"
On success,
.B PTRACE_PEEK*
requests return the requested data,
while other requests return zero.
On error, all requests return \-1, and
.I errno
is set appropriately.
Since the value returned by a successful
.B PTRACE_PEEK*
request may be \-1, the caller must clear
.I errno
before the call, and then check it afterward
to determine whether or not an error occurred.
.SH ERRORS
.TP
.B EBUSY
(i386 only) There was an error with allocating or freeing a debug register.
.TP
.B EFAULT
There was an attempt to read from or write to an invalid area in
the tracer's or the tracee's memory,
probably because the area wasn't mapped or accessible.
Unfortunately, under Linux, different variations of this fault
will return
.B EIO
or
.B EFAULT
more or less arbitrarily.
.TP
.B EINVAL
An attempt was made to set an invalid option.
.TP
.B EIO
.I request
is invalid, or an attempt was made to read from or
write to an invalid area in the tracer's or the tracee's memory,
or there was a word-alignment violation,
or an invalid signal was specified during a restart request.
.TP
.B EPERM
The specified process cannot be traced.
This could be because the
tracer has insufficient privileges (the required capability is
.BR CAP_SYS_PTRACE );
unprivileged processes cannot trace processes that they
cannot send signals to or those running
set-user-ID/set-group-ID programs, for obvious reasons.
.\" 
.\" FIXME I reworked the discussion of init below to note
.\" the kernel version (2.6.26) when the behavior changed for
.\" tracing init(8). Okay?
Alternatively, the process may already be being traced,
or (on kernels before 2.6.26) be
.BR init (8)
(PID 1).
.TP
.B ESRCH
The specified process does not exist, or is not currently being traced
by the caller, or is not stopped
(for requests that require a stopped tracee).
.SH "CONFORMING TO"
SVr4, 4.3BSD.
.SH NOTES
Although arguments to
.BR ptrace ()
are interpreted according to the prototype given,
glibc currently declares
.BR ptrace ()
as a variadic function with only the
.I request
argument fixed.
This means that unneeded trailing arguments may be omitted,
though doing so makes use of undocumented
.BR gcc (1)
behavior.
.\" FIXME Please review. I reinstated the following, noting the
.\" kernel version number where it ceased to be true
.LP
In Linux kernels before 2.6.26,
.\" See commit 00cd5c37afd5f431ac186dd131705048c0a11fdb
.BR init (8),
the process with PID 1, may not be traced.
.LP
The layout of the contents of memory and the USER area are
quite operating-system- and architecture-specific.
The offset supplied, and the data returned,
might not entirely match with the definition of
.IR "struct user" .
.\" See http://lkml.org/lkml/2008/5/8/375
.LP
The size of a "word" is determined by the operating-system variant
(e.g., for 32-bit Linux it is 32 bits, etc.).
.\" FIXME So, can we just remove the following text (rather than
.\" just commenting it out)?
.\"
.\" Covered in more details above: (removed by dv)
.\" .LP
.\" Tracing causes a few subtle differences in the semantics of
.\" traced processes.
.\" For example, if a process is attached to with
.\" .BR PTRACE_ATTACH ,
.\" its original parent can no longer receive notification via
.\" .BR waitpid (2)
.\" when it stops, and there is no way for the new parent to
.\" effectively simulate this notification.
.\" .LP
.\" When the parent receives an event with
.\" .B PTRACE_EVENT_*
.\" set,
.\" the tracee is not in the normal signal delivery path.
.\" This means the parent cannot do
.\" .BR ptrace (PTRACE_CONT)
.\" with a signal or
.\" .BR ptrace (PTRACE_KILL).
.\" .BR kill (2)
.\" with a
.\" .B SIGKILL
.\" signal can be used instead to kill the tracee
.\" after receiving one of these messages.
.\" .LP
This page documents the way the
.BR ptrace ()
call works currently in Linux.
Its behavior differs noticeably on other flavors of UNIX.
In any case, use of
.BR ptrace ()
is highly specific to the operating system and architecture.
.SH BUGS
On hosts with 2.6 kernel headers,
.B PTRACE_SETOPTIONS
is declared with a different value than the one for 2.4.
This leads to applications compiled with 2.6 kernel
headers failing when run on 2.4 kernels.
This can be worked around by redefining
.B PTRACE_SETOPTIONS
to
.BR PTRACE_OLDSETOPTIONS ,
if that is defined.
.LP
Group-stop notifications are sent to the tracer, but not to real parent.
Last confirmed on 2.6.38.6.
.LP
If a thread group leader is traced and exits by calling
.BR _exit (2),
.\" Note from Denys Vlasenko:
.\"     Here "exits" means any kind of death - _exit, exit_group,
.\"     signal death. Signal death and exit_group cases are trivial,
.\"     though: since signal death and exit_group kill all other threads
.\"     too, "until all other threads exit" thing happens rather soon
.\"     in these cases. Therefore, only _exit presents observably
.\"     puzzling behavior to ptrace users: thread leader _exit's,
.\"     but WIFEXITED isn't reported! We are trying to explain here
.\"     why it is so.
a
.B PTRACE_EVENT_EXIT
stop will happen for it (if requested), but the subsequent
.B WIFEXITED
notification will not be delivered until all other threads exit.
As explained above, if one of other threads calls
.BR execve (2),
the death of the thread group leader will
.I never
be reported.
If the execed thread is not traced by this tracer,
the tracer will never know that
.BR execve (2)
happened.
One possible workaround is to
.B PTRACE_DETACH
the thread group leader instead of restarting it in this case.
Last confirmed on 2.6.38.6.
.\"        ^^^ need to test/verify this scenario
.\" FIXME: mtk: the preceding comment seems to be unresolved?
.\"        Do you want to add anything?
.LP
A
.B SIGKILL
signal may still cause a
.B PTRACE_EVENT_EXIT
stop before actual signal death.
This may be changed in the future;
.B SIGKILL
is meant to always immediately kill tasks even under ptrace.
Last confirmed on 2.6.38.6.
.SH "SEE ALSO"
.BR gdb (1),
.BR strace (1),
.BR clone (2),
.BR execve (2),
.BR fork (2),
.BR gettid (2),
.BR sigaction (2),
.BR tgkill (2),
.BR vfork (2),
.BR waitpid (2),
.BR exec (3),
.BR capabilities (7),
.BR signal (7)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] man ptrace: add extended description of various ptrace quirks
  2012-02-26 18:42           ` Michael Kerrisk
@ 2012-02-27  0:58             ` Denys Vlasenko
  2012-03-05 17:33               ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 18+ messages in thread
From: Denys Vlasenko @ 2012-02-27  0:58 UTC (permalink / raw)
  To: mtk.manpages
  Cc: Oleg Nesterov, Jan Kratochvil, linux-kernel, Tejun Heo,
	linux-man, Heiko Carstens, Blaisorblade, Daniel Jacobowitz

On Sunday 26 February 2012 19:42, Michael Kerrisk wrote:
> Hello Denys,
> 
> Below is another iteration of the ptrace.2 page with your new
> material. Could you please take a look at the page in general, and the
> FIXMEs in particular? (I'd like to get specific input from you on all
> of the FIXMEs, if possible.)
> 
> Thanks,
> 
> Michael

...
...

> As for
> .BR PTRACE_PEEKUSER ,
> the offset must typically be word-aligned.
> In order to maintain the integrity of the kernel,
> some modifications to the USER area are disallowed.
> .\" FIXME In the preceding sentence, which modifications are disallowed,
> .\" and when they are disallowed, how does userspace discover that fact?
...
> As for
> .BR PTRACE_POKEUSER ,
> some general purpose register modifications may be disallowed.
> .\" FIXME In the preceding sentence, which modifications are disallowed,
> .\" and when they are disallowed, how does userspace discover that fact?

I don't know the answer to this question.


> Use of the
> .B WNOHANG
> flag may cause
> .BR waitpid (2)
> to return 0 ("no wait results available yet")
> even if the tracer knows there should be a notification.
> Example:
> .nf
> 
>     kill(tracee, SIGKILL);
>     waitpid(tracee, &status, __WALL | WNOHANG);
> .fi
> .\" FIXME: mtk: the following comment seems to be unresolved?
> .\"        Do you want to add anything?
> .\"
> .\"     waitid usage? WNOWAIT?
> .\"     describe how wait notifications queue (or not queue)

I did not experiment with waitid and WNOWAIT flag yet.


> .LP
> The following kinds of ptrace-stops exist: signal-delivery-stops,
> group-stop, PTRACE_EVENT stops, syscall-stops
> .\"
> .\" FIXME: mtk: the following text ("[, PTRACE_SINGLESTEP...") is incomplete.
> .\"        Do you want to add anything?
> .\"
> [, PTRACE_SINGLESTEP, PTRACE_SYSEMU,
> PTRACE_SYSEMU_SINGLESTEP].

I am not familiar enough with these ptrace commands, can't add anything useful.
You can just remove the [...] part for now.


> As of kernel 2.6.38,
> after the tracer sees the tracee ptrace-stop and until it
> restarts or kills it, the tracee will not run,
> and will not send notifications (except
> .B SIGKILL
> death) to the tracer, even if the tracer enters into another
> .BR waitpid (2)
> call.
> .LP
> .\" FIXME It is unclear what "this kernel behavior" refers to.
> .\" Can show me exactly which piece of text above or below is
> .\" referred to when you say "this kernel behavior"?
> Currently, this kernel behavior
> causes a problem with transparent handling of stopping signals:
> if the tracer restarts the tracee after group-stop,
> the stopping signal
> is effectively ignored\(emthe tracee doesn't remain stopped, it runs.
> If the tracer doesn't restart the tracee before entering into the next
> .BR waitpid (2),
> future
> .B SIGCONT
> signals will not be reported to the tracer.
> This would cause
> .B SIGCONT
> to have no effect.

You seem to be asking this question repeatedly. I tried to give you
the answer several times. I don't know what is unclear here.

Ok, I will try to explain it yet again.

Let's say a tracee receives stopping signal and stops.
Tracer sees this stop via waitpid() status.
It determines that it is a group-stop.

After this, tracer has two options: (2) execute ptrace(PTRACE_CONT)
on the tracee before going back to waitpid'ing, or (2) don't
do ptrace(PTRACE_CONT), and go back to waitpid'ing.

Both options are bad: in option (1), tracee will start running -
in effect, making stop signal to not have intended effect.
In option (2), tracee will be stopped FOREVER - SIGCONT won't be able
to start it again.

> Currently, this kernel behavior
> causes a problem with transparent handling of stopping signals:
> if the tracer restarts the tracee after group-stop,
> the stopping signal
> is effectively ignored

I am not a native English speaker. Please rephrase
this text fragment so that it sounds understandable to you.
I would agree to any version of it by now.


> But such detection is fragile and is best avoided.
> .LP
> Using the
> .B PTRACE_O_TRACESYSGOOD
> .\"
> .\" FIXME Below: "is the recommended method" for WHAT?
> option is the recommended method,
> since it is reliable and does not incur a performance penalty.

It is the recommended method to distinquish syscall-stops
from other kinds of ptrace-stops.


> If after syscall-enter-stop,
> the tracer uses a restarting command other than
> .BR PTRACE_SYSCALL ,
> syscall-exit-stop is not generated.
> .LP
> .B PTRACE_GETSIGINFO
> on syscall-stops returns
> .B SIGTRAP
> in
> .IR si_signo ,
> with
> .I si_code
> set to
> .B SIGTRAP
> or
> .IR (SIGTRAP|0x80) .
> .SS PTRACE_SINGLESTEP, PTRACE_SYSEMU, PTRACE_SYSEMU_SINGLESTEP stops
> .\"
> .\" FIXME The following TODO is unresolved
> .\"       Do you want to add anything, or (less good) do we just
> .\"       convert this into a comment in the source indicating
> .\"       that these points still need to be documented?
> .\"
> (TODO: document stops occurring with PTRACE_SINGLESTEP, PTRACE_SYSEMU,
> PTRACE_SYSEMU_SINGLESTEP)

I am not familiar enough with these ptrace commands, can't add anything useful.
You can just remove the (...) part for now.


> The design bug here is that a ptrace attach and a concurrently delivered
> .B SIGSTOP
> may race and the concurrent
> .B SIGSTOP
> may be lost.
> .\"
> .\" FIXME: mtk: the following comment seems to be unresolved?
> .\"	   Do you want to add any text?
> .\"
> .\"      Describe how to attach to a thread which is already group-stopped.

No, I don't have anything useful to add here right now.


> Another complication is that the tracee may enter other ptrace-stops
> and needs to be restarted and waited for again, until
> .B SIGSTOP
> is seen.
> Yet another complication is to be sure that
> the tracee is not already ptrace-stopped,
> because no signal delivery happens while it is\(emnot even
> .BR SIGSTOP .
> .\" FIXME: mtk: the following comment seems to be unresolved?
> .\"       Do you want to add anything?
> .\"
> .\"     Describe how to detach from a group-stopped tracee so that it
> .\"     doesn't run, but continues to wait for SIGCONT.

No, I don't have anything useful to add here right now.


> If the tracer dies, all tracees are automatically detached and restarted,
> unless they were in group-stop.
> Handling of restart from group-stop is
> .\" FIXME: Define currently
> currently buggy, but the
> .\" FIXME: Planned for when? And should applications be designed
> .\" in some way so as to allow for this future change?
> "as planned" behavior is to leave tracee stopped and waiting for
> .BR SIGCONT .

It means that current kernels are known to have bugs in this area:
if tracer exits, group-stopped tracees may start running.


> Then a
> .B PTRACE_EVENT_EXEC
> stop happens, if the
> .BR PTRACE_O_TRACEEXEC
> option was turned on.
> .\" FIXME: mtk: the following comment seems to be unresolved?
> .\"       (on which tracee - leader? execve-ing one?)

At this point, pid change has already occurred.
Currently, rendered manpage looks like this:

*  All   other   threads   stop   in  PTRACE_EVENT_EXIT  stop,  if  the
   PTRACE_O_TRACEEXIT option was turned on.   Then  all  other  threads
   except  the  thread  group leader report death as if they exited via
   _exit(2) with exit code 0.  Then a PTRACE_EVENT_EXEC  stop  happens,
   if the PTRACE_O_TRACEEXEC option was turned on.

*  The  execing  tracee  changes  its  thread  ID  while  it  is in the
   execve(2).  (Remember, under ptrace, the "pid" returned  from  wait-
   pid(2),  or fed into ptrace calls, is the tracee's thread ID.)  That
   is, the tracee's thread ID is reset to be the same  as  its  process
   ID, which is the same as the thread group leader's thread ID.

*  If  the  thread group leader has reported its death by this time...


I suggest creating a new bullet point after the second one,
and moving "Then a PTRACE_EVENT_EXEC stop happens, if the
PTRACE_O_TRACEEXEC option was turned on" text into it.

This will clearly indicate that by this time, pid has changed.

There is a bit of text below:

> The thread ID change happens before
> .B PTRACE_EVENT_EXEC
> stop, not after.

which will be made redundant by the above change and can be deleted.



> .\" FIXME: Please check: at various places in the following,
> .\"        I have changed "pid" to "[the tracee's] thead ID"
> .\"        Is that okay?
> .IP *
> The execing tracee changes its thread ID while it is in the
> .BR execve (2).
> (Remember, under ptrace, the "pid" returned from
> .BR waitpid (2),
> or fed into ptrace calls, is the tracee's thread ID.)
> That is, the tracee's thread ID is reset to be the same as its process ID,
> which is the same as the thread group leader's thread ID.

Yes, the text look ok to me.


> The
> .B PTRACE_O_TRACEEXEC
> option is the recommended tool for dealing with this situation.
> It enables
> .B PTRACE_EVENT_EXEC
> stop, which occurs before
> .BR execve (2)
> returns.
> .\" FIXME Following on from the previous sentences,
> .\"       can/should we add a few more words on how
> .\"       PTRACE_EVENT_EXEC stop helps us deal with this situation?
> .LP

I propose the following text:

The PTRACE_O_TRACEEXEC option is the recommended tool for dealing with
this situation. First, it enables PTRACE_EVENT_EXEC stop, which occurs
before execve(2) returns. In this stop, tracer can use
ptrace(PTRACE_GETEVENTMSG) call to retrieve the tracee's former thread ID.
(This feature was introduced in Linux 3.0).
Second, PTRACE_O_TRACEEXEC option disables legacy SIGTRAP generation
on execve.



> As of Linux 2.6.38, the following is believed to work correctly:
> .IP * 3
> exit/death by signal is reported first to the tracer, then,
> when the tracer consumes the
> .BR waitpid (2)
> result, to the real parent (to the real parent only when the
> whole multithreaded process exits).
> .\"
> .\" FIXME mtk: Please check: In the next line,
> .\" I changed "they" to "the tracer and the real parent". Okay?
> If the tracer and the real parent are the same process,
> the report is sent only once.

Yes, this change is ok.


> .B EPERM
> The specified process cannot be traced.
> This could be because the
> tracer has insufficient privileges (the required capability is
> .BR CAP_SYS_PTRACE );
> unprivileged processes cannot trace processes that they
> cannot send signals to or those running
> set-user-ID/set-group-ID programs, for obvious reasons.
> .\"
> .\" FIXME I reworked the discussion of init below to note
> .\" the kernel version (2.6.26) when the behavior changed for
> .\" tracing init(8). Okay?
> Alternatively, the process may already be being traced,
> or (on kernels before 2.6.26) be
> .BR init (8)
> (PID 1).

Yes, this change is ok.


> glibc currently declares
> .BR ptrace ()
> as a variadic function with only the
> .I request
> argument fixed.
> This means that unneeded trailing arguments may be omitted,
> though doing so makes use of undocumented
> .BR gcc (1)
> behavior.
> .\" FIXME Please review. I reinstated the following, noting the
> .\" kernel version number where it ceased to be true
> .LP
> In Linux kernels before 2.6.26,
> .\" See commit 00cd5c37afd5f431ac186dd131705048c0a11fdb
> .BR init (8),
> the process with PID 1, may not be traced.

Yes, this change is ok.


> .\" FIXME So, can we just remove the following text (rather than
> .\" just commenting it out)?
> .\"
> .\" Covered in more details above: (removed by dv)
> .\" .LP
> .\" Tracing causes a few subtle differences in the semantics of
> .\" traced processes.
> .\" For example, if a process is attached to with
> .\" .BR PTRACE_ATTACH ,
> .\" its original parent can no longer receive notification via
> .\" .BR waitpid (2)
> .\" when it stops, and there is no way for the new parent to
> .\" effectively simulate this notification.
> .\" .LP
> .\" When the parent receives an event with
> .\" .B PTRACE_EVENT_*
> .\" set,
> .\" the tracee is not in the normal signal delivery path.
> .\" This means the parent cannot do
> .\" .BR ptrace (PTRACE_CONT)
> .\" with a signal or
> .\" .BR ptrace (PTRACE_KILL).
> .\" .BR kill (2)
> .\" with a
> .\" .B SIGKILL
> .\" signal can be used instead to kill the tracee
> .\" after receiving one of these messages.
> .\" .LP

Yes, let's remove this comment.


> If a thread group leader is traced and exits by calling
> .BR _exit (2),
> .\" Note from Denys Vlasenko:
> .\"     Here "exits" means any kind of death - _exit, exit_group,
> .\"     signal death. Signal death and exit_group cases are trivial,
> .\"     though: since signal death and exit_group kill all other threads
> .\"     too, "until all other threads exit" thing happens rather soon
> .\"     in these cases. Therefore, only _exit presents observably
> .\"     puzzling behavior to ptrace users: thread leader _exit's,
> .\"     but WIFEXITED isn't reported! We are trying to explain here
> .\"     why it is so.
> a
> .B PTRACE_EVENT_EXIT
> stop will happen for it (if requested), but the subsequent
> .B WIFEXITED
> notification will not be delivered until all other threads exit.
> As explained above, if one of other threads calls
> .BR execve (2),
> the death of the thread group leader will
> .I never
> be reported.
> If the execed thread is not traced by this tracer,
> the tracer will never know that
> .BR execve (2)
> happened.
> One possible workaround is to
> .B PTRACE_DETACH
> the thread group leader instead of restarting it in this case.
> Last confirmed on 2.6.38.6.
> .\"        ^^^ need to test/verify this scenario
> .\" FIXME: mtk: the preceding comment seems to be unresolved?
> .\"        Do you want to add anything?

No, I don't have anything useful to add here right now.


-- 
vda

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] man ptrace: add extended description of various ptrace quirks
  2012-02-27  0:58             ` Denys Vlasenko
@ 2012-03-05 17:33               ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 18+ messages in thread
From: Michael Kerrisk (man-pages) @ 2012-03-05 17:33 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Oleg Nesterov, Jan Kratochvil, linux-kernel, Tejun Heo,
	linux-man, Heiko Carstens, Blaisorblade, Daniel Jacobowitz

Hi Denys,

On Mon, Feb 27, 2012 at 1:58 PM, Denys Vlasenko
<vda.linux@googlemail.com> wrote:
> On Sunday 26 February 2012 19:42, Michael Kerrisk wrote:
>> Hello Denys,
>>
>> Below is another iteration of the ptrace.2 page with your new
>> material. Could you please take a look at the page in general, and the
>> FIXMEs in particular? (I'd like to get specific input from you on all
>> of the FIXMEs, if possible.)
>>
>> Thanks,
>>
>> Michael
>
> ...
> ...
>
>> As for
>> .BR PTRACE_PEEKUSER ,
>> the offset must typically be word-aligned.
>> In order to maintain the integrity of the kernel,
>> some modifications to the USER area are disallowed.
>> .\" FIXME In the preceding sentence, which modifications are disallowed,
>> .\" and when they are disallowed, how does userspace discover that fact?
> ...
>> As for
>> .BR PTRACE_POKEUSER ,
>> some general purpose register modifications may be disallowed.
>> .\" FIXME In the preceding sentence, which modifications are disallowed,
>> .\" and when they are disallowed, how does userspace discover that fact?
>
> I don't know the answer to this question.

Okay -- I'll just leave the FIXME there for future reference.

>> Use of the
>> .B WNOHANG
>> flag may cause
>> .BR waitpid (2)
>> to return 0 ("no wait results available yet")
>> even if the tracer knows there should be a notification.
>> Example:
>> .nf
>>
>>     kill(tracee, SIGKILL);
>>     waitpid(tracee, &status, __WALL | WNOHANG);
>> .fi
>> .\" FIXME: mtk: the following comment seems to be unresolved?
>> .\"        Do you want to add anything?
>> .\"
>> .\"     waitid usage? WNOWAIT?
>> .\"     describe how wait notifications queue (or not queue)
>
> I did not experiment with waitid and WNOWAIT flag yet.

Okay -- I'll just leave the FIXME there for future reference.

>> .LP
>> The following kinds of ptrace-stops exist: signal-delivery-stops,
>> group-stop, PTRACE_EVENT stops, syscall-stops
>> .\"
>> .\" FIXME: mtk: the following text ("[, PTRACE_SINGLESTEP...") is incomplete.
>> .\"        Do you want to add anything?
>> .\"
>> [, PTRACE_SINGLESTEP, PTRACE_SYSEMU,
>> PTRACE_SYSEMU_SINGLESTEP].
>
> I am not familiar enough with these ptrace commands, can't add anything useful.
> You can just remove the [...] part for now.

Actually, I think I'll leave it in. See below.

>> As of kernel 2.6.38,
>> after the tracer sees the tracee ptrace-stop and until it
>> restarts or kills it, the tracee will not run,
>> and will not send notifications (except
>> .B SIGKILL
>> death) to the tracer, even if the tracer enters into another
>> .BR waitpid (2)
>> call.
>> .LP
>> .\" FIXME It is unclear what "this kernel behavior" refers to.
>> .\" Can show me exactly which piece of text above or below is
>> .\" referred to when you say "this kernel behavior"?
>> Currently, this kernel behavior
>> causes a problem with transparent handling of stopping signals:
>> if the tracer restarts the tracee after group-stop,
>> the stopping signal
>> is effectively ignored\(emthe tracee doesn't remain stopped, it runs.
>> If the tracer doesn't restart the tracee before entering into the next
>> .BR waitpid (2),
>> future
>> .B SIGCONT
>> signals will not be reported to the tracer.
>> This would cause
>> .B SIGCONT
>> to have no effect.
>
> You seem to be asking this question repeatedly. I tried to give you
> the answer several times. I don't know what is unclear here.
>
> Ok, I will try to explain it yet again.
>
> Let's say a tracee receives stopping signal and stops.
> Tracer sees this stop via waitpid() status.
> It determines that it is a group-stop.
>
> After this, tracer has two options: (2) execute ptrace(PTRACE_CONT)
> on the tracee before going back to waitpid'ing, or (2) don't
> do ptrace(PTRACE_CONT), and go back to waitpid'ing.
>
> Both options are bad: in option (1), tracee will start running -
> in effect, making stop signal to not have intended effect.
> In option (2), tracee will be stopped FOREVER - SIGCONT won't be able
> to start it again.

Okay -- as discussed in a chat. I think the main point to bring out
here is that "This kernel behavior" means "The kernel behavior
described in the previous paragraph". I'll reword to make that clear.


>> Currently, this kernel behavior
>> causes a problem with transparent handling of stopping signals:
>> if the tracer restarts the tracee after group-stop,
>> the stopping signal
>> is effectively ignored
>
> I am not a native English speaker. Please rephrase
> this text fragment so that it sounds understandable to you.
> I would agree to any version of it by now.

Done.

>> But such detection is fragile and is best avoided.
>> .LP
>> Using the
>> .B PTRACE_O_TRACESYSGOOD
>> .\"
>> .\" FIXME Below: "is the recommended method" for WHAT?
>> option is the recommended method,
>> since it is reliable and does not incur a performance penalty.
>
> It is the recommended method to distinquish syscall-stops
> from other kinds of ptrace-stops.

Okay -- I added those words.

>> If after syscall-enter-stop,
>> the tracer uses a restarting command other than
>> .BR PTRACE_SYSCALL ,
>> syscall-exit-stop is not generated.
>> .LP
>> .B PTRACE_GETSIGINFO
>> on syscall-stops returns
>> .B SIGTRAP
>> in
>> .IR si_signo ,
>> with
>> .I si_code
>> set to
>> .B SIGTRAP
>> or
>> .IR (SIGTRAP|0x80) .
>> .SS PTRACE_SINGLESTEP, PTRACE_SYSEMU, PTRACE_SYSEMU_SINGLESTEP stops
>> .\"
>> .\" FIXME The following TODO is unresolved
>> .\"       Do you want to add anything, or (less good) do we just
>> .\"       convert this into a comment in the source indicating
>> .\"       that these points still need to be documented?
>> .\"
>> (TODO: document stops occurring with PTRACE_SINGLESTEP, PTRACE_SYSEMU,
>> PTRACE_SYSEMU_SINGLESTEP)
>
> I am not familiar enough with these ptrace commands, can't add anything useful.
> You can just remove the (...) part for now.

In fact, I think I'll leave a piece of text here in the man page to
note that these stops exists, but are not yet documented.


>> The design bug here is that a ptrace attach and a concurrently delivered
>> .B SIGSTOP
>> may race and the concurrent
>> .B SIGSTOP
>> may be lost.
>> .\"
>> .\" FIXME: mtk: the following comment seems to be unresolved?
>> .\"      Do you want to add any text?
>> .\"
>> .\"      Describe how to attach to a thread which is already group-stopped.
>
> No, I don't have anything useful to add here right now.

Okay -- I'll just leave the FIXME there for future reference.

>> Another complication is that the tracee may enter other ptrace-stops
>> and needs to be restarted and waited for again, until
>> .B SIGSTOP
>> is seen.
>> Yet another complication is to be sure that
>> the tracee is not already ptrace-stopped,
>> because no signal delivery happens while it is\(emnot even
>> .BR SIGSTOP .
>> .\" FIXME: mtk: the following comment seems to be unresolved?
>> .\"       Do you want to add anything?
>> .\"
>> .\"     Describe how to detach from a group-stopped tracee so that it
>> .\"     doesn't run, but continues to wait for SIGCONT.
>
> No, I don't have anything useful to add here right now.

Okay -- I'll just leave the FIXME there for future reference.

>> If the tracer dies, all tracees are automatically detached and restarted,
>> unless they were in group-stop.
>> Handling of restart from group-stop is
>> .\" FIXME: Define currently
>> currently buggy, but the
>> .\" FIXME: Planned for when? And should applications be designed
>> .\" in some way so as to allow for this future change?
>> "as planned" behavior is to leave tracee stopped and waiting for
>> .BR SIGCONT .
>
> It means that current kernels are known to have bugs in this area:
> if tracer exits, group-stopped tracees may start running.

Okay.

>> Then a
>> .B PTRACE_EVENT_EXEC
>> stop happens, if the
>> .BR PTRACE_O_TRACEEXEC
>> option was turned on.
>> .\" FIXME: mtk: the following comment seems to be unresolved?
>> .\"       (on which tracee - leader? execve-ing one?)
>
> At this point, pid change has already occurred.
> Currently, rendered manpage looks like this:
>
> *  All   other   threads   stop   in  PTRACE_EVENT_EXIT  stop,  if  the
>   PTRACE_O_TRACEEXIT option was turned on.   Then  all  other  threads
>   except  the  thread  group leader report death as if they exited via
>   _exit(2) with exit code 0.  Then a PTRACE_EVENT_EXEC  stop  happens,
>   if the PTRACE_O_TRACEEXEC option was turned on.
>
> *  The  execing  tracee  changes  its  thread  ID  while  it  is in the
>   execve(2).  (Remember, under ptrace, the "pid" returned  from  wait-
>   pid(2),  or fed into ptrace calls, is the tracee's thread ID.)  That
>   is, the tracee's thread ID is reset to be the same  as  its  process
>   ID, which is the same as the thread group leader's thread ID.
>
> *  If  the  thread group leader has reported its death by this time...
>
>
> I suggest creating a new bullet point after the second one,
> and moving "Then a PTRACE_EVENT_EXEC stop happens, if the
> PTRACE_O_TRACEEXEC option was turned on" text into it.
>
> This will clearly indicate that by this time, pid has changed.

Done.

> There is a bit of text below:
>
>> The thread ID change happens before
>> .B PTRACE_EVENT_EXEC
>> stop, not after.
>
> which will be made redundant by the above change and can be deleted.

I deleted it.


>> .\" FIXME: Please check: at various places in the following,
>> .\"        I have changed "pid" to "[the tracee's] thead ID"
>> .\"        Is that okay?
>> .IP *
>> The execing tracee changes its thread ID while it is in the
>> .BR execve (2).
>> (Remember, under ptrace, the "pid" returned from
>> .BR waitpid (2),
>> or fed into ptrace calls, is the tracee's thread ID.)
>> That is, the tracee's thread ID is reset to be the same as its process ID,
>> which is the same as the thread group leader's thread ID.
>
> Yes, the text look ok to me.

Okay.

>> The
>> .B PTRACE_O_TRACEEXEC
>> option is the recommended tool for dealing with this situation.
>> It enables
>> .B PTRACE_EVENT_EXEC
>> stop, which occurs before
>> .BR execve (2)
>> returns.
>> .\" FIXME Following on from the previous sentences,
>> .\"       can/should we add a few more words on how
>> .\"       PTRACE_EVENT_EXEC stop helps us deal with this situation?
>> .LP
>
> I propose the following text:
>
> The PTRACE_O_TRACEEXEC option is the recommended tool for dealing with
> this situation. First, it enables PTRACE_EVENT_EXEC stop, which occurs
> before execve(2) returns. In this stop, tracer can use
> ptrace(PTRACE_GETEVENTMSG) call to retrieve the tracee's former thread ID.
> (This feature was introduced in Linux 3.0).
> Second, PTRACE_O_TRACEEXEC option disables legacy SIGTRAP generation
> on execve.

Thanks. I added that text.

>> As of Linux 2.6.38, the following is believed to work correctly:
>> .IP * 3
>> exit/death by signal is reported first to the tracer, then,
>> when the tracer consumes the
>> .BR waitpid (2)
>> result, to the real parent (to the real parent only when the
>> whole multithreaded process exits).
>> .\"
>> .\" FIXME mtk: Please check: In the next line,
>> .\" I changed "they" to "the tracer and the real parent". Okay?
>> If the tracer and the real parent are the same process,
>> the report is sent only once.
>
> Yes, this change is ok.

Thanks.

>> .B EPERM
>> The specified process cannot be traced.
>> This could be because the
>> tracer has insufficient privileges (the required capability is
>> .BR CAP_SYS_PTRACE );
>> unprivileged processes cannot trace processes that they
>> cannot send signals to or those running
>> set-user-ID/set-group-ID programs, for obvious reasons.
>> .\"
>> .\" FIXME I reworked the discussion of init below to note
>> .\" the kernel version (2.6.26) when the behavior changed for
>> .\" tracing init(8). Okay?
>> Alternatively, the process may already be being traced,
>> or (on kernels before 2.6.26) be
>> .BR init (8)
>> (PID 1).
>
> Yes, this change is ok.

Thanks.

>> glibc currently declares
>> .BR ptrace ()
>> as a variadic function with only the
>> .I request
>> argument fixed.
>> This means that unneeded trailing arguments may be omitted,
>> though doing so makes use of undocumented
>> .BR gcc (1)
>> behavior.
>> .\" FIXME Please review. I reinstated the following, noting the
>> .\" kernel version number where it ceased to be true
>> .LP
>> In Linux kernels before 2.6.26,
>> .\" See commit 00cd5c37afd5f431ac186dd131705048c0a11fdb
>> .BR init (8),
>> the process with PID 1, may not be traced.
>
> Yes, this change is ok.

Thanks.

>> .\" FIXME So, can we just remove the following text (rather than
>> .\" just commenting it out)?
>> .\"
>> .\" Covered in more details above: (removed by dv)
>> .\" .LP
>> .\" Tracing causes a few subtle differences in the semantics of
>> .\" traced processes.
>> .\" For example, if a process is attached to with
>> .\" .BR PTRACE_ATTACH ,
>> .\" its original parent can no longer receive notification via
>> .\" .BR waitpid (2)
>> .\" when it stops, and there is no way for the new parent to
>> .\" effectively simulate this notification.
>> .\" .LP
>> .\" When the parent receives an event with
>> .\" .B PTRACE_EVENT_*
>> .\" set,
>> .\" the tracee is not in the normal signal delivery path.
>> .\" This means the parent cannot do
>> .\" .BR ptrace (PTRACE_CONT)
>> .\" with a signal or
>> .\" .BR ptrace (PTRACE_KILL).
>> .\" .BR kill (2)
>> .\" with a
>> .\" .B SIGKILL
>> .\" signal can be used instead to kill the tracee
>> .\" after receiving one of these messages.
>> .\" .LP
>
> Yes, let's remove this comment.

Done.

>> If a thread group leader is traced and exits by calling
>> .BR _exit (2),
>> .\" Note from Denys Vlasenko:
>> .\"     Here "exits" means any kind of death - _exit, exit_group,
>> .\"     signal death. Signal death and exit_group cases are trivial,
>> .\"     though: since signal death and exit_group kill all other threads
>> .\"     too, "until all other threads exit" thing happens rather soon
>> .\"     in these cases. Therefore, only _exit presents observably
>> .\"     puzzling behavior to ptrace users: thread leader _exit's,
>> .\"     but WIFEXITED isn't reported! We are trying to explain here
>> .\"     why it is so.
>> a
>> .B PTRACE_EVENT_EXIT
>> stop will happen for it (if requested), but the subsequent
>> .B WIFEXITED
>> notification will not be delivered until all other threads exit.
>> As explained above, if one of other threads calls
>> .BR execve (2),
>> the death of the thread group leader will
>> .I never
>> be reported.
>> If the execed thread is not traced by this tracer,
>> the tracer will never know that
>> .BR execve (2)
>> happened.
>> One possible workaround is to
>> .B PTRACE_DETACH
>> the thread group leader instead of restarting it in this case.
>> Last confirmed on 2.6.38.6.
>> .\"        ^^^ need to test/verify this scenario
>> .\" FIXME: mtk: the preceding comment seems to be unresolved?
>> .\"        Do you want to add anything?
>
> No, I don't have anything useful to add here right now.

Okay -- I'll just leave the FIXME there for future reference.

So, I think this update is ready to go into the next man-pages
release. Thanks for all of this work Denys. It's a great improvement
to the page.

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface"; http://man7.org/tlpi/

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2012-03-05 17:34 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-21 11:09 [PATCH] man ptrace: add extended description of various ptrace quirks Denys Vlasenko
2011-07-21 16:51 ` Oleg Nesterov
2011-07-21 18:00   ` [PATCH 0/1] (Was: man ptrace: add extended description of various ptrace quirks) Oleg Nesterov
2011-07-21 18:00     ` [PATCH 1/1] ptrace: do_wait(traced_leader_killed_by_mt_exec) can block forever Oleg Nesterov
2011-07-22  8:44       ` Tejun Heo
2011-09-21  5:10   ` [PATCH] man ptrace: add extended description of various ptrace quirks Michael Kerrisk
2011-09-23  9:31     ` Denys Vlasenko
2011-09-25  6:10 ` Michael Kerrisk
2011-09-29 19:08 ` Michael Kerrisk
2011-09-30 14:14   ` Denys Vlasenko
2011-10-03  5:27     ` Michael Kerrisk
2012-02-13 22:02       ` Denys Vlasenko
2012-02-26 18:25         ` Michael Kerrisk
2012-02-26 18:42           ` Michael Kerrisk
2012-02-27  0:58             ` Denys Vlasenko
2012-03-05 17:33               ` Michael Kerrisk (man-pages)
2011-09-30 14:28   ` Denys Vlasenko
2011-10-03  5:35     ` Michael Kerrisk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).