[RFC,0/6] Sleepable tracepoints
mbox series

Message ID 20201023195352.26269-1-mjeanson@efficios.com
Headers show
Series
  • Sleepable tracepoints
Related show

Message

Michael Jeanson Oct. 23, 2020, 7:53 p.m. UTC
When invoked from system call enter/exit instrumentation, accessing
user-space data is a common use-case for tracers. However, tracepoints
currently disable preemption around iteration on the registered
tracepoint probes and invocation of the probe callbacks, which prevents
tracers from handling page faults.

Extend the tracepoint and trace event APIs to allow specific tracer
probes to take page faults. Adapt ftrace, perf, and ebpf to allow being
called from sleepable context, and convert the system call enter/exit
instrumentation to sleepable tracepoints.

This series only implements the tracepoint infrastructure required to
allow tracers to handle page faults. Modifying each tracer to handle
those page faults would be a next step after we all agree on this piece
of instrumentation infrastructure.

This patchset is base on v5.9.1.

Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: bpf@vger.kernel.org

Mathieu Desnoyers (1):
  tracing: use sched-RCU instead of SRCU for rcuidle tracepoints

Michael Jeanson (5):
  tracing: introduce sleepable tracepoints
  tracing: ftrace: add support for sleepable tracepoints
  tracing: bpf-trace: add support for sleepable tracepoints
  tracing: perf: add support for sleepable tracepoints
  tracing: convert sys_enter/exit to sleepable tracepoints

 include/linux/tracepoint-defs.h |  11 ++++
 include/linux/tracepoint.h      | 104 +++++++++++++++++++++-----------
 include/trace/bpf_probe.h       |  23 ++++++-
 include/trace/define_trace.h    |   7 +++
 include/trace/events/syscalls.h |   4 +-
 include/trace/perf.h            |  26 ++++++--
 include/trace/trace_events.h    |  79 ++++++++++++++++++++++--
 init/Kconfig                    |   1 +
 kernel/trace/bpf_trace.c        |   5 +-
 kernel/trace/trace_events.c     |  15 ++++-
 kernel/trace/trace_syscalls.c   |  68 +++++++++++++--------
 kernel/tracepoint.c             | 104 +++++++++++++++++++++++++-------
 12 files changed, 351 insertions(+), 96 deletions(-)

Comments

peter enderborg Oct. 26, 2020, 12:05 p.m. UTC | #1
On 10/23/20 9:53 PM, Michael Jeanson wrote:
> When invoked from system call enter/exit instrumentation, accessing
> user-space data is a common use-case for tracers. However, tracepoints
> currently disable preemption around iteration on the registered
> tracepoint probes and invocation of the probe callbacks, which prevents
> tracers from handling page faults.
>
> Extend the tracepoint and trace event APIs to allow specific tracer
> probes to take page faults. Adapt ftrace, perf, and ebpf to allow being
> called from sleepable context, and convert the system call enter/exit
> instrumentation to sleepable tracepoints.

Will this not be a problem for analyse of the trace? It get two
relevant times, one it when it is called and one when it returns.

It makes things harder to correlate in what order things happen.

And handling of tracing of contexts that already are not preamptable?

Eg the same tracepoint are used in different places and contexts.


> This series only implements the tracepoint infrastructure required to
> allow tracers to handle page faults. Modifying each tracer to handle
> those page faults would be a next step after we all agree on this piece
> of instrumentation infrastructure.
>
> This patchset is base on v5.9.1.
>
> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Yonghong Song <yhs@fb.com>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> Cc: Jiri Olsa <jolsa@redhat.com>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
> Cc: bpf@vger.kernel.org
>
> Mathieu Desnoyers (1):
>   tracing: use sched-RCU instead of SRCU for rcuidle tracepoints
>
> Michael Jeanson (5):
>   tracing: introduce sleepable tracepoints
>   tracing: ftrace: add support for sleepable tracepoints
>   tracing: bpf-trace: add support for sleepable tracepoints
>   tracing: perf: add support for sleepable tracepoints
>   tracing: convert sys_enter/exit to sleepable tracepoints
>
>  include/linux/tracepoint-defs.h |  11 ++++
>  include/linux/tracepoint.h      | 104 +++++++++++++++++++++-----------
>  include/trace/bpf_probe.h       |  23 ++++++-
>  include/trace/define_trace.h    |   7 +++
>  include/trace/events/syscalls.h |   4 +-
>  include/trace/perf.h            |  26 ++++++--
>  include/trace/trace_events.h    |  79 ++++++++++++++++++++++--
>  init/Kconfig                    |   1 +
>  kernel/trace/bpf_trace.c        |   5 +-
>  kernel/trace/trace_events.c     |  15 ++++-
>  kernel/trace/trace_syscalls.c   |  68 +++++++++++++--------
>  kernel/tracepoint.c             | 104 +++++++++++++++++++++++++-------
>  12 files changed, 351 insertions(+), 96 deletions(-)
>
Mathieu Desnoyers Oct. 26, 2020, 2:59 p.m. UTC | #2
----- On Oct 26, 2020, at 8:05 AM, peter enderborg peter.enderborg@sony.com wrote:

> On 10/23/20 9:53 PM, Michael Jeanson wrote:
>> When invoked from system call enter/exit instrumentation, accessing
>> user-space data is a common use-case for tracers. However, tracepoints
>> currently disable preemption around iteration on the registered
>> tracepoint probes and invocation of the probe callbacks, which prevents
>> tracers from handling page faults.
>>
>> Extend the tracepoint and trace event APIs to allow specific tracer
>> probes to take page faults. Adapt ftrace, perf, and ebpf to allow being
>> called from sleepable context, and convert the system call enter/exit
>> instrumentation to sleepable tracepoints.
> 
> Will this not be a problem for analyse of the trace? It get two
> relevant times, one it when it is called and one when it returns.

It will depend on what the tracer chooses to do. If we call the side-effect
of what is being traced a "transaction" (e.g. actually opening a file
descriptor and adding it to a process'file descriptor table as the result
of an open(2) system call), we have to consider that already today the
timestamp which we get is either slightly before or after the actual
side-effect of the transaction in the kernel. That is true even without
being preemptable.

Sometimes it's not relevant to have a tracepoint before and after the
transaction, e.g. when all we care about is to know that the transaction
has successfully happened or not.

In the case of system calls, we have sys_enter and sys_exit to mark the
beginning and end of the "transaction". Whatever side-effects are done by
the system call happens in between.

I think the question here is whether it is relevant to know whether page
faults triggered by accessing system call input parameters need to
happen after we trace a "system call entry" event. If the tracers care,
then it would be up to them to first trace that "system call entry" event,
and have a separate event for the argument payload. But there are other
ways to identify whether page faults happen within the system call or
from user-space, for instance by using the instruction pointer associated
with the page fault. So when observing page faults happening before sys
enter, but associated with a kernel instruction pointer, a trace analysis
tool could theoretically figure out who is to blame for that page fault,
*if* it cares.

> 
> It makes things harder to correlate in what order things happen.

The alternative is to have partial payloads like LTTng does today for
system call arguments. If reading a string from userspace (e.g. open(2)
file name) requires to take a page fault, LTTng truncates the string. This
is pretty bad for automated analysis as well.

> 
> And handling of tracing of contexts that already are not preamptable?

The sleepable tracepoints are only meant to be used in contexts which can sleep.
For tracepoints placed in non-preemptible contexts, those should never take
a page fault to begin with.

> 
> Eg the same tracepoint are used in different places and contexts.

As far as considering that a given tracepoint "name" could be registered to
by both a sleepable and non-sleepable tracer probes, I would like to see an
actual use-case for this. I don't have any.

I can envision that some tracer code will want to be allowed to work in
both sleepable and non-sleepable context, e.g. take page faults in
sleepable context (and called from a sleepable tracepoint), but have a
truncation behavior when called from non-sleepable context. This can actually
be done by looking at the new "TRACEPOINT_MAYSLEEP" tp flag. Passing that
tp_flags to code shared between sleepable and non-sleepable probes would allow
the callee to know whether it can take a page fault or not.

Thanks,

Mathieu