[ANNOUNCE][PATCH 0 of 4] zedtrace, a general-purpose binary tracer

* [ANNOUNCE][PATCH 0 of 4] zedtrace, a general-purpose binary tracer
@ 2009-02-27  9:00 Tom Zanussi
  2009-02-27 16:46 ` Frederic Weisbecker
  2009-02-28 21:00 ` Christoph Hellwig
  0 siblings, 2 replies; 5+ messages in thread
From: Tom Zanussi @ 2009-02-27  9:00 UTC (permalink / raw)
  To: linux-kernel

Since everyone seems to be posting binary tracers lately, I thought I'd
post the current version of mine too.  It's called zedtrace and is
based on utt, the generic tracing framework derived from the powerful
blktrace tracing infrastructure.  The details are below and in the
code, but in a nutshell what it does, in addition to providing the
standard tracing-to-disk/network/live-mode per-cpu tracing enabled by
utt, is to provide a very efficient way to move data pretty much
directly from the kernel into native variables defined in
high-level-language event-handling scripts, where they can then be
manipulated using the full power of the high-level-language being
used.  This is done via self-describing events, the definition of which
make for very efficient logging functions (basically 'direct to the
wire', avoiding anything like run-time binary printk() overhead).  The
design also makes it trivially easy to add new event definitions when
new tracepoints come online, and also to create new userspace
high-level-language event-handling support (see the zedperl code, which
should be easily ported to create a zedpython or zedruby, for
instance).

I haven't had much time to write many useful scripts yet, though
hopefully the few examples described below would give some idea of what
kinds of things could be done.  I'll probably start by porting my
version of blktrace-in-the-kernel that I did a couple of years ago:
http://sources.redhat.com/ml/systemtap/2007-q1/msg00485.html

Here's the blurb from the config option, which describes it in a little
more detail.  Following that are some examples and usage details.

zedtrace is a low-overhead general-purpose kernel tracing option
that allows you to selectively trace any/all tracepoints currently
defined in the kernel, with generic kernel or userspace filtering, as
desired.  Trace events from different subsystems
e.g. sched/block/workqueue appear together seamlessly in the trace
output, allowing interactions between subsystems to be identified and
providing an overall view of system-wide activity.

All trace events are self-describing; there are a number of userspace
tools available in the zedtrace tarball that take advantage of this
and make all of the events and event fields available to
general-purpose language interpreters as native types and event
handler funtions.  For instance, the 'zedperl' event interpreter
allows the full power of Perl to be used to aggregate and analyze
either a saved set of per-cpu trace files or or a live trace stream in
real-time.  There are also simpler utilities that can be used to do
the same thing but that simply print the data in a generic way.  In
any case, because the data is self-describing, these tools don't need
to be recompiled when new events are added to the kernel.  Users can
also trivially write their own front-ends to display and/or
aggregate/analyze the trace stream in custom ways by plugging in to
the modular Zed architecture; interpreters for other languages such as
Python, Ruby, etc. are expected soon.

Because Zed is based on the best-of-breed blktrace-derived UTT
userspace tracing library, users can trace not only to local disk but
over the network in several different ways, and as mentioned can also
do 'live tracing' using the combined per-cpu data stream.

See kernel/trace/trace_binary/zedtrace.h for the current set of event
definitions, which can be used as examples for adding your own.

See http://utt.sourceforge.net/zedtrace.html for userspace tools,
examples and further documentation.

Usage
-----

The simplest thing to do is trace to disk, which you can do by simply
invoking zedtrace with a set of events to trace.  To find out what
events are available, use the -t zedtrace option:

# ./zedtrace -t

Available tracepoints:

sched
  sched:kthread_stop
  sched:kthread_stop_ret
  sched:wait_task
  sched:wakeup
  sched:wakeup_new
  sched:switch
  sched:migrate_task
  sched:process_free
  sched:process_exit
  sched:process_wait
  sched:process_fork
  sched:signal_send
block
  block:rq_abort
  block:rq_insert
  block:rq_issue
  block:rq_requeue
  block:rq_complete
  block:bio_bounce
  block:bio_complete
  block:bio_backmerge
  block:bio_frontmerge
  block:bio_queue
  block:getrq
  block:sleeprq
  block:plug
  block:unplug_io
  block:unplug_timer
  block:split
  block:remap
workqueue
  workqueue:insertion
  workqueue:execution
  workqueue:creation
  workqueue:destruction

You can specify any or all of these to tell zedtrace which events to
trace.  You can specify complete subsystems to trace, using the
top-level (unindented) items in the event listing, or individual
events using the same syntax as displayed in the listing e.g.

# ./zedtrace sched block workqueue
# ./zedtrace sched:switch sched:process_fork

This results in a set of per-cpu trace files in the current directory:

-rw-r--r--  1 root root   18432 2009-02-26 22:36 trace.zed.0
-rw-r--r--  1 root root   39152 2009-02-26 22:36 trace.zed.1

You can use zedprint to print all the events along with the raw values
for each event, by simply specifying the first common part of the
per-cpu trace files (e.g. trace).  Because events are self-describing,
you can use the same zedprint executable to print any event even if it
was added to the kernel at a later time (the same is also true for
zedperl as well, see below).  Note that the output of zedprint is not
pretty, and it wasn't meant to be - it's a simple program used to
generically dump trace records:

# ./zedprint trace

wakeup               cpu = 1, time =     4.107217716, sequence = 2991, wakeup_pid = 12172, success = 1, 
switch               cpu = 1, time =     4.107237155, sequence = 2992, prev_pid = 0, next_pid = 12172, 
remap                cpu = 1, time =     4.115181242, sequence = 2993, dev = 8388614, what = 1048591,
		     rw = 0, errors = 0, sector = 212057636, bytes = 61440, pdu_len = 16, pdu[0] = blob, 
bio_queue            cpu = 1, time =     4.115183032, sequence = 2994, dev = 0, what = 1048577, rw = 0,
		     errors = 0, sector = 212057636, bytes = 61440, 
getrq                cpu = 1, time =     4.115190306, sequence = 2995, dev = 0, what = 1048580, rw = 0,
		     errors = 0, sector = 212057636, bytes = 61440, 
plug                 cpu = 1, time =     4.115194683, sequence = 2996, dev = 0, what = 1048585, 
rq_insert            cpu = 1, time =     4.115198413, sequence = 2997, dev = 8388608, what = 17891340,
		     rw = 0, errors = 0, sector = 212057636, bytes = 61440, pdu_len = 0, pdu[0] = blob, 
unplug_io            cpu = 1, time =     4.115204344, sequence = 2998, dev = 0, what = 1048586, pdu_len = 8, pdu[0] = blob, 
rq_issue             cpu = 1, time =     4.115211814, sequence = 2999, dev = 8388608, what = 21037063, rw = 0,
		     errors = 0, sector = 212057636, bytes = 61440, pdu_len = 0, pdu[0] = blob, 
switch               cpu = 1, time =     4.115240756, sequence = 3000, prev_pid = 12172, next_pid = 0, 
insertion            cpu = 1, time =     4.117371692, sequence = 3001, pid = 2872, 
wakeup               cpu = 1, time =     4.117378249, sequence = 3002, wakeup_pid = 2872, success = 1, 
switch               cpu = 1, time =     4.117382526, sequence = 3003, prev_pid = 0, next_pid = 2872, 
execution            cpu = 1, time =     4.117384668, sequence = 3004, pid = 2872, 
switch               cpu = 1, time =     4.117397292, sequence = 3005, prev_pid = 2872, next_pid = 0, 
wakeup               cpu = 0, time =     4.119145967, sequence = 500, wakeup_pid = 12159, success = 1, 
switch               cpu = 0, time =     4.119152068, sequence = 501, prev_pid = 0, next_pid = 12159, 
switch               cpu = 0, time =     4.119161459, sequence = 502, prev_pid = 12159, next_pid = 0, 
rq_complete          cpu = 1, time =     4.125063131, sequence = 3006, dev = 8388608, what = 25231368, rw = 0,
		     errors = 0, sector = 212057636, bytes = 61440, pdu_len = 0, pdu[0] = blob, 
wakeup               cpu = 1, time =     4.125068423, sequence = 3007, wakeup_pid = 12172, success = 1, 
switch               cpu = 1, time =     4.125084887, sequence = 3008, prev_pid = 0, next_pid = 12172, 

You can also pass the same trace files to more powerful analysis tools
such as zedperl, which is a tool that contains an embedded perl
interpreter that makes the trace data available to event handler
functions contained in arbitrary user scripts.  The self-describing
event data here allows the trace fields to be passed directly to the
script as native param variables.  Here's an example of a typical
handler function as used in the printall.pl script (which has handlers
for all known event types, as does countall.pl).  The handler function
starts with the subsystem name followed by the event name.  The first
'my' line defines the parameters that will be passed to the function
by the scripting engine:

sub sched::switch {
    my ($cpu, $secs, $nsecs, $sequence, $prev_pid, $next_pid) = @_;

    $this = (caller(0)) [3];
    print_header($this, $cpu, $secs, $nsecs, $sequence);
    printf("prev_pid=%-5u prev_pid=%-5u\n", $prev_pid, $next_pid);
}

The full power of Perl is now at your disposal for processing the
event data.

The trace files from the previous example can be passed instead to
zedperl and processed using the printall.pl script.  The output in
this case isn't much prettier, mainly because the script is
unfinished, but in any case printing is the least interesting thing
you can do with the data.

# ./zedperl trace printall.pl

event                    cpu   timestamp   sequence
-----                    ---   ---------   --------
sched::wakeup            1     4.107217716     2991 wake_pid=12172 success=1
sched::switch            1     4.107237155     2992 prev_pid=0     prev_pid=12172
block::remap             1     4.115181242     2993 dev=8,6     pid=x rwbs=R  212057636 + 120 <- r.dev=(x,x) r.sector=x
block::bio_queue         1     4.115183032     2994 dev=0,0     pid=x rwbs=R  sector+size=212057636 + 120 []
block::getrq             1     4.115190306     2995 dev=0,0     pid=x rwbs=R  sector+size=212057636 + 120 []
block::plug              1     4.115194683     2996 dev=0,0     pid=x rwbs=N  []
block::rq_insert         1     4.115198413     2997 dev=8,0     pid=x rwbs=R  sector+size=212057636 + 120 []
block::unplug_io         1     4.115204344     2998 dev=0,0     pid=x rwbs=N  [] 99
block::rq_issue          1     4.115211814     2999 dev=8,0     pid=x rwbs=R  sector+size=212057636 + 120 []
sched::switch            1     4.115240756     3000 prev_pid=12172 prev_pid=0    
sched::switch            1     4.117371692     3001 pid=2872  wq_cpu=0
sched::wakeup            1     4.117378249     3002 wake_pid=2872  success=1
sched::switch            1     4.117382526     3003 prev_pid=0     prev_pid=2872 
sched::switch            1     4.117384668     3004 pid=2872  wq_cpu=0
sched::switch            1     4.117397292     3005 prev_pid=2872  prev_pid=0    
sched::wakeup            0     4.119145967      500 wake_pid=12159 success=1
sched::switch            0     4.119152068      501 prev_pid=0     prev_pid=12159
sched::switch            0     4.119161459      502 prev_pid=12159 prev_pid=0    
block::rq_complete       1     4.125063131     3006 dev=8,0     pid=x rwbs=R  sector+size=212057636 + 120 [0]
sched::wakeup            1     4.125068423     3007 wake_pid=12172 success=1
sched::switch            1     4.125084887     3008 prev_pid=0     prev_pid=12172

zedtrace can also be used in a pipeline with zedperl or zedprint to do
'live tracing', where the output doesn't go to per-cpu files on disk
but instead the per-cpu data coming from the kernel is combined in
real-time and processed by the scripted handlers as it comes in.
Running the following results in the printed event data being
displayed on the console as it comes in:

# ./zedtrace sched block workqueue -o - | ./zedperl -i - printall.pl

NOTE: known bug/todo: sometimes get broken pipes when starting, try a
couple more times if that happens.

Of course, if printing were the only thing you could do, it wouldn't
be too interesting.  Since you now have the power of a full-fledged
language at your disposal, you can do literally anything the language
allows with the trace data coming in.  The countall.pl script isn't
too interesting either, but it does start to show the potential of the
kinds of things you can do.  Here's the same handler as above but in
the context of the countall script:

sub sched::switch {
    my ($cpu, $secs, $nsecs, $sequence, $prev_pid, $next_pid) = @_;

    $this = (caller(0)) [3];
    $counts{$this}++;
}

Invoking the following will simply count each event as it comes in
(counts are hashed by handler name) and on Ctrl-C will dump out the
counts contained in the hash.  The trace_begin() and trace_end()
functions are special-purpose functions that are invoked, if present,
at the beginning end of the trace - here the trace_end() function
dumps out the results:

# ./zedtrace sched block workqueue -o - | ./zedperl -i - countall.pl
zedperl started with Perl script countall.pl
^Cevent counts:
  block::unplug_timer: 5
  block::bio_backmerge: 46
  block::remap: 89
  block::rq_insert: 411
  block::unplug_io: 5
  sched::wakeup_new: 65
  block::rq_issue: 411
  sched::migrate_task: 172
  sched::process_exit: 65
  sched::wakeup: 7844
  block::bio_queue: 89
  block::getrq: 411
  sched::process_wait: 145
  sched::signal_send: 79
  block::rq_complete: 89
  sched::process_free: 65
  block::plug: 383
  workqueue::execution: 1850
  sched::switch: 13226
  workqueue::insertion: 1850
  sched::process_fork: 65

The same thing can be accomplished without the pipeline as such:

# ./zedtrace sched block workqueue
% ./zedperl trace countall.pl

A slightly more interesting example is the workqueue-stats.pl script,
which basically replicates the kernel's ftrace-based workqueue tracer.
The workqueue-stats script is very small and simple but it shows the
power of being able to use a real language for either live- or
post-processed tracing - it uses an array of nested hashes to
accomplish its results, which is something beyond even DTrace's
capabilities.

# ./zedtrace workqueue -o - | ./zedperl -i - workqueue-stats.pl

or

# ./zedtrace workqueue
# ./zedperl trace workqueue-stats.pl

zedperl started with Perl script workqueue-stats.pl
workqueue counts:
cpu    ins   exec    pid
  0      4      6   2872
  0     90     90     10
  1    867    864   2872
  1     91     91     11
  1      3      3     73
  1    630    630   1184
  1     90     90   5516

Because zedtrace userspace code is based on libutt, which is a
generalized version of the powerful blktrace userspace code, it also
inherits from it the ability to trace over the network e.g. on the
server side:

root@bookworm:/home/trz/zedtrace# ./zedtrace -l
server: waiting for connections...
server: connection from 127.0.0.1
server: end of run for trace
Channel: trace
  CPU  0:                    0 events,       18 KiB data
  CPU  1:                    0 events,       39 KiB data
  Total:                     0 events (dropped 0),       57 KiB data

And on the client side:

root@bookworm:/home/trz/zedtrace# ./zedtrace -h localhost sched
zed: connecting to localhost
zed: connected!
^CChannel: trace
  CPU  0:                    0 events,       18 KiB data
  CPU  1:                    0 events,       39 KiB data
  Total:                     0 events (dropped 0),       57 KiB data

This produces a directory on the server with a name made up of the
client's IP address and timestamp, e.g. 127.0.0.1-2009-02-27-04:36:06,
that contains the trace files.  These can be processed using the same
tools as on a local system.

TODO: need to save event descriptions in remote directory, for now you
have to manually copy them over.

Finally, because even the most efficient tracing system can become
overwhelmed at times by large volumes of trace data, zedtrace provides
a generic filtering capability in the kernel.  To make use of it,
you'll need to know a little about the internals of how event
descriptions are represented in the debugfs filesystem.  All of the
event descriptions are arranged by subsystem, as defined in the event
definitions in the kernel, and live under
/sys/kernel/debug/zed/trace/events:

# ls -al /sys/kernel/debug/zed/trace/events
total 0
drwxr-xr-x 6 root root 0 2009-02-25 15:55 .
drwxr-xr-x 3 root root 0 2009-02-25 22:38 ..
---------- 1 root root 0 2009-02-25 15:55 all
drwxr-xr-x 2 root root 0 2009-02-25 15:55 block
drwxr-xr-x 2 root root 0 2009-02-25 15:55 power
drwxr-xr-x 2 root root 0 2009-02-25 15:55 sched
drwxr-xr-x 2 root root 0 2009-02-25 15:55 workqueue

TODO: for now, some of the code assumes the /sys/kernel/debug path,
need to remove that restriction.

Contained within each 'subsystem' directory is a list of all the
events available for that subsystem along with an 'all' file e.g. for
the 'sched' subsystem, here's what you should see:

# ls -al /sys/kernel/debug/zed/trace/events/sched
total 0
drwxr-xr-x 2 root root 0 2009-02-25 15:55 .
drwxr-xr-x 6 root root 0 2009-02-25 15:55 ..
---------- 1 root root 0 2009-02-25 22:37 all
---------- 1 root root 0 2009-02-25 15:55 kthread_stop_ret_trace
---------- 1 root root 0 2009-02-25 15:55 kthread_stop_trace
---------- 1 root root 0 2009-02-25 15:55 migrate_task_trace
---------- 1 root root 0 2009-02-25 15:55 process_exit_trace
---------- 1 root root 0 2009-02-25 15:55 process_fork_trace
---------- 1 root root 0 2009-02-25 15:55 process_free_trace
---------- 1 root root 0 2009-02-25 15:55 process_wait_trace
---------- 1 root root 0 2009-02-25 15:55 signal_send_trace
---------- 1 root root 0 2009-02-25 15:55 switch_trace
---------- 1 root root 0 2009-02-25 15:55 wait_task_trace
---------- 1 root root 0 2009-02-25 15:55 wakeup_new_trace
---------- 1 root root 0 2009-02-25 15:55 wakeup_trace

To enable or disable all events for a subsystem, you write 0 or 1 to
the 'all' file.  To enable or disable a particular event, you write 0
or 1 to the relevant event file:

# echo 1 > /sys/kernel/debug/zed/trace/events/all
# echo 0 > /sys/kernel/debug/zed/trace/events/signal_send_trace

If you cat an event description file, you'll see a bunch of
information about the event's fields and state:

~# cat /sys/kernel/debug/zed/trace/events/sched/switch_trace
switch_trace    5       40      disabled
        magic   u32     0       4
        id      u16     4       2
        cpu     u32     8       4
        time    timestamp       16      8
        sequence        u32     24      4
        prev_pid        pid_t   28      4
        next_pid        pid_t   32      4
filters:
        none

At the end of each event description, there's a filter: section, which
displays the state of any filtering set for that event.  Any field in
the event can be filtered on using statements containing '==' or '!='.
There may be up to 8 such statement for an event and they can be
combined using 'and' or 'or'.  Only events that pass the compound
statement are passed to userspace; failing events are simply
discarded.  As with enabling/disabling, echoing the statements to the
'all' file will cause the filter to be applied to all events in the
subsystem, and echoing to an individual event will affect only that
event.  To clear the filter on an event, simply disable the event.
The statement below will allow only block events on cpu1 and for
device 8388608 through:

# echo "cpu == 1" > /sys/kernel/debug/zed/trace/events/block/all
# echo "and dev != 8388608" > /sys/kernel/debug/zed/trace/events/block/all
# cat /sys/kernel/debug/zed/trace/events/block/bio_backmerge_trace

Note that the syntax is exact, unforgiving, and not particularly
user-friendly - it's meant mainly to be used internally by the tracing
tools similarly to how zedtrace handles event enable/disable i.e. the
user never knows of their existence but through a nicer interface
exposed by the tool.

bio_backmerge_trace     19      64      disabled
        magic   u32     0       4
        id      u16     4       2
        cpu     u32     8       4
        time    timestamp       16      8
        sequence        u32     24      4
        dev     dev_t   28      4
        what    u32     32      4
        rw      int     36      4
        errors  int     40      4
        sector  sector_t        48      8
        bytes   int     56      4
filters:
        cpu == 1
        and dev != 8388608

It's a blunt instrument for filtering, but is meant to be
coarse-grained - finer grained filtering is basically what the rest of
the infrastructure is there for. ;-)

^ permalink raw reply	[flat|nested] 5+ messages in thread