linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH v2 00/15] tracing: 'hist' triggers
@ 2015-03-03  0:01 Alexei Starovoitov
  2015-03-03  1:18 ` Tom Zanussi
  0 siblings, 1 reply; 23+ messages in thread
From: Alexei Starovoitov @ 2015-03-03  0:01 UTC (permalink / raw)
  To: Tom Zanussi
  Cc: Steven Rostedt, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML,
	Ingo Molnar, Arnaldo Carvalho de Melo

On Mon, Mar 2, 2015 at 11:55 AM, Tom Zanussi
<tom.zanussi@linux.intel.com> wrote:
>
> I disagree that it would be rarely used.  In fact, it would probably
> cover about 80% of the use cases that people initially use things like
> systemtap or dtrace for, which I guess is what ebpf is shooting for.

'hist' style won't solve any of the use cases I'm targeting with bpf.
So, imo, 'hist' being 80% of dtrace is far from reality...
but let's agree to disagree. it's not that important.
I'm not saying don't do 'hist' at all.
I'm only suggesting to do it differently.

> I'm also looking at systems that have very little memory and 8Mb of
> storage to work with, so streaming it all to userspace and
> post-processing won't really work on those systems.

I'm not suggesting to post-process. Quite the opposite.
Let programs do ++ in the kernel, since that's what
your patch 12 is doing, but in a hard coded way.

> With some thought, though, I think the ebpf system/interpreter could be
> made smart enough to recognize the simple patterns represented by the
> hist triggers, and reuse them internally.  So ftrace users get their
> command-line version and it's also something ebpf can reuse.

I'm saying keep the command line version of hist, but let
user space process it.
I don't buy the argument that you must run it in busybox
without any extra tools.
If you're in busybox, the system is likely idle, so nothing
to trace/analyze. If you have some user space apps,
then it equally easy to add 'hist->bpf' tool.

>> to embedded argument. Why add this to kernel
>> when bpf programs can do the same on demand?
>
> Because this demonstrates that all those things can be done without
> introducing an interpreter into the mix, so why bother with the
> interpreter?

because interpreter is done once for all use cases,
whereas custom 'hist' code is doing only one thing for one use case.

>> the kernel ABI exposure is much smaller.
>> So would you consider working together on adding
>> clean bpf+tracepoints infra and corresponding
>> user space bits?
>> We can have small user space parser/compiler for
>> 'hist:keys=common_pid.execname,id.syscall:vals=hitcount'
>> strings that will convert it into bpf program and you'll
>> be able to use it in embedded setups ?
>
> Yeah, wouldn't be averse to working together to create a clean bpf
> +tracepoints infrastructure - I think creating a reusable component like
> this would be a good first step.

great.
>From the program you can emit the same text format
as in your 'cat hist' example.
But it will not be a part of stable kernel ABI, which I think is
one of the main advantages to do such printing from programs
instead of kernel C code.
If you decide to extend what is being printed, you can
tweak 'hist->bpf' tool and print something else.
No one will complain, whereas when you would want
to extend the format of 'hist' file printed by kernel, you'd
need to consider all user tools that are parsing it.
Like we saw in systrace example...

> BTW, I've actually tried to play around with the BPF samples/, but it
> seems they're not actually hooked into the system i.e. the samples
> Makefile doesn't build them, and it even looks for tools/llvm that's not
> there.  I got as far as getting the latest llvm from the location
> mentioened in one of the bpf commit messages, but gave up after it told
> me 'invalid target bpf'.  And I couldn't find any documentation on how
> to set it all up - did I just miss that?

the comment next to 'tool/llvm' says 'point to your llvm' :)
so yes, to build C examples one need to install latest llvm trunk.
If you're saying that existing bpf stuff is hard to use, then yes.
I completely agree. It is hard to use. We're working on it.
The user bits can be improved gradually unlike kernel/user
boundary. Once you set it to be 'hist' file format it will stay
forever.

^ permalink raw reply	[flat|nested] 23+ messages in thread
* Re: [PATCH v2 00/15] tracing: 'hist' triggers
@ 2015-03-03  2:31 Alexei Starovoitov
  2015-03-03 15:47 ` Tom Zanussi
  0 siblings, 1 reply; 23+ messages in thread
From: Alexei Starovoitov @ 2015-03-03  2:31 UTC (permalink / raw)
  To: Tom Zanussi
  Cc: Steven Rostedt, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML,
	Ingo Molnar, Arnaldo Carvalho de Melo

On Mon, Mar 2, 2015 at 5:18 PM, Tom Zanussi <tom.zanussi@linux.intel.com> wrote:
>>
>> I'm saying keep the command line version of hist, but let
>> user space process it.
>> I don't buy the argument that you must run it in busybox
>> without any extra tools.
>> If you're in busybox, the system is likely idle, so nothing
>> to trace/analyze. If you have some user space apps,
>> then it equally easy to add 'hist->bpf' tool.
>>
>
> How about systems that run a single statically linked process with no
> shell (but a service that can read and write files like/event/trigger
> and event/hist)?  We'd still like to be able to trace those systems.

hmm, there is no shell and one cannot do
'echo hist.. > trigger; cat hist' , but there is a demon
that can accept remote commands ?
Then would make even more sense to pass bpf programs
from remote host and consume aggregated data remotely.
This on-host demon will be tiny wrapper on top of bpf syscall.
Quite a bit more flexible :)

> Well, I'd say writing BPF 'assembly' to do anything isn't something more
> than a few users in the world would even consider, so that's completely
> out. Which means the only practical way to use it is via the C
> interface.  But getting that set up properly doesn't seem
> straightforward either - it isn't something the Makefile will help with,
> and there's no documentation on how one might do it.

I'm not proposing to use asm or C for this 'hist->bpf' tool.
Keep proposed 'hist:keys=...:vals=...' syntax and generate
bpf program on the fly based on this string.
So this user tool will take string, generate program, load
and attach it. Everything based on this single string input.
With the examples you mentioned in docs, it's not hard.
It will be more involved than patch 12, but way more generic
and easily extensible when 'hist:keys=...' string would need
to be extended.

> So I tweaked the Makefile to get samples/bpf in the build (I mean the
> directory is there under samples/, so why do I need to add it to the
> Makefile myself?) and tried building which failed until I tweaked
> something else to get it to find the right headers, etc.  Finally I got
> it building the userspace stuff but then found out I needed my own llvm
> to get the kernel modules built, so searched and found your llvm tree
> which I thought would configure the bpf backend automatically, but
> apparently not, since it then failed with llc: invalid target 'bpf'
> which is where I gave up.  Do I need to configure with --target=bpf or
> something like that?  I don't know, and know nothing about llvm, so am
> kind of stuck.

hmm. 'make samples/bpf/' should work out of the box.
only llvm needs to installed.
To install llvm:
$git clone http://llvm.org/git/llvm.git
$mkdir llvm/bld; cd llvm/bld
if you like cmake:
$cmake -DLLVM_EXPERIMENTAL_TARGETS_TO_BUILD=BPF ..
$make
if you like autoconf:
$../configure --enable-experimental-targets=BPF
$make

That's typical for any new backend. Hopefully soon it will lose
'experimental' status.

> I really do want to try doing something with it, and I understand that
> you're working on improving the user experience, but at this point it
> seems users have to jump through a lot of hoops just to get a minimally
> working setup.  Even a small paragraph with some basic instructions
> would help.  Or maybe it's just me, and it works for everyone else out
> of the box.

Thank you for feedback. We'll add llvm build instructions to the doc.
The goal for llvm is to be able to do
'apt-get install llvm-3.7' and bpf will be part of default install.

^ permalink raw reply	[flat|nested] 23+ messages in thread
* Re: [PATCH v2 00/15] tracing: 'hist' triggers
@ 2015-03-02 19:36 Alexei Starovoitov
  2015-03-02 19:45 ` Steven Rostedt
  0 siblings, 1 reply; 23+ messages in thread
From: Alexei Starovoitov @ 2015-03-02 19:36 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Tom Zanussi, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML,
	Ingo Molnar, Arnaldo Carvalho de Melo

On Mon, Mar 2, 2015 at 11:31 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Mon, 2 Mar 2015 11:14:54 -0800
> Alexei Starovoitov <ast@plumgrid.com> wrote:
>
>> I think we both want to see in-kernel aggregation.
>> This 'hist' stuff is trying to do counting and even map sorting
>> in the kernel, whereas with bpf programs I'm moving
>> all of these decisions to user space.
>> I understand your desire to avoid any user level scripts
>> and do everything via 'cat' and debugfs, but imo that's
>> very limiting. I think it's better to do slim user space
>> scripting language that can translate to bpf even in
>> embedded setups. Then users will be able to aggregate
>> whatever they like, whereas with 'hist' approach
>> they're limited to simple counters.
>> trace_events_trigger.c - 1466 lines - that's quite a bit
>> of code that will be rarely used. Kinda goes counter
>> to embedded argument. Why add this to kernel
>> when bpf programs can do the same on demand?
>
> At Collab, a lot of people came to me telling me they love the debugfs
> system. Because it's everywhere they go. You remove that, you remove
> most users (including myself).
>
>
>> Also the arguments about stable ABI apply as well.
>> The format of 'hist' file would need to be stable, so will
>> be hard to extend it. With bpf programs doing aggregation
>> the kernel ABI exposure is much smaller.
>
> I disagree with this statement.
>
>> So would you consider working together on adding
>> clean bpf+tracepoints infra and corresponding
>> user space bits?
>> We can have small user space parser/compiler for
>> 'hist:keys=common_pid.execname,id.syscall:vals=hitcount'
>> strings that will convert it into bpf program and you'll
>> be able to use it in embedded setups ?
>
> Make sure it also works on android.

the sad part is that tracing is off on android...
so all of tracefs is actually compiled out there...
and probably will be if we keep adding rarely used code to it.

^ permalink raw reply	[flat|nested] 23+ messages in thread
* Re: [PATCH v2 00/15] tracing: 'hist' triggers
@ 2015-03-02 19:14 Alexei Starovoitov
  2015-03-02 19:31 ` Steven Rostedt
  2015-03-02 19:55 ` Tom Zanussi
  0 siblings, 2 replies; 23+ messages in thread
From: Alexei Starovoitov @ 2015-03-02 19:14 UTC (permalink / raw)
  To: Tom Zanussi
  Cc: Steven Rostedt, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML,
	Ingo Molnar, Arnaldo Carvalho de Melo

On Mon, Mar 2, 2015 at 8:00 AM, Tom Zanussi <tom.zanussi@linux.intel.com> wrote:
>
>   # echo 'hist:keys=common_pid.execname,id.syscall:vals=hitcount' > \
>         /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger
>
>   # cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/hist
>
>   key: common_pid:bash[3112], id:sys_write                     vals: count:69
>   key: common_pid:bash[3112], id:sys_rt_sigprocmask            vals: count:218

Hi Tom,

I think we both want to see in-kernel aggregation.
This 'hist' stuff is trying to do counting and even map sorting
in the kernel, whereas with bpf programs I'm moving
all of these decisions to user space.
I understand your desire to avoid any user level scripts
and do everything via 'cat' and debugfs, but imo that's
very limiting. I think it's better to do slim user space
scripting language that can translate to bpf even in
embedded setups. Then users will be able to aggregate
whatever they like, whereas with 'hist' approach
they're limited to simple counters.
trace_events_trigger.c - 1466 lines - that's quite a bit
of code that will be rarely used. Kinda goes counter
to embedded argument. Why add this to kernel
when bpf programs can do the same on demand?
Also the arguments about stable ABI apply as well.
The format of 'hist' file would need to be stable, so will
be hard to extend it. With bpf programs doing aggregation
the kernel ABI exposure is much smaller.
So would you consider working together on adding
clean bpf+tracepoints infra and corresponding
user space bits?
We can have small user space parser/compiler for
'hist:keys=common_pid.execname,id.syscall:vals=hitcount'
strings that will convert it into bpf program and you'll
be able to use it in embedded setups ?

Thanks

^ permalink raw reply	[flat|nested] 23+ messages in thread
* [PATCH v2 00/15] tracing: 'hist' triggers
@ 2015-03-02 16:00 Tom Zanussi
  2015-03-03  2:25 ` Masami Hiramatsu
  0 siblings, 1 reply; 23+ messages in thread
From: Tom Zanussi @ 2015-03-02 16:00 UTC (permalink / raw)
  To: rostedt
  Cc: masami.hiramatsu.pt, namhyung, andi, ast, linux-kernel, Tom Zanussi

This is v2 of my previously posted 'hashtriggers' patchset [1], but
renamed to 'hist triggers' following feedback from v1.

Since then, the kernel has gained a tracing map implementation in the
form of bpf_map, which this patchset makes a bit more generic, exports
and uses (as tracing_map_*, still in the bpf syscall file however).

A large part of the initial hash triggers implementation was devoted
to a map implementation and general-purpose hashing functions, which
have now been subsumed by the bpf maps.  I've completely redone the
trigger patches themselves to work on top of tracing_map.  The result
is a much simpler and easier-to-review patchset that's able to focus
more directly on the problem at hand.

The new version addresses all the comments from the previous review,
including changing the name from hash->hist, adding separate 'hist'
files for the output, and moving the examples into Documentation.

This patchset also includes a couple other new and related triggers,
enable_hist and disable_hist, very similar to the existing
enable_event/disable_event triggers used to automatically enable and
disable events based on a triggering condition, but in this case
allowing hist triggers to be enabled and disabled in the same way.

The only problem with using the bpf_map implementation for this is
that it uses kmalloc internally, which causes problems when trying to
trace kmalloc itself.  I'm guessing the ebpf tracing code would also
share this problem e.g. when using bpf_maps from probes on kmalloc().
This patchset attempts a solution to that problem (by adding a
gfp_flag and changing the kmem memory allocation tracepoints to
conditional variants) for checking for it in for but I'm not sure it's
the best way to address it.

There are a couple of important bits of functionality that were
present in v1 but dropped in v2 mainly because I'm still trying to
figure out the best way to accomplish those things using the bpf_map
implementation.

The first is support for compound keys.  Currently, maps can only be
keyed on a single event field, whereas in v1 they could be keyed on
multiple keys.  With support for compound keys, you can create much
more interesting output, such as for example per-pid lists of
syscalls or read counts e.g.:

  # echo 'hist:keys=common_pid.execname,id.syscall:vals=hitcount' > \
        /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger

  # cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/hist

  key: common_pid:bash[3112], id:sys_write		       vals: count:69
  key: common_pid:bash[3112], id:sys_rt_sigprocmask	       vals: count:218

  key: common_pid:update-notifier[3164], id:sys_poll	       vals: count:37
  key: common_pid:update-notifier[3164], id:sys_recvfrom       vals: count:118

  key: common_pid:deja-dup-monito[3194], id:sys_sendto	       vals: count:1
  key: common_pid:deja-dup-monito[3194], id:sys_read	       vals: count:4
  key: common_pid:deja-dup-monito[3194], id:sys_poll	       vals: count:8
  key: common_pid:deja-dup-monito[3194], id:sys_recvmsg	       vals: count:8
  key: common_pid:deja-dup-monito[3194], id:sys_getegid	       vals: count:8

  key: common_pid:emacs[3275], id:sys_fsync		       vals: count:1
  key: common_pid:emacs[3275], id:sys_open		       vals: count:1
  key: common_pid:emacs[3275], id:sys_symlink		       vals: count:2
  key: common_pid:emacs[3275], id:sys_poll		       vals: count:23
  key: common_pid:emacs[3275], id:sys_select		       vals: count:23
  key: common_pid:emacs[3275], id:unknown_syscall	       vals: count:34
  key: common_pid:emacs[3275], id:sys_ioctl		       vals: count:60
  key: common_pid:emacs[3275], id:sys_rt_sigprocmask	       vals: count:116

  key: common_pid:cat[3323], id:sys_munmap		       vals: count:1
  key: common_pid:cat[3323], id:sys_fadvise64		       vals: count:1

Related to that is support for sorting on multiple fields.  Currently,
you can sort using only a primary key.  Being able to sort on multiple
or at least a secondary key is indispensible for seeing trends when
displaying multiple values.

[1] http://thread.gmane.org/gmane.linux.kernel/1673551

Changes from v1:
 - completely rewritten on top of tracing_map (renamed and exported bpf_map)
 - added map clearing and client ops to tracing_map
 - changed the name from 'hash' triggers to 'hist' triggers
 - added new trigger 'pause' feature
 - added new enable_hist and disable_hist triggers
 - added usage for hist/enable_hist/disable hist to tracing/README
 - moved examples into Documentation/trace/event.txt
 - added ___GFP_NOTRACE, kmalloc/kfree macros, and conditional kmem tracepoints

The following changes since commit 49058038a12cfd9044146a1bf4b286781268d5c9:

  ring-buffer: Do not wake up a splice waiter when page is not full (2015-02-24 14:00:41 -0600)

are available in the git repository at:

  git://git.yoctoproject.org/linux-yocto-contrib.git tzanussi/hist-triggers-v2
  http://git.yoctoproject.org/cgit/cgit.cgi/linux-yocto-contrib/log/?h=tzanussi/hist-triggers-v2

Tom Zanussi (15):
  tracing: Make ftrace_event_field checking functions available
  tracing: Add event record param to trigger_ops.func()
  tracing: Add get_syscall_name()
  bpf: Export bpf map functionality as trace_map_*
  bpf: Export a map-clearing function
  bpf: Add tracing_map client ops
  mm: Add ___GFP_NOTRACE
  tracing: Make kmem memory allocation tracepoints conditional
  tracing: Add kmalloc/kfree macros
  bpf: Make tracing_map use kmalloc/kfree_notrace()
  tracing: Add a per-event-trigger 'paused' field
  tracing: Add 'hist' event trigger command
  tracing: Add sorting to hist triggers
  tracing: Add enable_hist/disable_hist triggers
  tracing: Add 'hist' trigger Documentation

 Documentation/trace/events.txt      |  870 +++++++++++++++++++++
 include/linux/bpf.h                 |   15 +
 include/linux/ftrace_event.h        |    9 +-
 include/linux/gfp.h                 |    3 +-
 include/linux/slab.h                |   61 +-
 include/trace/events/kmem.h         |   28 +-
 kernel/bpf/arraymap.c               |   16 +
 kernel/bpf/hashtab.c                |   39 +-
 kernel/bpf/syscall.c                |  193 ++++-
 kernel/trace/trace.c                |   48 ++
 kernel/trace/trace.h                |   25 +-
 kernel/trace/trace_events.c         |    3 +
 kernel/trace/trace_events_filter.c  |   15 +-
 kernel/trace/trace_events_trigger.c | 1466 ++++++++++++++++++++++++++++++++++-
 kernel/trace/trace_syscalls.c       |   11 +
 mm/slab.c                           |   45 +-
 mm/slob.c                           |   45 +-
 mm/slub.c                           |   47 +-
 18 files changed, 2795 insertions(+), 144 deletions(-)

-- 
1.9.3


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2015-03-09 11:41 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-03  0:01 [PATCH v2 00/15] tracing: 'hist' triggers Alexei Starovoitov
2015-03-03  1:18 ` Tom Zanussi
  -- strict thread matches above, loose matches on Subject: below --
2015-03-03  2:31 Alexei Starovoitov
2015-03-03 15:47 ` Tom Zanussi
2015-03-04  2:22   ` Alexei Starovoitov
2015-03-04  5:01     ` Tom Zanussi
2015-03-02 19:36 Alexei Starovoitov
2015-03-02 19:45 ` Steven Rostedt
2015-03-02 19:49   ` Karim Yaghmour
2015-03-02 20:33     ` Alexei Starovoitov
2015-03-02 20:37       ` Karim Yaghmour
2015-03-02 20:39       ` Steven Rostedt
2015-03-02 20:43         ` Steven Rostedt
2015-03-02 20:48           ` Alexei Starovoitov
2015-03-02 20:52             ` Karim Yaghmour
2015-03-02 20:54             ` Karim Yaghmour
2015-03-02 19:14 Alexei Starovoitov
2015-03-02 19:31 ` Steven Rostedt
2015-03-02 19:55 ` Tom Zanussi
2015-03-09 11:38   ` He Kuang
2015-03-02 16:00 Tom Zanussi
2015-03-03  2:25 ` Masami Hiramatsu
2015-03-03 14:47   ` Tom Zanussi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).