linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH v2 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
@ 2015-01-29 18:55 Alexei Starovoitov
  0 siblings, 0 replies; 9+ messages in thread
From: Alexei Starovoitov @ 2015-01-29 18:55 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa,
	Masami Hiramatsu, Linux API, Network Development, LKML

On Thu, Jan 29, 2015 at 4:35 AM, Namhyung Kim <namhyung@kernel.org> wrote:
>>> Right.  I think bpf programs belong to a user process but events are
>>> global resource.  Maybe you also need to consider attaching bpf
>>> program via perf (ioctl?) interface..
>>
>> yes. I did. Please see my reply to Masami.
>> ioctl only works for tracepoints.
>
> What was the problem of kprobes then? :)

Looks like I misread the logic of attaching a filter via perf ioctl.
Looking at it again it seems to be a major change in design:
Instead of adding into ftrace_raw_* helpers, I would add
to perf_trace_* helpers which are very stack heavy
because of 'pt_regs'
Ex: perf_trace_kfree_skb() is using 224 bytes of stack
whereas ftrace_raw_event_kfree_skb() only 80.
which doesn't help in my quest for lowest overhead.
And the discussion about soft- and auto- enable/disable
becomes meaningless, since there is no such things
when it goes through perf events.
I guess it means no hooks through tracefs...
Anyway, I'll hook it up and see which way is cleaner.

^ permalink raw reply	[flat|nested] 9+ messages in thread
* Re: [PATCH v2 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
@ 2015-01-29  7:04 Alexei Starovoitov
  2015-01-29 12:35 ` Namhyung Kim
  0 siblings, 1 reply; 9+ messages in thread
From: Alexei Starovoitov @ 2015-01-29  7:04 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa,
	Masami Hiramatsu, Linux API, Network Development, LKML

On Wed, Jan 28, 2015 at 10:41 PM, Namhyung Kim <namhyung@kernel.org> wrote:
>
> I think it's not a problem of bpf.  An user process can be killed
> anytime while it enabed events without bpf.  The only thing it should
> care is the auto-unload IMHO.

ok. I think it does indeed make sense to decouple the logic.
We can add 'auto_enable' file to achieve desired Ctrl-C behavior.
While the 'auto_enable' file is open the event will be enabled
and writes to 'enable' file will be ignored.
As soon as file closes, the event is auto-disabled.
Then user space will use 'bpf' file to attach/auto-unload
and 'auto_enable' file together.
Seem there would be a use for such 'auto_enable'
without bpf as well.

> I'm okay for not calling bpf program in NMI but not for disabling events.
>
> Suppose an user was collecting an event (including in NMI) and then
> [s]he also wanted to run a bpf program.  So [s]he wrote a program
> always return 1.  But after attaching the program, it didn't record
> the event in NMI..  Isn't that a problem?

ok, I think 'if (in_nmi()) return 1;' will work then, right?
Or you're thinking something else ?

> Right.  I think bpf programs belong to a user process but events are
> global resource.  Maybe you also need to consider attaching bpf
> program via perf (ioctl?) interface..

yes. I did. Please see my reply to Masami.
ioctl only works for tracepoints.

^ permalink raw reply	[flat|nested] 9+ messages in thread
* Re: [PATCH v2 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
@ 2015-01-29  6:39 Alexei Starovoitov
  0 siblings, 0 replies; 9+ messages in thread
From: Alexei Starovoitov @ 2015-01-29  6:39 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Namhyung Kim, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, Jiri Olsa, Linux API,
	Network Development, LKML

On Wed, Jan 28, 2015 at 9:39 PM, Masami Hiramatsu
<masami.hiramatsu.pt@hitachi.com> wrote:
>
> Maybe, would we need a reference counter for each event? :)

when we would want multiple users to attach different programs
to the same event, then yes.
Right now I'd rather have things simple.

> Actually, ftrace event is not similar to perf-event which ktap
> is based on, ftrace event interface is always exported via
> debugfs, this means users can share the event for different
> usage.

yes.I've been thinking to extend perf_event ioctl to attach programs,
but right now it's only supporting tracepoints and kprobe
seems not trivial to add.
So I went for tracefs style of attaching for now.

> One possible other solution is to add a instance-lock interface
> for each ftrace instance and lock it by bpf. Then, other users
> can not enable/disable the events in the instance.

the user space can synchronize itself via flock. kernel doesn't
need to arbitrate. If one user process attached a program
that auto-enabled an event and another process did
'echo 0 > enable', it's fine. I think it's a feature instead of a bug.
Both users are root anyway.

The more we talk about it, the more I like a new 'bpf' file
approach within tracefs (that I've mentioned in cover letter)
with auto-enable/disable logic to make it clear that
it's different from traditional global 'filter' file.

^ permalink raw reply	[flat|nested] 9+ messages in thread
* Re: [PATCH v2 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
@ 2015-01-29  4:40 Alexei Starovoitov
  2015-01-29  5:39 ` Masami Hiramatsu
  2015-01-29  6:41 ` Namhyung Kim
  0 siblings, 2 replies; 9+ messages in thread
From: Alexei Starovoitov @ 2015-01-29  4:40 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa,
	Masami Hiramatsu, Linux API, Network Development, LKML

On Wed, Jan 28, 2015 at 4:46 PM, Namhyung Kim <namhyung@kernel.org> wrote:
>>
>> +static int event_filter_release(struct inode *inode, struct file *filp)
>> +{
>> +     struct ftrace_event_file *file;
>> +     char buf[2] = "0";
>> +
>> +     mutex_lock(&event_mutex);
>> +     file = event_file_data(filp);
>> +     if (file) {
>> +             if (file->flags & TRACE_EVENT_FL_BPF) {
>> +                     /* auto-disable the filter */
>> +                     ftrace_event_enable_disable(file, 0);
>
> Hmm.. what if user already enabled an event, attached a bpf filter and
> then detached the filter - I'm not sure we can always auto-disable
> it..

why not?
I think it makes sense auto enable/disable, since that
is cleaner user experience.
Otherwise Ctrl-C of the user process will have bpf program dangling.
not good. If we auto-unload bpf program only, it's equally bad.
Since Ctrl-C of the process will auto-onload only
and will keep tracepoint enabled which will be spamming
the trace buffer.

>> +unsigned int trace_filter_call_bpf(struct event_filter *filter, void *ctx)
>> +{
>> +     unsigned int ret;
>> +
>> +     if (in_nmi()) /* not supported yet */
>> +             return 0;
>
> But doesn't this mean to auto-disable all attached events during NMI
> as returning 0 will prevent the event going to ring buffer?

well, it means that if tracepoint fired during nmi the program
won't be called and event won't be sent to trace buffer.
The program might be broken (like divide by zero) and
it will self-terminate with 'return 0'
so zero should be the safest return value that
causes minimum disturbance to the whole system overall.

> I think it'd be better to keep an attached event in a soft-disabled
> state like event trigger and give control of enabling to users..

I think it suffers from the same Ctrl-C issue.
Say, attaching bpf program activates tracepoint and keeps
it in soft-disabled. Then user space clears soft-disabled.
Then user Ctrl-C it. Now bpf program must auto-detach
and unload, since prog_fd is closing.
If we don't completely deactivate tracepoint, then
Ctrl-C will leave the state of the system in the state
different from it was before user process started running.
I think we must avoid such situation.
'kill pid' should be completely cleaning all resources
that user process was using.
Yes. It's different from typical usage of /sys/.../tracing
that has all global knobs, but, imo, it's cleaner this way.

^ permalink raw reply	[flat|nested] 9+ messages in thread
* [PATCH v2 linux-trace 0/8] tracing: attach eBPF programs to tracepoints/syscalls/kprobe
@ 2015-01-28  4:06 Alexei Starovoitov
  2015-01-28  4:06 ` [PATCH v2 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls Alexei Starovoitov
  0 siblings, 1 reply; 9+ messages in thread
From: Alexei Starovoitov @ 2015-01-28  4:06 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	Masami Hiramatsu, linux-api, netdev, linux-kernel

Hi Steven,

This patch set is for linux-trace/for-next
It adds ability to attach eBPF programs to tracepoints, syscalls and kprobes.

The programs are run after soft_disabled() check, but before trace_buffer
is allocated to have minimal impact on a system, which can be demonstrated
by 'dd if=/dev/zero of=/dev/null count=5000000' test:
1.19343 s, 2.1 GB/s - no tracing (raw base line)
1.53301 s, 1.7 GB/s - echo 1 > enable
1.62742 s, 1.6 GB/s - echo cnt==1234 > filter
1.23418 s, 2.1 GB/s - attached bpf program does 'return 0'
1.25890 s, 2.0 GB/s - attached bpf program does 'map[log2(count)]++'

Though tracex1 example is an example of event/filter equivalent logic,
should we create a new file '/sys/.../tracing/events/.../bpf' and use
that for attaching instead of overloading 'filter' file meaning?
That will move bpf related logic out of trace_events_filter.c into new
file and we'll be able to use both bpf program as a 'pre filter' and
existing filter code that runs on allocated trace_buffer at the same time.
In this patch set bpf programs co-exist with TP_printk and triggers.

Anyway, resending with accumulated fixes:
V1->V2:
- dropped bpf_dump_stack() and bpf_printk() helpers,
  trigger 'stacktrace' can be used instead of bpf_dump_stack()
- disabled running programs in_nmi
- other minor cleanups

V1 cover letter:
----------------
Mechanism of attaching:
- load program via bpf() syscall and receive program_fd
- event_fd = open("/sys/kernel/debug/tracing/events/.../filter")
- write 'bpf-123' to event_fd where 123 is program_fd
- program will be attached to particular event and event automatically enabled
- close(event_fd) will detach bpf program from event and event disabled

Program attach point and input arguments:
- programs attached to kprobes receive 'struct pt_regs *' as an input.
  See tracex4_kern.c that demonstrates how users can write a C program like:
  SEC("events/kprobes/sys_write")
  int bpf_prog4(struct pt_regs *regs)
  {
     long write_size = regs->dx; 
     // here user need to know the proto of sys_write() from kernel
     // sources and x64 calling convention to know that register $rdx
     // contains 3rd argument to sys_write() which is 'size_t count'

  it's obviously architecture dependent, but allows building sophisticated
  user tools on top, that can see from debug info of vmlinux which variables
  are in which registers or stack locations and fetch it from there.
  'perf probe' can potentialy use this hook to generate programs in user space
  and insert them instead of letting kernel parse string during kprobe creation.

- programs attached to tracepoints and syscalls receive 'struct bpf_context *':
  u64 arg1, arg2, ..., arg6;
  for syscalls they match syscall arguments.
  for tracepoints these args match arguments passed to tracepoint.
  For example:
  trace_sched_migrate_task(p, new_cpu); from sched/core.c
  arg1 <- p        which is 'struct task_struct *'
  arg2 <- new_cpu  which is 'unsigned int'
  arg3..arg6 = 0
  the program can use bpf_fetch_u8/16/32/64/ptr() helpers to walk 'task_struct'
  or any other kernel data structures.
  These helpers are using probe_kernel_read() similar to 'perf probe' which is
  not 100% safe in both cases, but good enough.
  To access task_struct's pid inside 'sched_migrate_task' tracepoint
  the program can do:
  struct task_struct *task = (struct task_struct *)ctx->arg1;
  u32 pid = bpf_fetch_u32(&task->pid);
  Since struct layout is kernel configuration specific such programs are not
  portable and require access to kernel headers to be compiled,
  but in this case we don't need debug info.
  llvm with bpf backend will statically compute task->pid offset as a constant
  based on kernel headers only.
  The example of this arbitrary pointer walking is tracex1_kern.c
  which does skb->dev->name == "lo" filtering.

In all cases the programs are called before trace buffer is allocated to
minimize the overhead, since we want to filter huge number of events, but
buffer alloc/free and argument copy for every event is too costly.
Theoretically we can invoke programs after buffer is allocated, but it
doesn't seem needed, since above approach is faster and achieves the same.

Note, tracepoint/syscall and kprobe programs are two different types:
BPF_PROG_TYPE_TRACING_FILTER and BPF_PROG_TYPE_KPROBE_FILTER,
since they expect different input.
Both use the same set of helper functions:
- map access (lookup/update/delete)
- fetch (probe_kernel_read wrappers)
- memcmp (probe_kernel_read + memcmp)

Portability:
- kprobe programs are architecture dependent and need user scripting
  language like ktap/stap/dtrace/perf that will dynamically generate
  them based on debug info in vmlinux
- tracepoint programs are architecture independent, but if arbitrary pointer
  walking (with fetch() helpers) is used, they need data struct layout to match.
  Debug info is not necessary
- for networking use case we need to access 'struct sk_buff' fields in portable
  way (user space needs to fetch packet length without knowing skb->len offset),
  so for some frequently used data structures we will add helper functions
  or pseudo instructions to access them. I've hacked few ways specifically
  for skb, but abandoned them in favor of more generic type/field infra.
  That work is still wip. Not part of this set.
  Once it's ready tracepoint programs that access common data structs
  will be kernel independent.

Program return value:
- programs return 0 to discard an event
- and return non-zero to proceed with event (allocate trace buffer, copy
  arguments there and print it eventually in trace_pipe in traditional way)

Examples:
- dropmon.c - simple kfree_skb() accounting in eBPF assembler, similar
  to dropmon tool
- tracex1_kern.c - does net/netif_receive_skb event filtering
  for dev->skb->name == "lo" condition
- tracex2_kern.c - same kfree_skb() accounting like dropmon, but now in C
  plus computes histogram of all write sizes from sys_write syscall
  and prints the histogram in userspace
- tracex3_kern.c - most sophisticated example that computes IO latency
  between block/block_rq_issue and block/block_rq_complete events
  and prints 'heatmap' using gray shades of text terminal.
  Useful to analyze disk performance.
- tracex4_kern.c - computes histogram of write sizes from sys_write syscall
  using kprobe mechanism instead of syscall. Since kprobe is optimized into
  ftrace the overhead of instrumentation is smaller than in example 2.

The user space tools like ktap/dtrace/systemptap/perf that has access
to debug info would probably want to use kprobe attachment point, since kprobe
can be inserted anywhere and all registers are avaiable in the program.
tracepoint attachments are useful without debug info, so standalone tools
like iosnoop will use them.

The main difference vs existing perf_probe/ftrace infra is in kernel aggregation
and conditional walking of arbitrary data structures.

Thanks!

Alexei Starovoitov (8):
  tracing: attach eBPF programs to tracepoints and syscalls
  tracing: allow eBPF programs to call ktime_get_ns()
  samples: bpf: simple tracing example in eBPF assembler
  samples: bpf: simple tracing example in C
  samples: bpf: counting example for kfree_skb tracepoint and write
    syscall
  samples: bpf: IO latency analysis (iosnoop/heatmap)
  tracing: attach eBPF programs to kprobe/kretprobe
  samples: bpf: simple kprobe example

 include/linux/ftrace_event.h       |    6 ++
 include/trace/bpf_trace.h          |   25 +++++
 include/trace/ftrace.h             |   29 ++++++
 include/uapi/linux/bpf.h           |    9 ++
 kernel/trace/Kconfig               |    1 +
 kernel/trace/Makefile              |    1 +
 kernel/trace/bpf_trace.c           |  178 ++++++++++++++++++++++++++++++++++++
 kernel/trace/trace.h               |    3 +
 kernel/trace/trace_events.c        |   33 ++++++-
 kernel/trace/trace_events_filter.c |   83 ++++++++++++++++-
 kernel/trace/trace_kprobe.c        |   11 ++-
 kernel/trace/trace_syscalls.c      |   31 +++++++
 samples/bpf/Makefile               |   18 ++++
 samples/bpf/bpf_helpers.h          |   14 +++
 samples/bpf/bpf_load.c             |   62 +++++++++++--
 samples/bpf/bpf_load.h             |    3 +
 samples/bpf/dropmon.c              |  129 ++++++++++++++++++++++++++
 samples/bpf/tracex1_kern.c         |   28 ++++++
 samples/bpf/tracex1_user.c         |   24 +++++
 samples/bpf/tracex2_kern.c         |   71 ++++++++++++++
 samples/bpf/tracex2_user.c         |   95 +++++++++++++++++++
 samples/bpf/tracex3_kern.c         |   92 +++++++++++++++++++
 samples/bpf/tracex3_user.c         |  150 ++++++++++++++++++++++++++++++
 samples/bpf/tracex4_kern.c         |   36 ++++++++
 samples/bpf/tracex4_user.c         |   83 +++++++++++++++++
 25 files changed, 1206 insertions(+), 9 deletions(-)
 create mode 100644 include/trace/bpf_trace.h
 create mode 100644 kernel/trace/bpf_trace.c
 create mode 100644 samples/bpf/dropmon.c
 create mode 100644 samples/bpf/tracex1_kern.c
 create mode 100644 samples/bpf/tracex1_user.c
 create mode 100644 samples/bpf/tracex2_kern.c
 create mode 100644 samples/bpf/tracex2_user.c
 create mode 100644 samples/bpf/tracex3_kern.c
 create mode 100644 samples/bpf/tracex3_user.c
 create mode 100644 samples/bpf/tracex4_kern.c
 create mode 100644 samples/bpf/tracex4_user.c

-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-01-29 18:55 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-29 18:55 [PATCH v2 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls Alexei Starovoitov
  -- strict thread matches above, loose matches on Subject: below --
2015-01-29  7:04 Alexei Starovoitov
2015-01-29 12:35 ` Namhyung Kim
2015-01-29  6:39 Alexei Starovoitov
2015-01-29  4:40 Alexei Starovoitov
2015-01-29  5:39 ` Masami Hiramatsu
2015-01-29  6:41 ` Namhyung Kim
2015-01-28  4:06 [PATCH v2 linux-trace 0/8] tracing: attach eBPF programs to tracepoints/syscalls/kprobe Alexei Starovoitov
2015-01-28  4:06 ` [PATCH v2 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls Alexei Starovoitov
2015-01-29  0:46   ` Namhyung Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).