All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 linux-trace 0/8] tracing: attach eBPF programs to tracepoints/syscalls/kprobe
@ 2015-02-10  3:45 ` Alexei Starovoitov
  0 siblings, 0 replies; 70+ messages in thread
From: Alexei Starovoitov @ 2015-02-10  3:45 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	Masami Hiramatsu, linux-api, netdev, linux-kernel

Hi Steven,

This patch set is for linux-trace/for-next
It adds ability to attach eBPF programs to tracepoints, syscalls and kprobes.
Obviously too late for 3.20, but please review. I'll rebase and repost when
merge window closes.

Main difference in V3 is different attaching mechanism:
- load program via bpf() syscall and receive prog_fd
- event_fd = perf_event_open()
- ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd) to attach program to event
- close(event_fd) will destroy event and detach the program
kernel diff became smaller and in general this approach is cleaner
(thanks to Masami and Namhyung for suggesting it)

The programs are run before ring buffer is allocated to have minimal
impact on a system, which can be demonstrated by
'dd if=/dev/zero of=/dev/null count=20000000' test:
4.80074 s, 2.1 GB/s - no tracing (raw base line)
5.62705 s, 1.8 GB/s - attached bpf program does 'map[log2(count)]++' without JIT
5.05963 s, 2.0 GB/s - attached bpf program does 'map[log2(count)]++' with JIT
4.91715 s, 2.1 GB/s - attached bpf program does 'return 0'

perf record -e skb:sys_write dd if=/dev/zero of=/dev/null count=20000000
8.75686 s, 1.2 GB/s
Warning: Processed 20355236 events and lost 44 chunks!

perf record -e skb:sys_write --filter cnt==1234 dd if=/dev/zero of=/dev/null count=20000000
5.69732 s, 1.8 GB/s

6.13730 s, 1.7 GB/s - echo 1 > /sys/../events/skb/sys_write/enable
6.50091 s, 1.6 GB/s - echo 'cnt == 1234' > /sys/../events/skb/sys_write/filter

(skb:sys_write is a temporary tracepoint in write() syscall)

So the overhead of realistic bpf program is 5.05963/4.80074 = ~5%
which is faster than perf_event filtering: 5.69732/4.80074 = ~18%
or ftrace filtering: 6.50091/4.80074 = ~35%

V2->V3:
- changed program attach interface from tracefs into perf_event ioctl
- rewrote user space helpers to use perf_events
- rewrote tracex1 example to use mmap-ed ring_buffer instead of trace_pipe
- as suggested by Arnaldo renamed bpf_memcmp to bpf_probe_memcmp to better
  indicate function logic
- added ifdefs to make bpf check a nop when CONFIG_BPF_SYSCALL is not set

V1->V2:
- dropped bpf_dump_stack() and bpf_printk() helpers
- disabled running programs in_nmi
- other minor cleanups

Program attach point and input arguments:
- programs attached to kprobes receive 'struct pt_regs *' as an input.
  See tracex4_kern.c that demonstrates how users can write a C program like:
  SEC("events/kprobes/sys_write")
  int bpf_prog4(struct pt_regs *regs)
  {
     long write_size = regs->dx; 
     // here user need to know the proto of sys_write() from kernel
     // sources and x64 calling convention to know that register $rdx
     // contains 3rd argument to sys_write() which is 'size_t count'

  it's obviously architecture dependent, but allows building sophisticated
  user tools on top, that can see from debug info of vmlinux which variables
  are in which registers or stack locations and fetch it from there.
  'perf probe' can potentialy use this hook to generate programs in user space
  and insert them instead of letting kernel parse string during kprobe creation.

- programs attached to tracepoints and syscalls receive 'struct bpf_context *':
  u64 arg1, arg2, ..., arg6;
  for syscalls they match syscall arguments.
  for tracepoints these args match arguments passed to tracepoint.
  For example:
  trace_sched_migrate_task(p, new_cpu); from sched/core.c
  arg1 <- p        which is 'struct task_struct *'
  arg2 <- new_cpu  which is 'unsigned int'
  arg3..arg6 = 0
  the program can use bpf_fetch_u8/16/32/64/ptr() helpers to walk 'task_struct'
  or any other kernel data structures.
  These helpers are using probe_kernel_read() similar to 'perf probe' which is
  not 100% safe in both cases, but good enough.
  To access task_struct's pid inside 'sched_migrate_task' tracepoint
  the program can do:
  struct task_struct *task = (struct task_struct *)ctx->arg1;
  u32 pid = bpf_fetch_u32(&task->pid);
  Since struct layout is kernel configuration specific such programs are not
  portable and require access to kernel headers to be compiled,
  but in this case we don't need debug info.
  llvm with bpf backend will statically compute task->pid offset as a constant
  based on kernel headers only.
  The example of this arbitrary pointer walking is tracex1_kern.c
  which does skb->dev->name == "lo" filtering.

In all cases the programs are called before ring buffer is allocated to
minimize the overhead, since we want to filter huge number of events, but
perf_trace_buf_prepare/submit and argument copy for every event is too costly.

Note, tracepoint/syscall and kprobe programs are two different types:
BPF_PROG_TYPE_TRACEPOINT and BPF_PROG_TYPE_KPROBE,
since they expect different input.
Both use the same set of helper functions:
- map access (lookup/update/delete)
- fetch (probe_kernel_read wrappers)
- probe_memcmp (probe_kernel_read + memcmp)

Portability:
- kprobe programs are architecture dependent and need user scripting
  language like ktap/stap/dtrace/perf that will dynamically generate
  them based on debug info in vmlinux
- tracepoint programs are architecture independent, but if arbitrary pointer
  walking (with fetch() helpers) is used, they need data struct layout to match.
  Debug info is not necessary
- for networking use case we need to access 'struct sk_buff' fields in portable
  way (user space needs to fetch packet length without knowing layout of sk_buff),
  so for some frequently used data structures there will be a way to access them
  effeciently without bpf_fetch* helpers. Once it's ready tracepoint programs
  that access common data structs will be kernel independent.

Program return value:
- programs return 0 to discard an event
- and return non-zero to proceed with event (get ring buffer, copy
  arguments there and pass to user space via mmap-ed area)

Examples:
- dropmon.c - simple kfree_skb() accounting in eBPF assembler, similar
  to dropmon tool
- tracex1_kern.c - does net/netif_receive_skb event filtering
  for dev->skb->name == "lo" condition
  trace1_user.c - receives PERF_SAMPLE_RAW events into mmap-ed buffer and
  prints them
- tracex2_kern.c - same kfree_skb() accounting like dropmon, but now in C
  plus computes histogram of all write sizes from sys_write syscall
  and prints the histogram in userspace
- tracex3_kern.c - most sophisticated example that computes IO latency
  between block/block_rq_issue and block/block_rq_complete events
  and prints 'heatmap' using gray shades of text terminal.
  Useful to analyze disk performance.
- tracex4_kern.c - computes histogram of write sizes from sys_write syscall
  using kprobe mechanism instead of syscall. Since kprobe is optimized into
  ftrace the overhead of instrumentation is smaller than in example 2.

The user space tools like ktap/dtrace/systemptap/perf that has access
to debug info would probably want to use kprobe attachment point, since kprobe
can be inserted anywhere and all registers are avaiable in the program.
tracepoint attachments are useful without debug info, so standalone tools
like iosnoop will use them.

The main difference vs existing perf_probe/ftrace infra is in kernel aggregation
and conditional walking of arbitrary data structures.

Thanks!

Alexei Starovoitov (8):
  tracing: attach eBPF programs to tracepoints and syscalls
  tracing: allow eBPF programs to call ktime_get_ns()
  samples: bpf: simple tracing example in eBPF assembler
  samples: bpf: simple tracing example in C
  samples: bpf: counting example for kfree_skb tracepoint and write
    syscall
  samples: bpf: IO latency analysis (iosnoop/heatmap)
  tracing: attach eBPF programs to kprobe/kretprobe
  samples: bpf: simple kprobe example

 include/linux/bpf.h             |    6 +-
 include/linux/ftrace_event.h    |   14 +++
 include/trace/bpf_trace.h       |   25 +++++
 include/trace/ftrace.h          |   31 +++++++
 include/uapi/linux/bpf.h        |    9 ++
 include/uapi/linux/perf_event.h |    1 +
 kernel/events/core.c            |   58 ++++++++++++
 kernel/trace/Makefile           |    1 +
 kernel/trace/bpf_trace.c        |  194 +++++++++++++++++++++++++++++++++++++++
 kernel/trace/trace_kprobe.c     |   10 +-
 kernel/trace/trace_syscalls.c   |   35 +++++++
 samples/bpf/Makefile            |   18 ++++
 samples/bpf/bpf_helpers.h       |   14 +++
 samples/bpf/bpf_load.c          |  136 +++++++++++++++++++++++++--
 samples/bpf/bpf_load.h          |   12 +++
 samples/bpf/dropmon.c           |  143 +++++++++++++++++++++++++++++
 samples/bpf/libbpf.c            |    7 ++
 samples/bpf/libbpf.h            |    4 +
 samples/bpf/tracex1_kern.c      |   28 ++++++
 samples/bpf/tracex1_user.c      |   50 ++++++++++
 samples/bpf/tracex2_kern.c      |   71 ++++++++++++++
 samples/bpf/tracex2_user.c      |   95 +++++++++++++++++++
 samples/bpf/tracex3_kern.c      |   98 ++++++++++++++++++++
 samples/bpf/tracex3_user.c      |  152 ++++++++++++++++++++++++++++++
 samples/bpf/tracex4_kern.c      |   36 ++++++++
 samples/bpf/tracex4_user.c      |   83 +++++++++++++++++
 26 files changed, 1321 insertions(+), 10 deletions(-)
 create mode 100644 include/trace/bpf_trace.h
 create mode 100644 kernel/trace/bpf_trace.c
 create mode 100644 samples/bpf/dropmon.c
 create mode 100644 samples/bpf/tracex1_kern.c
 create mode 100644 samples/bpf/tracex1_user.c
 create mode 100644 samples/bpf/tracex2_kern.c
 create mode 100644 samples/bpf/tracex2_user.c
 create mode 100644 samples/bpf/tracex3_kern.c
 create mode 100644 samples/bpf/tracex3_user.c
 create mode 100644 samples/bpf/tracex4_kern.c
 create mode 100644 samples/bpf/tracex4_user.c

-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 70+ messages in thread
* Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
@ 2015-02-10  5:51 ` Alexei Starovoitov
  0 siblings, 0 replies; 70+ messages in thread
From: Alexei Starovoitov @ 2015-02-10  5:51 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	Masami Hiramatsu, Linux API, Network Development, LKML

On Mon, Feb 9, 2015 at 8:46 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> Looks like this is entirely perf based and does not interact with
> ftrace at all. In other words, it's perf not tracing.
>
> It makes more sense to go through tip than the tracing tree.

well, all of earlier series were based on ftrace only,
but I was given convincing enough arguments that
perf_even_open+ioctl is a better interface :)
Ok. will rebase on tip in the next version.

^ permalink raw reply	[flat|nested] 70+ messages in thread
* Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
@ 2015-02-10  6:10 ` Alexei Starovoitov
  0 siblings, 0 replies; 70+ messages in thread
From: Alexei Starovoitov @ 2015-02-10  6:10 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	Masami Hiramatsu, Linux API, Network Development, LKML

On Mon, Feb 9, 2015 at 9:13 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
>>                                                                       \
>> +     if (prog) {                                                     \
>> +             __maybe_unused const u64 z = 0;                         \
>> +             struct bpf_context __ctx = ((struct bpf_context) {      \
>> +                             __BPF_CAST6(args, z, z, z, z, z)        \
>
> Note, there is no guarantee that args is at most 6. For example, in
> drivers/net/wireless/brcm80211/brcmsmac/brcms_trace_events.h, the
> trace_event brcms_txstatus has 8 args.
>
> But I guess that's OK if you do not need those last args, right?

yeah, some tracepoints pass a lot of things.
That's rare and in most of the cases they can be fetched
from parent structure.

> I'm nervous about showing args of tracepoints too, because we don't want
> that to become a strict ABI either.

One can argue that current TP_printk format is already an ABI,
because somebody might be parsing the text output.
so in some cases we cannot change tracepoints without
somebody complaining that his tool broke.
In other cases tracepoints are used for debugging only
and no one will notice when they change...
It was and still a grey area.
bpf doesn't change any of that.
It actually makes addition of new tracepoints easier.
In the future we might add a tracepoint and pass a single
pointer to interesting data struct to it. bpf programs will walk
data structures 'as safe modules' via bpf_fetch*() methods
without exposing it as ABI.
whereas today we pass a lot of fields to tracepoints and
make all of these fields immutable.

To me tracepoints are like gdb breakpoints.
and bpf programs like live debugger that examine things.

the next step is to be able to write bpf scripts on the fly
without leaving debugger. Something like perf probe +
editor + live execution. Truly like gdb for kernel.
while kernel is running.

^ permalink raw reply	[flat|nested] 70+ messages in thread
* Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
@ 2015-02-10 19:53 Alexei Starovoitov
  2015-02-10 21:53   ` Steven Rostedt
  0 siblings, 1 reply; 70+ messages in thread
From: Alexei Starovoitov @ 2015-02-10 19:53 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	Masami Hiramatsu, Linux API, Network Development, LKML,
	Linus Torvalds

On Tue, Feb 10, 2015 at 5:05 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Mon, 9 Feb 2015 22:10:45 -0800
> Alexei Starovoitov <ast@plumgrid.com> wrote:
>
>> One can argue that current TP_printk format is already an ABI,
>> because somebody might be parsing the text output.
>
> If somebody does, then it is an ABI. Luckily, it's not that useful to
> parse, thus it hasn't been an issue. As Linus has stated in the past,
> it's not that we can't change ABI interfaces, its just that we can not
> change them if there's a user space application that depends on it.

there are already tools that parse trace_pipe:
https://github.com/brendangregg/perf-tools

> and expect some events to have specific fields. Now we can add new
> fields, or even remove fields that no user space tool is using. This is
> because today, tools use libtraceevent to parse the event data.

not all tools use libtraceevent.
gdb calls perf_event_open directly:
https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=gdb/nat/linux-btrace.c
and parses PERF_RECORD_SAMPLE as a binary.
In this case it's branch records, but I think we never said anywhere
that PERF_SAMPLE_IP | PERF_SAMPLE_ADDR should come
in this particular order.

> This is why I'm nervous about exporting the parameters of the trace
> event call. Right now, those parameters can always change, because the
> only way to know they exist is by looking at the code. And currently,
> there's no way to interact with those parameters. Once we have eBPF in
> mainline, we now have a way to interact with the parameters and if
> those parameters change, then the eBPF program will break, and if eBPF
> can be part of a user space tool, that will break that tool and
> whatever change in the trace point that caused this breakage would have
> to be reverted. IOW, this can limit development in the kernel.

it can limit development unless we say that bpf programs
that attach to tracepoints are not part of ABI.
Easy enough to add large comment similar to perf_event.h

> Al Viro currently does not let any tracepoint in VFS because he doesn't
> want the internals of that code locked to an ABI. He's right to be
> worried.

Same with networking bits. We don't want tracepoints to limit
kernel development, but we want debuggability and kernel
analytics.
All existing tracepoints defined via DEFINE_EVENT should
not be an ABI.
But some maintainers think of them as ABI, whereas others
are using them freely. imo it's time to remove ambiguity.

The idea for new style of tracepoints is the following:
introduce new macro: DEFINE_EVENT_USER
and let programs attach to them.
These tracepoint will receive one or two pointers to important
structs only. They will not have TP_printk, assign and fields.
The placement and arguments to these new tracepoints
will be an ABI.
All existing tracepoints are not.

The main reason to attach to tracepoint is that they are
accessible without debug info (unlike kprobe)
Another reason is speed. tracepoints are much faster than
optimized kprobes and for real-time analytics the speed
is critical.

The position of new tracepoints and their arguments
will be an ABI and the programs can be both.
If program is using bpf_fetch*() helpers it obviously
wants to access internal data structures, so
it's really nothing more, but 'safe kernel module'
and kernel layout specific.
Both old and new tracepoints + programs will be used
for live kernel debugging.

If program is accessing user-ized data structures then
it is portable and will run on any kernel.
In uapi header we can define:
struct task_struct_user {
  int pid;
  int prio;
};
and let bpf programs access it via real 'struct task_struct *'
pointer passed into tracepoint.
bpf loader will translate offsets and sizes used inside
the program into real task_struct's offsets and loads.
(all structs are read-only of course)
programs will be fast and kernel independent.
They will be used for analytics (latency, etc)

>> so in some cases we cannot change tracepoints without
>> somebody complaining that his tool broke.
>> In other cases tracepoints are used for debugging only
>> and no one will notice when they change...
>> It was and still a grey area.
>
> Not really. If a tool uses the tracepoint, it can lock that tracepoint
> down. This is exactly what latencytop did. It happened, it's not a
> hypothetical situation.

correct.

>> bpf doesn't change any of that.
>> It actually makes addition of new tracepoints easier.
>
> I totally disagree. It adds more ways to see inside the kernel, and if
> user space depends on this, it adds more ways the kernel can not change.
>
> It comes down to how robust eBPF is with the internals of the kernel
> changing. If we limit eBPF to system call tracepoints only, that's
> fine because those have the same ABI as the system call itself. I'm
> worried about the internal tracepoints for scheduling, irqs, file
> systems, etc.

agree. we need to make it clear that existing tracepoints
+ programs is not ABI.

>> In the future we might add a tracepoint and pass a single
>> pointer to interesting data struct to it. bpf programs will walk
>> data structures 'as safe modules' via bpf_fetch*() methods
>> without exposing it as ABI.
>
> Will this work if that structure changes? When the field we are looking
> for no longer exists?

bpf_fetch*() is the same mechanism as perf probe.
If there is a mistake by user space tools, the program
will be reading some junk, but it won't be crashing.
To be able to debug live kernel we need to see everywhere.
Same way as systemtap loads kernel modules to walk
things inside kernel, bpf programs walk pointers with
bpf_fetch*().
I'm saying that if program is using bpf_fetch*()
it wants to see kernel internals and obviously depends
on particular kernel layout.

>> whereas today we pass a lot of fields to tracepoints and
>> make all of these fields immutable.
>
> The parameters passed to the tracepoint are not shown to userspace and
> can change at will. Now, we present the final parsing of the parameters
> that convert to fields. As all currently known tools uses
> libtraceevent.a, and parse the format files, those fields can move
> around and even change in size. The structures are not immutable. The
> fields are locked down if user space relies on them. But they can move
> about within the tracepoint, because the parsing allows for it.
>
> Remember, these are processed fields. The result of TP_fast_assign()
> and what gets put into the ring buffer. Now what is passed to the
> actual tracepoint is not visible by userspace, and in lots of cases, it
> is just a pointer to some structure. What eBPF brings to the table is a
> way to access this structure from user space. What keeps a structured
> passed to a tracepoint from becoming immutable if there's a eBPF
> program that expects it to have a specific field?

agree. that's fair.
I'm proposing to treat bpf programs that attach to existing
tracepoints as kernel modules that carry no ABI claims.

>> and bpf programs like live debugger that examine things.
>
> If bpf programs only dealt with kprobes, I may agree. But tracepoints
> have already been proven to be a type of ABI. If we open another window
> into the kernel, this can screw us later. It's better to solve this now
> than when we are fighting with Linus over user space breakage.

I'm not sure what's more needed other than adding
large comments into documentation, man pages and sample
code that bpf+existing tracepoint is not an ABI.

> What we need is to know if eBPF programs are modules or a user space
> interface. If they are a user interface then we need to be extremely
> careful here. If they are treated the same as modules, then it would
> not add any API. But that hasn't been settled yet, even if we have a
> comment in the kernel.
>
> Maybe what we should do is to make eBPF pass the kernel version it was
> made for (with all the mod version checks). If it doesn't match, fail
> to load it. Perhaps the more eBPF is limited like modules are, the
> better chance we have that no eBPF program creates a new ABI.

it's easy to add kernel version check and it will be equally
easy for user space to hack it.
imo comments in documentation and samples is good enough.

also not all bpf programs are equal.
bpf+existing tracepoint is not ABI
bpf+new tracepoint is ABI if programs are not using bpf_fetch
bpf+syscall is ABI if programs are not using bpf_fetch
bpf+kprobe is not ABI
bpf+sockets is ABI
At the end we want most of the programs to be written
without assuming anything about kernel internals.
But for live kernel debugging we will write programs
very specific to given kernel layout.

We can categorize the above in non-ambigous via
bpf program type.
Programs with:
BPF_PROG_TYPE_TRACEPOINT - not ABI
BPF_PROG_TYPE_KPROBE - not ABI
BPF_PROG_TYPE_SOCKET_FILTER - ABI

for my proposed 'new tracepoints' we can add type:
BPF_PROG_TYPE_TRACEPOINT_USER - ABI
and disallow calls to bpf_fetch*() for them.
To make it more strict we can do kernel version check
for all prog types that are 'not ABI', but is it really necessary?

To summarize and consolidate other threads:
- I will remove reading of PERF_SAMPLE_RAW in tracex1 example.
it's really orthogonal to this whole discussion.
- will add more comments through out that just because
programs can read tracepoint arguments, they shouldn't
make any assumptions that args stays as-is from version to version
- will work on a patch to demonstrate how few in-kernel
structures can be user-ized and how programs can access
them in version-indepedent way

btw the concept of user-ized data structures already exists
with classic bpf, since 'A = load -0x1000' is translated into
'A = skb->protocol'. I'm thinking of something similar
but more generic and less obscure.

^ permalink raw reply	[flat|nested] 70+ messages in thread
* Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
@ 2015-02-11  0:22 Alexei Starovoitov
  2015-02-11  0:50   ` Steven Rostedt
                   ` (5 more replies)
  0 siblings, 6 replies; 70+ messages in thread
From: Alexei Starovoitov @ 2015-02-11  0:22 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	Masami Hiramatsu, Linux API, Network Development, LKML,
	Linus Torvalds, Peter Zijlstra, Eric W. Biederman

On Tue, Feb 10, 2015 at 1:53 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Tue, 10 Feb 2015 11:53:22 -0800
> Alexei Starovoitov <ast@plumgrid.com> wrote:
>
>> On Tue, Feb 10, 2015 at 5:05 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
>> > On Mon, 9 Feb 2015 22:10:45 -0800
>> > Alexei Starovoitov <ast@plumgrid.com> wrote:
>> >
>> >> One can argue that current TP_printk format is already an ABI,
>> >> because somebody might be parsing the text output.
>> >
>> > If somebody does, then it is an ABI. Luckily, it's not that useful to
>> > parse, thus it hasn't been an issue. As Linus has stated in the past,
>> > it's not that we can't change ABI interfaces, its just that we can not
>> > change them if there's a user space application that depends on it.
>>
>> there are already tools that parse trace_pipe:
>> https://github.com/brendangregg/perf-tools
>
> Yep, and if this becomes a standard, then any change that makes
> trace_pipe different will be reverted.

I think reading of trace_pipe is widespread.

>> > and expect some events to have specific fields. Now we can add new
>> > fields, or even remove fields that no user space tool is using. This is
>> > because today, tools use libtraceevent to parse the event data.
>>
>> not all tools use libtraceevent.
>> gdb calls perf_event_open directly:
>> https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=gdb/nat/linux-btrace.c
>> and parses PERF_RECORD_SAMPLE as a binary.
>> In this case it's branch records, but I think we never said anywhere
>> that PERF_SAMPLE_IP | PERF_SAMPLE_ADDR should come
>> in this particular order.
>
> What particular order? Note, that's a hardware event, not a software
> one.

yes, but gdb assumes that 'u64 ip' precedes, 'u64 addr'
when attr.sample_type = IP | ADDR whereas this is an
internal order of 'if' statements inside perf_output_sample()...

>> But some maintainers think of them as ABI, whereas others
>> are using them freely. imo it's time to remove ambiguity.
>
> I would love to, and have brought this up at Kernel Summit more than
> once with no solution out of it.

let's try it again at plumbers in august?

For now I'm going to drop bpf+tracepoints, since it's so controversial
and go with bpf+syscall and bpf+kprobe only.

Hopefully by august it will be clear what bpf+kprobes can do
and why I'm excited about bpf+tracepoints in the future.

>> The idea for new style of tracepoints is the following:
>> introduce new macro: DEFINE_EVENT_USER
>> and let programs attach to them.
>
> We actually have that today. But it's TRACE_EVENT_FLAGS(), although
> that should be cleaned up a bit. Frederic added it to label events that
> are safe for perf non root. It seems to be used currently only for
> syscalls.

I didn't mean to let unprivileged apps to use.
Better name probably would be: DEFINE_EVENT_BPF
and only let bpf programs use it.

>> These tracepoint will receive one or two pointers to important
>> structs only. They will not have TP_printk, assign and fields.
>> The placement and arguments to these new tracepoints
>> will be an ABI.
>> All existing tracepoints are not.
>
> TP_printk() is not really an issue.

I think it is. The way things are printed is the most
visible part of tracepoints and I suspect maintainers don't
want to add new ones, because internal fields are printed
and users do parse trace_pipe.
Recent discussion about tcp instrumentation was
about adding new tracepoints and a module to print them.
As soon as something like this is in, the next question
'what we're going to do when more arguments need
to be printed'...

imo the solution is DEFINE_EVENT_BPF that doesn't
print anything and a bpf program to process it.
Then programs decide what they like to pass to
user space and in what format.
The kernel/user ABI is the position of this new tracepoint
and arguments that are passed in.
bpf program takes care of walking pointers, extracting
interesting fields, aggregating and passing them to
user in the format specific to this particular program.
Then when user wants to collect more fields it just
changes the program and corresponding userland side.

>> The position of new tracepoints and their arguments
>> will be an ABI and the programs can be both.
>
> You means "special tracepoints" one that does export the arguments?
>
> Question is, how many maintainers will add these, knowing that they
> will have to be forever maintained as is.

Obviously we would need to prove usefulness of
tracepoint_bpf before they're accepted.
Hopefully bpf+kprobe will make an impression on its own
and users would want similar but without debug info.

>> If program is using bpf_fetch*() helpers it obviously
>> wants to access internal data structures, so
>> it's really nothing more, but 'safe kernel module'
>> and kernel layout specific.
>> Both old and new tracepoints + programs will be used
>> for live kernel debugging.
>>
>> If program is accessing user-ized data structures then
>
> Technically, the TP_struct__entry is a user-ized structure.
>
>> it is portable and will run on any kernel.
>> In uapi header we can define:
>> struct task_struct_user {
>>   int pid;
>>   int prio;
>
> Here's a perfect example of something that looks stable to show to
> user space, but is really a pimple that is hiding cancer.
>
> Lets start with pid. We have name spaces. What pid will be put there?
> We have to show the pid of the name space it is under.
>
> Then we have prio. What is prio in the DEADLINE scheduler. It is rather
> meaningless. Also, it is meaningless in SCHED_OTHER.
>
> Also note that even for SCHED_FIFO, the prio is used differently in the
> kernel than it is in userspace. For the kernel, lower is higher.

well, ->prio and ->pid are already printed by sched tracepoints
and their meaning depends on scheduler. So users taking that
into account.
I'm not suggesting to preserve the meaning of 'pid' semantically
in all cases. That's not what users would want anyway.
I want to allow programs to access important fields and print
them in more generic way than current TP_printk does.
Then exposed ABI of such tracepoint_bpf is smaller than
with current tracepoints.

>> };
>> and let bpf programs access it via real 'struct task_struct *'
>> pointer passed into tracepoint.
>> bpf loader will translate offsets and sizes used inside
>> the program into real task_struct's offsets and loads.
>
> It would need to do more that that. It may have to calculate the value
> that it returns, as the internal value may be different with different
> kernels.

back to 'prio'... the 'prio' accessible from the program
should be the same 'prio' that we're storing inside task_struct.
No extra conversions.
The bpf-script for sched analytics would want to see
the actual value to make sense of it.

> But what if the userspace tool depends on that value returning
> something meaningful. If it was meaningful in the past, it will have to
> be meaningful in the future, even if the internals of the kernel make
> it otherwise.

in some cases... yes. if we make a poor choice of
selecting fields for this 'task_struct_user' then some fields
may disappear from real task_struct, and placeholders
will be left in _user version.
so exposed fields need to be thought through.

In case of networking some of these choices were
already made by classic bpf. Fields:
protocol, pkttype, ifindex, hatype, mark, hash, queue_mapping
have to be reference-able from 'skb' pointer.
They can move from skb to some other struct, change
their sizes, location within sk_buff, etc
They can even disappear, but user visible
struct sk_buff_user will still contain the above.

> eBPF is very flexible, which means it is bound to have someone use it
> in a way you never dreamed of, and that will be what bites you in the
> end (pun intended).

understood :)
let's start slow then with bpf+syscall and bpf+kprobe only.

>> also not all bpf programs are equal.
>> bpf+existing tracepoint is not ABI
>
> Why not?

well, because we want to see more tracepoints in the kernel.
We're already struggling to add more.

>> bpf+new tracepoint is ABI if programs are not using bpf_fetch
>
> How is this different?

the new ones will be explicit by definition.

>> bpf+syscall is ABI if programs are not using bpf_fetch
>
> Well, this is easy. As syscalls are ABI, and the tracepoints for them
> match the ABI, it by default becomes an ABI.
>
>> bpf+kprobe is not ABI
>
> Right.
>
>> bpf+sockets is ABI
>
> Right, because sockets themselves are ABI.
>
>> At the end we want most of the programs to be written
>> without assuming anything about kernel internals.
>> But for live kernel debugging we will write programs
>> very specific to given kernel layout.
>
> And here lies the trick. How do we differentiate applications for
> everyday use from debugging tools? There's been times when debugging
> tools have shown themselves as being so useful they become everyday
> use tools.
>
>>
>> We can categorize the above in non-ambigous via
>> bpf program type.
>> Programs with:
>> BPF_PROG_TYPE_TRACEPOINT - not ABI
>> BPF_PROG_TYPE_KPROBE - not ABI
>> BPF_PROG_TYPE_SOCKET_FILTER - ABI
>
> Again, what enforces this? (hint, it's Linus)
>
>
>>
>> for my proposed 'new tracepoints' we can add type:
>> BPF_PROG_TYPE_TRACEPOINT_USER - ABI
>> and disallow calls to bpf_fetch*() for them.
>> To make it more strict we can do kernel version check
>> for all prog types that are 'not ABI', but is it really necessary?
>
> If we have something that makes it difficult for a tool to work from
> one kernel to the next, or ever with different configs, where that tool
> will never become a standard, then that should be good enough to keep
> it from dictating user ABI.
>
> To give you an example, we thought about scrambling the trace event
> field locations from boot to boot to keep tools from hard coding the
> event layout. This may sound crazy, but developers out there are crazy.
> And if you want to keep them from abusing interfaces, you just need to
> be a bit more crazy than they are.

that is indeed crazy. the point is understood.

right now I cannot think of a solid way to prevent abuse
of bpf+tracepoint, so just going to drop it for now.
Cool things can be done with bpf+kprobe/syscall already.

>> To summarize and consolidate other threads:
>> - I will remove reading of PERF_SAMPLE_RAW in tracex1 example.
>> it's really orthogonal to this whole discussion.
>
> Or yous libtraceevent ;-) We really need to finish that and package it
> up for distros.

sure. some sophisticated example can use it too.

>> - will add more comments through out that just because
>> programs can read tracepoint arguments, they shouldn't
>> make any assumptions that args stays as-is from version to version
>
> We may need to find a way to actually keep it from being as is from
> version to version even if the users do not change.
>
>> - will work on a patch to demonstrate how few in-kernel
>> structures can be user-ized and how programs can access
>> them in version-indepedent way
>
> It will be interesting to see what kernel structures can be user-ized
> that are not already used by system calls.

well, for networking, few fields I mentioned above would
be enough for most of the programs.

>> btw the concept of user-ized data structures already exists
>> with classic bpf, since 'A = load -0x1000' is translated into
>> 'A = skb->protocol'. I'm thinking of something similar
>> but more generic and less obscure.
>
> I have to try to wrap my head around understanding the classic bpf, and
> how "load -0x1000" translates to "skb->protocol". Is that documented
> somewhere?

well, the magic constants are in uapi/linux/filter.h
load -0x1000 = skb->protocol
load -0x1000 + 4 = skb->pkt_type
load -0x1000 + 8 = skb->dev->ifindex
for eBPF I want to clean it up in a way that user program
will see:
struct sk_buff_user {
  int protocol;
  int pkt_type;
  int ifindex;
 ...
};
and C code of bpf program will look normal: skb->ifindex.
bpf loader will do verification and translation of
'load skb_ptr+12' into sequence of loads with correct
internal offsets.
So struct sk_buff_user is only an interface. It won't exist
in such layout in memory.

^ permalink raw reply	[flat|nested] 70+ messages in thread
* Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
@ 2015-02-11  3:04 Alexei Starovoitov
  2015-02-11  4:31   ` Steven Rostedt
  0 siblings, 1 reply; 70+ messages in thread
From: Alexei Starovoitov @ 2015-02-11  3:04 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	Masami Hiramatsu, Linux API, Network Development, LKML,
	Linus Torvalds, Peter Zijlstra, Eric W. Biederman

On Tue, Feb 10, 2015 at 4:50 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
>
>> >> But some maintainers think of them as ABI, whereas others
>> >> are using them freely. imo it's time to remove ambiguity.
>> >
>> > I would love to, and have brought this up at Kernel Summit more than
>> > once with no solution out of it.
>>
>> let's try it again at plumbers in august?
>
> Well, we need a statement from Linus. And it would be nice if we could
> also get Ingo involved in the discussion, but he seldom comes to
> anything but Kernel Summit.

+1

> BTW, I wonder if I could make a simple compiler in the kernel that
> would translate the current ftrace filters into a BPF program, where it
> could use the program and not use the current filter logic.

yep. I've sent that patch last year.
It converted pred_tree into bpf program.
I can try to dig it up. It doesn't provide extra programmability
though, just makes filtering logic much faster.

>> imo the solution is DEFINE_EVENT_BPF that doesn't
>> print anything and a bpf program to process it.
>
> You mean to be completely invisible to ftrace? And the debugfs/tracefs
> directory?

I mean it will be seen in tracefs to get 'id', but without enable/format/filter

>> I'm not suggesting to preserve the meaning of 'pid' semantically
>> in all cases. That's not what users would want anyway.
>> I want to allow programs to access important fields and print
>> them in more generic way than current TP_printk does.
>> Then exposed ABI of such tracepoint_bpf is smaller than
>> with current tracepoints.
>
> Again, this would mean they become invisible to ftrace, and even
> ftrace_dump_on_oops.

yes, since these new tracepoints have no meat inside them.
They're placeholders sitting idle and waiting for bpf to do
something useful with them.

> I'm not fully understanding what is to be exported by this new ABI. If
> the fields available, will always be available, then why can't the
> appear in a TP_printk()?

say, we define trace_netif_rx_entry() as this new tracepoint_bpf.
It will have only one argument 'skb'.
bpf program will read and print skb fields the way it likes
for particular tracing scenario.
So instead of making
TP_printk("dev=%s napi_id=%#x queue_mapping=%u skbaddr=%p
vlan_tagged=%d vlan_proto=0x%04x vlan_tci=0x%04x protocol=0x%04x
ip_summed=%d hash=0x%08x l4_hash=%d len=%u data_len=%u truesize=%u
mac_header_valid=%d mac_header=%d nr_frags=%d gso_size=%d
gso_type=%#x",...
the abi exposed via trace_pipe (as it is today),
the new tracepoint_bpf abi is presence of 'skb' pointer as one
and only argument to bpf program.
Future refactoring of netif_rx would need to guarantee
that trace_netif_rx_entry(skb) is called. that's it.
imo such tracepoints are much easier to deal with during
code changes.

May be some of the existing tracepoints like this one that
takes one argument can be marked 'bpf-ready', so that
programs can attach to them only.

>> let's start slow then with bpf+syscall and bpf+kprobe only.
>
> I'm fine with that.

thanks. will wait for merge window to close and will repost.

^ permalink raw reply	[flat|nested] 70+ messages in thread
* Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
@ 2015-02-11  6:33 ` Alexei Starovoitov
  0 siblings, 0 replies; 70+ messages in thread
From: Alexei Starovoitov @ 2015-02-11  6:33 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	Masami Hiramatsu, Linux API, Network Development, LKML,
	Linus Torvalds, Peter Zijlstra, Eric W. Biederman

On Tue, Feb 10, 2015 at 8:31 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
>> > Again, this would mean they become invisible to ftrace, and even
>> > ftrace_dump_on_oops.
>>
>> yes, since these new tracepoints have no meat inside them.
>> They're placeholders sitting idle and waiting for bpf to do
>> something useful with them.
>
> Hmm, I have a patch somewhere (never posted it), that add
> TRACE_MARKER(), which basically would just print that it was hit. But
> no data was passed to it. Something like that I would be more inclined
> to take. Then the question is, what can bpf access there? Could just be
> a place holder to add a "fast kprobe". This way, it can show up in
> trace events with enable and and all that, it just wont have any data
> to print out besides the normal pid, flags, etc.
>
> But because it will inject a nop, that could be converted to a jump, it
> will give you the power of kprobes but with the speed of a tracepoint.

fair enough.
Something like TRACE_MARKER(arg1, arg2) that prints
it was hit without accessing the args would be enough.
Without any args it is indeed a 'fast kprobe' only.
Debug info would still be needed to access
function arguments.
On x64 function entry point and x64 abi make it easy
to access args, but i386 or kprobe in the middle
lose visibility when debug info is not available.
TRACE_MARKER (with few key args that function
is operating on) is enough to achieve roughly the same
as kprobe without debug info.

>> > I'm not fully understanding what is to be exported by this new ABI. If
>> > the fields available, will always be available, then why can't the
>> > appear in a TP_printk()?
>>
>> say, we define trace_netif_rx_entry() as this new tracepoint_bpf.
>> It will have only one argument 'skb'.
>> bpf program will read and print skb fields the way it likes
>> for particular tracing scenario.
>> So instead of making
>> TP_printk("dev=%s napi_id=%#x queue_mapping=%u skbaddr=%p
>> vlan_tagged=%d vlan_proto=0x%04x vlan_tci=0x%04x protocol=0x%04x
>> ip_summed=%d hash=0x%08x l4_hash=%d len=%u data_len=%u truesize=%u
>> mac_header_valid=%d mac_header=%d nr_frags=%d gso_size=%d
>> gso_type=%#x",...
>> the abi exposed via trace_pipe (as it is today),
>> the new tracepoint_bpf abi is presence of 'skb' pointer as one
>> and only argument to bpf program.
>> Future refactoring of netif_rx would need to guarantee
>> that trace_netif_rx_entry(skb) is called. that's it.
>> imo such tracepoints are much easier to deal with during
>> code changes.
>
> But what can you access from that skb that is guaranteed to be there?
> If you say anything, then there's no reason it can't be added to the
> printk as well.

programs can access any field via bpf_fetch*() helpers which
make them kernel layout dependent or via user-ized sk_buff
with few fields which is portable.
In both cases kernel/user abi is only 'skb' pointer.
whether it's debugging program that needs full access
via fetch* helpers or portable program that uses stable api
it's up to program author.
Just like kprobes, it's clear, that if program is using
fetch* helpers it's doing it without any abi guarantees.
'perf probe' and 'bpf with fetch* helpers are the same.
perf probe creates wrappers on top of probe_kernel_read and
bpf_fetch* helpers are wrappers on top of probe_kernel_read.
Complains that 'my kprobe with flags=%cx mode=+4($stack)
stopped working in new kernel' are equivalent to complains
that program with bpf_fetch* stopped working.

Whereas if program is using user-ized structs it will work
across kernel versions, though it will be able to see
only very limited slice of in-kernel data.

>> May be some of the existing tracepoints like this one that
>> takes one argument can be marked 'bpf-ready', so that
>> programs can attach to them only.
>
> I really hate the idea of adding tracepoints that ftrace can't use. It
> basically kills the entire busy box usage scenario, as boards that have
> extremely limited userspace still make full use of ftrace via the
> existing tracepoints.

agree. I think your trace_marker with few args is a good
middle ground.

> I still don't see the argument that adding data via the bpf functions
> is any different than adding those same items to fields in an event.
> Once you add a bpf function, then you must maintain those fields.
>
> Look, you can always add more to a TP_printk(), as that is standard
> with all text file kernel parsing. Like stat in /proc. Fields can not
> be removed, but more can always be appended to the end.
>
> Any tool that parses trace_pipe is broken if it can't handle extended
> fields. The api can be extended, and for text files, that is by
> appending to them.

I agree that any text parsing script should be able to cope
with additional args without problems. I think it's a fear of
<1% breakage is causing maintainers to avoid any changes
to tracepoints even when they just add few args to the end
of TP_printk.
When tracepoints stop printing and the only thing they see
is single pointer to a well known struct like sk_buff,
this fear of tracepoints should fade.
programs are not part of the kernel, so whatever they do
and print is not our headache. We only make sure that
interface between kernel and programs is stable.
In other words kernel ABI is what kernel exposes to
user space and to bpf programs. Though programs
are run inside the kernel what they do it outside of
kernel abi. So when program prints fields is not our
problem, whereas when tracepoint prints fields it's
kernel abi.

ps I'll be traveling for the next few weeks, so
apologize in advance for slow response.

^ permalink raw reply	[flat|nested] 70+ messages in thread
* Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
@ 2015-02-14 22:48 ` Alexei Starovoitov
  0 siblings, 0 replies; 70+ messages in thread
From: Alexei Starovoitov @ 2015-02-14 22:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Ingo Molnar, Namhyung Kim,
	Arnaldo Carvalho de Melo, Jiri Olsa, Masami Hiramatsu, Linux API,
	Network Development, LKML, Linus Torvalds, Eric W. Biederman

On Wed, Feb 11, 2015 at 5:28 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> We're compiling the BPF stuff against the 'current' kernel headers
> right?

the tracex1 example is pulling kernel headers to demonstrate
how bpf_fetch*() helpers can be used to walk kernel structures
without debug info.
The other examples don't need any internal headers.

> So would enforcing module versioning not be sufficient?

I'm going to redo the ex1 to use kprobe and some form of
version check. Indeed module-like versioning should
be enough.

^ permalink raw reply	[flat|nested] 70+ messages in thread
* Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
@ 2015-02-14 22:54 ` Alexei Starovoitov
  0 siblings, 0 replies; 70+ messages in thread
From: Alexei Starovoitov @ 2015-02-14 22:54 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	Masami Hiramatsu, Linux API, Network Development, LKML,
	Linus Torvalds, Peter Zijlstra, Eric W. Biederman

On Wed, Feb 11, 2015 at 7:51 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Tue, 10 Feb 2015 22:33:05 -0800
> Alexei Starovoitov <ast@plumgrid.com> wrote:
>
>
>> fair enough.
>> Something like TRACE_MARKER(arg1, arg2) that prints
>> it was hit without accessing the args would be enough.
>> Without any args it is indeed a 'fast kprobe' only.
>> Debug info would still be needed to access
>> function arguments.
>> On x64 function entry point and x64 abi make it easy
>> to access args, but i386 or kprobe in the middle
>> lose visibility when debug info is not available.
>> TRACE_MARKER (with few key args that function
>> is operating on) is enough to achieve roughly the same
>> as kprobe without debug info.
>
> Actually, what about a TRACE_EVENT_DEBUG(), that has a few args and
> possibly a full trace event layout.
>
> The difference would be that the trace events do not show up unless you
> have "trace_debug" on the command line. This should prevent
> applications from depending on them.
>
> I could even do the nasty dmesg output like I do with trace_printk()s,
> that would definitely keep a production kernel from adding it by
> default.
>
> When trace_debug is not there, the trace points could still be accessed
> but perhaps only via bpf, or act like a simple trace marker.

I think that is a great idea!
Makes it clear that all prints are for debugging and
no abi guarantees.

> Note, if you need ids, I rather have them in another directory than
> tracefs. Make a eventfs perhaps that holds these. I rather keep tracefs
> simple.

indeed. makes sense. no reason to burn fs memory just
to get an id from name. may be perf_event api can be
extended to lookup id from name. I think perf will benefit as well.

^ permalink raw reply	[flat|nested] 70+ messages in thread
* Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
@ 2015-02-14 23:02 ` Alexei Starovoitov
  0 siblings, 0 replies; 70+ messages in thread
From: Alexei Starovoitov @ 2015-02-14 23:02 UTC (permalink / raw)
  To: Hekuang
  Cc: Steven Rostedt, Ingo Molnar, Namhyung Kim,
	Arnaldo Carvalho de Melo, Jiri Olsa, Masami Hiramatsu, Linux API,
	Network Development, LKML, Linus Torvalds, Peter Zijlstra,
	Eric W. Biederman, wangnan0

On Wed, Feb 11, 2015 at 11:58 PM, Hekuang <hekuang@huawei.com> wrote:
>
>>> eBPF is very flexible, which means it is bound to have someone use it
>>> in a way you never dreamed of, and that will be what bites you in the
>>> end (pun intended).
>>
>> understood :)
>> let's start slow then with bpf+syscall and bpf+kprobe only.
>
>
> I think BPF + system calls/kprobes can meet our use case
> (https://lkml.org/lkml/2015/2/6/44), but there're some issues to be
> improved.
>
> I suggest that you can improve bpf+kprobes when attached to function
> headers(or TRACE_MARKERS), make it converts pt-regs to bpf_ctx->arg1,
> arg2.., then top models and architectures can be separated by bpf.
>
> BPF bytecode is cross-platform, but what we can get by using bpf+kprobes
> is a 'regs->rdx' kind of information, such information is both
> architecture and kernel version related.

for kprobes in the middle of the function, kernel cannot
convert pt_regs into argN. Placement was decided by compiler
and can only be found in debug info.
I think bpf+kprobe will be using it when it is available.
When there is no debug info, kprobes will be limited
to function entry and mapping of regs/stack into
argN can be done by user space depending on architecture.
So user tracing scripts in some higher level language
can be kernel/arch independent when 'perf probe+bpf'
is loading them on the fly on the given machine.

> We hope to establish some models for describing kernel procedures such
> as IO and network, which requires that it does not rely on architecture
> and does not rely to a specific kernel version as much as possible.

That's obviously a goal, but it requires a new approach to tracepoints.
I think a lot of great ideas were discussed in this thread, so I'm
hopeful that we'll come up with solution that will satisfy even
strictest Peter's requirements :)

^ permalink raw reply	[flat|nested] 70+ messages in thread
* Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
@ 2015-02-23 18:55 ` Alexei Starovoitov
  0 siblings, 0 replies; 70+ messages in thread
From: Alexei Starovoitov @ 2015-02-23 18:55 UTC (permalink / raw)
  To: He Kuang
  Cc: Steven Rostedt, Ingo Molnar, Namhyung Kim,
	Arnaldo Carvalho de Melo, Jiri Olsa, Masami Hiramatsu, Linux API,
	Network Development, LKML, Linus Torvalds, Peter Zijlstra,
	Eric W. Biederman, wangnan0

On Mon, Feb 16, 2015 at 6:26 AM, He Kuang <hekuang@huawei.com> wrote:
> Hi, Alexei
>
> Another suggestion on bpf syscall interface. Currently, BPF +
> syscalls/kprobes depends on CONFIG_BPF_SYSCALL. In kernel used on
> commercial products, CONFIG_BPF_SYSCALL is probably disabled, in this
> case, bpf bytecode cannot be loaded to the kernel.

I'm seeing a flurry of use cases for bpf in ovs, tc, tracing, etc
When it's all ready, we can turn that config on by default.

> If we turn the functionality of BPF_SYSCALL into a loadable module, then
> we can use it without any dependencies on the kernel. What about change
> bpf syscall to a /dev node or /sys file which can be exported by a
> kernel module?

I don't think we will allow extending bpf by modules.
'bpf in modules' is an interface that is too easy to abuse.
So all of bpf core, helper functions and program types will be builtin.

As far as bpf+tracing the plan is to do bpf+kprobe and bpf+syscalls
first. Then add right set of helpers to make sure that use cases
like 'tcp stack instrumentation' are fully addressed.
Then there were few great ideas of accelerating kprobes
with trace markers and debug tracepoints that we can do later.

^ permalink raw reply	[flat|nested] 70+ messages in thread

end of thread, other threads:[~2015-02-23 18:55 UTC | newest]

Thread overview: 70+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-10  3:45 [PATCH v3 linux-trace 0/8] tracing: attach eBPF programs to tracepoints/syscalls/kprobe Alexei Starovoitov
2015-02-10  3:45 ` Alexei Starovoitov
2015-02-10  3:45 ` [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls Alexei Starovoitov
2015-02-10  3:45   ` Alexei Starovoitov
2015-02-10  4:46   ` Steven Rostedt
2015-02-10  4:46     ` Steven Rostedt
2015-02-10  5:13   ` Steven Rostedt
2015-02-10  5:13     ` Steven Rostedt
2015-02-10  3:45 ` [PATCH v3 linux-trace 2/8] tracing: allow eBPF programs to call ktime_get_ns() Alexei Starovoitov
2015-02-10  3:45 ` [PATCH v3 linux-trace 3/8] samples: bpf: simple tracing example in eBPF assembler Alexei Starovoitov
2015-02-10  3:45 ` [PATCH v3 linux-trace 4/8] samples: bpf: simple tracing example in C Alexei Starovoitov
2015-02-10  4:08   ` Steven Rostedt
2015-02-10  5:16     ` Steven Rostedt
2015-02-10  5:16       ` Steven Rostedt
2015-02-10  5:45       ` Alexei Starovoitov
2015-02-10  5:47         ` Alexei Starovoitov
2015-02-10  5:47           ` Alexei Starovoitov
2015-02-10 12:27           ` Steven Rostedt
2015-02-10 12:27             ` Steven Rostedt
2015-02-10 12:24         ` Steven Rostedt
2015-02-10 12:24           ` Steven Rostedt
2015-02-10  4:12   ` Steven Rostedt
2015-02-10  4:12     ` Steven Rostedt
2015-02-10  3:45 ` [PATCH v3 linux-trace 5/8] samples: bpf: counting example for kfree_skb tracepoint and write syscall Alexei Starovoitov
2015-02-10  3:45   ` Alexei Starovoitov
2015-02-10  3:45 ` [PATCH v3 linux-trace 6/8] samples: bpf: IO latency analysis (iosnoop/heatmap) Alexei Starovoitov
2015-02-10  3:46 ` [PATCH v3 linux-trace 7/8] tracing: attach eBPF programs to kprobe/kretprobe Alexei Starovoitov
2015-02-10  3:46 ` [PATCH v3 linux-trace 8/8] samples: bpf: simple kprobe example Alexei Starovoitov
2015-02-10 14:55 ` [PATCH v3 linux-trace 0/8] tracing: attach eBPF programs to tracepoints/syscalls/kprobe Steven Rostedt
2015-02-10 14:55   ` Steven Rostedt
2015-02-10  5:51 [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls Alexei Starovoitov
2015-02-10  5:51 ` Alexei Starovoitov
2015-02-10 12:27 ` Steven Rostedt
2015-02-10  6:10 Alexei Starovoitov
2015-02-10  6:10 ` Alexei Starovoitov
2015-02-10 13:05 ` Steven Rostedt
2015-02-10 13:05   ` Steven Rostedt
2015-02-10 19:53 Alexei Starovoitov
2015-02-10 21:53 ` Steven Rostedt
2015-02-10 21:53   ` Steven Rostedt
2015-02-11 10:28   ` Peter Zijlstra
2015-02-11 10:28     ` Peter Zijlstra
2015-02-11  0:22 Alexei Starovoitov
2015-02-11  0:50 ` Steven Rostedt
2015-02-11  0:50   ` Steven Rostedt
2015-02-11  9:33 ` Peter Zijlstra
2015-02-11  9:45 ` Peter Zijlstra
2015-02-11  9:45   ` Peter Zijlstra
2015-02-11 10:15 ` Peter Zijlstra
2015-02-11 10:15   ` Peter Zijlstra
2015-02-12  4:58 ` Hekuang
2015-02-12  4:58   ` Hekuang
2015-02-12  4:58   ` Hekuang
2015-02-16 11:26 ` He Kuang
2015-02-16 11:26   ` He Kuang
2015-02-11  3:04 Alexei Starovoitov
2015-02-11  4:31 ` Steven Rostedt
2015-02-11  4:31   ` Steven Rostedt
2015-02-11  6:33 Alexei Starovoitov
2015-02-11  6:33 ` Alexei Starovoitov
2015-02-11 12:51 ` Steven Rostedt
2015-02-11 12:51   ` Steven Rostedt
2015-02-14 22:48 Alexei Starovoitov
2015-02-14 22:48 ` Alexei Starovoitov
2015-02-14 22:54 Alexei Starovoitov
2015-02-14 22:54 ` Alexei Starovoitov
2015-02-14 23:02 Alexei Starovoitov
2015-02-14 23:02 ` Alexei Starovoitov
2015-02-23 18:55 Alexei Starovoitov
2015-02-23 18:55 ` Alexei Starovoitov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.