* [PATCH bpf-next 0/3] perf, bpf: reveal invisible bpf programs @ 2018-09-19 22:39 Alexei Starovoitov 2018-09-19 22:39 ` [PATCH bpf-next 1/3] perf/core: introduce perf_event_mmap_bpf_prog Alexei Starovoitov ` (2 more replies) 0 siblings, 3 replies; 35+ messages in thread From: Alexei Starovoitov @ 2018-09-19 22:39 UTC (permalink / raw) To: David S . Miller; +Cc: daniel, peterz, acme, netdev, kernel-team Hi All, this patch set adds kernel and user space support to reveal short lived bpf programs. The kernel stores addr+bpf_prog_name information into perf ring buffer. Later 'perf report' can properly attribute the cpu time to the program. The following command was run before and after: 'perf record ./test_progs; perf report' Before patch set: # Overhead Command Shared Object Symbol # ........ .......... ................. ....................................... # 10.73% test_progs [kernel.kallsyms] [k] __htab_map_lookup_elem 8.16% test_progs [kernel.kallsyms] [k] bpf_skb_set_tunnel_key 7.90% test_progs [kernel.kallsyms] [k] memcmp 3.10% test_progs [kernel.kallsyms] [k] lookup_nulls_elem_raw 2.57% test_progs [kernel.kallsyms] [k] dst_release 2.45% test_progs [kernel.kallsyms] [k] do_check 1.89% test_progs [kernel.kallsyms] [k] bpf_xdp_adjust_head 1.52% test_progs [kernel.kallsyms] [k] 0x00007fffa002cc5a 1.22% test_progs [kernel.kallsyms] [k] check_helper_mem_access 0.99% test_progs [kernel.kallsyms] [k] 0x00007fffa001a6a8 0.99% test_progs [kernel.kallsyms] [k] kmem_cache_alloc_trace 0.97% test_progs [kernel.kallsyms] [k] 0x00007fffa001a652 0.95% test_progs [kernel.kallsyms] [k] 0x00007fffa0042d52 0.95% test_progs [kernel.kallsyms] [k] read_tsc 0.81% test_progs [kernel.kallsyms] [k] 0x00007fffa001a660 0.77% test_progs [kernel.kallsyms] [k] 0x00007fffa0042d69 0.74% test_progs [kernel.kallsyms] [k] percpu_array_map_lookup_elem 0.69% test_progs [kernel.kallsyms] [k] 0x00007fffa001a64e After: # Overhead Command Shared Object Symbol # ........ .......... ................. ............................................. # 18.13% test_progs [kernel.kallsyms] [k] bpf_prog_1accc788e7f04c38_balancer_ingres 10.73% test_progs [kernel.kallsyms] [k] __htab_map_lookup_elem 7.94% test_progs [kernel.kallsyms] [k] bpf_prog_20a05d8a586cf0e8_F 7.80% test_progs [kernel.kallsyms] [k] bpf_skb_set_tunnel_key 4.64% test_progs [kernel.kallsyms] [k] __sysfs_match_string 3.61% test_progs [kernel.kallsyms] [k] bpf_prog_9d89afa51f1dc0d7_F 3.51% test_progs [kernel.kallsyms] [k] bpf_prog_73b45270911a0294_F 3.41% test_progs [kernel.kallsyms] [k] bpf_prog_79be7b7bad8026bf_F 3.26% test_progs [kernel.kallsyms] [k] memcmp 3.10% test_progs [kernel.kallsyms] [k] lookup_nulls_elem_raw 2.89% test_progs [kernel.kallsyms] [k] bpf_prog_57bf3d413b9b7455_F 2.57% test_progs [kernel.kallsyms] [k] dst_discard 2.45% test_progs [kernel.kallsyms] [k] do_check 2.39% test_progs [kernel.kallsyms] [k] bpf_prog_576cbdaac1a4d2f6_F 1.82% test_progs [kernel.kallsyms] [k] bpf_prog_53ade2ecbddaa85b_F 1.70% test_progs [kernel.kallsyms] [k] bpf_xdp_adjust_head 1.32% test_progs [kernel.kallsyms] [k] bpf_prog_0edc54822404a598_F Important considerations: - Before and after the number of cpu samples are the same. No samples are lost. But perf cannot find the symbol by IP, so a lot of small 0x00007fffa001a64e-like symbols appear towards the end of 'perf report' and none of them look hot. In reallity these IP addresses belong to few bpf programs that were active at that time. - newly loaded bpf program can be JITed into address space of unloaded prog. 7fffa001aXXX address at time X can belong to a program A, but similar 7fffa001aYYY address at time Y can belong to a different program B. - event->mmap.pid == -1 is an existing indicator of kernel event. event->mmap.tid == BPF_FS_MAGIC is an extension to indicate bpf related event. Alternatively it's possible to introduce new 'enum perf_event_type' command specificially for bpf prog load/unload, but existing RECORD_MMAP is very close, so the choice made by this patchset is to extend it. - steps to land the set: Patches 1 and 2 need to go via bpf-next tree, since we're adding support for better program names exactly in the same lines. Patch 3 can go in parallel into perf tree. It has no effect without kernel support and kernel support has not effect on old perf. Peter, Arnaldo, Daniel, please review. Alexei Starovoitov (3): perf/core: introduce perf_event_mmap_bpf_prog bpf: emit RECORD_MMAP events for bpf prog load/unload tools/perf: recognize and process RECORD_MMAP events for bpf progs include/linux/perf_event.h | 1 + kernel/bpf/core.c | 22 +++++++++++++++++-- kernel/events/core.c | 44 +++++++++++++++++++++++++++++++++----- tools/perf/util/machine.c | 27 +++++++++++++++++++++++ tools/perf/util/symbol.c | 13 +++++++++++ tools/perf/util/symbol.h | 1 + 6 files changed, 101 insertions(+), 7 deletions(-) -- 2.17.1 ^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH bpf-next 1/3] perf/core: introduce perf_event_mmap_bpf_prog 2018-09-19 22:39 [PATCH bpf-next 0/3] perf, bpf: reveal invisible bpf programs Alexei Starovoitov @ 2018-09-19 22:39 ` Alexei Starovoitov 2018-09-19 23:30 ` Song Liu 2018-09-19 22:39 ` [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload Alexei Starovoitov 2018-09-19 22:39 ` [PATCH perf 3/3] tools/perf: recognize and process RECORD_MMAP events for bpf progs Alexei Starovoitov 2 siblings, 1 reply; 35+ messages in thread From: Alexei Starovoitov @ 2018-09-19 22:39 UTC (permalink / raw) To: David S . Miller; +Cc: daniel, peterz, acme, netdev, kernel-team introduce perf_event_mmap_bpf_prog() helper to emit RECORD_MMAP events into perf ring buffer. It's used by bpf load/unload logic to notify user space of addresses and names of JITed bpf programs. Note that event->mmap.pid == -1 is an existing indicator of kernel event. In addition use event->mmap.tid == BPF_FS_MAGIC to indicate bpf related RECORD_MMAP event. Alternatively it's possible to introduce new 'enum perf_event_type' command specificially for bpf prog load/unload, but existing RECORD_MMAP is very close, so the choice made by this patch is to extend it. Signed-off-by: Alexei Starovoitov <ast@kernel.org> --- include/linux/perf_event.h | 1 + kernel/events/core.c | 44 +++++++++++++++++++++++++++++++++----- 2 files changed, 40 insertions(+), 5 deletions(-) diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 53c500f0ca79..0e79af83138f 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -1113,6 +1113,7 @@ static inline void perf_event_task_sched_out(struct task_struct *prev, } extern void perf_event_mmap(struct vm_area_struct *vma); +void perf_event_mmap_bpf_prog(u64 start, u64 len, char *name, int size); extern struct perf_guest_info_callbacks *perf_guest_cbs; extern int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *callbacks); extern int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks *callbacks); diff --git a/kernel/events/core.c b/kernel/events/core.c index 2a62b96600ad..c48244ddf993 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -7152,7 +7152,7 @@ static int perf_event_mmap_match(struct perf_event *event, { struct perf_mmap_event *mmap_event = data; struct vm_area_struct *vma = mmap_event->vma; - int executable = vma->vm_flags & VM_EXEC; + int executable = !vma || vma->vm_flags & VM_EXEC; return (!executable && event->attr.mmap_data) || (executable && (event->attr.mmap || event->attr.mmap2)); @@ -7165,12 +7165,13 @@ static void perf_event_mmap_output(struct perf_event *event, struct perf_output_handle handle; struct perf_sample_data sample; int size = mmap_event->event_id.header.size; + bool bpf_event = !mmap_event->vma; int ret; if (!perf_event_mmap_match(event, data)) return; - if (event->attr.mmap2) { + if (event->attr.mmap2 && !bpf_event) { mmap_event->event_id.header.type = PERF_RECORD_MMAP2; mmap_event->event_id.header.size += sizeof(mmap_event->maj); mmap_event->event_id.header.size += sizeof(mmap_event->min); @@ -7186,12 +7187,14 @@ static void perf_event_mmap_output(struct perf_event *event, if (ret) goto out; - mmap_event->event_id.pid = perf_event_pid(event, current); - mmap_event->event_id.tid = perf_event_tid(event, current); + if (!bpf_event) { + mmap_event->event_id.pid = perf_event_pid(event, current); + mmap_event->event_id.tid = perf_event_tid(event, current); + } perf_output_put(&handle, mmap_event->event_id); - if (event->attr.mmap2) { + if (event->attr.mmap2 && !bpf_event) { perf_output_put(&handle, mmap_event->maj); perf_output_put(&handle, mmap_event->min); perf_output_put(&handle, mmap_event->ino); @@ -7448,6 +7451,37 @@ void perf_event_mmap(struct vm_area_struct *vma) perf_event_mmap_event(&mmap_event); } +void perf_event_mmap_bpf_prog(u64 start, u64 len, char *name, int size) +{ + struct perf_mmap_event mmap_event; + + if (!atomic_read(&nr_mmap_events)) + return; + + if (!IS_ALIGNED(size, sizeof(u64))) { + WARN_ONCE(1, "size is not aligned\n"); + return; + } + + mmap_event = (struct perf_mmap_event){ + .file_name = name, + .file_size = size, + .event_id = { + .header = { + .type = PERF_RECORD_MMAP, + .misc = PERF_RECORD_MISC_KERNEL, + .size = sizeof(mmap_event.event_id) + size, + }, + .pid = -1, /* indicates kernel */ + .tid = BPF_FS_MAGIC, /* bpf mmap event */ + .start = start, + .len = len, + .pgoff = start, + }, + }; + perf_iterate_sb(perf_event_mmap_output, &mmap_event, NULL); +} + void perf_event_aux_event(struct perf_event *event, unsigned long head, unsigned long size, u64 flags) { -- 2.17.1 ^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re: [PATCH bpf-next 1/3] perf/core: introduce perf_event_mmap_bpf_prog 2018-09-19 22:39 ` [PATCH bpf-next 1/3] perf/core: introduce perf_event_mmap_bpf_prog Alexei Starovoitov @ 2018-09-19 23:30 ` Song Liu 2018-09-20 0:53 ` Alexei Starovoitov 0 siblings, 1 reply; 35+ messages in thread From: Song Liu @ 2018-09-19 23:30 UTC (permalink / raw) To: Alexei Starovoitov Cc: David S . Miller, daniel, peterz, acme, netdev, Kernel Team > On Sep 19, 2018, at 3:39 PM, Alexei Starovoitov <ast@kernel.org> wrote: > > introduce perf_event_mmap_bpf_prog() helper to emit RECORD_MMAP events > into perf ring buffer. > It's used by bpf load/unload logic to notify user space of addresses > and names of JITed bpf programs. > > Note that event->mmap.pid == -1 is an existing indicator of kernel event. > In addition use event->mmap.tid == BPF_FS_MAGIC to indicate bpf related > RECORD_MMAP event. > > Alternatively it's possible to introduce new 'enum perf_event_type' command > specificially for bpf prog load/unload, but existing RECORD_MMAP > is very close, so the choice made by this patch is to extend it. > > Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> I guess we should also use this for kernel modules load/unload? > --- > include/linux/perf_event.h | 1 + > kernel/events/core.c | 44 +++++++++++++++++++++++++++++++++----- > 2 files changed, 40 insertions(+), 5 deletions(-) > > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h > index 53c500f0ca79..0e79af83138f 100644 > --- a/include/linux/perf_event.h > +++ b/include/linux/perf_event.h > @@ -1113,6 +1113,7 @@ static inline void perf_event_task_sched_out(struct task_struct *prev, > } > > extern void perf_event_mmap(struct vm_area_struct *vma); > +void perf_event_mmap_bpf_prog(u64 start, u64 len, char *name, int size); > extern struct perf_guest_info_callbacks *perf_guest_cbs; > extern int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *callbacks); > extern int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks *callbacks); > diff --git a/kernel/events/core.c b/kernel/events/core.c > index 2a62b96600ad..c48244ddf993 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -7152,7 +7152,7 @@ static int perf_event_mmap_match(struct perf_event *event, > { > struct perf_mmap_event *mmap_event = data; > struct vm_area_struct *vma = mmap_event->vma; > - int executable = vma->vm_flags & VM_EXEC; > + int executable = !vma || vma->vm_flags & VM_EXEC; > > return (!executable && event->attr.mmap_data) || > (executable && (event->attr.mmap || event->attr.mmap2)); > @@ -7165,12 +7165,13 @@ static void perf_event_mmap_output(struct perf_event *event, > struct perf_output_handle handle; > struct perf_sample_data sample; > int size = mmap_event->event_id.header.size; > + bool bpf_event = !mmap_event->vma; > int ret; > > if (!perf_event_mmap_match(event, data)) > return; > > - if (event->attr.mmap2) { > + if (event->attr.mmap2 && !bpf_event) { > mmap_event->event_id.header.type = PERF_RECORD_MMAP2; > mmap_event->event_id.header.size += sizeof(mmap_event->maj); > mmap_event->event_id.header.size += sizeof(mmap_event->min); > @@ -7186,12 +7187,14 @@ static void perf_event_mmap_output(struct perf_event *event, > if (ret) > goto out; > > - mmap_event->event_id.pid = perf_event_pid(event, current); > - mmap_event->event_id.tid = perf_event_tid(event, current); > + if (!bpf_event) { > + mmap_event->event_id.pid = perf_event_pid(event, current); > + mmap_event->event_id.tid = perf_event_tid(event, current); > + } > > perf_output_put(&handle, mmap_event->event_id); > > - if (event->attr.mmap2) { > + if (event->attr.mmap2 && !bpf_event) { > perf_output_put(&handle, mmap_event->maj); > perf_output_put(&handle, mmap_event->min); > perf_output_put(&handle, mmap_event->ino); > @@ -7448,6 +7451,37 @@ void perf_event_mmap(struct vm_area_struct *vma) > perf_event_mmap_event(&mmap_event); > } > > +void perf_event_mmap_bpf_prog(u64 start, u64 len, char *name, int size) > +{ > + struct perf_mmap_event mmap_event; > + > + if (!atomic_read(&nr_mmap_events)) > + return; > + > + if (!IS_ALIGNED(size, sizeof(u64))) { > + WARN_ONCE(1, "size is not aligned\n"); > + return; > + } > + > + mmap_event = (struct perf_mmap_event){ > + .file_name = name, > + .file_size = size, > + .event_id = { > + .header = { > + .type = PERF_RECORD_MMAP, > + .misc = PERF_RECORD_MISC_KERNEL, > + .size = sizeof(mmap_event.event_id) + size, > + }, > + .pid = -1, /* indicates kernel */ > + .tid = BPF_FS_MAGIC, /* bpf mmap event */ > + .start = start, > + .len = len, > + .pgoff = start, > + }, > + }; > + perf_iterate_sb(perf_event_mmap_output, &mmap_event, NULL); > +} > + > void perf_event_aux_event(struct perf_event *event, unsigned long head, > unsigned long size, u64 flags) > { > -- > 2.17.1 > ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH bpf-next 1/3] perf/core: introduce perf_event_mmap_bpf_prog 2018-09-19 23:30 ` Song Liu @ 2018-09-20 0:53 ` Alexei Starovoitov 0 siblings, 0 replies; 35+ messages in thread From: Alexei Starovoitov @ 2018-09-20 0:53 UTC (permalink / raw) To: Song Liu, Alexei Starovoitov Cc: David S . Miller, daniel, peterz, acme, netdev, Kernel Team On 9/19/18 4:30 PM, Song Liu wrote: > > >> On Sep 19, 2018, at 3:39 PM, Alexei Starovoitov <ast@kernel.org> wrote: >> >> introduce perf_event_mmap_bpf_prog() helper to emit RECORD_MMAP events >> into perf ring buffer. >> It's used by bpf load/unload logic to notify user space of addresses >> and names of JITed bpf programs. >> >> Note that event->mmap.pid == -1 is an existing indicator of kernel event. >> In addition use event->mmap.tid == BPF_FS_MAGIC to indicate bpf related >> RECORD_MMAP event. >> >> Alternatively it's possible to introduce new 'enum perf_event_type' command >> specificially for bpf prog load/unload, but existing RECORD_MMAP >> is very close, so the choice made by this patch is to extend it. >> >> Signed-off-by: Alexei Starovoitov <ast@kernel.org> > > Acked-by: Song Liu <songliubraving@fb.com> > > I guess we should also use this for kernel modules load/unload? yes. that's possible. There is similar issue with modules today that get unloaded before 'perf report'. Synthetic record_mmap for modules that perf emits into perf.data has filename==module_name. It solves typical use case though. We can extend record_mmap further and let kernel emit them for module load/unload. Some new magic for 'tid' would be necessary. ^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload 2018-09-19 22:39 [PATCH bpf-next 0/3] perf, bpf: reveal invisible bpf programs Alexei Starovoitov 2018-09-19 22:39 ` [PATCH bpf-next 1/3] perf/core: introduce perf_event_mmap_bpf_prog Alexei Starovoitov @ 2018-09-19 22:39 ` Alexei Starovoitov 2018-09-19 23:44 ` Song Liu 2018-09-20 8:44 ` Peter Zijlstra 2018-09-19 22:39 ` [PATCH perf 3/3] tools/perf: recognize and process RECORD_MMAP events for bpf progs Alexei Starovoitov 2 siblings, 2 replies; 35+ messages in thread From: Alexei Starovoitov @ 2018-09-19 22:39 UTC (permalink / raw) To: David S . Miller; +Cc: daniel, peterz, acme, netdev, kernel-team use perf_event_mmap_bpf_prog() helper to notify user space about JITed bpf programs. Use RECORD_MMAP perf event to tell user space where JITed bpf program was loaded. Use empty program name as unload indication. Signed-off-by: Alexei Starovoitov <ast@kernel.org> --- kernel/bpf/core.c | 22 ++++++++++++++++++++-- 1 file changed, 20 insertions(+), 2 deletions(-) diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 3f5bf1af0826..ddf11fdafd36 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -384,7 +384,7 @@ bpf_get_prog_addr_region(const struct bpf_prog *prog, *symbol_end = addr + hdr->pages * PAGE_SIZE; } -static void bpf_get_prog_name(const struct bpf_prog *prog, char *sym) +static char *bpf_get_prog_name(const struct bpf_prog *prog, char *sym) { const char *end = sym + KSYM_NAME_LEN; @@ -402,9 +402,10 @@ static void bpf_get_prog_name(const struct bpf_prog *prog, char *sym) sym += snprintf(sym, KSYM_NAME_LEN, "bpf_prog_"); sym = bin2hex(sym, prog->tag, sizeof(prog->tag)); if (prog->aux->name[0]) - snprintf(sym, (size_t)(end - sym), "_%s", prog->aux->name); + sym += snprintf(sym, (size_t)(end - sym), "_%s", prog->aux->name); else *sym = 0; + return sym; } static __always_inline unsigned long @@ -480,23 +481,40 @@ static bool bpf_prog_kallsyms_verify_off(const struct bpf_prog *fp) void bpf_prog_kallsyms_add(struct bpf_prog *fp) { + unsigned long symbol_start, symbol_end; + char buf[KSYM_NAME_LEN], *sym; + if (!bpf_prog_kallsyms_candidate(fp) || !capable(CAP_SYS_ADMIN)) return; + bpf_get_prog_addr_region(fp, &symbol_start, &symbol_end); + sym = bpf_get_prog_name(fp, buf); + sym++; /* sym - buf is the length of the name including trailing 0 */ + while (!IS_ALIGNED(sym - buf, sizeof(u64))) + *sym++ = 0; spin_lock_bh(&bpf_lock); bpf_prog_ksym_node_add(fp->aux); spin_unlock_bh(&bpf_lock); + perf_event_mmap_bpf_prog(symbol_start, symbol_end - symbol_start, + buf, sym - buf); } void bpf_prog_kallsyms_del(struct bpf_prog *fp) { + unsigned long symbol_start, symbol_end; + /* mmap_record.filename cannot be NULL and has to be u64 aligned */ + char buf[sizeof(u64)] = {}; + if (!bpf_prog_kallsyms_candidate(fp)) return; spin_lock_bh(&bpf_lock); bpf_prog_ksym_node_del(fp->aux); spin_unlock_bh(&bpf_lock); + bpf_get_prog_addr_region(fp, &symbol_start, &symbol_end); + perf_event_mmap_bpf_prog(symbol_start, symbol_end - symbol_start, + buf, sizeof(buf)); } static struct bpf_prog *bpf_prog_kallsyms_find(unsigned long addr) -- 2.17.1 ^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload 2018-09-19 22:39 ` [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload Alexei Starovoitov @ 2018-09-19 23:44 ` Song Liu 2018-09-20 0:59 ` Alexei Starovoitov 2018-09-20 8:44 ` Peter Zijlstra 1 sibling, 1 reply; 35+ messages in thread From: Song Liu @ 2018-09-19 23:44 UTC (permalink / raw) To: Alexei Starovoitov Cc: David S . Miller, daniel, peterz, acme, netdev, Kernel Team > On Sep 19, 2018, at 3:39 PM, Alexei Starovoitov <ast@kernel.org> wrote: > > use perf_event_mmap_bpf_prog() helper to notify user space > about JITed bpf programs. > Use RECORD_MMAP perf event to tell user space where JITed bpf program was loaded. > Use empty program name as unload indication. > > Signed-off-by: Alexei Starovoitov <ast@kernel.org> > --- > kernel/bpf/core.c | 22 ++++++++++++++++++++-- > 1 file changed, 20 insertions(+), 2 deletions(-) > > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c > index 3f5bf1af0826..ddf11fdafd36 100644 > --- a/kernel/bpf/core.c > +++ b/kernel/bpf/core.c > @@ -384,7 +384,7 @@ bpf_get_prog_addr_region(const struct bpf_prog *prog, > *symbol_end = addr + hdr->pages * PAGE_SIZE; > } > > -static void bpf_get_prog_name(const struct bpf_prog *prog, char *sym) > +static char *bpf_get_prog_name(const struct bpf_prog *prog, char *sym) > { > const char *end = sym + KSYM_NAME_LEN; > > @@ -402,9 +402,10 @@ static void bpf_get_prog_name(const struct bpf_prog *prog, char *sym) > sym += snprintf(sym, KSYM_NAME_LEN, "bpf_prog_"); > sym = bin2hex(sym, prog->tag, sizeof(prog->tag)); > if (prog->aux->name[0]) > - snprintf(sym, (size_t)(end - sym), "_%s", prog->aux->name); > + sym += snprintf(sym, (size_t)(end - sym), "_%s", prog->aux->name); > else > *sym = 0; > + return sym; > } > > static __always_inline unsigned long > @@ -480,23 +481,40 @@ static bool bpf_prog_kallsyms_verify_off(const struct bpf_prog *fp) > > void bpf_prog_kallsyms_add(struct bpf_prog *fp) > { > + unsigned long symbol_start, symbol_end; > + char buf[KSYM_NAME_LEN], *sym; > + > if (!bpf_prog_kallsyms_candidate(fp) || > !capable(CAP_SYS_ADMIN)) > return; > > + bpf_get_prog_addr_region(fp, &symbol_start, &symbol_end); > + sym = bpf_get_prog_name(fp, buf); > + sym++; /* sym - buf is the length of the name including trailing 0 */ > + while (!IS_ALIGNED(sym - buf, sizeof(u64))) > + *sym++ = 0; nit: This logic feels a little weird to me. How about we wrap the extra logic in a separate function: size_t bpf_get_prog_name_u64_aligned(const struct bpf_prog fp, char *buf) where the return value is the u64 aligned size. Other than this Acked-by: Song Liu <songliubraving@fb.com> > spin_lock_bh(&bpf_lock); > bpf_prog_ksym_node_add(fp->aux); > spin_unlock_bh(&bpf_lock); > + perf_event_mmap_bpf_prog(symbol_start, symbol_end - symbol_start, > + buf, sym - buf); > } > > void bpf_prog_kallsyms_del(struct bpf_prog *fp) > { > + unsigned long symbol_start, symbol_end; > + /* mmap_record.filename cannot be NULL and has to be u64 aligned */ > + char buf[sizeof(u64)] = {}; > + > if (!bpf_prog_kallsyms_candidate(fp)) > return; > > spin_lock_bh(&bpf_lock); > bpf_prog_ksym_node_del(fp->aux); > spin_unlock_bh(&bpf_lock); > + bpf_get_prog_addr_region(fp, &symbol_start, &symbol_end); > + perf_event_mmap_bpf_prog(symbol_start, symbol_end - symbol_start, > + buf, sizeof(buf)); > } > > static struct bpf_prog *bpf_prog_kallsyms_find(unsigned long addr) > -- > 2.17.1 > ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload 2018-09-19 23:44 ` Song Liu @ 2018-09-20 0:59 ` Alexei Starovoitov 2018-09-20 5:48 ` Song Liu 0 siblings, 1 reply; 35+ messages in thread From: Alexei Starovoitov @ 2018-09-20 0:59 UTC (permalink / raw) To: Song Liu, Alexei Starovoitov Cc: David S . Miller, daniel, peterz, acme, netdev, Kernel Team On 9/19/18 4:44 PM, Song Liu wrote: > > >> On Sep 19, 2018, at 3:39 PM, Alexei Starovoitov <ast@kernel.org> wrote: >> >> use perf_event_mmap_bpf_prog() helper to notify user space >> about JITed bpf programs. >> Use RECORD_MMAP perf event to tell user space where JITed bpf program was loaded. >> Use empty program name as unload indication. >> >> Signed-off-by: Alexei Starovoitov <ast@kernel.org> >> --- >> kernel/bpf/core.c | 22 ++++++++++++++++++++-- >> 1 file changed, 20 insertions(+), 2 deletions(-) >> >> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c >> index 3f5bf1af0826..ddf11fdafd36 100644 >> --- a/kernel/bpf/core.c >> +++ b/kernel/bpf/core.c >> @@ -384,7 +384,7 @@ bpf_get_prog_addr_region(const struct bpf_prog *prog, >> *symbol_end = addr + hdr->pages * PAGE_SIZE; >> } >> >> -static void bpf_get_prog_name(const struct bpf_prog *prog, char *sym) >> +static char *bpf_get_prog_name(const struct bpf_prog *prog, char *sym) >> { >> const char *end = sym + KSYM_NAME_LEN; >> >> @@ -402,9 +402,10 @@ static void bpf_get_prog_name(const struct bpf_prog *prog, char *sym) >> sym += snprintf(sym, KSYM_NAME_LEN, "bpf_prog_"); >> sym = bin2hex(sym, prog->tag, sizeof(prog->tag)); >> if (prog->aux->name[0]) >> - snprintf(sym, (size_t)(end - sym), "_%s", prog->aux->name); >> + sym += snprintf(sym, (size_t)(end - sym), "_%s", prog->aux->name); >> else >> *sym = 0; >> + return sym; >> } >> >> static __always_inline unsigned long >> @@ -480,23 +481,40 @@ static bool bpf_prog_kallsyms_verify_off(const struct bpf_prog *fp) >> >> void bpf_prog_kallsyms_add(struct bpf_prog *fp) >> { >> + unsigned long symbol_start, symbol_end; >> + char buf[KSYM_NAME_LEN], *sym; >> + >> if (!bpf_prog_kallsyms_candidate(fp) || >> !capable(CAP_SYS_ADMIN)) >> return; >> >> + bpf_get_prog_addr_region(fp, &symbol_start, &symbol_end); >> + sym = bpf_get_prog_name(fp, buf); >> + sym++; /* sym - buf is the length of the name including trailing 0 */ >> + while (!IS_ALIGNED(sym - buf, sizeof(u64))) >> + *sym++ = 0; > > nit: This logic feels a little weird to me. How about we wrap the extra logic > in a separate function: > > size_t bpf_get_prog_name_u64_aligned(const struct bpf_prog fp, char *buf) probably bpf_get_prog_name_u64_padded() ? would be cleaner indeed. Will send v2 once Peter and Arnaldo provide their feedback. Thanks for reviewing! ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload 2018-09-20 0:59 ` Alexei Starovoitov @ 2018-09-20 5:48 ` Song Liu 0 siblings, 0 replies; 35+ messages in thread From: Song Liu @ 2018-09-20 5:48 UTC (permalink / raw) To: Alexei Starovoitov Cc: Alexei Starovoitov, David S . Miller, daniel, peterz, acme, netdev, Kernel Team > On Sep 19, 2018, at 5:59 PM, Alexei Starovoitov <ast@fb.com> wrote: > > On 9/19/18 4:44 PM, Song Liu wrote: >> >> >>> On Sep 19, 2018, at 3:39 PM, Alexei Starovoitov <ast@kernel.org> wrote: >>> >>> use perf_event_mmap_bpf_prog() helper to notify user space >>> about JITed bpf programs. >>> Use RECORD_MMAP perf event to tell user space where JITed bpf program was loaded. >>> Use empty program name as unload indication. >>> >>> Signed-off-by: Alexei Starovoitov <ast@kernel.org> >>> --- >>> kernel/bpf/core.c | 22 ++++++++++++++++++++-- >>> 1 file changed, 20 insertions(+), 2 deletions(-) >>> >>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c >>> index 3f5bf1af0826..ddf11fdafd36 100644 >>> --- a/kernel/bpf/core.c >>> +++ b/kernel/bpf/core.c >>> @@ -384,7 +384,7 @@ bpf_get_prog_addr_region(const struct bpf_prog *prog, >>> *symbol_end = addr + hdr->pages * PAGE_SIZE; >>> } >>> >>> -static void bpf_get_prog_name(const struct bpf_prog *prog, char *sym) >>> +static char *bpf_get_prog_name(const struct bpf_prog *prog, char *sym) >>> { >>> const char *end = sym + KSYM_NAME_LEN; >>> >>> @@ -402,9 +402,10 @@ static void bpf_get_prog_name(const struct bpf_prog *prog, char *sym) >>> sym += snprintf(sym, KSYM_NAME_LEN, "bpf_prog_"); >>> sym = bin2hex(sym, prog->tag, sizeof(prog->tag)); >>> if (prog->aux->name[0]) >>> - snprintf(sym, (size_t)(end - sym), "_%s", prog->aux->name); >>> + sym += snprintf(sym, (size_t)(end - sym), "_%s", prog->aux->name); >>> else >>> *sym = 0; >>> + return sym; >>> } >>> >>> static __always_inline unsigned long >>> @@ -480,23 +481,40 @@ static bool bpf_prog_kallsyms_verify_off(const struct bpf_prog *fp) >>> >>> void bpf_prog_kallsyms_add(struct bpf_prog *fp) >>> { >>> + unsigned long symbol_start, symbol_end; >>> + char buf[KSYM_NAME_LEN], *sym; >>> + >>> if (!bpf_prog_kallsyms_candidate(fp) || >>> !capable(CAP_SYS_ADMIN)) >>> return; >>> >>> + bpf_get_prog_addr_region(fp, &symbol_start, &symbol_end); >>> + sym = bpf_get_prog_name(fp, buf); >>> + sym++; /* sym - buf is the length of the name including trailing 0 */ >>> + while (!IS_ALIGNED(sym - buf, sizeof(u64))) >>> + *sym++ = 0; >> >> nit: This logic feels a little weird to me. How about we wrap the extra logic >> in a separate function: >> >> size_t bpf_get_prog_name_u64_aligned(const struct bpf_prog fp, char *buf) > > probably bpf_get_prog_name_u64_padded() ? > would be cleaner indeed. bpf_get_prog_name_u64_padded does sound better. > Will send v2 once Peter and Arnaldo provide their feedback. > > Thanks for reviewing! ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload 2018-09-19 22:39 ` [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload Alexei Starovoitov 2018-09-19 23:44 ` Song Liu @ 2018-09-20 8:44 ` Peter Zijlstra 2018-09-20 13:25 ` Arnaldo Carvalho de Melo 2018-09-20 13:36 ` Peter Zijlstra 1 sibling, 2 replies; 35+ messages in thread From: Peter Zijlstra @ 2018-09-20 8:44 UTC (permalink / raw) To: Alexei Starovoitov; +Cc: David S . Miller, daniel, acme, netdev, kernel-team On Wed, Sep 19, 2018 at 03:39:34PM -0700, Alexei Starovoitov wrote: > void bpf_prog_kallsyms_del(struct bpf_prog *fp) > { > + unsigned long symbol_start, symbol_end; > + /* mmap_record.filename cannot be NULL and has to be u64 aligned */ > + char buf[sizeof(u64)] = {}; > + > if (!bpf_prog_kallsyms_candidate(fp)) > return; > > spin_lock_bh(&bpf_lock); > bpf_prog_ksym_node_del(fp->aux); > spin_unlock_bh(&bpf_lock); > + bpf_get_prog_addr_region(fp, &symbol_start, &symbol_end); > + perf_event_mmap_bpf_prog(symbol_start, symbol_end - symbol_start, > + buf, sizeof(buf)); > } So perf doesn't normally issue unmap events.. We've talked about doing that, but so far it's never really need needed I think. I feels a bit weird to start issuing unmap events for this. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload 2018-09-20 8:44 ` Peter Zijlstra @ 2018-09-20 13:25 ` Arnaldo Carvalho de Melo 2018-09-20 13:56 ` Peter Zijlstra 2018-09-20 13:36 ` Peter Zijlstra 1 sibling, 1 reply; 35+ messages in thread From: Arnaldo Carvalho de Melo @ 2018-09-20 13:25 UTC (permalink / raw) To: Alexei Starovoitov Cc: Peter Zijlstra, David S . Miller, daniel, netdev, kernel-team Em Thu, Sep 20, 2018 at 10:44:24AM +0200, Peter Zijlstra escreveu: > On Wed, Sep 19, 2018 at 03:39:34PM -0700, Alexei Starovoitov wrote: > > void bpf_prog_kallsyms_del(struct bpf_prog *fp) > > { > > + unsigned long symbol_start, symbol_end; > > + /* mmap_record.filename cannot be NULL and has to be u64 aligned */ > > + char buf[sizeof(u64)] = {}; > > + > > if (!bpf_prog_kallsyms_candidate(fp)) > > return; > > > > spin_lock_bh(&bpf_lock); > > bpf_prog_ksym_node_del(fp->aux); > > spin_unlock_bh(&bpf_lock); > > + bpf_get_prog_addr_region(fp, &symbol_start, &symbol_end); > > + perf_event_mmap_bpf_prog(symbol_start, symbol_end - symbol_start, > > + buf, sizeof(buf)); > > } > > So perf doesn't normally issue unmap events.. We've talked about doing > that, but so far it's never really need needed I think. > I feels a bit weird to start issuing unmap events for this. For reference, this surfaced here: https://lkml.org/lkml/2017/1/27/452 Start of the thread, that involves postgresql, JIT, LLVM, perf is here: https://lkml.org/lkml/2016/12/10/1 PeterZ provided a patch introducing PERF_RECORD_MUNMAP, went nowhere due to having to cope with munmapping parts of existing mmaps, etc. I'm still more in favour of introduce PERF_RECORD_MUNMAP, even if for now it would be used just in this clean case for undoing a PERF_RECORD_MMAP for a BPF program. The ABI is already complicated, starting to use something called PERF_RECORD_MMAP for unmmaping by just using a NULL name... too clever, I think. - Arnaldo ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload 2018-09-20 13:25 ` Arnaldo Carvalho de Melo @ 2018-09-20 13:56 ` Peter Zijlstra 2018-09-21 3:14 ` Alexei Starovoitov 0 siblings, 1 reply; 35+ messages in thread From: Peter Zijlstra @ 2018-09-20 13:56 UTC (permalink / raw) To: Arnaldo Carvalho de Melo Cc: Alexei Starovoitov, David S . Miller, daniel, netdev, kernel-team On Thu, Sep 20, 2018 at 10:25:45AM -0300, Arnaldo Carvalho de Melo wrote: > PeterZ provided a patch introducing PERF_RECORD_MUNMAP, went nowhere due > to having to cope with munmapping parts of existing mmaps, etc. > > I'm still more in favour of introduce PERF_RECORD_MUNMAP, even if for > now it would be used just in this clean case for undoing a > PERF_RECORD_MMAP for a BPF program. > > The ABI is already complicated, starting to use something called > PERF_RECORD_MMAP for unmmaping by just using a NULL name... too clever, > I think. Agreed, the PERF_RECORD_MUNMAP patch was fairly trivial, the difficult part was getting the perf tool to dtrt for that use-case. But if we need unmap events, doing the unmap record now is the right thing. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload 2018-09-20 13:56 ` Peter Zijlstra @ 2018-09-21 3:14 ` Alexei Starovoitov 2018-09-21 12:25 ` Arnaldo Carvalho de Melo 0 siblings, 1 reply; 35+ messages in thread From: Alexei Starovoitov @ 2018-09-21 3:14 UTC (permalink / raw) To: Peter Zijlstra Cc: Arnaldo Carvalho de Melo, Alexei Starovoitov, David S . Miller, daniel, netdev, kernel-team On Thu, Sep 20, 2018 at 03:56:51PM +0200, Peter Zijlstra wrote: > On Thu, Sep 20, 2018 at 10:25:45AM -0300, Arnaldo Carvalho de Melo wrote: > > PeterZ provided a patch introducing PERF_RECORD_MUNMAP, went nowhere due > > to having to cope with munmapping parts of existing mmaps, etc. > > > > I'm still more in favour of introduce PERF_RECORD_MUNMAP, even if for > > now it would be used just in this clean case for undoing a > > PERF_RECORD_MMAP for a BPF program. > > > > The ABI is already complicated, starting to use something called > > PERF_RECORD_MMAP for unmmaping by just using a NULL name... too clever, > > I think. > > Agreed, the PERF_RECORD_MUNMAP patch was fairly trivial, the difficult > part was getting the perf tool to dtrt for that use-case. But if we need > unmap events, doing the unmap record now is the right thing. Thanks for the pointers! The answer is a bit long. pls bear with me. I have considered adding MUNMAP to match existing MMAP, but went without it because I didn't want to introduce new bit in perf_event_attr and emit these new events in a misbalanced conditional way for prog load/unload. Like old perf is asking kernel for mmap events via mmap bit, so prog load events will be in perf.data, but old perf report won't recognize them anyway. Whereas new perf would certainly want to catch bpf events and will set both mmap and mumap bits. Then if I add MUNMAP event without new bit and emit MMAP/MUNMAP conditionally based on single mmap bit they will confuse old perf and it will print warning about 'unknown events'. Both situations are ugly, hence I went with reuse of MMAP event for both load/unload. In such case old perf silently ignores them. Which is what I wanted. When we upgrade the kernel we cannot synchronize the kernel upgrade (or downgrade) with user space perf package upgrade. Hence not confusing old perf is important. With new kernel new bpf mmap events get into perf.data and new perf picks them up. Few more considerations: I consider synthetic perf events to be non-ABI. Meaning they're emitted by perf user space into perf.data and there is a convention on names, but it's not a kernel abi. Like RECORD_MMAP with event.filename == "[module_name]" is an indication for perf report to parse elf/build-id of dso==module_name. There is no such support in the kernel. Kernel doesn't emit such events for module load/unload. If in the future we decide to extend kernel with such events they don't have to match what user space perf does today. Why this is important? To get to next step. As Arnaldo pointed out this patch set is missing support for JITed prog annotations and displaying asm code. Absolutely correct. This set only helps perf to reveal the names of bpf progs that _were_ running at the time of perf record, but there is no way yet for perf report to show asm code of the prog that was running. In that sense bpf is drastically different from java, other jits and normal profiling. bpf JIT happens in the kernel and only kernel knows the mapping between original source and JITed code. In addition there are bpf2bpf functions. In the future there will be bpf libraries, more type info, line number support, etc. I strongly believe perf RECORD_* events should NOT care about the development that happens on the bpf side. The only thing kernel will be telling user space is that bpf prog with prog_id=X was loaded. Then based on prog_id the 'perf record' phase can query the kernel for bpf related information. There is already a way to fetch JITed image based on prog_id. Then perf will emit synthetic RECORD_FOOBAR into perf.data that will contain bpf related info (like complete JITed image) and perf report can process it later and annotate things in UI. It may seem that there is a race here. Like when 'perf record' see 'bpf prog was loaded with prog_id=X' event it will ask the kernel about prog_id=X, but that prog could be unloaded already. In such case prog_id will not exist and perf record can ignore such event. So no race. The kernel doesn't need to pass all information about bpf prog to the user space via RECORD_*. Instead 'perf record' can emit synthetic events into perf.data. I was thinking to extend RECORD_MMAP with prog_id already (instead of passing kallsyms's bpf prog name in event->mmap.filename) but bpf functions don't have their own prog_id. Multiple bpf funcs with different JITed blobs are considered to be a part of single prog_id. So as a step one I'm only extending RECORD_MMAP with addr and kallsym name of bpf function/prog. As a step two the plan is to add notification mechanism for prog load/unload that will include prog_id and design new synthetic RECORD_* events in perf user space that will contain JITed code, line info, BTF, etc. TLDR: step 1 (this patch set) Single bpf prog_load can call multiple bpf_prog_kallsyms_add() -> RECORD_MMAP with addr+kallsym only Similarly unload calls multiple bpf_prog_kallsyms_del() -> RECORD_MMAP with addr only step 2 (future work) single event for bpf prog_load with prog_id only. Either via perf ring buffer or ftrace or tracepoints or some other notification mechanism. It may seem that step 2 obsoletes step 1. It can, but I think it will complement it. There is a lot more code there and a lot more discussions to have. Step 1 is already big improvement. Thoughts? ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload 2018-09-21 3:14 ` Alexei Starovoitov @ 2018-09-21 12:25 ` Arnaldo Carvalho de Melo 2018-09-21 13:55 ` Peter Zijlstra ` (2 more replies) 0 siblings, 3 replies; 35+ messages in thread From: Arnaldo Carvalho de Melo @ 2018-09-21 12:25 UTC (permalink / raw) To: Alexei Starovoitov Cc: Peter Zijlstra, Alexei Starovoitov, David S . Miller, daniel, netdev, kernel-team Em Thu, Sep 20, 2018 at 08:14:46PM -0700, Alexei Starovoitov escreveu: > On Thu, Sep 20, 2018 at 03:56:51PM +0200, Peter Zijlstra wrote: > > On Thu, Sep 20, 2018 at 10:25:45AM -0300, Arnaldo Carvalho de Melo wrote: > > > PeterZ provided a patch introducing PERF_RECORD_MUNMAP, went nowhere due > > > to having to cope with munmapping parts of existing mmaps, etc. > > > > > > I'm still more in favour of introduce PERF_RECORD_MUNMAP, even if for > > > now it would be used just in this clean case for undoing a > > > PERF_RECORD_MMAP for a BPF program. > > > > > > The ABI is already complicated, starting to use something called > > > PERF_RECORD_MMAP for unmmaping by just using a NULL name... too clever, > > > I think. > > > > Agreed, the PERF_RECORD_MUNMAP patch was fairly trivial, the difficult > > part was getting the perf tool to dtrt for that use-case. But if we need > > unmap events, doing the unmap record now is the right thing. > > Thanks for the pointers! > The answer is a bit long. pls bear with me. Ditto with me :-) > I have considered adding MUNMAP to match existing MMAP, but went > without it because I didn't want to introduce new bit in perf_event_attr > and emit these new events in a misbalanced conditional way for prog load/unload. > Like old perf is asking kernel for mmap events via mmap bit, so prog load events By prog load events you mean that old perf, having perf_event_attr.mmap = 1 || perf_event_attr.mmap2 = 1 will cause the new kernel to emit PERF_RECORD_MMAP records for the range of addresses that a BPF program is being loaded on, right? > will be in perf.data, but old perf report won't recognize them anyway. Why not? It should lookup the symbol and find it in the rb_tree of maps, with a DSO name equal to what was in the PERF_RECORD_MMAP emitted by the BPF core, no? It'll be an unresolved symbol, but a resolved map. > Whereas new perf would certainly want to catch bpf events and will set > both mmap and mumap bits. new perf with your code will find a symbol, not a map, because your code catches a special case PERF_RECORD_MMAP and instead of creating a 'struct map' will create a 'struct symbol' and insert it in the kallsyms 'struct map', right? > Then if I add MUNMAP event without new bit and emit MMAP/MUNMAP > conditionally based on single mmap bit they will confuse old perf > and it will print warning about 'unknown events'. That is unfortunate and I'll turn that part into a pr_debug() > Both situations are ugly, hence I went with reuse of MMAP event > for both load/unload. So, its doubly odd, i.e. MMAP used for mmap() and for munmap() and the effects in the tooling is not to create or remove a 'struct map', but to alter an existing symbol table for the kallsyms map. > In such case old perf silently ignores them. Which is what I wanted. In theory the old perf should catch the PERF_RECORD_MMAP with a string in the filename part and insert a new map into the kernel mmap rb_tree, and then samples would be resolved to this map, but since there is no backing DSO with a symtab, it would stop at that, just stating that the map is called NAME-OF-BPF-PROGRAM. This is all from memory, possibly there is something in there that makes it ignore this PERF_RECORD_MMAP emitted by the BPF kernel code when loading a new program. > When we upgrade the kernel we cannot synchronize the kernel upgrade > (or downgrade) with user space perf package upgrade. sure > Hence not confusing old perf is important. Thanks for trying to achieve that, and its a pity that that "unknown record" message is a pr_warning or pr_info and not a pr_debug(). > With new kernel new bpf mmap events get into perf.data and > new perf picks them up. > > Few more considerations: > > I consider synthetic perf events to be non-ABI. Meaning they're > emitted by perf user space into perf.data and there is a convention > on names, but it's not a kernel abi. Like RECORD_MMAP with > event.filename == "[module_name]" is an indication for perf report > to parse elf/build-id of dso==module_name. > There is no such support in the kernel. Kernel doesn't emit > such events for module load/unload. If in the future > we decide to extend kernel with such events they don't have > to match what user space perf does today. Right, that is another unfortunate state of affairs, kernel module load/unload should already be supported, reported by the kernel via a proper PERF_RECORD_MODULE_LOAD/UNLOAD > Why this is important? To get to next step. > As Arnaldo pointed out this patch set is missing support for > JITed prog annotations and displaying asm code. Absolutely correct. > This set only helps perf to reveal the names of bpf progs that _were_ > running at the time of perf record, but there is no way yet for > perf report to show asm code of the prog that was running. > In that sense bpf is drastically different from java, other jits > and normal profiling. > bpf JIT happens in the kernel and only kernel knows the mapping > between original source and JITed code. > In addition there are bpf2bpf functions. In the future there will > be bpf libraries, more type info, line number support, etc. > I strongly believe perf RECORD_* events should NOT care about > the development that happens on the bpf side. > The only thing kernel will be telling user space is that bpf prog > with prog_id=X was loaded. > Then based on prog_id the 'perf record' phase can query the kernel > for bpf related information. There is already a way to fetch > JITed image based on prog_id. > Then perf will emit synthetic RECORD_FOOBAR into perf.data > that will contain bpf related info (like complete JITed image) > and perf report can process it later and annotate things in UI. There is another longstanding TODO list entry: PERF_RECORD_MMAP records should include a build-id, to avoid either userspace getting confused when there is an update of some mmap DSO, for long running sessions, for instance, or to have to scan the just recorded perf.data file for DSOs with samples to then read it from the file system (more races). Have you ever considered having a build-id for bpf objects that could be used here? > It may seem that there is a race here. > Like when 'perf record' see 'bpf prog was loaded with prog_id=X' event > it will ask the kernel about prog_id=X, but that prog could be > unloaded already. > In such case prog_id will not exist and perf record can ignore such event. > So no race. > The kernel doesn't need to pass all information about bpf prog to > the user space via RECORD_*. Instead 'perf record' can emit > synthetic events into perf.data. > I was thinking to extend RECORD_MMAP with prog_id already Right, if prog_id is a uniquely identifying cookie that is based on the file contents, like files in a git repo, then this would serve for both bpf progs and for everything else. Will the JITed code from some BPF bytecode be different if the same bytecode is JITed in several machines, all having the exact same hardware? > (instead of passing kallsyms's bpf prog name in event->mmap.filename) > but bpf functions don't have their own prog_id. Multiple bpf funcs > with different JITed blobs are considered to be a part of single prog_id. > So as a step one I'm only extending RECORD_MMAP with addr and kallsym > name of bpf function/prog. > As a step two the plan is to add notification mechanism for prog load/unload > that will include prog_id and design new synthetic RECORD_* events in > perf user space that will contain JITed code, line info, BTF, etc. So, will the kernel JIT a bytecode, load it somewhere and run it, then, when unloading it, keep it somewhere (a filesystem with some limits, etc) so that at some later time (with some timeouts) tooling can, using its id/buildid cookie get the contents and symbol table to have a better annotation experience? Using /proc/kcore as right now we should be able to annotate the BPF JIT as it is loaded, in 'perf top', for instance, with your suggested code, because we would have a symbol resolved that spans the whole bpf program, I'll try applying your patch and seeing how it goes. > TLDR: > step 1 (this patch set) > Single bpf prog_load can call multiple > bpf_prog_kallsyms_add() -> RECORD_MMAP with addr+kallsym only > Similarly unload calls multiple > bpf_prog_kallsyms_del() -> RECORD_MMAP with addr only > > step 2 (future work) > single event for bpf prog_load with prog_id only. > Either via perf ring buffer or ftrace or tracepoints or some > other notification mechanism. > > It may seem that step 2 obsoletes step 1. It can, but I think > it will complement it. There is a lot more code there and > a lot more discussions to have. > Step 1 is already big improvement. > > Thoughts? See above, and, bear with me too :-) - Arnaldo ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload 2018-09-21 12:25 ` Arnaldo Carvalho de Melo @ 2018-09-21 13:55 ` Peter Zijlstra 2018-09-21 13:59 ` Peter Zijlstra 2018-09-21 22:13 ` Alexei Starovoitov 2 siblings, 0 replies; 35+ messages in thread From: Peter Zijlstra @ 2018-09-21 13:55 UTC (permalink / raw) To: Arnaldo Carvalho de Melo Cc: Alexei Starovoitov, Alexei Starovoitov, David S . Miller, daniel, netdev, kernel-team On Fri, Sep 21, 2018 at 09:25:00AM -0300, Arnaldo Carvalho de Melo wrote: > > I consider synthetic perf events to be non-ABI. Meaning they're > > emitted by perf user space into perf.data and there is a convention > > on names, but it's not a kernel abi. Like RECORD_MMAP with > > event.filename == "[module_name]" is an indication for perf report > > to parse elf/build-id of dso==module_name. > > There is no such support in the kernel. Kernel doesn't emit > > such events for module load/unload. If in the future > > we decide to extend kernel with such events they don't have > > to match what user space perf does today. > > Right, that is another unfortunate state of affairs, kernel module > load/unload should already be supported, reported by the kernel via a > proper PERF_RECORD_MODULE_LOAD/UNLOAD Just wondering, is anyone actually doing enough module loading for this to matter? (asks the CONFIG_MODULES=n guy). I thought that was all a relatively static affair; you boot, you get loadead a few modules for present hardware, the end. Anyway, no real objection, just wonder if it's worth it. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload 2018-09-21 12:25 ` Arnaldo Carvalho de Melo 2018-09-21 13:55 ` Peter Zijlstra @ 2018-09-21 13:59 ` Peter Zijlstra 2018-09-21 22:13 ` Alexei Starovoitov 2 siblings, 0 replies; 35+ messages in thread From: Peter Zijlstra @ 2018-09-21 13:59 UTC (permalink / raw) To: Arnaldo Carvalho de Melo Cc: Alexei Starovoitov, Alexei Starovoitov, David S . Miller, daniel, netdev, kernel-team On Fri, Sep 21, 2018 at 09:25:00AM -0300, Arnaldo Carvalho de Melo wrote: > There is another longstanding TODO list entry: PERF_RECORD_MMAP records > should include a build-id I throught the problem was that the kernel doesn't have the build-id in the first place. So it cannot hand them out. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload 2018-09-21 12:25 ` Arnaldo Carvalho de Melo 2018-09-21 13:55 ` Peter Zijlstra 2018-09-21 13:59 ` Peter Zijlstra @ 2018-09-21 22:13 ` Alexei Starovoitov 2018-10-15 23:33 ` Song Liu 2 siblings, 1 reply; 35+ messages in thread From: Alexei Starovoitov @ 2018-09-21 22:13 UTC (permalink / raw) To: Arnaldo Carvalho de Melo Cc: Peter Zijlstra, Alexei Starovoitov, David S . Miller, daniel, netdev, kernel-team On Fri, Sep 21, 2018 at 09:25:00AM -0300, Arnaldo Carvalho de Melo wrote: > > > I have considered adding MUNMAP to match existing MMAP, but went > > without it because I didn't want to introduce new bit in perf_event_attr > > and emit these new events in a misbalanced conditional way for prog load/unload. > > Like old perf is asking kernel for mmap events via mmap bit, so prog load events > > By prog load events you mean that old perf, having perf_event_attr.mmap = 1 || > perf_event_attr.mmap2 = 1 will cause the new kernel to emit > PERF_RECORD_MMAP records for the range of addresses that a BPF program > is being loaded on, right? right. it would be weird when prog load events are there, but not unload. > > will be in perf.data, but old perf report won't recognize them anyway. > > Why not? It should lookup the symbol and find it in the rb_tree of maps, > with a DSO name equal to what was in the PERF_RECORD_MMAP emitted by the > BPF core, no? It'll be an unresolved symbol, but a resolved map. > > > Whereas new perf would certainly want to catch bpf events and will set > > both mmap and mumap bits. > > new perf with your code will find a symbol, not a map, because your code > catches a special case PERF_RECORD_MMAP and instead of creating a > 'struct map' will create a 'struct symbol' and insert it in the kallsyms > 'struct map', right? right. bpf progs are more similar to kernel functions than to modules. For modules it makes sense to create a new map and insert symbols into it. For bpf JITed images there is no DSO to parse. Single bpf elf file may contain multiple bpf progs and each prog may contain multiple bpf functions. They will be loaded at different time and will have different life time. > In theory the old perf should catch the PERF_RECORD_MMAP with a string > in the filename part and insert a new map into the kernel mmap rb_tree, > and then samples would be resolved to this map, but since there is no > backing DSO with a symtab, it would stop at that, just stating that the > map is called NAME-OF-BPF-PROGRAM. This is all from memory, possibly > there is something in there that makes it ignore this PERF_RECORD_MMAP > emitted by the BPF kernel code when loading a new program. In /proc/kcore there is already a section for module range. Hence when perf processes bpf load/unload events the map is already created. Therefore the patch 3 only searches for it and inserts new symbol into it. In that sense the reuse of RECORD_MMAP event for bpf progs is indeed not exactly clean, since no new map is created. It's probably better to introduce PERF_RECORD_[INSERT|ERASE]_KSYM events ? Such event potentially can be used for offline ksym resolution. perf could process /proc/kallsyms during perf record and emit all of them as synthetic PERF_RECORD_INSERT_KSYM into perf.data, so perf report can run on a different server and still find the right symbols. I guess, we can do bpf specific events too and keep RECORD_MMAP as-is. How about single PERF_RECORD_BPF event with internal flag for load/unload ? > Right, that is another unfortunate state of affairs, kernel module > load/unload should already be supported, reported by the kernel via a > proper PERF_RECORD_MODULE_LOAD/UNLOAD I agree with Peter here. It would nice, but low priority. modules are mostly static. Loaded once and stay there. > There is another longstanding TODO list entry: PERF_RECORD_MMAP records > should include a build-id, to avoid either userspace getting confused > when there is an update of some mmap DSO, for long running sessions, for > instance, or to have to scan the just recorded perf.data file for DSOs > with samples to then read it from the file system (more races). > > Have you ever considered having a build-id for bpf objects that could be > used here? build-id concept is not applicable to bpf. bpf elf files on the disc don't have good correlation with what is running in the kernel. bpf bytestream is converted and optimized by the verifier. Then JITed. So debug info left in .o file and original bpf bytestream in .o are mostly useless. For bpf programs we have 'program tag'. It is computed over original bpf bytestream, so both kernel and user space can compute it. In libbcc we use /var/tmp/bcc/bpf_prog_TAG/ directory to store original source code of the program, so users looking at kernel stack traces with bpf_prog_TAG can find the source. It's similar to build-id, but not going to help perf to annotate actual x86 instructions inside JITed image and show src code. Since JIT runs in the kernel this problem cannot be solved by user space only. It's a difficult problem and we have a plan to tackle that, but it's step 2. A bunch of infra is needed on bpf side to preserve the association during src_in_C -> original bpf insns -> translated bpf insns -> JITed asm. Then report it back to user space and then teach perf to properly annotate progs. > Will the JITed code from some BPF bytecode be different if the same > bytecode is JITed in several machines, all having the exact same > hardware? Yes. JITed code will be different depending on sysctl knobs (like const blinding) So the same original bpf byte stream loaded at different times may have different JITed image. Even without security features like blinding the JIT can be different. the bpf maps are separate from bpf progs. The bpf map is created first. Then the same bpf instruction stream (depending on map type that it uses) may be optimized by the verifier differently causing difference in JIT. > > (instead of passing kallsyms's bpf prog name in event->mmap.filename) > > but bpf functions don't have their own prog_id. Multiple bpf funcs > > with different JITed blobs are considered to be a part of single prog_id. > > So as a step one I'm only extending RECORD_MMAP with addr and kallsym > > name of bpf function/prog. > > As a step two the plan is to add notification mechanism for prog load/unload > > that will include prog_id and design new synthetic RECORD_* events in > > perf user space that will contain JITed code, line info, BTF, etc. > > So, will the kernel JIT a bytecode, load it somewhere and run it, then, > when unloading it, keep it somewhere (a filesystem with some limits, > etc) so that at some later time (with some timeouts) tooling can, using > its id/buildid cookie get the contents and symbol table to have a better > annotation experience? yes. The plan is to let perf fetch JITed image via BPF_OBJ_GET_INFO_BY_FD cmd during perf record run and store it inside perf.data in synthetic records. Then perf report can annotate it later. > Using /proc/kcore as right now we should be able to annotate the BPF JIT > as it is loaded, in 'perf top', for instance, with your suggested code, > because we would have a symbol resolved that spans the whole bpf > program, I'll try applying your patch and seeing how it goes. yes. /proc/kcore helps to annotate during perf report when progs are still loaded, but user experience sucks. Like for networking bpf performance testing the user needs to: - load the program - start blasting traffic to trigger prog execution - perf record -a sleep 5 - stop traffic - perf report -> to analyze what's going on in bpf prog - unload prog All steps are manual. We need to get to the point where we can: - perf record -a ./my_bpf_test - perf report where my_bpf_test will load, attach, run traffic, detach, unload. Most of selftests/bpf/* are done this way, but we cannot profile them at the moment. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload 2018-09-21 22:13 ` Alexei Starovoitov @ 2018-10-15 23:33 ` Song Liu 2018-10-16 23:43 ` David Ahern 0 siblings, 1 reply; 35+ messages in thread From: Song Liu @ 2018-10-15 23:33 UTC (permalink / raw) To: Alexei Starovoitov Cc: acme, Peter Zijlstra, Alexei Starovoitov, David S . Miller, Daniel Borkmann, Networking, kernel-team On Fri, Sep 21, 2018 at 3:15 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Fri, Sep 21, 2018 at 09:25:00AM -0300, Arnaldo Carvalho de Melo wrote: > > > > > I have considered adding MUNMAP to match existing MMAP, but went > > > without it because I didn't want to introduce new bit in perf_event_attr > > > and emit these new events in a misbalanced conditional way for prog load/unload. > > > Like old perf is asking kernel for mmap events via mmap bit, so prog load events > > > > By prog load events you mean that old perf, having perf_event_attr.mmap = 1 || > > perf_event_attr.mmap2 = 1 will cause the new kernel to emit > > PERF_RECORD_MMAP records for the range of addresses that a BPF program > > is being loaded on, right? > > right. it would be weird when prog load events are there, but not unload. > > > > will be in perf.data, but old perf report won't recognize them anyway. > > > > Why not? It should lookup the symbol and find it in the rb_tree of maps, > > with a DSO name equal to what was in the PERF_RECORD_MMAP emitted by the > > BPF core, no? It'll be an unresolved symbol, but a resolved map. > > > > > Whereas new perf would certainly want to catch bpf events and will set > > > both mmap and mumap bits. > > > > new perf with your code will find a symbol, not a map, because your code > > catches a special case PERF_RECORD_MMAP and instead of creating a > > 'struct map' will create a 'struct symbol' and insert it in the kallsyms > > 'struct map', right? > > right. > bpf progs are more similar to kernel functions than to modules. > For modules it makes sense to create a new map and insert symbols into it. > For bpf JITed images there is no DSO to parse. > Single bpf elf file may contain multiple bpf progs and each prog may contain > multiple bpf functions. They will be loaded at different time and > will have different life time. > > > In theory the old perf should catch the PERF_RECORD_MMAP with a string > > in the filename part and insert a new map into the kernel mmap rb_tree, > > and then samples would be resolved to this map, but since there is no > > backing DSO with a symtab, it would stop at that, just stating that the > > map is called NAME-OF-BPF-PROGRAM. This is all from memory, possibly > > there is something in there that makes it ignore this PERF_RECORD_MMAP > > emitted by the BPF kernel code when loading a new program. > > In /proc/kcore there is already a section for module range. > Hence when perf processes bpf load/unload events the map is already created. > Therefore the patch 3 only searches for it and inserts new symbol into it. > > In that sense the reuse of RECORD_MMAP event for bpf progs is indeed > not exactly clean, since no new map is created. > It's probably better to introduce PERF_RECORD_[INSERT|ERASE]_KSYM events ? > > Such event potentially can be used for offline ksym resolution. > perf could process /proc/kallsyms during perf record and emit all of them > as synthetic PERF_RECORD_INSERT_KSYM into perf.data, so perf report can run > on a different server and still find the right symbols. > > I guess, we can do bpf specific events too and keep RECORD_MMAP as-is. > How about single PERF_RECORD_BPF event with internal flag for load/unload ? > > > Right, that is another unfortunate state of affairs, kernel module > > load/unload should already be supported, reported by the kernel via a > > proper PERF_RECORD_MODULE_LOAD/UNLOAD > > I agree with Peter here. It would nice, but low priority. > modules are mostly static. Loaded once and stay there. > > > There is another longstanding TODO list entry: PERF_RECORD_MMAP records > > should include a build-id, to avoid either userspace getting confused > > when there is an update of some mmap DSO, for long running sessions, for > > instance, or to have to scan the just recorded perf.data file for DSOs > > with samples to then read it from the file system (more races). > > > > Have you ever considered having a build-id for bpf objects that could be > > used here? > > build-id concept is not applicable to bpf. > bpf elf files on the disc don't have good correlation with what is > running in the kernel. bpf bytestream is converted and optimized > by the verifier. Then JITed. > So debug info left in .o file and original bpf bytestream in .o are > mostly useless. > For bpf programs we have 'program tag'. It is computed over original > bpf bytestream, so both kernel and user space can compute it. > In libbcc we use /var/tmp/bcc/bpf_prog_TAG/ directory to store original > source code of the program, so users looking at kernel stack traces > with bpf_prog_TAG can find the source. > It's similar to build-id, but not going to help perf to annotate > actual x86 instructions inside JITed image and show src code. > Since JIT runs in the kernel this problem cannot be solved by user space only. > It's a difficult problem and we have a plan to tackle that, > but it's step 2. A bunch of infra is needed on bpf side to > preserve the association during src_in_C -> original bpf insns -> > translated bpf insns -> JITed asm. > Then report it back to user space and then teach perf to properly annotate progs. > > > Will the JITed code from some BPF bytecode be different if the same > > bytecode is JITed in several machines, all having the exact same > > hardware? > > Yes. JITed code will be different depending on sysctl knobs (like const blinding) > So the same original bpf byte stream loaded at different times > may have different JITed image. > > Even without security features like blinding the JIT can be different. > the bpf maps are separate from bpf progs. The bpf map is created first. > Then the same bpf instruction stream (depending on map type that it uses) > may be optimized by the verifier differently causing difference in JIT. > > > > (instead of passing kallsyms's bpf prog name in event->mmap.filename) > > > but bpf functions don't have their own prog_id. Multiple bpf funcs > > > with different JITed blobs are considered to be a part of single prog_id. > > > So as a step one I'm only extending RECORD_MMAP with addr and kallsym > > > name of bpf function/prog. > > > As a step two the plan is to add notification mechanism for prog load/unload > > > that will include prog_id and design new synthetic RECORD_* events in > > > perf user space that will contain JITed code, line info, BTF, etc. > > > > So, will the kernel JIT a bytecode, load it somewhere and run it, then, > > when unloading it, keep it somewhere (a filesystem with some limits, > > etc) so that at some later time (with some timeouts) tooling can, using > > its id/buildid cookie get the contents and symbol table to have a better > > annotation experience? > > yes. The plan is to let perf fetch JITed image via BPF_OBJ_GET_INFO_BY_FD cmd > during perf record run and store it inside perf.data in synthetic records. > Then perf report can annotate it later. Hi Peter and Arnaldo, I am working with Alexei on the idea of fetching BPF program information via BPF_OBJ_GET_INFO_BY_FD cmd. I added PERF_RECORD_BPF_EVENT to perf_event_type, and dumped these events to perf event ring buffer. I found that perf will not process event until the end of perf-record: root@virt-test:~# ~/perf record -ag -- sleep 10 ...... 10 seconds later [ perf record: Woken up 34 times to write data ] machine__process_bpf_event: prog_id 6 loaded machine__process_bpf_event: prog_id 6 unloaded [ perf record: Captured and wrote 9.337 MB perf.data (93178 samples) ] In this example, the bpf program was loaded and then unloaded in another terminal. When machine__process_bpf_event() processes the load event, the bpf program is already unloaded. Therefore, machine__process_bpf_event() will not be able to get information about the program via BPF_OBJ_GET_INFO_BY_FD cmd. To solve this problem, we will need to run BPF_OBJ_GET_INFO_BY_FD as soon as perf get the event from kernel. I looked around the perf code for a while. But I haven't found a good example where some events are processed before the end of perf-record. Could you please help me with this? Thanks, Song ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload 2018-10-15 23:33 ` Song Liu @ 2018-10-16 23:43 ` David Ahern 2018-10-17 6:43 ` Song Liu 0 siblings, 1 reply; 35+ messages in thread From: David Ahern @ 2018-10-16 23:43 UTC (permalink / raw) To: Song Liu, Alexei Starovoitov Cc: acme, Peter Zijlstra, Alexei Starovoitov, David S . Miller, Daniel Borkmann, Networking, kernel-team On 10/15/18 4:33 PM, Song Liu wrote: > I am working with Alexei on the idea of fetching BPF program information via > BPF_OBJ_GET_INFO_BY_FD cmd. I added PERF_RECORD_BPF_EVENT > to perf_event_type, and dumped these events to perf event ring buffer. > > I found that perf will not process event until the end of perf-record: > > root@virt-test:~# ~/perf record -ag -- sleep 10 > ...... 10 seconds later > [ perf record: Woken up 34 times to write data ] > machine__process_bpf_event: prog_id 6 loaded > machine__process_bpf_event: prog_id 6 unloaded > [ perf record: Captured and wrote 9.337 MB perf.data (93178 samples) ] > > In this example, the bpf program was loaded and then unloaded in > another terminal. When machine__process_bpf_event() processes > the load event, the bpf program is already unloaded. Therefore, > machine__process_bpf_event() will not be able to get information > about the program via BPF_OBJ_GET_INFO_BY_FD cmd. > > To solve this problem, we will need to run BPF_OBJ_GET_INFO_BY_FD > as soon as perf get the event from kernel. I looked around the perf > code for a while. But I haven't found a good example where some > events are processed before the end of perf-record. Could you > please help me with this? perf record does not process events as they are generated. Its sole job is pushing data from the maps to a file as fast as possible meaning in bulk based on current read and write locations. Adding code to process events will add significant overhead to the record command and will not really solve your race problem. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload 2018-10-16 23:43 ` David Ahern @ 2018-10-17 6:43 ` Song Liu 2018-10-17 12:11 ` Arnaldo Carvalho de Melo 2018-10-17 15:09 ` David Ahern 0 siblings, 2 replies; 35+ messages in thread From: Song Liu @ 2018-10-17 6:43 UTC (permalink / raw) To: David Ahern Cc: Alexei Starovoitov, acme, Peter Zijlstra, Alexei Starovoitov, David S . Miller, Daniel Borkmann, Networking, kernel-team Hi David, On Tue, Oct 16, 2018 at 4:43 PM David Ahern <dsahern@gmail.com> wrote: > > On 10/15/18 4:33 PM, Song Liu wrote: > > I am working with Alexei on the idea of fetching BPF program information via > > BPF_OBJ_GET_INFO_BY_FD cmd. I added PERF_RECORD_BPF_EVENT > > to perf_event_type, and dumped these events to perf event ring buffer. > > > > I found that perf will not process event until the end of perf-record: > > > > root@virt-test:~# ~/perf record -ag -- sleep 10 > > ...... 10 seconds later > > [ perf record: Woken up 34 times to write data ] > > machine__process_bpf_event: prog_id 6 loaded > > machine__process_bpf_event: prog_id 6 unloaded > > [ perf record: Captured and wrote 9.337 MB perf.data (93178 samples) ] > > > > In this example, the bpf program was loaded and then unloaded in > > another terminal. When machine__process_bpf_event() processes > > the load event, the bpf program is already unloaded. Therefore, > > machine__process_bpf_event() will not be able to get information > > about the program via BPF_OBJ_GET_INFO_BY_FD cmd. > > > > To solve this problem, we will need to run BPF_OBJ_GET_INFO_BY_FD > > as soon as perf get the event from kernel. I looked around the perf > > code for a while. But I haven't found a good example where some > > events are processed before the end of perf-record. Could you > > please help me with this? > > perf record does not process events as they are generated. Its sole job > is pushing data from the maps to a file as fast as possible meaning in > bulk based on current read and write locations. > > Adding code to process events will add significant overhead to the > record command and will not really solve your race problem. Thanks for the comment. I agree that processing events while recording has significant overhead. In this case, perf user space need to know details about the the jited BPF program. It is impossible to pass all these details to user space through the relatively stable ring_buffer API. Therefore, some processing of the data is necessary (get bpf prog_id from ring buffer, and then fetch program details via BPF_OBJ_GET_INFO_BY_FD. I have some idea on processing important data with relatively low overhead. Let me try implement it. Thanks again, Song ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload 2018-10-17 6:43 ` Song Liu @ 2018-10-17 12:11 ` Arnaldo Carvalho de Melo 2018-10-17 12:50 ` Arnaldo Carvalho de Melo 2018-11-02 1:08 ` Song Liu 2018-10-17 15:09 ` David Ahern 1 sibling, 2 replies; 35+ messages in thread From: Arnaldo Carvalho de Melo @ 2018-10-17 12:11 UTC (permalink / raw) To: Song Liu Cc: David Ahern, Alexei Starovoitov, Peter Zijlstra, Alexei Starovoitov, Alexey Budankov, David S . Miller, Daniel Borkmann, Namhyung Kim, Jiri Olsa, Networking, kernel-team Adding Alexey, Jiri and Namhyung as they worked/are working on multithreading 'perf record'. Em Tue, Oct 16, 2018 at 11:43:11PM -0700, Song Liu escreveu: > On Tue, Oct 16, 2018 at 4:43 PM David Ahern <dsahern@gmail.com> wrote: > > On 10/15/18 4:33 PM, Song Liu wrote: > > > I am working with Alexei on the idea of fetching BPF program information via > > > BPF_OBJ_GET_INFO_BY_FD cmd. I added PERF_RECORD_BPF_EVENT > > > to perf_event_type, and dumped these events to perf event ring buffer. > > > I found that perf will not process event until the end of perf-record: > > > root@virt-test:~# ~/perf record -ag -- sleep 10 > > > ...... 10 seconds later > > > [ perf record: Woken up 34 times to write data ] > > > machine__process_bpf_event: prog_id 6 loaded > > > machine__process_bpf_event: prog_id 6 unloaded > > > [ perf record: Captured and wrote 9.337 MB perf.data (93178 samples) ] > > > In this example, the bpf program was loaded and then unloaded in > > > another terminal. When machine__process_bpf_event() processes > > > the load event, the bpf program is already unloaded. Therefore, > > > machine__process_bpf_event() will not be able to get information > > > about the program via BPF_OBJ_GET_INFO_BY_FD cmd. > > > To solve this problem, we will need to run BPF_OBJ_GET_INFO_BY_FD > > > as soon as perf get the event from kernel. I looked around the perf > > > code for a while. But I haven't found a good example where some > > > events are processed before the end of perf-record. Could you > > > please help me with this? > > perf record does not process events as they are generated. Its sole job > > is pushing data from the maps to a file as fast as possible meaning in > > bulk based on current read and write locations. > > Adding code to process events will add significant overhead to the > > record command and will not really solve your race problem. > I agree that processing events while recording has significant overhead. > In this case, perf user space need to know details about the the jited BPF > program. It is impossible to pass all these details to user space through > the relatively stable ring_buffer API. Therefore, some processing of the > data is necessary (get bpf prog_id from ring buffer, and then fetch program > details via BPF_OBJ_GET_INFO_BY_FD. > I have some idea on processing important data with relatively low overhead. > Let me try implement it. Well, you could have a separate thread processing just those kinds of events, associate it with a dummy event where you only ask for PERF_RECORD_BPF_EVENTs. Here is how to setup the PERF_TYPE_SOFTWARE/PERF_COUNT_SW_DUMMY perf_event_attr: [root@seventh ~]# perf record -vv -e dummy sleep 01 ------------------------------------------------------------ perf_event_attr: type 1 size 112 config 0x9 { sample_period, sample_freq } 4000 sample_type IP|TID|TIME|PERIOD disabled 1 inherit 1 mmap 1 comm 1 freq 1 enable_on_exec 1 task 1 sample_id_all 1 exclude_guest 1 mmap2 1 comm_exec 1 ------------------------------------------------------------ sys_perf_event_open: pid 12046 cpu 0 group_fd -1 flags 0x8 = 4 sys_perf_event_open: pid 12046 cpu 1 group_fd -1 flags 0x8 = 5 sys_perf_event_open: pid 12046 cpu 2 group_fd -1 flags 0x8 = 6 sys_perf_event_open: pid 12046 cpu 3 group_fd -1 flags 0x8 = 8 mmap size 528384B perf event ring buffer mmapped per cpu Synthesizing TSC conversion information [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.014 MB perf.data ] [root@seventh ~]# [root@seventh ~]# perf evlist -v dummy: type: 1, size: 112, config: 0x9, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|PERIOD, disabled: 1, inherit: 1, mmap: 1, comm: 1, freq: 1, enable_on_exec: 1, task: 1, sample_id_all: 1, exclude_guest: 1, mmap2: 1, comm_exec: 1 [root@seventh ~]# There is work ongoing in dumping one file per cpu and then, at post processing time merging all those files to get ordering, so one more file, for these VIP events, that require per-event processing would be ordered at that time with all the other per-cpu files. - Arnaldo ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload 2018-10-17 12:11 ` Arnaldo Carvalho de Melo @ 2018-10-17 12:50 ` Arnaldo Carvalho de Melo 2018-10-17 16:06 ` Song Liu 2018-11-02 1:08 ` Song Liu 1 sibling, 1 reply; 35+ messages in thread From: Arnaldo Carvalho de Melo @ 2018-10-17 12:50 UTC (permalink / raw) To: Song Liu Cc: David Ahern, Alexei Starovoitov, Peter Zijlstra, Alexei Starovoitov, Alexey Budankov, David S . Miller, Daniel Borkmann, Namhyung Kim, Jiri Olsa, Networking, kernel-team Em Wed, Oct 17, 2018 at 09:11:40AM -0300, Arnaldo Carvalho de Melo escreveu: > Adding Alexey, Jiri and Namhyung as they worked/are working on > multithreading 'perf record'. > > Em Tue, Oct 16, 2018 at 11:43:11PM -0700, Song Liu escreveu: > > On Tue, Oct 16, 2018 at 4:43 PM David Ahern <dsahern@gmail.com> wrote: > > > On 10/15/18 4:33 PM, Song Liu wrote: > > > > I am working with Alexei on the idea of fetching BPF program information via > > > > BPF_OBJ_GET_INFO_BY_FD cmd. I added PERF_RECORD_BPF_EVENT > > > > to perf_event_type, and dumped these events to perf event ring buffer. > > > > > I found that perf will not process event until the end of perf-record: > > > > > root@virt-test:~# ~/perf record -ag -- sleep 10 > > > > ...... 10 seconds later > > > > [ perf record: Woken up 34 times to write data ] > > > > machine__process_bpf_event: prog_id 6 loaded > > > > machine__process_bpf_event: prog_id 6 unloaded > > > > [ perf record: Captured and wrote 9.337 MB perf.data (93178 samples) ] > > > > > In this example, the bpf program was loaded and then unloaded in > > > > another terminal. When machine__process_bpf_event() processes > > > > the load event, the bpf program is already unloaded. Therefore, > > > > machine__process_bpf_event() will not be able to get information > > > > about the program via BPF_OBJ_GET_INFO_BY_FD cmd. > > > > > To solve this problem, we will need to run BPF_OBJ_GET_INFO_BY_FD > > > > as soon as perf get the event from kernel. I looked around the perf > > > > code for a while. But I haven't found a good example where some > > > > events are processed before the end of perf-record. Could you > > > > please help me with this? > > > > perf record does not process events as they are generated. Its sole job > > > is pushing data from the maps to a file as fast as possible meaning in > > > bulk based on current read and write locations. > > > > Adding code to process events will add significant overhead to the > > > record command and will not really solve your race problem. > > > I agree that processing events while recording has significant overhead. > > In this case, perf user space need to know details about the the jited BPF > > program. It is impossible to pass all these details to user space through > > the relatively stable ring_buffer API. Therefore, some processing of the > > data is necessary (get bpf prog_id from ring buffer, and then fetch program > > details via BPF_OBJ_GET_INFO_BY_FD. > > > I have some idea on processing important data with relatively low overhead. > > Let me try implement it. > > Well, you could have a separate thread processing just those kinds of > events, associate it with a dummy event where you only ask for > PERF_RECORD_BPF_EVENTs. > > Here is how to setup the PERF_TYPE_SOFTWARE/PERF_COUNT_SW_DUMMY > perf_event_attr: > > [root@seventh ~]# perf record -vv -e dummy sleep 01 > ------------------------------------------------------------ > perf_event_attr: > type 1 > size 112 > config 0x9 > { sample_period, sample_freq } 4000 > sample_type IP|TID|TIME|PERIOD > disabled 1 > inherit 1 These you would have disabled, no need for PERF_RECORD_{MMAP*,COMM,FORK,EXIT} just PERF_RECORD_BPF_EVENT > mmap 1 > comm 1 > task 1 > mmap2 1 > comm_exec 1 > freq 1 > enable_on_exec 1 > sample_id_all 1 > exclude_guest 1 > ------------------------------------------------------------ > sys_perf_event_open: pid 12046 cpu 0 group_fd -1 flags 0x8 = 4 > sys_perf_event_open: pid 12046 cpu 1 group_fd -1 flags 0x8 = 5 > sys_perf_event_open: pid 12046 cpu 2 group_fd -1 flags 0x8 = 6 > sys_perf_event_open: pid 12046 cpu 3 group_fd -1 flags 0x8 = 8 > mmap size 528384B > perf event ring buffer mmapped per cpu > Synthesizing TSC conversion information > [ perf record: Woken up 1 times to write data ] > [ perf record: Captured and wrote 0.014 MB perf.data ] > [root@seventh ~]# > > [root@seventh ~]# perf evlist -v > dummy: type: 1, size: 112, config: 0x9, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|PERIOD, disabled: 1, inherit: 1, mmap: 1, comm: 1, freq: 1, enable_on_exec: 1, task: 1, sample_id_all: 1, exclude_guest: 1, mmap2: 1, comm_exec: 1 > [root@seventh ~]# > > There is work ongoing in dumping one file per cpu and then, at post > processing time merging all those files to get ordering, so one more > file, for these VIP events, that require per-event processing would be > ordered at that time with all the other per-cpu files. > > - Arnaldo ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload 2018-10-17 12:50 ` Arnaldo Carvalho de Melo @ 2018-10-17 16:06 ` Song Liu 0 siblings, 0 replies; 35+ messages in thread From: Song Liu @ 2018-10-17 16:06 UTC (permalink / raw) To: arnaldo.melo Cc: David Ahern, Alexei Starovoitov, Peter Zijlstra, Alexei Starovoitov, Alexey Budankov, David S . Miller, Daniel Borkmann, namhyung, Jiri Olsa, Networking, kernel-team On Wed, Oct 17, 2018 at 5:50 AM Arnaldo Carvalho de Melo <arnaldo.melo@gmail.com> wrote: > > Em Wed, Oct 17, 2018 at 09:11:40AM -0300, Arnaldo Carvalho de Melo escreveu: > > Adding Alexey, Jiri and Namhyung as they worked/are working on > > multithreading 'perf record'. > > > > Em Tue, Oct 16, 2018 at 11:43:11PM -0700, Song Liu escreveu: > > > On Tue, Oct 16, 2018 at 4:43 PM David Ahern <dsahern@gmail.com> wrote: > > > > On 10/15/18 4:33 PM, Song Liu wrote: > > > > > I am working with Alexei on the idea of fetching BPF program information via > > > > > BPF_OBJ_GET_INFO_BY_FD cmd. I added PERF_RECORD_BPF_EVENT > > > > > to perf_event_type, and dumped these events to perf event ring buffer. > > > > > > > I found that perf will not process event until the end of perf-record: > > > > > > > root@virt-test:~# ~/perf record -ag -- sleep 10 > > > > > ...... 10 seconds later > > > > > [ perf record: Woken up 34 times to write data ] > > > > > machine__process_bpf_event: prog_id 6 loaded > > > > > machine__process_bpf_event: prog_id 6 unloaded > > > > > [ perf record: Captured and wrote 9.337 MB perf.data (93178 samples) ] > > > > > > > In this example, the bpf program was loaded and then unloaded in > > > > > another terminal. When machine__process_bpf_event() processes > > > > > the load event, the bpf program is already unloaded. Therefore, > > > > > machine__process_bpf_event() will not be able to get information > > > > > about the program via BPF_OBJ_GET_INFO_BY_FD cmd. > > > > > > > To solve this problem, we will need to run BPF_OBJ_GET_INFO_BY_FD > > > > > as soon as perf get the event from kernel. I looked around the perf > > > > > code for a while. But I haven't found a good example where some > > > > > events are processed before the end of perf-record. Could you > > > > > please help me with this? > > > > > > perf record does not process events as they are generated. Its sole job > > > > is pushing data from the maps to a file as fast as possible meaning in > > > > bulk based on current read and write locations. > > > > > > Adding code to process events will add significant overhead to the > > > > record command and will not really solve your race problem. > > > > > I agree that processing events while recording has significant overhead. > > > In this case, perf user space need to know details about the the jited BPF > > > program. It is impossible to pass all these details to user space through > > > the relatively stable ring_buffer API. Therefore, some processing of the > > > data is necessary (get bpf prog_id from ring buffer, and then fetch program > > > details via BPF_OBJ_GET_INFO_BY_FD. > > > > > I have some idea on processing important data with relatively low overhead. > > > Let me try implement it. > > > > Well, you could have a separate thread processing just those kinds of > > events, associate it with a dummy event where you only ask for > > PERF_RECORD_BPF_EVENTs. > > > > Here is how to setup the PERF_TYPE_SOFTWARE/PERF_COUNT_SW_DUMMY > > perf_event_attr: > > > > [root@seventh ~]# perf record -vv -e dummy sleep 01 > > ------------------------------------------------------------ > > perf_event_attr: > > type 1 > > size 112 > > config 0x9 > > { sample_period, sample_freq } 4000 > > sample_type IP|TID|TIME|PERIOD > > disabled 1 > > inherit 1 > > These you would have disabled, no need for > PERF_RECORD_{MMAP*,COMM,FORK,EXIT} just PERF_RECORD_BPF_EVENT > > > mmap 1 > > comm 1 > > task 1 > > mmap2 1 > > comm_exec 1 > > Thanks Arnaldo! This looks better than my original idea (using POLLPRI to highlight special events). I will try implement the BPF_EVENT in this direction. Song ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload 2018-10-17 12:11 ` Arnaldo Carvalho de Melo 2018-10-17 12:50 ` Arnaldo Carvalho de Melo @ 2018-11-02 1:08 ` Song Liu 1 sibling, 0 replies; 35+ messages in thread From: Song Liu @ 2018-11-02 1:08 UTC (permalink / raw) To: Arnaldo Melo Cc: David Ahern, Alexei Starovoitov, Peter Zijlstra, Alexei Starovoitov, Alexey Budankov, David S . Miller, Daniel Borkmann, namhyung, Jiri Olsa, Networking, kernel-team Hi Arnaldo, On Wed, Oct 17, 2018 at 5:11 AM Arnaldo Carvalho de Melo <arnaldo.melo@gmail.com> wrote: > > Adding Alexey, Jiri and Namhyung as they worked/are working on > multithreading 'perf record'. I have read Alexey's work on enabling aio for perf-record (https://lkml.org/lkml/2018/10/15/169). But I feel it is not really related to this work. Did I miss anything here? For VIP events, I think we need more mmap. Currently, the default setting uses 1 mmap per cpu. To capture VIP events, I think we need another mmap per CPU. The VIP events will be send to the special mmap. Then, user space will process these event before the end of perf-record. Does this approach make sense? Thanks! Song ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload 2018-10-17 6:43 ` Song Liu 2018-10-17 12:11 ` Arnaldo Carvalho de Melo @ 2018-10-17 15:09 ` David Ahern 2018-10-17 16:18 ` Song Liu 2018-10-17 16:36 ` Alexei Starovoitov 1 sibling, 2 replies; 35+ messages in thread From: David Ahern @ 2018-10-17 15:09 UTC (permalink / raw) To: Song Liu Cc: Alexei Starovoitov, acme, Peter Zijlstra, Alexei Starovoitov, David S . Miller, Daniel Borkmann, Networking, kernel-team On 10/16/18 11:43 PM, Song Liu wrote: > I agree that processing events while recording has significant overhead. > In this case, perf user space need to know details about the the jited BPF > program. It is impossible to pass all these details to user space through > the relatively stable ring_buffer API. Therefore, some processing of the > data is necessary (get bpf prog_id from ring buffer, and then fetch program > details via BPF_OBJ_GET_INFO_BY_FD. > > I have some idea on processing important data with relatively low overhead. > Let me try implement it. > As I understand it, you want this series: kernel: add event to perf buffer on bpf prog load userspace: perf reads the event and grabs information about the program from the fd Is that correct? Userpsace is not awakened immediately when an event is added the the ring. It is awakened once the number of events crosses a watermark. That means there is an unknown - and potentially long - time window where the program can be unloaded before perf reads the event. So no matter what you do expecting perf record to be able to process the event quickly is an unreasonable expectation. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload 2018-10-17 15:09 ` David Ahern @ 2018-10-17 16:18 ` Song Liu 2018-10-17 16:36 ` Alexei Starovoitov 1 sibling, 0 replies; 35+ messages in thread From: Song Liu @ 2018-10-17 16:18 UTC (permalink / raw) To: David Ahern Cc: Alexei Starovoitov, acme, Peter Zijlstra, Alexei Starovoitov, David S . Miller, Daniel Borkmann, Networking, kernel-team Hi David, On Wed, Oct 17, 2018 at 8:09 AM David Ahern <dsahern@gmail.com> wrote: > > On 10/16/18 11:43 PM, Song Liu wrote: > > I agree that processing events while recording has significant overhead. > > In this case, perf user space need to know details about the the jited BPF > > program. It is impossible to pass all these details to user space through > > the relatively stable ring_buffer API. Therefore, some processing of the > > data is necessary (get bpf prog_id from ring buffer, and then fetch program > > details via BPF_OBJ_GET_INFO_BY_FD. > > > > I have some idea on processing important data with relatively low overhead. > > Let me try implement it. > > > > As I understand it, you want this series: > > kernel: add event to perf buffer on bpf prog load > > userspace: perf reads the event and grabs information about the program > from the fd > > Is that correct? Yes, this is correct. > > Userpsace is not awakened immediately when an event is added the the > ring. It is awakened once the number of events crosses a watermark. That > means there is an unknown - and potentially long - time window where the > program can be unloaded before perf reads the event. > > So no matter what you do expecting perf record to be able to process the > event quickly is an unreasonable expectation. In this case, we don't really need to gather the information immediately. We will lost information about some short-living BPF programs. These BPF programs are less important for the performance analysis anyway. I guess the only missing case is when some BPF program get load/unload many times, so each instance is short-living, but the overall perf impact is significant. I think this case is not interesting either, as BPF programs should not be used like this (considering the kernel need to translate and jit the program, unloading and reloading soon doesn't make any sense). Thanks, Song ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload 2018-10-17 15:09 ` David Ahern 2018-10-17 16:18 ` Song Liu @ 2018-10-17 16:36 ` Alexei Starovoitov 2018-10-17 18:53 ` Arnaldo Carvalho de Melo 1 sibling, 1 reply; 35+ messages in thread From: Alexei Starovoitov @ 2018-10-17 16:36 UTC (permalink / raw) To: David Ahern, Song Liu Cc: Alexei Starovoitov, acme, Peter Zijlstra, Alexei Starovoitov, David S . Miller, Daniel Borkmann, Networking, Kernel Team On 10/17/18 8:09 AM, David Ahern wrote: > On 10/16/18 11:43 PM, Song Liu wrote: >> I agree that processing events while recording has significant overhead. >> In this case, perf user space need to know details about the the jited BPF >> program. It is impossible to pass all these details to user space through >> the relatively stable ring_buffer API. Therefore, some processing of the >> data is necessary (get bpf prog_id from ring buffer, and then fetch program >> details via BPF_OBJ_GET_INFO_BY_FD. >> >> I have some idea on processing important data with relatively low overhead. >> Let me try implement it. >> > > As I understand it, you want this series: > > kernel: add event to perf buffer on bpf prog load > > userspace: perf reads the event and grabs information about the program > from the fd > > Is that correct? > > Userpsace is not awakened immediately when an event is added the the > ring. It is awakened once the number of events crosses a watermark. That > means there is an unknown - and potentially long - time window where the > program can be unloaded before perf reads the event. > > So no matter what you do expecting perf record to be able to process the > event quickly is an unreasonable expectation. yes... unless we go with threaded model as Arnaldo suggested and use single event as a watermark to wakeup our perf thread. In such case there is still a race window between user space waking up and doing _single_ bpf_get_fd_from_id() call to hold that prog and some other process trying to instantly unload the prog it just loaded. I think such race window is extremely tiny and if perf misses those load/unload events it's a good thing, since there is no chance that normal pmu event samples would be happening during prog execution. The alternative approach with no race window at all is to burden kernel RECORD_* events with _all_ information about bpf prog. Which is jited addresses, jited image itself, info about all subprogs, info about line info, all BTF data, etc. As I said earlier I'm strongly against such RECORD_* bloating. Instead we need to find a way to process new RECORD_BPF events with single prog_id field in perf user space with minimal race and threaded approach sounds like a win to me. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload 2018-10-17 16:36 ` Alexei Starovoitov @ 2018-10-17 18:53 ` Arnaldo Carvalho de Melo 2018-10-17 19:08 ` Alexei Starovoitov 0 siblings, 1 reply; 35+ messages in thread From: Arnaldo Carvalho de Melo @ 2018-10-17 18:53 UTC (permalink / raw) To: Alexei Starovoitov Cc: David Ahern, Song Liu, Alexei Starovoitov, Peter Zijlstra, Alexei Starovoitov, David S . Miller, Daniel Borkmann, Networking, Kernel Team Em Wed, Oct 17, 2018 at 04:36:08PM +0000, Alexei Starovoitov escreveu: > On 10/17/18 8:09 AM, David Ahern wrote: > > On 10/16/18 11:43 PM, Song Liu wrote: > >> I agree that processing events while recording has significant overhead. > >> In this case, perf user space need to know details about the the jited BPF > >> program. It is impossible to pass all these details to user space through > >> the relatively stable ring_buffer API. Therefore, some processing of the > >> data is necessary (get bpf prog_id from ring buffer, and then fetch program > >> details via BPF_OBJ_GET_INFO_BY_FD. > >> > >> I have some idea on processing important data with relatively low overhead. > >> Let me try implement it. > >> > > > > As I understand it, you want this series: > > > > kernel: add event to perf buffer on bpf prog load > > > > userspace: perf reads the event and grabs information about the program > > from the fd > > > > Is that correct? > > > > Userpsace is not awakened immediately when an event is added the the > > ring. It is awakened once the number of events crosses a watermark. That > > means there is an unknown - and potentially long - time window where the > > program can be unloaded before perf reads the event. > > So no matter what you do expecting perf record to be able to process the > > event quickly is an unreasonable expectation. > yes... unless we go with threaded model as Arnaldo suggested and use > single event as a watermark to wakeup our perf thread. > In such case there is still a race window between user space waking up > and doing _single_ bpf_get_fd_from_id() call to hold that prog > and some other process trying to instantly unload the prog it > just loaded. > I think such race window is extremely tiny and if perf misses > those load/unload events it's a good thing, since there is no chance > that normal pmu event samples would be happening during prog execution. > The alternative approach with no race window at all is to burden kernel > RECORD_* events with _all_ information about bpf prog. Which is jited > addresses, jited image itself, info about all subprogs, info about line > info, all BTF data, etc. As I said earlier I'm strongly against such > RECORD_* bloating. > Instead we need to find a way to process new RECORD_BPF events with > single prog_id field in perf user space with minimal race > and threaded approach sounds like a win to me. There is another alternative, I think: put just a content based hash, like a git commit id into a PERF_RECORD_MMAP3 new record, and when the validator does the jit, etc, it stashes the content that BPF_OBJ_GET_INFO_BY_FD would get somewhere, some filesystem populated by the kernel right after getting stuff from sys_bpf() and preparing it for use, then we know that in (start, end) we have blob foo with content id, that we will use to retrieve information that augments what we know with just (start, end, id) and allows annotation, etc. That stash space for jitted stuff gets garbage collected from time to time or is even completely disabled if the user is not interested in such augmentation, just like one can do disabling perf's ~/.debug/ directory hashed by build-id. I think with this we have no races, the PERF_RECORD_MMAP3 gets just what is in PERF_RECORD_MMAP2 plus some extra 20 bytes for such content based cookie and we solve the other race we already have with kernel modules, DSOs, etc. I have mentioned this before, there were objections, perhaps this time I formulated in a different way that makes it more interesting? - Arnaldo ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload 2018-10-17 18:53 ` Arnaldo Carvalho de Melo @ 2018-10-17 19:08 ` Alexei Starovoitov 2018-10-17 21:31 ` Arnaldo Carvalho de Melo 0 siblings, 1 reply; 35+ messages in thread From: Alexei Starovoitov @ 2018-10-17 19:08 UTC (permalink / raw) To: Arnaldo Carvalho de Melo Cc: David Ahern, Song Liu, Alexei Starovoitov, Peter Zijlstra, Alexei Starovoitov, David S . Miller, Daniel Borkmann, Networking, Kernel Team On 10/17/18 11:53 AM, Arnaldo Carvalho de Melo wrote: > Em Wed, Oct 17, 2018 at 04:36:08PM +0000, Alexei Starovoitov escreveu: >> On 10/17/18 8:09 AM, David Ahern wrote: >>> On 10/16/18 11:43 PM, Song Liu wrote: >>>> I agree that processing events while recording has significant overhead. >>>> In this case, perf user space need to know details about the the jited BPF >>>> program. It is impossible to pass all these details to user space through >>>> the relatively stable ring_buffer API. Therefore, some processing of the >>>> data is necessary (get bpf prog_id from ring buffer, and then fetch program >>>> details via BPF_OBJ_GET_INFO_BY_FD. >>>> >>>> I have some idea on processing important data with relatively low overhead. >>>> Let me try implement it. >>>> >>> >>> As I understand it, you want this series: >>> >>> kernel: add event to perf buffer on bpf prog load >>> >>> userspace: perf reads the event and grabs information about the program >>> from the fd >>> >>> Is that correct? >>> >>> Userpsace is not awakened immediately when an event is added the the >>> ring. It is awakened once the number of events crosses a watermark. That >>> means there is an unknown - and potentially long - time window where the >>> program can be unloaded before perf reads the event. > >>> So no matter what you do expecting perf record to be able to process the >>> event quickly is an unreasonable expectation. > >> yes... unless we go with threaded model as Arnaldo suggested and use >> single event as a watermark to wakeup our perf thread. >> In such case there is still a race window between user space waking up >> and doing _single_ bpf_get_fd_from_id() call to hold that prog >> and some other process trying to instantly unload the prog it >> just loaded. >> I think such race window is extremely tiny and if perf misses >> those load/unload events it's a good thing, since there is no chance >> that normal pmu event samples would be happening during prog execution. > >> The alternative approach with no race window at all is to burden kernel >> RECORD_* events with _all_ information about bpf prog. Which is jited >> addresses, jited image itself, info about all subprogs, info about line >> info, all BTF data, etc. As I said earlier I'm strongly against such >> RECORD_* bloating. >> Instead we need to find a way to process new RECORD_BPF events with >> single prog_id field in perf user space with minimal race >> and threaded approach sounds like a win to me. > > There is another alternative, I think: put just a content based hash, > like a git commit id into a PERF_RECORD_MMAP3 new record, and when the > validator does the jit, etc, it stashes the content that > BPF_OBJ_GET_INFO_BY_FD would get somewhere, some filesystem populated by > the kernel right after getting stuff from sys_bpf() and preparing it for > use, then we know that in (start, end) we have blob foo with content id, > that we will use to retrieve information that augments what we know with > just (start, end, id) and allows annotation, etc. > > That stash space for jitted stuff gets garbage collected from time to > time or is even completely disabled if the user is not interested in > such augmentation, just like one can do disabling perf's ~/.debug/ > directory hashed by build-id. > > I think with this we have no races, the PERF_RECORD_MMAP3 gets just what > is in PERF_RECORD_MMAP2 plus some extra 20 bytes for such content based > cookie and we solve the other race we already have with kernel modules, > DSOs, etc. > > I have mentioned this before, there were objections, perhaps this time I > formulated in a different way that makes it more interesting? that 'content based hash' we already have. It's called program tag. and we already taught iovisor/bcc to stash that stuff into /var/tmp/bcc/bpf_prog_TAG/ directory. Unfortunately that approach didn't work. JITed image only exists in the kernel. It's there only when program is loaded and it's the one that matter the most for performance analysis, since sample IPs are pointing into it. Also the line info mapping that user space knows is not helping much either, since verifier optimizes the instructions and then JIT does more. The debug_info <-> JIT relationship must be preserved by the kernel and returned to user space. The only other non-race way is to keep all that info in the kernel after program is unloaded, but that is even worse than bloating RECORD*s ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload 2018-10-17 19:08 ` Alexei Starovoitov @ 2018-10-17 21:31 ` Arnaldo Carvalho de Melo 2018-10-17 22:01 ` Alexei Starovoitov 0 siblings, 1 reply; 35+ messages in thread From: Arnaldo Carvalho de Melo @ 2018-10-17 21:31 UTC (permalink / raw) To: Alexei Starovoitov Cc: David Ahern, Song Liu, Alexei Starovoitov, Peter Zijlstra, Alexei Starovoitov, David S . Miller, Daniel Borkmann, Networking, Kernel Team Em Wed, Oct 17, 2018 at 07:08:37PM +0000, Alexei Starovoitov escreveu: > On 10/17/18 11:53 AM, Arnaldo Carvalho de Melo wrote: > > Em Wed, Oct 17, 2018 at 04:36:08PM +0000, Alexei Starovoitov escreveu: > >> On 10/17/18 8:09 AM, David Ahern wrote: > >>> On 10/16/18 11:43 PM, Song Liu wrote: > >>>> I agree that processing events while recording has significant overhead. > >>>> In this case, perf user space need to know details about the the jited BPF > >>>> program. It is impossible to pass all these details to user space through > >>>> the relatively stable ring_buffer API. Therefore, some processing of the > >>>> data is necessary (get bpf prog_id from ring buffer, and then fetch program > >>>> details via BPF_OBJ_GET_INFO_BY_FD. > >>>> > >>>> I have some idea on processing important data with relatively low overhead. > >>>> Let me try implement it. > >>>> > >>> > >>> As I understand it, you want this series: > >>> > >>> kernel: add event to perf buffer on bpf prog load > >>> > >>> userspace: perf reads the event and grabs information about the program > >>> from the fd > >>> > >>> Is that correct? > >>> > >>> Userpsace is not awakened immediately when an event is added the the > >>> ring. It is awakened once the number of events crosses a watermark. That > >>> means there is an unknown - and potentially long - time window where the > >>> program can be unloaded before perf reads the event. > > > >>> So no matter what you do expecting perf record to be able to process the > >>> event quickly is an unreasonable expectation. > > > >> yes... unless we go with threaded model as Arnaldo suggested and use > >> single event as a watermark to wakeup our perf thread. > >> In such case there is still a race window between user space waking up > >> and doing _single_ bpf_get_fd_from_id() call to hold that prog > >> and some other process trying to instantly unload the prog it > >> just loaded. > >> I think such race window is extremely tiny and if perf misses > >> those load/unload events it's a good thing, since there is no chance > >> that normal pmu event samples would be happening during prog execution. > >> The alternative approach with no race window at all is to burden kernel > >> RECORD_* events with _all_ information about bpf prog. Which is jited > >> addresses, jited image itself, info about all subprogs, info about line > >> info, all BTF data, etc. As I said earlier I'm strongly against such > >> RECORD_* bloating. > >> Instead we need to find a way to process new RECORD_BPF events with > >> single prog_id field in perf user space with minimal race > >> and threaded approach sounds like a win to me. > > There is another alternative, I think: put just a content based hash, > > like a git commit id into a PERF_RECORD_MMAP3 new record, and when the > > validator does the jit, etc, it stashes the content that > > BPF_OBJ_GET_INFO_BY_FD would get somewhere, some filesystem populated by > > the kernel right after getting stuff from sys_bpf() and preparing it for > > use, then we know that in (start, end) we have blob foo with content id, > > that we will use to retrieve information that augments what we know with > > just (start, end, id) and allows annotation, etc. > > That stash space for jitted stuff gets garbage collected from time to > > time or is even completely disabled if the user is not interested in > > such augmentation, just like one can do disabling perf's ~/.debug/ > > directory hashed by build-id. > > I think with this we have no races, the PERF_RECORD_MMAP3 gets just what > > is in PERF_RECORD_MMAP2 plus some extra 20 bytes for such content based > > cookie and we solve the other race we already have with kernel modules, > > DSOs, etc. > > I have mentioned this before, there were objections, perhaps this time I > > formulated in a different way that makes it more interesting? > that 'content based hash' we already have. It's called program tag. But that was calculated by whom? Userspace? It can't do that, its the kernel that ultimately puts together, from what userspace gave it, what we need to do performance analysis, line numbers, etc. > and we already taught iovisor/bcc to stash that stuff into > /var/tmp/bcc/bpf_prog_TAG/ directory. > Unfortunately that approach didn't work. > JITed image only exists in the kernel. It's there only when That is why I said that _the_ kernel stashes that thing, not bcc/iovisor or perf, the kernel calculates the hash, and it also puts that into the PERF_RECORD_MMAP3, so the tool sees it and goes to get it from the place the kernel stashed it. > program is loaded and it's the one that matter the most for performance > analysis, since sample IPs are pointing into it. Agreed > Also the line info mapping that user space knows is not helping much > either, since verifier optimizes the instructions and then JIT > does more. The debug_info <-> JIT relationship must be preserved > by the kernel and returned to user space. Violent agreement :-) > The only other non-race way is to keep all that info in the kernel after > program is unloaded, but that is even worse than bloating RECORD*s Keep all that info in a file, as I described above. Or keep it for a while, to give that thread in userspace time to get it and tell the kernel that it can trow it away. It may well be that most of the time the 'perf record' thread catching those events picks that up and saves it in /var/tmp/bcc/bpf_prog_BUILDID/ even before the program gets unloaded, no? - Arnaldo ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload 2018-10-17 21:31 ` Arnaldo Carvalho de Melo @ 2018-10-17 22:01 ` Alexei Starovoitov 0 siblings, 0 replies; 35+ messages in thread From: Alexei Starovoitov @ 2018-10-17 22:01 UTC (permalink / raw) To: Arnaldo Carvalho de Melo Cc: David Ahern, Song Liu, Alexei Starovoitov, Peter Zijlstra, Alexei Starovoitov, David S . Miller, Daniel Borkmann, Networking, Kernel Team On 10/17/18 2:31 PM, Arnaldo Carvalho de Melo wrote: > > Keep all that info in a file, as I described above. Or keep it for a > while, to give that thread in userspace time to get it and tell the > kernel that it can trow it away. stashing by kernel into a file is a huge headache, since format of the file becomes kernel abi. Plus why have it in two places (in a file and in normal kernel data structures)? > It may well be that most of the time the 'perf record' thread catching > those events picks that up and saves it in > /var/tmp/bcc/bpf_prog_BUILDID/ even before the program gets unloaded, > no? Whether perf user space stashes that info into perf.data as synthetic records or stashes it somewhere in /var/tmp/ sound about equivalent to me. Both have their pros and cons. This is certainly a good topic to discuss further. But asking kernel to keep JITed images and all relevant bpf data after program has been unloaded sounds really scary to me. I struggling to think through the security implications with that. How long kernel suppose to keep that? Some timeout? As I explained already the time it takes for perf to do _single_ get_fd_by_id syscall when it sees RECORD_MMAP with prog_id is pretty instant. All other syscalls to grab JITed image and everything else can be done later. The prog will not go away because perf will hold an fd. If prog was somehow unloaded before perf could do get_fd_by_id no one cares about such programs, since there is close to zero chance that this program was attached to anything and absolutely no chance that it run. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload 2018-09-20 8:44 ` Peter Zijlstra 2018-09-20 13:25 ` Arnaldo Carvalho de Melo @ 2018-09-20 13:36 ` Peter Zijlstra 1 sibling, 0 replies; 35+ messages in thread From: Peter Zijlstra @ 2018-09-20 13:36 UTC (permalink / raw) To: Alexei Starovoitov; +Cc: David S . Miller, daniel, acme, netdev, kernel-team On Thu, Sep 20, 2018 at 10:44:24AM +0200, Peter Zijlstra wrote: > On Wed, Sep 19, 2018 at 03:39:34PM -0700, Alexei Starovoitov wrote: > > void bpf_prog_kallsyms_del(struct bpf_prog *fp) > > { > > + unsigned long symbol_start, symbol_end; > > + /* mmap_record.filename cannot be NULL and has to be u64 aligned */ > > + char buf[sizeof(u64)] = {}; > > + > > if (!bpf_prog_kallsyms_candidate(fp)) > > return; > > > > spin_lock_bh(&bpf_lock); > > bpf_prog_ksym_node_del(fp->aux); > > spin_unlock_bh(&bpf_lock); > > + bpf_get_prog_addr_region(fp, &symbol_start, &symbol_end); > > + perf_event_mmap_bpf_prog(symbol_start, symbol_end - symbol_start, > > + buf, sizeof(buf)); > > } > > So perf doesn't normally issue unmap events.. We've talked about doing > that, but so far it's never really need needed I think. > > I feels a bit weird to start issuing unmap events for this. FWIW: https://lkml.kernel.org/r/20170127130702.GI6515@twins.programming.kicks-ass.net has talk of PERF_RECORD_MUNMAP ^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH perf 3/3] tools/perf: recognize and process RECORD_MMAP events for bpf progs 2018-09-19 22:39 [PATCH bpf-next 0/3] perf, bpf: reveal invisible bpf programs Alexei Starovoitov 2018-09-19 22:39 ` [PATCH bpf-next 1/3] perf/core: introduce perf_event_mmap_bpf_prog Alexei Starovoitov 2018-09-19 22:39 ` [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload Alexei Starovoitov @ 2018-09-19 22:39 ` Alexei Starovoitov 2018-09-20 13:36 ` Arnaldo Carvalho de Melo 2018-09-21 22:57 ` Song Liu 2 siblings, 2 replies; 35+ messages in thread From: Alexei Starovoitov @ 2018-09-19 22:39 UTC (permalink / raw) To: David S . Miller; +Cc: daniel, peterz, acme, netdev, kernel-team Recognize JITed bpf prog load/unload events. Add/remove kernel symbols accordingly. Signed-off-by: Alexei Starovoitov <ast@kernel.org> --- tools/perf/util/machine.c | 27 +++++++++++++++++++++++++++ tools/perf/util/symbol.c | 13 +++++++++++++ tools/perf/util/symbol.h | 1 + 3 files changed, 41 insertions(+) diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c index c4acd2001db0..ae4f8a0fdc7e 100644 --- a/tools/perf/util/machine.c +++ b/tools/perf/util/machine.c @@ -25,6 +25,7 @@ #include "sane_ctype.h" #include <symbol/kallsyms.h> #include <linux/mman.h> +#include <linux/magic.h> static void __machine__remove_thread(struct machine *machine, struct thread *th, bool lock); @@ -1460,6 +1461,32 @@ static int machine__process_kernel_mmap_event(struct machine *machine, enum dso_kernel_type kernel_type; bool is_kernel_mmap; + /* process JITed bpf programs load/unload events */ + if (event->mmap.pid == ~0u && event->mmap.tid == BPF_FS_MAGIC) { + struct symbol *sym; + u64 ip; + + map = map_groups__find(&machine->kmaps, event->mmap.start); + if (!map) { + pr_err("No kernel map for IP %lx\n", event->mmap.start); + goto out_problem; + } + ip = event->mmap.start - map->start + map->pgoff; + if (event->mmap.filename[0]) { + sym = symbol__new(ip, event->mmap.len, 0, 0, + event->mmap.filename); + dso__insert_symbol(map->dso, sym); + } else { + if (symbols__erase(&map->dso->symbols, ip)) { + pr_err("No bpf prog at IP %lx/%lx\n", + event->mmap.start, ip); + goto out_problem; + } + dso__reset_find_symbol_cache(map->dso); + } + return 0; + } + /* If we have maps from kcore then we do not need or want any others */ if (machine__uses_kcore(machine)) return 0; diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c index d188b7588152..0653f313661d 100644 --- a/tools/perf/util/symbol.c +++ b/tools/perf/util/symbol.c @@ -353,6 +353,19 @@ static struct symbol *symbols__find(struct rb_root *symbols, u64 ip) return NULL; } +int symbols__erase(struct rb_root *symbols, u64 ip) +{ + struct symbol *s; + + s = symbols__find(symbols, ip); + if (!s) + return -ENOENT; + + rb_erase(&s->rb_node, symbols); + symbol__delete(s); + return 0; +} + static struct symbol *symbols__first(struct rb_root *symbols) { struct rb_node *n = rb_first(symbols); diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h index f25fae4b5743..92ef31953d9a 100644 --- a/tools/perf/util/symbol.h +++ b/tools/perf/util/symbol.h @@ -310,6 +310,7 @@ char *dso__demangle_sym(struct dso *dso, int kmodule, const char *elf_name); void __symbols__insert(struct rb_root *symbols, struct symbol *sym, bool kernel); void symbols__insert(struct rb_root *symbols, struct symbol *sym); +int symbols__erase(struct rb_root *symbols, u64 ip); void symbols__fixup_duplicate(struct rb_root *symbols); void symbols__fixup_end(struct rb_root *symbols); void map_groups__fixup_end(struct map_groups *mg); -- 2.17.1 ^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re: [PATCH perf 3/3] tools/perf: recognize and process RECORD_MMAP events for bpf progs 2018-09-19 22:39 ` [PATCH perf 3/3] tools/perf: recognize and process RECORD_MMAP events for bpf progs Alexei Starovoitov @ 2018-09-20 13:36 ` Arnaldo Carvalho de Melo 2018-09-20 14:05 ` Arnaldo Carvalho de Melo 2018-09-21 22:57 ` Song Liu 1 sibling, 1 reply; 35+ messages in thread From: Arnaldo Carvalho de Melo @ 2018-09-20 13:36 UTC (permalink / raw) To: Alexei Starovoitov; +Cc: David S . Miller, daniel, peterz, netdev, kernel-team Em Wed, Sep 19, 2018 at 03:39:35PM -0700, Alexei Starovoitov escreveu: > Recognize JITed bpf prog load/unload events. > Add/remove kernel symbols accordingly. > > Signed-off-by: Alexei Starovoitov <ast@kernel.org> > --- > tools/perf/util/machine.c | 27 +++++++++++++++++++++++++++ > tools/perf/util/symbol.c | 13 +++++++++++++ > tools/perf/util/symbol.h | 1 + > 3 files changed, 41 insertions(+) > > diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c > index c4acd2001db0..ae4f8a0fdc7e 100644 > --- a/tools/perf/util/machine.c > +++ b/tools/perf/util/machine.c > @@ -25,6 +25,7 @@ > #include "sane_ctype.h" > #include <symbol/kallsyms.h> > #include <linux/mman.h> > +#include <linux/magic.h> > > static void __machine__remove_thread(struct machine *machine, struct thread *th, bool lock); > > @@ -1460,6 +1461,32 @@ static int machine__process_kernel_mmap_event(struct machine *machine, > enum dso_kernel_type kernel_type; > bool is_kernel_mmap; > > + /* process JITed bpf programs load/unload events */ > + if (event->mmap.pid == ~0u && event->mmap.tid == BPF_FS_MAGIC) { So, this would be in machine__process_kernel_munmap-event(machine), etc, no check for BPF_FS_MAGIC would be needed with a PERF_RECORD_MUNMAP. > + struct symbol *sym; > + u64 ip; > + > + map = map_groups__find(&machine->kmaps, event->mmap.start); > + if (!map) { > + pr_err("No kernel map for IP %lx\n", event->mmap.start); > + goto out_problem; > + } > + ip = event->mmap.start - map->start + map->pgoff; > + if (event->mmap.filename[0]) { > + sym = symbol__new(ip, event->mmap.len, 0, 0, > + event->mmap.filename); Humm, so the bpf program would be just one symbol... bpf-to-bpf calls will be to a different bpf program, right? /me goes to read https://lwn.net/Articles/741773/ "[PATCH bpf-next 00/13] bpf: introduce function calls" > + dso__insert_symbol(map->dso, sym); > + } else { > + if (symbols__erase(&map->dso->symbols, ip)) { > + pr_err("No bpf prog at IP %lx/%lx\n", > + event->mmap.start, ip); > + goto out_problem; > + } > + dso__reset_find_symbol_cache(map->dso); > + } > + return 0; > + } > + > /* If we have maps from kcore then we do not need or want any others */ > if (machine__uses_kcore(machine)) > return 0; > diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c > index d188b7588152..0653f313661d 100644 > --- a/tools/perf/util/symbol.c > +++ b/tools/perf/util/symbol.c > @@ -353,6 +353,19 @@ static struct symbol *symbols__find(struct rb_root *symbols, u64 ip) > return NULL; > } > > +int symbols__erase(struct rb_root *symbols, u64 ip) > +{ > + struct symbol *s; > + > + s = symbols__find(symbols, ip); > + if (!s) > + return -ENOENT; > + > + rb_erase(&s->rb_node, symbols); > + symbol__delete(s); > + return 0; > +} > + > static struct symbol *symbols__first(struct rb_root *symbols) > { > struct rb_node *n = rb_first(symbols); > diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h > index f25fae4b5743..92ef31953d9a 100644 > --- a/tools/perf/util/symbol.h > +++ b/tools/perf/util/symbol.h > @@ -310,6 +310,7 @@ char *dso__demangle_sym(struct dso *dso, int kmodule, const char *elf_name); > > void __symbols__insert(struct rb_root *symbols, struct symbol *sym, bool kernel); > void symbols__insert(struct rb_root *symbols, struct symbol *sym); > +int symbols__erase(struct rb_root *symbols, u64 ip); > void symbols__fixup_duplicate(struct rb_root *symbols); > void symbols__fixup_end(struct rb_root *symbols); > void map_groups__fixup_end(struct map_groups *mg); > -- > 2.17.1 ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH perf 3/3] tools/perf: recognize and process RECORD_MMAP events for bpf progs 2018-09-20 13:36 ` Arnaldo Carvalho de Melo @ 2018-09-20 14:05 ` Arnaldo Carvalho de Melo 0 siblings, 0 replies; 35+ messages in thread From: Arnaldo Carvalho de Melo @ 2018-09-20 14:05 UTC (permalink / raw) To: Alexei Starovoitov; +Cc: David S . Miller, daniel, peterz, netdev, kernel-team Em Thu, Sep 20, 2018 at 10:36:17AM -0300, Arnaldo Carvalho de Melo escreveu: > Em Wed, Sep 19, 2018 at 03:39:35PM -0700, Alexei Starovoitov escreveu: > > Recognize JITed bpf prog load/unload events. > > Add/remove kernel symbols accordingly. > > > > Signed-off-by: Alexei Starovoitov <ast@kernel.org> > > --- > > tools/perf/util/machine.c | 27 +++++++++++++++++++++++++++ > > tools/perf/util/symbol.c | 13 +++++++++++++ > > tools/perf/util/symbol.h | 1 + > > 3 files changed, 41 insertions(+) > > > > diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c > > index c4acd2001db0..ae4f8a0fdc7e 100644 > > --- a/tools/perf/util/machine.c > > +++ b/tools/perf/util/machine.c > > @@ -25,6 +25,7 @@ > > #include "sane_ctype.h" > > #include <symbol/kallsyms.h> > > #include <linux/mman.h> > > +#include <linux/magic.h> > > > > static void __machine__remove_thread(struct machine *machine, struct thread *th, bool lock); > > > > @@ -1460,6 +1461,32 @@ static int machine__process_kernel_mmap_event(struct machine *machine, > > enum dso_kernel_type kernel_type; > > bool is_kernel_mmap; > > > > + /* process JITed bpf programs load/unload events */ > > + if (event->mmap.pid == ~0u && event->mmap.tid == BPF_FS_MAGIC) { > > > So, this would be in machine__process_kernel_munmap-event(machine), etc, > no check for BPF_FS_MAGIC would be needed with a PERF_RECORD_MUNMAP. > > > + struct symbol *sym; > > + u64 ip; > > + > > + map = map_groups__find(&machine->kmaps, event->mmap.start); > > + if (!map) { > > + pr_err("No kernel map for IP %lx\n", event->mmap.start); > > + goto out_problem; > > + } > > + ip = event->mmap.start - map->start + map->pgoff; > > + if (event->mmap.filename[0]) { > > + sym = symbol__new(ip, event->mmap.len, 0, 0, > > + event->mmap.filename); > > Humm, so the bpf program would be just one symbol... bpf-to-bpf calls > will be to a different bpf program, right? > > /me goes to read https://lwn.net/Articles/741773/ > "[PATCH bpf-next 00/13] bpf: introduce function calls" After reading it, yeah, I think we need some way to access a symbol table for a BPF program, and also its binary so that we can do annotation, static (perf annotate) and live (perf top), was this already considered? I think one can get the binary for a program giving sufficient perms somehow, right? One other thing I need to catch up 8-) - Arnaldo > > + dso__insert_symbol(map->dso, sym); > > + } else { > > + if (symbols__erase(&map->dso->symbols, ip)) { > > + pr_err("No bpf prog at IP %lx/%lx\n", > > + event->mmap.start, ip); > > + goto out_problem; > > + } > > + dso__reset_find_symbol_cache(map->dso); > > + } > > + return 0; > > + } > > + > > /* If we have maps from kcore then we do not need or want any others */ > > if (machine__uses_kcore(machine)) > > return 0; > > diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c > > index d188b7588152..0653f313661d 100644 > > --- a/tools/perf/util/symbol.c > > +++ b/tools/perf/util/symbol.c > > @@ -353,6 +353,19 @@ static struct symbol *symbols__find(struct rb_root *symbols, u64 ip) > > return NULL; > > } > > > > +int symbols__erase(struct rb_root *symbols, u64 ip) > > +{ > > + struct symbol *s; > > + > > + s = symbols__find(symbols, ip); > > + if (!s) > > + return -ENOENT; > > + > > + rb_erase(&s->rb_node, symbols); > > + symbol__delete(s); > > + return 0; > > +} > > + > > static struct symbol *symbols__first(struct rb_root *symbols) > > { > > struct rb_node *n = rb_first(symbols); > > diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h > > index f25fae4b5743..92ef31953d9a 100644 > > --- a/tools/perf/util/symbol.h > > +++ b/tools/perf/util/symbol.h > > @@ -310,6 +310,7 @@ char *dso__demangle_sym(struct dso *dso, int kmodule, const char *elf_name); > > > > void __symbols__insert(struct rb_root *symbols, struct symbol *sym, bool kernel); > > void symbols__insert(struct rb_root *symbols, struct symbol *sym); > > +int symbols__erase(struct rb_root *symbols, u64 ip); > > void symbols__fixup_duplicate(struct rb_root *symbols); > > void symbols__fixup_end(struct rb_root *symbols); > > void map_groups__fixup_end(struct map_groups *mg); > > -- > > 2.17.1 ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH perf 3/3] tools/perf: recognize and process RECORD_MMAP events for bpf progs 2018-09-19 22:39 ` [PATCH perf 3/3] tools/perf: recognize and process RECORD_MMAP events for bpf progs Alexei Starovoitov 2018-09-20 13:36 ` Arnaldo Carvalho de Melo @ 2018-09-21 22:57 ` Song Liu 1 sibling, 0 replies; 35+ messages in thread From: Song Liu @ 2018-09-21 22:57 UTC (permalink / raw) To: Alexei Starovoitov Cc: David S . Miller, Daniel Borkmann, Peter Zijlstra, acme, Networking, kernel-team On Wed, Sep 19, 2018 at 3:51 PM Alexei Starovoitov <ast@kernel.org> wrote: > > Recognize JITed bpf prog load/unload events. > Add/remove kernel symbols accordingly. > > Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> > --- > tools/perf/util/machine.c | 27 +++++++++++++++++++++++++++ > tools/perf/util/symbol.c | 13 +++++++++++++ > tools/perf/util/symbol.h | 1 + > 3 files changed, 41 insertions(+) > > diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c > index c4acd2001db0..ae4f8a0fdc7e 100644 > --- a/tools/perf/util/machine.c > +++ b/tools/perf/util/machine.c > @@ -25,6 +25,7 @@ > #include "sane_ctype.h" > #include <symbol/kallsyms.h> > #include <linux/mman.h> > +#include <linux/magic.h> > > static void __machine__remove_thread(struct machine *machine, struct thread *th, bool lock); > > @@ -1460,6 +1461,32 @@ static int machine__process_kernel_mmap_event(struct machine *machine, > enum dso_kernel_type kernel_type; > bool is_kernel_mmap; > > + /* process JITed bpf programs load/unload events */ > + if (event->mmap.pid == ~0u && event->mmap.tid == BPF_FS_MAGIC) { > + struct symbol *sym; > + u64 ip; > + > + map = map_groups__find(&machine->kmaps, event->mmap.start); > + if (!map) { > + pr_err("No kernel map for IP %lx\n", event->mmap.start); > + goto out_problem; > + } > + ip = event->mmap.start - map->start + map->pgoff; > + if (event->mmap.filename[0]) { > + sym = symbol__new(ip, event->mmap.len, 0, 0, > + event->mmap.filename); > + dso__insert_symbol(map->dso, sym); > + } else { > + if (symbols__erase(&map->dso->symbols, ip)) { > + pr_err("No bpf prog at IP %lx/%lx\n", > + event->mmap.start, ip); > + goto out_problem; > + } > + dso__reset_find_symbol_cache(map->dso); > + } > + return 0; > + } > + > /* If we have maps from kcore then we do not need or want any others */ > if (machine__uses_kcore(machine)) > return 0; > diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c > index d188b7588152..0653f313661d 100644 > --- a/tools/perf/util/symbol.c > +++ b/tools/perf/util/symbol.c > @@ -353,6 +353,19 @@ static struct symbol *symbols__find(struct rb_root *symbols, u64 ip) > return NULL; > } > > +int symbols__erase(struct rb_root *symbols, u64 ip) > +{ > + struct symbol *s; > + > + s = symbols__find(symbols, ip); > + if (!s) > + return -ENOENT; > + > + rb_erase(&s->rb_node, symbols); > + symbol__delete(s); > + return 0; > +} > + > static struct symbol *symbols__first(struct rb_root *symbols) > { > struct rb_node *n = rb_first(symbols); > diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h > index f25fae4b5743..92ef31953d9a 100644 > --- a/tools/perf/util/symbol.h > +++ b/tools/perf/util/symbol.h > @@ -310,6 +310,7 @@ char *dso__demangle_sym(struct dso *dso, int kmodule, const char *elf_name); > > void __symbols__insert(struct rb_root *symbols, struct symbol *sym, bool kernel); > void symbols__insert(struct rb_root *symbols, struct symbol *sym); > +int symbols__erase(struct rb_root *symbols, u64 ip); > void symbols__fixup_duplicate(struct rb_root *symbols); > void symbols__fixup_end(struct rb_root *symbols); > void map_groups__fixup_end(struct map_groups *mg); > -- > 2.17.1 > ^ permalink raw reply [flat|nested] 35+ messages in thread
end of thread, other threads:[~2018-11-02 10:13 UTC | newest] Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-09-19 22:39 [PATCH bpf-next 0/3] perf, bpf: reveal invisible bpf programs Alexei Starovoitov 2018-09-19 22:39 ` [PATCH bpf-next 1/3] perf/core: introduce perf_event_mmap_bpf_prog Alexei Starovoitov 2018-09-19 23:30 ` Song Liu 2018-09-20 0:53 ` Alexei Starovoitov 2018-09-19 22:39 ` [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload Alexei Starovoitov 2018-09-19 23:44 ` Song Liu 2018-09-20 0:59 ` Alexei Starovoitov 2018-09-20 5:48 ` Song Liu 2018-09-20 8:44 ` Peter Zijlstra 2018-09-20 13:25 ` Arnaldo Carvalho de Melo 2018-09-20 13:56 ` Peter Zijlstra 2018-09-21 3:14 ` Alexei Starovoitov 2018-09-21 12:25 ` Arnaldo Carvalho de Melo 2018-09-21 13:55 ` Peter Zijlstra 2018-09-21 13:59 ` Peter Zijlstra 2018-09-21 22:13 ` Alexei Starovoitov 2018-10-15 23:33 ` Song Liu 2018-10-16 23:43 ` David Ahern 2018-10-17 6:43 ` Song Liu 2018-10-17 12:11 ` Arnaldo Carvalho de Melo 2018-10-17 12:50 ` Arnaldo Carvalho de Melo 2018-10-17 16:06 ` Song Liu 2018-11-02 1:08 ` Song Liu 2018-10-17 15:09 ` David Ahern 2018-10-17 16:18 ` Song Liu 2018-10-17 16:36 ` Alexei Starovoitov 2018-10-17 18:53 ` Arnaldo Carvalho de Melo 2018-10-17 19:08 ` Alexei Starovoitov 2018-10-17 21:31 ` Arnaldo Carvalho de Melo 2018-10-17 22:01 ` Alexei Starovoitov 2018-09-20 13:36 ` Peter Zijlstra 2018-09-19 22:39 ` [PATCH perf 3/3] tools/perf: recognize and process RECORD_MMAP events for bpf progs Alexei Starovoitov 2018-09-20 13:36 ` Arnaldo Carvalho de Melo 2018-09-20 14:05 ` Arnaldo Carvalho de Melo 2018-09-21 22:57 ` Song Liu
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.