All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next 0/3] perf, bpf: reveal invisible bpf programs
@ 2018-09-19 22:39 Alexei Starovoitov
  2018-09-19 22:39 ` [PATCH bpf-next 1/3] perf/core: introduce perf_event_mmap_bpf_prog Alexei Starovoitov
                   ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Alexei Starovoitov @ 2018-09-19 22:39 UTC (permalink / raw)
  To: David S . Miller; +Cc: daniel, peterz, acme, netdev, kernel-team

Hi All,

this patch set adds kernel and user space support to reveal
short lived bpf programs.
The kernel stores addr+bpf_prog_name information into perf ring buffer.
Later 'perf report' can properly attribute the cpu time to the program.

The following command was run before and after: 'perf record ./test_progs; perf report'

Before patch set:

# Overhead  Command     Shared Object      Symbol                                 
# ........  ..........  .................  .......................................
#
    10.73%  test_progs  [kernel.kallsyms]  [k] __htab_map_lookup_elem
     8.16%  test_progs  [kernel.kallsyms]  [k] bpf_skb_set_tunnel_key
     7.90%  test_progs  [kernel.kallsyms]  [k] memcmp
     3.10%  test_progs  [kernel.kallsyms]  [k] lookup_nulls_elem_raw
     2.57%  test_progs  [kernel.kallsyms]  [k] dst_release
     2.45%  test_progs  [kernel.kallsyms]  [k] do_check
     1.89%  test_progs  [kernel.kallsyms]  [k] bpf_xdp_adjust_head
     1.52%  test_progs  [kernel.kallsyms]  [k] 0x00007fffa002cc5a
     1.22%  test_progs  [kernel.kallsyms]  [k] check_helper_mem_access
     0.99%  test_progs  [kernel.kallsyms]  [k] 0x00007fffa001a6a8
     0.99%  test_progs  [kernel.kallsyms]  [k] kmem_cache_alloc_trace
     0.97%  test_progs  [kernel.kallsyms]  [k] 0x00007fffa001a652
     0.95%  test_progs  [kernel.kallsyms]  [k] 0x00007fffa0042d52
     0.95%  test_progs  [kernel.kallsyms]  [k] read_tsc
     0.81%  test_progs  [kernel.kallsyms]  [k] 0x00007fffa001a660
     0.77%  test_progs  [kernel.kallsyms]  [k] 0x00007fffa0042d69
     0.74%  test_progs  [kernel.kallsyms]  [k] percpu_array_map_lookup_elem
     0.69%  test_progs  [kernel.kallsyms]  [k] 0x00007fffa001a64e

After:

# Overhead  Command     Shared Object      Symbol                                       
# ........  ..........  .................  .............................................
#
    18.13%  test_progs  [kernel.kallsyms]  [k] bpf_prog_1accc788e7f04c38_balancer_ingres
    10.73%  test_progs  [kernel.kallsyms]  [k] __htab_map_lookup_elem
     7.94%  test_progs  [kernel.kallsyms]  [k] bpf_prog_20a05d8a586cf0e8_F
     7.80%  test_progs  [kernel.kallsyms]  [k] bpf_skb_set_tunnel_key
     4.64%  test_progs  [kernel.kallsyms]  [k] __sysfs_match_string
     3.61%  test_progs  [kernel.kallsyms]  [k] bpf_prog_9d89afa51f1dc0d7_F
     3.51%  test_progs  [kernel.kallsyms]  [k] bpf_prog_73b45270911a0294_F
     3.41%  test_progs  [kernel.kallsyms]  [k] bpf_prog_79be7b7bad8026bf_F
     3.26%  test_progs  [kernel.kallsyms]  [k] memcmp
     3.10%  test_progs  [kernel.kallsyms]  [k] lookup_nulls_elem_raw
     2.89%  test_progs  [kernel.kallsyms]  [k] bpf_prog_57bf3d413b9b7455_F
     2.57%  test_progs  [kernel.kallsyms]  [k] dst_discard
     2.45%  test_progs  [kernel.kallsyms]  [k] do_check
     2.39%  test_progs  [kernel.kallsyms]  [k] bpf_prog_576cbdaac1a4d2f6_F
     1.82%  test_progs  [kernel.kallsyms]  [k] bpf_prog_53ade2ecbddaa85b_F
     1.70%  test_progs  [kernel.kallsyms]  [k] bpf_xdp_adjust_head
     1.32%  test_progs  [kernel.kallsyms]  [k] bpf_prog_0edc54822404a598_F

Important considerations:

- Before and after the number of cpu samples are the same. No samples are lost.
  But perf cannot find the symbol by IP, so a lot of small 0x00007fffa001a64e-like
  symbols appear towards the end of 'perf report' and none of them look hot.
  In reallity these IP addresses belong to few bpf programs that
  were active at that time.

- newly loaded bpf program can be JITed into address space of unloaded prog.
  7fffa001aXXX address at time X can belong to a program A, but similar
  7fffa001aYYY address at time Y can belong to a different program B.

- event->mmap.pid == -1 is an existing indicator of kernel event.
  event->mmap.tid == BPF_FS_MAGIC is an extension to indicate bpf related event.
  Alternatively it's possible to introduce new 'enum perf_event_type' command
  specificially for bpf prog load/unload, but existing RECORD_MMAP
  is very close, so the choice made by this patchset is to extend it.

- steps to land the set:
  Patches 1 and 2 need to go via bpf-next tree,
  since we're adding support for better program names exactly in the same lines.
  Patch 3 can go in parallel into perf tree. It has no effect without kernel
  support and kernel support has not effect on old perf.

Peter, Arnaldo, Daniel, please review.

Alexei Starovoitov (3):
  perf/core: introduce perf_event_mmap_bpf_prog
  bpf: emit RECORD_MMAP events for bpf prog load/unload
  tools/perf: recognize and process RECORD_MMAP events for bpf progs

 include/linux/perf_event.h |  1 +
 kernel/bpf/core.c          | 22 +++++++++++++++++--
 kernel/events/core.c       | 44 +++++++++++++++++++++++++++++++++-----
 tools/perf/util/machine.c  | 27 +++++++++++++++++++++++
 tools/perf/util/symbol.c   | 13 +++++++++++
 tools/perf/util/symbol.h   |  1 +
 6 files changed, 101 insertions(+), 7 deletions(-)

-- 
2.17.1

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH bpf-next 1/3] perf/core: introduce perf_event_mmap_bpf_prog
  2018-09-19 22:39 [PATCH bpf-next 0/3] perf, bpf: reveal invisible bpf programs Alexei Starovoitov
@ 2018-09-19 22:39 ` Alexei Starovoitov
  2018-09-19 23:30   ` Song Liu
  2018-09-19 22:39 ` [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload Alexei Starovoitov
  2018-09-19 22:39 ` [PATCH perf 3/3] tools/perf: recognize and process RECORD_MMAP events for bpf progs Alexei Starovoitov
  2 siblings, 1 reply; 35+ messages in thread
From: Alexei Starovoitov @ 2018-09-19 22:39 UTC (permalink / raw)
  To: David S . Miller; +Cc: daniel, peterz, acme, netdev, kernel-team

introduce perf_event_mmap_bpf_prog() helper to emit RECORD_MMAP events
into perf ring buffer.
It's used by bpf load/unload logic to notify user space of addresses
and names of JITed bpf programs.

Note that event->mmap.pid == -1 is an existing indicator of kernel event.
In addition use event->mmap.tid == BPF_FS_MAGIC to indicate bpf related
RECORD_MMAP event.

Alternatively it's possible to introduce new 'enum perf_event_type' command
specificially for bpf prog load/unload, but existing RECORD_MMAP
is very close, so the choice made by this patch is to extend it.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/perf_event.h |  1 +
 kernel/events/core.c       | 44 +++++++++++++++++++++++++++++++++-----
 2 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 53c500f0ca79..0e79af83138f 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1113,6 +1113,7 @@ static inline void perf_event_task_sched_out(struct task_struct *prev,
 }
 
 extern void perf_event_mmap(struct vm_area_struct *vma);
+void perf_event_mmap_bpf_prog(u64 start, u64 len, char *name, int size);
 extern struct perf_guest_info_callbacks *perf_guest_cbs;
 extern int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *callbacks);
 extern int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks *callbacks);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2a62b96600ad..c48244ddf993 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7152,7 +7152,7 @@ static int perf_event_mmap_match(struct perf_event *event,
 {
 	struct perf_mmap_event *mmap_event = data;
 	struct vm_area_struct *vma = mmap_event->vma;
-	int executable = vma->vm_flags & VM_EXEC;
+	int executable = !vma || vma->vm_flags & VM_EXEC;
 
 	return (!executable && event->attr.mmap_data) ||
 	       (executable && (event->attr.mmap || event->attr.mmap2));
@@ -7165,12 +7165,13 @@ static void perf_event_mmap_output(struct perf_event *event,
 	struct perf_output_handle handle;
 	struct perf_sample_data sample;
 	int size = mmap_event->event_id.header.size;
+	bool bpf_event = !mmap_event->vma;
 	int ret;
 
 	if (!perf_event_mmap_match(event, data))
 		return;
 
-	if (event->attr.mmap2) {
+	if (event->attr.mmap2 && !bpf_event) {
 		mmap_event->event_id.header.type = PERF_RECORD_MMAP2;
 		mmap_event->event_id.header.size += sizeof(mmap_event->maj);
 		mmap_event->event_id.header.size += sizeof(mmap_event->min);
@@ -7186,12 +7187,14 @@ static void perf_event_mmap_output(struct perf_event *event,
 	if (ret)
 		goto out;
 
-	mmap_event->event_id.pid = perf_event_pid(event, current);
-	mmap_event->event_id.tid = perf_event_tid(event, current);
+	if (!bpf_event) {
+		mmap_event->event_id.pid = perf_event_pid(event, current);
+		mmap_event->event_id.tid = perf_event_tid(event, current);
+	}
 
 	perf_output_put(&handle, mmap_event->event_id);
 
-	if (event->attr.mmap2) {
+	if (event->attr.mmap2 && !bpf_event) {
 		perf_output_put(&handle, mmap_event->maj);
 		perf_output_put(&handle, mmap_event->min);
 		perf_output_put(&handle, mmap_event->ino);
@@ -7448,6 +7451,37 @@ void perf_event_mmap(struct vm_area_struct *vma)
 	perf_event_mmap_event(&mmap_event);
 }
 
+void perf_event_mmap_bpf_prog(u64 start, u64 len, char *name, int size)
+{
+	struct perf_mmap_event mmap_event;
+
+	if (!atomic_read(&nr_mmap_events))
+		return;
+
+	if (!IS_ALIGNED(size, sizeof(u64))) {
+		WARN_ONCE(1, "size is not aligned\n");
+		return;
+	}
+
+	mmap_event = (struct perf_mmap_event){
+		.file_name = name,
+		.file_size = size,
+		.event_id  = {
+			.header = {
+				.type = PERF_RECORD_MMAP,
+				.misc = PERF_RECORD_MISC_KERNEL,
+				.size = sizeof(mmap_event.event_id) + size,
+			},
+			.pid = -1, /* indicates kernel */
+			.tid = BPF_FS_MAGIC, /* bpf mmap event */
+			.start  = start,
+			.len    = len,
+			.pgoff  = start,
+		},
+	};
+	perf_iterate_sb(perf_event_mmap_output, &mmap_event, NULL);
+}
+
 void perf_event_aux_event(struct perf_event *event, unsigned long head,
 			  unsigned long size, u64 flags)
 {
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
  2018-09-19 22:39 [PATCH bpf-next 0/3] perf, bpf: reveal invisible bpf programs Alexei Starovoitov
  2018-09-19 22:39 ` [PATCH bpf-next 1/3] perf/core: introduce perf_event_mmap_bpf_prog Alexei Starovoitov
@ 2018-09-19 22:39 ` Alexei Starovoitov
  2018-09-19 23:44   ` Song Liu
  2018-09-20  8:44   ` Peter Zijlstra
  2018-09-19 22:39 ` [PATCH perf 3/3] tools/perf: recognize and process RECORD_MMAP events for bpf progs Alexei Starovoitov
  2 siblings, 2 replies; 35+ messages in thread
From: Alexei Starovoitov @ 2018-09-19 22:39 UTC (permalink / raw)
  To: David S . Miller; +Cc: daniel, peterz, acme, netdev, kernel-team

use perf_event_mmap_bpf_prog() helper to notify user space
about JITed bpf programs.
Use RECORD_MMAP perf event to tell user space where JITed bpf program was loaded.
Use empty program name as unload indication.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/core.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 3f5bf1af0826..ddf11fdafd36 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -384,7 +384,7 @@ bpf_get_prog_addr_region(const struct bpf_prog *prog,
 	*symbol_end   = addr + hdr->pages * PAGE_SIZE;
 }
 
-static void bpf_get_prog_name(const struct bpf_prog *prog, char *sym)
+static char *bpf_get_prog_name(const struct bpf_prog *prog, char *sym)
 {
 	const char *end = sym + KSYM_NAME_LEN;
 
@@ -402,9 +402,10 @@ static void bpf_get_prog_name(const struct bpf_prog *prog, char *sym)
 	sym += snprintf(sym, KSYM_NAME_LEN, "bpf_prog_");
 	sym  = bin2hex(sym, prog->tag, sizeof(prog->tag));
 	if (prog->aux->name[0])
-		snprintf(sym, (size_t)(end - sym), "_%s", prog->aux->name);
+		sym += snprintf(sym, (size_t)(end - sym), "_%s", prog->aux->name);
 	else
 		*sym = 0;
+	return sym;
 }
 
 static __always_inline unsigned long
@@ -480,23 +481,40 @@ static bool bpf_prog_kallsyms_verify_off(const struct bpf_prog *fp)
 
 void bpf_prog_kallsyms_add(struct bpf_prog *fp)
 {
+	unsigned long symbol_start, symbol_end;
+	char buf[KSYM_NAME_LEN], *sym;
+
 	if (!bpf_prog_kallsyms_candidate(fp) ||
 	    !capable(CAP_SYS_ADMIN))
 		return;
 
+	bpf_get_prog_addr_region(fp, &symbol_start, &symbol_end);
+	sym = bpf_get_prog_name(fp, buf);
+	sym++; /* sym - buf is the length of the name including trailing 0 */
+	while (!IS_ALIGNED(sym - buf, sizeof(u64)))
+		*sym++ = 0;
 	spin_lock_bh(&bpf_lock);
 	bpf_prog_ksym_node_add(fp->aux);
 	spin_unlock_bh(&bpf_lock);
+	perf_event_mmap_bpf_prog(symbol_start, symbol_end - symbol_start,
+				 buf, sym - buf);
 }
 
 void bpf_prog_kallsyms_del(struct bpf_prog *fp)
 {
+	unsigned long symbol_start, symbol_end;
+	/* mmap_record.filename cannot be NULL and has to be u64 aligned */
+	char buf[sizeof(u64)] = {};
+
 	if (!bpf_prog_kallsyms_candidate(fp))
 		return;
 
 	spin_lock_bh(&bpf_lock);
 	bpf_prog_ksym_node_del(fp->aux);
 	spin_unlock_bh(&bpf_lock);
+	bpf_get_prog_addr_region(fp, &symbol_start, &symbol_end);
+	perf_event_mmap_bpf_prog(symbol_start, symbol_end - symbol_start,
+				 buf, sizeof(buf));
 }
 
 static struct bpf_prog *bpf_prog_kallsyms_find(unsigned long addr)
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH perf 3/3] tools/perf: recognize and process RECORD_MMAP events for bpf progs
  2018-09-19 22:39 [PATCH bpf-next 0/3] perf, bpf: reveal invisible bpf programs Alexei Starovoitov
  2018-09-19 22:39 ` [PATCH bpf-next 1/3] perf/core: introduce perf_event_mmap_bpf_prog Alexei Starovoitov
  2018-09-19 22:39 ` [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload Alexei Starovoitov
@ 2018-09-19 22:39 ` Alexei Starovoitov
  2018-09-20 13:36   ` Arnaldo Carvalho de Melo
  2018-09-21 22:57   ` Song Liu
  2 siblings, 2 replies; 35+ messages in thread
From: Alexei Starovoitov @ 2018-09-19 22:39 UTC (permalink / raw)
  To: David S . Miller; +Cc: daniel, peterz, acme, netdev, kernel-team

Recognize JITed bpf prog load/unload events.
Add/remove kernel symbols accordingly.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/perf/util/machine.c | 27 +++++++++++++++++++++++++++
 tools/perf/util/symbol.c  | 13 +++++++++++++
 tools/perf/util/symbol.h  |  1 +
 3 files changed, 41 insertions(+)

diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index c4acd2001db0..ae4f8a0fdc7e 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -25,6 +25,7 @@
 #include "sane_ctype.h"
 #include <symbol/kallsyms.h>
 #include <linux/mman.h>
+#include <linux/magic.h>
 
 static void __machine__remove_thread(struct machine *machine, struct thread *th, bool lock);
 
@@ -1460,6 +1461,32 @@ static int machine__process_kernel_mmap_event(struct machine *machine,
 	enum dso_kernel_type kernel_type;
 	bool is_kernel_mmap;
 
+	/* process JITed bpf programs load/unload events */
+	if (event->mmap.pid == ~0u && event->mmap.tid == BPF_FS_MAGIC) {
+		struct symbol *sym;
+		u64 ip;
+
+		map = map_groups__find(&machine->kmaps, event->mmap.start);
+		if (!map) {
+			pr_err("No kernel map for IP %lx\n", event->mmap.start);
+			goto out_problem;
+		}
+		ip = event->mmap.start - map->start + map->pgoff;
+		if (event->mmap.filename[0]) {
+			sym = symbol__new(ip, event->mmap.len, 0, 0,
+					  event->mmap.filename);
+			dso__insert_symbol(map->dso, sym);
+		} else {
+			if (symbols__erase(&map->dso->symbols, ip)) {
+				pr_err("No bpf prog at IP %lx/%lx\n",
+				       event->mmap.start, ip);
+				goto out_problem;
+			}
+			dso__reset_find_symbol_cache(map->dso);
+		}
+		return 0;
+	}
+
 	/* If we have maps from kcore then we do not need or want any others */
 	if (machine__uses_kcore(machine))
 		return 0;
diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c
index d188b7588152..0653f313661d 100644
--- a/tools/perf/util/symbol.c
+++ b/tools/perf/util/symbol.c
@@ -353,6 +353,19 @@ static struct symbol *symbols__find(struct rb_root *symbols, u64 ip)
 	return NULL;
 }
 
+int symbols__erase(struct rb_root *symbols, u64 ip)
+{
+	struct symbol *s;
+
+	s = symbols__find(symbols, ip);
+	if (!s)
+		return -ENOENT;
+
+	rb_erase(&s->rb_node, symbols);
+	symbol__delete(s);
+	return 0;
+}
+
 static struct symbol *symbols__first(struct rb_root *symbols)
 {
 	struct rb_node *n = rb_first(symbols);
diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h
index f25fae4b5743..92ef31953d9a 100644
--- a/tools/perf/util/symbol.h
+++ b/tools/perf/util/symbol.h
@@ -310,6 +310,7 @@ char *dso__demangle_sym(struct dso *dso, int kmodule, const char *elf_name);
 
 void __symbols__insert(struct rb_root *symbols, struct symbol *sym, bool kernel);
 void symbols__insert(struct rb_root *symbols, struct symbol *sym);
+int symbols__erase(struct rb_root *symbols, u64 ip);
 void symbols__fixup_duplicate(struct rb_root *symbols);
 void symbols__fixup_end(struct rb_root *symbols);
 void map_groups__fixup_end(struct map_groups *mg);
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next 1/3] perf/core: introduce perf_event_mmap_bpf_prog
  2018-09-19 22:39 ` [PATCH bpf-next 1/3] perf/core: introduce perf_event_mmap_bpf_prog Alexei Starovoitov
@ 2018-09-19 23:30   ` Song Liu
  2018-09-20  0:53     ` Alexei Starovoitov
  0 siblings, 1 reply; 35+ messages in thread
From: Song Liu @ 2018-09-19 23:30 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S . Miller, daniel, peterz, acme, netdev, Kernel Team



> On Sep 19, 2018, at 3:39 PM, Alexei Starovoitov <ast@kernel.org> wrote:
> 
> introduce perf_event_mmap_bpf_prog() helper to emit RECORD_MMAP events
> into perf ring buffer.
> It's used by bpf load/unload logic to notify user space of addresses
> and names of JITed bpf programs.
> 
> Note that event->mmap.pid == -1 is an existing indicator of kernel event.
> In addition use event->mmap.tid == BPF_FS_MAGIC to indicate bpf related
> RECORD_MMAP event.
> 
> Alternatively it's possible to introduce new 'enum perf_event_type' command
> specificially for bpf prog load/unload, but existing RECORD_MMAP
> is very close, so the choice made by this patch is to extend it.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Acked-by: Song Liu <songliubraving@fb.com>

I guess we should also use this for kernel modules load/unload? 


> ---
> include/linux/perf_event.h |  1 +
> kernel/events/core.c       | 44 +++++++++++++++++++++++++++++++++-----
> 2 files changed, 40 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 53c500f0ca79..0e79af83138f 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -1113,6 +1113,7 @@ static inline void perf_event_task_sched_out(struct task_struct *prev,
> }
> 
> extern void perf_event_mmap(struct vm_area_struct *vma);
> +void perf_event_mmap_bpf_prog(u64 start, u64 len, char *name, int size);
> extern struct perf_guest_info_callbacks *perf_guest_cbs;
> extern int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *callbacks);
> extern int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks *callbacks);
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 2a62b96600ad..c48244ddf993 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -7152,7 +7152,7 @@ static int perf_event_mmap_match(struct perf_event *event,
> {
> 	struct perf_mmap_event *mmap_event = data;
> 	struct vm_area_struct *vma = mmap_event->vma;
> -	int executable = vma->vm_flags & VM_EXEC;
> +	int executable = !vma || vma->vm_flags & VM_EXEC;
> 
> 	return (!executable && event->attr.mmap_data) ||
> 	       (executable && (event->attr.mmap || event->attr.mmap2));
> @@ -7165,12 +7165,13 @@ static void perf_event_mmap_output(struct perf_event *event,
> 	struct perf_output_handle handle;
> 	struct perf_sample_data sample;
> 	int size = mmap_event->event_id.header.size;
> +	bool bpf_event = !mmap_event->vma;
> 	int ret;
> 
> 	if (!perf_event_mmap_match(event, data))
> 		return;
> 
> -	if (event->attr.mmap2) {
> +	if (event->attr.mmap2 && !bpf_event) {
> 		mmap_event->event_id.header.type = PERF_RECORD_MMAP2;
> 		mmap_event->event_id.header.size += sizeof(mmap_event->maj);
> 		mmap_event->event_id.header.size += sizeof(mmap_event->min);
> @@ -7186,12 +7187,14 @@ static void perf_event_mmap_output(struct perf_event *event,
> 	if (ret)
> 		goto out;
> 
> -	mmap_event->event_id.pid = perf_event_pid(event, current);
> -	mmap_event->event_id.tid = perf_event_tid(event, current);
> +	if (!bpf_event) {
> +		mmap_event->event_id.pid = perf_event_pid(event, current);
> +		mmap_event->event_id.tid = perf_event_tid(event, current);
> +	}
> 
> 	perf_output_put(&handle, mmap_event->event_id);
> 
> -	if (event->attr.mmap2) {
> +	if (event->attr.mmap2 && !bpf_event) {
> 		perf_output_put(&handle, mmap_event->maj);
> 		perf_output_put(&handle, mmap_event->min);
> 		perf_output_put(&handle, mmap_event->ino);
> @@ -7448,6 +7451,37 @@ void perf_event_mmap(struct vm_area_struct *vma)
> 	perf_event_mmap_event(&mmap_event);
> }
> 
> +void perf_event_mmap_bpf_prog(u64 start, u64 len, char *name, int size)
> +{
> +	struct perf_mmap_event mmap_event;
> +
> +	if (!atomic_read(&nr_mmap_events))
> +		return;
> +
> +	if (!IS_ALIGNED(size, sizeof(u64))) {
> +		WARN_ONCE(1, "size is not aligned\n");
> +		return;
> +	}
> +
> +	mmap_event = (struct perf_mmap_event){
> +		.file_name = name,
> +		.file_size = size,
> +		.event_id  = {
> +			.header = {
> +				.type = PERF_RECORD_MMAP,
> +				.misc = PERF_RECORD_MISC_KERNEL,
> +				.size = sizeof(mmap_event.event_id) + size,
> +			},
> +			.pid = -1, /* indicates kernel */
> +			.tid = BPF_FS_MAGIC, /* bpf mmap event */
> +			.start  = start,
> +			.len    = len,
> +			.pgoff  = start,
> +		},
> +	};
> +	perf_iterate_sb(perf_event_mmap_output, &mmap_event, NULL);
> +}
> +
> void perf_event_aux_event(struct perf_event *event, unsigned long head,
> 			  unsigned long size, u64 flags)
> {
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
  2018-09-19 22:39 ` [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload Alexei Starovoitov
@ 2018-09-19 23:44   ` Song Liu
  2018-09-20  0:59     ` Alexei Starovoitov
  2018-09-20  8:44   ` Peter Zijlstra
  1 sibling, 1 reply; 35+ messages in thread
From: Song Liu @ 2018-09-19 23:44 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S . Miller, daniel, peterz, acme, netdev, Kernel Team



> On Sep 19, 2018, at 3:39 PM, Alexei Starovoitov <ast@kernel.org> wrote:
> 
> use perf_event_mmap_bpf_prog() helper to notify user space
> about JITed bpf programs.
> Use RECORD_MMAP perf event to tell user space where JITed bpf program was loaded.
> Use empty program name as unload indication.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
> kernel/bpf/core.c | 22 ++++++++++++++++++++--
> 1 file changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 3f5bf1af0826..ddf11fdafd36 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -384,7 +384,7 @@ bpf_get_prog_addr_region(const struct bpf_prog *prog,
> 	*symbol_end   = addr + hdr->pages * PAGE_SIZE;
> }
> 
> -static void bpf_get_prog_name(const struct bpf_prog *prog, char *sym)
> +static char *bpf_get_prog_name(const struct bpf_prog *prog, char *sym)
> {
> 	const char *end = sym + KSYM_NAME_LEN;
> 
> @@ -402,9 +402,10 @@ static void bpf_get_prog_name(const struct bpf_prog *prog, char *sym)
> 	sym += snprintf(sym, KSYM_NAME_LEN, "bpf_prog_");
> 	sym  = bin2hex(sym, prog->tag, sizeof(prog->tag));
> 	if (prog->aux->name[0])
> -		snprintf(sym, (size_t)(end - sym), "_%s", prog->aux->name);
> +		sym += snprintf(sym, (size_t)(end - sym), "_%s", prog->aux->name);
> 	else
> 		*sym = 0;
> +	return sym;
> }
> 
> static __always_inline unsigned long
> @@ -480,23 +481,40 @@ static bool bpf_prog_kallsyms_verify_off(const struct bpf_prog *fp)
> 
> void bpf_prog_kallsyms_add(struct bpf_prog *fp)
> {
> +	unsigned long symbol_start, symbol_end;
> +	char buf[KSYM_NAME_LEN], *sym;
> +
> 	if (!bpf_prog_kallsyms_candidate(fp) ||
> 	    !capable(CAP_SYS_ADMIN))
> 		return;
> 
> +	bpf_get_prog_addr_region(fp, &symbol_start, &symbol_end);
> +	sym = bpf_get_prog_name(fp, buf);
> +	sym++; /* sym - buf is the length of the name including trailing 0 */
> +	while (!IS_ALIGNED(sym - buf, sizeof(u64)))
> +		*sym++ = 0;

nit: This logic feels a little weird to me. How about we wrap the extra logic
in a separate function:

size_t bpf_get_prog_name_u64_aligned(const struct bpf_prog fp, char *buf)

where the return value is the u64 aligned size. 

Other than this 

Acked-by: Song Liu <songliubraving@fb.com>

> 	spin_lock_bh(&bpf_lock);
> 	bpf_prog_ksym_node_add(fp->aux);
> 	spin_unlock_bh(&bpf_lock);
> +	perf_event_mmap_bpf_prog(symbol_start, symbol_end - symbol_start,
> +				 buf, sym - buf);
> }
> 
> void bpf_prog_kallsyms_del(struct bpf_prog *fp)
> {
> +	unsigned long symbol_start, symbol_end;
> +	/* mmap_record.filename cannot be NULL and has to be u64 aligned */
> +	char buf[sizeof(u64)] = {};
> +
> 	if (!bpf_prog_kallsyms_candidate(fp))
> 		return;
> 
> 	spin_lock_bh(&bpf_lock);
> 	bpf_prog_ksym_node_del(fp->aux);
> 	spin_unlock_bh(&bpf_lock);
> +	bpf_get_prog_addr_region(fp, &symbol_start, &symbol_end);
> +	perf_event_mmap_bpf_prog(symbol_start, symbol_end - symbol_start,
> +				 buf, sizeof(buf));
> }
> 
> static struct bpf_prog *bpf_prog_kallsyms_find(unsigned long addr)
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next 1/3] perf/core: introduce perf_event_mmap_bpf_prog
  2018-09-19 23:30   ` Song Liu
@ 2018-09-20  0:53     ` Alexei Starovoitov
  0 siblings, 0 replies; 35+ messages in thread
From: Alexei Starovoitov @ 2018-09-20  0:53 UTC (permalink / raw)
  To: Song Liu, Alexei Starovoitov
  Cc: David S . Miller, daniel, peterz, acme, netdev, Kernel Team

On 9/19/18 4:30 PM, Song Liu wrote:
>
>
>> On Sep 19, 2018, at 3:39 PM, Alexei Starovoitov <ast@kernel.org> wrote:
>>
>> introduce perf_event_mmap_bpf_prog() helper to emit RECORD_MMAP events
>> into perf ring buffer.
>> It's used by bpf load/unload logic to notify user space of addresses
>> and names of JITed bpf programs.
>>
>> Note that event->mmap.pid == -1 is an existing indicator of kernel event.
>> In addition use event->mmap.tid == BPF_FS_MAGIC to indicate bpf related
>> RECORD_MMAP event.
>>
>> Alternatively it's possible to introduce new 'enum perf_event_type' command
>> specificially for bpf prog load/unload, but existing RECORD_MMAP
>> is very close, so the choice made by this patch is to extend it.
>>
>> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>
> Acked-by: Song Liu <songliubraving@fb.com>
>
> I guess we should also use this for kernel modules load/unload?

yes. that's possible.
There is similar issue with modules today that get unloaded
before 'perf report'.
Synthetic record_mmap for modules that perf emits into perf.data
has filename==module_name. It solves typical use case though.
We can extend record_mmap further and let kernel emit them
for module load/unload.
Some new magic for 'tid' would be necessary.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
  2018-09-19 23:44   ` Song Liu
@ 2018-09-20  0:59     ` Alexei Starovoitov
  2018-09-20  5:48       ` Song Liu
  0 siblings, 1 reply; 35+ messages in thread
From: Alexei Starovoitov @ 2018-09-20  0:59 UTC (permalink / raw)
  To: Song Liu, Alexei Starovoitov
  Cc: David S . Miller, daniel, peterz, acme, netdev, Kernel Team

On 9/19/18 4:44 PM, Song Liu wrote:
>
>
>> On Sep 19, 2018, at 3:39 PM, Alexei Starovoitov <ast@kernel.org> wrote:
>>
>> use perf_event_mmap_bpf_prog() helper to notify user space
>> about JITed bpf programs.
>> Use RECORD_MMAP perf event to tell user space where JITed bpf program was loaded.
>> Use empty program name as unload indication.
>>
>> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>> ---
>> kernel/bpf/core.c | 22 ++++++++++++++++++++--
>> 1 file changed, 20 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>> index 3f5bf1af0826..ddf11fdafd36 100644
>> --- a/kernel/bpf/core.c
>> +++ b/kernel/bpf/core.c
>> @@ -384,7 +384,7 @@ bpf_get_prog_addr_region(const struct bpf_prog *prog,
>> 	*symbol_end   = addr + hdr->pages * PAGE_SIZE;
>> }
>>
>> -static void bpf_get_prog_name(const struct bpf_prog *prog, char *sym)
>> +static char *bpf_get_prog_name(const struct bpf_prog *prog, char *sym)
>> {
>> 	const char *end = sym + KSYM_NAME_LEN;
>>
>> @@ -402,9 +402,10 @@ static void bpf_get_prog_name(const struct bpf_prog *prog, char *sym)
>> 	sym += snprintf(sym, KSYM_NAME_LEN, "bpf_prog_");
>> 	sym  = bin2hex(sym, prog->tag, sizeof(prog->tag));
>> 	if (prog->aux->name[0])
>> -		snprintf(sym, (size_t)(end - sym), "_%s", prog->aux->name);
>> +		sym += snprintf(sym, (size_t)(end - sym), "_%s", prog->aux->name);
>> 	else
>> 		*sym = 0;
>> +	return sym;
>> }
>>
>> static __always_inline unsigned long
>> @@ -480,23 +481,40 @@ static bool bpf_prog_kallsyms_verify_off(const struct bpf_prog *fp)
>>
>> void bpf_prog_kallsyms_add(struct bpf_prog *fp)
>> {
>> +	unsigned long symbol_start, symbol_end;
>> +	char buf[KSYM_NAME_LEN], *sym;
>> +
>> 	if (!bpf_prog_kallsyms_candidate(fp) ||
>> 	    !capable(CAP_SYS_ADMIN))
>> 		return;
>>
>> +	bpf_get_prog_addr_region(fp, &symbol_start, &symbol_end);
>> +	sym = bpf_get_prog_name(fp, buf);
>> +	sym++; /* sym - buf is the length of the name including trailing 0 */
>> +	while (!IS_ALIGNED(sym - buf, sizeof(u64)))
>> +		*sym++ = 0;
>
> nit: This logic feels a little weird to me. How about we wrap the extra logic
> in a separate function:
>
> size_t bpf_get_prog_name_u64_aligned(const struct bpf_prog fp, char *buf)

probably bpf_get_prog_name_u64_padded() ?
would be cleaner indeed.

Will send v2 once Peter and Arnaldo provide their feedback.

Thanks for reviewing!

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
  2018-09-20  0:59     ` Alexei Starovoitov
@ 2018-09-20  5:48       ` Song Liu
  0 siblings, 0 replies; 35+ messages in thread
From: Song Liu @ 2018-09-20  5:48 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, David S . Miller, daniel, peterz, acme,
	netdev, Kernel Team



> On Sep 19, 2018, at 5:59 PM, Alexei Starovoitov <ast@fb.com> wrote:
> 
> On 9/19/18 4:44 PM, Song Liu wrote:
>> 
>> 
>>> On Sep 19, 2018, at 3:39 PM, Alexei Starovoitov <ast@kernel.org> wrote:
>>> 
>>> use perf_event_mmap_bpf_prog() helper to notify user space
>>> about JITed bpf programs.
>>> Use RECORD_MMAP perf event to tell user space where JITed bpf program was loaded.
>>> Use empty program name as unload indication.
>>> 
>>> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>>> ---
>>> kernel/bpf/core.c | 22 ++++++++++++++++++++--
>>> 1 file changed, 20 insertions(+), 2 deletions(-)
>>> 
>>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>>> index 3f5bf1af0826..ddf11fdafd36 100644
>>> --- a/kernel/bpf/core.c
>>> +++ b/kernel/bpf/core.c
>>> @@ -384,7 +384,7 @@ bpf_get_prog_addr_region(const struct bpf_prog *prog,
>>> 	*symbol_end   = addr + hdr->pages * PAGE_SIZE;
>>> }
>>> 
>>> -static void bpf_get_prog_name(const struct bpf_prog *prog, char *sym)
>>> +static char *bpf_get_prog_name(const struct bpf_prog *prog, char *sym)
>>> {
>>> 	const char *end = sym + KSYM_NAME_LEN;
>>> 
>>> @@ -402,9 +402,10 @@ static void bpf_get_prog_name(const struct bpf_prog *prog, char *sym)
>>> 	sym += snprintf(sym, KSYM_NAME_LEN, "bpf_prog_");
>>> 	sym  = bin2hex(sym, prog->tag, sizeof(prog->tag));
>>> 	if (prog->aux->name[0])
>>> -		snprintf(sym, (size_t)(end - sym), "_%s", prog->aux->name);
>>> +		sym += snprintf(sym, (size_t)(end - sym), "_%s", prog->aux->name);
>>> 	else
>>> 		*sym = 0;
>>> +	return sym;
>>> }
>>> 
>>> static __always_inline unsigned long
>>> @@ -480,23 +481,40 @@ static bool bpf_prog_kallsyms_verify_off(const struct bpf_prog *fp)
>>> 
>>> void bpf_prog_kallsyms_add(struct bpf_prog *fp)
>>> {
>>> +	unsigned long symbol_start, symbol_end;
>>> +	char buf[KSYM_NAME_LEN], *sym;
>>> +
>>> 	if (!bpf_prog_kallsyms_candidate(fp) ||
>>> 	    !capable(CAP_SYS_ADMIN))
>>> 		return;
>>> 
>>> +	bpf_get_prog_addr_region(fp, &symbol_start, &symbol_end);
>>> +	sym = bpf_get_prog_name(fp, buf);
>>> +	sym++; /* sym - buf is the length of the name including trailing 0 */
>>> +	while (!IS_ALIGNED(sym - buf, sizeof(u64)))
>>> +		*sym++ = 0;
>> 
>> nit: This logic feels a little weird to me. How about we wrap the extra logic
>> in a separate function:
>> 
>> size_t bpf_get_prog_name_u64_aligned(const struct bpf_prog fp, char *buf)
> 
> probably bpf_get_prog_name_u64_padded() ?
> would be cleaner indeed.

bpf_get_prog_name_u64_padded does sound better. 

> Will send v2 once Peter and Arnaldo provide their feedback.
> 
> Thanks for reviewing!

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
  2018-09-19 22:39 ` [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload Alexei Starovoitov
  2018-09-19 23:44   ` Song Liu
@ 2018-09-20  8:44   ` Peter Zijlstra
  2018-09-20 13:25     ` Arnaldo Carvalho de Melo
  2018-09-20 13:36     ` Peter Zijlstra
  1 sibling, 2 replies; 35+ messages in thread
From: Peter Zijlstra @ 2018-09-20  8:44 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: David S . Miller, daniel, acme, netdev, kernel-team

On Wed, Sep 19, 2018 at 03:39:34PM -0700, Alexei Starovoitov wrote:
>  void bpf_prog_kallsyms_del(struct bpf_prog *fp)
>  {
> +	unsigned long symbol_start, symbol_end;
> +	/* mmap_record.filename cannot be NULL and has to be u64 aligned */
> +	char buf[sizeof(u64)] = {};
> +
>  	if (!bpf_prog_kallsyms_candidate(fp))
>  		return;
>  
>  	spin_lock_bh(&bpf_lock);
>  	bpf_prog_ksym_node_del(fp->aux);
>  	spin_unlock_bh(&bpf_lock);
> +	bpf_get_prog_addr_region(fp, &symbol_start, &symbol_end);
> +	perf_event_mmap_bpf_prog(symbol_start, symbol_end - symbol_start,
> +				 buf, sizeof(buf));
>  }

So perf doesn't normally issue unmap events.. We've talked about doing
that, but so far it's never really need needed I think.

I feels a bit weird to start issuing unmap events for this.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
  2018-09-20  8:44   ` Peter Zijlstra
@ 2018-09-20 13:25     ` Arnaldo Carvalho de Melo
  2018-09-20 13:56       ` Peter Zijlstra
  2018-09-20 13:36     ` Peter Zijlstra
  1 sibling, 1 reply; 35+ messages in thread
From: Arnaldo Carvalho de Melo @ 2018-09-20 13:25 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Peter Zijlstra, David S . Miller, daniel, netdev, kernel-team

Em Thu, Sep 20, 2018 at 10:44:24AM +0200, Peter Zijlstra escreveu:
> On Wed, Sep 19, 2018 at 03:39:34PM -0700, Alexei Starovoitov wrote:
> >  void bpf_prog_kallsyms_del(struct bpf_prog *fp)
> >  {
> > +	unsigned long symbol_start, symbol_end;
> > +	/* mmap_record.filename cannot be NULL and has to be u64 aligned */
> > +	char buf[sizeof(u64)] = {};
> > +
> >  	if (!bpf_prog_kallsyms_candidate(fp))
> >  		return;
> >  
> >  	spin_lock_bh(&bpf_lock);
> >  	bpf_prog_ksym_node_del(fp->aux);
> >  	spin_unlock_bh(&bpf_lock);
> > +	bpf_get_prog_addr_region(fp, &symbol_start, &symbol_end);
> > +	perf_event_mmap_bpf_prog(symbol_start, symbol_end - symbol_start,
> > +				 buf, sizeof(buf));
> >  }
> 
> So perf doesn't normally issue unmap events.. We've talked about doing
> that, but so far it's never really need needed I think.
 
> I feels a bit weird to start issuing unmap events for this.

For reference, this surfaced here:

https://lkml.org/lkml/2017/1/27/452

Start of the thread, that involves postgresql, JIT, LLVM, perf is here:

https://lkml.org/lkml/2016/12/10/1

PeterZ provided a patch introducing PERF_RECORD_MUNMAP, went nowhere due
to having to cope with munmapping parts of existing mmaps, etc.

I'm still more in favour of introduce PERF_RECORD_MUNMAP, even if for
now it would be used just in this clean case for undoing a
PERF_RECORD_MMAP for a BPF program.

The ABI is already complicated, starting to use something called
PERF_RECORD_MMAP for unmmaping by just using a NULL name... too clever,
I think.

- Arnaldo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH perf 3/3] tools/perf: recognize and process RECORD_MMAP events for bpf progs
  2018-09-19 22:39 ` [PATCH perf 3/3] tools/perf: recognize and process RECORD_MMAP events for bpf progs Alexei Starovoitov
@ 2018-09-20 13:36   ` Arnaldo Carvalho de Melo
  2018-09-20 14:05     ` Arnaldo Carvalho de Melo
  2018-09-21 22:57   ` Song Liu
  1 sibling, 1 reply; 35+ messages in thread
From: Arnaldo Carvalho de Melo @ 2018-09-20 13:36 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: David S . Miller, daniel, peterz, netdev, kernel-team

Em Wed, Sep 19, 2018 at 03:39:35PM -0700, Alexei Starovoitov escreveu:
> Recognize JITed bpf prog load/unload events.
> Add/remove kernel symbols accordingly.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  tools/perf/util/machine.c | 27 +++++++++++++++++++++++++++
>  tools/perf/util/symbol.c  | 13 +++++++++++++
>  tools/perf/util/symbol.h  |  1 +
>  3 files changed, 41 insertions(+)
> 
> diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
> index c4acd2001db0..ae4f8a0fdc7e 100644
> --- a/tools/perf/util/machine.c
> +++ b/tools/perf/util/machine.c
> @@ -25,6 +25,7 @@
>  #include "sane_ctype.h"
>  #include <symbol/kallsyms.h>
>  #include <linux/mman.h>
> +#include <linux/magic.h>
>  
>  static void __machine__remove_thread(struct machine *machine, struct thread *th, bool lock);
>  
> @@ -1460,6 +1461,32 @@ static int machine__process_kernel_mmap_event(struct machine *machine,
>  	enum dso_kernel_type kernel_type;
>  	bool is_kernel_mmap;
>  
> +	/* process JITed bpf programs load/unload events */
> +	if (event->mmap.pid == ~0u && event->mmap.tid == BPF_FS_MAGIC) {


So, this would be in machine__process_kernel_munmap-event(machine), etc,
no check for BPF_FS_MAGIC would be needed with a PERF_RECORD_MUNMAP.

> +		struct symbol *sym;
> +		u64 ip;
> +
> +		map = map_groups__find(&machine->kmaps, event->mmap.start);
> +		if (!map) {
> +			pr_err("No kernel map for IP %lx\n", event->mmap.start);
> +			goto out_problem;
> +		}
> +		ip = event->mmap.start - map->start + map->pgoff;
> +		if (event->mmap.filename[0]) {
> +			sym = symbol__new(ip, event->mmap.len, 0, 0,
> +					  event->mmap.filename);

Humm, so the bpf program would be just one symbol... bpf-to-bpf calls
will be to a different bpf program, right? 

/me goes to read https://lwn.net/Articles/741773/
                 "[PATCH bpf-next 00/13] bpf: introduce function calls"

> +			dso__insert_symbol(map->dso, sym);
> +		} else {
> +			if (symbols__erase(&map->dso->symbols, ip)) {
> +				pr_err("No bpf prog at IP %lx/%lx\n",
> +				       event->mmap.start, ip);
> +				goto out_problem;
> +			}
> +			dso__reset_find_symbol_cache(map->dso);
> +		}
> +		return 0;
> +	}
> +
>  	/* If we have maps from kcore then we do not need or want any others */
>  	if (machine__uses_kcore(machine))
>  		return 0;
> diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c
> index d188b7588152..0653f313661d 100644
> --- a/tools/perf/util/symbol.c
> +++ b/tools/perf/util/symbol.c
> @@ -353,6 +353,19 @@ static struct symbol *symbols__find(struct rb_root *symbols, u64 ip)
>  	return NULL;
>  }
>  
> +int symbols__erase(struct rb_root *symbols, u64 ip)
> +{
> +	struct symbol *s;
> +
> +	s = symbols__find(symbols, ip);
> +	if (!s)
> +		return -ENOENT;
> +
> +	rb_erase(&s->rb_node, symbols);
> +	symbol__delete(s);
> +	return 0;
> +}
> +
>  static struct symbol *symbols__first(struct rb_root *symbols)
>  {
>  	struct rb_node *n = rb_first(symbols);
> diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h
> index f25fae4b5743..92ef31953d9a 100644
> --- a/tools/perf/util/symbol.h
> +++ b/tools/perf/util/symbol.h
> @@ -310,6 +310,7 @@ char *dso__demangle_sym(struct dso *dso, int kmodule, const char *elf_name);
>  
>  void __symbols__insert(struct rb_root *symbols, struct symbol *sym, bool kernel);
>  void symbols__insert(struct rb_root *symbols, struct symbol *sym);
> +int symbols__erase(struct rb_root *symbols, u64 ip);
>  void symbols__fixup_duplicate(struct rb_root *symbols);
>  void symbols__fixup_end(struct rb_root *symbols);
>  void map_groups__fixup_end(struct map_groups *mg);
> -- 
> 2.17.1

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
  2018-09-20  8:44   ` Peter Zijlstra
  2018-09-20 13:25     ` Arnaldo Carvalho de Melo
@ 2018-09-20 13:36     ` Peter Zijlstra
  1 sibling, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2018-09-20 13:36 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: David S . Miller, daniel, acme, netdev, kernel-team

On Thu, Sep 20, 2018 at 10:44:24AM +0200, Peter Zijlstra wrote:
> On Wed, Sep 19, 2018 at 03:39:34PM -0700, Alexei Starovoitov wrote:
> >  void bpf_prog_kallsyms_del(struct bpf_prog *fp)
> >  {
> > +	unsigned long symbol_start, symbol_end;
> > +	/* mmap_record.filename cannot be NULL and has to be u64 aligned */
> > +	char buf[sizeof(u64)] = {};
> > +
> >  	if (!bpf_prog_kallsyms_candidate(fp))
> >  		return;
> >  
> >  	spin_lock_bh(&bpf_lock);
> >  	bpf_prog_ksym_node_del(fp->aux);
> >  	spin_unlock_bh(&bpf_lock);
> > +	bpf_get_prog_addr_region(fp, &symbol_start, &symbol_end);
> > +	perf_event_mmap_bpf_prog(symbol_start, symbol_end - symbol_start,
> > +				 buf, sizeof(buf));
> >  }
> 
> So perf doesn't normally issue unmap events.. We've talked about doing
> that, but so far it's never really need needed I think.
> 
> I feels a bit weird to start issuing unmap events for this.

FWIW:

  https://lkml.kernel.org/r/20170127130702.GI6515@twins.programming.kicks-ass.net

has talk of PERF_RECORD_MUNMAP

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
  2018-09-20 13:25     ` Arnaldo Carvalho de Melo
@ 2018-09-20 13:56       ` Peter Zijlstra
  2018-09-21  3:14         ` Alexei Starovoitov
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2018-09-20 13:56 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Alexei Starovoitov, David S . Miller, daniel, netdev, kernel-team

On Thu, Sep 20, 2018 at 10:25:45AM -0300, Arnaldo Carvalho de Melo wrote:
> PeterZ provided a patch introducing PERF_RECORD_MUNMAP, went nowhere due
> to having to cope with munmapping parts of existing mmaps, etc.
> 
> I'm still more in favour of introduce PERF_RECORD_MUNMAP, even if for
> now it would be used just in this clean case for undoing a
> PERF_RECORD_MMAP for a BPF program.
> 
> The ABI is already complicated, starting to use something called
> PERF_RECORD_MMAP for unmmaping by just using a NULL name... too clever,
> I think.

Agreed, the PERF_RECORD_MUNMAP patch was fairly trivial, the difficult
part was getting the perf tool to dtrt for that use-case. But if we need
unmap events, doing the unmap record now is the right thing.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH perf 3/3] tools/perf: recognize and process RECORD_MMAP events for bpf progs
  2018-09-20 13:36   ` Arnaldo Carvalho de Melo
@ 2018-09-20 14:05     ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 35+ messages in thread
From: Arnaldo Carvalho de Melo @ 2018-09-20 14:05 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: David S . Miller, daniel, peterz, netdev, kernel-team

Em Thu, Sep 20, 2018 at 10:36:17AM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Wed, Sep 19, 2018 at 03:39:35PM -0700, Alexei Starovoitov escreveu:
> > Recognize JITed bpf prog load/unload events.
> > Add/remove kernel symbols accordingly.
> > 
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > ---
> >  tools/perf/util/machine.c | 27 +++++++++++++++++++++++++++
> >  tools/perf/util/symbol.c  | 13 +++++++++++++
> >  tools/perf/util/symbol.h  |  1 +
> >  3 files changed, 41 insertions(+)
> > 
> > diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
> > index c4acd2001db0..ae4f8a0fdc7e 100644
> > --- a/tools/perf/util/machine.c
> > +++ b/tools/perf/util/machine.c
> > @@ -25,6 +25,7 @@
> >  #include "sane_ctype.h"
> >  #include <symbol/kallsyms.h>
> >  #include <linux/mman.h>
> > +#include <linux/magic.h>
> >  
> >  static void __machine__remove_thread(struct machine *machine, struct thread *th, bool lock);
> >  
> > @@ -1460,6 +1461,32 @@ static int machine__process_kernel_mmap_event(struct machine *machine,
> >  	enum dso_kernel_type kernel_type;
> >  	bool is_kernel_mmap;
> >  
> > +	/* process JITed bpf programs load/unload events */
> > +	if (event->mmap.pid == ~0u && event->mmap.tid == BPF_FS_MAGIC) {
> 
> 
> So, this would be in machine__process_kernel_munmap-event(machine), etc,
> no check for BPF_FS_MAGIC would be needed with a PERF_RECORD_MUNMAP.
> 
> > +		struct symbol *sym;
> > +		u64 ip;
> > +
> > +		map = map_groups__find(&machine->kmaps, event->mmap.start);
> > +		if (!map) {
> > +			pr_err("No kernel map for IP %lx\n", event->mmap.start);
> > +			goto out_problem;
> > +		}
> > +		ip = event->mmap.start - map->start + map->pgoff;
> > +		if (event->mmap.filename[0]) {
> > +			sym = symbol__new(ip, event->mmap.len, 0, 0,
> > +					  event->mmap.filename);
> 
> Humm, so the bpf program would be just one symbol... bpf-to-bpf calls
> will be to a different bpf program, right? 
> 
> /me goes to read https://lwn.net/Articles/741773/
>                  "[PATCH bpf-next 00/13] bpf: introduce function calls"

After reading it, yeah, I think we need some way to access a symbol
table for a BPF program, and also its binary so that we can do
annotation, static (perf annotate) and live (perf top), was this already
considered? I think one can get the binary for a program giving
sufficient perms somehow, right? One other thing I need to catch up 8-)

- Arnaldo
 
> > +			dso__insert_symbol(map->dso, sym);
> > +		} else {
> > +			if (symbols__erase(&map->dso->symbols, ip)) {
> > +				pr_err("No bpf prog at IP %lx/%lx\n",
> > +				       event->mmap.start, ip);
> > +				goto out_problem;
> > +			}
> > +			dso__reset_find_symbol_cache(map->dso);
> > +		}
> > +		return 0;
> > +	}
> > +
> >  	/* If we have maps from kcore then we do not need or want any others */
> >  	if (machine__uses_kcore(machine))
> >  		return 0;
> > diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c
> > index d188b7588152..0653f313661d 100644
> > --- a/tools/perf/util/symbol.c
> > +++ b/tools/perf/util/symbol.c
> > @@ -353,6 +353,19 @@ static struct symbol *symbols__find(struct rb_root *symbols, u64 ip)
> >  	return NULL;
> >  }
> >  
> > +int symbols__erase(struct rb_root *symbols, u64 ip)
> > +{
> > +	struct symbol *s;
> > +
> > +	s = symbols__find(symbols, ip);
> > +	if (!s)
> > +		return -ENOENT;
> > +
> > +	rb_erase(&s->rb_node, symbols);
> > +	symbol__delete(s);
> > +	return 0;
> > +}
> > +
> >  static struct symbol *symbols__first(struct rb_root *symbols)
> >  {
> >  	struct rb_node *n = rb_first(symbols);
> > diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h
> > index f25fae4b5743..92ef31953d9a 100644
> > --- a/tools/perf/util/symbol.h
> > +++ b/tools/perf/util/symbol.h
> > @@ -310,6 +310,7 @@ char *dso__demangle_sym(struct dso *dso, int kmodule, const char *elf_name);
> >  
> >  void __symbols__insert(struct rb_root *symbols, struct symbol *sym, bool kernel);
> >  void symbols__insert(struct rb_root *symbols, struct symbol *sym);
> > +int symbols__erase(struct rb_root *symbols, u64 ip);
> >  void symbols__fixup_duplicate(struct rb_root *symbols);
> >  void symbols__fixup_end(struct rb_root *symbols);
> >  void map_groups__fixup_end(struct map_groups *mg);
> > -- 
> > 2.17.1

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
  2018-09-20 13:56       ` Peter Zijlstra
@ 2018-09-21  3:14         ` Alexei Starovoitov
  2018-09-21 12:25           ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 35+ messages in thread
From: Alexei Starovoitov @ 2018-09-21  3:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arnaldo Carvalho de Melo, Alexei Starovoitov, David S . Miller,
	daniel, netdev, kernel-team

On Thu, Sep 20, 2018 at 03:56:51PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 20, 2018 at 10:25:45AM -0300, Arnaldo Carvalho de Melo wrote:
> > PeterZ provided a patch introducing PERF_RECORD_MUNMAP, went nowhere due
> > to having to cope with munmapping parts of existing mmaps, etc.
> > 
> > I'm still more in favour of introduce PERF_RECORD_MUNMAP, even if for
> > now it would be used just in this clean case for undoing a
> > PERF_RECORD_MMAP for a BPF program.
> > 
> > The ABI is already complicated, starting to use something called
> > PERF_RECORD_MMAP for unmmaping by just using a NULL name... too clever,
> > I think.
> 
> Agreed, the PERF_RECORD_MUNMAP patch was fairly trivial, the difficult
> part was getting the perf tool to dtrt for that use-case. But if we need
> unmap events, doing the unmap record now is the right thing.

Thanks for the pointers!
The answer is a bit long. pls bear with me.

I have considered adding MUNMAP to match existing MMAP, but went
without it because I didn't want to introduce new bit in perf_event_attr
and emit these new events in a misbalanced conditional way for prog load/unload.
Like old perf is asking kernel for mmap events via mmap bit, so prog load events
will be in perf.data, but old perf report won't recognize them anyway.
Whereas new perf would certainly want to catch bpf events and will set
both mmap and mumap bits.

Then if I add MUNMAP event without new bit and emit MMAP/MUNMAP
conditionally based on single mmap bit they will confuse old perf
and it will print warning about 'unknown events'.

Both situations are ugly, hence I went with reuse of MMAP event
for both load/unload.
In such case old perf silently ignores them. Which is what I wanted.
When we upgrade the kernel we cannot synchronize the kernel upgrade
(or downgrade) with user space perf package upgrade.
Hence not confusing old perf is important.
With new kernel new bpf mmap events get into perf.data and
new perf picks them up.

Few more considerations:

I consider synthetic perf events to be non-ABI. Meaning they're
emitted by perf user space into perf.data and there is a convention
on names, but it's not a kernel abi. Like RECORD_MMAP with
event.filename == "[module_name]" is an indication for perf report
to parse elf/build-id of dso==module_name.
There is no such support in the kernel. Kernel doesn't emit
such events for module load/unload. If in the future
we decide to extend kernel with such events they don't have
to match what user space perf does today.

Why this is important? To get to next step.
As Arnaldo pointed out this patch set is missing support for
JITed prog annotations and displaying asm code. Absolutely correct.
This set only helps perf to reveal the names of bpf progs that _were_
running at the time of perf record, but there is no way yet for
perf report to show asm code of the prog that was running.
In that sense bpf is drastically different from java, other jits
and normal profiling.
bpf JIT happens in the kernel and only kernel knows the mapping
between original source and JITed code.
In addition there are bpf2bpf functions. In the future there will
be bpf libraries, more type info, line number support, etc.
I strongly believe perf RECORD_* events should NOT care about
the development that happens on the bpf side.
The only thing kernel will be telling user space is that bpf prog
with prog_id=X was loaded.
Then based on prog_id the 'perf record' phase can query the kernel
for bpf related information. There is already a way to fetch
JITed image based on prog_id.
Then perf will emit synthetic RECORD_FOOBAR into perf.data
that will contain bpf related info (like complete JITed image)
and perf report can process it later and annotate things in UI.

It may seem that there is a race here.
Like when 'perf record' see 'bpf prog was loaded with prog_id=X' event
it will ask the kernel about prog_id=X, but that prog could be
unloaded already.
In such case prog_id will not exist and perf record can ignore such event.
So no race.
The kernel doesn't need to pass all information about bpf prog to
the user space via RECORD_*. Instead 'perf record' can emit
synthetic events into perf.data.
I was thinking to extend RECORD_MMAP with prog_id already
(instead of passing kallsyms's bpf prog name in event->mmap.filename)
but bpf functions don't have their own prog_id. Multiple bpf funcs
with different JITed blobs are considered to be a part of single prog_id.
So as a step one I'm only extending RECORD_MMAP with addr and kallsym
name of bpf function/prog.
As a step two the plan is to add notification mechanism for prog load/unload
that will include prog_id and design new synthetic RECORD_* events in
perf user space that will contain JITed code, line info, BTF, etc.

TLDR:
step 1 (this patch set)
Single bpf prog_load can call multiple
bpf_prog_kallsyms_add() -> RECORD_MMAP with addr+kallsym only
Similarly unload calls multiple
bpf_prog_kallsyms_del() -> RECORD_MMAP with addr only

step 2 (future work)
single event for bpf prog_load with prog_id only.
Either via perf ring buffer or ftrace or tracepoints or some
other notification mechanism.

It may seem that step 2 obsoletes step 1. It can, but I think
it will complement it. There is a lot more code there and
a lot more discussions to have.
Step 1 is already big improvement.

Thoughts?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
  2018-09-21  3:14         ` Alexei Starovoitov
@ 2018-09-21 12:25           ` Arnaldo Carvalho de Melo
  2018-09-21 13:55             ` Peter Zijlstra
                               ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Arnaldo Carvalho de Melo @ 2018-09-21 12:25 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Peter Zijlstra, Alexei Starovoitov, David S . Miller, daniel,
	netdev, kernel-team

Em Thu, Sep 20, 2018 at 08:14:46PM -0700, Alexei Starovoitov escreveu:
> On Thu, Sep 20, 2018 at 03:56:51PM +0200, Peter Zijlstra wrote:
> > On Thu, Sep 20, 2018 at 10:25:45AM -0300, Arnaldo Carvalho de Melo wrote:
> > > PeterZ provided a patch introducing PERF_RECORD_MUNMAP, went nowhere due
> > > to having to cope with munmapping parts of existing mmaps, etc.
> > > 
> > > I'm still more in favour of introduce PERF_RECORD_MUNMAP, even if for
> > > now it would be used just in this clean case for undoing a
> > > PERF_RECORD_MMAP for a BPF program.
> > > 
> > > The ABI is already complicated, starting to use something called
> > > PERF_RECORD_MMAP for unmmaping by just using a NULL name... too clever,
> > > I think.
> > 
> > Agreed, the PERF_RECORD_MUNMAP patch was fairly trivial, the difficult
> > part was getting the perf tool to dtrt for that use-case. But if we need
> > unmap events, doing the unmap record now is the right thing.
> 
> Thanks for the pointers!
> The answer is a bit long. pls bear with me.

Ditto with me :-)
 
> I have considered adding MUNMAP to match existing MMAP, but went
> without it because I didn't want to introduce new bit in perf_event_attr
> and emit these new events in a misbalanced conditional way for prog load/unload.
> Like old perf is asking kernel for mmap events via mmap bit, so prog load events

By prog load events you mean that old perf, having perf_event_attr.mmap = 1 ||
perf_event_attr.mmap2 = 1 will cause the new kernel to emit
PERF_RECORD_MMAP records for the range of addresses that a BPF program
is being loaded on, right?

> will be in perf.data, but old perf report won't recognize them anyway.

Why not? It should lookup the symbol and find it in the rb_tree of maps,
with a DSO name equal to what was in the PERF_RECORD_MMAP emitted by the
BPF core, no? It'll be an unresolved symbol, but a resolved map.

> Whereas new perf would certainly want to catch bpf events and will set
> both mmap and mumap bits.

new perf with your code will find a symbol, not a map, because your code
catches a special case PERF_RECORD_MMAP and instead of creating a
'struct map' will create a 'struct symbol' and insert it in the kallsyms
'struct map', right?
 
> Then if I add MUNMAP event without new bit and emit MMAP/MUNMAP
> conditionally based on single mmap bit they will confuse old perf
> and it will print warning about 'unknown events'.

That is unfortunate and I'll turn that part into a pr_debug()
 
> Both situations are ugly, hence I went with reuse of MMAP event
> for both load/unload.

So, its doubly odd, i.e. MMAP used for mmap() and for munmap() and the
effects in the tooling is not to create or remove a 'struct map', but to
alter an existing symbol table for the kallsyms map.

> In such case old perf silently ignores them. Which is what I wanted.

In theory the old perf should catch the PERF_RECORD_MMAP with a string
in the filename part and insert a new map into the kernel mmap rb_tree,
and then samples would be resolved to this map, but since there is no
backing DSO with a symtab, it would stop at that, just stating that the
map is called NAME-OF-BPF-PROGRAM. This is all from memory, possibly
there is something in there that makes it ignore this PERF_RECORD_MMAP
emitted by the BPF kernel code when loading a new program.

> When we upgrade the kernel we cannot synchronize the kernel upgrade
> (or downgrade) with user space perf package upgrade.

sure

> Hence not confusing old perf is important.

Thanks for trying to achieve that, and its a pity that that "unknown
record" message is a pr_warning or pr_info and not a pr_debug().

> With new kernel new bpf mmap events get into perf.data and
> new perf picks them up.
> 
> Few more considerations:
> 
> I consider synthetic perf events to be non-ABI. Meaning they're
> emitted by perf user space into perf.data and there is a convention
> on names, but it's not a kernel abi. Like RECORD_MMAP with
> event.filename == "[module_name]" is an indication for perf report
> to parse elf/build-id of dso==module_name.
> There is no such support in the kernel. Kernel doesn't emit
> such events for module load/unload. If in the future
> we decide to extend kernel with such events they don't have
> to match what user space perf does today.

Right, that is another unfortunate state of affairs, kernel module
load/unload should already be supported, reported by the kernel via a
proper PERF_RECORD_MODULE_LOAD/UNLOAD
 
> Why this is important? To get to next step.
> As Arnaldo pointed out this patch set is missing support for
> JITed prog annotations and displaying asm code. Absolutely correct.
> This set only helps perf to reveal the names of bpf progs that _were_
> running at the time of perf record, but there is no way yet for
> perf report to show asm code of the prog that was running.
> In that sense bpf is drastically different from java, other jits
> and normal profiling.
> bpf JIT happens in the kernel and only kernel knows the mapping
> between original source and JITed code.
> In addition there are bpf2bpf functions. In the future there will
> be bpf libraries, more type info, line number support, etc.
> I strongly believe perf RECORD_* events should NOT care about
> the development that happens on the bpf side.
> The only thing kernel will be telling user space is that bpf prog
> with prog_id=X was loaded.
> Then based on prog_id the 'perf record' phase can query the kernel
> for bpf related information. There is already a way to fetch
> JITed image based on prog_id.
> Then perf will emit synthetic RECORD_FOOBAR into perf.data
> that will contain bpf related info (like complete JITed image)
> and perf report can process it later and annotate things in UI.

There is another longstanding TODO list entry: PERF_RECORD_MMAP records
should include a build-id, to avoid either userspace getting confused
when there is an update of some mmap DSO, for long running sessions, for
instance, or to have to scan the just recorded perf.data file for DSOs
with samples to then read it from the file system (more races).

Have you ever considered having a build-id for bpf objects that could be
used here?
 
> It may seem that there is a race here.
> Like when 'perf record' see 'bpf prog was loaded with prog_id=X' event
> it will ask the kernel about prog_id=X, but that prog could be
> unloaded already.
> In such case prog_id will not exist and perf record can ignore such event.
> So no race.
> The kernel doesn't need to pass all information about bpf prog to
> the user space via RECORD_*. Instead 'perf record' can emit
> synthetic events into perf.data.
> I was thinking to extend RECORD_MMAP with prog_id already

Right, if prog_id is a uniquely identifying cookie that is based on the
file contents, like files in a git repo, then this would serve for both
bpf progs and for everything else.

Will the JITed code from some BPF bytecode be different if the same
bytecode is JITed in several machines, all having the exact same
hardware?

> (instead of passing kallsyms's bpf prog name in event->mmap.filename)
> but bpf functions don't have their own prog_id. Multiple bpf funcs
> with different JITed blobs are considered to be a part of single prog_id.
> So as a step one I'm only extending RECORD_MMAP with addr and kallsym
> name of bpf function/prog.
> As a step two the plan is to add notification mechanism for prog load/unload
> that will include prog_id and design new synthetic RECORD_* events in
> perf user space that will contain JITed code, line info, BTF, etc.

So, will the kernel JIT a bytecode, load it somewhere and run it, then,
when unloading it, keep it somewhere (a filesystem with some limits,
etc) so that at some later time (with some timeouts) tooling can, using
its id/buildid cookie get the contents and symbol table to have a better
annotation experience?

Using /proc/kcore as right now we should be able to annotate the BPF JIT
as it is loaded, in 'perf top', for instance, with your suggested code,
because we would have a symbol resolved that spans the whole bpf
program, I'll try applying your patch and seeing how it goes.

> TLDR:
> step 1 (this patch set)
> Single bpf prog_load can call multiple
> bpf_prog_kallsyms_add() -> RECORD_MMAP with addr+kallsym only
> Similarly unload calls multiple
> bpf_prog_kallsyms_del() -> RECORD_MMAP with addr only
> 
> step 2 (future work)
> single event for bpf prog_load with prog_id only.
> Either via perf ring buffer or ftrace or tracepoints or some
> other notification mechanism.
> 
> It may seem that step 2 obsoletes step 1. It can, but I think
> it will complement it. There is a lot more code there and
> a lot more discussions to have.
> Step 1 is already big improvement.
> 
> Thoughts?

See above, and, bear with me too :-)

- Arnaldo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
  2018-09-21 12:25           ` Arnaldo Carvalho de Melo
@ 2018-09-21 13:55             ` Peter Zijlstra
  2018-09-21 13:59             ` Peter Zijlstra
  2018-09-21 22:13             ` Alexei Starovoitov
  2 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2018-09-21 13:55 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Alexei Starovoitov, Alexei Starovoitov, David S . Miller, daniel,
	netdev, kernel-team

On Fri, Sep 21, 2018 at 09:25:00AM -0300, Arnaldo Carvalho de Melo wrote:
> > I consider synthetic perf events to be non-ABI. Meaning they're
> > emitted by perf user space into perf.data and there is a convention
> > on names, but it's not a kernel abi. Like RECORD_MMAP with
> > event.filename == "[module_name]" is an indication for perf report
> > to parse elf/build-id of dso==module_name.
> > There is no such support in the kernel. Kernel doesn't emit
> > such events for module load/unload. If in the future
> > we decide to extend kernel with such events they don't have
> > to match what user space perf does today.
> 
> Right, that is another unfortunate state of affairs, kernel module
> load/unload should already be supported, reported by the kernel via a
> proper PERF_RECORD_MODULE_LOAD/UNLOAD

Just wondering, is anyone actually doing enough module loading for this
to matter? (asks the CONFIG_MODULES=n guy).

I thought that was all a relatively static affair; you boot, you get
loadead a few modules for present hardware, the end.

Anyway, no real objection, just wonder if it's worth it.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
  2018-09-21 12:25           ` Arnaldo Carvalho de Melo
  2018-09-21 13:55             ` Peter Zijlstra
@ 2018-09-21 13:59             ` Peter Zijlstra
  2018-09-21 22:13             ` Alexei Starovoitov
  2 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2018-09-21 13:59 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Alexei Starovoitov, Alexei Starovoitov, David S . Miller, daniel,
	netdev, kernel-team

On Fri, Sep 21, 2018 at 09:25:00AM -0300, Arnaldo Carvalho de Melo wrote:
> There is another longstanding TODO list entry: PERF_RECORD_MMAP records
> should include a build-id

I throught the problem was that the kernel doesn't have the build-id in
the first place. So it cannot hand them out.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
  2018-09-21 12:25           ` Arnaldo Carvalho de Melo
  2018-09-21 13:55             ` Peter Zijlstra
  2018-09-21 13:59             ` Peter Zijlstra
@ 2018-09-21 22:13             ` Alexei Starovoitov
  2018-10-15 23:33               ` Song Liu
  2 siblings, 1 reply; 35+ messages in thread
From: Alexei Starovoitov @ 2018-09-21 22:13 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Peter Zijlstra, Alexei Starovoitov, David S . Miller, daniel,
	netdev, kernel-team

On Fri, Sep 21, 2018 at 09:25:00AM -0300, Arnaldo Carvalho de Melo wrote:
>  
> > I have considered adding MUNMAP to match existing MMAP, but went
> > without it because I didn't want to introduce new bit in perf_event_attr
> > and emit these new events in a misbalanced conditional way for prog load/unload.
> > Like old perf is asking kernel for mmap events via mmap bit, so prog load events
> 
> By prog load events you mean that old perf, having perf_event_attr.mmap = 1 ||
> perf_event_attr.mmap2 = 1 will cause the new kernel to emit
> PERF_RECORD_MMAP records for the range of addresses that a BPF program
> is being loaded on, right?

right. it would be weird when prog load events are there, but not unload.

> > will be in perf.data, but old perf report won't recognize them anyway.
> 
> Why not? It should lookup the symbol and find it in the rb_tree of maps,
> with a DSO name equal to what was in the PERF_RECORD_MMAP emitted by the
> BPF core, no? It'll be an unresolved symbol, but a resolved map.
> 
> > Whereas new perf would certainly want to catch bpf events and will set
> > both mmap and mumap bits.
> 
> new perf with your code will find a symbol, not a map, because your code
> catches a special case PERF_RECORD_MMAP and instead of creating a
> 'struct map' will create a 'struct symbol' and insert it in the kallsyms
> 'struct map', right?

right.
bpf progs are more similar to kernel functions than to modules.
For modules it makes sense to create a new map and insert symbols into it.
For bpf JITed images there is no DSO to parse.
Single bpf elf file may contain multiple bpf progs and each prog may contain
multiple bpf functions. They will be loaded at different time and
will have different life time.

> In theory the old perf should catch the PERF_RECORD_MMAP with a string
> in the filename part and insert a new map into the kernel mmap rb_tree,
> and then samples would be resolved to this map, but since there is no
> backing DSO with a symtab, it would stop at that, just stating that the
> map is called NAME-OF-BPF-PROGRAM. This is all from memory, possibly
> there is something in there that makes it ignore this PERF_RECORD_MMAP
> emitted by the BPF kernel code when loading a new program.

In /proc/kcore there is already a section for module range.
Hence when perf processes bpf load/unload events the map is already created.
Therefore the patch 3 only searches for it and inserts new symbol into it.

In that sense the reuse of RECORD_MMAP event for bpf progs is indeed
not exactly clean, since no new map is created.
It's probably better to introduce PERF_RECORD_[INSERT|ERASE]_KSYM events ?

Such event potentially can be used for offline ksym resolution.
perf could process /proc/kallsyms during perf record and emit all of them
as synthetic PERF_RECORD_INSERT_KSYM into perf.data, so perf report can run
on a different server and still find the right symbols.

I guess, we can do bpf specific events too and keep RECORD_MMAP as-is.
How about single PERF_RECORD_BPF event with internal flag for load/unload ?

> Right, that is another unfortunate state of affairs, kernel module
> load/unload should already be supported, reported by the kernel via a
> proper PERF_RECORD_MODULE_LOAD/UNLOAD

I agree with Peter here. It would nice, but low priority.
modules are mostly static. Loaded once and stay there.

> There is another longstanding TODO list entry: PERF_RECORD_MMAP records
> should include a build-id, to avoid either userspace getting confused
> when there is an update of some mmap DSO, for long running sessions, for
> instance, or to have to scan the just recorded perf.data file for DSOs
> with samples to then read it from the file system (more races).
> 
> Have you ever considered having a build-id for bpf objects that could be
> used here?

build-id concept is not applicable to bpf.
bpf elf files on the disc don't have good correlation with what is
running in the kernel. bpf bytestream is converted and optimized
by the verifier. Then JITed.
So debug info left in .o file and original bpf bytestream in .o are
mostly useless.
For bpf programs we have 'program tag'. It is computed over original
bpf bytestream, so both kernel and user space can compute it.
In libbcc we use /var/tmp/bcc/bpf_prog_TAG/ directory to store original
source code of the program, so users looking at kernel stack traces
with bpf_prog_TAG can find the source.
It's similar to build-id, but not going to help perf to annotate
actual x86 instructions inside JITed image and show src code.
Since JIT runs in the kernel this problem cannot be solved by user space only.
It's a difficult problem and we have a plan to tackle that,
but it's step 2. A bunch of infra is needed on bpf side to
preserve the association during src_in_C -> original bpf insns ->
translated bpf insns -> JITed asm.
Then report it back to user space and then teach perf to properly annotate progs.

> Will the JITed code from some BPF bytecode be different if the same
> bytecode is JITed in several machines, all having the exact same
> hardware?

Yes. JITed code will be different depending on sysctl knobs (like const blinding)
So the same original bpf byte stream loaded at different times
may have different JITed image.

Even without security features like blinding the JIT can be different.
the bpf maps are separate from bpf progs. The bpf map is created first.
Then the same bpf instruction stream (depending on map type that it uses)
may be optimized by the verifier differently causing difference in JIT.

> > (instead of passing kallsyms's bpf prog name in event->mmap.filename)
> > but bpf functions don't have their own prog_id. Multiple bpf funcs
> > with different JITed blobs are considered to be a part of single prog_id.
> > So as a step one I'm only extending RECORD_MMAP with addr and kallsym
> > name of bpf function/prog.
> > As a step two the plan is to add notification mechanism for prog load/unload
> > that will include prog_id and design new synthetic RECORD_* events in
> > perf user space that will contain JITed code, line info, BTF, etc.
> 
> So, will the kernel JIT a bytecode, load it somewhere and run it, then,
> when unloading it, keep it somewhere (a filesystem with some limits,
> etc) so that at some later time (with some timeouts) tooling can, using
> its id/buildid cookie get the contents and symbol table to have a better
> annotation experience?

yes. The plan is to let perf fetch JITed image via BPF_OBJ_GET_INFO_BY_FD cmd
during perf record run and store it inside perf.data in synthetic records.
Then perf report can annotate it later.

> Using /proc/kcore as right now we should be able to annotate the BPF JIT
> as it is loaded, in 'perf top', for instance, with your suggested code,
> because we would have a symbol resolved that spans the whole bpf
> program, I'll try applying your patch and seeing how it goes.

yes. /proc/kcore helps to annotate during perf report when progs
are still loaded, but user experience sucks.
Like for networking bpf performance testing the user needs to:
- load the program
- start blasting traffic to trigger prog execution
- perf record -a sleep 5
- stop traffic
- perf report -> to analyze what's going on in bpf prog
- unload prog
All steps are manual.

We need to get to the point where we can:
- perf record -a ./my_bpf_test
- perf report
where my_bpf_test will load, attach, run traffic, detach, unload.
Most of selftests/bpf/* are done this way, but we cannot profile
them at the moment.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH perf 3/3] tools/perf: recognize and process RECORD_MMAP events for bpf progs
  2018-09-19 22:39 ` [PATCH perf 3/3] tools/perf: recognize and process RECORD_MMAP events for bpf progs Alexei Starovoitov
  2018-09-20 13:36   ` Arnaldo Carvalho de Melo
@ 2018-09-21 22:57   ` Song Liu
  1 sibling, 0 replies; 35+ messages in thread
From: Song Liu @ 2018-09-21 22:57 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S . Miller, Daniel Borkmann, Peter Zijlstra, acme,
	Networking, kernel-team

On Wed, Sep 19, 2018 at 3:51 PM Alexei Starovoitov <ast@kernel.org> wrote:
>
> Recognize JITed bpf prog load/unload events.
> Add/remove kernel symbols accordingly.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Acked-by: Song Liu <songliubraving@fb.com>

> ---
>  tools/perf/util/machine.c | 27 +++++++++++++++++++++++++++
>  tools/perf/util/symbol.c  | 13 +++++++++++++
>  tools/perf/util/symbol.h  |  1 +
>  3 files changed, 41 insertions(+)
>
> diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
> index c4acd2001db0..ae4f8a0fdc7e 100644
> --- a/tools/perf/util/machine.c
> +++ b/tools/perf/util/machine.c
> @@ -25,6 +25,7 @@
>  #include "sane_ctype.h"
>  #include <symbol/kallsyms.h>
>  #include <linux/mman.h>
> +#include <linux/magic.h>
>
>  static void __machine__remove_thread(struct machine *machine, struct thread *th, bool lock);
>
> @@ -1460,6 +1461,32 @@ static int machine__process_kernel_mmap_event(struct machine *machine,
>         enum dso_kernel_type kernel_type;
>         bool is_kernel_mmap;
>
> +       /* process JITed bpf programs load/unload events */
> +       if (event->mmap.pid == ~0u && event->mmap.tid == BPF_FS_MAGIC) {
> +               struct symbol *sym;
> +               u64 ip;
> +
> +               map = map_groups__find(&machine->kmaps, event->mmap.start);
> +               if (!map) {
> +                       pr_err("No kernel map for IP %lx\n", event->mmap.start);
> +                       goto out_problem;
> +               }
> +               ip = event->mmap.start - map->start + map->pgoff;
> +               if (event->mmap.filename[0]) {
> +                       sym = symbol__new(ip, event->mmap.len, 0, 0,
> +                                         event->mmap.filename);
> +                       dso__insert_symbol(map->dso, sym);
> +               } else {
> +                       if (symbols__erase(&map->dso->symbols, ip)) {
> +                               pr_err("No bpf prog at IP %lx/%lx\n",
> +                                      event->mmap.start, ip);
> +                               goto out_problem;
> +                       }
> +                       dso__reset_find_symbol_cache(map->dso);
> +               }
> +               return 0;
> +       }
> +
>         /* If we have maps from kcore then we do not need or want any others */
>         if (machine__uses_kcore(machine))
>                 return 0;
> diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c
> index d188b7588152..0653f313661d 100644
> --- a/tools/perf/util/symbol.c
> +++ b/tools/perf/util/symbol.c
> @@ -353,6 +353,19 @@ static struct symbol *symbols__find(struct rb_root *symbols, u64 ip)
>         return NULL;
>  }
>
> +int symbols__erase(struct rb_root *symbols, u64 ip)
> +{
> +       struct symbol *s;
> +
> +       s = symbols__find(symbols, ip);
> +       if (!s)
> +               return -ENOENT;
> +
> +       rb_erase(&s->rb_node, symbols);
> +       symbol__delete(s);
> +       return 0;
> +}
> +
>  static struct symbol *symbols__first(struct rb_root *symbols)
>  {
>         struct rb_node *n = rb_first(symbols);
> diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h
> index f25fae4b5743..92ef31953d9a 100644
> --- a/tools/perf/util/symbol.h
> +++ b/tools/perf/util/symbol.h
> @@ -310,6 +310,7 @@ char *dso__demangle_sym(struct dso *dso, int kmodule, const char *elf_name);
>
>  void __symbols__insert(struct rb_root *symbols, struct symbol *sym, bool kernel);
>  void symbols__insert(struct rb_root *symbols, struct symbol *sym);
> +int symbols__erase(struct rb_root *symbols, u64 ip);
>  void symbols__fixup_duplicate(struct rb_root *symbols);
>  void symbols__fixup_end(struct rb_root *symbols);
>  void map_groups__fixup_end(struct map_groups *mg);
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
  2018-09-21 22:13             ` Alexei Starovoitov
@ 2018-10-15 23:33               ` Song Liu
  2018-10-16 23:43                 ` David Ahern
  0 siblings, 1 reply; 35+ messages in thread
From: Song Liu @ 2018-10-15 23:33 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: acme, Peter Zijlstra, Alexei Starovoitov, David S . Miller,
	Daniel Borkmann, Networking, kernel-team

On Fri, Sep 21, 2018 at 3:15 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Sep 21, 2018 at 09:25:00AM -0300, Arnaldo Carvalho de Melo wrote:
> >
> > > I have considered adding MUNMAP to match existing MMAP, but went
> > > without it because I didn't want to introduce new bit in perf_event_attr
> > > and emit these new events in a misbalanced conditional way for prog load/unload.
> > > Like old perf is asking kernel for mmap events via mmap bit, so prog load events
> >
> > By prog load events you mean that old perf, having perf_event_attr.mmap = 1 ||
> > perf_event_attr.mmap2 = 1 will cause the new kernel to emit
> > PERF_RECORD_MMAP records for the range of addresses that a BPF program
> > is being loaded on, right?
>
> right. it would be weird when prog load events are there, but not unload.
>
> > > will be in perf.data, but old perf report won't recognize them anyway.
> >
> > Why not? It should lookup the symbol and find it in the rb_tree of maps,
> > with a DSO name equal to what was in the PERF_RECORD_MMAP emitted by the
> > BPF core, no? It'll be an unresolved symbol, but a resolved map.
> >
> > > Whereas new perf would certainly want to catch bpf events and will set
> > > both mmap and mumap bits.
> >
> > new perf with your code will find a symbol, not a map, because your code
> > catches a special case PERF_RECORD_MMAP and instead of creating a
> > 'struct map' will create a 'struct symbol' and insert it in the kallsyms
> > 'struct map', right?
>
> right.
> bpf progs are more similar to kernel functions than to modules.
> For modules it makes sense to create a new map and insert symbols into it.
> For bpf JITed images there is no DSO to parse.
> Single bpf elf file may contain multiple bpf progs and each prog may contain
> multiple bpf functions. They will be loaded at different time and
> will have different life time.
>
> > In theory the old perf should catch the PERF_RECORD_MMAP with a string
> > in the filename part and insert a new map into the kernel mmap rb_tree,
> > and then samples would be resolved to this map, but since there is no
> > backing DSO with a symtab, it would stop at that, just stating that the
> > map is called NAME-OF-BPF-PROGRAM. This is all from memory, possibly
> > there is something in there that makes it ignore this PERF_RECORD_MMAP
> > emitted by the BPF kernel code when loading a new program.
>
> In /proc/kcore there is already a section for module range.
> Hence when perf processes bpf load/unload events the map is already created.
> Therefore the patch 3 only searches for it and inserts new symbol into it.
>
> In that sense the reuse of RECORD_MMAP event for bpf progs is indeed
> not exactly clean, since no new map is created.
> It's probably better to introduce PERF_RECORD_[INSERT|ERASE]_KSYM events ?
>
> Such event potentially can be used for offline ksym resolution.
> perf could process /proc/kallsyms during perf record and emit all of them
> as synthetic PERF_RECORD_INSERT_KSYM into perf.data, so perf report can run
> on a different server and still find the right symbols.
>
> I guess, we can do bpf specific events too and keep RECORD_MMAP as-is.
> How about single PERF_RECORD_BPF event with internal flag for load/unload ?
>
> > Right, that is another unfortunate state of affairs, kernel module
> > load/unload should already be supported, reported by the kernel via a
> > proper PERF_RECORD_MODULE_LOAD/UNLOAD
>
> I agree with Peter here. It would nice, but low priority.
> modules are mostly static. Loaded once and stay there.
>
> > There is another longstanding TODO list entry: PERF_RECORD_MMAP records
> > should include a build-id, to avoid either userspace getting confused
> > when there is an update of some mmap DSO, for long running sessions, for
> > instance, or to have to scan the just recorded perf.data file for DSOs
> > with samples to then read it from the file system (more races).
> >
> > Have you ever considered having a build-id for bpf objects that could be
> > used here?
>
> build-id concept is not applicable to bpf.
> bpf elf files on the disc don't have good correlation with what is
> running in the kernel. bpf bytestream is converted and optimized
> by the verifier. Then JITed.
> So debug info left in .o file and original bpf bytestream in .o are
> mostly useless.
> For bpf programs we have 'program tag'. It is computed over original
> bpf bytestream, so both kernel and user space can compute it.
> In libbcc we use /var/tmp/bcc/bpf_prog_TAG/ directory to store original
> source code of the program, so users looking at kernel stack traces
> with bpf_prog_TAG can find the source.
> It's similar to build-id, but not going to help perf to annotate
> actual x86 instructions inside JITed image and show src code.
> Since JIT runs in the kernel this problem cannot be solved by user space only.
> It's a difficult problem and we have a plan to tackle that,
> but it's step 2. A bunch of infra is needed on bpf side to
> preserve the association during src_in_C -> original bpf insns ->
> translated bpf insns -> JITed asm.
> Then report it back to user space and then teach perf to properly annotate progs.
>
> > Will the JITed code from some BPF bytecode be different if the same
> > bytecode is JITed in several machines, all having the exact same
> > hardware?
>
> Yes. JITed code will be different depending on sysctl knobs (like const blinding)
> So the same original bpf byte stream loaded at different times
> may have different JITed image.
>
> Even without security features like blinding the JIT can be different.
> the bpf maps are separate from bpf progs. The bpf map is created first.
> Then the same bpf instruction stream (depending on map type that it uses)
> may be optimized by the verifier differently causing difference in JIT.
>
> > > (instead of passing kallsyms's bpf prog name in event->mmap.filename)
> > > but bpf functions don't have their own prog_id. Multiple bpf funcs
> > > with different JITed blobs are considered to be a part of single prog_id.
> > > So as a step one I'm only extending RECORD_MMAP with addr and kallsym
> > > name of bpf function/prog.
> > > As a step two the plan is to add notification mechanism for prog load/unload
> > > that will include prog_id and design new synthetic RECORD_* events in
> > > perf user space that will contain JITed code, line info, BTF, etc.
> >
> > So, will the kernel JIT a bytecode, load it somewhere and run it, then,
> > when unloading it, keep it somewhere (a filesystem with some limits,
> > etc) so that at some later time (with some timeouts) tooling can, using
> > its id/buildid cookie get the contents and symbol table to have a better
> > annotation experience?
>
> yes. The plan is to let perf fetch JITed image via BPF_OBJ_GET_INFO_BY_FD cmd
> during perf record run and store it inside perf.data in synthetic records.
> Then perf report can annotate it later.

Hi Peter and Arnaldo,

I am working with Alexei on the idea of fetching BPF program information via
BPF_OBJ_GET_INFO_BY_FD cmd. I added PERF_RECORD_BPF_EVENT
to perf_event_type, and dumped these events to perf event ring buffer.

I found that perf will not process event until the end of perf-record:

root@virt-test:~# ~/perf record -ag -- sleep 10
...... 10 seconds later
[ perf record: Woken up 34 times to write data ]
machine__process_bpf_event: prog_id 6 loaded
machine__process_bpf_event: prog_id 6 unloaded
[ perf record: Captured and wrote 9.337 MB perf.data (93178 samples) ]

In this example, the bpf program was loaded and then unloaded in
another terminal. When machine__process_bpf_event() processes
the load event, the bpf program is already unloaded. Therefore,
machine__process_bpf_event() will not be able to get information
about the program via BPF_OBJ_GET_INFO_BY_FD cmd.

To solve this problem, we will need to run BPF_OBJ_GET_INFO_BY_FD
as soon as perf get the event from kernel. I looked around the perf
code for a while. But I haven't found a good example where some
events are processed before the end of perf-record. Could you
please help me with this?

Thanks,
Song

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
  2018-10-15 23:33               ` Song Liu
@ 2018-10-16 23:43                 ` David Ahern
  2018-10-17  6:43                   ` Song Liu
  0 siblings, 1 reply; 35+ messages in thread
From: David Ahern @ 2018-10-16 23:43 UTC (permalink / raw)
  To: Song Liu, Alexei Starovoitov
  Cc: acme, Peter Zijlstra, Alexei Starovoitov, David S . Miller,
	Daniel Borkmann, Networking, kernel-team

On 10/15/18 4:33 PM, Song Liu wrote:
> I am working with Alexei on the idea of fetching BPF program information via
> BPF_OBJ_GET_INFO_BY_FD cmd. I added PERF_RECORD_BPF_EVENT
> to perf_event_type, and dumped these events to perf event ring buffer.
> 
> I found that perf will not process event until the end of perf-record:
> 
> root@virt-test:~# ~/perf record -ag -- sleep 10
> ...... 10 seconds later
> [ perf record: Woken up 34 times to write data ]
> machine__process_bpf_event: prog_id 6 loaded
> machine__process_bpf_event: prog_id 6 unloaded
> [ perf record: Captured and wrote 9.337 MB perf.data (93178 samples) ]
> 
> In this example, the bpf program was loaded and then unloaded in
> another terminal. When machine__process_bpf_event() processes
> the load event, the bpf program is already unloaded. Therefore,
> machine__process_bpf_event() will not be able to get information
> about the program via BPF_OBJ_GET_INFO_BY_FD cmd.
> 
> To solve this problem, we will need to run BPF_OBJ_GET_INFO_BY_FD
> as soon as perf get the event from kernel. I looked around the perf
> code for a while. But I haven't found a good example where some
> events are processed before the end of perf-record. Could you
> please help me with this?

perf record does not process events as they are generated. Its sole job
is pushing data from the maps to a file as fast as possible meaning in
bulk based on current read and write locations.

Adding code to process events will add significant overhead to the
record command and will not really solve your race problem.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
  2018-10-16 23:43                 ` David Ahern
@ 2018-10-17  6:43                   ` Song Liu
  2018-10-17 12:11                     ` Arnaldo Carvalho de Melo
  2018-10-17 15:09                     ` David Ahern
  0 siblings, 2 replies; 35+ messages in thread
From: Song Liu @ 2018-10-17  6:43 UTC (permalink / raw)
  To: David Ahern
  Cc: Alexei Starovoitov, acme, Peter Zijlstra, Alexei Starovoitov,
	David S . Miller, Daniel Borkmann, Networking, kernel-team

Hi David,

On Tue, Oct 16, 2018 at 4:43 PM David Ahern <dsahern@gmail.com> wrote:
>
> On 10/15/18 4:33 PM, Song Liu wrote:
> > I am working with Alexei on the idea of fetching BPF program information via
> > BPF_OBJ_GET_INFO_BY_FD cmd. I added PERF_RECORD_BPF_EVENT
> > to perf_event_type, and dumped these events to perf event ring buffer.
> >
> > I found that perf will not process event until the end of perf-record:
> >
> > root@virt-test:~# ~/perf record -ag -- sleep 10
> > ...... 10 seconds later
> > [ perf record: Woken up 34 times to write data ]
> > machine__process_bpf_event: prog_id 6 loaded
> > machine__process_bpf_event: prog_id 6 unloaded
> > [ perf record: Captured and wrote 9.337 MB perf.data (93178 samples) ]
> >
> > In this example, the bpf program was loaded and then unloaded in
> > another terminal. When machine__process_bpf_event() processes
> > the load event, the bpf program is already unloaded. Therefore,
> > machine__process_bpf_event() will not be able to get information
> > about the program via BPF_OBJ_GET_INFO_BY_FD cmd.
> >
> > To solve this problem, we will need to run BPF_OBJ_GET_INFO_BY_FD
> > as soon as perf get the event from kernel. I looked around the perf
> > code for a while. But I haven't found a good example where some
> > events are processed before the end of perf-record. Could you
> > please help me with this?
>
> perf record does not process events as they are generated. Its sole job
> is pushing data from the maps to a file as fast as possible meaning in
> bulk based on current read and write locations.
>
> Adding code to process events will add significant overhead to the
> record command and will not really solve your race problem.

Thanks for the comment.

I agree that processing events while recording has significant overhead.
In this case, perf user space need to know details about the the jited BPF
program. It is impossible to pass all these details to user space through
the relatively stable ring_buffer API. Therefore, some processing of the
data is necessary (get bpf prog_id from ring buffer, and then fetch program
details via BPF_OBJ_GET_INFO_BY_FD.

I have some idea on processing important data with relatively low overhead.
Let me try implement it.

Thanks again,
Song

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
  2018-10-17  6:43                   ` Song Liu
@ 2018-10-17 12:11                     ` Arnaldo Carvalho de Melo
  2018-10-17 12:50                       ` Arnaldo Carvalho de Melo
  2018-11-02  1:08                       ` Song Liu
  2018-10-17 15:09                     ` David Ahern
  1 sibling, 2 replies; 35+ messages in thread
From: Arnaldo Carvalho de Melo @ 2018-10-17 12:11 UTC (permalink / raw)
  To: Song Liu
  Cc: David Ahern, Alexei Starovoitov, Peter Zijlstra,
	Alexei Starovoitov, Alexey Budankov, David S . Miller,
	Daniel Borkmann, Namhyung Kim, Jiri Olsa, Networking,
	kernel-team

Adding Alexey, Jiri and Namhyung as they worked/are working on
multithreading 'perf record'.

Em Tue, Oct 16, 2018 at 11:43:11PM -0700, Song Liu escreveu:
> On Tue, Oct 16, 2018 at 4:43 PM David Ahern <dsahern@gmail.com> wrote:
> > On 10/15/18 4:33 PM, Song Liu wrote:
> > > I am working with Alexei on the idea of fetching BPF program information via
> > > BPF_OBJ_GET_INFO_BY_FD cmd. I added PERF_RECORD_BPF_EVENT
> > > to perf_event_type, and dumped these events to perf event ring buffer.

> > > I found that perf will not process event until the end of perf-record:

> > > root@virt-test:~# ~/perf record -ag -- sleep 10
> > > ...... 10 seconds later
> > > [ perf record: Woken up 34 times to write data ]
> > > machine__process_bpf_event: prog_id 6 loaded
> > > machine__process_bpf_event: prog_id 6 unloaded
> > > [ perf record: Captured and wrote 9.337 MB perf.data (93178 samples) ]

> > > In this example, the bpf program was loaded and then unloaded in
> > > another terminal. When machine__process_bpf_event() processes
> > > the load event, the bpf program is already unloaded. Therefore,
> > > machine__process_bpf_event() will not be able to get information
> > > about the program via BPF_OBJ_GET_INFO_BY_FD cmd.

> > > To solve this problem, we will need to run BPF_OBJ_GET_INFO_BY_FD
> > > as soon as perf get the event from kernel. I looked around the perf
> > > code for a while. But I haven't found a good example where some
> > > events are processed before the end of perf-record. Could you
> > > please help me with this?

> > perf record does not process events as they are generated. Its sole job
> > is pushing data from the maps to a file as fast as possible meaning in
> > bulk based on current read and write locations.

> > Adding code to process events will add significant overhead to the
> > record command and will not really solve your race problem.

> I agree that processing events while recording has significant overhead.
> In this case, perf user space need to know details about the the jited BPF
> program. It is impossible to pass all these details to user space through
> the relatively stable ring_buffer API. Therefore, some processing of the
> data is necessary (get bpf prog_id from ring buffer, and then fetch program
> details via BPF_OBJ_GET_INFO_BY_FD.
 
> I have some idea on processing important data with relatively low overhead.
> Let me try implement it.

Well, you could have a separate thread processing just those kinds of
events, associate it with a dummy event where you only ask for
PERF_RECORD_BPF_EVENTs.

Here is how to setup the PERF_TYPE_SOFTWARE/PERF_COUNT_SW_DUMMY
perf_event_attr:

[root@seventh ~]# perf record -vv -e dummy sleep 01
------------------------------------------------------------
perf_event_attr:
  type                             1
  size                             112
  config                           0x9
  { sample_period, sample_freq }   4000
  sample_type                      IP|TID|TIME|PERIOD
  disabled                         1
  inherit                          1
  mmap                             1
  comm                             1
  freq                             1
  enable_on_exec                   1
  task                             1
  sample_id_all                    1
  exclude_guest                    1
  mmap2                            1
  comm_exec                        1
------------------------------------------------------------
sys_perf_event_open: pid 12046  cpu 0  group_fd -1  flags 0x8 = 4
sys_perf_event_open: pid 12046  cpu 1  group_fd -1  flags 0x8 = 5
sys_perf_event_open: pid 12046  cpu 2  group_fd -1  flags 0x8 = 6
sys_perf_event_open: pid 12046  cpu 3  group_fd -1  flags 0x8 = 8
mmap size 528384B
perf event ring buffer mmapped per cpu
Synthesizing TSC conversion information
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.014 MB perf.data ]
[root@seventh ~]#

[root@seventh ~]# perf evlist -v
dummy: type: 1, size: 112, config: 0x9, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|PERIOD, disabled: 1, inherit: 1, mmap: 1, comm: 1, freq: 1, enable_on_exec: 1, task: 1, sample_id_all: 1, exclude_guest: 1, mmap2: 1, comm_exec: 1
[root@seventh ~]# 

There is work ongoing in dumping one file per cpu and then, at post
processing time merging all those files to get ordering, so one more
file, for these VIP events, that require per-event processing would be
ordered at that time with all the other per-cpu files.

- Arnaldo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
  2018-10-17 12:11                     ` Arnaldo Carvalho de Melo
@ 2018-10-17 12:50                       ` Arnaldo Carvalho de Melo
  2018-10-17 16:06                         ` Song Liu
  2018-11-02  1:08                       ` Song Liu
  1 sibling, 1 reply; 35+ messages in thread
From: Arnaldo Carvalho de Melo @ 2018-10-17 12:50 UTC (permalink / raw)
  To: Song Liu
  Cc: David Ahern, Alexei Starovoitov, Peter Zijlstra,
	Alexei Starovoitov, Alexey Budankov, David S . Miller,
	Daniel Borkmann, Namhyung Kim, Jiri Olsa, Networking,
	kernel-team

Em Wed, Oct 17, 2018 at 09:11:40AM -0300, Arnaldo Carvalho de Melo escreveu:
> Adding Alexey, Jiri and Namhyung as they worked/are working on
> multithreading 'perf record'.
> 
> Em Tue, Oct 16, 2018 at 11:43:11PM -0700, Song Liu escreveu:
> > On Tue, Oct 16, 2018 at 4:43 PM David Ahern <dsahern@gmail.com> wrote:
> > > On 10/15/18 4:33 PM, Song Liu wrote:
> > > > I am working with Alexei on the idea of fetching BPF program information via
> > > > BPF_OBJ_GET_INFO_BY_FD cmd. I added PERF_RECORD_BPF_EVENT
> > > > to perf_event_type, and dumped these events to perf event ring buffer.
> 
> > > > I found that perf will not process event until the end of perf-record:
> 
> > > > root@virt-test:~# ~/perf record -ag -- sleep 10
> > > > ...... 10 seconds later
> > > > [ perf record: Woken up 34 times to write data ]
> > > > machine__process_bpf_event: prog_id 6 loaded
> > > > machine__process_bpf_event: prog_id 6 unloaded
> > > > [ perf record: Captured and wrote 9.337 MB perf.data (93178 samples) ]
> 
> > > > In this example, the bpf program was loaded and then unloaded in
> > > > another terminal. When machine__process_bpf_event() processes
> > > > the load event, the bpf program is already unloaded. Therefore,
> > > > machine__process_bpf_event() will not be able to get information
> > > > about the program via BPF_OBJ_GET_INFO_BY_FD cmd.
> 
> > > > To solve this problem, we will need to run BPF_OBJ_GET_INFO_BY_FD
> > > > as soon as perf get the event from kernel. I looked around the perf
> > > > code for a while. But I haven't found a good example where some
> > > > events are processed before the end of perf-record. Could you
> > > > please help me with this?
> 
> > > perf record does not process events as they are generated. Its sole job
> > > is pushing data from the maps to a file as fast as possible meaning in
> > > bulk based on current read and write locations.
> 
> > > Adding code to process events will add significant overhead to the
> > > record command and will not really solve your race problem.
> 
> > I agree that processing events while recording has significant overhead.
> > In this case, perf user space need to know details about the the jited BPF
> > program. It is impossible to pass all these details to user space through
> > the relatively stable ring_buffer API. Therefore, some processing of the
> > data is necessary (get bpf prog_id from ring buffer, and then fetch program
> > details via BPF_OBJ_GET_INFO_BY_FD.
>  
> > I have some idea on processing important data with relatively low overhead.
> > Let me try implement it.
> 
> Well, you could have a separate thread processing just those kinds of
> events, associate it with a dummy event where you only ask for
> PERF_RECORD_BPF_EVENTs.
> 
> Here is how to setup the PERF_TYPE_SOFTWARE/PERF_COUNT_SW_DUMMY
> perf_event_attr:
> 
> [root@seventh ~]# perf record -vv -e dummy sleep 01
> ------------------------------------------------------------
> perf_event_attr:
>   type                             1
>   size                             112
>   config                           0x9
>   { sample_period, sample_freq }   4000
>   sample_type                      IP|TID|TIME|PERIOD
>   disabled                         1
>   inherit                          1

These you would have disabled, no need for
PERF_RECORD_{MMAP*,COMM,FORK,EXIT} just PERF_RECORD_BPF_EVENT

>   mmap                             1
>   comm                             1
>   task                             1
>   mmap2                            1
>   comm_exec                        1


>   freq                             1
>   enable_on_exec                   1
>   sample_id_all                    1
>   exclude_guest                    1
> ------------------------------------------------------------
> sys_perf_event_open: pid 12046  cpu 0  group_fd -1  flags 0x8 = 4
> sys_perf_event_open: pid 12046  cpu 1  group_fd -1  flags 0x8 = 5
> sys_perf_event_open: pid 12046  cpu 2  group_fd -1  flags 0x8 = 6
> sys_perf_event_open: pid 12046  cpu 3  group_fd -1  flags 0x8 = 8
> mmap size 528384B
> perf event ring buffer mmapped per cpu
> Synthesizing TSC conversion information
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.014 MB perf.data ]
> [root@seventh ~]#
> 
> [root@seventh ~]# perf evlist -v
> dummy: type: 1, size: 112, config: 0x9, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|PERIOD, disabled: 1, inherit: 1, mmap: 1, comm: 1, freq: 1, enable_on_exec: 1, task: 1, sample_id_all: 1, exclude_guest: 1, mmap2: 1, comm_exec: 1
> [root@seventh ~]# 
> 
> There is work ongoing in dumping one file per cpu and then, at post
> processing time merging all those files to get ordering, so one more
> file, for these VIP events, that require per-event processing would be
> ordered at that time with all the other per-cpu files.
> 
> - Arnaldo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
  2018-10-17  6:43                   ` Song Liu
  2018-10-17 12:11                     ` Arnaldo Carvalho de Melo
@ 2018-10-17 15:09                     ` David Ahern
  2018-10-17 16:18                       ` Song Liu
  2018-10-17 16:36                       ` Alexei Starovoitov
  1 sibling, 2 replies; 35+ messages in thread
From: David Ahern @ 2018-10-17 15:09 UTC (permalink / raw)
  To: Song Liu
  Cc: Alexei Starovoitov, acme, Peter Zijlstra, Alexei Starovoitov,
	David S . Miller, Daniel Borkmann, Networking, kernel-team

On 10/16/18 11:43 PM, Song Liu wrote:
> I agree that processing events while recording has significant overhead.
> In this case, perf user space need to know details about the the jited BPF
> program. It is impossible to pass all these details to user space through
> the relatively stable ring_buffer API. Therefore, some processing of the
> data is necessary (get bpf prog_id from ring buffer, and then fetch program
> details via BPF_OBJ_GET_INFO_BY_FD.
> 
> I have some idea on processing important data with relatively low overhead.
> Let me try implement it.
> 

As I understand it, you want this series:

 kernel: add event to perf buffer on bpf prog load

 userspace: perf reads the event and grabs information about the program
from the fd

Is that correct?

Userpsace is not awakened immediately when an event is added the the
ring. It is awakened once the number of events crosses a watermark. That
means there is an unknown - and potentially long - time window where the
program can be unloaded before perf reads the event.

So no matter what you do expecting perf record to be able to process the
event quickly is an unreasonable expectation.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
  2018-10-17 12:50                       ` Arnaldo Carvalho de Melo
@ 2018-10-17 16:06                         ` Song Liu
  0 siblings, 0 replies; 35+ messages in thread
From: Song Liu @ 2018-10-17 16:06 UTC (permalink / raw)
  To: arnaldo.melo
  Cc: David Ahern, Alexei Starovoitov, Peter Zijlstra,
	Alexei Starovoitov, Alexey Budankov, David S . Miller,
	Daniel Borkmann, namhyung, Jiri Olsa, Networking, kernel-team

On Wed, Oct 17, 2018 at 5:50 AM Arnaldo Carvalho de Melo
<arnaldo.melo@gmail.com> wrote:
>
> Em Wed, Oct 17, 2018 at 09:11:40AM -0300, Arnaldo Carvalho de Melo escreveu:
> > Adding Alexey, Jiri and Namhyung as they worked/are working on
> > multithreading 'perf record'.
> >
> > Em Tue, Oct 16, 2018 at 11:43:11PM -0700, Song Liu escreveu:
> > > On Tue, Oct 16, 2018 at 4:43 PM David Ahern <dsahern@gmail.com> wrote:
> > > > On 10/15/18 4:33 PM, Song Liu wrote:
> > > > > I am working with Alexei on the idea of fetching BPF program information via
> > > > > BPF_OBJ_GET_INFO_BY_FD cmd. I added PERF_RECORD_BPF_EVENT
> > > > > to perf_event_type, and dumped these events to perf event ring buffer.
> >
> > > > > I found that perf will not process event until the end of perf-record:
> >
> > > > > root@virt-test:~# ~/perf record -ag -- sleep 10
> > > > > ...... 10 seconds later
> > > > > [ perf record: Woken up 34 times to write data ]
> > > > > machine__process_bpf_event: prog_id 6 loaded
> > > > > machine__process_bpf_event: prog_id 6 unloaded
> > > > > [ perf record: Captured and wrote 9.337 MB perf.data (93178 samples) ]
> >
> > > > > In this example, the bpf program was loaded and then unloaded in
> > > > > another terminal. When machine__process_bpf_event() processes
> > > > > the load event, the bpf program is already unloaded. Therefore,
> > > > > machine__process_bpf_event() will not be able to get information
> > > > > about the program via BPF_OBJ_GET_INFO_BY_FD cmd.
> >
> > > > > To solve this problem, we will need to run BPF_OBJ_GET_INFO_BY_FD
> > > > > as soon as perf get the event from kernel. I looked around the perf
> > > > > code for a while. But I haven't found a good example where some
> > > > > events are processed before the end of perf-record. Could you
> > > > > please help me with this?
> >
> > > > perf record does not process events as they are generated. Its sole job
> > > > is pushing data from the maps to a file as fast as possible meaning in
> > > > bulk based on current read and write locations.
> >
> > > > Adding code to process events will add significant overhead to the
> > > > record command and will not really solve your race problem.
> >
> > > I agree that processing events while recording has significant overhead.
> > > In this case, perf user space need to know details about the the jited BPF
> > > program. It is impossible to pass all these details to user space through
> > > the relatively stable ring_buffer API. Therefore, some processing of the
> > > data is necessary (get bpf prog_id from ring buffer, and then fetch program
> > > details via BPF_OBJ_GET_INFO_BY_FD.
> >
> > > I have some idea on processing important data with relatively low overhead.
> > > Let me try implement it.
> >
> > Well, you could have a separate thread processing just those kinds of
> > events, associate it with a dummy event where you only ask for
> > PERF_RECORD_BPF_EVENTs.
> >
> > Here is how to setup the PERF_TYPE_SOFTWARE/PERF_COUNT_SW_DUMMY
> > perf_event_attr:
> >
> > [root@seventh ~]# perf record -vv -e dummy sleep 01
> > ------------------------------------------------------------
> > perf_event_attr:
> >   type                             1
> >   size                             112
> >   config                           0x9
> >   { sample_period, sample_freq }   4000
> >   sample_type                      IP|TID|TIME|PERIOD
> >   disabled                         1
> >   inherit                          1
>
> These you would have disabled, no need for
> PERF_RECORD_{MMAP*,COMM,FORK,EXIT} just PERF_RECORD_BPF_EVENT
>
> >   mmap                             1
> >   comm                             1
> >   task                             1
> >   mmap2                            1
> >   comm_exec                        1
>
>

Thanks Arnaldo! This looks better than my original idea (using POLLPRI
to highlight
special events). I will try implement the BPF_EVENT in this direction.

Song

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
  2018-10-17 15:09                     ` David Ahern
@ 2018-10-17 16:18                       ` Song Liu
  2018-10-17 16:36                       ` Alexei Starovoitov
  1 sibling, 0 replies; 35+ messages in thread
From: Song Liu @ 2018-10-17 16:18 UTC (permalink / raw)
  To: David Ahern
  Cc: Alexei Starovoitov, acme, Peter Zijlstra, Alexei Starovoitov,
	David S . Miller, Daniel Borkmann, Networking, kernel-team

Hi David,

On Wed, Oct 17, 2018 at 8:09 AM David Ahern <dsahern@gmail.com> wrote:
>
> On 10/16/18 11:43 PM, Song Liu wrote:
> > I agree that processing events while recording has significant overhead.
> > In this case, perf user space need to know details about the the jited BPF
> > program. It is impossible to pass all these details to user space through
> > the relatively stable ring_buffer API. Therefore, some processing of the
> > data is necessary (get bpf prog_id from ring buffer, and then fetch program
> > details via BPF_OBJ_GET_INFO_BY_FD.
> >
> > I have some idea on processing important data with relatively low overhead.
> > Let me try implement it.
> >
>
> As I understand it, you want this series:
>
>  kernel: add event to perf buffer on bpf prog load
>
>  userspace: perf reads the event and grabs information about the program
> from the fd
>
> Is that correct?

Yes, this is correct.

>
> Userpsace is not awakened immediately when an event is added the the
> ring. It is awakened once the number of events crosses a watermark. That
> means there is an unknown - and potentially long - time window where the
> program can be unloaded before perf reads the event.
>
> So no matter what you do expecting perf record to be able to process the
> event quickly is an unreasonable expectation.

In this case, we don't really need to gather the information immediately. We
will lost information about some short-living BPF programs. These BPF
programs are less important for the performance analysis anyway. I guess
the only missing case is when some BPF program get load/unload many
times, so each instance is short-living, but the overall perf impact is
significant. I think this case is not interesting either, as BPF programs
should not be used like this (considering the kernel need to translate and
jit the program, unloading and reloading soon doesn't make any sense).

Thanks,
Song

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
  2018-10-17 15:09                     ` David Ahern
  2018-10-17 16:18                       ` Song Liu
@ 2018-10-17 16:36                       ` Alexei Starovoitov
  2018-10-17 18:53                         ` Arnaldo Carvalho de Melo
  1 sibling, 1 reply; 35+ messages in thread
From: Alexei Starovoitov @ 2018-10-17 16:36 UTC (permalink / raw)
  To: David Ahern, Song Liu
  Cc: Alexei Starovoitov, acme, Peter Zijlstra, Alexei Starovoitov,
	David S . Miller, Daniel Borkmann, Networking, Kernel Team

On 10/17/18 8:09 AM, David Ahern wrote:
> On 10/16/18 11:43 PM, Song Liu wrote:
>> I agree that processing events while recording has significant overhead.
>> In this case, perf user space need to know details about the the jited BPF
>> program. It is impossible to pass all these details to user space through
>> the relatively stable ring_buffer API. Therefore, some processing of the
>> data is necessary (get bpf prog_id from ring buffer, and then fetch program
>> details via BPF_OBJ_GET_INFO_BY_FD.
>>
>> I have some idea on processing important data with relatively low overhead.
>> Let me try implement it.
>>
>
> As I understand it, you want this series:
>
>  kernel: add event to perf buffer on bpf prog load
>
>  userspace: perf reads the event and grabs information about the program
> from the fd
>
> Is that correct?
>
> Userpsace is not awakened immediately when an event is added the the
> ring. It is awakened once the number of events crosses a watermark. That
> means there is an unknown - and potentially long - time window where the
> program can be unloaded before perf reads the event.
>
> So no matter what you do expecting perf record to be able to process the
> event quickly is an unreasonable expectation.

yes... unless we go with threaded model as Arnaldo suggested and use
single event as a watermark to wakeup our perf thread.
In such case there is still a race window between user space waking up
and doing _single_ bpf_get_fd_from_id() call to hold that prog
and some other process trying to instantly unload the prog it
just loaded.
I think such race window is extremely tiny and if perf misses
those load/unload events it's a good thing, since there is no chance
that normal pmu event samples would be happening during prog execution.

The alternative approach with no race window at all is to burden kernel
RECORD_* events with _all_ information about bpf prog. Which is jited
addresses, jited image itself, info about all subprogs, info about line
info, all BTF data, etc. As I said earlier I'm strongly against such
RECORD_* bloating.
Instead we need to find a way to process new RECORD_BPF events with
single prog_id field in perf user space with minimal race
and threaded approach sounds like a win to me.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
  2018-10-17 16:36                       ` Alexei Starovoitov
@ 2018-10-17 18:53                         ` Arnaldo Carvalho de Melo
  2018-10-17 19:08                           ` Alexei Starovoitov
  0 siblings, 1 reply; 35+ messages in thread
From: Arnaldo Carvalho de Melo @ 2018-10-17 18:53 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Ahern, Song Liu, Alexei Starovoitov, Peter Zijlstra,
	Alexei Starovoitov, David S . Miller, Daniel Borkmann,
	Networking, Kernel Team

Em Wed, Oct 17, 2018 at 04:36:08PM +0000, Alexei Starovoitov escreveu:
> On 10/17/18 8:09 AM, David Ahern wrote:
> > On 10/16/18 11:43 PM, Song Liu wrote:
> >> I agree that processing events while recording has significant overhead.
> >> In this case, perf user space need to know details about the the jited BPF
> >> program. It is impossible to pass all these details to user space through
> >> the relatively stable ring_buffer API. Therefore, some processing of the
> >> data is necessary (get bpf prog_id from ring buffer, and then fetch program
> >> details via BPF_OBJ_GET_INFO_BY_FD.
> >>
> >> I have some idea on processing important data with relatively low overhead.
> >> Let me try implement it.
> >>
> >
> > As I understand it, you want this series:
> >
> >  kernel: add event to perf buffer on bpf prog load
> >
> >  userspace: perf reads the event and grabs information about the program
> > from the fd
> >
> > Is that correct?
> >
> > Userpsace is not awakened immediately when an event is added the the
> > ring. It is awakened once the number of events crosses a watermark. That
> > means there is an unknown - and potentially long - time window where the
> > program can be unloaded before perf reads the event.

> > So no matter what you do expecting perf record to be able to process the
> > event quickly is an unreasonable expectation.
 
> yes... unless we go with threaded model as Arnaldo suggested and use
> single event as a watermark to wakeup our perf thread.
> In such case there is still a race window between user space waking up
> and doing _single_ bpf_get_fd_from_id() call to hold that prog
> and some other process trying to instantly unload the prog it
> just loaded.
> I think such race window is extremely tiny and if perf misses
> those load/unload events it's a good thing, since there is no chance
> that normal pmu event samples would be happening during prog execution.
 
> The alternative approach with no race window at all is to burden kernel
> RECORD_* events with _all_ information about bpf prog. Which is jited
> addresses, jited image itself, info about all subprogs, info about line
> info, all BTF data, etc. As I said earlier I'm strongly against such
> RECORD_* bloating.
> Instead we need to find a way to process new RECORD_BPF events with
> single prog_id field in perf user space with minimal race
> and threaded approach sounds like a win to me.

There is another alternative, I think: put just a content based hash,
like a git commit id into a PERF_RECORD_MMAP3 new record, and when the
validator does the jit, etc, it stashes the content that
BPF_OBJ_GET_INFO_BY_FD would get somewhere, some filesystem populated by
the kernel right after getting stuff from sys_bpf() and preparing it for
use, then we know that in (start, end) we have blob foo with content id,
that we will use to retrieve information that augments what we know with
just (start, end, id) and allows annotation, etc.

That stash space for jitted stuff gets garbage collected from time to
time or is even completely disabled if the user is not interested in
such augmentation, just like one can do disabling perf's ~/.debug/
directory hashed by build-id.

I think with this we have no races, the PERF_RECORD_MMAP3 gets just what
is in PERF_RECORD_MMAP2 plus some extra 20 bytes for such content based
cookie and we solve the other race we already have with kernel modules,
DSOs, etc.

I have mentioned this before, there were objections, perhaps this time I
formulated in a different way that makes it more interesting?

- Arnaldo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
  2018-10-17 18:53                         ` Arnaldo Carvalho de Melo
@ 2018-10-17 19:08                           ` Alexei Starovoitov
  2018-10-17 21:31                             ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 35+ messages in thread
From: Alexei Starovoitov @ 2018-10-17 19:08 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: David Ahern, Song Liu, Alexei Starovoitov, Peter Zijlstra,
	Alexei Starovoitov, David S . Miller, Daniel Borkmann,
	Networking, Kernel Team

On 10/17/18 11:53 AM, Arnaldo Carvalho de Melo wrote:
> Em Wed, Oct 17, 2018 at 04:36:08PM +0000, Alexei Starovoitov escreveu:
>> On 10/17/18 8:09 AM, David Ahern wrote:
>>> On 10/16/18 11:43 PM, Song Liu wrote:
>>>> I agree that processing events while recording has significant overhead.
>>>> In this case, perf user space need to know details about the the jited BPF
>>>> program. It is impossible to pass all these details to user space through
>>>> the relatively stable ring_buffer API. Therefore, some processing of the
>>>> data is necessary (get bpf prog_id from ring buffer, and then fetch program
>>>> details via BPF_OBJ_GET_INFO_BY_FD.
>>>>
>>>> I have some idea on processing important data with relatively low overhead.
>>>> Let me try implement it.
>>>>
>>>
>>> As I understand it, you want this series:
>>>
>>>  kernel: add event to perf buffer on bpf prog load
>>>
>>>  userspace: perf reads the event and grabs information about the program
>>> from the fd
>>>
>>> Is that correct?
>>>
>>> Userpsace is not awakened immediately when an event is added the the
>>> ring. It is awakened once the number of events crosses a watermark. That
>>> means there is an unknown - and potentially long - time window where the
>>> program can be unloaded before perf reads the event.
>
>>> So no matter what you do expecting perf record to be able to process the
>>> event quickly is an unreasonable expectation.
>
>> yes... unless we go with threaded model as Arnaldo suggested and use
>> single event as a watermark to wakeup our perf thread.
>> In such case there is still a race window between user space waking up
>> and doing _single_ bpf_get_fd_from_id() call to hold that prog
>> and some other process trying to instantly unload the prog it
>> just loaded.
>> I think such race window is extremely tiny and if perf misses
>> those load/unload events it's a good thing, since there is no chance
>> that normal pmu event samples would be happening during prog execution.
>
>> The alternative approach with no race window at all is to burden kernel
>> RECORD_* events with _all_ information about bpf prog. Which is jited
>> addresses, jited image itself, info about all subprogs, info about line
>> info, all BTF data, etc. As I said earlier I'm strongly against such
>> RECORD_* bloating.
>> Instead we need to find a way to process new RECORD_BPF events with
>> single prog_id field in perf user space with minimal race
>> and threaded approach sounds like a win to me.
>
> There is another alternative, I think: put just a content based hash,
> like a git commit id into a PERF_RECORD_MMAP3 new record, and when the
> validator does the jit, etc, it stashes the content that
> BPF_OBJ_GET_INFO_BY_FD would get somewhere, some filesystem populated by
> the kernel right after getting stuff from sys_bpf() and preparing it for
> use, then we know that in (start, end) we have blob foo with content id,
> that we will use to retrieve information that augments what we know with
> just (start, end, id) and allows annotation, etc.
>
> That stash space for jitted stuff gets garbage collected from time to
> time or is even completely disabled if the user is not interested in
> such augmentation, just like one can do disabling perf's ~/.debug/
> directory hashed by build-id.
>
> I think with this we have no races, the PERF_RECORD_MMAP3 gets just what
> is in PERF_RECORD_MMAP2 plus some extra 20 bytes for such content based
> cookie and we solve the other race we already have with kernel modules,
> DSOs, etc.
>
> I have mentioned this before, there were objections, perhaps this time I
> formulated in a different way that makes it more interesting?

that 'content based hash' we already have. It's called program tag.
and we already taught iovisor/bcc to stash that stuff into
/var/tmp/bcc/bpf_prog_TAG/ directory.
Unfortunately that approach didn't work.
JITed image only exists in the kernel. It's there only when
program is loaded and it's the one that matter the most for performance
analysis, since sample IPs are pointing into it.
Also the line info mapping that user space knows is not helping much
either, since verifier optimizes the instructions and then JIT
does more. The debug_info <-> JIT relationship must be preserved
by the kernel and returned to user space.

The only other non-race way is to keep all that info in the kernel after
program is unloaded, but that is even worse than bloating RECORD*s

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
  2018-10-17 19:08                           ` Alexei Starovoitov
@ 2018-10-17 21:31                             ` Arnaldo Carvalho de Melo
  2018-10-17 22:01                               ` Alexei Starovoitov
  0 siblings, 1 reply; 35+ messages in thread
From: Arnaldo Carvalho de Melo @ 2018-10-17 21:31 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Ahern, Song Liu, Alexei Starovoitov, Peter Zijlstra,
	Alexei Starovoitov, David S . Miller, Daniel Borkmann,
	Networking, Kernel Team

Em Wed, Oct 17, 2018 at 07:08:37PM +0000, Alexei Starovoitov escreveu:
> On 10/17/18 11:53 AM, Arnaldo Carvalho de Melo wrote:
> > Em Wed, Oct 17, 2018 at 04:36:08PM +0000, Alexei Starovoitov escreveu:
> >> On 10/17/18 8:09 AM, David Ahern wrote:
> >>> On 10/16/18 11:43 PM, Song Liu wrote:
> >>>> I agree that processing events while recording has significant overhead.
> >>>> In this case, perf user space need to know details about the the jited BPF
> >>>> program. It is impossible to pass all these details to user space through
> >>>> the relatively stable ring_buffer API. Therefore, some processing of the
> >>>> data is necessary (get bpf prog_id from ring buffer, and then fetch program
> >>>> details via BPF_OBJ_GET_INFO_BY_FD.
> >>>>
> >>>> I have some idea on processing important data with relatively low overhead.
> >>>> Let me try implement it.
> >>>>
> >>>
> >>> As I understand it, you want this series:
> >>>
> >>>  kernel: add event to perf buffer on bpf prog load
> >>>
> >>>  userspace: perf reads the event and grabs information about the program
> >>> from the fd
> >>>
> >>> Is that correct?
> >>>
> >>> Userpsace is not awakened immediately when an event is added the the
> >>> ring. It is awakened once the number of events crosses a watermark. That
> >>> means there is an unknown - and potentially long - time window where the
> >>> program can be unloaded before perf reads the event.
> >
> >>> So no matter what you do expecting perf record to be able to process the
> >>> event quickly is an unreasonable expectation.
> >
> >> yes... unless we go with threaded model as Arnaldo suggested and use
> >> single event as a watermark to wakeup our perf thread.
> >> In such case there is still a race window between user space waking up
> >> and doing _single_ bpf_get_fd_from_id() call to hold that prog
> >> and some other process trying to instantly unload the prog it
> >> just loaded.
> >> I think such race window is extremely tiny and if perf misses
> >> those load/unload events it's a good thing, since there is no chance
> >> that normal pmu event samples would be happening during prog execution.

> >> The alternative approach with no race window at all is to burden kernel
> >> RECORD_* events with _all_ information about bpf prog. Which is jited
> >> addresses, jited image itself, info about all subprogs, info about line
> >> info, all BTF data, etc. As I said earlier I'm strongly against such
> >> RECORD_* bloating.
> >> Instead we need to find a way to process new RECORD_BPF events with
> >> single prog_id field in perf user space with minimal race
> >> and threaded approach sounds like a win to me.

> > There is another alternative, I think: put just a content based hash,
> > like a git commit id into a PERF_RECORD_MMAP3 new record, and when the
> > validator does the jit, etc, it stashes the content that
> > BPF_OBJ_GET_INFO_BY_FD would get somewhere, some filesystem populated by
> > the kernel right after getting stuff from sys_bpf() and preparing it for
> > use, then we know that in (start, end) we have blob foo with content id,
> > that we will use to retrieve information that augments what we know with
> > just (start, end, id) and allows annotation, etc.

> > That stash space for jitted stuff gets garbage collected from time to
> > time or is even completely disabled if the user is not interested in
> > such augmentation, just like one can do disabling perf's ~/.debug/
> > directory hashed by build-id.

> > I think with this we have no races, the PERF_RECORD_MMAP3 gets just what
> > is in PERF_RECORD_MMAP2 plus some extra 20 bytes for such content based
> > cookie and we solve the other race we already have with kernel modules,
> > DSOs, etc.

> > I have mentioned this before, there were objections, perhaps this time I
> > formulated in a different way that makes it more interesting?
 
> that 'content based hash' we already have. It's called program tag.

But that was calculated by whom? Userspace? It can't do that, its the
kernel that ultimately puts together, from what userspace gave it, what
we need to do performance analysis, line numbers, etc.

> and we already taught iovisor/bcc to stash that stuff into
> /var/tmp/bcc/bpf_prog_TAG/ directory.
> Unfortunately that approach didn't work.
> JITed image only exists in the kernel. It's there only when

That is why I said that _the_ kernel stashes that thing, not bcc/iovisor
or perf, the kernel calculates the hash, and it also puts that into the
PERF_RECORD_MMAP3, so the tool sees it and goes to get it from the place
the kernel stashed it.

> program is loaded and it's the one that matter the most for performance
> analysis, since sample IPs are pointing into it.

Agreed

> Also the line info mapping that user space knows is not helping much
> either, since verifier optimizes the instructions and then JIT
> does more. The debug_info <-> JIT relationship must be preserved
> by the kernel and returned to user space.

Violent agreement :-)
 
> The only other non-race way is to keep all that info in the kernel after
> program is unloaded, but that is even worse than bloating RECORD*s

Keep all that info in a file, as I described above. Or keep it for a
while, to give that thread in userspace time to get it and tell the
kernel that it can trow it away.

It may well be that most of the time the 'perf record' thread catching
those events picks that up and saves it in
/var/tmp/bcc/bpf_prog_BUILDID/ even before the program gets unloaded,
no?

- Arnaldo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
  2018-10-17 21:31                             ` Arnaldo Carvalho de Melo
@ 2018-10-17 22:01                               ` Alexei Starovoitov
  0 siblings, 0 replies; 35+ messages in thread
From: Alexei Starovoitov @ 2018-10-17 22:01 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: David Ahern, Song Liu, Alexei Starovoitov, Peter Zijlstra,
	Alexei Starovoitov, David S . Miller, Daniel Borkmann,
	Networking, Kernel Team

On 10/17/18 2:31 PM, Arnaldo Carvalho de Melo wrote:
>
> Keep all that info in a file, as I described above. Or keep it for a
> while, to give that thread in userspace time to get it and tell the
> kernel that it can trow it away.

stashing by kernel into a file is a huge headache, since format of the
file becomes kernel abi.
Plus why have it in two places (in a file and in normal kernel data
structures)?

> It may well be that most of the time the 'perf record' thread catching
> those events picks that up and saves it in
> /var/tmp/bcc/bpf_prog_BUILDID/ even before the program gets unloaded,
> no?

Whether perf user space stashes that info into perf.data as
synthetic records or stashes it somewhere in /var/tmp/ sound about
equivalent to me. Both have their pros and cons. This is certainly
a good topic to discuss further.

But asking kernel to keep JITed images and all relevant bpf data
after program has been unloaded sounds really scary to me.
I struggling to think through the security implications with that.
How long kernel suppose to keep that? Some timeout?

As I explained already the time it takes for perf to do
_single_ get_fd_by_id syscall when it sees RECORD_MMAP with prog_id
is pretty instant.
All other syscalls to grab JITed image and everything else can
be done later. The prog will not go away because perf will hold an fd.
If prog was somehow unloaded before perf could do get_fd_by_id
no one cares about such programs, since there is close to zero
chance that this program was attached to anything and absolutely
no chance that it run.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
  2018-10-17 12:11                     ` Arnaldo Carvalho de Melo
  2018-10-17 12:50                       ` Arnaldo Carvalho de Melo
@ 2018-11-02  1:08                       ` Song Liu
  1 sibling, 0 replies; 35+ messages in thread
From: Song Liu @ 2018-11-02  1:08 UTC (permalink / raw)
  To: Arnaldo Melo
  Cc: David Ahern, Alexei Starovoitov, Peter Zijlstra,
	Alexei Starovoitov, Alexey Budankov, David S . Miller,
	Daniel Borkmann, namhyung, Jiri Olsa, Networking, kernel-team

Hi Arnaldo,

On Wed, Oct 17, 2018 at 5:11 AM Arnaldo Carvalho de Melo
<arnaldo.melo@gmail.com> wrote:
>
> Adding Alexey, Jiri and Namhyung as they worked/are working on
> multithreading 'perf record'.

I have read Alexey's work on enabling aio for perf-record
(https://lkml.org/lkml/2018/10/15/169).
But I feel it is not really related to this work. Did I miss anything here?

For VIP events, I think we need more mmap. Currently, the default setting
uses 1 mmap per cpu. To capture VIP events, I think we need another mmap
per CPU. The VIP events will be send to the special mmap. Then, user space
will process these event before the end of perf-record.

Does this approach make sense?

Thanks!
Song

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2018-11-02 10:13 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-19 22:39 [PATCH bpf-next 0/3] perf, bpf: reveal invisible bpf programs Alexei Starovoitov
2018-09-19 22:39 ` [PATCH bpf-next 1/3] perf/core: introduce perf_event_mmap_bpf_prog Alexei Starovoitov
2018-09-19 23:30   ` Song Liu
2018-09-20  0:53     ` Alexei Starovoitov
2018-09-19 22:39 ` [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload Alexei Starovoitov
2018-09-19 23:44   ` Song Liu
2018-09-20  0:59     ` Alexei Starovoitov
2018-09-20  5:48       ` Song Liu
2018-09-20  8:44   ` Peter Zijlstra
2018-09-20 13:25     ` Arnaldo Carvalho de Melo
2018-09-20 13:56       ` Peter Zijlstra
2018-09-21  3:14         ` Alexei Starovoitov
2018-09-21 12:25           ` Arnaldo Carvalho de Melo
2018-09-21 13:55             ` Peter Zijlstra
2018-09-21 13:59             ` Peter Zijlstra
2018-09-21 22:13             ` Alexei Starovoitov
2018-10-15 23:33               ` Song Liu
2018-10-16 23:43                 ` David Ahern
2018-10-17  6:43                   ` Song Liu
2018-10-17 12:11                     ` Arnaldo Carvalho de Melo
2018-10-17 12:50                       ` Arnaldo Carvalho de Melo
2018-10-17 16:06                         ` Song Liu
2018-11-02  1:08                       ` Song Liu
2018-10-17 15:09                     ` David Ahern
2018-10-17 16:18                       ` Song Liu
2018-10-17 16:36                       ` Alexei Starovoitov
2018-10-17 18:53                         ` Arnaldo Carvalho de Melo
2018-10-17 19:08                           ` Alexei Starovoitov
2018-10-17 21:31                             ` Arnaldo Carvalho de Melo
2018-10-17 22:01                               ` Alexei Starovoitov
2018-09-20 13:36     ` Peter Zijlstra
2018-09-19 22:39 ` [PATCH perf 3/3] tools/perf: recognize and process RECORD_MMAP events for bpf progs Alexei Starovoitov
2018-09-20 13:36   ` Arnaldo Carvalho de Melo
2018-09-20 14:05     ` Arnaldo Carvalho de Melo
2018-09-21 22:57   ` Song Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.