linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] bpf: lbr: enable reading LBR from tracing bpf programs
@ 2021-08-18  1:29 Song Liu
  2021-08-18  9:15 ` Peter Zijlstra
  0 siblings, 1 reply; 10+ messages in thread
From: Song Liu @ 2021-08-18  1:29 UTC (permalink / raw)
  To: bpf, linux-kernel
  Cc: acme, peterz, mingo, kernel-team, Song Liu, Kan Liang, Like Xu,
	Alexey Budankov

The typical way to access LBR is via hardware perf_event. For CPUs with
FREEZE_LBRS_ON_PMI support, PMI could capture reliable LBR. On the other
hand, LBR could also be useful in non-PMI scenario. For example, in
kretprobe or bpf fexit program, LBR could provide a lot of information
on what happened with the function.

In this RFC, we try to enable LBR for BPF program. This works like:
  1. Create a hardware perf_event with PERF_SAMPLE_BRANCH_* on each CPU;
  2. Call a new bpf helper (bpf_get_branch_trace) from the BPF program;
  3. Before calling this bpf program, the kernel stops LBR on local CPU,
     make a copy of LBR, and resumes LBR;
  4. In the bpf program, the helper access the copy from #3.

Please see tools/testing/selftests/bpf/[progs|prog_tests]/get_call_trace.c
for a detailed example. Not that, this process is far from ideal, but it
allows quick prototype of this feature.

AFAICT, the biggest challenge here is that we are now sharing LBR in PMI
and out of PMI, which could trigger some interesting race conditions.
However, if we allow some level of missed/corrupted samples, this should
still be very useful.

Please share your thoughts and comments on this. Thanks in advance!

Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Like Xu <like.xu@linux.intel.com>
Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
---
 arch/x86/events/intel/lbr.c                   | 16 ++++
 include/linux/filter.h                        |  3 +-
 include/linux/perf_event.h                    |  2 +
 include/uapi/linux/bpf.h                      |  8 ++
 kernel/bpf/trampoline.c                       |  3 +
 kernel/bpf/verifier.c                         |  7 ++
 kernel/events/core.c                          |  5 ++
 kernel/trace/bpf_trace.c                      | 27 ++++++
 net/bpf/test_run.c                            | 15 +++-
 tools/include/uapi/linux/bpf.h                |  8 ++
 .../bpf/prog_tests/get_branch_trace.c         | 82 +++++++++++++++++++
 .../selftests/bpf/progs/get_branch_trace.c    | 36 ++++++++
 tools/testing/selftests/bpf/trace_helpers.c   | 30 +++++++
 tools/testing/selftests/bpf/trace_helpers.h   |  5 ++
 14 files changed, 245 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/get_branch_trace.c
 create mode 100644 tools/testing/selftests/bpf/progs/get_branch_trace.c

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 9e6d6eaeb4cb6..e4e863f6fb93d 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -1862,3 +1862,19 @@ EXPORT_SYMBOL_GPL(x86_perf_get_lbr);
 struct event_constraint vlbr_constraint =
 	__EVENT_CONSTRAINT(INTEL_FIXED_VLBR_EVENT, (1ULL << INTEL_PMC_IDX_FIXED_VLBR),
 			  FIXED_EVENT_FLAGS, 1, 0, PERF_X86_EVENT_LBR_SELECT);
+
+DEFINE_PER_CPU(struct perf_branch_entry, bpf_lbr_entries[MAX_LBR_ENTRIES]);
+DEFINE_PER_CPU(int, bpf_lbr_cnt);
+
+int bpf_branch_record_read(void)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+	intel_pmu_lbr_disable_all();
+	intel_pmu_lbr_read();
+	memcpy(this_cpu_ptr(&bpf_lbr_entries), cpuc->lbr_entries,
+	       sizeof(struct perf_branch_entry) * x86_pmu.lbr_nr);
+	*this_cpu_ptr(&bpf_lbr_cnt) = x86_pmu.lbr_nr;
+	intel_pmu_lbr_enable_all(false);
+	return 0;
+}
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 7d248941ecea3..8c30712f56ab2 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -575,7 +575,8 @@ struct bpf_prog {
 				has_callchain_buf:1, /* callchain buffer allocated? */
 				enforce_expected_attach_type:1, /* Enforce expected_attach_type checking at attach time */
 				call_get_stack:1, /* Do we call bpf_get_stack() or bpf_get_stackid() */
-				call_get_func_ip:1; /* Do we call get_func_ip() */
+				call_get_func_ip:1, /* Do we call get_func_ip() */
+				call_get_branch:1; /* Do we call get_branch_trace() */
 	enum bpf_prog_type	type;		/* Type of BPF program */
 	enum bpf_attach_type	expected_attach_type; /* For some prog types */
 	u32			len;		/* Number of filter blocks */
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index fe156a8170aa3..cdc36fa5fda91 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -116,6 +116,8 @@ struct perf_branch_stack {
 	struct perf_branch_entry	entries[];
 };
 
+int bpf_branch_record_read(void);
+
 struct task_struct;
 
 /*
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c4f7892edb2b3..442cfef8d6bd5 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4871,6 +4871,13 @@ union bpf_attr {
  * 	Return
  *		Value specified by user at BPF link creation/attachment time
  *		or 0, if it was not specified.
+ *
+ * long bpf_get_branch_trace(void *entries, u32 size, u64 flags)
+ *	Description
+ *		Get branch strace from hardware engines like Intel LBR.
+ * 	Return
+ *		> 0, # of entries.
+ *		**-EOPNOTSUP**, the hardware/kernel does not support this function
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5048,6 +5055,7 @@ union bpf_attr {
 	FN(timer_cancel),		\
 	FN(get_func_ip),		\
 	FN(get_attach_cookie),		\
+	FN(get_branch_trace),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
index fe1e857324e66..0526e187f9b21 100644
--- a/kernel/bpf/trampoline.c
+++ b/kernel/bpf/trampoline.c
@@ -566,6 +566,9 @@ u64 notrace __bpf_prog_enter(struct bpf_prog *prog)
 {
 	rcu_read_lock();
 	migrate_disable();
+	if (prog->call_get_branch)
+		bpf_branch_record_read();
+
 	if (unlikely(__this_cpu_inc_return(*(prog->active)) != 1)) {
 		inc_misses_counter(prog);
 		return 0;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index f5a0077c99811..292d2b471892a 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -6446,6 +6446,13 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 		env->prog->call_get_func_ip = true;
 	}
 
+	if (func_id == BPF_FUNC_get_branch_trace) {
+		if (env->prog->aux->sleepable) {
+			verbose(env, "sleepable progs cannot call get_branch_trace\n");
+			return -ENOTSUPP;
+		}
+		env->prog->call_get_branch = true;
+	}
 	if (changes_data)
 		clear_all_pkt_pointers(env);
 	return 0;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2d1e63dd97f23..4808ce268223d 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -13434,3 +13434,8 @@ struct cgroup_subsys perf_event_cgrp_subsys = {
 	.threaded	= true,
 };
 #endif /* CONFIG_CGROUP_PERF */
+
+int __weak  bpf_branch_record_read(void)
+{
+	return -EOPNOTSUPP;
+}
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index cbc73c08c4a4e..85683bde1f991 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1002,6 +1002,29 @@ static const struct bpf_func_proto bpf_get_attach_cookie_proto_pe = {
 	.arg1_type	= ARG_PTR_TO_CTX,
 };
 
+#ifndef MAX_LBR_ENTRIES
+#define MAX_LBR_ENTRIES		32
+#endif
+
+DECLARE_PER_CPU(struct perf_branch_entry, bpf_lbr_entries[MAX_LBR_ENTRIES]);
+DECLARE_PER_CPU(int, bpf_lbr_cnt);
+BPF_CALL_3(bpf_get_branch_trace, void *, buf, u32, size, u64, flags)
+{
+	memcpy(buf, *this_cpu_ptr(&bpf_lbr_entries),
+	       min_t(u32, size,
+		     sizeof(struct perf_branch_entry) * MAX_LBR_ENTRIES));
+	return *this_cpu_ptr(&bpf_lbr_cnt);
+}
+
+static const struct bpf_func_proto bpf_get_branch_trace_proto = {
+	.func		= bpf_get_branch_trace,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_UNINIT_MEM,
+	.arg2_type	= ARG_CONST_SIZE_OR_ZERO,
+	.arg3_type	= ARG_ANYTHING,
+};
+
 static const struct bpf_func_proto *
 bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
@@ -1115,6 +1138,8 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_snprintf_proto;
 	case BPF_FUNC_get_func_ip:
 		return &bpf_get_func_ip_proto_tracing;
+	case BPF_FUNC_get_branch_trace:
+		return &bpf_get_branch_trace_proto;
 	default:
 		return bpf_base_func_proto(func_id);
 	}
@@ -1851,6 +1876,8 @@ void __bpf_trace_run(struct bpf_prog *prog, u64 *args)
 {
 	cant_sleep();
 	rcu_read_lock();
+	if (prog->call_get_branch)
+		bpf_branch_record_read();
 	(void) bpf_prog_run(prog, args);
 	rcu_read_unlock();
 }
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 2eb0e55ef54d2..60ff066ddab86 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -231,6 +231,18 @@ struct sock * noinline bpf_kfunc_call_test3(struct sock *sk)
 	return sk;
 }
 
+int noinline bpf_fexit_loop_test1(int n)
+{
+	int i, sum = 0;
+
+	/* the primary goal of this test is to test LBR. Create a lot of
+	 * branches in the function, so we can catch it easily.
+	 */
+	for (i = 0; i < n; i++)
+		sum += i;
+	return sum;
+}
+
 __diag_pop();
 
 ALLOW_ERROR_INJECTION(bpf_modify_return_test, ERRNO);
@@ -293,7 +305,8 @@ int bpf_prog_test_run_tracing(struct bpf_prog *prog,
 		    bpf_fentry_test5(11, (void *)12, 13, 14, 15) != 65 ||
 		    bpf_fentry_test6(16, (void *)17, 18, 19, (void *)20, 21) != 111 ||
 		    bpf_fentry_test7((struct bpf_fentry_test_t *)0) != 0 ||
-		    bpf_fentry_test8(&arg) != 0)
+		    bpf_fentry_test8(&arg) != 0 ||
+		    bpf_fexit_loop_test1(101) != 5050)
 			goto out;
 		break;
 	case BPF_MODIFY_RETURN:
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index c4f7892edb2b3..442cfef8d6bd5 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -4871,6 +4871,13 @@ union bpf_attr {
  * 	Return
  *		Value specified by user at BPF link creation/attachment time
  *		or 0, if it was not specified.
+ *
+ * long bpf_get_branch_trace(void *entries, u32 size, u64 flags)
+ *	Description
+ *		Get branch strace from hardware engines like Intel LBR.
+ * 	Return
+ *		> 0, # of entries.
+ *		**-EOPNOTSUP**, the hardware/kernel does not support this function
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5048,6 +5055,7 @@ union bpf_attr {
 	FN(timer_cancel),		\
 	FN(get_func_ip),		\
 	FN(get_attach_cookie),		\
+	FN(get_branch_trace),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/tools/testing/selftests/bpf/prog_tests/get_branch_trace.c b/tools/testing/selftests/bpf/prog_tests/get_branch_trace.c
new file mode 100644
index 0000000000000..509bef6e3edd2
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/get_branch_trace.c
@@ -0,0 +1,82 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Facebook */
+#include <test_progs.h>
+#include "get_branch_trace.skel.h"
+
+static int pfd_array[128] = {-1};  /* TODO remove hardcodded 128 */
+
+static int create_perf_events(void)
+{
+	struct perf_event_attr attr = {0};
+	int cpu;
+
+	/* create perf event */
+	attr.size = sizeof(attr);
+	attr.type = PERF_TYPE_HARDWARE;
+	attr.config = PERF_COUNT_HW_CPU_CYCLES;
+	attr.freq = 1;
+	attr.sample_freq = 4000;
+	attr.sample_type = PERF_SAMPLE_BRANCH_STACK;
+	attr.branch_sample_type = PERF_SAMPLE_BRANCH_KERNEL |
+		PERF_SAMPLE_BRANCH_USER | PERF_SAMPLE_BRANCH_ANY;
+	for (cpu = 0; cpu < libbpf_num_possible_cpus(); cpu++) {
+		pfd_array[cpu] = syscall(__NR_perf_event_open, &attr,
+					 -1, cpu, -1, PERF_FLAG_FD_CLOEXEC);
+		if (pfd_array[cpu] < 0)
+			break;
+	}
+	return cpu == 0;
+}
+
+static void close_perf_events(void)
+{
+	int cpu = 0;
+	int fd;
+
+	while (cpu < 128) {
+		fd = pfd_array[cpu];
+		if (fd < 0)
+			break;
+		close(fd);
+	}
+}
+
+void test_get_branch_trace(void)
+{
+	struct get_branch_trace *skel;
+	int err, prog_fd;
+	__u32 retval;
+
+	if (create_perf_events()) {
+		test__skip();  /* system doesn't support LBR */
+		goto cleanup;
+	}
+
+	skel = get_branch_trace__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "get_branch_trace__open_and_load"))
+		goto cleanup;
+
+	err = kallsyms_find("bpf_fexit_loop_test1", &skel->bss->address_low);
+	if (!ASSERT_OK(err, "kallsyms_find"))
+		goto cleanup;
+
+	err = kallsyms_find_next("bpf_fexit_loop_test1", &skel->bss->address_high);
+	if (!ASSERT_OK(err, "kallsyms_find_next"))
+		goto cleanup;
+
+	err = get_branch_trace__attach(skel);
+	if (!ASSERT_OK(err, "get_branch_trace__attach"))
+		goto cleanup;
+
+	prog_fd = bpf_program__fd(skel->progs.test1);
+	err = bpf_prog_test_run(prog_fd, 1, NULL, 0,
+				NULL, 0, &retval, NULL);
+
+	if (!ASSERT_OK(err, "bpf_prog_test_run"))
+		goto cleanup;
+	ASSERT_GT(skel->bss->test1_hits, 5, "find_test1_in_lbr");
+
+cleanup:
+	get_branch_trace__destroy(skel);
+	close_perf_events();
+}
diff --git a/tools/testing/selftests/bpf/progs/get_branch_trace.c b/tools/testing/selftests/bpf/progs/get_branch_trace.c
new file mode 100644
index 0000000000000..cacfe25fd92e8
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/get_branch_trace.c
@@ -0,0 +1,36 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Facebook */
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+__u64 test1_hits = 0;
+__u64 address_low = 0;
+__u64 address_high = 0;
+
+#define MAX_LBR_ENTRIES 32
+
+struct perf_branch_entry entries[MAX_LBR_ENTRIES] = {};
+
+static inline bool in_range(__u64 val)
+{
+	return (val >= address_low) && (val < address_high);
+}
+
+SEC("fexit/bpf_fexit_loop_test1")
+int BPF_PROG(test1, int n, int ret)
+{
+	long cnt, i;
+
+	cnt = bpf_get_branch_trace(entries, sizeof(entries), 0);
+
+	for (i = 0; i < MAX_LBR_ENTRIES; i++) {
+		if (i >= cnt)
+			break;
+		if (in_range(entries[i].from) && in_range(entries[i].to))
+			test1_hits++;
+	}
+	return 0;
+}
diff --git a/tools/testing/selftests/bpf/trace_helpers.c b/tools/testing/selftests/bpf/trace_helpers.c
index e7a19b04d4eaf..2926a3b626821 100644
--- a/tools/testing/selftests/bpf/trace_helpers.c
+++ b/tools/testing/selftests/bpf/trace_helpers.c
@@ -117,6 +117,36 @@ int kallsyms_find(const char *sym, unsigned long long *addr)
 	return err;
 }
 
+/* find the address of the next symbol, this can be used to determine the
+ * end of a function
+ */
+int kallsyms_find_next(const char *sym, unsigned long long *addr)
+{
+	char type, name[500];
+	unsigned long long value;
+	bool found = false;
+	int err = 0;
+	FILE *f;
+
+	f = fopen("/proc/kallsyms", "r");
+	if (!f)
+		return -EINVAL;
+
+	while (fscanf(f, "%llx %c %499s%*[^\n]\n", &value, &type, name) > 0) {
+		if (found) {
+			*addr = value;
+			goto out;
+		}
+		if (strcmp(name, sym) == 0)
+			found = true;
+	}
+	err = -ENOENT;
+
+out:
+	fclose(f);
+	return err;
+}
+
 void read_trace_pipe(void)
 {
 	int trace_fd;
diff --git a/tools/testing/selftests/bpf/trace_helpers.h b/tools/testing/selftests/bpf/trace_helpers.h
index d907b445524d5..bc8ed86105d94 100644
--- a/tools/testing/selftests/bpf/trace_helpers.h
+++ b/tools/testing/selftests/bpf/trace_helpers.h
@@ -16,6 +16,11 @@ long ksym_get_addr(const char *name);
 /* open kallsyms and find addresses on the fly, faster than load + search. */
 int kallsyms_find(const char *sym, unsigned long long *addr);
 
+/* find the address of the next symbol, this can be used to determine the
+ * end of a function
+ */
+int kallsyms_find_next(const char *sym, unsigned long long *addr);
+
 void read_trace_pipe(void);
 
 ssize_t get_uprobe_offset(const void *addr, ssize_t base);
-- 
2.30.2


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] bpf: lbr: enable reading LBR from tracing bpf programs
  2021-08-18  1:29 [RFC] bpf: lbr: enable reading LBR from tracing bpf programs Song Liu
@ 2021-08-18  9:15 ` Peter Zijlstra
  2021-08-18 16:46   ` Song Liu
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2021-08-18  9:15 UTC (permalink / raw)
  To: Song Liu
  Cc: bpf, linux-kernel, acme, mingo, kernel-team, Kan Liang, Like Xu,
	Alexey Budankov

On Tue, Aug 17, 2021 at 06:29:37PM -0700, Song Liu wrote:
> The typical way to access LBR is via hardware perf_event. For CPUs with
> FREEZE_LBRS_ON_PMI support, PMI could capture reliable LBR. On the other
> hand, LBR could also be useful in non-PMI scenario. For example, in
> kretprobe or bpf fexit program, LBR could provide a lot of information
> on what happened with the function.
> 
> In this RFC, we try to enable LBR for BPF program. This works like:
>   1. Create a hardware perf_event with PERF_SAMPLE_BRANCH_* on each CPU;
>   2. Call a new bpf helper (bpf_get_branch_trace) from the BPF program;
>   3. Before calling this bpf program, the kernel stops LBR on local CPU,
>      make a copy of LBR, and resumes LBR;
>   4. In the bpf program, the helper access the copy from #3.
> 
> Please see tools/testing/selftests/bpf/[progs|prog_tests]/get_call_trace.c
> for a detailed example. Not that, this process is far from ideal, but it
> allows quick prototype of this feature.
> 
> AFAICT, the biggest challenge here is that we are now sharing LBR in PMI
> and out of PMI, which could trigger some interesting race conditions.
> However, if we allow some level of missed/corrupted samples, this should
> still be very useful.
> 
> Please share your thoughts and comments on this. Thanks in advance!

> +int bpf_branch_record_read(void)
> +{
> +	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> +
> +	intel_pmu_lbr_disable_all();
> +	intel_pmu_lbr_read();
> +	memcpy(this_cpu_ptr(&bpf_lbr_entries), cpuc->lbr_entries,
> +	       sizeof(struct perf_branch_entry) * x86_pmu.lbr_nr);
> +	*this_cpu_ptr(&bpf_lbr_cnt) = x86_pmu.lbr_nr;
> +	intel_pmu_lbr_enable_all(false);
> +	return 0;
> +}

Urgghhh.. I so really hate BPF specials like this. Also, the PMI race
you describe is because you're doing abysmal layer violations. If you'd
have used perf_pmu_disable() that wouldn't have been a problem.

I'd much rather see a generic 'fake/inject' PMI facility, something that
works across the board and isn't tied to x86/intel.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] bpf: lbr: enable reading LBR from tracing bpf programs
  2021-08-18  9:15 ` Peter Zijlstra
@ 2021-08-18 16:46   ` Song Liu
  2021-08-19 11:57     ` Peter Zijlstra
  0 siblings, 1 reply; 10+ messages in thread
From: Song Liu @ 2021-08-18 16:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: open list:BPF (Safe dynamic programs and tools),
	linux-kernel, acme, mingo, Kernel Team, Kan Liang, Like Xu,
	Alexey Budankov

Hi Peter,

Thanks for you quick response!

> On Aug 18, 2021, at 2:15 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Tue, Aug 17, 2021 at 06:29:37PM -0700, Song Liu wrote:
>> The typical way to access LBR is via hardware perf_event. For CPUs with
>> FREEZE_LBRS_ON_PMI support, PMI could capture reliable LBR. On the other
>> hand, LBR could also be useful in non-PMI scenario. For example, in
>> kretprobe or bpf fexit program, LBR could provide a lot of information
>> on what happened with the function.
>> 
>> In this RFC, we try to enable LBR for BPF program. This works like:
>>  1. Create a hardware perf_event with PERF_SAMPLE_BRANCH_* on each CPU;
>>  2. Call a new bpf helper (bpf_get_branch_trace) from the BPF program;
>>  3. Before calling this bpf program, the kernel stops LBR on local CPU,
>>     make a copy of LBR, and resumes LBR;
>>  4. In the bpf program, the helper access the copy from #3.
>> 
>> Please see tools/testing/selftests/bpf/[progs|prog_tests]/get_call_trace.c
>> for a detailed example. Not that, this process is far from ideal, but it
>> allows quick prototype of this feature.
>> 
>> AFAICT, the biggest challenge here is that we are now sharing LBR in PMI
>> and out of PMI, which could trigger some interesting race conditions.
>> However, if we allow some level of missed/corrupted samples, this should
>> still be very useful.
>> 
>> Please share your thoughts and comments on this. Thanks in advance!
> 
>> +int bpf_branch_record_read(void)
>> +{
>> +	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
>> +
>> +	intel_pmu_lbr_disable_all();
>> +	intel_pmu_lbr_read();
>> +	memcpy(this_cpu_ptr(&bpf_lbr_entries), cpuc->lbr_entries,
>> +	       sizeof(struct perf_branch_entry) * x86_pmu.lbr_nr);
>> +	*this_cpu_ptr(&bpf_lbr_cnt) = x86_pmu.lbr_nr;
>> +	intel_pmu_lbr_enable_all(false);
>> +	return 0;
>> +}
> 
> Urgghhh.. I so really hate BPF specials like this.

I don't really like this design either. But it does show that LBR can be
very useful in non-PMI scenario. 

> Also, the PMI race
> you describe is because you're doing abysmal layer violations. If you'd
> have used perf_pmu_disable() that wouldn't have been a problem.

Do you mean instead of disable/enable lbr, we disable/enable the whole 
pmu? 

> 
> I'd much rather see a generic 'fake/inject' PMI facility, something that
> works across the board and isn't tied to x86/intel.

How would that work? Do we have a function to trigger PMI from software, 
and then gather the LBR data after the PMI? This does sound like a much
cleaner solution. Where can I find code examples that fake/inject PMI?

There is another limitation right now: we need to enable LBR with a 
hardware perf event (cycles, etc.). However, unless we use the event for 
something else, it wastes a hardware counter. So I was thinking to allow
software event, i.e. dummy event, to enable LBR. Does this idea sound 
sane to you?

Thanks,
Song

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] bpf: lbr: enable reading LBR from tracing bpf programs
  2021-08-18 16:46   ` Song Liu
@ 2021-08-19 11:57     ` Peter Zijlstra
  2021-08-19 16:46       ` Song Liu
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2021-08-19 11:57 UTC (permalink / raw)
  To: Song Liu
  Cc: open list:BPF (Safe dynamic programs and tools),
	linux-kernel, acme, mingo, Kernel Team, Kan Liang, Like Xu,
	Alexey Budankov

On Wed, Aug 18, 2021 at 04:46:32PM +0000, Song Liu wrote:

> > Urgghhh.. I so really hate BPF specials like this.
> 
> I don't really like this design either. But it does show that LBR can be
> very useful in non-PMI scenario. 
> 
> > Also, the PMI race
> > you describe is because you're doing abysmal layer violations. If you'd
> > have used perf_pmu_disable() that wouldn't have been a problem.
> 
> Do you mean instead of disable/enable lbr, we disable/enable the whole 
> pmu? 

Yep, that way you're serialized against PMIs. It's what all of the perf
core does.

> > I'd much rather see a generic 'fake/inject' PMI facility, something that
> > works across the board and isn't tied to x86/intel.
> 
> How would that work? Do we have a function to trigger PMI from software, 
> and then gather the LBR data after the PMI? This does sound like a much
> cleaner solution. Where can I find code examples that fake/inject PMI?

We don't yet have anything like it; but it would look a little like:

void perf_inject_event(struct perf_event *event, struct pt_regs *regs)
{
	struct perf_sample_data data;
	struct pmu *pmu = event->pmu;
	unsigned long flags;

	local_irq_save(flags);
	perf_pmu_disable(pmu);

	perf_sample_data_init(&data, 0, 0);
	/*
	 * XXX or a variant with more _ that starts at the overflow
	 * handler...
	 */
	__perf_event_overflow(event, 0, &data, regs);

	perf_pmu_enable(pmu);
	local_irq_restore(flags);
}

But please consider carefully, I haven't...

> There is another limitation right now: we need to enable LBR with a 
> hardware perf event (cycles, etc.). However, unless we use the event for 
> something else, it wastes a hardware counter. So I was thinking to allow
> software event, i.e. dummy event, to enable LBR. Does this idea sound 
> sane to you?

We have a VLBR dummy event, but I'm not sure it does exactly as you
want. However, we should also consider Power, which also has the branch
stack feature.

You can't really make a software event with LBR on, because then it
wouldn't be a software event anymore. You'll need some hybrid like
thing, which will be yuck and I suspect it needs arch support one way or
the other :/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] bpf: lbr: enable reading LBR from tracing bpf programs
  2021-08-19 11:57     ` Peter Zijlstra
@ 2021-08-19 16:46       ` Song Liu
  2021-08-19 18:06         ` Peter Zijlstra
  0 siblings, 1 reply; 10+ messages in thread
From: Song Liu @ 2021-08-19 16:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: open list:BPF (Safe dynamic programs and tools),
	linux-kernel, acme, mingo, Kernel Team, Kan Liang, Like Xu,
	Alexey Budankov

Hi Peter,

Thanks for these helpful information and insights!

> On Aug 19, 2021, at 4:57 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Wed, Aug 18, 2021 at 04:46:32PM +0000, Song Liu wrote:
> 
>>> Urgghhh.. I so really hate BPF specials like this.
>> 
>> I don't really like this design either. But it does show that LBR can be
>> very useful in non-PMI scenario. 
>> 
>>> Also, the PMI race
>>> you describe is because you're doing abysmal layer violations. If you'd
>>> have used perf_pmu_disable() that wouldn't have been a problem.
>> 
>> Do you mean instead of disable/enable lbr, we disable/enable the whole 
>> pmu? 
> 
> Yep, that way you're serialized against PMIs. It's what all of the perf
> core does.
> 
>>> I'd much rather see a generic 'fake/inject' PMI facility, something that
>>> works across the board and isn't tied to x86/intel.
>> 
>> How would that work? Do we have a function to trigger PMI from software, 
>> and then gather the LBR data after the PMI? This does sound like a much
>> cleaner solution. Where can I find code examples that fake/inject PMI?
> 
> We don't yet have anything like it; but it would look a little like:
> 
> void perf_inject_event(struct perf_event *event, struct pt_regs *regs)
> {
> 	struct perf_sample_data data;
> 	struct pmu *pmu = event->pmu;
> 	unsigned long flags;
> 
> 	local_irq_save(flags);
> 	perf_pmu_disable(pmu);
> 
> 	perf_sample_data_init(&data, 0, 0);
> 	/*
> 	 * XXX or a variant with more _ that starts at the overflow
> 	 * handler...
> 	 */
> 	__perf_event_overflow(event, 0, &data, regs);
> 
> 	perf_pmu_enable(pmu);
> 	local_irq_restore(flags);
> }
> 
> But please consider carefully, I haven't...

Hmm... This is a little weird to me. 
IIUC, we need to call perf_inject_event() after the software event, say
a kretprobe, triggers. So it gonna look like:

  1. kretprobe trigger;
  2. handler calls perf_inject_event();
  3. PMI kicks in, and saves LBR;
  4. after the PMI, consumer of LBR uses the saved data;

However, given perf_inject_event() disables PMU, we can just save the LBR
right there? And it should be a lot easier? Something like:

  1. kretprobe triggers;
  2. handler calls perf_snapshot_lbr();
     2.1 perf_pmu_disable(pmu);
     2.2 saves LBR 
     2.3 perf_pmu_enable(pmu);
  3. consumer of LBR uses the saved data;

What is the downside of this approach? 

> 
>> There is another limitation right now: we need to enable LBR with a 
>> hardware perf event (cycles, etc.). However, unless we use the event for 
>> something else, it wastes a hardware counter. So I was thinking to allow
>> software event, i.e. dummy event, to enable LBR. Does this idea sound 
>> sane to you?
> 
> We have a VLBR dummy event, but I'm not sure it does exactly as you
> want. However, we should also consider Power, which also has the branch
> stack feature.

VLBR event does look similar to the use case we have. I will take a closer
look. Thanks for the pointer!

> 
> You can't really make a software event with LBR on, because then it
> wouldn't be a software event anymore. You'll need some hybrid like
> thing, which will be yuck and I suspect it needs arch support one way or
> the other :/

Yeah, I guess it could be a "LBR only hardware event". All it needs to do 
is to keep LBR enabled (lbr_users++). I will try to keep the arch support
clean. 

Thanks,
Song


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] bpf: lbr: enable reading LBR from tracing bpf programs
  2021-08-19 16:46       ` Song Liu
@ 2021-08-19 18:06         ` Peter Zijlstra
  2021-08-19 18:22           ` Song Liu
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2021-08-19 18:06 UTC (permalink / raw)
  To: Song Liu
  Cc: open list:BPF (Safe dynamic programs and tools),
	linux-kernel, acme, mingo, Kernel Team, Kan Liang, Like Xu,
	Alexey Budankov

On Thu, Aug 19, 2021 at 04:46:20PM +0000, Song Liu wrote:
> > void perf_inject_event(struct perf_event *event, struct pt_regs *regs)
> > {
> > 	struct perf_sample_data data;
> > 	struct pmu *pmu = event->pmu;
> > 	unsigned long flags;
> > 
> > 	local_irq_save(flags);
> > 	perf_pmu_disable(pmu);
> > 
> > 	perf_sample_data_init(&data, 0, 0);
> > 	/*
> > 	 * XXX or a variant with more _ that starts at the overflow
> > 	 * handler...
> > 	 */
> > 	__perf_event_overflow(event, 0, &data, regs);
> > 
> > 	perf_pmu_enable(pmu);
> > 	local_irq_restore(flags);
> > }
> > 
> > But please consider carefully, I haven't...
> 
> Hmm... This is a little weird to me. 
> IIUC, we need to call perf_inject_event() after the software event, say
> a kretprobe, triggers. So it gonna look like:
> 
>   1. kretprobe trigger;
>   2. handler calls perf_inject_event();
>   3. PMI kicks in, and saves LBR;

This doesn't actually happen. I overlooked the fact that we need the PMI
to fill out @data for us.

>   4. after the PMI, consumer of LBR uses the saved data;

Normal overflow handler will have data->br_stack set, but I now realize
that the 'psuedo' code above will not get that. We need to somehow get
the arch bits involved; again :/

> However, given perf_inject_event() disables PMU, we can just save the LBR
> right there? And it should be a lot easier? Something like:
> 
>   1. kretprobe triggers;
>   2. handler calls perf_snapshot_lbr();
>      2.1 perf_pmu_disable(pmu);
>      2.2 saves LBR 
>      2.3 perf_pmu_enable(pmu);
>   3. consumer of LBR uses the saved data;
> 
> What is the downside of this approach? 

It would be perf_snapshot_branch_stack() and would require a new
(optional) pmu::method to set up the branch stack.

And if we're going to be adding new pmu::methods then I figure one that
does the whole sample state might be more useful.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] bpf: lbr: enable reading LBR from tracing bpf programs
  2021-08-19 18:06         ` Peter Zijlstra
@ 2021-08-19 18:22           ` Song Liu
  2021-08-19 18:27             ` Peter Zijlstra
  0 siblings, 1 reply; 10+ messages in thread
From: Song Liu @ 2021-08-19 18:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: open list:BPF (Safe dynamic programs and tools),
	linux-kernel, acme, mingo, Kernel Team, Kan Liang, Like Xu,
	Alexey Budankov

Hi Peter, 

> On Aug 19, 2021, at 11:06 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Thu, Aug 19, 2021 at 04:46:20PM +0000, Song Liu wrote:
>>> void perf_inject_event(struct perf_event *event, struct pt_regs *regs)
>>> {
>>> 	struct perf_sample_data data;
>>> 	struct pmu *pmu = event->pmu;
>>> 	unsigned long flags;
>>> 
>>> 	local_irq_save(flags);
>>> 	perf_pmu_disable(pmu);
>>> 
>>> 	perf_sample_data_init(&data, 0, 0);
>>> 	/*
>>> 	 * XXX or a variant with more _ that starts at the overflow
>>> 	 * handler...
>>> 	 */
>>> 	__perf_event_overflow(event, 0, &data, regs);
>>> 
>>> 	perf_pmu_enable(pmu);
>>> 	local_irq_restore(flags);
>>> }
>>> 
>>> But please consider carefully, I haven't...
>> 
>> Hmm... This is a little weird to me. 
>> IIUC, we need to call perf_inject_event() after the software event, say
>> a kretprobe, triggers. So it gonna look like:
>> 
>>  1. kretprobe trigger;
>>  2. handler calls perf_inject_event();
>>  3. PMI kicks in, and saves LBR;
> 
> This doesn't actually happen. I overlooked the fact that we need the PMI
> to fill out @data for us.
> 
>>  4. after the PMI, consumer of LBR uses the saved data;
> 
> Normal overflow handler will have data->br_stack set, but I now realize
> that the 'psuedo' code above will not get that. We need to somehow get
> the arch bits involved; again :/
> 
>> However, given perf_inject_event() disables PMU, we can just save the LBR
>> right there? And it should be a lot easier? Something like:
>> 
>>  1. kretprobe triggers;
>>  2. handler calls perf_snapshot_lbr();
>>     2.1 perf_pmu_disable(pmu);
>>     2.2 saves LBR 
>>     2.3 perf_pmu_enable(pmu);
>>  3. consumer of LBR uses the saved data;
>> 
>> What is the downside of this approach? 
> 
> It would be perf_snapshot_branch_stack() and would require a new
> (optional) pmu::method to set up the branch stack.

I guess it would look like:

diff --git i/include/linux/perf_event.h w/include/linux/perf_event.h
index fe156a8170aa3..af379b7f18050 100644
--- i/include/linux/perf_event.h
+++ w/include/linux/perf_event.h
@@ -514,6 +514,9 @@ struct pmu {
         * Check period value for PERF_EVENT_IOC_PERIOD ioctl.
         */
        int (*check_period)             (struct perf_event *event, u64 value); /* optional */
+
+       int (*snapshot_branch_stack)    (struct perf_event *event, /* TBD, maybe struct
+                                                                     perf_output_handle? */);
 };

 enum perf_addr_filter_action_t {
diff --git i/kernel/events/core.c w/kernel/events/core.c
index 2d1e63dd97f23..14aa5f7bccf1f 100644
--- i/kernel/events/core.c
+++ w/kernel/events/core.c
@@ -1207,6 +1207,19 @@ void perf_pmu_enable(struct pmu *pmu)
                pmu->pmu_enable(pmu);
 }

+int perf_snapshot_branch_stack(struct perf_event *event)
+{
+       struct pmu *pmu = event->pmu;
+       int ret;
+
+       if (!pmu->snapshot_branch_stack)
+               return -EOPNOTSUPP;
+       perf_pmu_disable(pmu);
+       ret = pmu->snapshot_branch_stack(event, ...);
+       perf_pmu_enable(pmu);
+       return 0;
+}
+
 static DEFINE_PER_CPU(struct list_head, active_ctx_list);

> 
> And if we're going to be adding new pmu::methods then I figure one that
> does the whole sample state might be more useful.

What do you mean by "whole sample state"? To integrate with exiting
perf_sample_data, like perf_output_sample()?

Thanks,
Song

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] bpf: lbr: enable reading LBR from tracing bpf programs
  2021-08-19 18:22           ` Song Liu
@ 2021-08-19 18:27             ` Peter Zijlstra
  2021-08-19 18:45               ` Song Liu
  2021-08-20  7:33               ` Song Liu
  0 siblings, 2 replies; 10+ messages in thread
From: Peter Zijlstra @ 2021-08-19 18:27 UTC (permalink / raw)
  To: Song Liu
  Cc: open list:BPF (Safe dynamic programs and tools),
	linux-kernel, acme, mingo, Kernel Team, Kan Liang, Like Xu,
	Alexey Budankov

On Thu, Aug 19, 2021 at 06:22:07PM +0000, Song Liu wrote:

> > And if we're going to be adding new pmu::methods then I figure one that
> > does the whole sample state might be more useful.
> 
> What do you mean by "whole sample state"? To integrate with exiting
> perf_sample_data, like perf_output_sample()?

Yeah, the PMI can/does set more than data->br_stack, but I'm now
thinking that without an actual PMI, much of that will not be possible.
br_stack is special here.

Oh well, carry on I suppose.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] bpf: lbr: enable reading LBR from tracing bpf programs
  2021-08-19 18:27             ` Peter Zijlstra
@ 2021-08-19 18:45               ` Song Liu
  2021-08-20  7:33               ` Song Liu
  1 sibling, 0 replies; 10+ messages in thread
From: Song Liu @ 2021-08-19 18:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: open list:BPF (Safe dynamic programs and tools),
	linux-kernel, acme, mingo, Kernel Team, Kan Liang, Like Xu,
	Alexey Budankov

Hi Peter,

> On Aug 19, 2021, at 11:27 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Thu, Aug 19, 2021 at 06:22:07PM +0000, Song Liu wrote:
> 
>>> And if we're going to be adding new pmu::methods then I figure one that
>>> does the whole sample state might be more useful.
>> 
>> What do you mean by "whole sample state"? To integrate with exiting
>> perf_sample_data, like perf_output_sample()?
> 
> Yeah, the PMI can/does set more than data->br_stack, but I'm now
> thinking that without an actual PMI, much of that will not be possible.
> br_stack is special here.
> 
> Oh well, carry on I suppose.

Thanks again for these information and suggestions. I will try to put them
together in the code and send the patches. 

Song

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] bpf: lbr: enable reading LBR from tracing bpf programs
  2021-08-19 18:27             ` Peter Zijlstra
  2021-08-19 18:45               ` Song Liu
@ 2021-08-20  7:33               ` Song Liu
  1 sibling, 0 replies; 10+ messages in thread
From: Song Liu @ 2021-08-20  7:33 UTC (permalink / raw)
  To: Peter Zijlstra, acme, Arnaldo Carvalho de Melo
  Cc: open list:BPF (Safe dynamic programs and tools),
	linux-kernel, mingo, Kernel Team, Kan Liang

Hi Peter, 

> On Aug 19, 2021, at 11:27 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Thu, Aug 19, 2021 at 06:22:07PM +0000, Song Liu wrote:
> 
>>> And if we're going to be adding new pmu::methods then I figure one that
>>> does the whole sample state might be more useful.
>> 
>> What do you mean by "whole sample state"? To integrate with exiting
>> perf_sample_data, like perf_output_sample()?
> 
> Yeah, the PMI can/does set more than data->br_stack, but I'm now
> thinking that without an actual PMI, much of that will not be possible.
> br_stack is special here.
> 
> Oh well, carry on I suppose.

Here is another design choice that I would like to know your opinion on. 

Say we don't use BPF here. Instead, we use perf_kprobe. On a kretprobe, 
we can use branch_stack to figure out what branches we took before the 
return, say what happened when sys_perf_event_open returns -EINVAL? 

To achieve this, we will need two events from the kernel's perspective, 
one INTEL_FIXED_VLBR_EVENT attached to hardware pmu, and a kretprobe
event. Let's call them VLBR-event and software-event respectively. When
the software-event triggers, it need to read branch_stack snapshot from 
the VLBR-event's pmu, which is kind of weird. We will need some connection
between these two events. 

Also, to keep more useful data in LBR registers, we want minimal number 
of branches between software-event triggers and 
perf_pmu_disable(hardware_pmu). I guess the best way is to add a pointer,
branch_stack_pmu, to perf_event, and use it directly when the software-
event triggers. Note that, the pmu will not go away. So event the VLBR-
event get freed by accident, we will not crash the kernel (we may read
garbage data though). Does this make sense to you?

BPF case will be similar to this. We will replace the software-event with
a BPF program, and still need the VLBR-event on each CPU. 

Another question is, how do we crate the VLBR-event. On way is to create
it in user space, and somehow pass the information to the software-event
or the BPF program. Another approach is to create it in the kernel with
perf_event_create_kernel_counter(). Which of the two do you like better?

Thanks,
Song

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-08-20  7:33 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-18  1:29 [RFC] bpf: lbr: enable reading LBR from tracing bpf programs Song Liu
2021-08-18  9:15 ` Peter Zijlstra
2021-08-18 16:46   ` Song Liu
2021-08-19 11:57     ` Peter Zijlstra
2021-08-19 16:46       ` Song Liu
2021-08-19 18:06         ` Peter Zijlstra
2021-08-19 18:22           ` Song Liu
2021-08-19 18:27             ` Peter Zijlstra
2021-08-19 18:45               ` Song Liu
2021-08-20  7:33               ` Song Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).