linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 bpf-next 0/2] bpf: fix stackmap on perf_events with PEBS
@ 2020-07-16 22:59 Song Liu
  2020-07-16 22:59 ` [PATCH v3 bpf-next 1/2] bpf: separate bpf_get_[stack|stackid] for perf events BPF Song Liu
  0 siblings, 1 reply; 8+ messages in thread
From: Song Liu @ 2020-07-16 22:59 UTC (permalink / raw)
  To: linux-kernel, bpf, netdev
  Cc: ast, daniel, kernel-team, john.fastabend, kpsingh, brouer,
	peterz, Song Liu

Calling get_perf_callchain() on perf_events from PEBS entries may cause
unwinder errors. To fix this issue, perf subsystem fetches callchain early,
and marks perf_events are marked with __PERF_SAMPLE_CALLCHAIN_EARLY.
Similar issue exists when BPF program calls get_perf_callchain() via
helper functions. For more information about this issue, please refer to
discussions in [1].

This set fixes this issue with helper proto bpf_get_stackid_pe and
bpf_get_stack_pe.

[1] https://lore.kernel.org/lkml/ED7B9430-6489-4260-B3C5-9CFA2E3AA87A@fb.com/

Changes v2 => v3:
1. Fix handling of stackmap skip field. (Andrii)
2. Simplify the code in a few places. (Andrii)

Changes v1 => v2:
1. Simplify the design and avoid introducing new helper function. (Andrii)

Song Liu (2):
  bpf: separate bpf_get_[stack|stackid] for perf events BPF
  selftests/bpf: add callchain_stackid

 include/linux/bpf.h                           |   2 +
 kernel/bpf/stackmap.c                         | 202 ++++++++++++++++--
 kernel/trace/bpf_trace.c                      |   4 +-
 .../bpf/prog_tests/perf_event_stackmap.c      | 116 ++++++++++
 .../selftests/bpf/progs/perf_event_stackmap.c |  59 +++++
 5 files changed, 363 insertions(+), 20 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/perf_event_stackmap.c
 create mode 100644 tools/testing/selftests/bpf/progs/perf_event_stackmap.c

--
2.24.1

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v3 bpf-next 1/2] bpf: separate bpf_get_[stack|stackid] for perf events BPF
  2020-07-16 22:59 [PATCH v3 bpf-next 0/2] bpf: fix stackmap on perf_events with PEBS Song Liu
@ 2020-07-16 22:59 ` Song Liu
  2020-07-21 19:10   ` Alexei Starovoitov
  0 siblings, 1 reply; 8+ messages in thread
From: Song Liu @ 2020-07-16 22:59 UTC (permalink / raw)
  To: linux-kernel, bpf, netdev
  Cc: ast, daniel, kernel-team, john.fastabend, kpsingh, brouer,
	peterz, Song Liu

Calling get_perf_callchain() on perf_events from PEBS entries may cause
unwinder errors. To fix this issue, the callchain is fetched early. Such
perf_events are marked with __PERF_SAMPLE_CALLCHAIN_EARLY.

Similarly, calling bpf_get_[stack|stackid] on perf_events from PEBS may
also cause unwinder errors. To fix this, add separate version of these
two helpers, bpf_get_[stack|stackid]_pe. These two hepers use callchain in
bpf_perf_event_data_kern->data->callchain.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 include/linux/bpf.h      |   2 +
 kernel/bpf/stackmap.c    | 202 +++++++++++++++++++++++++++++++++++----
 kernel/trace/bpf_trace.c |   4 +-
 3 files changed, 188 insertions(+), 20 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 54ad426dbea1a..cf8804e302257 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1643,6 +1643,8 @@ extern const struct bpf_func_proto bpf_get_current_comm_proto;
 extern const struct bpf_func_proto bpf_get_stackid_proto;
 extern const struct bpf_func_proto bpf_get_stack_proto;
 extern const struct bpf_func_proto bpf_get_task_stack_proto;
+extern const struct bpf_func_proto bpf_get_stackid_proto_pe;
+extern const struct bpf_func_proto bpf_get_stack_proto_pe;
 extern const struct bpf_func_proto bpf_sock_map_update_proto;
 extern const struct bpf_func_proto bpf_sock_hash_update_proto;
 extern const struct bpf_func_proto bpf_get_current_cgroup_id_proto;
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 48d8e739975fa..7a03d1c24b25f 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -4,6 +4,7 @@
 #include <linux/bpf.h>
 #include <linux/jhash.h>
 #include <linux/filter.h>
+#include <linux/kernel.h>
 #include <linux/stacktrace.h>
 #include <linux/perf_event.h>
 #include <linux/elf.h>
@@ -387,11 +388,10 @@ get_callchain_entry_for_task(struct task_struct *task, u32 init_nr)
 #endif
 }
 
-BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
-	   u64, flags)
+static long __bpf_get_stackid(struct bpf_map *map,
+			      struct perf_callchain_entry *trace, u64 flags)
 {
 	struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, map);
-	struct perf_callchain_entry *trace;
 	struct stack_map_bucket *bucket, *new_bucket, *old_bucket;
 	u32 max_depth = map->value_size / stack_map_data_size(map);
 	/* stack_map_alloc() checks that max_depth <= sysctl_perf_event_max_stack */
@@ -399,21 +399,9 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 	u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
 	u32 hash, id, trace_nr, trace_len;
 	bool user = flags & BPF_F_USER_STACK;
-	bool kernel = !user;
 	u64 *ips;
 	bool hash_matches;
 
-	if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK |
-			       BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID)))
-		return -EINVAL;
-
-	trace = get_perf_callchain(regs, init_nr, kernel, user,
-				   sysctl_perf_event_max_stack, false, false);
-
-	if (unlikely(!trace))
-		/* couldn't fetch the stack trace */
-		return -EFAULT;
-
 	/* get_perf_callchain() guarantees that trace->nr >= init_nr
 	 * and trace-nr <= sysctl_perf_event_max_stack, so trace_nr <= max_depth
 	 */
@@ -478,6 +466,30 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 	return id;
 }
 
+BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
+	   u64, flags)
+{
+	u32 max_depth = map->value_size / stack_map_data_size(map);
+	/* stack_map_alloc() checks that max_depth <= sysctl_perf_event_max_stack */
+	u32 init_nr = sysctl_perf_event_max_stack - max_depth;
+	bool user = flags & BPF_F_USER_STACK;
+	struct perf_callchain_entry *trace;
+	bool kernel = !user;
+
+	if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK |
+			       BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID)))
+		return -EINVAL;
+
+	trace = get_perf_callchain(regs, init_nr, kernel, user,
+				   sysctl_perf_event_max_stack, false, false);
+
+	if (unlikely(!trace))
+		/* couldn't fetch the stack trace */
+		return -EFAULT;
+
+	return __bpf_get_stackid(map, trace, flags);
+}
+
 const struct bpf_func_proto bpf_get_stackid_proto = {
 	.func		= bpf_get_stackid,
 	.gpl_only	= true,
@@ -487,7 +499,86 @@ const struct bpf_func_proto bpf_get_stackid_proto = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
+static __u64 count_kernel_ip(struct perf_callchain_entry *trace)
+{
+	__u64 nr_kernel = 0;
+
+	while (nr_kernel < trace->nr) {
+		if (trace->ip[nr_kernel] == PERF_CONTEXT_USER)
+			break;
+		nr_kernel++;
+	}
+	return nr_kernel;
+}
+
+BPF_CALL_3(bpf_get_stackid_pe, struct bpf_perf_event_data_kern *, ctx,
+	   struct bpf_map *, map, u64, flags)
+{
+	struct perf_event *event = ctx->event;
+	struct perf_callchain_entry *trace;
+	bool has_kernel, has_user;
+	bool kernel, user;
+
+	/* perf_sample_data doesn't have callchain, use bpf_get_stackid */
+	if (!(event->attr.sample_type & __PERF_SAMPLE_CALLCHAIN_EARLY))
+		return bpf_get_stackid((unsigned long)(ctx->regs),
+				       (unsigned long) map, flags, 0, 0);
+
+	if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK |
+			       BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID)))
+		return -EINVAL;
+
+	user = flags & BPF_F_USER_STACK;
+	kernel = !user;
+
+	has_kernel = !event->attr.exclude_callchain_kernel;
+	has_user = !event->attr.exclude_callchain_user;
+
+	if ((kernel && !has_kernel) || (user && !has_user))
+		return -EINVAL;
+
+	trace = ctx->data->callchain;
+	if (unlikely(!trace))
+		return -EFAULT;
+
+	if (has_kernel && has_user) {
+		__u64 nr_kernel = count_kernel_ip(trace);
+		int ret;
+
+		if (kernel) {
+			__u64 nr = trace->nr;
+
+			trace->nr = nr_kernel;
+			ret = __bpf_get_stackid(map, trace, flags);
+
+			/* restore nr */
+			trace->nr = nr;
+		} else { /* user */
+			u64 skip = flags & BPF_F_SKIP_FIELD_MASK;
+
+			skip += nr_kernel;
+			if (skip > BPF_F_SKIP_FIELD_MASK)
+				return -EFAULT;
+
+			flags = (flags & ~BPF_F_SKIP_FIELD_MASK) | skip;
+			ret = __bpf_get_stackid(map, trace, flags);
+		}
+		return ret;
+	}
+	return __bpf_get_stackid(map, trace, flags);
+}
+
+const struct bpf_func_proto bpf_get_stackid_proto_pe = {
+	.func		= bpf_get_stackid_pe,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_CONST_MAP_PTR,
+	.arg3_type	= ARG_ANYTHING,
+};
+
 static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
+			    struct perf_callchain_entry *trace_in,
 			    void *buf, u32 size, u64 flags)
 {
 	u32 init_nr, trace_nr, copy_len, elem_size, num_elem;
@@ -520,7 +611,9 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 	else
 		init_nr = sysctl_perf_event_max_stack - num_elem;
 
-	if (kernel && task)
+	if (trace_in)
+		trace = trace_in;
+	else if (kernel && task)
 		trace = get_callchain_entry_for_task(task, init_nr);
 	else
 		trace = get_perf_callchain(regs, init_nr, kernel, user,
@@ -556,7 +649,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 BPF_CALL_4(bpf_get_stack, struct pt_regs *, regs, void *, buf, u32, size,
 	   u64, flags)
 {
-	return __bpf_get_stack(regs, NULL, buf, size, flags);
+	return __bpf_get_stack(regs, NULL, NULL, buf, size, flags);
 }
 
 const struct bpf_func_proto bpf_get_stack_proto = {
@@ -574,7 +667,7 @@ BPF_CALL_4(bpf_get_task_stack, struct task_struct *, task, void *, buf,
 {
 	struct pt_regs *regs = task_pt_regs(task);
 
-	return __bpf_get_stack(regs, task, buf, size, flags);
+	return __bpf_get_stack(regs, task, NULL, buf, size, flags);
 }
 
 BTF_ID_LIST(bpf_get_task_stack_btf_ids)
@@ -591,6 +684,79 @@ const struct bpf_func_proto bpf_get_task_stack_proto = {
 	.btf_id		= bpf_get_task_stack_btf_ids,
 };
 
+BPF_CALL_4(bpf_get_stack_pe, struct bpf_perf_event_data_kern *, ctx,
+	   void *, buf, u32, size, u64, flags)
+{
+	struct perf_event *event = ctx->event;
+	struct perf_callchain_entry *trace;
+	bool has_kernel, has_user;
+	bool kernel, user;
+	int err = -EINVAL;
+
+	if (!(event->attr.sample_type & __PERF_SAMPLE_CALLCHAIN_EARLY))
+		return __bpf_get_stack(ctx->regs, NULL, NULL, buf, size, flags);
+
+	if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK |
+			       BPF_F_USER_BUILD_ID)))
+		goto clear;
+
+	user = flags & BPF_F_USER_STACK;
+	kernel = !user;
+
+	has_kernel = !event->attr.exclude_callchain_kernel;
+	has_user = !event->attr.exclude_callchain_user;
+
+	if ((kernel && !has_kernel) || (user && !has_user))
+		goto clear;
+
+	err = -EFAULT;
+	trace = ctx->data->callchain;
+	if (unlikely(!trace))
+		goto clear;
+
+	if (has_kernel && has_user) {
+		__u64 nr_kernel = count_kernel_ip(trace);
+		int ret;
+
+		if (kernel) {
+			__u64 nr = trace->nr;
+
+			trace->nr = nr_kernel;
+			ret = __bpf_get_stack(ctx->regs, NULL, trace, buf,
+					      size, flags);
+
+			/* restore nr */
+			trace->nr = nr;
+		} else { /* user */
+			u64 skip = flags & BPF_F_SKIP_FIELD_MASK;
+
+			skip += nr_kernel;
+			if (skip > BPF_F_SKIP_FIELD_MASK)
+				goto clear;
+
+			flags = (flags & ~BPF_F_SKIP_FIELD_MASK) | skip;
+			ret = __bpf_get_stack(ctx->regs, NULL, trace, buf,
+					      size, flags);
+		}
+		return ret;
+	}
+	return __bpf_get_stack(ctx->regs, NULL, trace, buf, size, flags);
+clear:
+	memset(buf, 0, size);
+	return err;
+
+}
+
+const struct bpf_func_proto bpf_get_stack_proto_pe = {
+	.func		= bpf_get_stack_pe,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_PTR_TO_UNINIT_MEM,
+	.arg3_type	= ARG_CONST_SIZE_OR_ZERO,
+	.arg4_type	= ARG_ANYTHING,
+};
+
 /* Called from eBPF program */
 static void *stack_map_lookup_elem(struct bpf_map *map, void *key)
 {
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 3cc0dcb60ca20..cb91ef902cc43 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1411,9 +1411,9 @@ pe_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_perf_event_output:
 		return &bpf_perf_event_output_proto_tp;
 	case BPF_FUNC_get_stackid:
-		return &bpf_get_stackid_proto_tp;
+		return &bpf_get_stackid_proto_pe;
 	case BPF_FUNC_get_stack:
-		return &bpf_get_stack_proto_tp;
+		return &bpf_get_stack_proto_pe;
 	case BPF_FUNC_perf_prog_read_value:
 		return &bpf_perf_prog_read_value_proto;
 	case BPF_FUNC_read_branch_records:
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 bpf-next 1/2] bpf: separate bpf_get_[stack|stackid] for perf events BPF
  2020-07-16 22:59 ` [PATCH v3 bpf-next 1/2] bpf: separate bpf_get_[stack|stackid] for perf events BPF Song Liu
@ 2020-07-21 19:10   ` Alexei Starovoitov
  2020-07-21 22:40     ` Song Liu
  0 siblings, 1 reply; 8+ messages in thread
From: Alexei Starovoitov @ 2020-07-21 19:10 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-kernel, bpf, netdev, ast, daniel, kernel-team,
	john.fastabend, kpsingh, brouer, peterz

On Thu, Jul 16, 2020 at 03:59:32PM -0700, Song Liu wrote:
> +
> +BPF_CALL_3(bpf_get_stackid_pe, struct bpf_perf_event_data_kern *, ctx,
> +	   struct bpf_map *, map, u64, flags)
> +{
> +	struct perf_event *event = ctx->event;
> +	struct perf_callchain_entry *trace;
> +	bool has_kernel, has_user;
> +	bool kernel, user;
> +
> +	/* perf_sample_data doesn't have callchain, use bpf_get_stackid */
> +	if (!(event->attr.sample_type & __PERF_SAMPLE_CALLCHAIN_EARLY))

what if event was not created with PERF_SAMPLE_CALLCHAIN ?
Calling the helper will still cause crashes, no?

> +		return bpf_get_stackid((unsigned long)(ctx->regs),
> +				       (unsigned long) map, flags, 0, 0);
> +
> +	if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK |
> +			       BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID)))
> +		return -EINVAL;
> +
> +	user = flags & BPF_F_USER_STACK;
> +	kernel = !user;
> +
> +	has_kernel = !event->attr.exclude_callchain_kernel;
> +	has_user = !event->attr.exclude_callchain_user;
> +
> +	if ((kernel && !has_kernel) || (user && !has_user))
> +		return -EINVAL;

this will break existing users in a way that will be very hard for them to debug.
If they happen to set exclude_callchain_* flags during perf_event_open
the helpers will be failing at run-time.
One can argue that when precise_ip=1 the bpf_get_stack is broken, but
this is a change in behavior.
It also seems to be broken when PERF_SAMPLE_CALLCHAIN was not set at event
creation time, but precise_ip=1 was.

> +
> +	trace = ctx->data->callchain;
> +	if (unlikely(!trace))
> +		return -EFAULT;
> +
> +	if (has_kernel && has_user) {

shouldn't it be || ?

> +		__u64 nr_kernel = count_kernel_ip(trace);
> +		int ret;
> +
> +		if (kernel) {
> +			__u64 nr = trace->nr;
> +
> +			trace->nr = nr_kernel;
> +			ret = __bpf_get_stackid(map, trace, flags);
> +
> +			/* restore nr */
> +			trace->nr = nr;
> +		} else { /* user */
> +			u64 skip = flags & BPF_F_SKIP_FIELD_MASK;
> +
> +			skip += nr_kernel;
> +			if (skip > BPF_F_SKIP_FIELD_MASK)
> +				return -EFAULT;
> +
> +			flags = (flags & ~BPF_F_SKIP_FIELD_MASK) | skip;
> +			ret = __bpf_get_stackid(map, trace, flags);
> +		}
> +		return ret;
> +	}
> +	return __bpf_get_stackid(map, trace, flags);
...
> +	if (has_kernel && has_user) {
> +		__u64 nr_kernel = count_kernel_ip(trace);
> +		int ret;
> +
> +		if (kernel) {
> +			__u64 nr = trace->nr;
> +
> +			trace->nr = nr_kernel;
> +			ret = __bpf_get_stack(ctx->regs, NULL, trace, buf,
> +					      size, flags);
> +
> +			/* restore nr */
> +			trace->nr = nr;
> +		} else { /* user */
> +			u64 skip = flags & BPF_F_SKIP_FIELD_MASK;
> +
> +			skip += nr_kernel;
> +			if (skip > BPF_F_SKIP_FIELD_MASK)
> +				goto clear;
> +
> +			flags = (flags & ~BPF_F_SKIP_FIELD_MASK) | skip;
> +			ret = __bpf_get_stack(ctx->regs, NULL, trace, buf,
> +					      size, flags);
> +		}

Looks like copy-paste. I think there should be a way to make it
into common helper.

I think the main isssue is wrong interaction with event attr flags.
I think the verifier should detect that bpf_get_stack/bpf_get_stackid
were used and prevent attaching to perf_event with attr.precise_ip=1
and PERF_SAMPLE_CALLCHAIN is not specified.
I was thinking whether attaching bpf to event can force setting of
PERF_SAMPLE_CALLCHAIN, but that would be a surprising behavior,
so not a good idea.
So the only thing left is to reject attach when bpf_get_stack is used
in two cases:
if attr.precise_ip=1 and PERF_SAMPLE_CALLCHAIN is not set.
  (since it will lead to crashes)
if attr.precise_ip=1 and PERF_SAMPLE_CALLCHAIN is set,
but exclude_callchain_[user|kernel]=1 is set too.
  (since it will lead to surprising behavior of bpf_get_stack)

Other ideas?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 bpf-next 1/2] bpf: separate bpf_get_[stack|stackid] for perf events BPF
  2020-07-21 19:10   ` Alexei Starovoitov
@ 2020-07-21 22:40     ` Song Liu
  2020-07-21 22:43       ` Alexei Starovoitov
  2020-07-22 15:40       ` Peter Zijlstra
  0 siblings, 2 replies; 8+ messages in thread
From: Song Liu @ 2020-07-21 22:40 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: open list, bpf, Networking, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, john.fastabend, kpsingh, brouer, peterz



> On Jul 21, 2020, at 12:10 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> On Thu, Jul 16, 2020 at 03:59:32PM -0700, Song Liu wrote:
>> +
>> +BPF_CALL_3(bpf_get_stackid_pe, struct bpf_perf_event_data_kern *, ctx,
>> +	   struct bpf_map *, map, u64, flags)
>> +{
>> +	struct perf_event *event = ctx->event;
>> +	struct perf_callchain_entry *trace;
>> +	bool has_kernel, has_user;
>> +	bool kernel, user;
>> +
>> +	/* perf_sample_data doesn't have callchain, use bpf_get_stackid */
>> +	if (!(event->attr.sample_type & __PERF_SAMPLE_CALLCHAIN_EARLY))
> 
> what if event was not created with PERF_SAMPLE_CALLCHAIN ?
> Calling the helper will still cause crashes, no?

Yeah, it may still crash. Somehow I messed up this logic...

> 
>> +		return bpf_get_stackid((unsigned long)(ctx->regs),
>> +				       (unsigned long) map, flags, 0, 0);
>> +
>> +	if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK |
>> +			       BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID)))
>> +		return -EINVAL;
>> +
>> +	user = flags & BPF_F_USER_STACK;
>> +	kernel = !user;
>> +
>> +	has_kernel = !event->attr.exclude_callchain_kernel;
>> +	has_user = !event->attr.exclude_callchain_user;
>> +
>> +	if ((kernel && !has_kernel) || (user && !has_user))
>> +		return -EINVAL;
> 
> this will break existing users in a way that will be very hard for them to debug.
> If they happen to set exclude_callchain_* flags during perf_event_open
> the helpers will be failing at run-time.
> One can argue that when precise_ip=1 the bpf_get_stack is broken, but
> this is a change in behavior.
> It also seems to be broken when PERF_SAMPLE_CALLCHAIN was not set at event
> creation time, but precise_ip=1 was.
> 
>> +
>> +	trace = ctx->data->callchain;
>> +	if (unlikely(!trace))
>> +		return -EFAULT;
>> +
>> +	if (has_kernel && has_user) {
> 
> shouldn't it be || ?

It should be &&. We only need to adjust the attached calltrace when it has both 
kernel and user stack. 

> 
>> +		__u64 nr_kernel = count_kernel_ip(trace);
>> +		int ret;
>> +
>> +		if (kernel) {
>> +			__u64 nr = trace->nr;
>> +
>> +			trace->nr = nr_kernel;
>> +			ret = __bpf_get_stackid(map, trace, flags);
>> +
>> +			/* restore nr */
>> +			trace->nr = nr;
>> +		} else { /* user */
>> +			u64 skip = flags & BPF_F_SKIP_FIELD_MASK;
>> +
>> +			skip += nr_kernel;
>> +			if (skip > BPF_F_SKIP_FIELD_MASK)
>> +				return -EFAULT;
>> +
>> +			flags = (flags & ~BPF_F_SKIP_FIELD_MASK) | skip;
>> +			ret = __bpf_get_stackid(map, trace, flags);
>> +		}
>> +		return ret;
>> +	}
>> +	return __bpf_get_stackid(map, trace, flags);
> ...
>> +	if (has_kernel && has_user) {
>> +		__u64 nr_kernel = count_kernel_ip(trace);
>> +		int ret;
>> +
>> +		if (kernel) {
>> +			__u64 nr = trace->nr;
>> +
>> +			trace->nr = nr_kernel;
>> +			ret = __bpf_get_stack(ctx->regs, NULL, trace, buf,
>> +					      size, flags);
>> +
>> +			/* restore nr */
>> +			trace->nr = nr;
>> +		} else { /* user */
>> +			u64 skip = flags & BPF_F_SKIP_FIELD_MASK;
>> +
>> +			skip += nr_kernel;
>> +			if (skip > BPF_F_SKIP_FIELD_MASK)
>> +				goto clear;
>> +
>> +			flags = (flags & ~BPF_F_SKIP_FIELD_MASK) | skip;
>> +			ret = __bpf_get_stack(ctx->regs, NULL, trace, buf,
>> +					      size, flags);
>> +		}
> 
> Looks like copy-paste. I think there should be a way to make it
> into common helper.

I thought about moving this logic to a helper. But we are calling
__bpf_get_stackid() above, and __bpf_get_stack() here. So we can't 
easily put all the logic in a big helper. Multiple small helpers 
looks messy (to me). 

> 
> I think the main isssue is wrong interaction with event attr flags.
> I think the verifier should detect that bpf_get_stack/bpf_get_stackid
> were used and prevent attaching to perf_event with attr.precise_ip=1
> and PERF_SAMPLE_CALLCHAIN is not specified.
> I was thinking whether attaching bpf to event can force setting of
> PERF_SAMPLE_CALLCHAIN, but that would be a surprising behavior,
> so not a good idea.
> So the only thing left is to reject attach when bpf_get_stack is used
> in two cases:
> if attr.precise_ip=1 and PERF_SAMPLE_CALLCHAIN is not set.
>  (since it will lead to crashes)

We only need to block precise_ip >= 2. precise_ip == 1 is OK. 

> if attr.precise_ip=1 and PERF_SAMPLE_CALLCHAIN is set,
> but exclude_callchain_[user|kernel]=1 is set too.
>  (since it will lead to surprising behavior of bpf_get_stack)
> 
> Other ideas?

Yes, this sounds good. 

Thanks,
Song


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 bpf-next 1/2] bpf: separate bpf_get_[stack|stackid] for perf events BPF
  2020-07-21 22:40     ` Song Liu
@ 2020-07-21 22:43       ` Alexei Starovoitov
  2020-07-21 22:51         ` Song Liu
  2020-07-22 15:40       ` Peter Zijlstra
  1 sibling, 1 reply; 8+ messages in thread
From: Alexei Starovoitov @ 2020-07-21 22:43 UTC (permalink / raw)
  To: Song Liu
  Cc: open list, bpf, Networking, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, john.fastabend, kpsingh, brouer, peterz

On Tue, Jul 21, 2020 at 3:40 PM Song Liu <songliubraving@fb.com> wrote:
>
> We only need to block precise_ip >= 2. precise_ip == 1 is OK.

Are you sure?
intel_pmu_hw_config() has:
if (event->attr.precise_ip) {
    if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)
            event->attr.sample_type |= __PERF_SAMPLE_CALLCHAIN_EARLY;
}

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 bpf-next 1/2] bpf: separate bpf_get_[stack|stackid] for perf events BPF
  2020-07-21 22:43       ` Alexei Starovoitov
@ 2020-07-21 22:51         ` Song Liu
  0 siblings, 0 replies; 8+ messages in thread
From: Song Liu @ 2020-07-21 22:51 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: open list, bpf, Networking, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, john.fastabend, kpsingh, brouer, peterz



> On Jul 21, 2020, at 3:43 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> On Tue, Jul 21, 2020 at 3:40 PM Song Liu <songliubraving@fb.com> wrote:
>> 
>> We only need to block precise_ip >= 2. precise_ip == 1 is OK.
> 
> Are you sure?
> intel_pmu_hw_config() has:
> if (event->attr.precise_ip) {
>    if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)
>            event->attr.sample_type |= __PERF_SAMPLE_CALLCHAIN_EARLY;
> }

The bit that breaks the unwinder was in setup_pebs_fixed_sample_data():

                if (x86_pmu.intel_cap.pebs_format >= 2) {
                        set_linear_ip(regs, pebs->real_ip);
                        regs->flags |= PERF_EFLAGS_EXACT;
                } 

"real_ip" causes the issue. 

But on a second thought, it is probably better also blocks precise_ip == 1, 
to match the logic in intel_pmu_hw_config(). 

Thanks,
Song

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 bpf-next 1/2] bpf: separate bpf_get_[stack|stackid] for perf events BPF
  2020-07-21 22:40     ` Song Liu
  2020-07-21 22:43       ` Alexei Starovoitov
@ 2020-07-22 15:40       ` Peter Zijlstra
  2020-07-22 16:49         ` Song Liu
  1 sibling, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2020-07-22 15:40 UTC (permalink / raw)
  To: Song Liu
  Cc: Alexei Starovoitov, open list, bpf, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team, john.fastabend,
	kpsingh, brouer

On Tue, Jul 21, 2020 at 10:40:19PM +0000, Song Liu wrote:

> We only need to block precise_ip >= 2. precise_ip == 1 is OK. 

Uuuh, how? Anything PEBS would have the same problem. Sure, precise_ip
== 1 will not correct the IP, but the stack will not match regardless.

You need IP,SP(,BP) to be a consistent set _AND_ have it match the
current stack, PEBS simply cannot do that, because the regs get recorded
(much) earlier than the PMI and the stack can have changed in the
meantime.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 bpf-next 1/2] bpf: separate bpf_get_[stack|stackid] for perf events BPF
  2020-07-22 15:40       ` Peter Zijlstra
@ 2020-07-22 16:49         ` Song Liu
  0 siblings, 0 replies; 8+ messages in thread
From: Song Liu @ 2020-07-22 16:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexei Starovoitov, open list, bpf, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team, john.fastabend,
	kpsingh, brouer



> On Jul 22, 2020, at 8:40 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Tue, Jul 21, 2020 at 10:40:19PM +0000, Song Liu wrote:
> 
>> We only need to block precise_ip >= 2. precise_ip == 1 is OK. 
> 
> Uuuh, how? Anything PEBS would have the same problem. Sure, precise_ip
> == 1 will not correct the IP, but the stack will not match regardless.
> 
> You need IP,SP(,BP) to be a consistent set _AND_ have it match the
> current stack, PEBS simply cannot do that, because the regs get recorded
> (much) earlier than the PMI and the stack can have changed in the
> meantime.
> 

By "OK", I meant unwinder will not report error (in my tests). For 
accurate stack, we should do the same for precise_ip == 1. 

Thanks,
Song

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-07-22 16:50 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-16 22:59 [PATCH v3 bpf-next 0/2] bpf: fix stackmap on perf_events with PEBS Song Liu
2020-07-16 22:59 ` [PATCH v3 bpf-next 1/2] bpf: separate bpf_get_[stack|stackid] for perf events BPF Song Liu
2020-07-21 19:10   ` Alexei Starovoitov
2020-07-21 22:40     ` Song Liu
2020-07-21 22:43       ` Alexei Starovoitov
2020-07-21 22:51         ` Song Liu
2020-07-22 15:40       ` Peter Zijlstra
2020-07-22 16:49         ` Song Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).