All of lore.kernel.org
 help / color / mirror / Atom feed
* using skip>0 with bpf_get_stack()
@ 2021-05-28 22:16 Eugene Loh
  2021-06-01 21:48 ` Yonghong Song
  0 siblings, 1 reply; 7+ messages in thread
From: Eugene Loh @ 2021-05-28 22:16 UTC (permalink / raw)
  To: bpf

I have a question about bpf_get_stack().  I'm interested in the case
         skip > 0
         user_build_id == 0
         num_elem < sysctl_perf_event_max_stack

The function sets
         init_nr = sysctl_perf_event_max_stack - num_elem;
which means that get_perf_callchain() will return "num_elem" stack 
frames.  Then, since we skip "skip" frames, we'll fill the user buffer 
with only "num_elem - skip" frames, the remaining frames being filled zero.

For example, let's say the call stack is
         leaf <- caller <- foo1 <- foo2 <- foo3 <- foo4 <- foo5 <- foo6

Let's say I pass bpf_get_stack() a buffer with num_elem==4 and ask 
skip==2.  I would expect to skip 2 frames then get 4 frames, getting back:
         foo1  foo2  foo3  foo4

Instead, I get
         foo1  foo2  0  0
skipping 2 frames but also leaving frames zeroed out.

I think the init_nr computation should be:

-       if (sysctl_perf_event_max_stack < num_elem)
+       if (sysctl_perf_event_max_stack <= num_elem + skip)
                 init_nr = 0;
         else
-               init_nr = sysctl_perf_event_max_stack - num_elem;
+               init_nr = sysctl_perf_event_max_stack - num_elem - skip;

Incidentally, the return value of the function is presumably the size of 
the returned data.  Would it make sense to say so in 
include/uapi/linux/bpf.h?


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: using skip>0 with bpf_get_stack()
  2021-05-28 22:16 using skip>0 with bpf_get_stack() Eugene Loh
@ 2021-06-01 21:48 ` Yonghong Song
  2021-06-26  1:22   ` Eugene Loh
  0 siblings, 1 reply; 7+ messages in thread
From: Yonghong Song @ 2021-06-01 21:48 UTC (permalink / raw)
  To: Eugene Loh, bpf



On 5/28/21 3:16 PM, Eugene Loh wrote:
> I have a question about bpf_get_stack().  I'm interested in the case
>          skip > 0
>          user_build_id == 0
>          num_elem < sysctl_perf_event_max_stack
> 
> The function sets
>          init_nr = sysctl_perf_event_max_stack - num_elem;
> which means that get_perf_callchain() will return "num_elem" stack 
> frames.  Then, since we skip "skip" frames, we'll fill the user buffer 
> with only "num_elem - skip" frames, the remaining frames being filled zero.
> 
> For example, let's say the call stack is
>          leaf <- caller <- foo1 <- foo2 <- foo3 <- foo4 <- foo5 <- foo6
> 
> Let's say I pass bpf_get_stack() a buffer with num_elem==4 and ask 
> skip==2.  I would expect to skip 2 frames then get 4 frames, getting back:
>          foo1  foo2  foo3  foo4
> 
> Instead, I get
>          foo1  foo2  0  0
> skipping 2 frames but also leaving frames zeroed out.

Thanks for reporting. I looked at codes and it does seem that we may
have a kernel bug when skip != 0. Basically as you described,
initially we collected num_elem stacks and later on we reduced by skip
so we got num_elem - skip as you calculated in the above.

> 
> I think the init_nr computation should be:
> 
> -       if (sysctl_perf_event_max_stack < num_elem)
> +       if (sysctl_perf_event_max_stack <= num_elem + skip)
>                  init_nr = 0;
>          else
> -               init_nr = sysctl_perf_event_max_stack - num_elem;
> +               init_nr = sysctl_perf_event_max_stack - num_elem - skip;

A rough check looks like this is one correct way to fix the issue.

> Incidentally, the return value of the function is presumably the size of 
> the returned data.  Would it make sense to say so in 
> include/uapi/linux/bpf.h?

The current documentation says:
  *      Return
  *              A non-negative value equal to or less than *size* on 
success,
  *              or a negative error in case of failure.

I think you can improve with more precise description such that
a non-negative value for the copied **buf** length.

Could you submit a patch for this? Thanks!


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: using skip>0 with bpf_get_stack()
  2021-06-01 21:48 ` Yonghong Song
@ 2021-06-26  1:22   ` Eugene Loh
  2021-06-29  3:33     ` Yonghong Song
  0 siblings, 1 reply; 7+ messages in thread
From: Eugene Loh @ 2021-06-26  1:22 UTC (permalink / raw)
  To: bpf

[-- Attachment #1: Type: text/plain, Size: 2708 bytes --]


On 6/1/21 5:48 PM, Yonghong Song wrote:
>
>
> On 5/28/21 3:16 PM, Eugene Loh wrote:
>> I have a question about bpf_get_stack(). I'm interested in the case
>>          skip > 0
>>          user_build_id == 0
>>          num_elem < sysctl_perf_event_max_stack
>>
>> The function sets
>>          init_nr = sysctl_perf_event_max_stack - num_elem;
>> which means that get_perf_callchain() will return "num_elem" stack 
>> frames.  Then, since we skip "skip" frames, we'll fill the user 
>> buffer with only "num_elem - skip" frames, the remaining frames being 
>> filled zero.
>>
>> For example, let's say the call stack is
>>          leaf <- caller <- foo1 <- foo2 <- foo3 <- foo4 <- foo5 <- foo6
>>
>> Let's say I pass bpf_get_stack() a buffer with num_elem==4 and ask 
>> skip==2.  I would expect to skip 2 frames then get 4 frames, getting 
>> back:
>>          foo1  foo2  foo3  foo4
>>
>> Instead, I get
>>          foo1  foo2  0  0
>> skipping 2 frames but also leaving frames zeroed out.
>
> Thanks for reporting. I looked at codes and it does seem that we may
> have a kernel bug when skip != 0. Basically as you described,
> initially we collected num_elem stacks and later on we reduced by skip
> so we got num_elem - skip as you calculated in the above.
>
>>
>> I think the init_nr computation should be:
>>
>> -       if (sysctl_perf_event_max_stack < num_elem)
>> +       if (sysctl_perf_event_max_stack <= num_elem + skip)
>>                  init_nr = 0;
>>          else
>> -               init_nr = sysctl_perf_event_max_stack - num_elem;
>> +               init_nr = sysctl_perf_event_max_stack - num_elem - skip;
>
> A rough check looks like this is one correct way to fix the issue.
>
>> Incidentally, the return value of the function is presumably the size 
>> of the returned data.  Would it make sense to say so in 
>> include/uapi/linux/bpf.h?
>
> The current documentation says:
>  *      Return
>  *              A non-negative value equal to or less than *size* on 
> success,
>  *              or a negative error in case of failure.
>
> I think you can improve with more precise description such that
> a non-negative value for the copied **buf** length.
>
> Could you submit a patch for this? Thanks!

Sure.  Thanks for looking at this and sorry about the long delay getting 
back to you.

Could you take a look at the attached, proposed patch?  As you see in 
the commit message, I'm unclear about the bpf_get_stack*_pe() variants.  
They might use an earlier construct callchain, and I do not know how 
init_nr was set for them.

[-- Attachment #2: 0001-bpf-Adjust-BPF-stack-helper-functions-to-accommodate.patch --]
[-- Type: text/plain, Size: 4186 bytes --]

From 70da9057d7fa7dda76e1b0861b8a0174078434ea Mon Sep 17 00:00:00 2001
From: Eugene Loh <eugene.loh@oracle.com>
Date: Fri, 25 Jun 2021 15:18:46 -0700
Subject: [PATCH] bpf: Adjust BPF stack helper functions to accommodate skip>0

Let's say that the caller has storage for num_elem stack frames.  Then,
the BPF stack helper functions walk the stack for only num_elem frames.
This means that if skip>0, one keeps only num_elem-skip frames.

Change the computation of init_nr so that num_elem+skip stack frames are
walked (and hence num_elem frames are kept).

[NB:  I am unsure of the bpf_get_stack*_pe() variants, which in the case of
__PERF_SAMPLE_CALLCHAIN_EARLY use ctx->data->callchain, which was walked
earlier.  I am unclear how init_nr was chosen for it.]

Change the comment on bpf_get_stack() in the header file to be more
explicit what the return value means.

Signed-off-by: Eugene Loh <eugene.loh@oracle.com>
---
 include/uapi/linux/bpf.h |  4 ++--
 kernel/bpf/stackmap.c    | 26 +++++++++++++++-----------
 2 files changed, 17 insertions(+), 13 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 79c893310492..7c7b93e1db90 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2183,8 +2183,8 @@ union bpf_attr {
  *
  * 			# sysctl kernel.perf_event_max_stack=<new value>
  * 	Return
- * 		A non-negative value equal to or less than *size* on success,
- * 		or a negative error in case of failure.
+ * 		The non-negative copied *buf* length equal to or less than
+ * 		*size* on success, or a negative error in case of failure.
  *
  * long bpf_skb_load_bytes_relative(const void *skb, u32 offset, void *to, u32 len, u32 start_header)
  * 	Description
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index be35bfb7fb13..e2a193581550 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -249,23 +249,30 @@ get_callchain_entry_for_task(struct task_struct *task, u32 init_nr)
 #endif
 }
 
+static u32 get_init_nr(u32 nelem, u64 flags)
+{
+	u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
+
+	if (sysctl_perf_event_max_stack <= nelem + skip)
+		return 0;
+	else
+		return sysctl_perf_event_max_stack - nelem - skip;
+}
+
 static long __bpf_get_stackid(struct bpf_map *map,
 			      struct perf_callchain_entry *trace, u64 flags)
 {
 	struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, map);
 	struct stack_map_bucket *bucket, *new_bucket, *old_bucket;
 	u32 max_depth = map->value_size / stack_map_data_size(map);
-	/* stack_map_alloc() checks that max_depth <= sysctl_perf_event_max_stack */
-	u32 init_nr = sysctl_perf_event_max_stack - max_depth;
+	u32 init_nr;
 	u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
 	u32 hash, id, trace_nr, trace_len;
 	bool user = flags & BPF_F_USER_STACK;
 	u64 *ips;
 	bool hash_matches;
 
-	/* get_perf_callchain() guarantees that trace->nr >= init_nr
-	 * and trace-nr <= sysctl_perf_event_max_stack, so trace_nr <= max_depth
-	 */
+	init_nr = get_init_nr(max_depth, flags);
 	trace_nr = trace->nr - init_nr;
 
 	if (trace_nr <= skip)
@@ -331,8 +338,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 	   u64, flags)
 {
 	u32 max_depth = map->value_size / stack_map_data_size(map);
-	/* stack_map_alloc() checks that max_depth <= sysctl_perf_event_max_stack */
-	u32 init_nr = sysctl_perf_event_max_stack - max_depth;
+	u32 init_nr;
 	bool user = flags & BPF_F_USER_STACK;
 	struct perf_callchain_entry *trace;
 	bool kernel = !user;
@@ -341,6 +347,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 			       BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID)))
 		return -EINVAL;
 
+	init_nr = get_init_nr(max_depth, flags);
 	trace = get_perf_callchain(regs, init_nr, kernel, user,
 				   sysctl_perf_event_max_stack, false, false);
 
@@ -458,10 +465,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 		goto err_fault;
 
 	num_elem = size / elem_size;
-	if (sysctl_perf_event_max_stack < num_elem)
-		init_nr = 0;
-	else
-		init_nr = sysctl_perf_event_max_stack - num_elem;
+	init_nr = get_init_nr(num_elem, flags);
 
 	if (trace_in)
 		trace = trace_in;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: using skip>0 with bpf_get_stack()
  2021-06-26  1:22   ` Eugene Loh
@ 2021-06-29  3:33     ` Yonghong Song
  2022-03-04 20:37       ` Namhyung Kim
  0 siblings, 1 reply; 7+ messages in thread
From: Yonghong Song @ 2021-06-29  3:33 UTC (permalink / raw)
  To: Eugene Loh, bpf



On 6/25/21 6:22 PM, Eugene Loh wrote:
> 
> On 6/1/21 5:48 PM, Yonghong Song wrote:
>>
>>
>> On 5/28/21 3:16 PM, Eugene Loh wrote:
>>> I have a question about bpf_get_stack(). I'm interested in the case
>>>          skip > 0
>>>          user_build_id == 0
>>>          num_elem < sysctl_perf_event_max_stack
>>>
>>> The function sets
>>>          init_nr = sysctl_perf_event_max_stack - num_elem;
>>> which means that get_perf_callchain() will return "num_elem" stack 
>>> frames.  Then, since we skip "skip" frames, we'll fill the user 
>>> buffer with only "num_elem - skip" frames, the remaining frames being 
>>> filled zero.
>>>
>>> For example, let's say the call stack is
>>>          leaf <- caller <- foo1 <- foo2 <- foo3 <- foo4 <- foo5 <- foo6
>>>
>>> Let's say I pass bpf_get_stack() a buffer with num_elem==4 and ask 
>>> skip==2.  I would expect to skip 2 frames then get 4 frames, getting 
>>> back:
>>>          foo1  foo2  foo3  foo4
>>>
>>> Instead, I get
>>>          foo1  foo2  0  0
>>> skipping 2 frames but also leaving frames zeroed out.
>>
>> Thanks for reporting. I looked at codes and it does seem that we may
>> have a kernel bug when skip != 0. Basically as you described,
>> initially we collected num_elem stacks and later on we reduced by skip
>> so we got num_elem - skip as you calculated in the above.
>>
>>>
>>> I think the init_nr computation should be:
>>>
>>> -       if (sysctl_perf_event_max_stack < num_elem)
>>> +       if (sysctl_perf_event_max_stack <= num_elem + skip)
>>>                  init_nr = 0;
>>>          else
>>> -               init_nr = sysctl_perf_event_max_stack - num_elem;
>>> +               init_nr = sysctl_perf_event_max_stack - num_elem - skip;
>>
>> A rough check looks like this is one correct way to fix the issue.
>>
>>> Incidentally, the return value of the function is presumably the size 
>>> of the returned data.  Would it make sense to say so in 
>>> include/uapi/linux/bpf.h?
>>
>> The current documentation says:
>>  *      Return
>>  *              A non-negative value equal to or less than *size* on 
>> success,
>>  *              or a negative error in case of failure.
>>
>> I think you can improve with more precise description such that
>> a non-negative value for the copied **buf** length.
>>
>> Could you submit a patch for this? Thanks!
> 
> Sure.  Thanks for looking at this and sorry about the long delay getting 
> back to you.
> 
> Could you take a look at the attached, proposed patch?  As you see in 
> the commit message, I'm unclear about the bpf_get_stack*_pe() variants. 
> They might use an earlier construct callchain, and I do not know ho
> init_nr was set for them.

I think bpf_get_stackid() and __bpf_get_stackid() implementation is 
correct. Did you find any issues?

For bpf_get_stack_pe, see:
 
https://lore.kernel.org/bpf/20200723180648.1429892-2-songliubraving@fb.com/
I think you should not change bpf_get_stack() function.
__bpf_get_stack() is used by bpf_get_stack() and bpf_get_stack_pe().
In bpf_get_stack_pe(), callchain is fetched by perf event infrastructure
if event->attr.sample_type & __PERF_SAMPLE_CALLCHAIN_EARLY is true.

Just focus on __bpf_get_stack(). We could factor __bpf_get_stackid(),
but unless we have a bug, I didn't see it is necessary.

It will be good if you can add a test for the change, there is a 
stacktrace test prog_tests/stacktrace_map.c, you can take a look,
and you can add a subtest there.

Next time, you can submit a formal patch with `git send-email ...` to
this alias. This way it is easier to review compared to attachment.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: using skip>0 with bpf_get_stack()
  2021-06-29  3:33     ` Yonghong Song
@ 2022-03-04 20:37       ` Namhyung Kim
  2022-03-04 20:50         ` Eugene Loh
  0 siblings, 1 reply; 7+ messages in thread
From: Namhyung Kim @ 2022-03-04 20:37 UTC (permalink / raw)
  To: Eugene Loh; +Cc: Yonghong Song, bpf

Hello,

On Mon, Jun 28, 2021 at 08:33:11PM -0700, Yonghong Song wrote:
> 
> 
> On 6/25/21 6:22 PM, Eugene Loh wrote:
> > 
> > On 6/1/21 5:48 PM, Yonghong Song wrote:
> > > Could you submit a patch for this? Thanks!
> > 
> > Sure.  Thanks for looking at this and sorry about the long delay getting
> > back to you.
> > 
> > Could you take a look at the attached, proposed patch?  As you see in
> > the commit message, I'm unclear about the bpf_get_stack*_pe() variants.
> > They might use an earlier construct callchain, and I do not know ho
> > init_nr was set for them.
> 
> I think bpf_get_stackid() and __bpf_get_stackid() implementation is correct.
> Did you find any issues?
> 
> For bpf_get_stack_pe, see:
> 
> https://lore.kernel.org/bpf/20200723180648.1429892-2-songliubraving@fb.com/
> I think you should not change bpf_get_stack() function.
> __bpf_get_stack() is used by bpf_get_stack() and bpf_get_stack_pe().
> In bpf_get_stack_pe(), callchain is fetched by perf event infrastructure
> if event->attr.sample_type & __PERF_SAMPLE_CALLCHAIN_EARLY is true.
> 
> Just focus on __bpf_get_stack(). We could factor __bpf_get_stackid(),
> but unless we have a bug, I didn't see it is necessary.
> 
> It will be good if you can add a test for the change, there is a stacktrace
> test prog_tests/stacktrace_map.c, you can take a look,
> and you can add a subtest there.
> 
> Next time, you can submit a formal patch with `git send-email ...` to
> this alias. This way it is easier to review compared to attachment.

Any updates on this?  I'm hitting the same issue and found this before
sending a fix.

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: using skip>0 with bpf_get_stack()
  2022-03-04 20:37       ` Namhyung Kim
@ 2022-03-04 20:50         ` Eugene Loh
  2022-03-04 21:16           ` Namhyung Kim
  0 siblings, 1 reply; 7+ messages in thread
From: Eugene Loh @ 2022-03-04 20:50 UTC (permalink / raw)
  To: Namhyung Kim; +Cc: Yonghong Song, bpf

No update from me.  A fix would be great.

On 3/4/22 3:37 PM, Namhyung Kim wrote:
> Hello,
>
> On Mon, Jun 28, 2021 at 08:33:11PM -0700, Yonghong Song wrote:
>>
>> On 6/25/21 6:22 PM, Eugene Loh wrote:
>>> On 6/1/21 5:48 PM, Yonghong Song wrote:
>>>> Could you submit a patch for this? Thanks!
>>> Sure.  Thanks for looking at this and sorry about the long delay getting
>>> back to you.
>>>
>>> Could you take a look at the attached, proposed patch?  As you see in
>>> the commit message, I'm unclear about the bpf_get_stack*_pe() variants.
>>> They might use an earlier construct callchain, and I do not know ho
>>> init_nr was set for them.
>> I think bpf_get_stackid() and __bpf_get_stackid() implementation is correct.
>> Did you find any issues?
>>
>> For bpf_get_stack_pe, see:
>>
>> https://lore.kernel.org/bpf/20200723180648.1429892-2-songliubraving@fb.com/
>> I think you should not change bpf_get_stack() function.
>> __bpf_get_stack() is used by bpf_get_stack() and bpf_get_stack_pe().
>> In bpf_get_stack_pe(), callchain is fetched by perf event infrastructure
>> if event->attr.sample_type & __PERF_SAMPLE_CALLCHAIN_EARLY is true.
>>
>> Just focus on __bpf_get_stack(). We could factor __bpf_get_stackid(),
>> but unless we have a bug, I didn't see it is necessary.
>>
>> It will be good if you can add a test for the change, there is a stacktrace
>> test prog_tests/stacktrace_map.c, you can take a look,
>> and you can add a subtest there.
>>
>> Next time, you can submit a formal patch with `git send-email ...` to
>> this alias. This way it is easier to review compared to attachment.
> Any updates on this?  I'm hitting the same issue and found this before
> sending a fix.
>
> Thanks,
> Namhyung

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: using skip>0 with bpf_get_stack()
  2022-03-04 20:50         ` Eugene Loh
@ 2022-03-04 21:16           ` Namhyung Kim
  0 siblings, 0 replies; 7+ messages in thread
From: Namhyung Kim @ 2022-03-04 21:16 UTC (permalink / raw)
  To: Eugene Loh; +Cc: Yonghong Song, bpf

On Fri, Mar 4, 2022 at 12:50 PM Eugene Loh <eugene.loh@oracle.com> wrote:
>
> No update from me.  A fix would be great.

Ok, I will think about the _pe variants more.

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-03-04 21:16 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-28 22:16 using skip>0 with bpf_get_stack() Eugene Loh
2021-06-01 21:48 ` Yonghong Song
2021-06-26  1:22   ` Eugene Loh
2021-06-29  3:33     ` Yonghong Song
2022-03-04 20:37       ` Namhyung Kim
2022-03-04 20:50         ` Eugene Loh
2022-03-04 21:16           ` Namhyung Kim

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.