* using skip>0 with bpf_get_stack()
@ 2021-05-28 22:16 Eugene Loh
2021-06-01 21:48 ` Yonghong Song
0 siblings, 1 reply; 7+ messages in thread
From: Eugene Loh @ 2021-05-28 22:16 UTC (permalink / raw)
To: bpf
I have a question about bpf_get_stack(). I'm interested in the case
skip > 0
user_build_id == 0
num_elem < sysctl_perf_event_max_stack
The function sets
init_nr = sysctl_perf_event_max_stack - num_elem;
which means that get_perf_callchain() will return "num_elem" stack
frames. Then, since we skip "skip" frames, we'll fill the user buffer
with only "num_elem - skip" frames, the remaining frames being filled zero.
For example, let's say the call stack is
leaf <- caller <- foo1 <- foo2 <- foo3 <- foo4 <- foo5 <- foo6
Let's say I pass bpf_get_stack() a buffer with num_elem==4 and ask
skip==2. I would expect to skip 2 frames then get 4 frames, getting back:
foo1 foo2 foo3 foo4
Instead, I get
foo1 foo2 0 0
skipping 2 frames but also leaving frames zeroed out.
I think the init_nr computation should be:
- if (sysctl_perf_event_max_stack < num_elem)
+ if (sysctl_perf_event_max_stack <= num_elem + skip)
init_nr = 0;
else
- init_nr = sysctl_perf_event_max_stack - num_elem;
+ init_nr = sysctl_perf_event_max_stack - num_elem - skip;
Incidentally, the return value of the function is presumably the size of
the returned data. Would it make sense to say so in
include/uapi/linux/bpf.h?
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: using skip>0 with bpf_get_stack()
2021-05-28 22:16 using skip>0 with bpf_get_stack() Eugene Loh
@ 2021-06-01 21:48 ` Yonghong Song
2021-06-26 1:22 ` Eugene Loh
0 siblings, 1 reply; 7+ messages in thread
From: Yonghong Song @ 2021-06-01 21:48 UTC (permalink / raw)
To: Eugene Loh, bpf
On 5/28/21 3:16 PM, Eugene Loh wrote:
> I have a question about bpf_get_stack(). I'm interested in the case
> skip > 0
> user_build_id == 0
> num_elem < sysctl_perf_event_max_stack
>
> The function sets
> init_nr = sysctl_perf_event_max_stack - num_elem;
> which means that get_perf_callchain() will return "num_elem" stack
> frames. Then, since we skip "skip" frames, we'll fill the user buffer
> with only "num_elem - skip" frames, the remaining frames being filled zero.
>
> For example, let's say the call stack is
> leaf <- caller <- foo1 <- foo2 <- foo3 <- foo4 <- foo5 <- foo6
>
> Let's say I pass bpf_get_stack() a buffer with num_elem==4 and ask
> skip==2. I would expect to skip 2 frames then get 4 frames, getting back:
> foo1 foo2 foo3 foo4
>
> Instead, I get
> foo1 foo2 0 0
> skipping 2 frames but also leaving frames zeroed out.
Thanks for reporting. I looked at codes and it does seem that we may
have a kernel bug when skip != 0. Basically as you described,
initially we collected num_elem stacks and later on we reduced by skip
so we got num_elem - skip as you calculated in the above.
>
> I think the init_nr computation should be:
>
> - if (sysctl_perf_event_max_stack < num_elem)
> + if (sysctl_perf_event_max_stack <= num_elem + skip)
> init_nr = 0;
> else
> - init_nr = sysctl_perf_event_max_stack - num_elem;
> + init_nr = sysctl_perf_event_max_stack - num_elem - skip;
A rough check looks like this is one correct way to fix the issue.
> Incidentally, the return value of the function is presumably the size of
> the returned data. Would it make sense to say so in
> include/uapi/linux/bpf.h?
The current documentation says:
* Return
* A non-negative value equal to or less than *size* on
success,
* or a negative error in case of failure.
I think you can improve with more precise description such that
a non-negative value for the copied **buf** length.
Could you submit a patch for this? Thanks!
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: using skip>0 with bpf_get_stack()
2021-06-01 21:48 ` Yonghong Song
@ 2021-06-26 1:22 ` Eugene Loh
2021-06-29 3:33 ` Yonghong Song
0 siblings, 1 reply; 7+ messages in thread
From: Eugene Loh @ 2021-06-26 1:22 UTC (permalink / raw)
To: bpf
[-- Attachment #1: Type: text/plain, Size: 2708 bytes --]
On 6/1/21 5:48 PM, Yonghong Song wrote:
>
>
> On 5/28/21 3:16 PM, Eugene Loh wrote:
>> I have a question about bpf_get_stack(). I'm interested in the case
>> skip > 0
>> user_build_id == 0
>> num_elem < sysctl_perf_event_max_stack
>>
>> The function sets
>> init_nr = sysctl_perf_event_max_stack - num_elem;
>> which means that get_perf_callchain() will return "num_elem" stack
>> frames. Then, since we skip "skip" frames, we'll fill the user
>> buffer with only "num_elem - skip" frames, the remaining frames being
>> filled zero.
>>
>> For example, let's say the call stack is
>> leaf <- caller <- foo1 <- foo2 <- foo3 <- foo4 <- foo5 <- foo6
>>
>> Let's say I pass bpf_get_stack() a buffer with num_elem==4 and ask
>> skip==2. I would expect to skip 2 frames then get 4 frames, getting
>> back:
>> foo1 foo2 foo3 foo4
>>
>> Instead, I get
>> foo1 foo2 0 0
>> skipping 2 frames but also leaving frames zeroed out.
>
> Thanks for reporting. I looked at codes and it does seem that we may
> have a kernel bug when skip != 0. Basically as you described,
> initially we collected num_elem stacks and later on we reduced by skip
> so we got num_elem - skip as you calculated in the above.
>
>>
>> I think the init_nr computation should be:
>>
>> - if (sysctl_perf_event_max_stack < num_elem)
>> + if (sysctl_perf_event_max_stack <= num_elem + skip)
>> init_nr = 0;
>> else
>> - init_nr = sysctl_perf_event_max_stack - num_elem;
>> + init_nr = sysctl_perf_event_max_stack - num_elem - skip;
>
> A rough check looks like this is one correct way to fix the issue.
>
>> Incidentally, the return value of the function is presumably the size
>> of the returned data. Would it make sense to say so in
>> include/uapi/linux/bpf.h?
>
> The current documentation says:
> * Return
> * A non-negative value equal to or less than *size* on
> success,
> * or a negative error in case of failure.
>
> I think you can improve with more precise description such that
> a non-negative value for the copied **buf** length.
>
> Could you submit a patch for this? Thanks!
Sure. Thanks for looking at this and sorry about the long delay getting
back to you.
Could you take a look at the attached, proposed patch? As you see in
the commit message, I'm unclear about the bpf_get_stack*_pe() variants.
They might use an earlier construct callchain, and I do not know how
init_nr was set for them.
[-- Attachment #2: 0001-bpf-Adjust-BPF-stack-helper-functions-to-accommodate.patch --]
[-- Type: text/plain, Size: 4186 bytes --]
From 70da9057d7fa7dda76e1b0861b8a0174078434ea Mon Sep 17 00:00:00 2001
From: Eugene Loh <eugene.loh@oracle.com>
Date: Fri, 25 Jun 2021 15:18:46 -0700
Subject: [PATCH] bpf: Adjust BPF stack helper functions to accommodate skip>0
Let's say that the caller has storage for num_elem stack frames. Then,
the BPF stack helper functions walk the stack for only num_elem frames.
This means that if skip>0, one keeps only num_elem-skip frames.
Change the computation of init_nr so that num_elem+skip stack frames are
walked (and hence num_elem frames are kept).
[NB: I am unsure of the bpf_get_stack*_pe() variants, which in the case of
__PERF_SAMPLE_CALLCHAIN_EARLY use ctx->data->callchain, which was walked
earlier. I am unclear how init_nr was chosen for it.]
Change the comment on bpf_get_stack() in the header file to be more
explicit what the return value means.
Signed-off-by: Eugene Loh <eugene.loh@oracle.com>
---
include/uapi/linux/bpf.h | 4 ++--
kernel/bpf/stackmap.c | 26 +++++++++++++++-----------
2 files changed, 17 insertions(+), 13 deletions(-)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 79c893310492..7c7b93e1db90 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2183,8 +2183,8 @@ union bpf_attr {
*
* # sysctl kernel.perf_event_max_stack=<new value>
* Return
- * A non-negative value equal to or less than *size* on success,
- * or a negative error in case of failure.
+ * The non-negative copied *buf* length equal to or less than
+ * *size* on success, or a negative error in case of failure.
*
* long bpf_skb_load_bytes_relative(const void *skb, u32 offset, void *to, u32 len, u32 start_header)
* Description
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index be35bfb7fb13..e2a193581550 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -249,23 +249,30 @@ get_callchain_entry_for_task(struct task_struct *task, u32 init_nr)
#endif
}
+static u32 get_init_nr(u32 nelem, u64 flags)
+{
+ u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
+
+ if (sysctl_perf_event_max_stack <= nelem + skip)
+ return 0;
+ else
+ return sysctl_perf_event_max_stack - nelem - skip;
+}
+
static long __bpf_get_stackid(struct bpf_map *map,
struct perf_callchain_entry *trace, u64 flags)
{
struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, map);
struct stack_map_bucket *bucket, *new_bucket, *old_bucket;
u32 max_depth = map->value_size / stack_map_data_size(map);
- /* stack_map_alloc() checks that max_depth <= sysctl_perf_event_max_stack */
- u32 init_nr = sysctl_perf_event_max_stack - max_depth;
+ u32 init_nr;
u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
u32 hash, id, trace_nr, trace_len;
bool user = flags & BPF_F_USER_STACK;
u64 *ips;
bool hash_matches;
- /* get_perf_callchain() guarantees that trace->nr >= init_nr
- * and trace-nr <= sysctl_perf_event_max_stack, so trace_nr <= max_depth
- */
+ init_nr = get_init_nr(max_depth, flags);
trace_nr = trace->nr - init_nr;
if (trace_nr <= skip)
@@ -331,8 +338,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
u64, flags)
{
u32 max_depth = map->value_size / stack_map_data_size(map);
- /* stack_map_alloc() checks that max_depth <= sysctl_perf_event_max_stack */
- u32 init_nr = sysctl_perf_event_max_stack - max_depth;
+ u32 init_nr;
bool user = flags & BPF_F_USER_STACK;
struct perf_callchain_entry *trace;
bool kernel = !user;
@@ -341,6 +347,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID)))
return -EINVAL;
+ init_nr = get_init_nr(max_depth, flags);
trace = get_perf_callchain(regs, init_nr, kernel, user,
sysctl_perf_event_max_stack, false, false);
@@ -458,10 +465,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
goto err_fault;
num_elem = size / elem_size;
- if (sysctl_perf_event_max_stack < num_elem)
- init_nr = 0;
- else
- init_nr = sysctl_perf_event_max_stack - num_elem;
+ init_nr = get_init_nr(num_elem, flags);
if (trace_in)
trace = trace_in;
--
2.27.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: using skip>0 with bpf_get_stack()
2021-06-26 1:22 ` Eugene Loh
@ 2021-06-29 3:33 ` Yonghong Song
2022-03-04 20:37 ` Namhyung Kim
0 siblings, 1 reply; 7+ messages in thread
From: Yonghong Song @ 2021-06-29 3:33 UTC (permalink / raw)
To: Eugene Loh, bpf
On 6/25/21 6:22 PM, Eugene Loh wrote:
>
> On 6/1/21 5:48 PM, Yonghong Song wrote:
>>
>>
>> On 5/28/21 3:16 PM, Eugene Loh wrote:
>>> I have a question about bpf_get_stack(). I'm interested in the case
>>> skip > 0
>>> user_build_id == 0
>>> num_elem < sysctl_perf_event_max_stack
>>>
>>> The function sets
>>> init_nr = sysctl_perf_event_max_stack - num_elem;
>>> which means that get_perf_callchain() will return "num_elem" stack
>>> frames. Then, since we skip "skip" frames, we'll fill the user
>>> buffer with only "num_elem - skip" frames, the remaining frames being
>>> filled zero.
>>>
>>> For example, let's say the call stack is
>>> leaf <- caller <- foo1 <- foo2 <- foo3 <- foo4 <- foo5 <- foo6
>>>
>>> Let's say I pass bpf_get_stack() a buffer with num_elem==4 and ask
>>> skip==2. I would expect to skip 2 frames then get 4 frames, getting
>>> back:
>>> foo1 foo2 foo3 foo4
>>>
>>> Instead, I get
>>> foo1 foo2 0 0
>>> skipping 2 frames but also leaving frames zeroed out.
>>
>> Thanks for reporting. I looked at codes and it does seem that we may
>> have a kernel bug when skip != 0. Basically as you described,
>> initially we collected num_elem stacks and later on we reduced by skip
>> so we got num_elem - skip as you calculated in the above.
>>
>>>
>>> I think the init_nr computation should be:
>>>
>>> - if (sysctl_perf_event_max_stack < num_elem)
>>> + if (sysctl_perf_event_max_stack <= num_elem + skip)
>>> init_nr = 0;
>>> else
>>> - init_nr = sysctl_perf_event_max_stack - num_elem;
>>> + init_nr = sysctl_perf_event_max_stack - num_elem - skip;
>>
>> A rough check looks like this is one correct way to fix the issue.
>>
>>> Incidentally, the return value of the function is presumably the size
>>> of the returned data. Would it make sense to say so in
>>> include/uapi/linux/bpf.h?
>>
>> The current documentation says:
>> * Return
>> * A non-negative value equal to or less than *size* on
>> success,
>> * or a negative error in case of failure.
>>
>> I think you can improve with more precise description such that
>> a non-negative value for the copied **buf** length.
>>
>> Could you submit a patch for this? Thanks!
>
> Sure. Thanks for looking at this and sorry about the long delay getting
> back to you.
>
> Could you take a look at the attached, proposed patch? As you see in
> the commit message, I'm unclear about the bpf_get_stack*_pe() variants.
> They might use an earlier construct callchain, and I do not know ho
> init_nr was set for them.
I think bpf_get_stackid() and __bpf_get_stackid() implementation is
correct. Did you find any issues?
For bpf_get_stack_pe, see:
https://lore.kernel.org/bpf/20200723180648.1429892-2-songliubraving@fb.com/
I think you should not change bpf_get_stack() function.
__bpf_get_stack() is used by bpf_get_stack() and bpf_get_stack_pe().
In bpf_get_stack_pe(), callchain is fetched by perf event infrastructure
if event->attr.sample_type & __PERF_SAMPLE_CALLCHAIN_EARLY is true.
Just focus on __bpf_get_stack(). We could factor __bpf_get_stackid(),
but unless we have a bug, I didn't see it is necessary.
It will be good if you can add a test for the change, there is a
stacktrace test prog_tests/stacktrace_map.c, you can take a look,
and you can add a subtest there.
Next time, you can submit a formal patch with `git send-email ...` to
this alias. This way it is easier to review compared to attachment.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: using skip>0 with bpf_get_stack()
2021-06-29 3:33 ` Yonghong Song
@ 2022-03-04 20:37 ` Namhyung Kim
2022-03-04 20:50 ` Eugene Loh
0 siblings, 1 reply; 7+ messages in thread
From: Namhyung Kim @ 2022-03-04 20:37 UTC (permalink / raw)
To: Eugene Loh; +Cc: Yonghong Song, bpf
Hello,
On Mon, Jun 28, 2021 at 08:33:11PM -0700, Yonghong Song wrote:
>
>
> On 6/25/21 6:22 PM, Eugene Loh wrote:
> >
> > On 6/1/21 5:48 PM, Yonghong Song wrote:
> > > Could you submit a patch for this? Thanks!
> >
> > Sure. Thanks for looking at this and sorry about the long delay getting
> > back to you.
> >
> > Could you take a look at the attached, proposed patch? As you see in
> > the commit message, I'm unclear about the bpf_get_stack*_pe() variants.
> > They might use an earlier construct callchain, and I do not know ho
> > init_nr was set for them.
>
> I think bpf_get_stackid() and __bpf_get_stackid() implementation is correct.
> Did you find any issues?
>
> For bpf_get_stack_pe, see:
>
> https://lore.kernel.org/bpf/20200723180648.1429892-2-songliubraving@fb.com/
> I think you should not change bpf_get_stack() function.
> __bpf_get_stack() is used by bpf_get_stack() and bpf_get_stack_pe().
> In bpf_get_stack_pe(), callchain is fetched by perf event infrastructure
> if event->attr.sample_type & __PERF_SAMPLE_CALLCHAIN_EARLY is true.
>
> Just focus on __bpf_get_stack(). We could factor __bpf_get_stackid(),
> but unless we have a bug, I didn't see it is necessary.
>
> It will be good if you can add a test for the change, there is a stacktrace
> test prog_tests/stacktrace_map.c, you can take a look,
> and you can add a subtest there.
>
> Next time, you can submit a formal patch with `git send-email ...` to
> this alias. This way it is easier to review compared to attachment.
Any updates on this? I'm hitting the same issue and found this before
sending a fix.
Thanks,
Namhyung
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: using skip>0 with bpf_get_stack()
2022-03-04 20:37 ` Namhyung Kim
@ 2022-03-04 20:50 ` Eugene Loh
2022-03-04 21:16 ` Namhyung Kim
0 siblings, 1 reply; 7+ messages in thread
From: Eugene Loh @ 2022-03-04 20:50 UTC (permalink / raw)
To: Namhyung Kim; +Cc: Yonghong Song, bpf
No update from me. A fix would be great.
On 3/4/22 3:37 PM, Namhyung Kim wrote:
> Hello,
>
> On Mon, Jun 28, 2021 at 08:33:11PM -0700, Yonghong Song wrote:
>>
>> On 6/25/21 6:22 PM, Eugene Loh wrote:
>>> On 6/1/21 5:48 PM, Yonghong Song wrote:
>>>> Could you submit a patch for this? Thanks!
>>> Sure. Thanks for looking at this and sorry about the long delay getting
>>> back to you.
>>>
>>> Could you take a look at the attached, proposed patch? As you see in
>>> the commit message, I'm unclear about the bpf_get_stack*_pe() variants.
>>> They might use an earlier construct callchain, and I do not know ho
>>> init_nr was set for them.
>> I think bpf_get_stackid() and __bpf_get_stackid() implementation is correct.
>> Did you find any issues?
>>
>> For bpf_get_stack_pe, see:
>>
>> https://lore.kernel.org/bpf/20200723180648.1429892-2-songliubraving@fb.com/
>> I think you should not change bpf_get_stack() function.
>> __bpf_get_stack() is used by bpf_get_stack() and bpf_get_stack_pe().
>> In bpf_get_stack_pe(), callchain is fetched by perf event infrastructure
>> if event->attr.sample_type & __PERF_SAMPLE_CALLCHAIN_EARLY is true.
>>
>> Just focus on __bpf_get_stack(). We could factor __bpf_get_stackid(),
>> but unless we have a bug, I didn't see it is necessary.
>>
>> It will be good if you can add a test for the change, there is a stacktrace
>> test prog_tests/stacktrace_map.c, you can take a look,
>> and you can add a subtest there.
>>
>> Next time, you can submit a formal patch with `git send-email ...` to
>> this alias. This way it is easier to review compared to attachment.
> Any updates on this? I'm hitting the same issue and found this before
> sending a fix.
>
> Thanks,
> Namhyung
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: using skip>0 with bpf_get_stack()
2022-03-04 20:50 ` Eugene Loh
@ 2022-03-04 21:16 ` Namhyung Kim
0 siblings, 0 replies; 7+ messages in thread
From: Namhyung Kim @ 2022-03-04 21:16 UTC (permalink / raw)
To: Eugene Loh; +Cc: Yonghong Song, bpf
On Fri, Mar 4, 2022 at 12:50 PM Eugene Loh <eugene.loh@oracle.com> wrote:
>
> No update from me. A fix would be great.
Ok, I will think about the _pe variants more.
Thanks,
Namhyung
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2022-03-04 21:16 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-28 22:16 using skip>0 with bpf_get_stack() Eugene Loh
2021-06-01 21:48 ` Yonghong Song
2021-06-26 1:22 ` Eugene Loh
2021-06-29 3:33 ` Yonghong Song
2022-03-04 20:37 ` Namhyung Kim
2022-03-04 20:50 ` Eugene Loh
2022-03-04 21:16 ` Namhyung Kim
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.