* using skip>0 with bpf_get_stack() @ 2021-05-28 22:16 Eugene Loh 2021-06-01 21:48 ` Yonghong Song 0 siblings, 1 reply; 7+ messages in thread From: Eugene Loh @ 2021-05-28 22:16 UTC (permalink / raw) To: bpf I have a question about bpf_get_stack(). I'm interested in the case skip > 0 user_build_id == 0 num_elem < sysctl_perf_event_max_stack The function sets init_nr = sysctl_perf_event_max_stack - num_elem; which means that get_perf_callchain() will return "num_elem" stack frames. Then, since we skip "skip" frames, we'll fill the user buffer with only "num_elem - skip" frames, the remaining frames being filled zero. For example, let's say the call stack is leaf <- caller <- foo1 <- foo2 <- foo3 <- foo4 <- foo5 <- foo6 Let's say I pass bpf_get_stack() a buffer with num_elem==4 and ask skip==2. I would expect to skip 2 frames then get 4 frames, getting back: foo1 foo2 foo3 foo4 Instead, I get foo1 foo2 0 0 skipping 2 frames but also leaving frames zeroed out. I think the init_nr computation should be: - if (sysctl_perf_event_max_stack < num_elem) + if (sysctl_perf_event_max_stack <= num_elem + skip) init_nr = 0; else - init_nr = sysctl_perf_event_max_stack - num_elem; + init_nr = sysctl_perf_event_max_stack - num_elem - skip; Incidentally, the return value of the function is presumably the size of the returned data. Would it make sense to say so in include/uapi/linux/bpf.h? ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: using skip>0 with bpf_get_stack() 2021-05-28 22:16 using skip>0 with bpf_get_stack() Eugene Loh @ 2021-06-01 21:48 ` Yonghong Song 2021-06-26 1:22 ` Eugene Loh 0 siblings, 1 reply; 7+ messages in thread From: Yonghong Song @ 2021-06-01 21:48 UTC (permalink / raw) To: Eugene Loh, bpf On 5/28/21 3:16 PM, Eugene Loh wrote: > I have a question about bpf_get_stack(). I'm interested in the case > skip > 0 > user_build_id == 0 > num_elem < sysctl_perf_event_max_stack > > The function sets > init_nr = sysctl_perf_event_max_stack - num_elem; > which means that get_perf_callchain() will return "num_elem" stack > frames. Then, since we skip "skip" frames, we'll fill the user buffer > with only "num_elem - skip" frames, the remaining frames being filled zero. > > For example, let's say the call stack is > leaf <- caller <- foo1 <- foo2 <- foo3 <- foo4 <- foo5 <- foo6 > > Let's say I pass bpf_get_stack() a buffer with num_elem==4 and ask > skip==2. I would expect to skip 2 frames then get 4 frames, getting back: > foo1 foo2 foo3 foo4 > > Instead, I get > foo1 foo2 0 0 > skipping 2 frames but also leaving frames zeroed out. Thanks for reporting. I looked at codes and it does seem that we may have a kernel bug when skip != 0. Basically as you described, initially we collected num_elem stacks and later on we reduced by skip so we got num_elem - skip as you calculated in the above. > > I think the init_nr computation should be: > > - if (sysctl_perf_event_max_stack < num_elem) > + if (sysctl_perf_event_max_stack <= num_elem + skip) > init_nr = 0; > else > - init_nr = sysctl_perf_event_max_stack - num_elem; > + init_nr = sysctl_perf_event_max_stack - num_elem - skip; A rough check looks like this is one correct way to fix the issue. > Incidentally, the return value of the function is presumably the size of > the returned data. Would it make sense to say so in > include/uapi/linux/bpf.h? The current documentation says: * Return * A non-negative value equal to or less than *size* on success, * or a negative error in case of failure. I think you can improve with more precise description such that a non-negative value for the copied **buf** length. Could you submit a patch for this? Thanks! ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: using skip>0 with bpf_get_stack() 2021-06-01 21:48 ` Yonghong Song @ 2021-06-26 1:22 ` Eugene Loh 2021-06-29 3:33 ` Yonghong Song 0 siblings, 1 reply; 7+ messages in thread From: Eugene Loh @ 2021-06-26 1:22 UTC (permalink / raw) To: bpf [-- Attachment #1: Type: text/plain, Size: 2708 bytes --] On 6/1/21 5:48 PM, Yonghong Song wrote: > > > On 5/28/21 3:16 PM, Eugene Loh wrote: >> I have a question about bpf_get_stack(). I'm interested in the case >> skip > 0 >> user_build_id == 0 >> num_elem < sysctl_perf_event_max_stack >> >> The function sets >> init_nr = sysctl_perf_event_max_stack - num_elem; >> which means that get_perf_callchain() will return "num_elem" stack >> frames. Then, since we skip "skip" frames, we'll fill the user >> buffer with only "num_elem - skip" frames, the remaining frames being >> filled zero. >> >> For example, let's say the call stack is >> leaf <- caller <- foo1 <- foo2 <- foo3 <- foo4 <- foo5 <- foo6 >> >> Let's say I pass bpf_get_stack() a buffer with num_elem==4 and ask >> skip==2. I would expect to skip 2 frames then get 4 frames, getting >> back: >> foo1 foo2 foo3 foo4 >> >> Instead, I get >> foo1 foo2 0 0 >> skipping 2 frames but also leaving frames zeroed out. > > Thanks for reporting. I looked at codes and it does seem that we may > have a kernel bug when skip != 0. Basically as you described, > initially we collected num_elem stacks and later on we reduced by skip > so we got num_elem - skip as you calculated in the above. > >> >> I think the init_nr computation should be: >> >> - if (sysctl_perf_event_max_stack < num_elem) >> + if (sysctl_perf_event_max_stack <= num_elem + skip) >> init_nr = 0; >> else >> - init_nr = sysctl_perf_event_max_stack - num_elem; >> + init_nr = sysctl_perf_event_max_stack - num_elem - skip; > > A rough check looks like this is one correct way to fix the issue. > >> Incidentally, the return value of the function is presumably the size >> of the returned data. Would it make sense to say so in >> include/uapi/linux/bpf.h? > > The current documentation says: > * Return > * A non-negative value equal to or less than *size* on > success, > * or a negative error in case of failure. > > I think you can improve with more precise description such that > a non-negative value for the copied **buf** length. > > Could you submit a patch for this? Thanks! Sure. Thanks for looking at this and sorry about the long delay getting back to you. Could you take a look at the attached, proposed patch? As you see in the commit message, I'm unclear about the bpf_get_stack*_pe() variants. They might use an earlier construct callchain, and I do not know how init_nr was set for them. [-- Attachment #2: 0001-bpf-Adjust-BPF-stack-helper-functions-to-accommodate.patch --] [-- Type: text/plain, Size: 4186 bytes --] From 70da9057d7fa7dda76e1b0861b8a0174078434ea Mon Sep 17 00:00:00 2001 From: Eugene Loh <eugene.loh@oracle.com> Date: Fri, 25 Jun 2021 15:18:46 -0700 Subject: [PATCH] bpf: Adjust BPF stack helper functions to accommodate skip>0 Let's say that the caller has storage for num_elem stack frames. Then, the BPF stack helper functions walk the stack for only num_elem frames. This means that if skip>0, one keeps only num_elem-skip frames. Change the computation of init_nr so that num_elem+skip stack frames are walked (and hence num_elem frames are kept). [NB: I am unsure of the bpf_get_stack*_pe() variants, which in the case of __PERF_SAMPLE_CALLCHAIN_EARLY use ctx->data->callchain, which was walked earlier. I am unclear how init_nr was chosen for it.] Change the comment on bpf_get_stack() in the header file to be more explicit what the return value means. Signed-off-by: Eugene Loh <eugene.loh@oracle.com> --- include/uapi/linux/bpf.h | 4 ++-- kernel/bpf/stackmap.c | 26 +++++++++++++++----------- 2 files changed, 17 insertions(+), 13 deletions(-) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 79c893310492..7c7b93e1db90 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -2183,8 +2183,8 @@ union bpf_attr { * * # sysctl kernel.perf_event_max_stack=<new value> * Return - * A non-negative value equal to or less than *size* on success, - * or a negative error in case of failure. + * The non-negative copied *buf* length equal to or less than + * *size* on success, or a negative error in case of failure. * * long bpf_skb_load_bytes_relative(const void *skb, u32 offset, void *to, u32 len, u32 start_header) * Description diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c index be35bfb7fb13..e2a193581550 100644 --- a/kernel/bpf/stackmap.c +++ b/kernel/bpf/stackmap.c @@ -249,23 +249,30 @@ get_callchain_entry_for_task(struct task_struct *task, u32 init_nr) #endif } +static u32 get_init_nr(u32 nelem, u64 flags) +{ + u32 skip = flags & BPF_F_SKIP_FIELD_MASK; + + if (sysctl_perf_event_max_stack <= nelem + skip) + return 0; + else + return sysctl_perf_event_max_stack - nelem - skip; +} + static long __bpf_get_stackid(struct bpf_map *map, struct perf_callchain_entry *trace, u64 flags) { struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, map); struct stack_map_bucket *bucket, *new_bucket, *old_bucket; u32 max_depth = map->value_size / stack_map_data_size(map); - /* stack_map_alloc() checks that max_depth <= sysctl_perf_event_max_stack */ - u32 init_nr = sysctl_perf_event_max_stack - max_depth; + u32 init_nr; u32 skip = flags & BPF_F_SKIP_FIELD_MASK; u32 hash, id, trace_nr, trace_len; bool user = flags & BPF_F_USER_STACK; u64 *ips; bool hash_matches; - /* get_perf_callchain() guarantees that trace->nr >= init_nr - * and trace-nr <= sysctl_perf_event_max_stack, so trace_nr <= max_depth - */ + init_nr = get_init_nr(max_depth, flags); trace_nr = trace->nr - init_nr; if (trace_nr <= skip) @@ -331,8 +338,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map, u64, flags) { u32 max_depth = map->value_size / stack_map_data_size(map); - /* stack_map_alloc() checks that max_depth <= sysctl_perf_event_max_stack */ - u32 init_nr = sysctl_perf_event_max_stack - max_depth; + u32 init_nr; bool user = flags & BPF_F_USER_STACK; struct perf_callchain_entry *trace; bool kernel = !user; @@ -341,6 +347,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map, BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID))) return -EINVAL; + init_nr = get_init_nr(max_depth, flags); trace = get_perf_callchain(regs, init_nr, kernel, user, sysctl_perf_event_max_stack, false, false); @@ -458,10 +465,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task, goto err_fault; num_elem = size / elem_size; - if (sysctl_perf_event_max_stack < num_elem) - init_nr = 0; - else - init_nr = sysctl_perf_event_max_stack - num_elem; + init_nr = get_init_nr(num_elem, flags); if (trace_in) trace = trace_in; -- 2.27.0 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: using skip>0 with bpf_get_stack() 2021-06-26 1:22 ` Eugene Loh @ 2021-06-29 3:33 ` Yonghong Song 2022-03-04 20:37 ` Namhyung Kim 0 siblings, 1 reply; 7+ messages in thread From: Yonghong Song @ 2021-06-29 3:33 UTC (permalink / raw) To: Eugene Loh, bpf On 6/25/21 6:22 PM, Eugene Loh wrote: > > On 6/1/21 5:48 PM, Yonghong Song wrote: >> >> >> On 5/28/21 3:16 PM, Eugene Loh wrote: >>> I have a question about bpf_get_stack(). I'm interested in the case >>> skip > 0 >>> user_build_id == 0 >>> num_elem < sysctl_perf_event_max_stack >>> >>> The function sets >>> init_nr = sysctl_perf_event_max_stack - num_elem; >>> which means that get_perf_callchain() will return "num_elem" stack >>> frames. Then, since we skip "skip" frames, we'll fill the user >>> buffer with only "num_elem - skip" frames, the remaining frames being >>> filled zero. >>> >>> For example, let's say the call stack is >>> leaf <- caller <- foo1 <- foo2 <- foo3 <- foo4 <- foo5 <- foo6 >>> >>> Let's say I pass bpf_get_stack() a buffer with num_elem==4 and ask >>> skip==2. I would expect to skip 2 frames then get 4 frames, getting >>> back: >>> foo1 foo2 foo3 foo4 >>> >>> Instead, I get >>> foo1 foo2 0 0 >>> skipping 2 frames but also leaving frames zeroed out. >> >> Thanks for reporting. I looked at codes and it does seem that we may >> have a kernel bug when skip != 0. Basically as you described, >> initially we collected num_elem stacks and later on we reduced by skip >> so we got num_elem - skip as you calculated in the above. >> >>> >>> I think the init_nr computation should be: >>> >>> - if (sysctl_perf_event_max_stack < num_elem) >>> + if (sysctl_perf_event_max_stack <= num_elem + skip) >>> init_nr = 0; >>> else >>> - init_nr = sysctl_perf_event_max_stack - num_elem; >>> + init_nr = sysctl_perf_event_max_stack - num_elem - skip; >> >> A rough check looks like this is one correct way to fix the issue. >> >>> Incidentally, the return value of the function is presumably the size >>> of the returned data. Would it make sense to say so in >>> include/uapi/linux/bpf.h? >> >> The current documentation says: >> * Return >> * A non-negative value equal to or less than *size* on >> success, >> * or a negative error in case of failure. >> >> I think you can improve with more precise description such that >> a non-negative value for the copied **buf** length. >> >> Could you submit a patch for this? Thanks! > > Sure. Thanks for looking at this and sorry about the long delay getting > back to you. > > Could you take a look at the attached, proposed patch? As you see in > the commit message, I'm unclear about the bpf_get_stack*_pe() variants. > They might use an earlier construct callchain, and I do not know ho > init_nr was set for them. I think bpf_get_stackid() and __bpf_get_stackid() implementation is correct. Did you find any issues? For bpf_get_stack_pe, see: https://lore.kernel.org/bpf/20200723180648.1429892-2-songliubraving@fb.com/ I think you should not change bpf_get_stack() function. __bpf_get_stack() is used by bpf_get_stack() and bpf_get_stack_pe(). In bpf_get_stack_pe(), callchain is fetched by perf event infrastructure if event->attr.sample_type & __PERF_SAMPLE_CALLCHAIN_EARLY is true. Just focus on __bpf_get_stack(). We could factor __bpf_get_stackid(), but unless we have a bug, I didn't see it is necessary. It will be good if you can add a test for the change, there is a stacktrace test prog_tests/stacktrace_map.c, you can take a look, and you can add a subtest there. Next time, you can submit a formal patch with `git send-email ...` to this alias. This way it is easier to review compared to attachment. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: using skip>0 with bpf_get_stack() 2021-06-29 3:33 ` Yonghong Song @ 2022-03-04 20:37 ` Namhyung Kim 2022-03-04 20:50 ` Eugene Loh 0 siblings, 1 reply; 7+ messages in thread From: Namhyung Kim @ 2022-03-04 20:37 UTC (permalink / raw) To: Eugene Loh; +Cc: Yonghong Song, bpf Hello, On Mon, Jun 28, 2021 at 08:33:11PM -0700, Yonghong Song wrote: > > > On 6/25/21 6:22 PM, Eugene Loh wrote: > > > > On 6/1/21 5:48 PM, Yonghong Song wrote: > > > Could you submit a patch for this? Thanks! > > > > Sure. Thanks for looking at this and sorry about the long delay getting > > back to you. > > > > Could you take a look at the attached, proposed patch? As you see in > > the commit message, I'm unclear about the bpf_get_stack*_pe() variants. > > They might use an earlier construct callchain, and I do not know ho > > init_nr was set for them. > > I think bpf_get_stackid() and __bpf_get_stackid() implementation is correct. > Did you find any issues? > > For bpf_get_stack_pe, see: > > https://lore.kernel.org/bpf/20200723180648.1429892-2-songliubraving@fb.com/ > I think you should not change bpf_get_stack() function. > __bpf_get_stack() is used by bpf_get_stack() and bpf_get_stack_pe(). > In bpf_get_stack_pe(), callchain is fetched by perf event infrastructure > if event->attr.sample_type & __PERF_SAMPLE_CALLCHAIN_EARLY is true. > > Just focus on __bpf_get_stack(). We could factor __bpf_get_stackid(), > but unless we have a bug, I didn't see it is necessary. > > It will be good if you can add a test for the change, there is a stacktrace > test prog_tests/stacktrace_map.c, you can take a look, > and you can add a subtest there. > > Next time, you can submit a formal patch with `git send-email ...` to > this alias. This way it is easier to review compared to attachment. Any updates on this? I'm hitting the same issue and found this before sending a fix. Thanks, Namhyung ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: using skip>0 with bpf_get_stack() 2022-03-04 20:37 ` Namhyung Kim @ 2022-03-04 20:50 ` Eugene Loh 2022-03-04 21:16 ` Namhyung Kim 0 siblings, 1 reply; 7+ messages in thread From: Eugene Loh @ 2022-03-04 20:50 UTC (permalink / raw) To: Namhyung Kim; +Cc: Yonghong Song, bpf No update from me. A fix would be great. On 3/4/22 3:37 PM, Namhyung Kim wrote: > Hello, > > On Mon, Jun 28, 2021 at 08:33:11PM -0700, Yonghong Song wrote: >> >> On 6/25/21 6:22 PM, Eugene Loh wrote: >>> On 6/1/21 5:48 PM, Yonghong Song wrote: >>>> Could you submit a patch for this? Thanks! >>> Sure. Thanks for looking at this and sorry about the long delay getting >>> back to you. >>> >>> Could you take a look at the attached, proposed patch? As you see in >>> the commit message, I'm unclear about the bpf_get_stack*_pe() variants. >>> They might use an earlier construct callchain, and I do not know ho >>> init_nr was set for them. >> I think bpf_get_stackid() and __bpf_get_stackid() implementation is correct. >> Did you find any issues? >> >> For bpf_get_stack_pe, see: >> >> https://lore.kernel.org/bpf/20200723180648.1429892-2-songliubraving@fb.com/ >> I think you should not change bpf_get_stack() function. >> __bpf_get_stack() is used by bpf_get_stack() and bpf_get_stack_pe(). >> In bpf_get_stack_pe(), callchain is fetched by perf event infrastructure >> if event->attr.sample_type & __PERF_SAMPLE_CALLCHAIN_EARLY is true. >> >> Just focus on __bpf_get_stack(). We could factor __bpf_get_stackid(), >> but unless we have a bug, I didn't see it is necessary. >> >> It will be good if you can add a test for the change, there is a stacktrace >> test prog_tests/stacktrace_map.c, you can take a look, >> and you can add a subtest there. >> >> Next time, you can submit a formal patch with `git send-email ...` to >> this alias. This way it is easier to review compared to attachment. > Any updates on this? I'm hitting the same issue and found this before > sending a fix. > > Thanks, > Namhyung ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: using skip>0 with bpf_get_stack() 2022-03-04 20:50 ` Eugene Loh @ 2022-03-04 21:16 ` Namhyung Kim 0 siblings, 0 replies; 7+ messages in thread From: Namhyung Kim @ 2022-03-04 21:16 UTC (permalink / raw) To: Eugene Loh; +Cc: Yonghong Song, bpf On Fri, Mar 4, 2022 at 12:50 PM Eugene Loh <eugene.loh@oracle.com> wrote: > > No update from me. A fix would be great. Ok, I will think about the _pe variants more. Thanks, Namhyung ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2022-03-04 21:16 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-05-28 22:16 using skip>0 with bpf_get_stack() Eugene Loh 2021-06-01 21:48 ` Yonghong Song 2021-06-26 1:22 ` Eugene Loh 2021-06-29 3:33 ` Yonghong Song 2022-03-04 20:37 ` Namhyung Kim 2022-03-04 20:50 ` Eugene Loh 2022-03-04 21:16 ` Namhyung Kim
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.