Re: using skip>0 with bpf_get_stack()

From: Yonghong Song <yhs@fb.com>
To: Eugene Loh <eugene.loh@oracle.com>, <bpf@vger.kernel.org>
Subject: Re: using skip>0 with bpf_get_stack()
Date: Tue, 1 Jun 2021 14:48:50 -0700	[thread overview]
Message-ID: <c65ac449-ec54-3dff-5447-8a318001285b@fb.com> (raw)
In-Reply-To: <30a7b5d5-6726-1cc2-eaee-8da2828a9a9c@oracle.com>

On 5/28/21 3:16 PM, Eugene Loh wrote:
> I have a question about bpf_get_stack().  I'm interested in the case
>          skip > 0
>          user_build_id == 0
>          num_elem < sysctl_perf_event_max_stack
> 
> The function sets
>          init_nr = sysctl_perf_event_max_stack - num_elem;
> which means that get_perf_callchain() will return "num_elem" stack 
> frames.  Then, since we skip "skip" frames, we'll fill the user buffer 
> with only "num_elem - skip" frames, the remaining frames being filled zero.
> 
> For example, let's say the call stack is
>          leaf <- caller <- foo1 <- foo2 <- foo3 <- foo4 <- foo5 <- foo6
> 
> Let's say I pass bpf_get_stack() a buffer with num_elem==4 and ask 
> skip==2.  I would expect to skip 2 frames then get 4 frames, getting back:
>          foo1  foo2  foo3  foo4
> 
> Instead, I get
>          foo1  foo2  0  0
> skipping 2 frames but also leaving frames zeroed out.

Thanks for reporting. I looked at codes and it does seem that we may
have a kernel bug when skip != 0. Basically as you described,
initially we collected num_elem stacks and later on we reduced by skip
so we got num_elem - skip as you calculated in the above.

> 
> I think the init_nr computation should be:
> 
> -       if (sysctl_perf_event_max_stack < num_elem)
> +       if (sysctl_perf_event_max_stack <= num_elem + skip)
>                  init_nr = 0;
>          else
> -               init_nr = sysctl_perf_event_max_stack - num_elem;
> +               init_nr = sysctl_perf_event_max_stack - num_elem - skip;

A rough check looks like this is one correct way to fix the issue.

> Incidentally, the return value of the function is presumably the size of 
> the returned data.  Would it make sense to say so in 
> include/uapi/linux/bpf.h?

The current documentation says:
  *      Return
  *              A non-negative value equal to or less than *size* on 
success,
  *              or a negative error in case of failure.

I think you can improve with more precise description such that
a non-negative value for the copied **buf** length.

Could you submit a patch for this? Thanks!