All of lore.kernel.org
 help / color / mirror / Atom feed
* perf event and data_sz
@ 2020-08-26 14:11 Borna Cafuk
  2020-08-26 18:45 ` Yonghong Song
  0 siblings, 1 reply; 5+ messages in thread
From: Borna Cafuk @ 2020-08-26 14:11 UTC (permalink / raw)
  To: bpf; +Cc: Luka Perkov, Juraj Vijtiuk, Jakov Petrina

Hi everyone,

When examining BPF programs that use perf buffers, I have noticed that
the `data_sz` parameter in the sample callback has a different value than
the size passed to `bpf_perf_event_output`.

Raw samples are padded in `perf_prepare_sample` so their size is a multiple of
`sizeof(u64)` (see [0]). The sample includes the size header, a `u32`.
The size stored in `raw->size` is then size of the sample, minus the 4 bytes for
the size header. This size, however, includes both the data and the padding.

What I find confusing is that this size including the padding is eventually
passed as `data_sz` to the callback in the userspace, instead of
the size of the data that was passed as an argument to `bpf_perf_event_output`.

Is this intended behaviour?

I have a use-case for getting only the size of the data in the
userspace, could this be done?

To demonstrate, I have prepared a minimal example by modifying
BCC's filelife example. It uses a kprobe on vfs_unlink to print some sizes
every time a file is unlinked. The sizes are:
 * the `sizeof(struct event)` measured in the userspace program,
 * the `sizeof(struct event)` measured in the BPF program, and
 * the `data_sz` parameter.

The first two are 62, as expected, but `data_sz` is 68.
The 62 bytes of the struct and the 4 bytes of the sample header make 66 bytes.
This is rounded up to the first multiple of 8, which is 72.
The 4 bytes for the size header are then subtracted,
and 68 is written as the data size.

Any input is much appreciated,

Best regards,
Borna Cafuk


[0] https://github.com/torvalds/linux/blob/6a9dc5fd6170d0a41c8a14eb19e63d94bea5705a/kernel/events/core.c#L7035


example.h
--------------------------------
#ifndef __EXAMPLE_H
#define __EXAMPLE_H

struct __attribute__((__packed__)) event {
    __u16 size;
    char filler[60];
};

#endif /* __EXAMPLE_H */


example.bpf.c
--------------------------------
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include "example.h"

struct {
    __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
    __uint(key_size, sizeof(u32));
    __uint(value_size, sizeof(u32));
} events SEC(".maps");

SEC("kprobe/vfs_unlink")
int BPF_KPROBE(kprobe__vfs_unlink, struct inode *dir, struct dentry *dentry)
{
    struct event event = {};
    event.size = sizeof(struct event);

    bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU,
                          &event, sizeof(struct event));
    return 0;
}

char LICENSE[] SEC("license") = "GPL";


example.c
--------------------------------
#include <stdio.h>
#include <bpf/libbpf.h>
#include <sys/resource.h>
#include "example.h"
#include "example.skel.h"

#define PERF_BUFFER_PAGES    16
#define PERF_POLL_TIMEOUT_MS    100

void handle_event(void *ctx, int cpu, void *data, __u32 data_sz)
{
    const struct event *e = data;

    printf("Userspace: %u | BPF: %zu | data_sz: %u \n",
           e->size, sizeof(struct event), data_sz);
}

void handle_lost_events(void *ctx, int cpu, __u64 lost_cnt)
{
    fprintf(stderr, "lost %llu events on CPU #%d\n", lost_cnt, cpu);
}

int main(int argc, char **argv)
{
    struct perf_buffer_opts pb_opts;
    struct perf_buffer *pb = NULL;
    struct example_bpf *obj;
    int err;

    struct rlimit rlim_new = {
        .rlim_cur    = RLIM_INFINITY,
        .rlim_max    = RLIM_INFINITY,
    };
    err = setrlimit(RLIMIT_MEMLOCK, &rlim_new);
    if (err) {
        fprintf(stderr, "failed to increase rlimit: %d\n", err);
        return 1;
    }

    obj = example_bpf__open();
    if (!obj) {
        fprintf(stderr, "failed to open and/or load BPF object\n");
        return 1;
    }

    err = example_bpf__load(obj);
    if (err) {
        fprintf(stderr, "failed to load BPF object: %d\n", err);
        goto cleanup;
    }

    err = example_bpf__attach(obj);
    if (err) {
        fprintf(stderr, "failed to attach BPF programs\n");
        goto cleanup;
    }

    pb_opts.sample_cb = handle_event;
    pb_opts.lost_cb = handle_lost_events;
    pb = perf_buffer__new(bpf_map__fd(obj->maps.events), PERF_BUFFER_PAGES,
                          &pb_opts);
    err = libbpf_get_error(pb);
    if (err) {
        pb = NULL;
        fprintf(stderr, "failed to open perf buffer: %d\n", err);
        goto cleanup;
    }

    while ((err = perf_buffer__poll(pb, PERF_POLL_TIMEOUT_MS)) >= 0)
        ;
    fprintf(stderr, "error polling perf buffer: %d\n", err);

    cleanup:
    perf_buffer__free(pb);
    example_bpf__destroy(obj);

    return err != 0;
}

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: perf event and data_sz
  2020-08-26 14:11 perf event and data_sz Borna Cafuk
@ 2020-08-26 18:45 ` Yonghong Song
  2020-08-27 10:01   ` Borna Cafuk
  0 siblings, 1 reply; 5+ messages in thread
From: Yonghong Song @ 2020-08-26 18:45 UTC (permalink / raw)
  To: Borna Cafuk, bpf; +Cc: Luka Perkov, Juraj Vijtiuk, Jakov Petrina



On 8/26/20 7:11 AM, Borna Cafuk wrote:
> Hi everyone,
> 
> When examining BPF programs that use perf buffers, I have noticed that
> the `data_sz` parameter in the sample callback has a different value than
> the size passed to `bpf_perf_event_output`.
> 
> Raw samples are padded in `perf_prepare_sample` so their size is a multiple of
> `sizeof(u64)` (see [0]). The sample includes the size header, a `u32`.
> The size stored in `raw->size` is then size of the sample, minus the 4 bytes for
> the size header. This size, however, includes both the data and the padding.
> 
> What I find confusing is that this size including the padding is eventually
> passed as `data_sz` to the callback in the userspace, instead of
> the size of the data that was passed as an argument to `bpf_perf_event_output`.
> 
> Is this intended behaviour?

 From the kernel source code, yes, this is expected behavior. What you 
described below matches what the kernel did. So raw->size = 68 is expected.

> 
> I have a use-case for getting only the size of the data in the
> userspace, could this be done?

In this case, since we know the kernel writes one record at a time,
you check the size, it is 68 more than 62, you just read 62 bytes
as your real data, ignore the rest as the padding. Does this work?

bcc callback passed the the buffer with raw->size to application.
But applications are expected to know what the record layout is...

> 
> To demonstrate, I have prepared a minimal example by modifying
> BCC's filelife example. It uses a kprobe on vfs_unlink to print some sizes
> every time a file is unlinked. The sizes are:
>   * the `sizeof(struct event)` measured in the userspace program,
>   * the `sizeof(struct event)` measured in the BPF program, and
>   * the `data_sz` parameter.
> 
> The first two are 62, as expected, but `data_sz` is 68.
> The 62 bytes of the struct and the 4 bytes of the sample header make 66 bytes.
> This is rounded up to the first multiple of 8, which is 72.
> The 4 bytes for the size header are then subtracted,
> and 68 is written as the data size.
> 
> Any input is much appreciated,
> 
> Best regards,
> Borna Cafuk
> 
> 
> [0] https://github.com/torvalds/linux/blob/6a9dc5fd6170d0a41c8a14eb19e63d94bea5705a/kernel/events/core.c#L7035
> 
> 
> example.h
> --------------------------------
> #ifndef __EXAMPLE_H
> #define __EXAMPLE_H
> 
> struct __attribute__((__packed__)) event {
>      __u16 size;
>      char filler[60];
> };
> 
> #endif /* __EXAMPLE_H */
> 
> 
> example.bpf.c
> --------------------------------
> #include "vmlinux.h"
> #include <bpf/bpf_helpers.h>
> #include <bpf/bpf_tracing.h>
> #include "example.h"
> 
> struct {
>      __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
>      __uint(key_size, sizeof(u32));
>      __uint(value_size, sizeof(u32));
> } events SEC(".maps");
> 
> SEC("kprobe/vfs_unlink")
> int BPF_KPROBE(kprobe__vfs_unlink, struct inode *dir, struct dentry *dentry)
> {
>      struct event event = {};
>      event.size = sizeof(struct event);
> 
>      bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU,
>                            &event, sizeof(struct event));
>      return 0;
> }
> 
> char LICENSE[] SEC("license") = "GPL";
> 
> 
> example.c
> --------------------------------
> #include <stdio.h>
> #include <bpf/libbpf.h>
> #include <sys/resource.h>
> #include "example.h"
> #include "example.skel.h"
> 
> #define PERF_BUFFER_PAGES    16
> #define PERF_POLL_TIMEOUT_MS    100
> 
> void handle_event(void *ctx, int cpu, void *data, __u32 data_sz)
> {
>      const struct event *e = data;
> 
>      printf("Userspace: %u | BPF: %zu | data_sz: %u \n",
>             e->size, sizeof(struct event), data_sz);
> }
> 
> void handle_lost_events(void *ctx, int cpu, __u64 lost_cnt)
> {
>      fprintf(stderr, "lost %llu events on CPU #%d\n", lost_cnt, cpu);
> }
> 
> int main(int argc, char **argv)
> {
>      struct perf_buffer_opts pb_opts;
>      struct perf_buffer *pb = NULL;
>      struct example_bpf *obj;
>      int err;
> 
>      struct rlimit rlim_new = {
>          .rlim_cur    = RLIM_INFINITY,
>          .rlim_max    = RLIM_INFINITY,
>      };
>      err = setrlimit(RLIMIT_MEMLOCK, &rlim_new);
>      if (err) {
>          fprintf(stderr, "failed to increase rlimit: %d\n", err);
>          return 1;
>      }
> 
>      obj = example_bpf__open();
>      if (!obj) {
>          fprintf(stderr, "failed to open and/or load BPF object\n");
>          return 1;
>      }
> 
>      err = example_bpf__load(obj);
>      if (err) {
>          fprintf(stderr, "failed to load BPF object: %d\n", err);
>          goto cleanup;
>      }
> 
>      err = example_bpf__attach(obj);
>      if (err) {
>          fprintf(stderr, "failed to attach BPF programs\n");
>          goto cleanup;
>      }
> 
>      pb_opts.sample_cb = handle_event;
>      pb_opts.lost_cb = handle_lost_events;
>      pb = perf_buffer__new(bpf_map__fd(obj->maps.events), PERF_BUFFER_PAGES,
>                            &pb_opts);
>      err = libbpf_get_error(pb);
>      if (err) {
>          pb = NULL;
>          fprintf(stderr, "failed to open perf buffer: %d\n", err);
>          goto cleanup;
>      }
> 
>      while ((err = perf_buffer__poll(pb, PERF_POLL_TIMEOUT_MS)) >= 0)
>          ;
>      fprintf(stderr, "error polling perf buffer: %d\n", err);
> 
>      cleanup:
>      perf_buffer__free(pb);
>      example_bpf__destroy(obj);
> 
>      return err != 0;
> }
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: perf event and data_sz
  2020-08-26 18:45 ` Yonghong Song
@ 2020-08-27 10:01   ` Borna Cafuk
  2020-08-27 16:04     ` Yonghong Song
  2020-09-02  0:29     ` Andrii Nakryiko
  0 siblings, 2 replies; 5+ messages in thread
From: Borna Cafuk @ 2020-08-27 10:01 UTC (permalink / raw)
  To: Yonghong Song; +Cc: bpf, Luka Perkov, Juraj Vijtiuk, Jakov Petrina

On Wed, Aug 26, 2020 at 8:45 PM Yonghong Song <yhs@fb.com> wrote:
>
> On 8/26/20 7:11 AM, Borna Cafuk wrote:
> > Hi everyone,
> >
> > When examining BPF programs that use perf buffers, I have noticed that
> > the `data_sz` parameter in the sample callback has a different value than
> > the size passed to `bpf_perf_event_output`.
> >
> > Raw samples are padded in `perf_prepare_sample` so their size is a multiple of
> > `sizeof(u64)` (see [0]). The sample includes the size header, a `u32`.
> > The size stored in `raw->size` is then size of the sample, minus the 4 bytes for
> > the size header. This size, however, includes both the data and the padding.
> >
> > What I find confusing is that this size including the padding is eventually
> > passed as `data_sz` to the callback in the userspace, instead of
> > the size of the data that was passed as an argument to `bpf_perf_event_output`.
> >
> > Is this intended behaviour?
>
>  From the kernel source code, yes, this is expected behavior. What you
> described below matches what the kernel did. So raw->size = 68 is expected.

I understand why this size that is stored in `raw->size` is needed.
What I don't see is how the value is of any use in the callback.

>
> >
> > I have a use-case for getting only the size of the data in the
> > userspace, could this be done?
>
> In this case, since we know the kernel writes one record at a time,
> you check the size, it is 68 more than 62, you just read 62 bytes
> as your real data, ignore the rest as the padding. Does this work?

The `data_sz` parameter seems a little pointless, then. What is its purpose
in the callback if it doesn't accurately represent the size of the data?

>
> bcc callback passed the the buffer with raw->size to application.
> But applications are expected to know what the record layout is...

I'm afraid that wouldn't work for the use-case, our application should be able
to read the raw data without having to know the record layout. It has to be
generic, we handle interpreting the records elsewhere and at another time.

>
> >
> > To demonstrate, I have prepared a minimal example by modifying
> > BCC's filelife example. It uses a kprobe on vfs_unlink to print some sizes
> > every time a file is unlinked. The sizes are:
> >   * the `sizeof(struct event)` measured in the userspace program,
> >   * the `sizeof(struct event)` measured in the BPF program, and
> >   * the `data_sz` parameter.
> >
> > The first two are 62, as expected, but `data_sz` is 68.
> > The 62 bytes of the struct and the 4 bytes of the sample header make 66 bytes.
> > This is rounded up to the first multiple of 8, which is 72.
> > The 4 bytes for the size header are then subtracted,
> > and 68 is written as the data size.
> >
> > Any input is much appreciated,
> >
> > Best regards,
> > Borna Cafuk
> >
> >
> > [0] https://github.com/torvalds/linux/blob/6a9dc5fd6170d0a41c8a14eb19e63d94bea5705a/kernel/events/core.c#L7035
> >
> >
> > example.h
> > --------------------------------
> > #ifndef __EXAMPLE_H
> > #define __EXAMPLE_H
> >
> > struct __attribute__((__packed__)) event {
> >      __u16 size;
> >      char filler[60];
> > };
> >
> > #endif /* __EXAMPLE_H */
> >
> >
> > example.bpf.c
> > --------------------------------
> > #include "vmlinux.h"
> > #include <bpf/bpf_helpers.h>
> > #include <bpf/bpf_tracing.h>
> > #include "example.h"
> >
> > struct {
> >      __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
> >      __uint(key_size, sizeof(u32));
> >      __uint(value_size, sizeof(u32));
> > } events SEC(".maps");
> >
> > SEC("kprobe/vfs_unlink")
> > int BPF_KPROBE(kprobe__vfs_unlink, struct inode *dir, struct dentry *dentry)
> > {
> >      struct event event = {};
> >      event.size = sizeof(struct event);
> >
> >      bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU,
> >                            &event, sizeof(struct event));
> >      return 0;
> > }
> >
> > char LICENSE[] SEC("license") = "GPL";
> >
> >
> > example.c
> > --------------------------------
> > #include <stdio.h>
> > #include <bpf/libbpf.h>
> > #include <sys/resource.h>
> > #include "example.h"
> > #include "example.skel.h"
> >
> > #define PERF_BUFFER_PAGES    16
> > #define PERF_POLL_TIMEOUT_MS    100
> >
> > void handle_event(void *ctx, int cpu, void *data, __u32 data_sz)
> > {
> >      const struct event *e = data;
> >
> >      printf("Userspace: %u | BPF: %zu | data_sz: %u \n",
> >             e->size, sizeof(struct event), data_sz);
> > }
> >
> > void handle_lost_events(void *ctx, int cpu, __u64 lost_cnt)
> > {
> >      fprintf(stderr, "lost %llu events on CPU #%d\n", lost_cnt, cpu);
> > }
> >
> > int main(int argc, char **argv)
> > {
> >      struct perf_buffer_opts pb_opts;
> >      struct perf_buffer *pb = NULL;
> >      struct example_bpf *obj;
> >      int err;
> >
> >      struct rlimit rlim_new = {
> >          .rlim_cur    = RLIM_INFINITY,
> >          .rlim_max    = RLIM_INFINITY,
> >      };
> >      err = setrlimit(RLIMIT_MEMLOCK, &rlim_new);
> >      if (err) {
> >          fprintf(stderr, "failed to increase rlimit: %d\n", err);
> >          return 1;
> >      }
> >
> >      obj = example_bpf__open();
> >      if (!obj) {
> >          fprintf(stderr, "failed to open and/or load BPF object\n");
> >          return 1;
> >      }
> >
> >      err = example_bpf__load(obj);
> >      if (err) {
> >          fprintf(stderr, "failed to load BPF object: %d\n", err);
> >          goto cleanup;
> >      }
> >
> >      err = example_bpf__attach(obj);
> >      if (err) {
> >          fprintf(stderr, "failed to attach BPF programs\n");
> >          goto cleanup;
> >      }
> >
> >      pb_opts.sample_cb = handle_event;
> >      pb_opts.lost_cb = handle_lost_events;
> >      pb = perf_buffer__new(bpf_map__fd(obj->maps.events), PERF_BUFFER_PAGES,
> >                            &pb_opts);
> >      err = libbpf_get_error(pb);
> >      if (err) {
> >          pb = NULL;
> >          fprintf(stderr, "failed to open perf buffer: %d\n", err);
> >          goto cleanup;
> >      }
> >
> >      while ((err = perf_buffer__poll(pb, PERF_POLL_TIMEOUT_MS)) >= 0)
> >          ;
> >      fprintf(stderr, "error polling perf buffer: %d\n", err);
> >
> >      cleanup:
> >      perf_buffer__free(pb);
> >      example_bpf__destroy(obj);
> >
> >      return err != 0;
> > }
> >

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: perf event and data_sz
  2020-08-27 10:01   ` Borna Cafuk
@ 2020-08-27 16:04     ` Yonghong Song
  2020-09-02  0:29     ` Andrii Nakryiko
  1 sibling, 0 replies; 5+ messages in thread
From: Yonghong Song @ 2020-08-27 16:04 UTC (permalink / raw)
  To: Borna Cafuk
  Cc: bpf, Luka Perkov, Juraj Vijtiuk, Jakov Petrina, Peter Zijlstra



On 8/27/20 3:01 AM, Borna Cafuk wrote:
> On Wed, Aug 26, 2020 at 8:45 PM Yonghong Song <yhs@fb.com> wrote:
>>
>> On 8/26/20 7:11 AM, Borna Cafuk wrote:
>>> Hi everyone,
>>>
>>> When examining BPF programs that use perf buffers, I have noticed that
>>> the `data_sz` parameter in the sample callback has a different value than
>>> the size passed to `bpf_perf_event_output`.
>>>
>>> Raw samples are padded in `perf_prepare_sample` so their size is a multiple of
>>> `sizeof(u64)` (see [0]). The sample includes the size header, a `u32`.
>>> The size stored in `raw->size` is then size of the sample, minus the 4 bytes for
>>> the size header. This size, however, includes both the data and the padding.
>>>
>>> What I find confusing is that this size including the padding is eventually
>>> passed as `data_sz` to the callback in the userspace, instead of
>>> the size of the data that was passed as an argument to `bpf_perf_event_output`.
>>>
>>> Is this intended behaviour?
>>
>>   From the kernel source code, yes, this is expected behavior. What you
>> described below matches what the kernel did. So raw->size = 68 is expected.
> 
> I understand why this size that is stored in `raw->size` is needed.
> What I don't see is how the value is of any use in the callback.
> 
>>
>>>
>>> I have a use-case for getting only the size of the data in the
>>> userspace, could this be done?
>>
>> In this case, since we know the kernel writes one record at a time,
>> you check the size, it is 68 more than 62, you just read 62 bytes
>> as your real data, ignore the rest as the padding. Does this work?
> 
> The `data_sz` parameter seems a little pointless, then. What is its purpose
> in the callback if it doesn't accurately represent the size of the data?

I agree since this is not exactly what user expected, user space will
need to calculate the "expected" data_sz and based on that to get to
what kind of record it is. Otherwise, the user needs to put some
info in the record itself to differentiate what it is and your 
intermediate layer just pass all data from the kernel to the application.

> 
>>
>> bcc callback passed the the buffer with raw->size to application.
>> But applications are expected to know what the record layout is...
> 
> I'm afraid that wouldn't work for the use-case, our application should be able
> to read the raw data without having to know the record layout. It has to be
> generic, we handle interpreting the records elsewhere and at another time.

I am added Peter Zijlstra in the cc list and hope he may have some
insight on how to tackle this.

> 
>>
>>>
>>> To demonstrate, I have prepared a minimal example by modifying
>>> BCC's filelife example. It uses a kprobe on vfs_unlink to print some sizes
>>> every time a file is unlinked. The sizes are:
>>>    * the `sizeof(struct event)` measured in the userspace program,
>>>    * the `sizeof(struct event)` measured in the BPF program, and
>>>    * the `data_sz` parameter.
>>>
>>> The first two are 62, as expected, but `data_sz` is 68.
>>> The 62 bytes of the struct and the 4 bytes of the sample header make 66 bytes.
>>> This is rounded up to the first multiple of 8, which is 72.
>>> The 4 bytes for the size header are then subtracted,
>>> and 68 is written as the data size.
>>>
>>> Any input is much appreciated,
>>>
>>> Best regards,
>>> Borna Cafuk
>>>
>>>
>>> [0] https://github.com/torvalds/linux/blob/6a9dc5fd6170d0a41c8a14eb19e63d94bea5705a/kernel/events/core.c#L7035
>>>
>>>
>>> example.h
>>> --------------------------------
>>> #ifndef __EXAMPLE_H
>>> #define __EXAMPLE_H
>>>
>>> struct __attribute__((__packed__)) event {
>>>       __u16 size;
>>>       char filler[60];
>>> };
>>>
>>> #endif /* __EXAMPLE_H */
>>>
>>>
>>> example.bpf.c
>>> --------------------------------
>>> #include "vmlinux.h"
>>> #include <bpf/bpf_helpers.h>
>>> #include <bpf/bpf_tracing.h>
>>> #include "example.h"
>>>
>>> struct {
>>>       __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
>>>       __uint(key_size, sizeof(u32));
>>>       __uint(value_size, sizeof(u32));
>>> } events SEC(".maps");
>>>
>>> SEC("kprobe/vfs_unlink")
>>> int BPF_KPROBE(kprobe__vfs_unlink, struct inode *dir, struct dentry *dentry)
>>> {
>>>       struct event event = {};
>>>       event.size = sizeof(struct event);
>>>
>>>       bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU,
>>>                             &event, sizeof(struct event));
>>>       return 0;
>>> }
>>>
>>> char LICENSE[] SEC("license") = "GPL";
>>>
>>>
>>> example.c
>>> --------------------------------
>>> #include <stdio.h>
>>> #include <bpf/libbpf.h>
>>> #include <sys/resource.h>
>>> #include "example.h"
>>> #include "example.skel.h"
>>>
>>> #define PERF_BUFFER_PAGES    16
>>> #define PERF_POLL_TIMEOUT_MS    100
>>>
>>> void handle_event(void *ctx, int cpu, void *data, __u32 data_sz)
>>> {
>>>       const struct event *e = data;
>>>
>>>       printf("Userspace: %u | BPF: %zu | data_sz: %u \n",
>>>              e->size, sizeof(struct event), data_sz);
>>> }
>>>
>>> void handle_lost_events(void *ctx, int cpu, __u64 lost_cnt)
>>> {
>>>       fprintf(stderr, "lost %llu events on CPU #%d\n", lost_cnt, cpu);
>>> }
>>>
>>> int main(int argc, char **argv)
>>> {
>>>       struct perf_buffer_opts pb_opts;
>>>       struct perf_buffer *pb = NULL;
>>>       struct example_bpf *obj;
>>>       int err;
>>>
>>>       struct rlimit rlim_new = {
>>>           .rlim_cur    = RLIM_INFINITY,
>>>           .rlim_max    = RLIM_INFINITY,
>>>       };
>>>       err = setrlimit(RLIMIT_MEMLOCK, &rlim_new);
>>>       if (err) {
>>>           fprintf(stderr, "failed to increase rlimit: %d\n", err);
>>>           return 1;
>>>       }
>>>
>>>       obj = example_bpf__open();
>>>       if (!obj) {
>>>           fprintf(stderr, "failed to open and/or load BPF object\n");
>>>           return 1;
>>>       }
>>>
>>>       err = example_bpf__load(obj);
>>>       if (err) {
>>>           fprintf(stderr, "failed to load BPF object: %d\n", err);
>>>           goto cleanup;
>>>       }
>>>
>>>       err = example_bpf__attach(obj);
>>>       if (err) {
>>>           fprintf(stderr, "failed to attach BPF programs\n");
>>>           goto cleanup;
>>>       }
>>>
>>>       pb_opts.sample_cb = handle_event;
>>>       pb_opts.lost_cb = handle_lost_events;
>>>       pb = perf_buffer__new(bpf_map__fd(obj->maps.events), PERF_BUFFER_PAGES,
>>>                             &pb_opts);
>>>       err = libbpf_get_error(pb);
>>>       if (err) {
>>>           pb = NULL;
>>>           fprintf(stderr, "failed to open perf buffer: %d\n", err);
>>>           goto cleanup;
>>>       }
>>>
>>>       while ((err = perf_buffer__poll(pb, PERF_POLL_TIMEOUT_MS)) >= 0)
>>>           ;
>>>       fprintf(stderr, "error polling perf buffer: %d\n", err);
>>>
>>>       cleanup:
>>>       perf_buffer__free(pb);
>>>       example_bpf__destroy(obj);
>>>
>>>       return err != 0;
>>> }
>>>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: perf event and data_sz
  2020-08-27 10:01   ` Borna Cafuk
  2020-08-27 16:04     ` Yonghong Song
@ 2020-09-02  0:29     ` Andrii Nakryiko
  1 sibling, 0 replies; 5+ messages in thread
From: Andrii Nakryiko @ 2020-09-02  0:29 UTC (permalink / raw)
  To: Borna Cafuk; +Cc: Yonghong Song, bpf, Luka Perkov, Juraj Vijtiuk, Jakov Petrina

On Thu, Aug 27, 2020 at 3:02 AM Borna Cafuk <borna.cafuk@sartura.hr> wrote:
>
> On Wed, Aug 26, 2020 at 8:45 PM Yonghong Song <yhs@fb.com> wrote:
> >
> > On 8/26/20 7:11 AM, Borna Cafuk wrote:
> > > Hi everyone,
> > >
> > > When examining BPF programs that use perf buffers, I have noticed that
> > > the `data_sz` parameter in the sample callback has a different value than
> > > the size passed to `bpf_perf_event_output`.
> > >
> > > Raw samples are padded in `perf_prepare_sample` so their size is a multiple of
> > > `sizeof(u64)` (see [0]). The sample includes the size header, a `u32`.
> > > The size stored in `raw->size` is then size of the sample, minus the 4 bytes for
> > > the size header. This size, however, includes both the data and the padding.
> > >
> > > What I find confusing is that this size including the padding is eventually
> > > passed as `data_sz` to the callback in the userspace, instead of
> > > the size of the data that was passed as an argument to `bpf_perf_event_output`.
> > >
> > > Is this intended behaviour?
> >
> >  From the kernel source code, yes, this is expected behavior. What you
> > described below matches what the kernel did. So raw->size = 68 is expected.
>
> I understand why this size that is stored in `raw->size` is needed.
> What I don't see is how the value is of any use in the callback.
>
> >
> > >
> > > I have a use-case for getting only the size of the data in the
> > > userspace, could this be done?
> >
> > In this case, since we know the kernel writes one record at a time,
> > you check the size, it is 68 more than 62, you just read 62 bytes
> > as your real data, ignore the rest as the padding. Does this work?
>
> The `data_sz` parameter seems a little pointless, then. What is its purpose
> in the callback if it doesn't accurately represent the size of the data?
>
> >
> > bcc callback passed the the buffer with raw->size to application.
> > But applications are expected to know what the record layout is...
>
> I'm afraid that wouldn't work for the use-case, our application should be able
> to read the raw data without having to know the record layout. It has to be
> generic, we handle interpreting the records elsewhere and at another time.

I agree it's confusing and less useful (not exactly useless, though),
but seems like PERF_SAMPLE_RAW doesn't store neither original size nor
(equivalently) the added padding size, so libbpf has nothing to go off
of, unfortunately.

I can only offer two suggestions:
1. Make sure samples you are emitting are 8 byte aligned, in which
case data_sz will always match actual data size. If you need different
sizes for different structs, you'd have to artificially add extra
bytes, though.
2. Consider using BPF ringbuf, it's data_sz is exact and doesn't
include padding (you need 5.8+ kernel).

>
> >
> > >
> > > To demonstrate, I have prepared a minimal example by modifying
> > > BCC's filelife example. It uses a kprobe on vfs_unlink to print some sizes
> > > every time a file is unlinked. The sizes are:
> > >   * the `sizeof(struct event)` measured in the userspace program,
> > >   * the `sizeof(struct event)` measured in the BPF program, and
> > >   * the `data_sz` parameter.
> > >
> > > The first two are 62, as expected, but `data_sz` is 68.
> > > The 62 bytes of the struct and the 4 bytes of the sample header make 66 bytes.
> > > This is rounded up to the first multiple of 8, which is 72.
> > > The 4 bytes for the size header are then subtracted,
> > > and 68 is written as the data size.
> > >
> > > Any input is much appreciated,
> > >
> > > Best regards,
> > > Borna Cafuk
> > >
> > >
> > > [0] https://github.com/torvalds/linux/blob/6a9dc5fd6170d0a41c8a14eb19e63d94bea5705a/kernel/events/core.c#L7035
> > >
> > >
> > > example.h

[...]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-09-02  0:29 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-26 14:11 perf event and data_sz Borna Cafuk
2020-08-26 18:45 ` Yonghong Song
2020-08-27 10:01   ` Borna Cafuk
2020-08-27 16:04     ` Yonghong Song
2020-09-02  0:29     ` Andrii Nakryiko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.