bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Maximum size of record over perf ring buffer?
@ 2020-07-17 14:23 Kevin Sheldrake
  2020-07-20  4:35 ` Andrii Nakryiko
  0 siblings, 1 reply; 11+ messages in thread
From: Kevin Sheldrake @ 2020-07-17 14:23 UTC (permalink / raw)
  To: bpf

Hello

I'm building a tool using EBPF/libbpf/C and I've run into an issue that I'd like to ask about.  I haven't managed to find documentation for the maximum size of a record that can be sent over the perf ring buffer, but experimentation (on kernel 5.3 (x64) with latest libbpf from github) suggests it is just short of 64KB.  Please could someone confirm if that's the case or not?  My experiments suggest that sending a record that is greater than 64KB results in the size reported in the callback being correct but the records overlapping, causing corruption if they are not serviced as quickly as they arrive.  Setting the record to exactly 64KB results in no records being received at all.

For reference, I'm using perf_buffer__new() and perf_buffer__poll() on the userland side; and bpf_perf_event_output(ctx, &event_map, BPF_F_CURRENT_CPU, event, sizeof(event_s)) on the EBPF side.

Additionally, is there a better architecture for sending large volumes of data (>64KB) back from the EBPF program to userland, such as a different ring buffer, a map, some kind of shared mmaped segment, etc, other than simply fragmenting the data?  Please excuse my naivety as I'm relatively new to the world of EBPF.

Thank you in anticipation

Kevin Sheldrake


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Maximum size of record over perf ring buffer?
  2020-07-17 14:23 Maximum size of record over perf ring buffer? Kevin Sheldrake
@ 2020-07-20  4:35 ` Andrii Nakryiko
  2020-07-20 11:39   ` [EXTERNAL] " Kevin Sheldrake
  0 siblings, 1 reply; 11+ messages in thread
From: Andrii Nakryiko @ 2020-07-20  4:35 UTC (permalink / raw)
  To: Kevin Sheldrake; +Cc: bpf

On Fri, Jul 17, 2020 at 7:24 AM Kevin Sheldrake
<Kevin.Sheldrake@microsoft.com> wrote:
>
> Hello
>
> I'm building a tool using EBPF/libbpf/C and I've run into an issue that I'd like to ask about.  I haven't managed to find documentation for the maximum size of a record that can be sent over the perf ring buffer, but experimentation (on kernel 5.3 (x64) with latest libbpf from github) suggests it is just short of 64KB.  Please could someone confirm if that's the case or not?  My experiments suggest that sending a record that is greater than 64KB results in the size reported in the callback being correct but the records overlapping, causing corruption if they are not serviced as quickly as they arrive.  Setting the record to exactly 64KB results in no records being received at all.
>
> For reference, I'm using perf_buffer__new() and perf_buffer__poll() on the userland side; and bpf_perf_event_output(ctx, &event_map, BPF_F_CURRENT_CPU, event, sizeof(event_s)) on the EBPF side.
>
> Additionally, is there a better architecture for sending large volumes of data (>64KB) back from the EBPF program to userland, such as a different ring buffer, a map, some kind of shared mmaped segment, etc, other than simply fragmenting the data?  Please excuse my naivety as I'm relatively new to the world of EBPF.
>

I'm not aware of any such limitations for perf ring buffer and I
haven't had a chance to validate this. It would be great if you can
provide a small repro so that someone can take a deeper look, it does
sound like a bug, if you really get clobbered data. It might be
actually how you set up perfbuf, AFAIK, it has a mode where it will
override the data, if it's not consumed quickly enough, but you need
to consciously enable that mode.

But apart from that, shameless plug here, you can try the new BPF ring
buffer ([0]), available in 5.8+ kernels. It will allow you to avoid
extra copy of data you get with bpf_perf_event_output(), if you use
BPF ringbuf's bpf_ringbuf_reserve() + bpf_ringbuf_commit() API. It
also has bpf_ringbuf_output() API, which is logically  equivalent to
bpf_perf_event_output(). And it has a very high limit on sample size,
up to 512MB per sample.

Keep in mind, BPF ringbuf is MPSC design and if you use just one BPF
ringbuf across all CPUs, you might run into some contention across
multiple CPU. It is acceptable in a lot of applications I was
targeting, but if you have a high frequency of events (keep in mind,
throughput doesn't matter, only contention on sample reservation
matters), you might want to use an array of BPF ringbufs to scale
throughput. You can do 1 ringbuf per each CPU for ultimate performance
at the expense of memory usage (that's perf ring buffer setup), but
BPF ringbuf is flexible enough to allow any topology that makes sense
for you use case, from 1 shared ringbuf across all CPUs, to anything
in between.


  [0] https://patchwork.ozlabs.org/project/netdev/list/?series=180119&state=*

> Thank you in anticipation
>
> Kevin Sheldrake
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [EXTERNAL] Re: Maximum size of record over perf ring buffer?
  2020-07-20  4:35 ` Andrii Nakryiko
@ 2020-07-20 11:39   ` Kevin Sheldrake
  2020-07-23 19:05     ` Andrii Nakryiko
  0 siblings, 1 reply; 11+ messages in thread
From: Kevin Sheldrake @ 2020-07-20 11:39 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: bpf

Hello

Thank you for your response; I hope you don't mind me top-posting.  I've put together a POC that demonstrates my results.  Edit the size of the data char array in event_defs.h to change the behaviour.

https://github.com/microsoft/OMS-Auditd-Plugin/tree/MSTIC-Research/ebpf_perf_output_poc

Unfortunately, our project aims to run on older kernels than 5.8 so the bpf ring buffer won't work for us.

Thanks again

Kevin Sheldrake


-----Original Message-----
From: bpf-owner@vger.kernel.org <bpf-owner@vger.kernel.org> On Behalf Of Andrii Nakryiko
Sent: 20 July 2020 05:35
To: Kevin Sheldrake <Kevin.Sheldrake@microsoft.com>
Cc: bpf@vger.kernel.org
Subject: [EXTERNAL] Re: Maximum size of record over perf ring buffer?

On Fri, Jul 17, 2020 at 7:24 AM Kevin Sheldrake <Kevin.Sheldrake@microsoft.com> wrote:
>
> Hello
>
> I'm building a tool using EBPF/libbpf/C and I've run into an issue that I'd like to ask about.  I haven't managed to find documentation for the maximum size of a record that can be sent over the perf ring buffer, but experimentation (on kernel 5.3 (x64) with latest libbpf from github) suggests it is just short of 64KB.  Please could someone confirm if that's the case or not?  My experiments suggest that sending a record that is greater than 64KB results in the size reported in the callback being correct but the records overlapping, causing corruption if they are not serviced as quickly as they arrive.  Setting the record to exactly 64KB results in no records being received at all.
>
> For reference, I'm using perf_buffer__new() and perf_buffer__poll() on the userland side; and bpf_perf_event_output(ctx, &event_map, BPF_F_CURRENT_CPU, event, sizeof(event_s)) on the EBPF side.
>
> Additionally, is there a better architecture for sending large volumes of data (>64KB) back from the EBPF program to userland, such as a different ring buffer, a map, some kind of shared mmaped segment, etc, other than simply fragmenting the data?  Please excuse my naivety as I'm relatively new to the world of EBPF.
>

I'm not aware of any such limitations for perf ring buffer and I haven't had a chance to validate this. It would be great if you can provide a small repro so that someone can take a deeper look, it does sound like a bug, if you really get clobbered data. It might be actually how you set up perfbuf, AFAIK, it has a mode where it will override the data, if it's not consumed quickly enough, but you need to consciously enable that mode.

But apart from that, shameless plug here, you can try the new BPF ring buffer ([0]), available in 5.8+ kernels. It will allow you to avoid extra copy of data you get with bpf_perf_event_output(), if you use BPF ringbuf's bpf_ringbuf_reserve() + bpf_ringbuf_commit() API. It also has bpf_ringbuf_output() API, which is logically  equivalent to bpf_perf_event_output(). And it has a very high limit on sample size, up to 512MB per sample.

Keep in mind, BPF ringbuf is MPSC design and if you use just one BPF ringbuf across all CPUs, you might run into some contention across multiple CPU. It is acceptable in a lot of applications I was targeting, but if you have a high frequency of events (keep in mind, throughput doesn't matter, only contention on sample reservation matters), you might want to use an array of BPF ringbufs to scale throughput. You can do 1 ringbuf per each CPU for ultimate performance at the expense of memory usage (that's perf ring buffer setup), but BPF ringbuf is flexible enough to allow any topology that makes sense for you use case, from 1 shared ringbuf across all CPUs, to anything in between.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [EXTERNAL] Re: Maximum size of record over perf ring buffer?
  2020-07-20 11:39   ` [EXTERNAL] " Kevin Sheldrake
@ 2020-07-23 19:05     ` Andrii Nakryiko
  2020-07-24  9:40       ` Kevin Sheldrake
  0 siblings, 1 reply; 11+ messages in thread
From: Andrii Nakryiko @ 2020-07-23 19:05 UTC (permalink / raw)
  To: Kevin Sheldrake; +Cc: bpf

On Mon, Jul 20, 2020 at 4:39 AM Kevin Sheldrake
<Kevin.Sheldrake@microsoft.com> wrote:
>
> Hello
>
> Thank you for your response; I hope you don't mind me top-posting.  I've put together a POC that demonstrates my results.  Edit the size of the data char array in event_defs.h to change the behaviour.
>
> https://github.com/microsoft/OMS-Auditd-Plugin/tree/MSTIC-Research/ebpf_perf_output_poc

I haven't run your program, but I can certainly reproduce this using
bench_perfbuf in selftests. It does seem like something is silently
corrupted, because the size reported by perf is correct (plus/minus
few bytes, probably rounding up to 8 bytes), but the contents is not
correct. I have no idea why that's happening, maybe someone more
familiar with the perf subsystem can take a look.

>
> Unfortunately, our project aims to run on older kernels than 5.8 so the bpf ring buffer won't work for us.
>
> Thanks again
>
> Kevin Sheldrake
>
>
> -----Original Message-----
> From: bpf-owner@vger.kernel.org <bpf-owner@vger.kernel.org> On Behalf Of Andrii Nakryiko
> Sent: 20 July 2020 05:35
> To: Kevin Sheldrake <Kevin.Sheldrake@microsoft.com>
> Cc: bpf@vger.kernel.org
> Subject: [EXTERNAL] Re: Maximum size of record over perf ring buffer?
>
> On Fri, Jul 17, 2020 at 7:24 AM Kevin Sheldrake <Kevin.Sheldrake@microsoft.com> wrote:
> >
> > Hello
> >
> > I'm building a tool using EBPF/libbpf/C and I've run into an issue that I'd like to ask about.  I haven't managed to find documentation for the maximum size of a record that can be sent over the perf ring buffer, but experimentation (on kernel 5.3 (x64) with latest libbpf from github) suggests it is just short of 64KB.  Please could someone confirm if that's the case or not?  My experiments suggest that sending a record that is greater than 64KB results in the size reported in the callback being correct but the records overlapping, causing corruption if they are not serviced as quickly as they arrive.  Setting the record to exactly 64KB results in no records being received at all.
> >
> > For reference, I'm using perf_buffer__new() and perf_buffer__poll() on the userland side; and bpf_perf_event_output(ctx, &event_map, BPF_F_CURRENT_CPU, event, sizeof(event_s)) on the EBPF side.
> >
> > Additionally, is there a better architecture for sending large volumes of data (>64KB) back from the EBPF program to userland, such as a different ring buffer, a map, some kind of shared mmaped segment, etc, other than simply fragmenting the data?  Please excuse my naivety as I'm relatively new to the world of EBPF.
> >
>
> I'm not aware of any such limitations for perf ring buffer and I haven't had a chance to validate this. It would be great if you can provide a small repro so that someone can take a deeper look, it does sound like a bug, if you really get clobbered data. It might be actually how you set up perfbuf, AFAIK, it has a mode where it will override the data, if it's not consumed quickly enough, but you need to consciously enable that mode.
>
> But apart from that, shameless plug here, you can try the new BPF ring buffer ([0]), available in 5.8+ kernels. It will allow you to avoid extra copy of data you get with bpf_perf_event_output(), if you use BPF ringbuf's bpf_ringbuf_reserve() + bpf_ringbuf_commit() API. It also has bpf_ringbuf_output() API, which is logically  equivalent to bpf_perf_event_output(). And it has a very high limit on sample size, up to 512MB per sample.
>
> Keep in mind, BPF ringbuf is MPSC design and if you use just one BPF ringbuf across all CPUs, you might run into some contention across multiple CPU. It is acceptable in a lot of applications I was targeting, but if you have a high frequency of events (keep in mind, throughput doesn't matter, only contention on sample reservation matters), you might want to use an array of BPF ringbufs to scale throughput. You can do 1 ringbuf per each CPU for ultimate performance at the expense of memory usage (that's perf ring buffer setup), but BPF ringbuf is flexible enough to allow any topology that makes sense for you use case, from 1 shared ringbuf across all CPUs, to anything in between.
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [EXTERNAL] Re: Maximum size of record over perf ring buffer?
  2020-07-23 19:05     ` Andrii Nakryiko
@ 2020-07-24  9:40       ` Kevin Sheldrake
  2020-10-26 17:10         ` Kevin Sheldrake
  0 siblings, 1 reply; 11+ messages in thread
From: Kevin Sheldrake @ 2020-07-24  9:40 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: bpf

Hello Andrii

Thank you for taking a look at this.  While the size is reported correctly to the consumer (bar padding, etc), the actual offsets between adjacent pointers appears to either have been cast to a u16 or otherwise masked with 0xFFFF, causing what I believe to be overlapping samples and the opportunity for sample corruption in the overlapped regions.

Thanks again

Kev


-----Original Message-----
From: Andrii Nakryiko <andrii.nakryiko@gmail.com> 
Sent: 23 July 2020 20:05
To: Kevin Sheldrake <Kevin.Sheldrake@microsoft.com>
Cc: bpf@vger.kernel.org
Subject: Re: [EXTERNAL] Re: Maximum size of record over perf ring buffer?

On Mon, Jul 20, 2020 at 4:39 AM Kevin Sheldrake <Kevin.Sheldrake@microsoft.com> wrote:
>
> Hello
>
> Thank you for your response; I hope you don't mind me top-posting.  I've put together a POC that demonstrates my results.  Edit the size of the data char array in event_defs.h to change the behaviour.
>
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith
> ub.com%2Fmicrosoft%2FOMS-Auditd-Plugin%2Ftree%2FMSTIC-Research%2Febpf_
> perf_output_poc&amp;data=02%7C01%7CKevin.Sheldrake%40microsoft.com%7C8
> bd9fb551cd4454b87a608d82f3b57c0%7C72f988bf86f141af91ab2d7cd011db47%7C1
> %7C0%7C637311279211606351&amp;sdata=jMMpfi%2Bd%2B7jZzMT905xJ6134cDJd5u
> MNSu9RCdx4M6s%3D&amp;reserved=0

I haven't run your program, but I can certainly reproduce this using bench_perfbuf in selftests. It does seem like something is silently corrupted, because the size reported by perf is correct (plus/minus few bytes, probably rounding up to 8 bytes), but the contents is not correct. I have no idea why that's happening, maybe someone more familiar with the perf subsystem can take a look.

>
> Unfortunately, our project aims to run on older kernels than 5.8 so the bpf ring buffer won't work for us.
>
> Thanks again
>
> Kevin Sheldrake
>
>
> -----Original Message-----
> From: bpf-owner@vger.kernel.org <bpf-owner@vger.kernel.org> On Behalf 
> Of Andrii Nakryiko
> Sent: 20 July 2020 05:35
> To: Kevin Sheldrake <Kevin.Sheldrake@microsoft.com>
> Cc: bpf@vger.kernel.org
> Subject: [EXTERNAL] Re: Maximum size of record over perf ring buffer?
>
> On Fri, Jul 17, 2020 at 7:24 AM Kevin Sheldrake <Kevin.Sheldrake@microsoft.com> wrote:
> >
> > Hello
> >
> > I'm building a tool using EBPF/libbpf/C and I've run into an issue that I'd like to ask about.  I haven't managed to find documentation for the maximum size of a record that can be sent over the perf ring buffer, but experimentation (on kernel 5.3 (x64) with latest libbpf from github) suggests it is just short of 64KB.  Please could someone confirm if that's the case or not?  My experiments suggest that sending a record that is greater than 64KB results in the size reported in the callback being correct but the records overlapping, causing corruption if they are not serviced as quickly as they arrive.  Setting the record to exactly 64KB results in no records being received at all.
> >
> > For reference, I'm using perf_buffer__new() and perf_buffer__poll() on the userland side; and bpf_perf_event_output(ctx, &event_map, BPF_F_CURRENT_CPU, event, sizeof(event_s)) on the EBPF side.
> >
> > Additionally, is there a better architecture for sending large volumes of data (>64KB) back from the EBPF program to userland, such as a different ring buffer, a map, some kind of shared mmaped segment, etc, other than simply fragmenting the data?  Please excuse my naivety as I'm relatively new to the world of EBPF.
> >
>
> I'm not aware of any such limitations for perf ring buffer and I haven't had a chance to validate this. It would be great if you can provide a small repro so that someone can take a deeper look, it does sound like a bug, if you really get clobbered data. It might be actually how you set up perfbuf, AFAIK, it has a mode where it will override the data, if it's not consumed quickly enough, but you need to consciously enable that mode.
>
> But apart from that, shameless plug here, you can try the new BPF ring buffer ([0]), available in 5.8+ kernels. It will allow you to avoid extra copy of data you get with bpf_perf_event_output(), if you use BPF ringbuf's bpf_ringbuf_reserve() + bpf_ringbuf_commit() API. It also has bpf_ringbuf_output() API, which is logically  equivalent to bpf_perf_event_output(). And it has a very high limit on sample size, up to 512MB per sample.
>
> Keep in mind, BPF ringbuf is MPSC design and if you use just one BPF ringbuf across all CPUs, you might run into some contention across multiple CPU. It is acceptable in a lot of applications I was targeting, but if you have a high frequency of events (keep in mind, throughput doesn't matter, only contention on sample reservation matters), you might want to use an array of BPF ringbufs to scale throughput. You can do 1 ringbuf per each CPU for ultimate performance at the expense of memory usage (that's perf ring buffer setup), but BPF ringbuf is flexible enough to allow any topology that makes sense for you use case, from 1 shared ringbuf across all CPUs, to anything in between.
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [EXTERNAL] Re: Maximum size of record over perf ring buffer?
  2020-07-24  9:40       ` Kevin Sheldrake
@ 2020-10-26 17:10         ` Kevin Sheldrake
  2020-10-26 22:01           ` Andrii Nakryiko
  0 siblings, 1 reply; 11+ messages in thread
From: Kevin Sheldrake @ 2020-10-26 17:10 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: bpf

Hello Andrii and list

I've now had chance to properly investigate the perf ring buffer corruption bug.  Essentially (as suspected), the size parameter that is specified as a __u64 in bpf_helper_defs.h, is truncated into a __u16 inside the struct perf_event_header size parameter ($KERNELSRC/include/uapi/linux/perf_event.h and /usr/include/linux/perf_event.h).

Changing the size parameter in perf_event_header (both locations) to a __u32 (or __u64 if you prefer) fixes my issue of sending more than 64KB of data in a single perf sample, but I'm not convinced that this is a good or workable solution.

As I, and probably others, are more likely to tend towards much smaller, fragmented packets, I suggest (having spoken to KP Singh) that the fix should probably be in the verifier - to ensure the size is <0xffff - 8 (sizeof(struct perf_event_header) I guess) - and also in bpf_helper_defs.h to raise a clang warning/error, as well as in the bpf_helpers man page.

The bpf_helper_defs.h and man page updates are trivial, but I can't work out where in verifier.c the check should go.  It feels like it should be in check_helper_call() but I can't see any other similar checks in there.  I suspect that the better fix would be to create another ARG_CONST_SIZE type, such as ARG_CONST_SIZE_PERF_SAMPLE, that can be explicitly checked rather than adding ad hoc size checks.

As this causes corruption inside the perf ring buffer as samples overlap, and the reported sample size is questionable, please can I ask for some help in fixing this?

Thanks

Kevin Sheldrake

PS I will get around to the clang/LLVM jump offset warning soon I promise.



> -----Original Message-----
> From: bpf-owner@vger.kernel.org <bpf-owner@vger.kernel.org> On Behalf
> Of Kevin Sheldrake
> Sent: 24 July 2020 10:40
> To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> Cc: bpf@vger.kernel.org
> Subject: RE: [EXTERNAL] Re: Maximum size of record over perf ring buffer?
> 
> Hello Andrii
> 
> Thank you for taking a look at this.  While the size is reported correctly to the
> consumer (bar padding, etc), the actual offsets between adjacent pointers
> appears to either have been cast to a u16 or otherwise masked with 0xFFFF,
> causing what I believe to be overlapping samples and the opportunity for
> sample corruption in the overlapped regions.
> 
> Thanks again
> 
> Kev
> 
> 
> -----Original Message-----
> From: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> Sent: 23 July 2020 20:05
> To: Kevin Sheldrake <Kevin.Sheldrake@microsoft.com>
> Cc: bpf@vger.kernel.org
> Subject: Re: [EXTERNAL] Re: Maximum size of record over perf ring buffer?
> 
> On Mon, Jul 20, 2020 at 4:39 AM Kevin Sheldrake
> <Kevin.Sheldrake@microsoft.com> wrote:
> >
> > Hello
> >
> > Thank you for your response; I hope you don't mind me top-posting.  I've
> put together a POC that demonstrates my results.  Edit the size of the data
> char array in event_defs.h to change the behaviour.
> >
> >
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith
> > ub.com%2Fmicrosoft%2FOMS-Auditd-Plugin%2Ftree%2FMSTIC-
> Research%2Febpf_
> >
> perf_output_poc&amp;data=02%7C01%7CKevin.Sheldrake%40microsoft.co
> m%7C8
> >
> bd9fb551cd4454b87a608d82f3b57c0%7C72f988bf86f141af91ab2d7cd011db47
> %7C1
> >
> %7C0%7C637311279211606351&amp;sdata=jMMpfi%2Bd%2B7jZzMT905xJ61
> 34cDJd5u
> > MNSu9RCdx4M6s%3D&amp;reserved=0
> 
> I haven't run your program, but I can certainly reproduce this using
> bench_perfbuf in selftests. It does seem like something is silently corrupted,
> because the size reported by perf is correct (plus/minus few bytes, probably
> rounding up to 8 bytes), but the contents is not correct. I have no idea why
> that's happening, maybe someone more familiar with the perf subsystem
> can take a look.
> 
> >
> > Unfortunately, our project aims to run on older kernels than 5.8 so the bpf
> ring buffer won't work for us.
> >
> > Thanks again
> >
> > Kevin Sheldrake
> >
> >
> > -----Original Message-----
> > From: bpf-owner@vger.kernel.org <bpf-owner@vger.kernel.org> On
> Behalf
> > Of Andrii Nakryiko
> > Sent: 20 July 2020 05:35
> > To: Kevin Sheldrake <Kevin.Sheldrake@microsoft.com>
> > Cc: bpf@vger.kernel.org
> > Subject: [EXTERNAL] Re: Maximum size of record over perf ring buffer?
> >
> > On Fri, Jul 17, 2020 at 7:24 AM Kevin Sheldrake
> <Kevin.Sheldrake@microsoft.com> wrote:
> > >
> > > Hello
> > >
> > > I'm building a tool using EBPF/libbpf/C and I've run into an issue that I'd
> like to ask about.  I haven't managed to find documentation for the
> maximum size of a record that can be sent over the perf ring buffer, but
> experimentation (on kernel 5.3 (x64) with latest libbpf from github) suggests
> it is just short of 64KB.  Please could someone confirm if that's the case or
> not?  My experiments suggest that sending a record that is greater than 64KB
> results in the size reported in the callback being correct but the records
> overlapping, causing corruption if they are not serviced as quickly as they
> arrive.  Setting the record to exactly 64KB results in no records being received
> at all.
> > >
> > > For reference, I'm using perf_buffer__new() and perf_buffer__poll() on
> the userland side; and bpf_perf_event_output(ctx, &event_map,
> BPF_F_CURRENT_CPU, event, sizeof(event_s)) on the EBPF side.
> > >
> > > Additionally, is there a better architecture for sending large volumes of
> data (>64KB) back from the EBPF program to userland, such as a different
> ring buffer, a map, some kind of shared mmaped segment, etc, other than
> simply fragmenting the data?  Please excuse my naivety as I'm relatively new
> to the world of EBPF.
> > >
> >
> > I'm not aware of any such limitations for perf ring buffer and I haven't had a
> chance to validate this. It would be great if you can provide a small repro so
> that someone can take a deeper look, it does sound like a bug, if you really
> get clobbered data. It might be actually how you set up perfbuf, AFAIK, it has
> a mode where it will override the data, if it's not consumed quickly enough,
> but you need to consciously enable that mode.
> >
> > But apart from that, shameless plug here, you can try the new BPF ring
> buffer ([0]), available in 5.8+ kernels. It will allow you to avoid extra copy of
> data you get with bpf_perf_event_output(), if you use BPF ringbuf's
> bpf_ringbuf_reserve() + bpf_ringbuf_commit() API. It also has
> bpf_ringbuf_output() API, which is logically  equivalent to
> bpf_perf_event_output(). And it has a very high limit on sample size, up to
> 512MB per sample.
> >
> > Keep in mind, BPF ringbuf is MPSC design and if you use just one BPF
> ringbuf across all CPUs, you might run into some contention across multiple
> CPU. It is acceptable in a lot of applications I was targeting, but if you have a
> high frequency of events (keep in mind, throughput doesn't matter, only
> contention on sample reservation matters), you might want to use an array
> of BPF ringbufs to scale throughput. You can do 1 ringbuf per each CPU for
> ultimate performance at the expense of memory usage (that's perf ring
> buffer setup), but BPF ringbuf is flexible enough to allow any topology that
> makes sense for you use case, from 1 shared ringbuf across all CPUs, to
> anything in between.
> >
> >

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [EXTERNAL] Re: Maximum size of record over perf ring buffer?
  2020-10-26 17:10         ` Kevin Sheldrake
@ 2020-10-26 22:01           ` Andrii Nakryiko
  2020-10-27  1:07             ` KP Singh
  0 siblings, 1 reply; 11+ messages in thread
From: Andrii Nakryiko @ 2020-10-26 22:01 UTC (permalink / raw)
  To: Kevin Sheldrake; +Cc: bpf

On Mon, Oct 26, 2020 at 10:10 AM Kevin Sheldrake
<Kevin.Sheldrake@microsoft.com> wrote:
>
> Hello Andrii and list
>
> I've now had chance to properly investigate the perf ring buffer corruption bug.  Essentially (as suspected), the size parameter that is specified as a __u64 in bpf_helper_defs.h, is truncated into a __u16 inside the struct perf_event_header size parameter ($KERNELSRC/include/uapi/linux/perf_event.h and /usr/include/linux/perf_event.h).
>
> Changing the size parameter in perf_event_header (both locations) to a __u32 (or __u64 if you prefer) fixes my issue of sending more than 64KB of data in a single perf sample, but I'm not convinced that this is a good or workable solution.

Right, this will break ABI compatibility with all the applications,
unfortunately.

>
> As I, and probably others, are more likely to tend towards much smaller, fragmented packets, I suggest (having spoken to KP Singh) that the fix should probably be in the verifier - to ensure the size is <0xffff - 8 (sizeof(struct perf_event_header) I guess) - and also in bpf_helper_defs.h to raise a clang warning/error, as well as in the bpf_helpers man page.
>
> The bpf_helper_defs.h and man page updates are trivial, but I can't work out where in verifier.c the check should go.  It feels like it should be in check_helper_call() but I can't see any other similar checks in there.  I suspect that the better fix would be to create another ARG_CONST_SIZE type, such as ARG_CONST_SIZE_PERF_SAMPLE, that can be explicitly checked rather than adding ad hoc size checks.
>
> As this causes corruption inside the perf ring buffer as samples overlap, and the reported sample size is questionable, please can I ask for some help in fixing this?

I don't think it's possible to enforce this statically in verifier in
all cases. The size of the sample can be dynamically determined, so
BPF verifier can't do much about that. It seems like the proper
solution is to do the check in bpf_perf_event_output() BPF helper
itself. Returning -E2BIG is an appropriate behavior here, rather than
corrupting data. __u64 if helper definition is fine as well, because
that's the size of BPF registers that are used to pass arguments to
helpers.

>
> Thanks
>
> Kevin Sheldrake
>
> PS I will get around to the clang/LLVM jump offset warning soon I promise.
>
>
>
> > -----Original Message-----
> > From: bpf-owner@vger.kernel.org <bpf-owner@vger.kernel.org> On Behalf
> > Of Kevin Sheldrake
> > Sent: 24 July 2020 10:40
> > To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> > Cc: bpf@vger.kernel.org
> > Subject: RE: [EXTERNAL] Re: Maximum size of record over perf ring buffer?
> >
> > Hello Andrii
> >
> > Thank you for taking a look at this.  While the size is reported correctly to the
> > consumer (bar padding, etc), the actual offsets between adjacent pointers
> > appears to either have been cast to a u16 or otherwise masked with 0xFFFF,
> > causing what I believe to be overlapping samples and the opportunity for
> > sample corruption in the overlapped regions.
> >
> > Thanks again
> >
> > Kev
> >
> >
> > -----Original Message-----
> > From: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> > Sent: 23 July 2020 20:05
> > To: Kevin Sheldrake <Kevin.Sheldrake@microsoft.com>
> > Cc: bpf@vger.kernel.org
> > Subject: Re: [EXTERNAL] Re: Maximum size of record over perf ring buffer?
> >
> > On Mon, Jul 20, 2020 at 4:39 AM Kevin Sheldrake
> > <Kevin.Sheldrake@microsoft.com> wrote:
> > >
> > > Hello
> > >
> > > Thank you for your response; I hope you don't mind me top-posting.  I've
> > put together a POC that demonstrates my results.  Edit the size of the data
> > char array in event_defs.h to change the behaviour.
> > >
> > >
> > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith
> > > ub.com%2Fmicrosoft%2FOMS-Auditd-Plugin%2Ftree%2FMSTIC-
> > Research%2Febpf_
> > >
> > perf_output_poc&amp;data=02%7C01%7CKevin.Sheldrake%40microsoft.co
> > m%7C8
> > >
> > bd9fb551cd4454b87a608d82f3b57c0%7C72f988bf86f141af91ab2d7cd011db47
> > %7C1
> > >
> > %7C0%7C637311279211606351&amp;sdata=jMMpfi%2Bd%2B7jZzMT905xJ61
> > 34cDJd5u
> > > MNSu9RCdx4M6s%3D&amp;reserved=0
> >
> > I haven't run your program, but I can certainly reproduce this using
> > bench_perfbuf in selftests. It does seem like something is silently corrupted,
> > because the size reported by perf is correct (plus/minus few bytes, probably
> > rounding up to 8 bytes), but the contents is not correct. I have no idea why
> > that's happening, maybe someone more familiar with the perf subsystem
> > can take a look.
> >
> > >
> > > Unfortunately, our project aims to run on older kernels than 5.8 so the bpf
> > ring buffer won't work for us.
> > >
> > > Thanks again
> > >
> > > Kevin Sheldrake
> > >
> > >
> > > -----Original Message-----
> > > From: bpf-owner@vger.kernel.org <bpf-owner@vger.kernel.org> On
> > Behalf
> > > Of Andrii Nakryiko
> > > Sent: 20 July 2020 05:35
> > > To: Kevin Sheldrake <Kevin.Sheldrake@microsoft.com>
> > > Cc: bpf@vger.kernel.org
> > > Subject: [EXTERNAL] Re: Maximum size of record over perf ring buffer?
> > >
> > > On Fri, Jul 17, 2020 at 7:24 AM Kevin Sheldrake
> > <Kevin.Sheldrake@microsoft.com> wrote:
> > > >
> > > > Hello
> > > >
> > > > I'm building a tool using EBPF/libbpf/C and I've run into an issue that I'd
> > like to ask about.  I haven't managed to find documentation for the
> > maximum size of a record that can be sent over the perf ring buffer, but
> > experimentation (on kernel 5.3 (x64) with latest libbpf from github) suggests
> > it is just short of 64KB.  Please could someone confirm if that's the case or
> > not?  My experiments suggest that sending a record that is greater than 64KB
> > results in the size reported in the callback being correct but the records
> > overlapping, causing corruption if they are not serviced as quickly as they
> > arrive.  Setting the record to exactly 64KB results in no records being received
> > at all.
> > > >
> > > > For reference, I'm using perf_buffer__new() and perf_buffer__poll() on
> > the userland side; and bpf_perf_event_output(ctx, &event_map,
> > BPF_F_CURRENT_CPU, event, sizeof(event_s)) on the EBPF side.
> > > >
> > > > Additionally, is there a better architecture for sending large volumes of
> > data (>64KB) back from the EBPF program to userland, such as a different
> > ring buffer, a map, some kind of shared mmaped segment, etc, other than
> > simply fragmenting the data?  Please excuse my naivety as I'm relatively new
> > to the world of EBPF.
> > > >
> > >
> > > I'm not aware of any such limitations for perf ring buffer and I haven't had a
> > chance to validate this. It would be great if you can provide a small repro so
> > that someone can take a deeper look, it does sound like a bug, if you really
> > get clobbered data. It might be actually how you set up perfbuf, AFAIK, it has
> > a mode where it will override the data, if it's not consumed quickly enough,
> > but you need to consciously enable that mode.
> > >
> > > But apart from that, shameless plug here, you can try the new BPF ring
> > buffer ([0]), available in 5.8+ kernels. It will allow you to avoid extra copy of
> > data you get with bpf_perf_event_output(), if you use BPF ringbuf's
> > bpf_ringbuf_reserve() + bpf_ringbuf_commit() API. It also has
> > bpf_ringbuf_output() API, which is logically  equivalent to
> > bpf_perf_event_output(). And it has a very high limit on sample size, up to
> > 512MB per sample.
> > >
> > > Keep in mind, BPF ringbuf is MPSC design and if you use just one BPF
> > ringbuf across all CPUs, you might run into some contention across multiple
> > CPU. It is acceptable in a lot of applications I was targeting, but if you have a
> > high frequency of events (keep in mind, throughput doesn't matter, only
> > contention on sample reservation matters), you might want to use an array
> > of BPF ringbufs to scale throughput. You can do 1 ringbuf per each CPU for
> > ultimate performance at the expense of memory usage (that's perf ring
> > buffer setup), but BPF ringbuf is flexible enough to allow any topology that
> > makes sense for you use case, from 1 shared ringbuf across all CPUs, to
> > anything in between.
> > >
> > >

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [EXTERNAL] Re: Maximum size of record over perf ring buffer?
  2020-10-26 22:01           ` Andrii Nakryiko
@ 2020-10-27  1:07             ` KP Singh
  2020-10-27  3:43               ` Andrii Nakryiko
  0 siblings, 1 reply; 11+ messages in thread
From: KP Singh @ 2020-10-27  1:07 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: Kevin Sheldrake, bpf

On Mon, Oct 26, 2020 at 11:01 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Mon, Oct 26, 2020 at 10:10 AM Kevin Sheldrake
> <Kevin.Sheldrake@microsoft.com> wrote:
> >
> > Hello Andrii and list
> >
> > I've now had chance to properly investigate the perf ring buffer corruption bug.  Essentially (as suspected), the size parameter that is specified as a __u64 in bpf_helper_defs.h, is truncated into a __u16 inside the struct perf_event_header size parameter ($KERNELSRC/include/uapi/linux/perf_event.h and /usr/include/linux/perf_event.h).
> >
> > Changing the size parameter in perf_event_header (both locations) to a __u32 (or __u64 if you prefer) fixes my issue of sending more than 64KB of data in a single perf sample, but I'm not convinced that this is a good or workable solution.
>
> Right, this will break ABI compatibility with all the applications,
> unfortunately.
>
> >
> > As I, and probably others, are more likely to tend towards much smaller, fragmented packets, I suggest (having spoken to KP Singh) that the fix should probably be in the verifier - to ensure the size is <0xffff - 8 (sizeof(struct perf_event_header) I guess) - and also in bpf_helper_defs.h to raise a clang warning/error, as well as in the bpf_helpers man page.
> >
> > The bpf_helper_defs.h and man page updates are trivial, but I can't work out where in verifier.c the check should go.  It feels like it should be in check_helper_call() but I can't see any other similar checks in there.  I suspect that the better fix would be to create another ARG_CONST_SIZE type, such as ARG_CONST_SIZE_PERF_SAMPLE, that can be explicitly checked rather than adding ad hoc size checks.
> >
> > As this causes corruption inside the perf ring buffer as samples overlap, and the reported sample size is questionable, please can I ask for some help in fixing this?
>
> I don't think it's possible to enforce this statically in verifier in
> all cases. The size of the sample can be dynamically determined, so

Are you sure it cannot be done in the verifier and that we can set the
size of the sample dynamically?

The size argument is of the type ARG_CONST_SIZE_OR_ZERO:

static const struct bpf_func_proto bpf_perf_event_output_proto = {
   .func = bpf_perf_event_output,
   .gpl_only = true,
   .ret_type = RET_INTEGER,
   .arg1_type = ARG_PTR_TO_CTX,
   .arg2_type = ARG_CONST_MAP_PTR,
   .arg3_type = ARG_ANYTHING,
   .arg4_type = ARG_PTR_TO_MEM,
   .arg5_type = ARG_CONST_SIZE_OR_ZERO,
};

and we do similar checks in the verifier with the BPF_MAX_VAR_SIZ:

if (reg->umax_value >= BPF_MAX_VAR_SIZ) {
   verbose(env, "R%d unbounded memory access, use 'var &= const' or
'if (var < const)'\n",
      regno);
   return -EACCES;
}

it's just that bpf_perf_event_output expects the size to be even
smaller than 32 bits (i.e. 16 bits).

> BPF verifier can't do much about that. It seems like the proper
> solution is to do the check in bpf_perf_event_output() BPF helper
> itself. Returning -E2BIG is an appropriate behavior here, rather than

This could be a solution (and maybe better than the verifier check).

But I do think perf needs to have the check instead of bpf_perf_event_output:

The size in the perf always seems to be u32 except the perf_event_header
(I assume this is to save some space on the ring buffer)

struct perf_raw_frag {
     union {
         struct perf_raw_frag *next;
         unsigned long pad;
     };
    perf_copy_f copy;
    void *data;
    u32 size;
} __packed;

struct perf_raw_record {
   struct perf_raw_frag frag;
    u32 size;
};

Maybe we can just add the check to perf_event_output instead?

- KP

> corrupting data. __u64 if helper definition is fine as well, because
> that's the size of BPF registers that are used to pass arguments to
> helpers.
>
> >
> > Thanks
> >
> > Kevin Sheldrake
> >
> > PS I will get around to the clang/LLVM jump offset warning soon I promise.
> >
> >
> >
> > > -----Original Message-----
> > > From: bpf-owner@vger.kernel.org <bpf-owner@vger.kernel.org> On Behalf
> > > Of Kevin Sheldrake
> > > Sent: 24 July 2020 10:40
> > > To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> > > Cc: bpf@vger.kernel.org
> > > Subject: RE: [EXTERNAL] Re: Maximum size of record over perf ring buffer?
> > >
> > > Hello Andrii
> > >
> > > Thank you for taking a look at this.  While the size is reported correctly to the
> > > consumer (bar padding, etc), the actual offsets between adjacent pointers
> > > appears to either have been cast to a u16 or otherwise masked with 0xFFFF,
> > > causing what I believe to be overlapping samples and the opportunity for
> > > sample corruption in the overlapped regions.
> > >
> > > Thanks again
> > >
> > > Kev
> > >
> > >
> > > -----Original Message-----
> > > From: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> > > Sent: 23 July 2020 20:05
> > > To: Kevin Sheldrake <Kevin.Sheldrake@microsoft.com>
> > > Cc: bpf@vger.kernel.org
> > > Subject: Re: [EXTERNAL] Re: Maximum size of record over perf ring buffer?
> > >
> > > On Mon, Jul 20, 2020 at 4:39 AM Kevin Sheldrake
> > > <Kevin.Sheldrake@microsoft.com> wrote:
> > > >
> > > > Hello
> > > >
> > > > Thank you for your response; I hope you don't mind me top-posting.  I've
> > > put together a POC that demonstrates my results.  Edit the size of the data
> > > char array in event_defs.h to change the behaviour.
> > > >
> > > >
> > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith
> > > > ub.com%2Fmicrosoft%2FOMS-Auditd-Plugin%2Ftree%2FMSTIC-
> > > Research%2Febpf_
> > > >
> > > perf_output_poc&amp;data=02%7C01%7CKevin.Sheldrake%40microsoft.co
> > > m%7C8
> > > >
> > > bd9fb551cd4454b87a608d82f3b57c0%7C72f988bf86f141af91ab2d7cd011db47
> > > %7C1
> > > >
> > > %7C0%7C637311279211606351&amp;sdata=jMMpfi%2Bd%2B7jZzMT905xJ61
> > > 34cDJd5u
> > > > MNSu9RCdx4M6s%3D&amp;reserved=0
> > >
> > > I haven't run your program, but I can certainly reproduce this using
> > > bench_perfbuf in selftests. It does seem like something is silently corrupted,
> > > because the size reported by perf is correct (plus/minus few bytes, probably
> > > rounding up to 8 bytes), but the contents is not correct. I have no idea why
> > > that's happening, maybe someone more familiar with the perf subsystem
> > > can take a look.
> > >
> > > >
> > > > Unfortunately, our project aims to run on older kernels than 5.8 so the bpf
> > > ring buffer won't work for us.
> > > >
> > > > Thanks again
> > > >
> > > > Kevin Sheldrake
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: bpf-owner@vger.kernel.org <bpf-owner@vger.kernel.org> On
> > > Behalf
> > > > Of Andrii Nakryiko
> > > > Sent: 20 July 2020 05:35
> > > > To: Kevin Sheldrake <Kevin.Sheldrake@microsoft.com>
> > > > Cc: bpf@vger.kernel.org
> > > > Subject: [EXTERNAL] Re: Maximum size of record over perf ring buffer?
> > > >
> > > > On Fri, Jul 17, 2020 at 7:24 AM Kevin Sheldrake
> > > <Kevin.Sheldrake@microsoft.com> wrote:
> > > > >
> > > > > Hello
> > > > >
> > > > > I'm building a tool using EBPF/libbpf/C and I've run into an issue that I'd
> > > like to ask about.  I haven't managed to find documentation for the
> > > maximum size of a record that can be sent over the perf ring buffer, but
> > > experimentation (on kernel 5.3 (x64) with latest libbpf from github) suggests
> > > it is just short of 64KB.  Please could someone confirm if that's the case or
> > > not?  My experiments suggest that sending a record that is greater than 64KB
> > > results in the size reported in the callback being correct but the records
> > > overlapping, causing corruption if they are not serviced as quickly as they
> > > arrive.  Setting the record to exactly 64KB results in no records being received
> > > at all.
> > > > >
> > > > > For reference, I'm using perf_buffer__new() and perf_buffer__poll() on
> > > the userland side; and bpf_perf_event_output(ctx, &event_map,
> > > BPF_F_CURRENT_CPU, event, sizeof(event_s)) on the EBPF side.
> > > > >
> > > > > Additionally, is there a better architecture for sending large volumes of
> > > data (>64KB) back from the EBPF program to userland, such as a different
> > > ring buffer, a map, some kind of shared mmaped segment, etc, other than
> > > simply fragmenting the data?  Please excuse my naivety as I'm relatively new
> > > to the world of EBPF.
> > > > >
> > > >
> > > > I'm not aware of any such limitations for perf ring buffer and I haven't had a
> > > chance to validate this. It would be great if you can provide a small repro so
> > > that someone can take a deeper look, it does sound like a bug, if you really
> > > get clobbered data. It might be actually how you set up perfbuf, AFAIK, it has
> > > a mode where it will override the data, if it's not consumed quickly enough,
> > > but you need to consciously enable that mode.
> > > >
> > > > But apart from that, shameless plug here, you can try the new BPF ring
> > > buffer ([0]), available in 5.8+ kernels. It will allow you to avoid extra copy of
> > > data you get with bpf_perf_event_output(), if you use BPF ringbuf's
> > > bpf_ringbuf_reserve() + bpf_ringbuf_commit() API. It also has
> > > bpf_ringbuf_output() API, which is logically  equivalent to
> > > bpf_perf_event_output(). And it has a very high limit on sample size, up to
> > > 512MB per sample.
> > > >
> > > > Keep in mind, BPF ringbuf is MPSC design and if you use just one BPF
> > > ringbuf across all CPUs, you might run into some contention across multiple
> > > CPU. It is acceptable in a lot of applications I was targeting, but if you have a
> > > high frequency of events (keep in mind, throughput doesn't matter, only
> > > contention on sample reservation matters), you might want to use an array
> > > of BPF ringbufs to scale throughput. You can do 1 ringbuf per each CPU for
> > > ultimate performance at the expense of memory usage (that's perf ring
> > > buffer setup), but BPF ringbuf is flexible enough to allow any topology that
> > > makes sense for you use case, from 1 shared ringbuf across all CPUs, to
> > > anything in between.
> > > >
> > > >

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [EXTERNAL] Re: Maximum size of record over perf ring buffer?
  2020-10-27  1:07             ` KP Singh
@ 2020-10-27  3:43               ` Andrii Nakryiko
  2020-10-28 19:03                 ` Kevin Sheldrake
  0 siblings, 1 reply; 11+ messages in thread
From: Andrii Nakryiko @ 2020-10-27  3:43 UTC (permalink / raw)
  To: KP Singh; +Cc: Kevin Sheldrake, bpf

On Mon, Oct 26, 2020 at 6:07 PM KP Singh <kpsingh@chromium.org> wrote:
>
> On Mon, Oct 26, 2020 at 11:01 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Mon, Oct 26, 2020 at 10:10 AM Kevin Sheldrake
> > <Kevin.Sheldrake@microsoft.com> wrote:
> > >
> > > Hello Andrii and list
> > >
> > > I've now had chance to properly investigate the perf ring buffer corruption bug.  Essentially (as suspected), the size parameter that is specified as a __u64 in bpf_helper_defs.h, is truncated into a __u16 inside the struct perf_event_header size parameter ($KERNELSRC/include/uapi/linux/perf_event.h and /usr/include/linux/perf_event.h).
> > >
> > > Changing the size parameter in perf_event_header (both locations) to a __u32 (or __u64 if you prefer) fixes my issue of sending more than 64KB of data in a single perf sample, but I'm not convinced that this is a good or workable solution.
> >
> > Right, this will break ABI compatibility with all the applications,
> > unfortunately.
> >
> > >
> > > As I, and probably others, are more likely to tend towards much smaller, fragmented packets, I suggest (having spoken to KP Singh) that the fix should probably be in the verifier - to ensure the size is <0xffff - 8 (sizeof(struct perf_event_header) I guess) - and also in bpf_helper_defs.h to raise a clang warning/error, as well as in the bpf_helpers man page.
> > >
> > > The bpf_helper_defs.h and man page updates are trivial, but I can't work out where in verifier.c the check should go.  It feels like it should be in check_helper_call() but I can't see any other similar checks in there.  I suspect that the better fix would be to create another ARG_CONST_SIZE type, such as ARG_CONST_SIZE_PERF_SAMPLE, that can be explicitly checked rather than adding ad hoc size checks.
> > >
> > > As this causes corruption inside the perf ring buffer as samples overlap, and the reported sample size is questionable, please can I ask for some help in fixing this?
> >
> > I don't think it's possible to enforce this statically in verifier in
> > all cases. The size of the sample can be dynamically determined, so
>
> Are you sure it cannot be done in the verifier and that we can set the
> size of the sample dynamically?
>
> The size argument is of the type ARG_CONST_SIZE_OR_ZERO:
>
> static const struct bpf_func_proto bpf_perf_event_output_proto = {
>    .func = bpf_perf_event_output,
>    .gpl_only = true,
>    .ret_type = RET_INTEGER,
>    .arg1_type = ARG_PTR_TO_CTX,
>    .arg2_type = ARG_CONST_MAP_PTR,
>    .arg3_type = ARG_ANYTHING,
>    .arg4_type = ARG_PTR_TO_MEM,
>    .arg5_type = ARG_CONST_SIZE_OR_ZERO,
> };
>
> and we do similar checks in the verifier with the BPF_MAX_VAR_SIZ:
>
> if (reg->umax_value >= BPF_MAX_VAR_SIZ) {
>    verbose(env, "R%d unbounded memory access, use 'var &= const' or
> 'if (var < const)'\n",
>       regno);
>    return -EACCES;
> }

You are right, of course, my bad. Verifier might not know the exact
value, but it enforces the upper bound. So in this case we can
additionally enforce extra upper bound for bpf_perf_event_output().
Though, given this will require kernel upgrade, I'd just stick to BPF
ringbuf at that point ;)

>
> it's just that bpf_perf_event_output expects the size to be even
> smaller than 32 bits (i.e. 16 bits).
>
> > BPF verifier can't do much about that. It seems like the proper
> > solution is to do the check in bpf_perf_event_output() BPF helper
> > itself. Returning -E2BIG is an appropriate behavior here, rather than
>
> This could be a solution (and maybe better than the verifier check).
>
> But I do think perf needs to have the check instead of bpf_perf_event_output:

Yeah, of course, perf subsystem itself shouldn't allow data
corruption. My point was to do it as a runtime check, rather than
enforce at verification time. But both would work fine.

>
> The size in the perf always seems to be u32 except the perf_event_header
> (I assume this is to save some space on the ring buffer)

Probably saving space, yeah. Though it's wasting 3 lower bits because
all sizes seem to be multiple of 8 always. So could the real limit
could be 8 * 64K = 512KB, easily. But it's a bit late now.

>
> struct perf_raw_frag {
>      union {
>          struct perf_raw_frag *next;
>          unsigned long pad;
>      };
>     perf_copy_f copy;
>     void *data;
>     u32 size;
> } __packed;
>
> struct perf_raw_record {
>    struct perf_raw_frag frag;
>     u32 size;
> };
>
> Maybe we can just add the check to perf_event_output instead?
>
> - KP
>
> > corrupting data. __u64 if helper definition is fine as well, because
> > that's the size of BPF registers that are used to pass arguments to
> > helpers.
> >
> > >
> > > Thanks
> > >
> > > Kevin Sheldrake
> > >
> > > PS I will get around to the clang/LLVM jump offset warning soon I promise.
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: bpf-owner@vger.kernel.org <bpf-owner@vger.kernel.org> On Behalf
> > > > Of Kevin Sheldrake
> > > > Sent: 24 July 2020 10:40
> > > > To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> > > > Cc: bpf@vger.kernel.org
> > > > Subject: RE: [EXTERNAL] Re: Maximum size of record over perf ring buffer?
> > > >
> > > > Hello Andrii
> > > >
> > > > Thank you for taking a look at this.  While the size is reported correctly to the
> > > > consumer (bar padding, etc), the actual offsets between adjacent pointers
> > > > appears to either have been cast to a u16 or otherwise masked with 0xFFFF,
> > > > causing what I believe to be overlapping samples and the opportunity for
> > > > sample corruption in the overlapped regions.
> > > >
> > > > Thanks again
> > > >
> > > > Kev
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> > > > Sent: 23 July 2020 20:05
> > > > To: Kevin Sheldrake <Kevin.Sheldrake@microsoft.com>
> > > > Cc: bpf@vger.kernel.org
> > > > Subject: Re: [EXTERNAL] Re: Maximum size of record over perf ring buffer?
> > > >
> > > > On Mon, Jul 20, 2020 at 4:39 AM Kevin Sheldrake
> > > > <Kevin.Sheldrake@microsoft.com> wrote:
> > > > >
> > > > > Hello
> > > > >
> > > > > Thank you for your response; I hope you don't mind me top-posting.  I've
> > > > put together a POC that demonstrates my results.  Edit the size of the data
> > > > char array in event_defs.h to change the behaviour.
> > > > >
> > > > >
> > > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith
> > > > > ub.com%2Fmicrosoft%2FOMS-Auditd-Plugin%2Ftree%2FMSTIC-
> > > > Research%2Febpf_
> > > > >
> > > > perf_output_poc&amp;data=02%7C01%7CKevin.Sheldrake%40microsoft.co
> > > > m%7C8
> > > > >
> > > > bd9fb551cd4454b87a608d82f3b57c0%7C72f988bf86f141af91ab2d7cd011db47
> > > > %7C1
> > > > >
> > > > %7C0%7C637311279211606351&amp;sdata=jMMpfi%2Bd%2B7jZzMT905xJ61
> > > > 34cDJd5u
> > > > > MNSu9RCdx4M6s%3D&amp;reserved=0
> > > >
> > > > I haven't run your program, but I can certainly reproduce this using
> > > > bench_perfbuf in selftests. It does seem like something is silently corrupted,
> > > > because the size reported by perf is correct (plus/minus few bytes, probably
> > > > rounding up to 8 bytes), but the contents is not correct. I have no idea why
> > > > that's happening, maybe someone more familiar with the perf subsystem
> > > > can take a look.
> > > >
> > > > >
> > > > > Unfortunately, our project aims to run on older kernels than 5.8 so the bpf
> > > > ring buffer won't work for us.
> > > > >
> > > > > Thanks again
> > > > >
> > > > > Kevin Sheldrake
> > > > >
> > > > >
> > > > > -----Original Message-----
> > > > > From: bpf-owner@vger.kernel.org <bpf-owner@vger.kernel.org> On
> > > > Behalf
> > > > > Of Andrii Nakryiko
> > > > > Sent: 20 July 2020 05:35
> > > > > To: Kevin Sheldrake <Kevin.Sheldrake@microsoft.com>
> > > > > Cc: bpf@vger.kernel.org
> > > > > Subject: [EXTERNAL] Re: Maximum size of record over perf ring buffer?
> > > > >
> > > > > On Fri, Jul 17, 2020 at 7:24 AM Kevin Sheldrake
> > > > <Kevin.Sheldrake@microsoft.com> wrote:
> > > > > >
> > > > > > Hello
> > > > > >
> > > > > > I'm building a tool using EBPF/libbpf/C and I've run into an issue that I'd
> > > > like to ask about.  I haven't managed to find documentation for the
> > > > maximum size of a record that can be sent over the perf ring buffer, but
> > > > experimentation (on kernel 5.3 (x64) with latest libbpf from github) suggests
> > > > it is just short of 64KB.  Please could someone confirm if that's the case or
> > > > not?  My experiments suggest that sending a record that is greater than 64KB
> > > > results in the size reported in the callback being correct but the records
> > > > overlapping, causing corruption if they are not serviced as quickly as they
> > > > arrive.  Setting the record to exactly 64KB results in no records being received
> > > > at all.
> > > > > >
> > > > > > For reference, I'm using perf_buffer__new() and perf_buffer__poll() on
> > > > the userland side; and bpf_perf_event_output(ctx, &event_map,
> > > > BPF_F_CURRENT_CPU, event, sizeof(event_s)) on the EBPF side.
> > > > > >
> > > > > > Additionally, is there a better architecture for sending large volumes of
> > > > data (>64KB) back from the EBPF program to userland, such as a different
> > > > ring buffer, a map, some kind of shared mmaped segment, etc, other than
> > > > simply fragmenting the data?  Please excuse my naivety as I'm relatively new
> > > > to the world of EBPF.
> > > > > >
> > > > >
> > > > > I'm not aware of any such limitations for perf ring buffer and I haven't had a
> > > > chance to validate this. It would be great if you can provide a small repro so
> > > > that someone can take a deeper look, it does sound like a bug, if you really
> > > > get clobbered data. It might be actually how you set up perfbuf, AFAIK, it has
> > > > a mode where it will override the data, if it's not consumed quickly enough,
> > > > but you need to consciously enable that mode.
> > > > >
> > > > > But apart from that, shameless plug here, you can try the new BPF ring
> > > > buffer ([0]), available in 5.8+ kernels. It will allow you to avoid extra copy of
> > > > data you get with bpf_perf_event_output(), if you use BPF ringbuf's
> > > > bpf_ringbuf_reserve() + bpf_ringbuf_commit() API. It also has
> > > > bpf_ringbuf_output() API, which is logically  equivalent to
> > > > bpf_perf_event_output(). And it has a very high limit on sample size, up to
> > > > 512MB per sample.
> > > > >
> > > > > Keep in mind, BPF ringbuf is MPSC design and if you use just one BPF
> > > > ringbuf across all CPUs, you might run into some contention across multiple
> > > > CPU. It is acceptable in a lot of applications I was targeting, but if you have a
> > > > high frequency of events (keep in mind, throughput doesn't matter, only
> > > > contention on sample reservation matters), you might want to use an array
> > > > of BPF ringbufs to scale throughput. You can do 1 ringbuf per each CPU for
> > > > ultimate performance at the expense of memory usage (that's perf ring
> > > > buffer setup), but BPF ringbuf is flexible enough to allow any topology that
> > > > makes sense for you use case, from 1 shared ringbuf across all CPUs, to
> > > > anything in between.
> > > > >
> > > > >

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [EXTERNAL] Re: Maximum size of record over perf ring buffer?
  2020-10-27  3:43               ` Andrii Nakryiko
@ 2020-10-28 19:03                 ` Kevin Sheldrake
  2020-10-29 22:48                   ` Andrii Nakryiko
  0 siblings, 1 reply; 11+ messages in thread
From: Kevin Sheldrake @ 2020-10-28 19:03 UTC (permalink / raw)
  To: Andrii Nakryiko, KP Singh; +Cc: bpf

> -----Original Message-----
> From: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> Sent: 27 October 2020 03:44
> 
> On Mon, Oct 26, 2020 at 6:07 PM KP Singh <kpsingh@chromium.org> wrote:
> >
> > On Mon, Oct 26, 2020 at 11:01 PM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Mon, Oct 26, 2020 at 10:10 AM Kevin Sheldrake
> > > <Kevin.Sheldrake@microsoft.com> wrote:
> > > >
> > > > Hello Andrii and list
> > > >
> > > > I've now had chance to properly investigate the perf ring buffer
> corruption bug.  Essentially (as suspected), the size parameter that is
> specified as a __u64 in bpf_helper_defs.h, is truncated into a __u16 inside
> the struct perf_event_header size parameter
> ($KERNELSRC/include/uapi/linux/perf_event.h and
> /usr/include/linux/perf_event.h).
> > > >
<SNIP>

> > The size argument is of the type ARG_CONST_SIZE_OR_ZERO:
> >
> > static const struct bpf_func_proto bpf_perf_event_output_proto = {
> >    .func = bpf_perf_event_output,
> >    .gpl_only = true,
> >    .ret_type = RET_INTEGER,
> >    .arg1_type = ARG_PTR_TO_CTX,
> >    .arg2_type = ARG_CONST_MAP_PTR,
> >    .arg3_type = ARG_ANYTHING,
> >    .arg4_type = ARG_PTR_TO_MEM,
> >    .arg5_type = ARG_CONST_SIZE_OR_ZERO, };
> >
> > and we do similar checks in the verifier with the BPF_MAX_VAR_SIZ:
> >
> > if (reg->umax_value >= BPF_MAX_VAR_SIZ) {
> >    verbose(env, "R%d unbounded memory access, use 'var &= const' or
> > 'if (var < const)'\n",
> >       regno);
> >    return -EACCES;
> > }
> 
> You are right, of course, my bad. Verifier might not know the exact value, but
> it enforces the upper bound. So in this case we can additionally enforce extra
> upper bound for bpf_perf_event_output().
> Though, given this will require kernel upgrade, I'd just stick to BPF ringbuf at
> that point ;)
> 
> >
> > it's just that bpf_perf_event_output expects the size to be even
> > smaller than 32 bits (i.e. 16 bits).
> >
> > > BPF verifier can't do much about that. It seems like the proper
> > > solution is to do the check in bpf_perf_event_output() BPF helper
> > > itself. Returning -E2BIG is an appropriate behavior here, rather
> > > than
> >
> > This could be a solution (and maybe better than the verifier check).
> >
> > But I do think perf needs to have the check instead of
> bpf_perf_event_output:
> 
> Yeah, of course, perf subsystem itself shouldn't allow data corruption. My
> point was to do it as a runtime check, rather than enforce at verification time.
> But both would work fine.

Appreciate this is a fix in the verifier and not the perf subsystem, but would something like this be acceptable?

/usr/src/linux$ diff include/linux/bpf_verifier.h.orig include/linux/bpf_verifier.h
19a20,24
> /* Maximum variable size permitted for size param to bpf_perf_event_output().
>  * This ensures the samples sent into the perf ring buffer do not overflow the
>  * size parameter in the perf event header.
>  */
> #define BPF_MAX_PERF_SAMP_SIZ ((1 << (sizeof(((struct perf_event_header *)0)->size) * 8)) - 24)
/usr/src/linux$ diff kernel/bpf/verifier.c.orig kernel/bpf/verifier.c
4599c4599
<       struct bpf_reg_state *regs;
---
>       struct bpf_reg_state *regs, *reg;
4653a4654,4662
>       /* special check for bpf_perf_event_output() size */
>       regs = cur_regs(env);
>       reg = &regs[BPF_REG_5];
>       if (func_id == BPF_FUNC_perf_event_output && reg->umax_value >= BPF_MAX_PERF_SAMP_SIZ) {
>               verbose(env, "bpf_perf_output_event()#%d size parameter must be less than %ld\n",
>                       BPF_FUNC_perf_event_output, BPF_MAX_PERF_SAMP_SIZ);
>               return -E2BIG;
>       }
>
4686,4687d4694
<
<       regs = cur_regs(env);

I couldn't find the details on the size of the header/padding for the sample.  The struct perf_event_header is 16 bytes, whereas the struct perf_raw_record mentioned below is 40 bytes, but the actual value determined by experimentation is 24.

If acceptable, and given that it protects the perf ring buffer from corruption, could it be a candidate for back-porting?

Thanks

Kev





> 
> >
> > The size in the perf always seems to be u32 except the
> > perf_event_header (I assume this is to save some space on the ring
> > buffer)
> 
> Probably saving space, yeah. Though it's wasting 3 lower bits because all sizes
> seem to be multiple of 8 always. So could the real limit could be 8 * 64K =
> 512KB, easily. But it's a bit late now.
> 
> >
> > struct perf_raw_frag {
> >      union {
> >          struct perf_raw_frag *next;
> >          unsigned long pad;
> >      };
> >     perf_copy_f copy;
> >     void *data;
> >     u32 size;
> > } __packed;
> >
> > struct perf_raw_record {
> >    struct perf_raw_frag frag;
> >     u32 size;
> > };
> >
> > Maybe we can just add the check to perf_event_output instead?
> >
> > - KP
> >
> > > corrupting data. __u64 if helper definition is fine as well, because
> > > that's the size of BPF registers that are used to pass arguments to
> > > helpers.
> > >
> > > >
> > > > Thanks
> > > >
> > > > Kevin Sheldrake
> > > >
> > > > PS I will get around to the clang/LLVM jump offset warning soon I
> promise.
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: bpf-owner@vger.kernel.org <bpf-owner@vger.kernel.org> On
> > > > > Behalf Of Kevin Sheldrake
> > > > > Sent: 24 July 2020 10:40
> > > > > To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> > > > > Cc: bpf@vger.kernel.org
> > > > > Subject: RE: [EXTERNAL] Re: Maximum size of record over perf ring
> buffer?
> > > > >
> > > > > Hello Andrii
> > > > >
> > > > > Thank you for taking a look at this.  While the size is reported
> > > > > correctly to the consumer (bar padding, etc), the actual offsets
> > > > > between adjacent pointers appears to either have been cast to a
> > > > > u16 or otherwise masked with 0xFFFF, causing what I believe to
> > > > > be overlapping samples and the opportunity for sample corruption in
> the overlapped regions.
> > > > >
> > > > > Thanks again
> > > > >
> > > > > Kev
> > > > >
> > > > >
> > > > > -----Original Message-----
> > > > > From: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> > > > > Sent: 23 July 2020 20:05
> > > > > To: Kevin Sheldrake <Kevin.Sheldrake@microsoft.com>
> > > > > Cc: bpf@vger.kernel.org
> > > > > Subject: Re: [EXTERNAL] Re: Maximum size of record over perf ring
> buffer?
> > > > >
> > > > > On Mon, Jul 20, 2020 at 4:39 AM Kevin Sheldrake
> > > > > <Kevin.Sheldrake@microsoft.com> wrote:
> > > > > >
> > > > > > Hello
> > > > > >
> > > > > > Thank you for your response; I hope you don't mind me
> > > > > > top-posting.  I've
> > > > > put together a POC that demonstrates my results.  Edit the size
> > > > > of the data char array in event_defs.h to change the behaviour.
> > > > > >
> > > > > >
> > > > >
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%
> > > > > 2Fgith
> > > > > > ub.com%2Fmicrosoft%2FOMS-Auditd-Plugin%2Ftree%2FMSTIC-
> > > > > Research%2Febpf_
> > > > > >
> > > > >
> perf_output_poc&amp;data=02%7C01%7CKevin.Sheldrake%40microsoft.c
> > > > > o
> > > > > m%7C8
> > > > > >
> > > > >
> bd9fb551cd4454b87a608d82f3b57c0%7C72f988bf86f141af91ab2d7cd011db
> > > > > 47
> > > > > %7C1
> > > > > >
> > > > >
> %7C0%7C637311279211606351&amp;sdata=jMMpfi%2Bd%2B7jZzMT905xJ61
> > > > > 34cDJd5u
> > > > > > MNSu9RCdx4M6s%3D&amp;reserved=0
> > > > >
> > > > > I haven't run your program, but I can certainly reproduce this
> > > > > using bench_perfbuf in selftests. It does seem like something is
> > > > > silently corrupted, because the size reported by perf is correct
> > > > > (plus/minus few bytes, probably rounding up to 8 bytes), but the
> > > > > contents is not correct. I have no idea why that's happening,
> > > > > maybe someone more familiar with the perf subsystem can take a
> look.
> > > > >
> > > > > >
> > > > > > Unfortunately, our project aims to run on older kernels than
> > > > > > 5.8 so the bpf
> > > > > ring buffer won't work for us.
> > > > > >
> > > > > > Thanks again
> > > > > >
> > > > > > Kevin Sheldrake
> > > > > >
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: bpf-owner@vger.kernel.org <bpf-owner@vger.kernel.org>
> On
> > > > > Behalf
> > > > > > Of Andrii Nakryiko
> > > > > > Sent: 20 July 2020 05:35
> > > > > > To: Kevin Sheldrake <Kevin.Sheldrake@microsoft.com>
> > > > > > Cc: bpf@vger.kernel.org
> > > > > > Subject: [EXTERNAL] Re: Maximum size of record over perf ring
> buffer?
> > > > > >
> > > > > > On Fri, Jul 17, 2020 at 7:24 AM Kevin Sheldrake
> > > > > <Kevin.Sheldrake@microsoft.com> wrote:
> > > > > > >
> > > > > > > Hello
> > > > > > >
> > > > > > > I'm building a tool using EBPF/libbpf/C and I've run into an
> > > > > > > issue that I'd
> > > > > like to ask about.  I haven't managed to find documentation for
> > > > > the maximum size of a record that can be sent over the perf ring
> > > > > buffer, but experimentation (on kernel 5.3 (x64) with latest
> > > > > libbpf from github) suggests it is just short of 64KB.  Please
> > > > > could someone confirm if that's the case or not?  My experiments
> > > > > suggest that sending a record that is greater than 64KB results
> > > > > in the size reported in the callback being correct but the
> > > > > records overlapping, causing corruption if they are not serviced
> > > > > as quickly as they arrive.  Setting the record to exactly 64KB results in
> no records being received at all.
> > > > > > >
> > > > > > > For reference, I'm using perf_buffer__new() and
> > > > > > > perf_buffer__poll() on
> > > > > the userland side; and bpf_perf_event_output(ctx, &event_map,
> > > > > BPF_F_CURRENT_CPU, event, sizeof(event_s)) on the EBPF side.
> > > > > > >
> > > > > > > Additionally, is there a better architecture for sending
> > > > > > > large volumes of
> > > > > data (>64KB) back from the EBPF program to userland, such as a
> > > > > different ring buffer, a map, some kind of shared mmaped
> > > > > segment, etc, other than simply fragmenting the data?  Please
> > > > > excuse my naivety as I'm relatively new to the world of EBPF.
> > > > > > >
> > > > > >
> > > > > > I'm not aware of any such limitations for perf ring buffer and
> > > > > > I haven't had a
> > > > > chance to validate this. It would be great if you can provide a
> > > > > small repro so that someone can take a deeper look, it does
> > > > > sound like a bug, if you really get clobbered data. It might be
> > > > > actually how you set up perfbuf, AFAIK, it has a mode where it
> > > > > will override the data, if it's not consumed quickly enough, but you
> need to consciously enable that mode.
> > > > > >
> > > > > > But apart from that, shameless plug here, you can try the new
> > > > > > BPF ring
> > > > > buffer ([0]), available in 5.8+ kernels. It will allow you to
> > > > > avoid extra copy of data you get with bpf_perf_event_output(),
> > > > > if you use BPF ringbuf's
> > > > > bpf_ringbuf_reserve() + bpf_ringbuf_commit() API. It also has
> > > > > bpf_ringbuf_output() API, which is logically  equivalent to
> > > > > bpf_perf_event_output(). And it has a very high limit on sample
> > > > > size, up to 512MB per sample.
> > > > > >
> > > > > > Keep in mind, BPF ringbuf is MPSC design and if you use just
> > > > > > one BPF
> > > > > ringbuf across all CPUs, you might run into some contention
> > > > > across multiple CPU. It is acceptable in a lot of applications I
> > > > > was targeting, but if you have a high frequency of events (keep
> > > > > in mind, throughput doesn't matter, only contention on sample
> > > > > reservation matters), you might want to use an array of BPF
> > > > > ringbufs to scale throughput. You can do 1 ringbuf per each CPU
> > > > > for ultimate performance at the expense of memory usage (that's
> > > > > perf ring buffer setup), but BPF ringbuf is flexible enough to
> > > > > allow any topology that makes sense for you use case, from 1 shared
> ringbuf across all CPUs, to anything in between.
> > > > > >
> > > > > >

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [EXTERNAL] Re: Maximum size of record over perf ring buffer?
  2020-10-28 19:03                 ` Kevin Sheldrake
@ 2020-10-29 22:48                   ` Andrii Nakryiko
  0 siblings, 0 replies; 11+ messages in thread
From: Andrii Nakryiko @ 2020-10-29 22:48 UTC (permalink / raw)
  To: Kevin Sheldrake; +Cc: KP Singh, bpf

On Wed, Oct 28, 2020 at 12:03 PM Kevin Sheldrake
<Kevin.Sheldrake@microsoft.com> wrote:
>
> > -----Original Message-----
> > From: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> > Sent: 27 October 2020 03:44
> >
> > On Mon, Oct 26, 2020 at 6:07 PM KP Singh <kpsingh@chromium.org> wrote:
> > >
> > > On Mon, Oct 26, 2020 at 11:01 PM Andrii Nakryiko
> > > <andrii.nakryiko@gmail.com> wrote:
> > > >
> > > > On Mon, Oct 26, 2020 at 10:10 AM Kevin Sheldrake
> > > > <Kevin.Sheldrake@microsoft.com> wrote:
> > > > >
> > > > > Hello Andrii and list
> > > > >
> > > > > I've now had chance to properly investigate the perf ring buffer
> > corruption bug.  Essentially (as suspected), the size parameter that is
> > specified as a __u64 in bpf_helper_defs.h, is truncated into a __u16 inside
> > the struct perf_event_header size parameter
> > ($KERNELSRC/include/uapi/linux/perf_event.h and
> > /usr/include/linux/perf_event.h).
> > > > >
> <SNIP>
>
> > > The size argument is of the type ARG_CONST_SIZE_OR_ZERO:
> > >
> > > static const struct bpf_func_proto bpf_perf_event_output_proto = {
> > >    .func = bpf_perf_event_output,
> > >    .gpl_only = true,
> > >    .ret_type = RET_INTEGER,
> > >    .arg1_type = ARG_PTR_TO_CTX,
> > >    .arg2_type = ARG_CONST_MAP_PTR,
> > >    .arg3_type = ARG_ANYTHING,
> > >    .arg4_type = ARG_PTR_TO_MEM,
> > >    .arg5_type = ARG_CONST_SIZE_OR_ZERO, };
> > >
> > > and we do similar checks in the verifier with the BPF_MAX_VAR_SIZ:
> > >
> > > if (reg->umax_value >= BPF_MAX_VAR_SIZ) {
> > >    verbose(env, "R%d unbounded memory access, use 'var &= const' or
> > > 'if (var < const)'\n",
> > >       regno);
> > >    return -EACCES;
> > > }
> >
> > You are right, of course, my bad. Verifier might not know the exact value, but
> > it enforces the upper bound. So in this case we can additionally enforce extra
> > upper bound for bpf_perf_event_output().
> > Though, given this will require kernel upgrade, I'd just stick to BPF ringbuf at
> > that point ;)
> >
> > >
> > > it's just that bpf_perf_event_output expects the size to be even
> > > smaller than 32 bits (i.e. 16 bits).
> > >
> > > > BPF verifier can't do much about that. It seems like the proper
> > > > solution is to do the check in bpf_perf_event_output() BPF helper
> > > > itself. Returning -E2BIG is an appropriate behavior here, rather
> > > > than
> > >
> > > This could be a solution (and maybe better than the verifier check).
> > >
> > > But I do think perf needs to have the check instead of
> > bpf_perf_event_output:
> >
> > Yeah, of course, perf subsystem itself shouldn't allow data corruption. My
> > point was to do it as a runtime check, rather than enforce at verification time.
> > But both would work fine.
>
> Appreciate this is a fix in the verifier and not the perf subsystem, but would something like this be acceptable?
>
> /usr/src/linux$ diff include/linux/bpf_verifier.h.orig include/linux/bpf_verifier.h
> 19a20,24
> > /* Maximum variable size permitted for size param to bpf_perf_event_output().
> >  * This ensures the samples sent into the perf ring buffer do not overflow the
> >  * size parameter in the perf event header.
> >  */
> > #define BPF_MAX_PERF_SAMP_SIZ ((1 << (sizeof(((struct perf_event_header *)0)->size) * 8)) - 24)
> /usr/src/linux$ diff kernel/bpf/verifier.c.orig kernel/bpf/verifier.c
> 4599c4599
> <       struct bpf_reg_state *regs;
> ---
> >       struct bpf_reg_state *regs, *reg;
> 4653a4654,4662

you didn't use unified diff format, so it's hard to tell where exactly
you made this change. But please post a proper patch and let's review
it properly.

> >       /* special check for bpf_perf_event_output() size */
> >       regs = cur_regs(env);
> >       reg = &regs[BPF_REG_5];
> >       if (func_id == BPF_FUNC_perf_event_output && reg->umax_value >= BPF_MAX_PERF_SAMP_SIZ) {
> >               verbose(env, "bpf_perf_output_event()#%d size parameter must be less than %ld\n",
> >                       BPF_FUNC_perf_event_output, BPF_MAX_PERF_SAMP_SIZ);
> >               return -E2BIG;
> >       }
> >
> 4686,4687d4694
> <
> <       regs = cur_regs(env);
>
> I couldn't find the details on the size of the header/padding for the sample.  The struct perf_event_header is 16 bytes, whereas the struct perf_raw_record mentioned below is 40 bytes, but the actual value determined by experimentation is 24.
>
> If acceptable, and given that it protects the perf ring buffer from corruption, could it be a candidate for back-porting?
>
> Thanks
>
> Kev
>
>
>
>
>
> >
> > >
> > > The size in the perf always seems to be u32 except the
> > > perf_event_header (I assume this is to save some space on the ring
> > > buffer)
> >
> > Probably saving space, yeah. Though it's wasting 3 lower bits because all sizes
> > seem to be multiple of 8 always. So could the real limit could be 8 * 64K =
> > 512KB, easily. But it's a bit late now.
> >
> > >
> > > struct perf_raw_frag {
> > >      union {
> > >          struct perf_raw_frag *next;
> > >          unsigned long pad;
> > >      };
> > >     perf_copy_f copy;
> > >     void *data;
> > >     u32 size;
> > > } __packed;
> > >
> > > struct perf_raw_record {
> > >    struct perf_raw_frag frag;
> > >     u32 size;
> > > };
> > >
> > > Maybe we can just add the check to perf_event_output instead?
> > >
> > > - KP
> > >
> > > > corrupting data. __u64 if helper definition is fine as well, because
> > > > that's the size of BPF registers that are used to pass arguments to
> > > > helpers.
> > > >
> > > > >
> > > > > Thanks
> > > > >
> > > > > Kevin Sheldrake
> > > > >
> > > > > PS I will get around to the clang/LLVM jump offset warning soon I
> > promise.
> > > > >
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: bpf-owner@vger.kernel.org <bpf-owner@vger.kernel.org> On
> > > > > > Behalf Of Kevin Sheldrake
> > > > > > Sent: 24 July 2020 10:40
> > > > > > To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> > > > > > Cc: bpf@vger.kernel.org
> > > > > > Subject: RE: [EXTERNAL] Re: Maximum size of record over perf ring
> > buffer?
> > > > > >
> > > > > > Hello Andrii
> > > > > >
> > > > > > Thank you for taking a look at this.  While the size is reported
> > > > > > correctly to the consumer (bar padding, etc), the actual offsets
> > > > > > between adjacent pointers appears to either have been cast to a
> > > > > > u16 or otherwise masked with 0xFFFF, causing what I believe to
> > > > > > be overlapping samples and the opportunity for sample corruption in
> > the overlapped regions.
> > > > > >
> > > > > > Thanks again
> > > > > >
> > > > > > Kev
> > > > > >
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> > > > > > Sent: 23 July 2020 20:05
> > > > > > To: Kevin Sheldrake <Kevin.Sheldrake@microsoft.com>
> > > > > > Cc: bpf@vger.kernel.org
> > > > > > Subject: Re: [EXTERNAL] Re: Maximum size of record over perf ring
> > buffer?
> > > > > >
> > > > > > On Mon, Jul 20, 2020 at 4:39 AM Kevin Sheldrake
> > > > > > <Kevin.Sheldrake@microsoft.com> wrote:
> > > > > > >
> > > > > > > Hello
> > > > > > >
> > > > > > > Thank you for your response; I hope you don't mind me
> > > > > > > top-posting.  I've
> > > > > > put together a POC that demonstrates my results.  Edit the size
> > > > > > of the data char array in event_defs.h to change the behaviour.
> > > > > > >
> > > > > > >
> > > > > >
> > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%
> > > > > > 2Fgith
> > > > > > > ub.com%2Fmicrosoft%2FOMS-Auditd-Plugin%2Ftree%2FMSTIC-
> > > > > > Research%2Febpf_
> > > > > > >
> > > > > >
> > perf_output_poc&amp;data=02%7C01%7CKevin.Sheldrake%40microsoft.c
> > > > > > o
> > > > > > m%7C8
> > > > > > >
> > > > > >
> > bd9fb551cd4454b87a608d82f3b57c0%7C72f988bf86f141af91ab2d7cd011db
> > > > > > 47
> > > > > > %7C1
> > > > > > >
> > > > > >
> > %7C0%7C637311279211606351&amp;sdata=jMMpfi%2Bd%2B7jZzMT905xJ61
> > > > > > 34cDJd5u
> > > > > > > MNSu9RCdx4M6s%3D&amp;reserved=0
> > > > > >
> > > > > > I haven't run your program, but I can certainly reproduce this
> > > > > > using bench_perfbuf in selftests. It does seem like something is
> > > > > > silently corrupted, because the size reported by perf is correct
> > > > > > (plus/minus few bytes, probably rounding up to 8 bytes), but the
> > > > > > contents is not correct. I have no idea why that's happening,
> > > > > > maybe someone more familiar with the perf subsystem can take a
> > look.
> > > > > >
> > > > > > >
> > > > > > > Unfortunately, our project aims to run on older kernels than
> > > > > > > 5.8 so the bpf
> > > > > > ring buffer won't work for us.
> > > > > > >
> > > > > > > Thanks again
> > > > > > >
> > > > > > > Kevin Sheldrake
> > > > > > >
> > > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: bpf-owner@vger.kernel.org <bpf-owner@vger.kernel.org>
> > On
> > > > > > Behalf
> > > > > > > Of Andrii Nakryiko
> > > > > > > Sent: 20 July 2020 05:35
> > > > > > > To: Kevin Sheldrake <Kevin.Sheldrake@microsoft.com>
> > > > > > > Cc: bpf@vger.kernel.org
> > > > > > > Subject: [EXTERNAL] Re: Maximum size of record over perf ring
> > buffer?
> > > > > > >
> > > > > > > On Fri, Jul 17, 2020 at 7:24 AM Kevin Sheldrake
> > > > > > <Kevin.Sheldrake@microsoft.com> wrote:
> > > > > > > >
> > > > > > > > Hello
> > > > > > > >
> > > > > > > > I'm building a tool using EBPF/libbpf/C and I've run into an
> > > > > > > > issue that I'd
> > > > > > like to ask about.  I haven't managed to find documentation for
> > > > > > the maximum size of a record that can be sent over the perf ring
> > > > > > buffer, but experimentation (on kernel 5.3 (x64) with latest
> > > > > > libbpf from github) suggests it is just short of 64KB.  Please
> > > > > > could someone confirm if that's the case or not?  My experiments
> > > > > > suggest that sending a record that is greater than 64KB results
> > > > > > in the size reported in the callback being correct but the
> > > > > > records overlapping, causing corruption if they are not serviced
> > > > > > as quickly as they arrive.  Setting the record to exactly 64KB results in
> > no records being received at all.
> > > > > > > >
> > > > > > > > For reference, I'm using perf_buffer__new() and
> > > > > > > > perf_buffer__poll() on
> > > > > > the userland side; and bpf_perf_event_output(ctx, &event_map,
> > > > > > BPF_F_CURRENT_CPU, event, sizeof(event_s)) on the EBPF side.
> > > > > > > >
> > > > > > > > Additionally, is there a better architecture for sending
> > > > > > > > large volumes of
> > > > > > data (>64KB) back from the EBPF program to userland, such as a
> > > > > > different ring buffer, a map, some kind of shared mmaped
> > > > > > segment, etc, other than simply fragmenting the data?  Please
> > > > > > excuse my naivety as I'm relatively new to the world of EBPF.
> > > > > > > >
> > > > > > >
> > > > > > > I'm not aware of any such limitations for perf ring buffer and
> > > > > > > I haven't had a
> > > > > > chance to validate this. It would be great if you can provide a
> > > > > > small repro so that someone can take a deeper look, it does
> > > > > > sound like a bug, if you really get clobbered data. It might be
> > > > > > actually how you set up perfbuf, AFAIK, it has a mode where it
> > > > > > will override the data, if it's not consumed quickly enough, but you
> > need to consciously enable that mode.
> > > > > > >
> > > > > > > But apart from that, shameless plug here, you can try the new
> > > > > > > BPF ring
> > > > > > buffer ([0]), available in 5.8+ kernels. It will allow you to
> > > > > > avoid extra copy of data you get with bpf_perf_event_output(),
> > > > > > if you use BPF ringbuf's
> > > > > > bpf_ringbuf_reserve() + bpf_ringbuf_commit() API. It also has
> > > > > > bpf_ringbuf_output() API, which is logically  equivalent to
> > > > > > bpf_perf_event_output(). And it has a very high limit on sample
> > > > > > size, up to 512MB per sample.
> > > > > > >
> > > > > > > Keep in mind, BPF ringbuf is MPSC design and if you use just
> > > > > > > one BPF
> > > > > > ringbuf across all CPUs, you might run into some contention
> > > > > > across multiple CPU. It is acceptable in a lot of applications I
> > > > > > was targeting, but if you have a high frequency of events (keep
> > > > > > in mind, throughput doesn't matter, only contention on sample
> > > > > > reservation matters), you might want to use an array of BPF
> > > > > > ringbufs to scale throughput. You can do 1 ringbuf per each CPU
> > > > > > for ultimate performance at the expense of memory usage (that's
> > > > > > perf ring buffer setup), but BPF ringbuf is flexible enough to
> > > > > > allow any topology that makes sense for you use case, from 1 shared
> > ringbuf across all CPUs, to anything in between.
> > > > > > >
> > > > > > >

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-10-29 22:48 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-17 14:23 Maximum size of record over perf ring buffer? Kevin Sheldrake
2020-07-20  4:35 ` Andrii Nakryiko
2020-07-20 11:39   ` [EXTERNAL] " Kevin Sheldrake
2020-07-23 19:05     ` Andrii Nakryiko
2020-07-24  9:40       ` Kevin Sheldrake
2020-10-26 17:10         ` Kevin Sheldrake
2020-10-26 22:01           ` Andrii Nakryiko
2020-10-27  1:07             ` KP Singh
2020-10-27  3:43               ` Andrii Nakryiko
2020-10-28 19:03                 ` Kevin Sheldrake
2020-10-29 22:48                   ` Andrii Nakryiko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).