RE: [EXTERNAL] Re: Maximum size of record over perf ring buffer?

From: Kevin Sheldrake <Kevin.Sheldrake@microsoft.com>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: "bpf@vger.kernel.org" <bpf@vger.kernel.org>
Subject: RE: [EXTERNAL] Re: Maximum size of record over perf ring buffer?
Date: Fri, 24 Jul 2020 09:40:16 +0000	[thread overview]
Message-ID: <HE1PR83MB0220B3D0413E997D1A33FA52FB770@HE1PR83MB0220.EURPRD83.prod.outlook.com> (raw)
In-Reply-To: <CAEf4BzZj8z5YWHQkYBjBuQ2LUwvodt7tz_9=GZzZ6hcW3zkj5g@mail.gmail.com>

Hello Andrii

Thank you for taking a look at this.  While the size is reported correctly to the consumer (bar padding, etc), the actual offsets between adjacent pointers appears to either have been cast to a u16 or otherwise masked with 0xFFFF, causing what I believe to be overlapping samples and the opportunity for sample corruption in the overlapped regions.

Thanks again

Kev

-----Original Message-----
From: Andrii Nakryiko <andrii.nakryiko@gmail.com> 
Sent: 23 July 2020 20:05
To: Kevin Sheldrake <Kevin.Sheldrake@microsoft.com>
Cc: bpf@vger.kernel.org
Subject: Re: [EXTERNAL] Re: Maximum size of record over perf ring buffer?

On Mon, Jul 20, 2020 at 4:39 AM Kevin Sheldrake <Kevin.Sheldrake@microsoft.com> wrote:
>
> Hello
>
> Thank you for your response; I hope you don't mind me top-posting.  I've put together a POC that demonstrates my results.  Edit the size of the data char array in event_defs.h to change the behaviour.
>
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith
> ub.com%2Fmicrosoft%2FOMS-Auditd-Plugin%2Ftree%2FMSTIC-Research%2Febpf_
> perf_output_poc&amp;data=02%7C01%7CKevin.Sheldrake%40microsoft.com%7C8
> bd9fb551cd4454b87a608d82f3b57c0%7C72f988bf86f141af91ab2d7cd011db47%7C1
> %7C0%7C637311279211606351&amp;sdata=jMMpfi%2Bd%2B7jZzMT905xJ6134cDJd5u
> MNSu9RCdx4M6s%3D&amp;reserved=0

I haven't run your program, but I can certainly reproduce this using bench_perfbuf in selftests. It does seem like something is silently corrupted, because the size reported by perf is correct (plus/minus few bytes, probably rounding up to 8 bytes), but the contents is not correct. I have no idea why that's happening, maybe someone more familiar with the perf subsystem can take a look.

>
> Unfortunately, our project aims to run on older kernels than 5.8 so the bpf ring buffer won't work for us.
>
> Thanks again
>
> Kevin Sheldrake
>
>
> -----Original Message-----
> From: bpf-owner@vger.kernel.org <bpf-owner@vger.kernel.org> On Behalf 
> Of Andrii Nakryiko
> Sent: 20 July 2020 05:35
> To: Kevin Sheldrake <Kevin.Sheldrake@microsoft.com>
> Cc: bpf@vger.kernel.org
> Subject: [EXTERNAL] Re: Maximum size of record over perf ring buffer?
>
> On Fri, Jul 17, 2020 at 7:24 AM Kevin Sheldrake <Kevin.Sheldrake@microsoft.com> wrote:
> >
> > Hello
> >
> > I'm building a tool using EBPF/libbpf/C and I've run into an issue that I'd like to ask about.  I haven't managed to find documentation for the maximum size of a record that can be sent over the perf ring buffer, but experimentation (on kernel 5.3 (x64) with latest libbpf from github) suggests it is just short of 64KB.  Please could someone confirm if that's the case or not?  My experiments suggest that sending a record that is greater than 64KB results in the size reported in the callback being correct but the records overlapping, causing corruption if they are not serviced as quickly as they arrive.  Setting the record to exactly 64KB results in no records being received at all.
> >
> > For reference, I'm using perf_buffer__new() and perf_buffer__poll() on the userland side; and bpf_perf_event_output(ctx, &event_map, BPF_F_CURRENT_CPU, event, sizeof(event_s)) on the EBPF side.
> >
> > Additionally, is there a better architecture for sending large volumes of data (>64KB) back from the EBPF program to userland, such as a different ring buffer, a map, some kind of shared mmaped segment, etc, other than simply fragmenting the data?  Please excuse my naivety as I'm relatively new to the world of EBPF.
> >
>
> I'm not aware of any such limitations for perf ring buffer and I haven't had a chance to validate this. It would be great if you can provide a small repro so that someone can take a deeper look, it does sound like a bug, if you really get clobbered data. It might be actually how you set up perfbuf, AFAIK, it has a mode where it will override the data, if it's not consumed quickly enough, but you need to consciously enable that mode.
>
> But apart from that, shameless plug here, you can try the new BPF ring buffer ([0]), available in 5.8+ kernels. It will allow you to avoid extra copy of data you get with bpf_perf_event_output(), if you use BPF ringbuf's bpf_ringbuf_reserve() + bpf_ringbuf_commit() API. It also has bpf_ringbuf_output() API, which is logically  equivalent to bpf_perf_event_output(). And it has a very high limit on sample size, up to 512MB per sample.
>
> Keep in mind, BPF ringbuf is MPSC design and if you use just one BPF ringbuf across all CPUs, you might run into some contention across multiple CPU. It is acceptable in a lot of applications I was targeting, but if you have a high frequency of events (keep in mind, throughput doesn't matter, only contention on sample reservation matters), you might want to use an array of BPF ringbufs to scale throughput. You can do 1 ringbuf per each CPU for ultimate performance at the expense of memory usage (that's perf ring buffer setup), but BPF ringbuf is flexible enough to allow any topology that makes sense for you use case, from 1 shared ringbuf across all CPUs, to anything in between.
>
>