RE: [EXTERNAL] Re: Maximum size of record over perf ring buffer?

From: Kevin Sheldrake <Kevin.Sheldrake@microsoft.com>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: "bpf@vger.kernel.org" <bpf@vger.kernel.org>
Subject: RE: [EXTERNAL] Re: Maximum size of record over perf ring buffer?
Date: Mon, 26 Oct 2020 17:10:07 +0000	[thread overview]
Message-ID: <VI1PR8303MB00808A980F003A9F403E413EFB190@VI1PR8303MB0080.EURPRD83.prod.outlook.com> (raw)
In-Reply-To: <HE1PR83MB0220B3D0413E997D1A33FA52FB770@HE1PR83MB0220.EURPRD83.prod.outlook.com>

Hello Andrii and list

I've now had chance to properly investigate the perf ring buffer corruption bug.  Essentially (as suspected), the size parameter that is specified as a __u64 in bpf_helper_defs.h, is truncated into a __u16 inside the struct perf_event_header size parameter ($KERNELSRC/include/uapi/linux/perf_event.h and /usr/include/linux/perf_event.h).

Changing the size parameter in perf_event_header (both locations) to a __u32 (or __u64 if you prefer) fixes my issue of sending more than 64KB of data in a single perf sample, but I'm not convinced that this is a good or workable solution.

As I, and probably others, are more likely to tend towards much smaller, fragmented packets, I suggest (having spoken to KP Singh) that the fix should probably be in the verifier - to ensure the size is <0xffff - 8 (sizeof(struct perf_event_header) I guess) - and also in bpf_helper_defs.h to raise a clang warning/error, as well as in the bpf_helpers man page.

The bpf_helper_defs.h and man page updates are trivial, but I can't work out where in verifier.c the check should go.  It feels like it should be in check_helper_call() but I can't see any other similar checks in there.  I suspect that the better fix would be to create another ARG_CONST_SIZE type, such as ARG_CONST_SIZE_PERF_SAMPLE, that can be explicitly checked rather than adding ad hoc size checks.

As this causes corruption inside the perf ring buffer as samples overlap, and the reported sample size is questionable, please can I ask for some help in fixing this?

Thanks

Kevin Sheldrake

PS I will get around to the clang/LLVM jump offset warning soon I promise.

> -----Original Message-----
> From: bpf-owner@vger.kernel.org <bpf-owner@vger.kernel.org> On Behalf
> Of Kevin Sheldrake
> Sent: 24 July 2020 10:40
> To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> Cc: bpf@vger.kernel.org
> Subject: RE: [EXTERNAL] Re: Maximum size of record over perf ring buffer?
> 
> Hello Andrii
> 
> Thank you for taking a look at this.  While the size is reported correctly to the
> consumer (bar padding, etc), the actual offsets between adjacent pointers
> appears to either have been cast to a u16 or otherwise masked with 0xFFFF,
> causing what I believe to be overlapping samples and the opportunity for
> sample corruption in the overlapped regions.
> 
> Thanks again
> 
> Kev
> 
> 
> -----Original Message-----
> From: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> Sent: 23 July 2020 20:05
> To: Kevin Sheldrake <Kevin.Sheldrake@microsoft.com>
> Cc: bpf@vger.kernel.org
> Subject: Re: [EXTERNAL] Re: Maximum size of record over perf ring buffer?
> 
> On Mon, Jul 20, 2020 at 4:39 AM Kevin Sheldrake
> <Kevin.Sheldrake@microsoft.com> wrote:
> >
> > Hello
> >
> > Thank you for your response; I hope you don't mind me top-posting.  I've
> put together a POC that demonstrates my results.  Edit the size of the data
> char array in event_defs.h to change the behaviour.
> >
> >
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith
> > ub.com%2Fmicrosoft%2FOMS-Auditd-Plugin%2Ftree%2FMSTIC-
> Research%2Febpf_
> >
> perf_output_poc&amp;data=02%7C01%7CKevin.Sheldrake%40microsoft.co
> m%7C8
> >
> bd9fb551cd4454b87a608d82f3b57c0%7C72f988bf86f141af91ab2d7cd011db47
> %7C1
> >
> %7C0%7C637311279211606351&amp;sdata=jMMpfi%2Bd%2B7jZzMT905xJ61
> 34cDJd5u
> > MNSu9RCdx4M6s%3D&amp;reserved=0
> 
> I haven't run your program, but I can certainly reproduce this using
> bench_perfbuf in selftests. It does seem like something is silently corrupted,
> because the size reported by perf is correct (plus/minus few bytes, probably
> rounding up to 8 bytes), but the contents is not correct. I have no idea why
> that's happening, maybe someone more familiar with the perf subsystem
> can take a look.
> 
> >
> > Unfortunately, our project aims to run on older kernels than 5.8 so the bpf
> ring buffer won't work for us.
> >
> > Thanks again
> >
> > Kevin Sheldrake
> >
> >
> > -----Original Message-----
> > From: bpf-owner@vger.kernel.org <bpf-owner@vger.kernel.org> On
> Behalf
> > Of Andrii Nakryiko
> > Sent: 20 July 2020 05:35
> > To: Kevin Sheldrake <Kevin.Sheldrake@microsoft.com>
> > Cc: bpf@vger.kernel.org
> > Subject: [EXTERNAL] Re: Maximum size of record over perf ring buffer?
> >
> > On Fri, Jul 17, 2020 at 7:24 AM Kevin Sheldrake
> <Kevin.Sheldrake@microsoft.com> wrote:
> > >
> > > Hello
> > >
> > > I'm building a tool using EBPF/libbpf/C and I've run into an issue that I'd
> like to ask about.  I haven't managed to find documentation for the
> maximum size of a record that can be sent over the perf ring buffer, but
> experimentation (on kernel 5.3 (x64) with latest libbpf from github) suggests
> it is just short of 64KB.  Please could someone confirm if that's the case or
> not?  My experiments suggest that sending a record that is greater than 64KB
> results in the size reported in the callback being correct but the records
> overlapping, causing corruption if they are not serviced as quickly as they
> arrive.  Setting the record to exactly 64KB results in no records being received
> at all.
> > >
> > > For reference, I'm using perf_buffer__new() and perf_buffer__poll() on
> the userland side; and bpf_perf_event_output(ctx, &event_map,
> BPF_F_CURRENT_CPU, event, sizeof(event_s)) on the EBPF side.
> > >
> > > Additionally, is there a better architecture for sending large volumes of
> data (>64KB) back from the EBPF program to userland, such as a different
> ring buffer, a map, some kind of shared mmaped segment, etc, other than
> simply fragmenting the data?  Please excuse my naivety as I'm relatively new
> to the world of EBPF.
> > >
> >
> > I'm not aware of any such limitations for perf ring buffer and I haven't had a
> chance to validate this. It would be great if you can provide a small repro so
> that someone can take a deeper look, it does sound like a bug, if you really
> get clobbered data. It might be actually how you set up perfbuf, AFAIK, it has
> a mode where it will override the data, if it's not consumed quickly enough,
> but you need to consciously enable that mode.
> >
> > But apart from that, shameless plug here, you can try the new BPF ring
> buffer ([0]), available in 5.8+ kernels. It will allow you to avoid extra copy of
> data you get with bpf_perf_event_output(), if you use BPF ringbuf's
> bpf_ringbuf_reserve() + bpf_ringbuf_commit() API. It also has
> bpf_ringbuf_output() API, which is logically  equivalent to
> bpf_perf_event_output(). And it has a very high limit on sample size, up to
> 512MB per sample.
> >
> > Keep in mind, BPF ringbuf is MPSC design and if you use just one BPF
> ringbuf across all CPUs, you might run into some contention across multiple
> CPU. It is acceptable in a lot of applications I was targeting, but if you have a
> high frequency of events (keep in mind, throughput doesn't matter, only
> contention on sample reservation matters), you might want to use an array
> of BPF ringbufs to scale throughput. You can do 1 ringbuf per each CPU for
> ultimate performance at the expense of memory usage (that's perf ring
> buffer setup), but BPF ringbuf is flexible enough to allow any topology that
> makes sense for you use case, from 1 shared ringbuf across all CPUs, to
> anything in between.
> >
> >