Re: [PATCH v2 bpf-next 7/7] docs/bpf: add BPF ring buffer design notes

From: Andrii Nakryiko <andrii.nakryiko@gmail.com>
To: Alban Crequy <alban.crequy@gmail.com>
Cc: Andrii Nakryiko <andriin@fb.com>, bpf <bpf@vger.kernel.org>,
	Networking <netdev@vger.kernel.org>,
	Alexei Starovoitov <ast@fb.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Kernel Team <kernel-team@fb.com>,
	"Paul E . McKenney" <paulmck@kernel.org>,
	Jonathan Lemon <jonathan.lemon@gmail.com>,
	Stanislav Fomichev <sdf@google.com>,
	Alban Crequy <alban@kinvolk.io>,
	mauricio@kinvolk.io, kai@kinvolk.io
Subject: Re: [PATCH v2 bpf-next 7/7] docs/bpf: add BPF ring buffer design notes
Date: Mon, 25 May 2020 12:12:31 -0700	[thread overview]
Message-ID: <CAEf4BzY3mt8puNgOwi5ZWnVbXksnsXK_beG+HhhZutyBG-BO7A@mail.gmail.com> (raw)
In-Reply-To: <CAMXgnP424S5s-mrwFB_nuZNSuqLyi1K8r519WKVkyMBPtv1PMQ@mail.gmail.com>

On Mon, May 25, 2020 at 3:00 AM Alban Crequy <alban.crequy@gmail.com> wrote:
>
> Hi,
>
> Thanks. Both motivators look very interesting to me:
>
> On Sun, 17 May 2020 at 21:58, Andrii Nakryiko <andriin@fb.com> wrote:
> [...]
> > +Motivation
> > +----------
> > +There are two distinctive motivators for this work, which are not satisfied by
> > +existing perf buffer, which prompted creation of a new ring buffer
> > +implementation.
> > +  - more efficient memory utilization by sharing ring buffer across CPUs;
>
> I have a use case with traceloop
> (https://github.com/kinvolk/traceloop) where I use one
> BPF_MAP_TYPE_PERF_EVENT_ARRAY per container, so when the number of
> containers times the number of CPU is high, it can use a lot of
> memory.
>
> > +  - preserving ordering of events that happen sequentially in time, even
> > +  across multiple CPUs (e.g., fork/exec/exit events for a task).
>
> I had the problem to keep track of TCP connections and when
> tcp-connect and tcp-close events can be on different CPUs, it makes it
> difficult to get the correct order.

Yep, in one of BPF applications I've written, handling out-of-order
events was major complication to the design of data structures, as
well as user-space implementation logic.

>
> [...]
> > +There are a bunch of similarities between perf buffer
> > +(BPF_MAP_TYPE_PERF_EVENT_ARRAY) and new BPF ring buffer semantics:
> > +  - variable-length records;
> > +  - if there is no more space left in ring buffer, reservation fails, no
> > +    blocking;
> [...]
>
> BPF_MAP_TYPE_PERF_EVENT_ARRAY can be set as both 'overwriteable' and
> 'backward': if there is no more space left in ring buffer, it would
> then overwrite the old events. For that, the buffer needs to be
> prepared with mmap(...PROT_READ) instead of mmap(...PROT_READ |
> PROT_WRITE), and set the write_backward flag. See details in commit
> 9ecda41acb97 ("perf/core: Add ::write_backward attribute to perf
> event"):
>
> struct perf_event_attr attr = {0,};
> attr.write_backward = 1; /* backward */
> fd = perf_event_open_map(&attr, ...);
> base = mmap(fd, 0, size, PROT_READ /* overwriteable */, MAP_SHARED);
>
> I use overwriteable and backward ring buffers in traceloop: buffers
> are continuously overwritten and are usually not read, except when a
> user explicitly asks for it (e.g. to inspect the last few events of an
> application after a crash). If BPF_MAP_TYPE_RINGBUF implements the
> same features, then I would be able to switch and use less memory.
>
> Do you think it will be possible to implement that in BPF_MAP_TYPE_RINGBUF?
>

I think it could be implemented similarly. Consumer_pos would be
ignored, producer_pos would point to the beginning of record and
decremented on new reservation. All the implementation and semantics
would stay. Extending ringbuf itself to enable this is also trivial,
it could be just extra map_flag passed when map is created,
consumer_pos page would become mmap()'able as R/O, of course.

But I fail to see how consumer can be 100% certain it's not reading
garbage data, especially on 32-bit architectures, where wrapping over
32-bit producer position is actually quite easy. Just checking
producer position before/after read isn't completely correct. Ignoring
that problem, the only sane way (IMO) to do this would mean copying
each record into a "stable" memory, before actually doing anything
with it, which is a pretty bad performance hit as well.

So all in all, such mode could be added, but certainly in a separate
patch set and after some good discussion :).

> Cheers,
> Alban