BPF Archive on lore.kernel.org
 help / color / Atom feed
From: Andrii Nakryiko <andrii.nakryiko@gmail.com>
To: Alban Crequy <alban.crequy@gmail.com>
Cc: Andrii Nakryiko <andriin@fb.com>, bpf <bpf@vger.kernel.org>,
	Networking <netdev@vger.kernel.org>,
	Alexei Starovoitov <ast@fb.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Kernel Team <kernel-team@fb.com>,
	"Paul E . McKenney" <paulmck@kernel.org>,
	Jonathan Lemon <jonathan.lemon@gmail.com>,
	Stanislav Fomichev <sdf@google.com>,
	Alban Crequy <alban@kinvolk.io>,
	mauricio@kinvolk.io, kai@kinvolk.io
Subject: Re: [PATCH v2 bpf-next 7/7] docs/bpf: add BPF ring buffer design notes
Date: Mon, 25 May 2020 12:12:31 -0700
Message-ID: <CAEf4BzY3mt8puNgOwi5ZWnVbXksnsXK_beG+HhhZutyBG-BO7A@mail.gmail.com> (raw)
In-Reply-To: <CAMXgnP424S5s-mrwFB_nuZNSuqLyi1K8r519WKVkyMBPtv1PMQ@mail.gmail.com>

On Mon, May 25, 2020 at 3:00 AM Alban Crequy <alban.crequy@gmail.com> wrote:
>
> Hi,
>
> Thanks. Both motivators look very interesting to me:
>
> On Sun, 17 May 2020 at 21:58, Andrii Nakryiko <andriin@fb.com> wrote:
> [...]
> > +Motivation
> > +----------
> > +There are two distinctive motivators for this work, which are not satisfied by
> > +existing perf buffer, which prompted creation of a new ring buffer
> > +implementation.
> > +  - more efficient memory utilization by sharing ring buffer across CPUs;
>
> I have a use case with traceloop
> (https://github.com/kinvolk/traceloop) where I use one
> BPF_MAP_TYPE_PERF_EVENT_ARRAY per container, so when the number of
> containers times the number of CPU is high, it can use a lot of
> memory.
>
> > +  - preserving ordering of events that happen sequentially in time, even
> > +  across multiple CPUs (e.g., fork/exec/exit events for a task).
>
> I had the problem to keep track of TCP connections and when
> tcp-connect and tcp-close events can be on different CPUs, it makes it
> difficult to get the correct order.

Yep, in one of BPF applications I've written, handling out-of-order
events was major complication to the design of data structures, as
well as user-space implementation logic.

>
> [...]
> > +There are a bunch of similarities between perf buffer
> > +(BPF_MAP_TYPE_PERF_EVENT_ARRAY) and new BPF ring buffer semantics:
> > +  - variable-length records;
> > +  - if there is no more space left in ring buffer, reservation fails, no
> > +    blocking;
> [...]
>
> BPF_MAP_TYPE_PERF_EVENT_ARRAY can be set as both 'overwriteable' and
> 'backward': if there is no more space left in ring buffer, it would
> then overwrite the old events. For that, the buffer needs to be
> prepared with mmap(...PROT_READ) instead of mmap(...PROT_READ |
> PROT_WRITE), and set the write_backward flag. See details in commit
> 9ecda41acb97 ("perf/core: Add ::write_backward attribute to perf
> event"):
>
> struct perf_event_attr attr = {0,};
> attr.write_backward = 1; /* backward */
> fd = perf_event_open_map(&attr, ...);
> base = mmap(fd, 0, size, PROT_READ /* overwriteable */, MAP_SHARED);
>
> I use overwriteable and backward ring buffers in traceloop: buffers
> are continuously overwritten and are usually not read, except when a
> user explicitly asks for it (e.g. to inspect the last few events of an
> application after a crash). If BPF_MAP_TYPE_RINGBUF implements the
> same features, then I would be able to switch and use less memory.
>
> Do you think it will be possible to implement that in BPF_MAP_TYPE_RINGBUF?
>

I think it could be implemented similarly. Consumer_pos would be
ignored, producer_pos would point to the beginning of record and
decremented on new reservation. All the implementation and semantics
would stay. Extending ringbuf itself to enable this is also trivial,
it could be just extra map_flag passed when map is created,
consumer_pos page would become mmap()'able as R/O, of course.

But I fail to see how consumer can be 100% certain it's not reading
garbage data, especially on 32-bit architectures, where wrapping over
32-bit producer position is actually quite easy. Just checking
producer position before/after read isn't completely correct. Ignoring
that problem, the only sane way (IMO) to do this would mean copying
each record into a "stable" memory, before actually doing anything
with it, which is a pretty bad performance hit as well.

So all in all, such mode could be added, but certainly in a separate
patch set and after some good discussion :).

> Cheers,
> Alban

      reply index

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-17 19:57 [PATCH v2 bpf-next 0/7] BPF ring buffer Andrii Nakryiko
2020-05-17 19:57 ` [PATCH v2 bpf-next 1/7] bpf: implement BPF ring buffer and verifier support for it Andrii Nakryiko
2020-05-19 12:57   ` kbuild test robot
2020-05-19 23:53   ` kbuild test robot
2020-05-22  0:25   ` Paul E. McKenney
2020-05-22 18:46     ` Andrii Nakryiko
2020-05-25 16:01       ` Paul E. McKenney
2020-05-25 18:45         ` Andrii Nakryiko
2020-05-22  1:07   ` Alexei Starovoitov
2020-05-22 18:48     ` Andrii Nakryiko
2020-05-25 20:34     ` Andrii Nakryiko
2020-05-17 19:57 ` [PATCH v2 bpf-next 2/7] tools/memory-model: add BPF ringbuf MPSC litmus tests Andrii Nakryiko
2020-05-22  0:34   ` Paul E. McKenney
2020-05-22 18:51     ` Andrii Nakryiko
2020-05-25 23:33       ` Andrii Nakryiko
2020-05-26  3:05         ` Paul E. McKenney
2020-05-17 19:57 ` [PATCH v2 bpf-next 3/7] bpf: track reference type in verifier Andrii Nakryiko
2020-05-22  1:13   ` Alexei Starovoitov
2020-05-22 18:53     ` Andrii Nakryiko
2020-05-17 19:57 ` [PATCH v2 bpf-next 4/7] libbpf: add BPF ring buffer support Andrii Nakryiko
2020-05-22  1:15   ` Alexei Starovoitov
2020-05-22 18:56     ` Andrii Nakryiko
2020-05-17 19:57 ` [PATCH v2 bpf-next 5/7] selftests/bpf: add BPF ringbuf selftests Andrii Nakryiko
2020-05-22  1:20   ` Alexei Starovoitov
2020-05-22 18:58     ` Andrii Nakryiko
2020-05-17 19:57 ` [PATCH v2 bpf-next 6/7] bpf: add BPF ringbuf and perf buffer benchmarks Andrii Nakryiko
2020-05-22  1:21   ` Alexei Starovoitov
2020-05-22 19:07     ` Andrii Nakryiko
2020-05-17 19:57 ` [PATCH v2 bpf-next 7/7] docs/bpf: add BPF ring buffer design notes Andrii Nakryiko
2020-05-22  1:23   ` Alexei Starovoitov
2020-05-22 19:08     ` Andrii Nakryiko
2020-05-25  9:59   ` Alban Crequy
2020-05-25 19:12     ` Andrii Nakryiko [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAEf4BzY3mt8puNgOwi5ZWnVbXksnsXK_beG+HhhZutyBG-BO7A@mail.gmail.com \
    --to=andrii.nakryiko@gmail.com \
    --cc=alban.crequy@gmail.com \
    --cc=alban@kinvolk.io \
    --cc=andriin@fb.com \
    --cc=ast@fb.com \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=jonathan.lemon@gmail.com \
    --cc=kai@kinvolk.io \
    --cc=kernel-team@fb.com \
    --cc=mauricio@kinvolk.io \
    --cc=netdev@vger.kernel.org \
    --cc=paulmck@kernel.org \
    --cc=sdf@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

BPF Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/bpf/0 bpf/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 bpf bpf/ https://lore.kernel.org/bpf \
		bpf@vger.kernel.org
	public-inbox-index bpf

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.bpf


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git