All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next 0/6] BPF ring buffer
@ 2020-05-13 19:25 Andrii Nakryiko
  2020-05-13 19:25 ` [PATCH bpf-next 1/6] bpf: implement BPF ring buffer and verifier support for it Andrii Nakryiko
                   ` (6 more replies)
  0 siblings, 7 replies; 32+ messages in thread
From: Andrii Nakryiko @ 2020-05-13 19:25 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel
  Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko, Paul E . McKenney,
	Jonathan Lemon

Implement a new BPF ring buffer, as presented at BPF virtual conference ([0]).
It presents an alternative to perf buffer, following its semantics closely,
but allowing sharing same instance of ring buffer across multiple CPUs
efficiently.

Most patches have extensive commentary explaining various aspects, so I'll
keep cover letter short. Overall structure of the patch set:
- patch #1 adds BPF ring buffer implementation to kernel and necessary
  verifier support;
- patch #2 adds litmus tests validating all the memory orderings and locking
  is correct;
- patch #3 is an optional patch that generalizes verifier's reference tracking
  machinery to capture type of reference;
- patch #4 adds libbpf consumer implementation for BPF ringbuf;
- path #5 adds selftest, both for single BPF ring buf use case, as well as
  using it with array/hash of maps;
- patch #6 adds extensive benchmarks and provide some analysis in commit
  message, it build upon selftests/bpf's bench runner.

  [0] https://docs.google.com/presentation/d/18ITdg77Bj6YDOH2LghxrnFxiPWe0fAqcmJY95t_qr0w

Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>

Andrii Nakryiko (6):
  bpf: implement BPF ring buffer and verifier support for it
  tools/memory-model: add BPF ringbuf MPSC litmus tests
  bpf: track reference type in verifier
  libbpf: add BPF ring buffer support
  selftests/bpf: add BPF ringbuf selftests
  bpf: add BPF ringbuf and perf buffer benchmarks

 include/linux/bpf.h                           |  12 +
 include/linux/bpf_types.h                     |   1 +
 include/linux/bpf_verifier.h                  |  12 +
 include/uapi/linux/bpf.h                      |  33 +-
 kernel/bpf/Makefile                           |   2 +-
 kernel/bpf/helpers.c                          |   8 +
 kernel/bpf/ringbuf.c                          | 409 ++++++++++++
 kernel/bpf/syscall.c                          |  12 +
 kernel/bpf/verifier.c                         | 252 ++++++--
 kernel/trace/bpf_trace.c                      |   8 +
 tools/include/uapi/linux/bpf.h                |  33 +-
 tools/lib/bpf/Build                           |   2 +-
 tools/lib/bpf/libbpf.h                        |  20 +
 tools/lib/bpf/libbpf.map                      |   4 +
 tools/lib/bpf/libbpf_probes.c                 |   5 +
 tools/lib/bpf/ringbuf.c                       | 264 ++++++++
 .../litmus-tests/mpsc-rb+1p1c+bounded.litmus  |  92 +++
 .../litmus-tests/mpsc-rb+1p1c+unbound.litmus  |  83 +++
 .../litmus-tests/mpsc-rb+2p1c+bounded.litmus  | 152 +++++
 .../litmus-tests/mpsc-rb+2p1c+unbound.litmus  | 137 ++++
 tools/testing/selftests/bpf/Makefile          |   5 +-
 tools/testing/selftests/bpf/bench.c           |  18 +
 .../selftests/bpf/benchs/bench_ringbufs.c     | 593 ++++++++++++++++++
 .../bpf/benchs/run_bench_ringbufs.sh          |  61 ++
 .../selftests/bpf/prog_tests/ringbuf.c        | 101 +++
 .../selftests/bpf/prog_tests/ringbuf_multi.c  | 102 +++
 .../selftests/bpf/progs/perfbuf_bench.c       |  33 +
 .../selftests/bpf/progs/ringbuf_bench.c       |  45 ++
 .../selftests/bpf/progs/test_ringbuf.c        |  63 ++
 .../selftests/bpf/progs/test_ringbuf_multi.c  |  77 +++
 30 files changed, 2584 insertions(+), 55 deletions(-)
 create mode 100644 kernel/bpf/ringbuf.c
 create mode 100644 tools/lib/bpf/ringbuf.c
 create mode 100644 tools/memory-model/litmus-tests/mpsc-rb+1p1c+bounded.litmus
 create mode 100644 tools/memory-model/litmus-tests/mpsc-rb+1p1c+unbound.litmus
 create mode 100644 tools/memory-model/litmus-tests/mpsc-rb+2p1c+bounded.litmus
 create mode 100644 tools/memory-model/litmus-tests/mpsc-rb+2p1c+unbound.litmus
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_ringbufs.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_ringbufs.sh
 create mode 100644 tools/testing/selftests/bpf/prog_tests/ringbuf.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/ringbuf_multi.c
 create mode 100644 tools/testing/selftests/bpf/progs/perfbuf_bench.c
 create mode 100644 tools/testing/selftests/bpf/progs/ringbuf_bench.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_ringbuf.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_ringbuf_multi.c

-- 
2.24.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH bpf-next 1/6] bpf: implement BPF ring buffer and verifier support for it
  2020-05-13 19:25 [PATCH bpf-next 0/6] BPF ring buffer Andrii Nakryiko
@ 2020-05-13 19:25 ` Andrii Nakryiko
  2020-05-13 20:57   ` kbuild test robot
                     ` (6 more replies)
  2020-05-13 19:25 ` [PATCH bpf-next 2/6] tools/memory-model: add BPF ringbuf MPSC litmus tests Andrii Nakryiko
                   ` (5 subsequent siblings)
  6 siblings, 7 replies; 32+ messages in thread
From: Andrii Nakryiko @ 2020-05-13 19:25 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel
  Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko, Paul E . McKenney,
	Jonathan Lemon

This commits adds a new MPSC ring buffer implementation into BPF ecosystem,
which allows multiple CPUs to submit data to a single shared ring buffer. On
the consumption side, only single consumer is assumed.

Motivation
----------
There are two distinctive motivators for this work, which are not satisfied by
existing perf buffer, which prompted creation of a new ring buffer
implementation.
  - more efficient memory utilization by sharing ring buffer across CPUs;
  - preserving ordering of events that happen sequentially in time, even
  across multiple CPUs (e.g., fork/exec/exit events for a task).

These two problems are independent, but perf buffer fails to satisfy both.
Both are a result of a choice to have per-CPU perf ring buffer.  Both can be
also solved by having an MPSC implementation of ring buffer. The ordering
problem could technically be solved for perf buffer with some in-kernel
counting, but given the first one requires an MPSC buffer, the same solution
would solve the second problem automatically.

Semantics and APIs
------------------
Single ring buffer is presented to BPF programs as an instance of BPF map of
type BPF_MAP_TYPE_RINGBUF. Two other alternatives considered, but ultimately
rejected.

One way would be to, similar to BPF_MAP_TYPE_PERF_EVENT_ARRAY, make
BPF_MAP_TYPE_RINGBUF could represent an array of ring buffers, but not enforce
"same CPU only" rule. This would be more familiar interface compatible with
existing perf buffer use in BPF, but would fail if application needed more
advanced logic to lookup ring buffer by arbitrary key. HASH_OF_MAPS addresses
this with current approach. Additionally, given the performance of BPF
ringbuf, many use cases would just opt into a simple single ring buffer shared
among all CPUs, for which current approach would be an overkill.

Another approach could introduce a new concept, alongside BPF map, to
represent generic "container" object, which doesn't necessarily have key/value
interface with lookup/update/delete operations. This approach would add a lot
of extra infrastructure that has to be built for observability and verifier
support. It would also add another concept that BPF developers would have to
familiarize themselves with, new syntax in libbpf, etc. But then would really
provide no additional benefits over the approach of using a map.
BPF_MAP_TYPE_RINGBUF doesn't support lookup/update/delete operations, but so
doesn't few other map types (e.g., queue and stack; array doesn't support
delete, etc).

The approach chosen has an advantage of re-using existing BPF map
infrastructure (introspection APIs in kernel, libbpf support, etc), being
familiar concept (no need to teach users a new type of object in BPF program),
and utilizing existing tooling (bpftool). For common scenario of using
a single ring buffer for all CPUs, it's as simple and straightforward, as
would be with a dedicated "container" object. On the other hand, by being
a map, it can be combined with ARRAY_OF_MAPS and HASH_OF_MAPS map-in-maps to
implement a wide variety of topologies, from one ring buffer for each CPU
(e.g., as a replacement for perf buffer use cases), to a complicated
application hashing/sharding of ring buffers (e.g., having a small pool of
ring buffers with hashed task's tgid being a look up key to preserve order,
but reduce contention).

Key and value sizes are enforced to be zero. max_entries is used to specify
the size of ring buffer and has to be a power of 2 value.

There are a bunch of similarities between perf buffer
(BPF_MAP_TYPE_PERF_EVENT_ARRAY) and new BPF ring buffer semantics:
  - variable-length records;
  - if there is no more space left in ring buffer, reservation fails, no
    blocking;
  - memory-mappable data area for user-space applications for ease of
    consumption and high performance;
  - epoll notifications for new incoming data;
  - but still the ability to do busy polling for new data to achieve the
    lowest latency, if necessary.

BPF ringbuf provides two sets of APIs to BPF programs:
  - bpf_ringbuf_output() allows to *copy* data from one place to a ring
    buffer, similarly to bpf_perf_event_output();
  - bpf_ringbuf_reserve()/bpf_ringbuf_commit()/bpf_ringbuf_discard() APIs
    split the whole process into two steps. First, a fixed amount of space is
    reserved. If successful, a pointer to a data inside ring buffer data area
    is returned, which BPF programs can use similarly to a data inside
    array/hash maps. Once ready, this piece of memory is either committed or
    discarded. Discard is similar to commit, but makes consumer ignore the
    record.

bpf_ringbuf_output() has disadvantage of incurring extra memory copy, because
record has to be prepared in some other place first. But it allows to submit
records of the length that's not known to verifier beforehand. It also closely
matches bpf_perf_event_output(), so will simplify migration significantly.

bpf_ringbuf_reserve() avoids the extra copy of memory by providing a memory
pointer directly to ring buffer memory. In a lot of cases records are larger
than BPF stack space allows, so many programs have use extra per-CPU array as
a temporary heap for preparing sample. bpf_ringbuf_reserve() avoid this needs
completely. But in exchange, it only allows a known constant size of memory to
be reserved, such that verifier can verify that BPF program can't access
memory outside its reserved record space. bpf_ringbuf_output(), while slightly
slower due to extra memory copy, covers some use cases that are not suitable
for bpf_ringbuf_reserve().

The difference between commit and discard is very small. Discard just marks
a record as discarded, and such records are supposed to be ignored by consumer
code. Discard is useful for some advanced use-cases, such as ensuring
all-or-nothing multi-record submission, or emulating temporary malloc()/free()
within single BPF program invocation.

Each reserved record is tracked by verifier through existing
reference-tracking logic, similar to socket ref-tracking. It is thus
impossible to reserve a record, but forget to submit (or discard) it.

Design and implementation
-------------------------
This reserve/commit schema allows a natural way for multiple producers, either
on different CPUs or even on the same CPU/in the same BPF program, to reserve
independent records and work with them without blocking other producers. This
means that if BPF program was interruped by another BPF program sharing the
same ring buffer, they will both get a record reserved (provided there is
enough space left) and can work with it and submit it independently. This
applies to NMI context as well, except that due to using a spinlock during
reservation, in NMI context, bpf_ringbuf_reserve() might fail to get a lock,
in which case reservation will fail even if ring buffer is not full.

The ring buffer itself internally is implemented as a power-of-2 sized
circular buffer, with two logical and ever-increasing counters (which might
wrap around on 32-bit architectures, that's not a problem):
  - consumer counter shows up to which logical position consumer consumed the
    data;
  - producer counter denotes amount of data reserved by all producers.

Each time a record is reserved, producer that "owns" the record will
successfully advance producer counter. At that point, data is still not yet
ready to be consumed, though. Each record has 8 byte header, which contains
the length of reserved record, as well as two extra bits: busy bit to denote
that record is still being worked on, and discard bit, which might be set at
commit time if record is discarded. In the latter case, consumer is supposed
to skip the record and move on to the next one. Record header also encodes
record's relative offset from the beginning of ring buffer data area (in
pages). This allows bpf_ringbuf_commit()/bpf_ringbuf_discard() to accept only
the pointer to the record itself, without requiring also the pointer to ring
buffer itself. Ring buffer memory location will be restored from record
metadata header. This significantly simplifies verifier, as well as improving
API usability.

Producer counter increments are serialized under spinlock, so there is
a strict ordering between reservations. Commits, on the other hand, are
completely lockless and independent. All records become available to consumer
in the order of reservations, but only after all previous records where
already committed. It is thus possible for slow producers to temporarily hold
off submitted records, that were reserved later.

Reservation/commit/consumer protocol is verified by litmus tests in the later
patch in this series.

One interesting implementation bit, that significantly simplifies (and thus
speeds up as well) implementation of both producers and consumers is how data
area is mapped twice contiguously back-to-back in the virtual memory. This
allows to not take any special measures for samples that have to wrap around
at the end of the circular buffer data area, because the next page after the
last data page would be first data page again, and thus the sample will still
appear completely contiguous in virtual memory. See comment and a simple ASCII
diagram showing this visually in bpf_ringbuf_area_alloc().

Another feature that distinguishes BPF ringbuf from perf ring buffer is
a self-pacing notifications of new data being availability.
bpf_ringbuf_commit() implementation will send a notification of new record
being available after commit only if consumer has already caught up right up
to the record being committed. If not, consumer still has to catch up and thus
will see new data anyways without needing an extra poll notification. As will
be shown in benchmarks in later patch in the series, this allows to achieve
a very high throughput without having to resort to tricks like "notify only
every Nth sample", like with perf buffer, to achieve good throughput
performance.

For performance evaluation against perf buffer and scalability limits, see
patch later in the series, adding ring buffers benchmark.
number of contention

Comparison to alternatives
--------------------------
Before considering implementing BPF ring buffer from scratch existing
alternatives in kernel were evaluated, but didn't seem to meet the needs. They
largely fell into few categores:
  - per-CPU buffers (perf, ftrace, etc), which don't satisfy two motivations
    outlined above (ordering and memory consumption);
  - linked list-based implementations; while some were multi-producer designs,
    consuming these from user-space would be very complicated and most
    probably not performant; memory-mapping contiguous piece of memory is
    simpler and more performant for user-space consumers;
  - io_uring is SPSC, but also requires fixed-sized elements. Naively turning
    SPSC queue into MPSC w/ lock would have subpar performance compared to
    locked reserve + lockless commit, as with BPF ring buffer. Fixed sized
    elements would be too limiting for BPF programs, given existing BPF
    programs heavily rely on variable-sized perf buffer already;
  - specialized implementations (like a new printk ring buffer, [0]) with lots
    of printk-specific limitations and implications, that didn't seem to fit
    well for intended use with BPF programs.

  [0] https://lwn.net/Articles/779550/

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---
 include/linux/bpf.h            |  12 +
 include/linux/bpf_types.h      |   1 +
 include/linux/bpf_verifier.h   |   4 +
 include/uapi/linux/bpf.h       |  33 ++-
 kernel/bpf/Makefile            |   2 +-
 kernel/bpf/helpers.c           |   8 +
 kernel/bpf/ringbuf.c           | 409 +++++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c           |  12 +
 kernel/bpf/verifier.c          | 156 ++++++++++---
 kernel/trace/bpf_trace.c       |   8 +
 tools/include/uapi/linux/bpf.h |  33 ++-
 11 files changed, 643 insertions(+), 35 deletions(-)
 create mode 100644 kernel/bpf/ringbuf.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index cf4b6e44f2bc..9e3da01f3e9b 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -89,6 +89,8 @@ struct bpf_map_ops {
 	int (*map_direct_value_meta)(const struct bpf_map *map,
 				     u64 imm, u32 *off);
 	int (*map_mmap)(struct bpf_map *map, struct vm_area_struct *vma);
+	__poll_t (*map_poll)(struct bpf_map *map, struct file *filp,
+			     struct poll_table_struct *pts);
 };
 
 struct bpf_map_memory {
@@ -243,6 +245,9 @@ enum bpf_arg_type {
 	ARG_PTR_TO_LONG,	/* pointer to long */
 	ARG_PTR_TO_SOCKET,	/* pointer to bpf_sock (fullsock) */
 	ARG_PTR_TO_BTF_ID,	/* pointer to in-kernel struct */
+	ARG_PTR_TO_ALLOC_MEM,	/* pointer to dynamically allocated memory */
+	ARG_PTR_TO_ALLOC_MEM_OR_NULL,	/* pointer to dynamically allocated memory or NULL */
+	ARG_CONST_ALLOC_SIZE_OR_ZERO,	/* number of allocated bytes requested */
 };
 
 /* type of values returned from helper functions */
@@ -254,6 +259,7 @@ enum bpf_return_type {
 	RET_PTR_TO_SOCKET_OR_NULL,	/* returns a pointer to a socket or NULL */
 	RET_PTR_TO_TCP_SOCK_OR_NULL,	/* returns a pointer to a tcp_sock or NULL */
 	RET_PTR_TO_SOCK_COMMON_OR_NULL,	/* returns a pointer to a sock_common or NULL */
+	RET_PTR_TO_ALLOC_MEM_OR_NULL,	/* returns a pointer to dynamically allocated memory or NULL */
 };
 
 /* eBPF function prototype used by verifier to allow BPF_CALLs from eBPF programs
@@ -321,6 +327,8 @@ enum bpf_reg_type {
 	PTR_TO_XDP_SOCK,	 /* reg points to struct xdp_sock */
 	PTR_TO_BTF_ID,		 /* reg points to kernel struct */
 	PTR_TO_BTF_ID_OR_NULL,	 /* reg points to kernel struct or NULL */
+	PTR_TO_MEM,		 /* reg points to valid memory region */
+	PTR_TO_MEM_OR_NULL,	 /* reg points to valid memory region or NULL */
 };
 
 /* The information passed from prog-specific *_is_valid_access
@@ -1585,6 +1593,10 @@ extern const struct bpf_func_proto bpf_tcp_sock_proto;
 extern const struct bpf_func_proto bpf_jiffies64_proto;
 extern const struct bpf_func_proto bpf_get_ns_current_pid_tgid_proto;
 extern const struct bpf_func_proto bpf_event_output_data_proto;
+extern const struct bpf_func_proto bpf_ringbuf_output_proto;
+extern const struct bpf_func_proto bpf_ringbuf_reserve_proto;
+extern const struct bpf_func_proto bpf_ringbuf_submit_proto;
+extern const struct bpf_func_proto bpf_ringbuf_discard_proto;
 
 const struct bpf_func_proto *bpf_tracing_func_proto(
 	enum bpf_func_id func_id, const struct bpf_prog *prog);
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 29d22752fc87..fa8e1b552acd 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -118,6 +118,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops)
 #if defined(CONFIG_BPF_JIT)
 BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops)
 #endif
+BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops)
 
 BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint)
 BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing)
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 6abd5a778fcd..c94a736e53cd 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -54,6 +54,8 @@ struct bpf_reg_state {
 
 		u32 btf_id; /* for PTR_TO_BTF_ID */
 
+		u32 mem_size; /* for PTR_TO_MEM | PTR_TO_MEM_OR_NULL */
+
 		/* Max size from any of the above. */
 		unsigned long raw;
 	};
@@ -63,6 +65,8 @@ struct bpf_reg_state {
 	 * offset, so they can share range knowledge.
 	 * For PTR_TO_MAP_VALUE_OR_NULL this is used to share which map value we
 	 * came from, when one is tested for != NULL.
+	 * For PTR_TO_MEM_OR_NULL this is used to identify memory allocation
+	 * for the purpose of tracking that it's freed.
 	 * For PTR_TO_SOCKET this is used to share which pointers retain the
 	 * same reference to the socket, to determine proper reference freeing.
 	 */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index bfb31c1be219..ae2deb6a8afc 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -147,6 +147,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_SK_STORAGE,
 	BPF_MAP_TYPE_DEVMAP_HASH,
 	BPF_MAP_TYPE_STRUCT_OPS,
+	BPF_MAP_TYPE_RINGBUF,
 };
 
 /* Note that tracing related programs such as
@@ -3121,6 +3122,32 @@ union bpf_attr {
  * 		0 on success, or a negative error in case of failure:
  *
  *		**-EOVERFLOW** if an overflow happened: The same object will be tried again.
+ *
+ * void *bpf_ringbuf_output(void *ringbuf, void *data, u64 size, u64 flags)
+ * 	Description
+ * 		Copy *size* bytes from *data* into a ring buffer *ringbuf*.
+ * 	Return
+ * 		0, on success;
+ * 		< 0, on error.
+ *
+ * void *bpf_ringbuf_reserve(void *ringbuf, u64 size, u64 flags)
+ * 	Description
+ * 		Reserve *size* bytes of payload in a ring buffer *ringbuf*.
+ * 	Return
+ * 		Valid pointer with *size* bytes of memory available; NULL,
+ * 		otherwise.
+ *
+ * void bpf_ringbuf_submit(void *data)
+ * 	Description
+ * 		Submit reserved ring buffer sample, pointed to by *data*.
+ * 	Return
+ * 		Nothing.
+ *
+ * void bpf_ringbuf_discard(void *data)
+ * 	Description
+ * 		Discard reserved ring buffer sample, pointed to by *data*.
+ * 	Return
+ * 		Nothing.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3250,7 +3277,11 @@ union bpf_attr {
 	FN(sk_assign),			\
 	FN(ktime_get_boot_ns),		\
 	FN(seq_printf),			\
-	FN(seq_write),
+	FN(seq_write),			\
+	FN(ringbuf_output),		\
+	FN(ringbuf_reserve),		\
+	FN(ringbuf_submit),		\
+	FN(ringbuf_discard),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 37b2d8620153..c9aada6c1806 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -4,7 +4,7 @@ CFLAGS_core.o += $(call cc-disable-warning, override-init)
 
 obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o
 obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o
-obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o
+obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
 obj-$(CONFIG_BPF_SYSCALL) += disasm.o
 obj-$(CONFIG_BPF_JIT) += trampoline.o
 obj-$(CONFIG_BPF_SYSCALL) += btf.o
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 5c0290e0696e..27321ca8803f 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -629,6 +629,14 @@ bpf_base_func_proto(enum bpf_func_id func_id)
 		return &bpf_ktime_get_ns_proto;
 	case BPF_FUNC_ktime_get_boot_ns:
 		return &bpf_ktime_get_boot_ns_proto;
+	case BPF_FUNC_ringbuf_output:
+		return &bpf_ringbuf_output_proto;
+	case BPF_FUNC_ringbuf_reserve:
+		return &bpf_ringbuf_reserve_proto;
+	case BPF_FUNC_ringbuf_submit:
+		return &bpf_ringbuf_submit_proto;
+	case BPF_FUNC_ringbuf_discard:
+		return &bpf_ringbuf_discard_proto;
 	default:
 		break;
 	}
diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c
new file mode 100644
index 000000000000..f2ae441a1695
--- /dev/null
+++ b/kernel/bpf/ringbuf.c
@@ -0,0 +1,409 @@
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/err.h>
+#include <linux/slab.h>
+#include <linux/filter.h>
+#include <linux/mm.h>
+#include <linux/vmalloc.h>
+#include <linux/wait.h>
+#include <linux/poll.h>
+#include <uapi/linux/btf.h>
+
+#define RINGBUF_CREATE_FLAG_MASK (BPF_F_NUMA_NODE)
+
+#define RINGBUF_BUSY_BIT BIT(31)
+#define RINGBUF_DISCARD_BIT BIT(30)
+#define RINGBUF_META_SZ 8
+
+/* non-mmap()'able part of bpf_ringbuf (everything up to consumer page) */
+#define BPF_RINGBUF_PGOFF \
+	(offsetof(struct bpf_ringbuf, consumer_pos) >> PAGE_SHIFT)
+
+struct bpf_ringbuf {
+	wait_queue_head_t waitq;
+	u64 mask;
+	spinlock_t spinlock ____cacheline_aligned_in_smp;
+	u64 consumer_pos __aligned(PAGE_SIZE);
+	u64 producer_pos __aligned(PAGE_SIZE);
+	char data[] __aligned(PAGE_SIZE);
+};
+
+struct bpf_ringbuf_map {
+	struct bpf_map map;
+	struct bpf_map_memory memory;
+	struct bpf_ringbuf *rb;
+};
+
+static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node)
+{
+	const gfp_t flags = GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN |
+			    __GFP_ZERO;
+	int nr_meta_pages = 2 + BPF_RINGBUF_PGOFF;
+	int nr_data_pages = data_sz >> PAGE_SHIFT;
+	int nr_pages = nr_meta_pages + nr_data_pages;
+	struct page **pages, *page;
+	size_t array_size;
+	void *addr;
+	int i;
+
+	/* Each data page is mapped twice to allow "virtual"
+	 * continuous read of samples wrapping around the end of ring
+	 * buffer area:
+	 * ------------------------------------------------------
+	 * | meta pages |  real data pages  |  same data pages  |
+	 * ------------------------------------------------------
+	 * |            | 1 2 3 4 5 6 7 8 9 | 1 2 3 4 5 6 7 8 9 |
+	 * ------------------------------------------------------
+	 * |            | TA             DA | TA             DA |
+	 * ------------------------------------------------------
+	 *                               ^^^^^^^
+	 *                                  |
+	 * Here, no need to worry about special handling of wrapped-around
+	 * data due to double-mapped data pages. This works both in kernel and
+	 * when mmap()'ed in user-space, simplifying both kernel and
+	 * user-space implementations significantly.
+	 */
+	array_size = (nr_meta_pages + 2 * nr_data_pages) * sizeof(*pages);
+	if (array_size > PAGE_SIZE)
+		pages = vmalloc_node(array_size, numa_node);
+	else
+		pages = kmalloc_node(array_size, flags, numa_node);
+	if (!pages)
+		return NULL;
+
+	for (i = 0; i < nr_pages; i++) {
+		page = alloc_pages_node(numa_node, flags, 0);
+		if (!page) {
+			nr_pages = i;
+			goto err_free_pages;
+		}
+		pages[i] = page;
+		if (i >= nr_meta_pages)
+			pages[nr_data_pages + i] = page;
+	}
+
+	addr = vmap(pages, nr_meta_pages + 2 * nr_data_pages,
+		    VM_ALLOC | VM_USERMAP, PAGE_KERNEL);
+	if (addr)
+		return addr;
+
+err_free_pages:
+	for (i = 0; i < nr_pages; i++)
+		free_page((unsigned long)pages[i]);
+	kvfree(pages);
+	return NULL;
+}
+
+static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node)
+{
+	struct bpf_ringbuf *rb;
+
+	if (!data_sz || !PAGE_ALIGNED(data_sz))
+		return ERR_PTR(-EINVAL);
+
+	rb = bpf_ringbuf_area_alloc(data_sz, numa_node);
+	if (!rb)
+		return ERR_PTR(-ENOMEM);
+
+	spin_lock_init(&rb->spinlock);
+	init_waitqueue_head(&rb->waitq);
+
+	rb->mask = data_sz - 1;
+	rb->consumer_pos = 0;
+	rb->producer_pos = 0;
+
+	return rb;
+}
+
+static struct bpf_map *ringbuf_map_alloc(union bpf_attr *attr)
+{
+	struct bpf_ringbuf_map *rb_map;
+	u64 cost;
+	int err;
+
+	if (attr->map_flags & ~RINGBUF_CREATE_FLAG_MASK)
+		return ERR_PTR(-EINVAL);
+
+	if (attr->key_size || attr->value_size ||
+	    attr->max_entries == 0 || !PAGE_ALIGNED(attr->max_entries))
+		return ERR_PTR(-EINVAL);
+
+	rb_map = kzalloc(sizeof(*rb_map), GFP_USER);
+	if (!rb_map)
+		return ERR_PTR(-ENOMEM);
+
+	bpf_map_init_from_attr(&rb_map->map, attr);
+
+	cost = sizeof(struct bpf_ringbuf_map) +
+	       sizeof(struct bpf_ringbuf) +
+	       attr->max_entries;
+	err = bpf_map_charge_init(&rb_map->map.memory, cost);
+	if (err)
+		goto err_free_map;
+
+	rb_map->rb = bpf_ringbuf_alloc(attr->max_entries, rb_map->map.numa_node);
+	if (IS_ERR(rb_map->rb)) {
+		err = PTR_ERR(rb_map->rb);
+		goto err_uncharge;
+	}
+
+	return &rb_map->map;
+
+err_uncharge:
+	bpf_map_charge_finish(&rb_map->map.memory);
+err_free_map:
+	kfree(rb_map);
+	return ERR_PTR(err);
+}
+
+static void bpf_ringbuf_free(struct bpf_ringbuf *ringbuf)
+{
+	kvfree(ringbuf);
+}
+
+static void ringbuf_map_free(struct bpf_map *map)
+{
+	struct bpf_ringbuf_map *rb_map;
+
+	/* at this point bpf_prog->aux->refcnt == 0 and this map->refcnt == 0,
+	 * so the programs (can be more than one that used this map) were
+	 * disconnected from events. Wait for outstanding critical sections in
+	 * these programs to complete
+	 */
+	synchronize_rcu();
+
+	rb_map = container_of(map, struct bpf_ringbuf_map, map);
+	bpf_ringbuf_free(rb_map->rb);
+	kfree(rb_map);
+}
+
+static void *ringbuf_map_lookup_elem(struct bpf_map *map, void *key)
+{
+	return ERR_PTR(-ENOTSUPP);
+}
+
+static int ringbuf_map_update_elem(struct bpf_map *map, void *key, void *value,
+				   u64 flags)
+{
+	return -ENOTSUPP;
+}
+
+static int ringbuf_map_delete_elem(struct bpf_map *map, void *key)
+{
+	return -ENOTSUPP;
+}
+
+static int ringbuf_map_get_next_key(struct bpf_map *map, void *key,
+				    void *next_key)
+{
+	return -ENOTSUPP;
+}
+
+static size_t bpf_ringbuf_mmap_page_cnt(const struct bpf_ringbuf *rb)
+{
+	size_t data_pages = (rb->mask + 1) >> PAGE_SHIFT;
+
+	/* consumer page + producer page + 2 x data pages */
+	return 2 + 2 * data_pages;
+}
+
+static int ringbuf_map_mmap(struct bpf_map *map, struct vm_area_struct *vma)
+{
+	struct bpf_ringbuf_map *rb_map;
+	size_t mmap_sz;
+
+	rb_map = container_of(map, struct bpf_ringbuf_map, map);
+	mmap_sz = bpf_ringbuf_mmap_page_cnt(rb_map->rb) << PAGE_SHIFT;
+
+	if (vma->vm_pgoff * PAGE_SIZE + (vma->vm_end - vma->vm_start) > mmap_sz)
+		return -EINVAL;
+
+	return remap_vmalloc_range(vma, rb_map->rb,
+				   vma->vm_pgoff + BPF_RINGBUF_PGOFF);
+}
+
+static __poll_t ringbuf_map_poll(struct bpf_map *map, struct file *filp,
+				  struct poll_table_struct *pts)
+{
+	struct bpf_ringbuf_map *rb_map;
+
+	rb_map = container_of(map, struct bpf_ringbuf_map, map);
+	poll_wait(filp, &rb_map->rb->waitq, pts);
+
+	return EPOLLIN | EPOLLRDNORM;
+}
+
+const struct bpf_map_ops ringbuf_map_ops = {
+	.map_alloc = ringbuf_map_alloc,
+	.map_free = ringbuf_map_free,
+	.map_mmap = ringbuf_map_mmap,
+	.map_poll = ringbuf_map_poll,
+	.map_lookup_elem = ringbuf_map_lookup_elem,
+	.map_update_elem = ringbuf_map_update_elem,
+	.map_delete_elem = ringbuf_map_delete_elem,
+	.map_get_next_key = ringbuf_map_get_next_key,
+};
+
+/* Given pointer to ring buffer record metadata and struct bpf_ringbuf itself,
+ * calculate offset from record metadata to ring buffer in pages, rounded
+ * down. This page offset is stored as part of record metadata and allows to
+ * restore struct bpf_ringbuf * from record pointer. This page offset is
+ * stored at offset 4 of record metadata header.
+ */
+static size_t bpf_ringbuf_rec_pg_off(struct bpf_ringbuf *rb, void *meta_ptr)
+{
+	return (meta_ptr - (void *)rb) >> PAGE_SHIFT;
+}
+
+/* Given pointer to ring buffer record metadata, restore pointer to struct
+ * bpf_ringbuf itself by using page offset stored at offset 4
+ */
+static struct bpf_ringbuf *bpf_ringbuf_restore_from_rec(void *meta_ptr)
+{
+	unsigned long addr = (unsigned long)meta_ptr;
+	unsigned long off = *(u32 *)(meta_ptr + 4) << PAGE_SHIFT;
+
+	return (void*)((addr & PAGE_MASK) - off);
+}
+
+static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size)
+{
+	unsigned long cons_pos, prod_pos, new_prod_pos, flags;
+	u32 len, pg_off;
+	void *meta_ptr;
+
+	if (unlikely(size > UINT_MAX))
+		return NULL;
+
+	len = round_up(size + RINGBUF_META_SZ, 8);
+	cons_pos = READ_ONCE(rb->consumer_pos);
+
+	if (in_nmi()) {
+		if (!spin_trylock_irqsave(&rb->spinlock, flags))
+			return NULL;
+	} else {
+		spin_lock_irqsave(&rb->spinlock, flags);
+	}
+
+	prod_pos = rb->producer_pos;
+	new_prod_pos = prod_pos + len;
+
+	/* check for out of ringbuf space by ensuring producer position
+	 * doesn't advance more than (ringbuf_size - 1) ahead
+	 */
+	if (new_prod_pos - cons_pos > rb->mask) {
+		spin_unlock_irqrestore(&rb->spinlock, flags);
+		return NULL;
+	}
+
+	meta_ptr = rb->data + (prod_pos & rb->mask);
+	pg_off = bpf_ringbuf_rec_pg_off(rb, meta_ptr);
+
+	WRITE_ONCE(*(u32 *)meta_ptr, RINGBUF_BUSY_BIT | size);
+	WRITE_ONCE(*(u32 *)(meta_ptr + 4), pg_off);
+
+	/* ensure length prefix is written before updating producer positions */
+	smp_wmb();
+	WRITE_ONCE(rb->producer_pos, new_prod_pos);
+
+	spin_unlock_irqrestore(&rb->spinlock, flags);
+
+	return meta_ptr + RINGBUF_META_SZ;
+}
+
+BPF_CALL_3(bpf_ringbuf_reserve, struct bpf_map *, map, u64, size, u64, flags)
+{
+	struct bpf_ringbuf_map *rb_map;
+
+	if (unlikely(flags))
+		return -EINVAL;
+
+	rb_map = container_of(map, struct bpf_ringbuf_map, map);
+	return (unsigned long)__bpf_ringbuf_reserve(rb_map->rb, size);
+}
+
+const struct bpf_func_proto bpf_ringbuf_reserve_proto = {
+	.func		= bpf_ringbuf_reserve,
+	.ret_type	= RET_PTR_TO_ALLOC_MEM_OR_NULL,
+	.arg1_type	= ARG_CONST_MAP_PTR,
+	.arg2_type	= ARG_CONST_ALLOC_SIZE_OR_ZERO,
+	.arg3_type	= ARG_ANYTHING,
+};
+
+static void bpf_ringbuf_commit(void *sample, bool discard)
+{
+	unsigned long rec_pos, cons_pos;
+	u32 new_meta, old_meta;
+	void *meta_ptr;
+	struct bpf_ringbuf *rb;
+
+	meta_ptr = sample - RINGBUF_META_SZ;
+	rb = bpf_ringbuf_restore_from_rec(meta_ptr);
+	old_meta = *(u32 *)meta_ptr;
+	new_meta = old_meta ^ RINGBUF_BUSY_BIT;
+	if (discard)
+		new_meta |= RINGBUF_DISCARD_BIT;
+
+	/* update metadata header with correct final size prefix */
+	xchg((u32 *)meta_ptr, new_meta);
+
+	/* if consumer caught up and is waiting for our record, notify about
+	 * new data availability
+	 */
+	rec_pos = (void *)meta_ptr - (void *)rb->data;
+	cons_pos = smp_load_acquire(&rb->consumer_pos) & rb->mask;
+	if (cons_pos == rec_pos)
+		wake_up_all(&rb->waitq);
+}
+
+BPF_CALL_1(bpf_ringbuf_submit, void *, sample)
+{
+	bpf_ringbuf_commit(sample, false /* discard */);
+	return 0;
+}
+
+const struct bpf_func_proto bpf_ringbuf_submit_proto = {
+	.func		= bpf_ringbuf_submit,
+	.ret_type	= RET_VOID,
+	.arg1_type	= ARG_PTR_TO_ALLOC_MEM,
+};
+
+BPF_CALL_1(bpf_ringbuf_discard, void *, sample)
+{
+	bpf_ringbuf_commit(sample, true /* discard */);
+	return 0;
+}
+
+const struct bpf_func_proto bpf_ringbuf_discard_proto = {
+	.func		= bpf_ringbuf_discard,
+	.ret_type	= RET_VOID,
+	.arg1_type	= ARG_PTR_TO_ALLOC_MEM,
+};
+
+BPF_CALL_4(bpf_ringbuf_output, struct bpf_map *, map, void *, data, u64, size,
+	   u64, flags)
+{
+	struct bpf_ringbuf_map *rb_map;
+	void *rec;
+
+	if (unlikely(flags))
+		return -EINVAL;
+
+	rb_map = container_of(map, struct bpf_ringbuf_map, map);
+	rec = __bpf_ringbuf_reserve(rb_map->rb, size);
+	if (!rec)
+		return -EAGAIN;
+
+	memcpy(rec, data, size);
+	bpf_ringbuf_commit(rec, false /* discard */);
+	return 0;
+}
+
+const struct bpf_func_proto bpf_ringbuf_output_proto = {
+	.func		= bpf_ringbuf_output,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_CONST_MAP_PTR,
+	.arg2_type	= ARG_PTR_TO_MEM,
+	.arg3_type	= ARG_CONST_SIZE_OR_ZERO,
+	.arg4_type	= ARG_ANYTHING,
+};
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index de2a75500233..462db8595e9f 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -26,6 +26,7 @@
 #include <linux/audit.h>
 #include <uapi/linux/btf.h>
 #include <linux/bpf_lsm.h>
+#include <linux/poll.h>
 
 #define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \
 			  (map)->map_type == BPF_MAP_TYPE_CGROUP_ARRAY || \
@@ -651,6 +652,16 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
 	return err;
 }
 
+static __poll_t bpf_map_poll(struct file *filp, struct poll_table_struct *pts)
+{
+	struct bpf_map *map = filp->private_data;
+
+	if (map->ops->map_poll)
+		return map->ops->map_poll(map, filp, pts);
+
+	return EPOLLERR;
+}
+
 const struct file_operations bpf_map_fops = {
 #ifdef CONFIG_PROC_FS
 	.show_fdinfo	= bpf_map_show_fdinfo,
@@ -659,6 +670,7 @@ const struct file_operations bpf_map_fops = {
 	.read		= bpf_dummy_read,
 	.write		= bpf_dummy_write,
 	.mmap		= bpf_map_mmap,
+	.poll		= bpf_map_poll,
 };
 
 int bpf_map_new_fd(struct bpf_map *map, int flags)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 2a1826c76bb6..b8f0158d2327 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -233,6 +233,7 @@ struct bpf_call_arg_meta {
 	bool pkt_access;
 	int regno;
 	int access_size;
+	int mem_size;
 	u64 msize_max_value;
 	int ref_obj_id;
 	int func_id;
@@ -399,7 +400,8 @@ static bool reg_type_may_be_null(enum bpf_reg_type type)
 	       type == PTR_TO_SOCKET_OR_NULL ||
 	       type == PTR_TO_SOCK_COMMON_OR_NULL ||
 	       type == PTR_TO_TCP_SOCK_OR_NULL ||
-	       type == PTR_TO_BTF_ID_OR_NULL;
+	       type == PTR_TO_BTF_ID_OR_NULL ||
+	       type == PTR_TO_MEM_OR_NULL;
 }
 
 static bool reg_may_point_to_spin_lock(const struct bpf_reg_state *reg)
@@ -413,7 +415,9 @@ static bool reg_type_may_be_refcounted_or_null(enum bpf_reg_type type)
 	return type == PTR_TO_SOCKET ||
 		type == PTR_TO_SOCKET_OR_NULL ||
 		type == PTR_TO_TCP_SOCK ||
-		type == PTR_TO_TCP_SOCK_OR_NULL;
+		type == PTR_TO_TCP_SOCK_OR_NULL ||
+		type == PTR_TO_MEM ||
+		type == PTR_TO_MEM_OR_NULL;
 }
 
 static bool arg_type_may_be_refcounted(enum bpf_arg_type type)
@@ -427,7 +431,9 @@ static bool arg_type_may_be_refcounted(enum bpf_arg_type type)
  */
 static bool is_release_function(enum bpf_func_id func_id)
 {
-	return func_id == BPF_FUNC_sk_release;
+	return func_id == BPF_FUNC_sk_release ||
+	       func_id == BPF_FUNC_ringbuf_submit ||
+	       func_id == BPF_FUNC_ringbuf_discard;
 }
 
 static bool may_be_acquire_function(enum bpf_func_id func_id)
@@ -435,7 +441,8 @@ static bool may_be_acquire_function(enum bpf_func_id func_id)
 	return func_id == BPF_FUNC_sk_lookup_tcp ||
 		func_id == BPF_FUNC_sk_lookup_udp ||
 		func_id == BPF_FUNC_skc_lookup_tcp ||
-		func_id == BPF_FUNC_map_lookup_elem;
+		func_id == BPF_FUNC_map_lookup_elem ||
+	        func_id == BPF_FUNC_ringbuf_reserve;
 }
 
 static bool is_acquire_function(enum bpf_func_id func_id,
@@ -445,7 +452,8 @@ static bool is_acquire_function(enum bpf_func_id func_id,
 
 	if (func_id == BPF_FUNC_sk_lookup_tcp ||
 	    func_id == BPF_FUNC_sk_lookup_udp ||
-	    func_id == BPF_FUNC_skc_lookup_tcp)
+	    func_id == BPF_FUNC_skc_lookup_tcp ||
+	    func_id == BPF_FUNC_ringbuf_reserve)
 		return true;
 
 	if (func_id == BPF_FUNC_map_lookup_elem &&
@@ -485,6 +493,8 @@ static const char * const reg_type_str[] = {
 	[PTR_TO_XDP_SOCK]	= "xdp_sock",
 	[PTR_TO_BTF_ID]		= "ptr_",
 	[PTR_TO_BTF_ID_OR_NULL]	= "ptr_or_null_",
+	[PTR_TO_MEM]		= "mem",
+	[PTR_TO_MEM_OR_NULL]	= "mem_or_null",
 };
 
 static char slot_type_char[] = {
@@ -2459,32 +2469,31 @@ static int check_map_access_type(struct bpf_verifier_env *env, u32 regno,
 	return 0;
 }
 
-/* check read/write into map element returned by bpf_map_lookup_elem() */
-static int __check_map_access(struct bpf_verifier_env *env, u32 regno, int off,
-			      int size, bool zero_size_allowed)
+/* check read/write into memory region (e.g., map value, ringbuf sample, etc) */
+static int __check_mem_access(struct bpf_verifier_env *env, int off,
+			      int size, u32 mem_size, bool zero_size_allowed)
 {
-	struct bpf_reg_state *regs = cur_regs(env);
-	struct bpf_map *map = regs[regno].map_ptr;
+	bool size_ok = size > 0 || (size == 0 && zero_size_allowed);
 
-	if (off < 0 || size < 0 || (size == 0 && !zero_size_allowed) ||
-	    off + size > map->value_size) {
-		verbose(env, "invalid access to map value, value_size=%d off=%d size=%d\n",
-			map->value_size, off, size);
-		return -EACCES;
-	}
-	return 0;
+	if (off >= 0 && size_ok && off + size <= mem_size)
+		return 0;
+
+	verbose(env, "invalid access to memory, mem_size=%u off=%d size=%d\n",
+		mem_size, off, size);
+	return -EACCES;
 }
 
-/* check read/write into a map element with possible variable offset */
-static int check_map_access(struct bpf_verifier_env *env, u32 regno,
-			    int off, int size, bool zero_size_allowed)
+/* check read/write into a memory region with possible variable offset */
+static int check_mem_region_access(struct bpf_verifier_env *env, u32 regno,
+				   int off, int size, u32 mem_size,
+				   bool zero_size_allowed)
 {
 	struct bpf_verifier_state *vstate = env->cur_state;
 	struct bpf_func_state *state = vstate->frame[vstate->curframe];
 	struct bpf_reg_state *reg = &state->regs[regno];
 	int err;
 
-	/* We may have adjusted the register to this map value, so we
+	/* We may have adjusted the register pointing to memory region, so we
 	 * need to try adding each of min_value and max_value to off
 	 * to make sure our theoretical access will be safe.
 	 */
@@ -2501,14 +2510,14 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno,
 	    (reg->smin_value == S64_MIN ||
 	     (off + reg->smin_value != (s64)(s32)(off + reg->smin_value)) ||
 	      reg->smin_value + off < 0)) {
-		verbose(env, "R%d min value is negative, either use unsigned index or do a if (index >=0) check.\n",
+		verbose(env, "R%d min value is negative, either use unsigned index or do an if (index >=0) check.\n",
 			regno);
 		return -EACCES;
 	}
-	err = __check_map_access(env, regno, reg->smin_value + off, size,
+	err = __check_mem_access(env, reg->smin_value + off, size, mem_size,
 				 zero_size_allowed);
 	if (err) {
-		verbose(env, "R%d min value is outside of the array range\n",
+		verbose(env, "R%d min value is outside of the memory region\n",
 			regno);
 		return err;
 	}
@@ -2518,18 +2527,38 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno,
 	 * If reg->umax_value + off could overflow, treat that as unbounded too.
 	 */
 	if (reg->umax_value >= BPF_MAX_VAR_OFF) {
-		verbose(env, "R%d unbounded memory access, make sure to bounds check any array access into a map\n",
+		verbose(env, "R%d unbounded memory access, make sure to bounds check any memory region access\n",
 			regno);
 		return -EACCES;
 	}
-	err = __check_map_access(env, regno, reg->umax_value + off, size,
+	err = __check_mem_access(env, reg->umax_value + off, size, mem_size,
 				 zero_size_allowed);
-	if (err)
-		verbose(env, "R%d max value is outside of the array range\n",
+	if (err) {
+		verbose(env, "R%d max value is outside of the memory region\n",
 			regno);
+		return err;
+	}
+
+	return 0;
+}
+
+/* check read/write into a map element with possible variable offset */
+static int check_map_access(struct bpf_verifier_env *env, u32 regno,
+			    int off, int size, bool zero_size_allowed)
+{
+	struct bpf_verifier_state *vstate = env->cur_state;
+	struct bpf_func_state *state = vstate->frame[vstate->curframe];
+	struct bpf_reg_state *reg = &state->regs[regno];
+	struct bpf_map *map = reg->map_ptr;
+	int err;
+
+	err = check_mem_region_access(env, regno, off, size, map->value_size,
+				      zero_size_allowed);
+	if (err)
+		return err;
 
-	if (map_value_has_spin_lock(reg->map_ptr)) {
-		u32 lock = reg->map_ptr->spin_lock_off;
+	if (map_value_has_spin_lock(map)) {
+		u32 lock = map->spin_lock_off;
 
 		/* if any part of struct bpf_spin_lock can be touched by
 		 * load/store reject this program.
@@ -3211,6 +3240,16 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 				mark_reg_unknown(env, regs, value_regno);
 			}
 		}
+	} else if (reg->type == PTR_TO_MEM) {
+		if (t == BPF_WRITE && value_regno >= 0 &&
+		    is_pointer_value(env, value_regno)) {
+			verbose(env, "R%d leaks addr into mem\n", value_regno);
+			return -EACCES;
+		}
+		err = check_mem_region_access(env, regno, off, size,
+					      reg->mem_size, false);
+		if (!err && t == BPF_READ && value_regno >= 0)
+			mark_reg_unknown(env, regs, value_regno);
 	} else if (reg->type == PTR_TO_CTX) {
 		enum bpf_reg_type reg_type = SCALAR_VALUE;
 		u32 btf_id = 0;
@@ -3548,6 +3587,10 @@ static int check_helper_mem_access(struct bpf_verifier_env *env, int regno,
 			return -EACCES;
 		return check_map_access(env, regno, reg->off, access_size,
 					zero_size_allowed);
+	case PTR_TO_MEM:
+		return check_mem_region_access(env, regno, reg->off,
+					       access_size, reg->mem_size,
+					       zero_size_allowed);
 	default: /* scalar_value|ptr_to_stack or invalid ptr */
 		return check_stack_boundary(env, regno, access_size,
 					    zero_size_allowed, meta);
@@ -3652,6 +3695,17 @@ static bool arg_type_is_mem_size(enum bpf_arg_type type)
 	       type == ARG_CONST_SIZE_OR_ZERO;
 }
 
+static bool arg_type_is_alloc_mem_ptr(enum bpf_arg_type type)
+{
+	return type == ARG_PTR_TO_ALLOC_MEM ||
+	       type == ARG_PTR_TO_ALLOC_MEM_OR_NULL;
+}
+
+static bool arg_type_is_alloc_size(enum bpf_arg_type type)
+{
+	return type == ARG_CONST_ALLOC_SIZE_OR_ZERO;
+}
+
 static bool arg_type_is_int_ptr(enum bpf_arg_type type)
 {
 	return type == ARG_PTR_TO_INT ||
@@ -3711,7 +3765,8 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
 			 type != expected_type)
 			goto err_type;
 	} else if (arg_type == ARG_CONST_SIZE ||
-		   arg_type == ARG_CONST_SIZE_OR_ZERO) {
+		   arg_type == ARG_CONST_SIZE_OR_ZERO ||
+		   arg_type == ARG_CONST_ALLOC_SIZE_OR_ZERO) {
 		expected_type = SCALAR_VALUE;
 		if (type != expected_type)
 			goto err_type;
@@ -3782,13 +3837,29 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
 		 * happens during stack boundary checking.
 		 */
 		if (register_is_null(reg) &&
-		    arg_type == ARG_PTR_TO_MEM_OR_NULL)
+		    (arg_type == ARG_PTR_TO_MEM_OR_NULL ||
+		     arg_type == ARG_PTR_TO_ALLOC_MEM_OR_NULL))
 			/* final test in check_stack_boundary() */;
 		else if (!type_is_pkt_pointer(type) &&
 			 type != PTR_TO_MAP_VALUE &&
+			 type != PTR_TO_MEM &&
 			 type != expected_type)
 			goto err_type;
 		meta->raw_mode = arg_type == ARG_PTR_TO_UNINIT_MEM;
+	} else if (arg_type_is_alloc_mem_ptr(arg_type)) {
+		expected_type = PTR_TO_MEM;
+		if (register_is_null(reg) &&
+		    arg_type == ARG_PTR_TO_ALLOC_MEM_OR_NULL)
+			/* final test in check_stack_boundary() */;
+		else if (type != expected_type)
+			goto err_type;
+		if (meta->ref_obj_id) {
+			verbose(env, "verifier internal error: more than one arg with ref_obj_id R%d %u %u\n",
+				regno, reg->ref_obj_id,
+				meta->ref_obj_id);
+			return -EFAULT;
+		}
+		meta->ref_obj_id = reg->ref_obj_id;
 	} else if (arg_type_is_int_ptr(arg_type)) {
 		expected_type = PTR_TO_STACK;
 		if (!type_is_pkt_pointer(type) &&
@@ -3884,6 +3955,13 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
 					      zero_size_allowed, meta);
 		if (!err)
 			err = mark_chain_precision(env, regno);
+	} else if (arg_type_is_alloc_size(arg_type)) {
+		if (!tnum_is_const(reg->var_off)) {
+			verbose(env, "R%d unbounded size, use 'var &= const' or 'if (var < const)'\n",
+				regno);
+			return -EACCES;
+		}
+		meta->mem_size = reg->var_off.value;
 	} else if (arg_type_is_int_ptr(arg_type)) {
 		int size = int_ptr_type_to_size(arg_type);
 
@@ -3920,6 +3998,13 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		    func_id != BPF_FUNC_xdp_output)
 			goto error;
 		break;
+	case BPF_MAP_TYPE_RINGBUF:
+		if (func_id != BPF_FUNC_ringbuf_output &&
+		    func_id != BPF_FUNC_ringbuf_reserve &&
+		    func_id != BPF_FUNC_ringbuf_submit &&
+		    func_id != BPF_FUNC_ringbuf_discard)
+			goto error;
+		break;
 	case BPF_MAP_TYPE_STACK_TRACE:
 		if (func_id != BPF_FUNC_get_stackid)
 			goto error;
@@ -4644,6 +4729,11 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
 		mark_reg_known_zero(env, regs, BPF_REG_0);
 		regs[BPF_REG_0].type = PTR_TO_TCP_SOCK_OR_NULL;
 		regs[BPF_REG_0].id = ++env->id_gen;
+	} else if (fn->ret_type == RET_PTR_TO_ALLOC_MEM_OR_NULL) {
+		mark_reg_known_zero(env, regs, BPF_REG_0);
+		regs[BPF_REG_0].type = PTR_TO_MEM_OR_NULL;
+		regs[BPF_REG_0].id = ++env->id_gen;
+		regs[BPF_REG_0].mem_size = meta.mem_size;
 	} else {
 		verbose(env, "unknown return type %d of func %s#%d\n",
 			fn->ret_type, func_id_name(func_id), func_id);
@@ -6583,6 +6673,8 @@ static void mark_ptr_or_null_reg(struct bpf_func_state *state,
 			reg->type = PTR_TO_TCP_SOCK;
 		} else if (reg->type == PTR_TO_BTF_ID_OR_NULL) {
 			reg->type = PTR_TO_BTF_ID;
+		} else if (reg->type == PTR_TO_MEM_OR_NULL) {
+			reg->type = PTR_TO_MEM;
 		}
 		if (is_null) {
 			/* We don't need id and ref_obj_id from this point
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index d961428fb5b6..6e6b3f8f77c1 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1053,6 +1053,14 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_perf_event_read_value_proto;
 	case BPF_FUNC_get_ns_current_pid_tgid:
 		return &bpf_get_ns_current_pid_tgid_proto;
+	case BPF_FUNC_ringbuf_output:
+		return &bpf_ringbuf_output_proto;
+	case BPF_FUNC_ringbuf_reserve:
+		return &bpf_ringbuf_reserve_proto;
+	case BPF_FUNC_ringbuf_submit:
+		return &bpf_ringbuf_submit_proto;
+	case BPF_FUNC_ringbuf_discard:
+		return &bpf_ringbuf_discard_proto;
 	default:
 		return NULL;
 	}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index bfb31c1be219..ae2deb6a8afc 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -147,6 +147,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_SK_STORAGE,
 	BPF_MAP_TYPE_DEVMAP_HASH,
 	BPF_MAP_TYPE_STRUCT_OPS,
+	BPF_MAP_TYPE_RINGBUF,
 };
 
 /* Note that tracing related programs such as
@@ -3121,6 +3122,32 @@ union bpf_attr {
  * 		0 on success, or a negative error in case of failure:
  *
  *		**-EOVERFLOW** if an overflow happened: The same object will be tried again.
+ *
+ * void *bpf_ringbuf_output(void *ringbuf, void *data, u64 size, u64 flags)
+ * 	Description
+ * 		Copy *size* bytes from *data* into a ring buffer *ringbuf*.
+ * 	Return
+ * 		0, on success;
+ * 		< 0, on error.
+ *
+ * void *bpf_ringbuf_reserve(void *ringbuf, u64 size, u64 flags)
+ * 	Description
+ * 		Reserve *size* bytes of payload in a ring buffer *ringbuf*.
+ * 	Return
+ * 		Valid pointer with *size* bytes of memory available; NULL,
+ * 		otherwise.
+ *
+ * void bpf_ringbuf_submit(void *data)
+ * 	Description
+ * 		Submit reserved ring buffer sample, pointed to by *data*.
+ * 	Return
+ * 		Nothing.
+ *
+ * void bpf_ringbuf_discard(void *data)
+ * 	Description
+ * 		Discard reserved ring buffer sample, pointed to by *data*.
+ * 	Return
+ * 		Nothing.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3250,7 +3277,11 @@ union bpf_attr {
 	FN(sk_assign),			\
 	FN(ktime_get_boot_ns),		\
 	FN(seq_printf),			\
-	FN(seq_write),
+	FN(seq_write),			\
+	FN(ringbuf_output),		\
+	FN(ringbuf_reserve),		\
+	FN(ringbuf_submit),		\
+	FN(ringbuf_discard),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
-- 
2.24.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH bpf-next 2/6] tools/memory-model: add BPF ringbuf MPSC litmus tests
  2020-05-13 19:25 [PATCH bpf-next 0/6] BPF ring buffer Andrii Nakryiko
  2020-05-13 19:25 ` [PATCH bpf-next 1/6] bpf: implement BPF ring buffer and verifier support for it Andrii Nakryiko
@ 2020-05-13 19:25 ` Andrii Nakryiko
  2020-05-13 19:25 ` [PATCH bpf-next 3/6] bpf: track reference type in verifier Andrii Nakryiko
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 32+ messages in thread
From: Andrii Nakryiko @ 2020-05-13 19:25 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel
  Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko, Paul E . McKenney,
	Jonathan Lemon

Add 4 litmus tests for BPF ringbuf implementation, divided into two different
use cases.

First, two unbounded case, one with 1 producer and another with
2 producers, single consumer. All reservations are supposed to succeed.

Second, bounded case with only 1 record allowed in ring buffer at any given
time. Here failures to reserve space are expected. Again, 1- and 2- producer
cases, single consumer, are validated.

Just for the fun of it, I also wrote a 3-producer cases, it took *16 hours* to
validate, but came back successful as well. I'm not including it in this
patch, because it's not practical to run it. See output for all included
4 cases and one 3-producer one with bounded use case.

Each litmust test implements producer/consumer protocol for BPF ring buffer
implementation found in kernel/bpf/ringbuf.c. Due to limitations, all records
are assumed equal-sized and producer/consumer counters are incremented by 1.
This doesn't change the correctness of the algorithm, though.

Verification results:

$ herd7 -unroll 0 -conf linux-kernel.cfg mpsc-rb+1p1c+bounded.litmus
Test mpsc-rb+1p1c+bounded Allowed
States 2
0:rFail=0; 1:rFail=0; cx=0; dropped=0; len1=1; px=1;
0:rFail=0; 1:rFail=0; cx=1; dropped=0; len1=1; px=1;
Ok
Witnesses
Positive: 3 Negative: 0
Condition exists (0:rFail=0 /\ 1:rFail=0 /\ dropped=0 /\ px=1 /\ len1=1 /\ (cx=0 \/ cx=1))
Observation mpsc-rb+1p1c+bounded Always 3 0
Time mpsc-rb+1p1c+bounded 0.03
Hash=554e6af9bfde3d1bfbb2c07bb0e283ad

$ herd7 -unroll 0 -conf linux-kernel.cfg mpsc-rb+2p1c+bounded.litmus
Test mpsc-rb+2p1c+bounded Allowed
States 4
0:rFail=0; 1:rFail=0; 2:rFail=0; cx=0; dropped=1; len1=1; px=1;
0:rFail=0; 1:rFail=0; 2:rFail=0; cx=1; dropped=0; len1=1; px=2;
0:rFail=0; 1:rFail=0; 2:rFail=0; cx=1; dropped=1; len1=1; px=1;
0:rFail=0; 1:rFail=0; 2:rFail=0; cx=2; dropped=0; len1=1; px=2;
Ok
Witnesses
Positive: 40 Negative: 0
Condition exists (0:rFail=0 /\ 1:rFail=0 /\ 2:rFail=0 /\ len1=1 /\ (dropped=0 /\ px=2 /\ (cx=1 \/ cx=2) \/ dropped=1 /\ px=1 /\ (cx=0 \/ cx=1)))
Observation mpsc-rb+2p1c+bounded Always 40 0
Time mpsc-rb+2p1c+bounded 114.32
Hash=fa7c38edbf482a6605d6b2031c310bdc

$ herd7 -unroll 0 -conf linux-kernel.cfg mpsc-rb+1p1c+unbound.litmus
Test mpsc-rb+1p1c+unbound Allowed
States 2
0:rFail=0; 1:rFail=0; cx=0; len1=1; px=1;
0:rFail=0; 1:rFail=0; cx=1; len1=1; px=1;
Ok
Witnesses
Positive: 4 Negative: 0
Condition exists (0:rFail=0 /\ 1:rFail=0 /\ px=1 /\ len1=1 /\ (cx=0 \/ cx=1))
Observation mpsc-rb+1p1c+unbound Always 4 0
Time mpsc-rb+1p1c+unbound 0.02
Hash=0798c82c96356e6eb25edbcd8561dfcf

$ herd7 -unroll 0 -conf linux-kernel.cfg mpsc-rb+2p1c+unbound.litmus
Test mpsc-rb+2p1c+unbound Allowed
States 3
0:rFail=0; 1:rFail=0; 2:rFail=0; cx=0; len1=1; len2=1; px=2;
0:rFail=0; 1:rFail=0; 2:rFail=0; cx=1; len1=1; len2=1; px=2;
0:rFail=0; 1:rFail=0; 2:rFail=0; cx=2; len1=1; len2=1; px=2;
Ok
Witnesses
Positive: 78 Negative: 0
Condition exists (0:rFail=0 /\ 1:rFail=0 /\ 2:rFail=0 /\ px=2 /\ len1=1 /\ len2=1 /\ (cx=0 \/ cx=1 \/ cx=2))
Observation mpsc-rb+2p1c+unbound Always 78 0
Time mpsc-rb+2p1c+unbound 37.85
Hash=a1a73c1810ff3eb91f0d054f232399a1

$ herd7 -unroll 0 -conf linux-kernel.cfg mpsc-rb+3p1c+bounded.litmus
Test mpsc+ringbuf-spinlock Allowed
States 5
0:rFail=0; 1:rFail=0; 2:rFail=0; 3:rFail=0; cx=0; len1=1; len2=1; px=2;
0:rFail=0; 1:rFail=0; 2:rFail=0; 3:rFail=0; cx=1; len1=1; len2=1; px=2;
0:rFail=0; 1:rFail=0; 2:rFail=0; 3:rFail=0; cx=1; len1=1; len2=1; px=3;
0:rFail=0; 1:rFail=0; 2:rFail=0; 3:rFail=0; cx=2; len1=1; len2=1; px=2;
0:rFail=0; 1:rFail=0; 2:rFail=0; 3:rFail=0; cx=2; len1=1; len2=1; px=3;
Ok
Witnesses
Positive: 558 Negative: 0
Condition exists (0:rFail=0 /\ 1:rFail=0 /\ 2:rFail=0 /\ 3:rFail=0 /\ len1=1 /\ len2=1 /\ (px=2 /\ (cx=0 \/ cx=1 \/ cx=2) \/ px=3 /\ (cx=1 \/ cx=2)))
Observation mpsc+ringbuf-spinlock Always 558 0
Time mpsc+ringbuf-spinlock 57487.24
Hash=133977dba930d167b4e1b4a6923d5687

Cc: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---
 .../litmus-tests/mpsc-rb+1p1c+bounded.litmus  |  92 +++++++++++
 .../litmus-tests/mpsc-rb+1p1c+unbound.litmus  |  83 ++++++++++
 .../litmus-tests/mpsc-rb+2p1c+bounded.litmus  | 152 ++++++++++++++++++
 .../litmus-tests/mpsc-rb+2p1c+unbound.litmus  | 137 ++++++++++++++++
 4 files changed, 464 insertions(+)
 create mode 100644 tools/memory-model/litmus-tests/mpsc-rb+1p1c+bounded.litmus
 create mode 100644 tools/memory-model/litmus-tests/mpsc-rb+1p1c+unbound.litmus
 create mode 100644 tools/memory-model/litmus-tests/mpsc-rb+2p1c+bounded.litmus
 create mode 100644 tools/memory-model/litmus-tests/mpsc-rb+2p1c+unbound.litmus

diff --git a/tools/memory-model/litmus-tests/mpsc-rb+1p1c+bounded.litmus b/tools/memory-model/litmus-tests/mpsc-rb+1p1c+bounded.litmus
new file mode 100644
index 000000000000..c97b2e1b56f6
--- /dev/null
+++ b/tools/memory-model/litmus-tests/mpsc-rb+1p1c+bounded.litmus
@@ -0,0 +1,92 @@
+C mpsc-rb+1p1c+bounded
+
+(*
+ * Result: Always
+ *
+ * This litmus test validates BPF ring buffer implementation under the
+ * following assumptions:
+ * - 1 producer;
+ * - 1 consumer;
+ * - ring buffer has capacity for only 1 record.
+ *
+ * Expectations:
+ * - 1 record pushed into ring buffer;
+ * - 0 or 1 element is consumed.
+ * - no failures.
+ *)
+
+{
+	max_len = 1;
+	len1 = 0;
+	px = 0;
+	cx = 0;
+	dropped = 0;
+}
+
+P0(int *len1, int *cx, int *px)
+{
+	int *rLenPtr;
+	int rLen;
+	int rPx;
+	int rCx;
+	int rFail;
+
+	rFail = 0;
+	rCx = READ_ONCE(*cx);
+
+	rPx = smp_load_acquire(px);
+	if (rCx < rPx) {
+		if (rCx == 0)
+			rLenPtr = len1;
+		else
+			rFail = 1;
+
+		rLen = READ_ONCE(*rLenPtr);
+		if (rLen == 0) {
+			rFail = 1;
+		} else if (rLen == 1) {
+			rCx = rCx + 1;
+			WRITE_ONCE(*cx, rCx);
+		}
+	}
+}
+
+P1(int *len1, spinlock_t *rb_lock, int *px, int *cx, int *dropped, int *max_len)
+{
+	int rPx;
+	int rCx;
+	int rFail;
+	int *rLenPtr;
+
+	rFail = 0;
+	rCx = READ_ONCE(*cx);
+
+	spin_lock(rb_lock);
+
+	rPx = *px;
+	if (rPx - rCx >= *max_len) {
+		atomic_inc(dropped);
+		spin_unlock(rb_lock);
+	} else {
+		if (rPx == 0)
+			rLenPtr = len1;
+		else
+			rFail = 1;
+
+		WRITE_ONCE(*rLenPtr, -1);
+		smp_wmb();
+		WRITE_ONCE(*px, rPx + 1);
+
+		spin_unlock(rb_lock);
+
+		WRITE_ONCE(*rLenPtr, 1);
+	}
+}
+
+exists (
+	0:rFail=0 /\ 1:rFail=0
+	/\
+	(
+		(dropped=0 /\ px=1 /\ len1=1 /\ (cx=0 \/ cx=1))
+	)
+)
diff --git a/tools/memory-model/litmus-tests/mpsc-rb+1p1c+unbound.litmus b/tools/memory-model/litmus-tests/mpsc-rb+1p1c+unbound.litmus
new file mode 100644
index 000000000000..b1e25ec6275e
--- /dev/null
+++ b/tools/memory-model/litmus-tests/mpsc-rb+1p1c+unbound.litmus
@@ -0,0 +1,83 @@
+C mpsc-rb+1p1c+unbound
+
+(*
+ * Result: Always
+ *
+ * This litmus test validates BPF ring buffer implementation under the
+ * following assumptions:
+ * - 1 producer;
+ * - 1 consumer;
+ * - ring buffer capacity is unbounded.
+ *
+ * Expectations:
+ * - 1 record pushed into ring buffer;
+ * - 0 or 1 element is consumed.
+ * - no failures.
+ *)
+
+{
+	len1 = 0;
+	px = 0;
+	cx = 0;
+}
+
+P0(int *len1, int *cx, int *px)
+{
+	int *rLenPtr;
+	int rLen;
+	int rPx;
+	int rCx;
+	int rFail;
+
+	rFail = 0;
+	rCx = READ_ONCE(*cx);
+
+	rPx = smp_load_acquire(px);
+	if (rCx < rPx) {
+		if (rCx == 0)
+			rLenPtr = len1;
+		else
+			rFail = 1;
+
+		rLen = READ_ONCE(*rLenPtr);
+		if (rLen == 0) {
+			rFail = 1;
+		} else if (rLen == 1) {
+			rCx = rCx + 1;
+			WRITE_ONCE(*cx, rCx);
+		}
+	}
+}
+
+P1(int *len1, spinlock_t *rb_lock, int *px, int *cx)
+{
+	int rPx;
+	int rCx;
+	int rFail;
+	int *rLenPtr;
+
+	rFail = 0;
+	rCx = READ_ONCE(*cx);
+
+	spin_lock(rb_lock);
+
+	rPx = *px;
+	if (rPx == 0)
+		rLenPtr = len1;
+	else
+		rFail = 1;
+
+	WRITE_ONCE(*rLenPtr, -1);
+	smp_wmb();
+	WRITE_ONCE(*px, rPx + 1);
+
+	spin_unlock(rb_lock);
+
+	WRITE_ONCE(*rLenPtr, 1);
+}
+
+exists (
+	0:rFail=0 /\ 1:rFail=0
+	/\ px=1 /\ len1=1
+	/\ (cx=0 \/ cx=1)
+)
diff --git a/tools/memory-model/litmus-tests/mpsc-rb+2p1c+bounded.litmus b/tools/memory-model/litmus-tests/mpsc-rb+2p1c+bounded.litmus
new file mode 100644
index 000000000000..0707aa9ad307
--- /dev/null
+++ b/tools/memory-model/litmus-tests/mpsc-rb+2p1c+bounded.litmus
@@ -0,0 +1,152 @@
+C mpsc-rb+2p1c+bounded
+
+(*
+ * Result: Always
+ *
+ * This litmus test validates BPF ring buffer implementation under the
+ * following assumptions:
+ * - 2 identical producers;
+ * - 1 consumer;
+ * - ring buffer has capacity for only 1 record.
+ *
+ * Expectations:
+ * - either 1 or 2 records are pushed into ring buffer;
+ * - 0, 1, or 2 elements are consumed by consumer;
+ * - appropriate number of dropped records is recorded to satisfy ring buffer
+ *   size bounds;
+ * - no failures.
+ *)
+
+{
+	max_len = 1;
+	len1 = 0;
+	px = 0;
+	cx = 0;
+	dropped = 0;
+}
+
+P0(int *len1, int *cx, int *px)
+{
+	int *rLenPtr;
+	int rLen;
+	int rPx;
+	int rCx;
+	int rFail;
+
+	rFail = 0;
+	rCx = READ_ONCE(*cx);
+
+	rPx = smp_load_acquire(px);
+	if (rCx < rPx) {
+		if (rCx == 0)
+			rLenPtr = len1;
+		else if (rCx == 1)
+			rLenPtr = len1;
+		else
+			rFail = 1;
+
+		rLen = READ_ONCE(*rLenPtr);
+		if (rLen == 0) {
+			rFail = 1;
+		} else if (rLen == 1) {
+			rCx = rCx + 1;
+			WRITE_ONCE(*cx, rCx);
+		}
+	}
+
+	rPx = smp_load_acquire(px);
+	if (rCx < rPx) {
+		if (rCx == 0)
+			rLenPtr = len1;
+		else if (rCx == 1)
+			rLenPtr = len1;
+		else
+			rFail = 1;
+
+		rLen = READ_ONCE(*rLenPtr);
+		if (rLen == 0) {
+			rFail = 1;
+		} else if (rLen == 1) {
+			rCx = rCx + 1;
+			WRITE_ONCE(*cx, rCx);
+		}
+	}
+}
+
+P1(int *len1, spinlock_t *rb_lock, int *px, int *cx, int *dropped, int *max_len)
+{
+	int rPx;
+	int rCx;
+	int rFail;
+	int *rLenPtr;
+
+	rFail = 0;
+	rCx = READ_ONCE(*cx);
+
+	spin_lock(rb_lock);
+
+	rPx = *px;
+	if (rPx - rCx >= *max_len) {
+		atomic_inc(dropped);
+		spin_unlock(rb_lock);
+	} else {
+		if (rPx == 0)
+			rLenPtr = len1;
+		else if (rPx == 1)
+			rLenPtr = len1;
+		else
+			rFail = 1;
+
+		WRITE_ONCE(*rLenPtr, -1);
+		smp_wmb();
+		WRITE_ONCE(*px, rPx + 1);
+
+		spin_unlock(rb_lock);
+
+		WRITE_ONCE(*rLenPtr, 1);
+	}
+}
+
+P2(int *len1, spinlock_t *rb_lock, int *px, int *cx, int *dropped, int *max_len)
+{
+	int rPx;
+	int rCx;
+	int rFail;
+	int *rLenPtr;
+
+	rFail = 0;
+	rCx = READ_ONCE(*cx);
+
+	spin_lock(rb_lock);
+
+	rPx = *px;
+	if (rPx - rCx >= *max_len) {
+		atomic_inc(dropped);
+		spin_unlock(rb_lock);
+	} else {
+		if (rPx == 0)
+			rLenPtr = len1;
+		else if (rPx == 1)
+			rLenPtr = len1;
+		else
+			rFail = 1;
+
+		WRITE_ONCE(*rLenPtr, -1);
+		smp_wmb();
+		WRITE_ONCE(*px, rPx + 1);
+
+		spin_unlock(rb_lock);
+
+		WRITE_ONCE(*rLenPtr, 1);
+	}
+}
+
+exists (
+	0:rFail=0 /\ 1:rFail=0 /\ 2:rFail=0 /\ len1=1
+	/\
+	(
+		(dropped = 0 /\ px=2 /\ (cx=1 \/ cx=2))
+		\/
+		(dropped = 1 /\ px=1 /\ (cx=0 \/ cx=1))
+	)
+)
diff --git a/tools/memory-model/litmus-tests/mpsc-rb+2p1c+unbound.litmus b/tools/memory-model/litmus-tests/mpsc-rb+2p1c+unbound.litmus
new file mode 100644
index 000000000000..f334ffece324
--- /dev/null
+++ b/tools/memory-model/litmus-tests/mpsc-rb+2p1c+unbound.litmus
@@ -0,0 +1,137 @@
+C mpsc-rb+2p1c+unbound
+
+(*
+ * Result: Always
+ *
+ * This litmus test validates BPF ring buffer implementation under the
+ * following assumptions:
+ * - 2 identical producers;
+ * - 1 consumer;
+ * - ring buffer capacity is unbounded.
+ *
+ * Expectations:
+ * - 2 records pushed into ring buffer;
+ * - 0, 1, or 2 elements are consumed.
+ * - no failures.
+ *)
+
+{
+	len1 = 0;
+	len2 = 0;
+	px = 0;
+	cx = 0;
+}
+
+P0(int *len1, int *len2, int *cx, int *px)
+{
+	int *rLenPtr;
+	int rLen;
+	int rPx;
+	int rCx;
+	int rFail;
+
+	rFail = 0;
+	rCx = READ_ONCE(*cx);
+
+	rPx = smp_load_acquire(px);
+	if (rCx < rPx) {
+		if (rCx == 0)
+			rLenPtr = len1;
+		else if (rCx == 1)
+			rLenPtr = len2;
+		else
+			rFail = 1;
+
+		rLen = READ_ONCE(*rLenPtr);
+		if (rLen == 0) {
+			rFail = 1;
+		} else if (rLen == 1) {
+			rCx = rCx + 1;
+			WRITE_ONCE(*cx, rCx);
+		}
+	}
+
+	rPx = smp_load_acquire(px);
+	if (rCx < rPx) {
+		if (rCx == 0)
+			rLenPtr = len1;
+		else if (rCx == 1)
+			rLenPtr = len2;
+		else
+			rFail = 1;
+
+		rLen = READ_ONCE(*rLenPtr);
+		if (rLen == 0) {
+			rFail = 1;
+		} else if (rLen == 1) {
+			rCx = rCx + 1;
+			WRITE_ONCE(*cx, rCx);
+		}
+	}
+}
+
+P1(int *len1, int *len2, spinlock_t *rb_lock, int *px, int *cx)
+{
+	int rPx;
+	int rCx;
+	int rFail;
+	int *rLenPtr;
+
+	rFail = 0;
+	rCx = READ_ONCE(*cx);
+
+	spin_lock(rb_lock);
+
+	rPx = *px;
+	if (rPx == 0)
+		rLenPtr = len1;
+	else if (rPx == 1)
+		rLenPtr = len2;
+	else
+		rFail = 1;
+
+	WRITE_ONCE(*rLenPtr, -1);
+	smp_wmb();
+	WRITE_ONCE(*px, rPx + 1);
+
+	spin_unlock(rb_lock);
+
+	WRITE_ONCE(*rLenPtr, 1);
+}
+
+P2(int *len1, int *len2, spinlock_t *rb_lock, int *px, int *cx)
+{
+	int rPx;
+	int rCx;
+	int rFail;
+	int *rLenPtr;
+
+	rFail = 0;
+	rCx = READ_ONCE(*cx);
+
+	spin_lock(rb_lock);
+
+	rPx = *px;
+	if (rPx == 0)
+		rLenPtr = len1;
+	else if (rPx == 1)
+		rLenPtr = len2;
+	else
+		rFail = 1;
+
+	WRITE_ONCE(*rLenPtr, -1);
+	smp_wmb();
+	WRITE_ONCE(*px, rPx + 1);
+
+	spin_unlock(rb_lock);
+
+	WRITE_ONCE(*rLenPtr, 1);
+}
+
+exists (
+	0:rFail=0 /\ 1:rFail=0 /\ 2:rFail=0
+	/\
+	px=2 /\ len1=1 /\ len2=1
+	/\
+	(cx=0 \/ cx=1 \/ cx=2)
+)
-- 
2.24.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH bpf-next 3/6] bpf: track reference type in verifier
  2020-05-13 19:25 [PATCH bpf-next 0/6] BPF ring buffer Andrii Nakryiko
  2020-05-13 19:25 ` [PATCH bpf-next 1/6] bpf: implement BPF ring buffer and verifier support for it Andrii Nakryiko
  2020-05-13 19:25 ` [PATCH bpf-next 2/6] tools/memory-model: add BPF ringbuf MPSC litmus tests Andrii Nakryiko
@ 2020-05-13 19:25 ` Andrii Nakryiko
  2020-05-13 19:25 ` [PATCH bpf-next 4/6] libbpf: add BPF ring buffer support Andrii Nakryiko
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 32+ messages in thread
From: Andrii Nakryiko @ 2020-05-13 19:25 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel
  Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko, Paul E . McKenney,
	Jonathan Lemon

This patch extends verifier's reference tracking logic with tracking
a reference type and ensuring acquire()/release() functions accept only the
right types of references. Currently, ambiguity between socket and ringbuf
record references is resolved through use of different types of input
arguments to bpf_sk_release() and bpf_ringbuf_commit(): ARG_PTR_TO_SOCK_COMMON
and ARG_PTR_TO_ALLOC_MEM, respectively. It is thus impossible to pass ringbuf
record pointer to bpf_sk_release() (and vice versa for socket).

On the other hand, patch #1 added ARG_PTR_TO_ALLOC_MEM arg type, which, from
the point of view of verifier, is a pointer to a fixed-sized allocated memory
region. This is generic enough concept that could be used for other BPF
helpers (e.g., malloc/free pair, if added). once we have that, there will be
nothing to prevent passing ringbuf record to bpf_mem_free() (or whatever)
helper. To that end, this patch adds a capability to specify and track
reference types, that would be validated by verifier to ensure correct match
between acquire() and release() helpers.

This patch can be postponed for later, so is posted separate from other
verifier changes.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---
 include/linux/bpf_verifier.h |  8 +++
 kernel/bpf/verifier.c        | 96 +++++++++++++++++++++++++++++-------
 2 files changed, 86 insertions(+), 18 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index c94a736e53cd..2a6d961570cb 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -164,6 +164,12 @@ struct bpf_stack_state {
 	u8 slot_type[BPF_REG_SIZE];
 };
 
+enum bpf_ref_type {
+	BPF_REF_INVALID,
+	BPF_REF_SOCKET,
+	BPF_REF_RINGBUF,
+};
+
 struct bpf_reference_state {
 	/* Track each reference created with a unique id, even if the same
 	 * instruction creates the reference multiple times (eg, via CALL).
@@ -173,6 +179,8 @@ struct bpf_reference_state {
 	 * is used purely to inform the user of a reference leak.
 	 */
 	int insn_idx;
+	/* Type of reference being tracked */
+	enum bpf_ref_type ref_type;
 };
 
 /* state of the program:
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index b8f0158d2327..dc741a631089 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -436,6 +436,19 @@ static bool is_release_function(enum bpf_func_id func_id)
 	       func_id == BPF_FUNC_ringbuf_discard;
 }
 
+static enum bpf_ref_type get_release_ref_type(enum bpf_func_id func_id)
+{
+	switch (func_id) {
+	case BPF_FUNC_sk_release:
+		return BPF_REF_SOCKET;
+	case BPF_FUNC_ringbuf_submit:
+	case BPF_FUNC_ringbuf_discard:
+		return BPF_REF_RINGBUF;
+	default:
+		return BPF_REF_INVALID;
+	}
+}
+
 static bool may_be_acquire_function(enum bpf_func_id func_id)
 {
 	return func_id == BPF_FUNC_sk_lookup_tcp ||
@@ -464,6 +477,28 @@ static bool is_acquire_function(enum bpf_func_id func_id,
 	return false;
 }
 
+static enum bpf_ref_type get_acquire_ref_type(enum bpf_func_id func_id,
+					      const struct bpf_map *map)
+{
+	enum bpf_map_type map_type = map ? map->map_type : BPF_MAP_TYPE_UNSPEC;
+
+	switch (func_id) {
+	case BPF_FUNC_sk_lookup_tcp:
+	case BPF_FUNC_sk_lookup_udp:
+	case BPF_FUNC_skc_lookup_tcp:
+		return BPF_REF_SOCKET;
+	case BPF_FUNC_map_lookup_elem:
+		if (map_type == BPF_MAP_TYPE_SOCKMAP ||
+		    map_type == BPF_MAP_TYPE_SOCKHASH)
+			return BPF_REF_SOCKET;
+		return BPF_REF_INVALID;
+	case BPF_FUNC_ringbuf_reserve:
+		return BPF_REF_RINGBUF;
+	default:
+		return BPF_REF_INVALID;
+	}
+}
+
 static bool is_ptr_cast_function(enum bpf_func_id func_id)
 {
 	return func_id == BPF_FUNC_tcp_sock ||
@@ -736,7 +771,8 @@ static int realloc_func_state(struct bpf_func_state *state, int stack_size,
  * On success, returns a valid pointer id to associate with the register
  * On failure, returns a negative errno.
  */
-static int acquire_reference_state(struct bpf_verifier_env *env, int insn_idx)
+static int acquire_reference_state(struct bpf_verifier_env *env,
+				   int insn_idx, enum bpf_ref_type ref_type)
 {
 	struct bpf_func_state *state = cur_func(env);
 	int new_ofs = state->acquired_refs;
@@ -748,25 +784,32 @@ static int acquire_reference_state(struct bpf_verifier_env *env, int insn_idx)
 	id = ++env->id_gen;
 	state->refs[new_ofs].id = id;
 	state->refs[new_ofs].insn_idx = insn_idx;
+	state->refs[new_ofs].ref_type = ref_type;
 
 	return id;
 }
 
 /* release function corresponding to acquire_reference_state(). Idempotent. */
-static int release_reference_state(struct bpf_func_state *state, int ptr_id)
+static int release_reference_state(struct bpf_func_state *state, int ptr_id,
+				   enum bpf_ref_type ref_type)
 {
+	struct bpf_reference_state *ref;
 	int i, last_idx;
 
 	last_idx = state->acquired_refs - 1;
 	for (i = 0; i < state->acquired_refs; i++) {
-		if (state->refs[i].id == ptr_id) {
-			if (last_idx && i != last_idx)
-				memcpy(&state->refs[i], &state->refs[last_idx],
-				       sizeof(*state->refs));
-			memset(&state->refs[last_idx], 0, sizeof(*state->refs));
-			state->acquired_refs--;
-			return 0;
-		}
+		ref = &state->refs[i];
+		if (ref->id != ptr_id)
+			continue;
+
+		if (ref_type != BPF_REF_INVALID && ref->ref_type != ref_type)
+			return -EINVAL;
+
+		if (i != last_idx)
+			memcpy(ref, &state->refs[last_idx], sizeof(*ref));
+		memset(&state->refs[last_idx], 0, sizeof(*ref));
+		state->acquired_refs--;
+		return 0;
 	}
 	return -EINVAL;
 }
@@ -4295,14 +4338,13 @@ static void release_reg_references(struct bpf_verifier_env *env,
 /* The pointer with the specified id has released its reference to kernel
  * resources. Identify all copies of the same pointer and clear the reference.
  */
-static int release_reference(struct bpf_verifier_env *env,
-			     int ref_obj_id)
+static int release_reference(struct bpf_verifier_env *env, int ref_obj_id,
+			     enum bpf_ref_type ref_type)
 {
 	struct bpf_verifier_state *vstate = env->cur_state;
-	int err;
-	int i;
+	int err, i;
 
-	err = release_reference_state(cur_func(env), ref_obj_id);
+	err = release_reference_state(cur_func(env), ref_obj_id, ref_type);
 	if (err)
 		return err;
 
@@ -4661,7 +4703,16 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
 			return err;
 		}
 	} else if (is_release_function(func_id)) {
-		err = release_reference(env, meta.ref_obj_id);
+		enum bpf_ref_type ref_type;
+
+		ref_type = get_release_ref_type(func_id);
+		if (ref_type == BPF_REF_INVALID) {
+			verbose(env, "unrecognized reference accepted by func %s#%d\n",
+				func_id_name(func_id), func_id);
+			return -EFAULT;
+		}
+
+		err = release_reference(env, meta.ref_obj_id, ref_type);
 		if (err) {
 			verbose(env, "func %s#%d reference has not been acquired before\n",
 				func_id_name(func_id), func_id);
@@ -4744,8 +4795,17 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
 		/* For release_reference() */
 		regs[BPF_REG_0].ref_obj_id = meta.ref_obj_id;
 	} else if (is_acquire_function(func_id, meta.map_ptr)) {
-		int id = acquire_reference_state(env, insn_idx);
+		enum bpf_ref_type ref_type;
+		int id;
+
+		ref_type = get_acquire_ref_type(func_id, meta.map_ptr);
+		if (ref_type == BPF_REF_INVALID) {
+			verbose(env, "unrecognized reference returned by func %s#%d\n",
+				func_id_name(func_id), func_id);
+			return -EINVAL;
+		}
 
+		id = acquire_reference_state(env, insn_idx, ref_type);
 		if (id < 0)
 			return id;
 		/* For mark_ptr_or_null_reg() */
@@ -6728,7 +6788,7 @@ static void mark_ptr_or_null_regs(struct bpf_verifier_state *vstate, u32 regno,
 		 * No one could have freed the reference state before
 		 * doing the NULL check.
 		 */
-		WARN_ON_ONCE(release_reference_state(state, id));
+		WARN_ON_ONCE(release_reference_state(state, id, BPF_REF_INVALID));
 
 	for (i = 0; i <= vstate->curframe; i++)
 		__mark_ptr_or_null_regs(vstate->frame[i], id, is_null);
-- 
2.24.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH bpf-next 4/6] libbpf: add BPF ring buffer support
  2020-05-13 19:25 [PATCH bpf-next 0/6] BPF ring buffer Andrii Nakryiko
                   ` (2 preceding siblings ...)
  2020-05-13 19:25 ` [PATCH bpf-next 3/6] bpf: track reference type in verifier Andrii Nakryiko
@ 2020-05-13 19:25 ` Andrii Nakryiko
  2020-05-13 19:25 ` [PATCH bpf-next 5/6] selftests/bpf: add BPF ringbuf selftests Andrii Nakryiko
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 32+ messages in thread
From: Andrii Nakryiko @ 2020-05-13 19:25 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel
  Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko, Paul E . McKenney,
	Jonathan Lemon

Declaring and instantiating BPF ring buffer doesn't require any changes to
libbpf, as it's just another type of maps. So using existing BTF-defined maps
syntax with __uint(type, BPF_MAP_TYPE_RINGBUF) and __uint(max_elemenetns,
<size-of-ring-buf>) is all that's necessary to create and use BPF ring buffer.

This patch adds BPF ring buffer consumer to libbpf, similar to perf_buffer
implementation for perf ring buffer, but also attempts to fix some minor
problems and inconveniences with existing perf_buffer API.

ring_buffer support both single ring buffer use case (with just using
ring_buffer__new()), as well as allows to add more ring buffers, each with its
own callback and context. This allows to efficiently poll and consume
multiple, and potentially completely independent, ring buffers, using single
epoll instance.

The latter is actually a problem in practice for applications
that are using multiple sets of perf buffers. They have to create multiple
instances for struct perf_buffer and poll them independently or in a loop,
each approach having its own problems (e.g., inability to use a common poll
timeout). struct ring_buffer eliminates this problem by aggregating many
independent ring buffer instances under the single "ring buffer manager".

Second, perf_buffer's callback can't return error, so applications that need
to stop polling due to error in data or data signalling the end, have to use
extra mechanisms to signal that polling has to stop. ring_buffer's callback
can return error, which will be passed through back to user code and can be
acted upon appropariately.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---
 tools/lib/bpf/Build           |   2 +-
 tools/lib/bpf/libbpf.h        |  20 +++
 tools/lib/bpf/libbpf.map      |   4 +
 tools/lib/bpf/libbpf_probes.c |   5 +
 tools/lib/bpf/ringbuf.c       | 264 ++++++++++++++++++++++++++++++++++
 5 files changed, 294 insertions(+), 1 deletion(-)
 create mode 100644 tools/lib/bpf/ringbuf.c

diff --git a/tools/lib/bpf/Build b/tools/lib/bpf/Build
index e3962cfbc9a6..190366d05588 100644
--- a/tools/lib/bpf/Build
+++ b/tools/lib/bpf/Build
@@ -1,3 +1,3 @@
 libbpf-y := libbpf.o bpf.o nlattr.o btf.o libbpf_errno.o str_error.o \
 	    netlink.o bpf_prog_linfo.o libbpf_probes.o xsk.o hashmap.o \
-	    btf_dump.o
+	    btf_dump.o ringbuf.o
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 8ea69558f0a8..037b764ca85f 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -478,6 +478,26 @@ LIBBPF_API int bpf_get_link_xdp_id(int ifindex, __u32 *prog_id, __u32 flags);
 LIBBPF_API int bpf_get_link_xdp_info(int ifindex, struct xdp_link_info *info,
 				     size_t info_size, __u32 flags);
 
+/* Ring buffer APIs */
+struct ring_buffer;
+
+typedef int (*ring_buffer_sample_fn)(void *ctx, void *data, size_t size);
+
+struct ring_buffer_opts {
+	size_t sz; /* size of this struct, for forward/backward compatiblity */
+};
+
+#define ring_buffer_opts__last_field sz
+
+LIBBPF_API struct ring_buffer *
+ring_buffer__new(int map_fd, ring_buffer_sample_fn sample_cb, void *ctx,
+		 const struct ring_buffer_opts *opts);
+LIBBPF_API void ring_buffer__free(struct ring_buffer *rb);
+LIBBPF_API int ring_buffer__add(struct ring_buffer *rb, int map_fd,
+				ring_buffer_sample_fn sample_cb, void *ctx);
+LIBBPF_API int ring_buffer__poll(struct ring_buffer *rb, int timeout_ms);
+
+/* Perf buffer APIs */
 struct perf_buffer;
 
 typedef void (*perf_buffer_sample_fn)(void *ctx, int cpu,
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index 0133d469d30b..0f32a9ba5bb5 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -262,4 +262,8 @@ LIBBPF_0.0.9 {
 		bpf_link_get_fd_by_id;
 		bpf_link_get_next_id;
 		bpf_program__attach_iter;
+		ring_buffer__add;
+		ring_buffer__free;
+		ring_buffer__new;
+		ring_buffer__poll;
 } LIBBPF_0.0.8;
diff --git a/tools/lib/bpf/libbpf_probes.c b/tools/lib/bpf/libbpf_probes.c
index 2c92059c0c90..10cd8d1891f5 100644
--- a/tools/lib/bpf/libbpf_probes.c
+++ b/tools/lib/bpf/libbpf_probes.c
@@ -238,6 +238,11 @@ bool bpf_probe_map_type(enum bpf_map_type map_type, __u32 ifindex)
 		if (btf_fd < 0)
 			return false;
 		break;
+	case BPF_MAP_TYPE_RINGBUF:
+		key_size = 0;
+		value_size = 0;
+		max_entries = 4096;
+		break;
 	case BPF_MAP_TYPE_UNSPEC:
 	case BPF_MAP_TYPE_HASH:
 	case BPF_MAP_TYPE_ARRAY:
diff --git a/tools/lib/bpf/ringbuf.c b/tools/lib/bpf/ringbuf.c
new file mode 100644
index 000000000000..b55eef8317b2
--- /dev/null
+++ b/tools/lib/bpf/ringbuf.c
@@ -0,0 +1,264 @@
+// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
+/*
+ * Ring buffer operations.
+ *
+ * Copyright (C) 2020 Facebook, Inc.
+ */
+#include <stdlib.h>
+#include <stdio.h>
+#include <errno.h>
+#include <unistd.h>
+#include <linux/err.h>
+#include <linux/bpf.h>
+#include <asm/barrier.h>
+#include <sys/mman.h>
+#include <sys/epoll.h>
+#include <tools/libc_compat.h>
+
+#include "libbpf.h"
+#include "libbpf_internal.h"
+#include "bpf.h"
+
+/* make sure libbpf doesn't use kernel-only integer typedefs */
+#pragma GCC poison u8 u16 u32 u64 s8 s16 s32 s64
+
+struct ring {
+	ring_buffer_sample_fn sample_cb;
+	void *ctx;
+	void *data;
+	__u64 *consumer_pos;
+	__u64 *producer_pos;
+	__u64 mask;
+	int map_fd;
+};
+
+struct ring_buffer {
+	struct epoll_event *events;
+	struct ring *rings;
+	size_t page_size;
+	int epoll_fd;
+	int ring_cnt;
+};
+
+static void ringbuf_unmap_ring(struct ring_buffer *rb, struct ring *r)
+{
+	if (r->consumer_pos) {
+		munmap(r->consumer_pos, rb->page_size);
+		r->consumer_pos = NULL;
+	}
+	if (r->producer_pos) {
+		munmap(r->producer_pos, rb->page_size + 2 * (r->mask + 1));
+		r->producer_pos = NULL;
+	}
+}
+
+/* Add extra RINGBUF maps to this ring buffer manager */
+int ring_buffer__add(struct ring_buffer *rb, int map_fd,
+		     ring_buffer_sample_fn sample_cb, void *ctx)
+{
+	struct bpf_map_info info;
+	__u32 len = sizeof(info);
+	struct epoll_event *e;
+	struct ring *r;
+	void *tmp;
+	int err;
+
+	memset(&info, 0, sizeof(info));
+
+	err = bpf_obj_get_info_by_fd(map_fd, &info, &len);
+	if (err) {
+		err = -errno;
+		pr_warn("ringbuf: failed to get map info for fd=%d: %d\n",
+			map_fd, err);
+		return err;
+	}
+
+	if (info.type != BPF_MAP_TYPE_RINGBUF) {
+		pr_warn("ringbuf: map fd=%d is not BPF_MAP_TYPE_RINGBUF\n",
+			map_fd);
+		return -EINVAL;
+	}
+
+	tmp = reallocarray(rb->rings, rb->ring_cnt + 1, sizeof(*rb->rings));
+	if (!tmp)
+		return -ENOMEM;
+	rb->rings = tmp;
+
+	tmp = reallocarray(rb->events, rb->ring_cnt + 1, sizeof(*rb->events));
+	if (!tmp)
+		return -ENOMEM;
+	rb->events = tmp;
+
+	r = &rb->rings[rb->ring_cnt];
+	memset(r, 0, sizeof(*r));
+
+	r->map_fd = map_fd;
+	r->sample_cb = sample_cb;
+	r->ctx = ctx;
+	r->mask = info.max_entries - 1;
+
+	/* Map writable consumer page */
+	tmp = mmap(NULL, rb->page_size, PROT_READ | PROT_WRITE, MAP_SHARED,
+		   map_fd, 0);
+	if (tmp == MAP_FAILED) {
+		err = -errno;
+		pr_warn("ringbuf: failed to mmap consumer page for map fd=%d: %d\n",
+			map_fd, err);
+		return err;
+	}
+	r->consumer_pos = tmp;
+
+	/* Map read-only producer page and data pages. We map twice as big
+	 * data size to allow simple reading of samples that wrap around the
+	 * end of a ring buffer. See kernel implementation for details.
+	 * */
+	tmp = mmap(NULL, rb->page_size + 2 * info.max_entries, PROT_READ,
+		   MAP_SHARED, map_fd, rb->page_size);
+	if (tmp == MAP_FAILED) {
+		err = -errno;
+		ringbuf_unmap_ring(rb, r);
+		pr_warn("ringbuf: failed to mmap data pages for map fd=%d: %d\n",
+			map_fd, err);
+		return err;
+	}
+	r->producer_pos = tmp;
+	r->data = tmp + rb->page_size;
+
+	e = &rb->events[rb->ring_cnt];
+	memset(e, 0, sizeof(*e));
+
+	e->events = EPOLLIN;
+	e->data.fd = rb->ring_cnt;
+	if (epoll_ctl(rb->epoll_fd, EPOLL_CTL_ADD, map_fd, e) < 0) {
+		err = -errno;
+		ringbuf_unmap_ring(rb, r);
+		pr_warn("ringbuf: failed to epoll add map fd=%d: %d\n",
+			map_fd, err);
+		return err;
+	}
+
+	rb->ring_cnt++;
+	return 0;
+}
+
+void ring_buffer__free(struct ring_buffer *rb)
+{
+	int i;
+
+	if (!rb)
+		return;
+
+	for (i = 0; i < rb->ring_cnt; ++i)
+		ringbuf_unmap_ring(rb, &rb->rings[i]);
+	if (rb->epoll_fd >= 0)
+		close(rb->epoll_fd);
+
+	free(rb->events);
+	free(rb->rings);
+	free(rb);
+}
+
+struct ring_buffer *
+ring_buffer__new(int map_fd, ring_buffer_sample_fn sample_cb, void *ctx,
+		 const struct ring_buffer_opts *opts)
+{
+	struct ring_buffer *rb;
+	int err;
+
+	if (!OPTS_VALID(opts, ring_buffer_opts))
+		return NULL;
+
+	rb = calloc(1, sizeof(*rb));
+	if (!rb)
+		return NULL;
+
+	rb->page_size = getpagesize();
+
+	rb->epoll_fd = epoll_create1(EPOLL_CLOEXEC);
+	if (rb->epoll_fd < 0) {
+		err = -errno;
+		pr_warn("ringbuf: failed to create epoll instance: %d\n", err);
+		goto err_out;
+	}
+
+	err = ring_buffer__add(rb, map_fd, sample_cb, ctx);
+	if (err)
+		goto err_out;
+
+	return rb;
+
+err_out:
+	ring_buffer__free(rb);
+	return NULL;
+}
+
+#define RINGBUF_BUSY_BIT (1 << 31)
+#define RINGBUF_DISCARD_BIT (1 << 30)
+#define RINGBUF_META_LEN 8
+
+static inline int roundup_len(__u32 len)
+{
+	/* clear out top 2 bits */
+	len <<= 2;
+	len >>= 2;
+	/* add length prefix */
+	len += RINGBUF_META_LEN;
+	/* round up to 8 byte alignment */
+	return (len + 7) / 8 * 8;
+}
+
+static int ringbuf_process_ring(struct ring* r)
+{
+	__u64 cons_pos, prod_pos;
+	int *len_ptr, len, err;
+	bool got_new_data;
+	void *sample;
+
+	cons_pos = *r->consumer_pos;
+	do {
+		got_new_data = false;
+		prod_pos = smp_load_acquire(r->producer_pos);
+		while (cons_pos < prod_pos) {
+			len_ptr = r->data + (cons_pos & r->mask);
+			len = smp_load_acquire(len_ptr);
+
+			/* sample not committed yet, bail out for now */
+			if (len & RINGBUF_BUSY_BIT)
+				goto done;
+
+			got_new_data = true;
+			cons_pos += roundup_len(len);
+
+			if ((len & RINGBUF_DISCARD_BIT) == 0) {
+				sample = (void *)len_ptr + RINGBUF_META_LEN;
+				err = r->sample_cb(r->ctx, sample, len);
+				if (err) {
+					/* update consumer pos and bail out */
+					smp_store_release(r->consumer_pos,
+							  cons_pos);
+					return err;
+				}
+			}
+
+			smp_store_release(r->consumer_pos, cons_pos);
+		}
+	} while (got_new_data);
+done:
+	return 0;
+}
+
+int ring_buffer__poll(struct ring_buffer *rb, int timeout_ms)
+{
+	int i, cnt, err;
+
+	cnt = epoll_wait(rb->epoll_fd, rb->events, rb->ring_cnt, timeout_ms);
+	for (i = 0; i < cnt; i++) {
+		__u32 ring_id = rb->events[i].data.fd;
+		struct ring *ring = &rb->rings[ring_id];
+
+		err = ringbuf_process_ring(ring);
+		if (err)
+			return err;
+	}
+	return cnt < 0 ? -errno : cnt;
+}
-- 
2.24.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH bpf-next 5/6] selftests/bpf: add BPF ringbuf selftests
  2020-05-13 19:25 [PATCH bpf-next 0/6] BPF ring buffer Andrii Nakryiko
                   ` (3 preceding siblings ...)
  2020-05-13 19:25 ` [PATCH bpf-next 4/6] libbpf: add BPF ring buffer support Andrii Nakryiko
@ 2020-05-13 19:25 ` Andrii Nakryiko
  2020-05-13 19:25 ` [PATCH bpf-next 6/6] bpf: add BPF ringbuf and perf buffer benchmarks Andrii Nakryiko
  2020-05-13 22:49 ` [PATCH bpf-next 0/6] BPF ring buffer Jonathan Lemon
  6 siblings, 0 replies; 32+ messages in thread
From: Andrii Nakryiko @ 2020-05-13 19:25 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel
  Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko, Paul E . McKenney,
	Jonathan Lemon

Both singleton BPF ringbuf and BPF ringbuf with map-in-map use cases are tested.
Also reserve+submit/discards and output variants of API are validated.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---
 .../selftests/bpf/prog_tests/ringbuf.c        | 101 +++++++++++++++++
 .../selftests/bpf/prog_tests/ringbuf_multi.c  | 102 ++++++++++++++++++
 .../selftests/bpf/progs/test_ringbuf.c        |  63 +++++++++++
 .../selftests/bpf/progs/test_ringbuf_multi.c  |  77 +++++++++++++
 4 files changed, 343 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/ringbuf.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/ringbuf_multi.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_ringbuf.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_ringbuf_multi.c

diff --git a/tools/testing/selftests/bpf/prog_tests/ringbuf.c b/tools/testing/selftests/bpf/prog_tests/ringbuf.c
new file mode 100644
index 000000000000..2708cc791f1a
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/ringbuf.c
@@ -0,0 +1,101 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <linux/compiler.h>
+#include <asm/barrier.h>
+#include <test_progs.h>
+#include <sys/mman.h>
+#include <sys/epoll.h>
+#include <time.h>
+#include <sched.h>
+#include <pthread.h>
+#include <sys/sysinfo.h>
+#include <linux/perf_event.h>
+#include <linux/ring_buffer.h>
+#include "test_ringbuf.skel.h"
+
+#define EDONE 7777
+
+static int duration = 0;
+
+struct sample {
+	int pid;
+	int seq;
+	long value;
+	char comm[16];
+};
+
+static int process_sample(void *ctx, void *data, size_t len)
+{
+	struct sample *s = data;
+
+	switch (s->seq) {
+	case 0:
+		CHECK(s->value != 333, "sample1_value", "exp %ld, got %ld\n",
+		      333L, s->value);
+		break;
+	case 1:
+		CHECK(s->value != 777, "sample2_value", "exp %ld, got %ld\n",
+		      777L, s->value);
+		return -EDONE;
+	default:
+		CHECK(false, "extra_sample", "unexpected sample\n");
+	}
+
+	return 0;
+}
+
+void test_ringbuf(void)
+{
+	struct test_ringbuf *skel;
+	struct ring_buffer *ringbuf;
+	int err;
+
+	skel = test_ringbuf__open_and_load();
+	if (CHECK(!skel, "skel_open_load", "skeleton open&load failed\n"))
+		return;
+
+	/* only trigger BPF program for current process */
+	skel->bss->pid = getpid();
+
+	ringbuf = ring_buffer__new(bpf_map__fd(skel->maps.ringbuf),
+				   process_sample, NULL, NULL);
+	if (CHECK(!ringbuf, "ringbuf_create", "failed to create ringbuf\n"))
+		goto cleanup;
+
+	err = test_ringbuf__attach(skel);
+	if (CHECK(err, "skel_attach", "skeleton attachment failed: %d\n", err))
+		goto cleanup;
+
+	/* trigger exactly two samples */
+	skel->bss->value = 333;
+	syscall(__NR_getpgid);
+	skel->bss->value = 777;
+	syscall(__NR_getpgid);
+
+	/* poll for samples */
+	do {
+		err = ring_buffer__poll(ringbuf, -1);
+	} while (err >= 0);
+
+	/* -EDONE is used as an indicator that we are done */
+	if (CHECK(err != -EDONE, "err_done", "done err: %d\n", err))
+		goto cleanup;
+
+	/* we expect extra polling to return nothing */
+	err = ring_buffer__poll(ringbuf, 0);
+	if (CHECK(err < 0, "extra_samples", "poll result: %d\n", err))
+		goto cleanup;
+
+	CHECK(skel->bss->dropped != 0, "err_dropped", "exp %ld, got %ld\n",
+	      0L, skel->bss->dropped);
+	CHECK(skel->bss->total != 2, "err_total", "exp %ld, got %ld\n",
+	      2L, skel->bss->total);
+	CHECK(skel->bss->discarded != 1, "err_discarded", "exp %ld, got %ld\n",
+	      1L, skel->bss->discarded);
+
+	test_ringbuf__detach(skel);
+
+cleanup:
+	ring_buffer__free(ringbuf);
+	test_ringbuf__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/prog_tests/ringbuf_multi.c b/tools/testing/selftests/bpf/prog_tests/ringbuf_multi.c
new file mode 100644
index 000000000000..f352f556bd34
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/ringbuf_multi.c
@@ -0,0 +1,102 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <test_progs.h>
+#include <sys/epoll.h>
+#include "test_ringbuf_multi.skel.h"
+
+static int duration = 0;
+
+struct sample {
+	int pid;
+	int seq;
+	long value;
+	char comm[16];
+};
+
+static int process_sample(void *ctx, void *data, size_t len)
+{
+	int ring = (unsigned long)ctx;
+	struct sample *s = data;
+
+	switch (s->seq) {
+	case 0:
+		CHECK(ring != 1, "sample1_ring", "exp %d, got %d\n", 1, ring);
+		CHECK(s->value != 333, "sample1_value", "exp %ld, got %ld\n",
+		      333L, s->value);
+		break;
+	case 1:
+		CHECK(ring != 2, "sample2_ring", "exp %d, got %d\n", 2, ring);
+		CHECK(s->value != 777, "sample2_value", "exp %ld, got %ld\n",
+		      777L, s->value);
+		break;
+	default:
+		CHECK(true, "extra_sample", "unexpected sample seq %d, val %ld\n",
+		      s->seq, s->value);
+		return -1;
+	}
+
+	return 0;
+}
+
+void test_ringbuf_multi(void)
+{
+	struct test_ringbuf_multi *skel;
+	struct ring_buffer *ringbuf;
+	int err;
+
+	skel = test_ringbuf_multi__open_and_load();
+	if (CHECK(!skel, "skel_open_load", "skeleton open&load failed\n"))
+		return;
+
+	/* only trigger BPF program for current process */
+	skel->bss->pid = getpid();
+
+	ringbuf = ring_buffer__new(bpf_map__fd(skel->maps.ringbuf1),
+				   process_sample, (void *)(long)1, NULL);
+	if (CHECK(!ringbuf, "ringbuf_create", "failed to create ringbuf\n"))
+		goto cleanup;
+
+	err = ring_buffer__add(ringbuf, bpf_map__fd(skel->maps.ringbuf2),
+			      process_sample, (void *)(long)2);
+	if (CHECK(err, "ringbuf_add", "failed to add another ring\n"))
+		goto cleanup;
+
+	err = test_ringbuf_multi__attach(skel);
+	if (CHECK(err, "skel_attach", "skeleton attachment failed: %d\n", err))
+		goto cleanup;
+
+	/* trigger few samples, some will be skipped */
+	skel->bss->target_ring = 0;
+	skel->bss->value = 333;
+	syscall(__NR_getpgid);
+
+	/* skipped, no ringbuf in slot 1 */
+	skel->bss->target_ring = 1;
+	skel->bss->value = 555;
+	syscall(__NR_getpgid);
+
+	skel->bss->target_ring = 2;
+	skel->bss->value = 777;
+	syscall(__NR_getpgid);
+
+	/* poll for samples, should get 2 ringbufs back */
+	err = ring_buffer__poll(ringbuf, -1);
+	if (CHECK(err != 2, "poll_res", "expected 2 events, got %d\n", err))
+		goto cleanup;
+
+	/* expect extra polling to return nothing */
+	err = ring_buffer__poll(ringbuf, 0);
+	if (CHECK(err < 0, "extra_samples", "poll result: %d\n", err))
+		goto cleanup;
+
+	CHECK(skel->bss->dropped != 0, "err_dropped", "exp %ld, got %ld\n",
+	      0L, skel->bss->dropped);
+	CHECK(skel->bss->skipped != 1, "err_skipped", "exp %ld, got %ld\n",
+	      1L, skel->bss->skipped);
+	CHECK(skel->bss->total != 2, "err_total", "exp %ld, got %ld\n",
+	      2L, skel->bss->total);
+
+cleanup:
+	ring_buffer__free(ringbuf);
+	test_ringbuf_multi__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_ringbuf.c b/tools/testing/selftests/bpf/progs/test_ringbuf.c
new file mode 100644
index 000000000000..6084f17e17d8
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_ringbuf.c
@@ -0,0 +1,63 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2019 Facebook
+
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+
+char _license[] SEC("license") = "GPL";
+
+struct sample {
+	int pid;
+	int seq;
+	long value;
+	char comm[16];
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_RINGBUF);
+	__uint(max_entries, 1 << 12);
+} ringbuf SEC(".maps");
+
+/* inputs */
+int pid = 0;
+long value = 0;
+
+/* outputs */
+long total = 0;
+long discarded = 0;
+long dropped = 0;
+
+SEC("tp/syscalls/sys_enter_getpgid")
+int test_ringbuf(void *ctx)
+{
+	int cur_pid = bpf_get_current_pid_tgid() >> 32;
+	struct sample *sample;
+	int zero = 0;
+
+	if (cur_pid != pid)
+		return 0;
+
+	sample = bpf_ringbuf_reserve(&ringbuf, sizeof(*sample), 0);
+	if (!sample) {
+		__sync_fetch_and_add(&dropped, 1);
+		return 1;
+	}
+
+	sample->pid = pid;
+	bpf_get_current_comm(sample->comm, sizeof(sample->comm));
+	sample->value = value;
+
+	sample->seq = total;
+	__sync_fetch_and_add(&total, 1);
+
+	if (sample->seq & 1) {
+		/* copy from reserved sample to a new one... */
+		bpf_ringbuf_output(&ringbuf, sample, sizeof(*sample), 0);
+		/* ...and then discard reserved sample */
+		bpf_ringbuf_discard(sample);
+		__sync_fetch_and_add(&discarded, 1);
+	} else {
+		bpf_ringbuf_submit(sample);
+	}
+	return 0;
+}
diff --git a/tools/testing/selftests/bpf/progs/test_ringbuf_multi.c b/tools/testing/selftests/bpf/progs/test_ringbuf_multi.c
new file mode 100644
index 000000000000..b45291e77aa2
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_ringbuf_multi.c
@@ -0,0 +1,77 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2019 Facebook
+
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+
+char _license[] SEC("license") = "GPL";
+
+struct sample {
+	int pid;
+	int seq;
+	long value;
+	char comm[16];
+};
+
+struct ringbuf_map {
+	__uint(type, BPF_MAP_TYPE_RINGBUF);
+	__uint(max_entries, 1 << 12);
+} ringbuf1 SEC(".maps"),
+  ringbuf2 SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS);
+	__uint(max_entries, 4);
+	__type(key, int);
+	__array(values, struct ringbuf_map);
+} ringbuf_arr SEC(".maps") = {
+	.values = {
+		[0] = &ringbuf1,
+		[2] = &ringbuf2,
+	},
+};
+
+/* inputs */
+int pid = 0;
+int target_ring = 0;
+long value = 0;
+
+/* outputs */
+long total = 0;
+long dropped = 0;
+long skipped = 0;
+
+SEC("tp/syscalls/sys_enter_getpgid")
+int test_ringbuf(void *ctx)
+{
+	int cur_pid = bpf_get_current_pid_tgid() >> 32;
+	struct sample *sample;
+	void *rb;
+	int zero = 0;
+
+	if (cur_pid != pid)
+		return 0;
+
+	rb = bpf_map_lookup_elem(&ringbuf_arr, &target_ring);
+	if (!rb) {
+		skipped += 1;
+		return 1;
+	}
+
+	sample = bpf_ringbuf_reserve(rb, sizeof(*sample), 0);
+	if (!sample) {
+		dropped += 1;
+		return 1;
+	}
+
+	sample->pid = pid;
+	bpf_get_current_comm(sample->comm, sizeof(sample->comm));
+	sample->value = value;
+
+	sample->seq = total;
+	total += 1;
+
+	bpf_ringbuf_submit(sample);
+
+	return 0;
+}
-- 
2.24.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH bpf-next 6/6] bpf: add BPF ringbuf and perf buffer benchmarks
  2020-05-13 19:25 [PATCH bpf-next 0/6] BPF ring buffer Andrii Nakryiko
                   ` (4 preceding siblings ...)
  2020-05-13 19:25 ` [PATCH bpf-next 5/6] selftests/bpf: add BPF ringbuf selftests Andrii Nakryiko
@ 2020-05-13 19:25 ` Andrii Nakryiko
  2020-05-13 22:49 ` [PATCH bpf-next 0/6] BPF ring buffer Jonathan Lemon
  6 siblings, 0 replies; 32+ messages in thread
From: Andrii Nakryiko @ 2020-05-13 19:25 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel
  Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko, Paul E . McKenney,
	Jonathan Lemon

Extend bench framework with ability to have benchmark-provided child argument
parser for custom benchmark-specific parameters. This makes bench generic code
modular and independent from any specific benchmark.

Also implement a set of benchmarks for new BPF ring buffer and existing perf
buffer. 5 benchmarks were implemented: 2 variations for BPF ringbuf,
3 variations for perfbuf:
  - rb-libbpf utilizes stock libbpf ring_buffer manager for reading data;
  - rb-custom implements custom ring buffer setup and reading code, to
    eliminate overheads inherent in generic libbpf code due to callback
    functions and the need to update consumer position after each consumed
    record, instead of batching updates (due to pessimistic assumption that
    user callback might take long time and thus could unnecessarily hold ring
    buffer space for too long);
  - pb-libbpf uses stock libbpf perf_buffer code with all the default
    settings; default settings are good safe defaults, but are far from
    providing highest throughput -- perf buffer allows to specify
    wakeup_events and sample_period > 1, that will cause perf code to trigger
    epoll notifications less frequently, which boosts throughput immensely;
  - pb-raw uses perf_buffer__new_raw() API to specify custom wakeup_events and
    sample_period and specifies raw sample callback, which removes some of the
    overhead in generic case;
  - pb-custom does the same setup as pb-raw, but implements custom consumer
    code eliminating all the callback overhead.

Otherwise, all benchamrks implement similar way to generate a batch of records
by using fentry/sys_getpgid BPF program, which pushes a bunch of records in
a tight loop and records number of successful and dropped samples. Each record
is a small 8-byte integer, to minimize the effect of memory copying with
bpf_perf_event_output() and bpf_ringbuf_output().

Benchmarks that have only one producer implement optional back-to-back mode,
in which record production and consumption is alternating on the same CPU.
This is the highest-throughput happy case, showing ultimate performance
achievable with either BPF ringbuf or perfbuf.

All the below scenarios are implemented in a script in
benchs/run_bench_ringbufs.sh. Tests were performed on 28-core/56-thread
Intel Xeon CPU E5-2680 v4 @ 2.40GHz CPU.

Single-producer, parallel producer
==================================
rb-libbpf            11.302 ± 0.564M/s (drops 0.000 ± 0.000M/s)
rb-custom            8.895 ± 0.148M/s (drops 0.000 ± 0.000M/s)
pb-libbpf            0.927 ± 0.008M/s (drops 0.000 ± 0.000M/s)
pb-raw               9.884 ± 0.058M/s (drops 0.000 ± 0.000M/s)
pb-custom            9.724 ± 0.068M/s (drops 0.000 ± 0.001M/s)

Single producer on one CPU, consumer on another one, both running at full
speed. Curiously, rb-libbpf has higher throughput than objectively faster (due
to more lightweight consumer code path) rb-custom. It appears that faster
consumer causes kernel to send notifications more frequently, because consumer
appears to be caught up more frequently.

Stock perfbuf settings are about 10x slower, compared to pb-raw/pb-custom with
wakeup_events/sample_period set to 500. The trade-off is that with sampling,
application might not get next X events until X+1st arrives, which is not
always acceptable.

Single-producer, back-to-back mode
==================================
rb-libbpf            15.991 ± 0.456M/s (drops 0.000 ± 0.000M/s)
rb-custom            21.646 ± 0.457M/s (drops 0.000 ± 0.000M/s)
pb-libbpf            1.608 ± 0.040M/s (drops 0.000 ± 0.000M/s)
pb-raw               8.866 ± 0.098M/s (drops 0.000 ± 0.000M/s)
pb-custom            8.858 ± 0.137M/s (drops 0.000 ± 0.000M/s)

Here we test a back-to-back mode, which is arguably best-case scenario both
for BPF ringbuf and perfbuf, because there is no contention and for ringbuf
also no excessive notification, because consumer appears to be behind after
the first record. For ringbuf, custom consumer code clearly wins with 22 vs 16
million records per second exchanged between producer and consumer.

Perfbuf with wakeup sampling gets 5.5x throughput increase, compared to
no-sampling version. There also doesn't seem to be noticeable overhead from
generic libbpf handling code.

Perfbuf, effect of sample rate, back-to-back
============================================
sample rate 1        1.634 ± 0.023M/s (drops 0.000 ± 0.000M/s)
sample rate 5        4.510 ± 0.082M/s (drops 0.000 ± 0.000M/s)
sample rate 10       5.875 ± 0.085M/s (drops 0.000 ± 0.000M/s)
sample rate 25       7.361 ± 0.150M/s (drops 0.000 ± 0.000M/s)
sample rate 50       7.574 ± 0.559M/s (drops 0.000 ± 0.000M/s)
sample rate 100      8.486 ± 0.274M/s (drops 0.000 ± 0.000M/s)
sample rate 250      8.787 ± 0.183M/s (drops 0.000 ± 0.000M/s)
sample rate 500      8.494 ± 0.732M/s (drops 0.000 ± 0.000M/s)
sample rate 1000     8.445 ± 0.336M/s (drops 0.000 ± 0.000M/s)
sample rate 2000     9.000 ± 0.187M/s (drops 0.000 ± 0.000M/s)
sample rate 3000     8.684 ± 0.556M/s (drops 0.000 ± 0.000M/s)

This benchmark shows the effect of event sampling for perfbuf. Back-to-back
mode for highest throughput. Just doing every 5th record notification gives 3x
speed up. 250-500 appears to be the point of diminishing return, with 5.5x
speed up. Most benchmarks use 500 as the default sampling for pb-raw and
pb-custom.

Ringbuf, reserve+commit vs output, back-to-back
===============================================
reserve              22.570 ± 0.490M/s (drops 0.000 ± 0.000M/s)
output               19.037 ± 0.429M/s (drops 0.000 ± 0.000M/s)

BPF ringbuf supports two sets of APIs with various usability and performance
tradeoffs: bpf_ringbuf_reserve()+bpf_ringbuf_commit() vs bpf_ringbuf_output().
This benchmark clearly shows superiority of reserve+commit approach, despite
using a small 8-byte record size.

Single-producer, consumer/producer competing on the same CPU, low batch count
=============================================================================
rb-libbpf            1.485 ± 0.015M/s (drops 1.969 ± 0.058M/s)
rb-custom            1.489 ± 0.017M/s (drops 1.910 ± 0.098M/s)
pb-libbpf            1.242 ± 0.042M/s (drops 0.000 ± 0.000M/s)
pb-raw               1.217 ± 0.043M/s (drops 0.000 ± 0.000M/s)
pb-custom            1.264 ± 0.023M/s (drops 0.000 ± 0.000M/s)

This benchmark shows one of the worst-case scenarios, in which producer and
consumer do not coordinate *and* fight for the same CPU. No batch count and
sampling settings were able to eliminate drops for ringbuffer, producer is
just too fast for consumer to keep up. Still, ringbuf and perfbuf still able
to pass through more than a million messages per second, which is more than
enough for a lot of applications.

Ringbuf, multi-producer contention, low batch count
===================================================
rb-libbpf nr_prod 1  9.559 ± 0.134M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 2  6.542 ± 0.039M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 3  4.444 ± 0.007M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 4  3.764 ± 0.021M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 8  4.147 ± 0.013M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 12 3.652 ± 0.185M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 16 2.338 ± 0.020M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 20 2.055 ± 0.006M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 24 1.961 ± 0.006M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 28 2.136 ± 0.003M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 32 2.069 ± 0.003M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 36 2.177 ± 0.002M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 40 2.323 ± 0.002M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 44 2.274 ± 0.004M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 48 2.069 ± 0.003M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 52 2.041 ± 0.004M/s (drops 0.000 ± 0.000M/s)

Ringbuf uses a very short-duration spinlock during reservation phase, to check
few invariants, increment producer count and set record header. This is the
biggest point of contention for ringbuf implementation. This benchmark
evaluates the effect of multiple competing writers on overall throughput of
a single shared ringbuffer.

Overall throughput drops by about 30% when going from single to two
highly-contended producers, losing another 30% with third producer added.
Performance drop stabilizes at around 16 producers and hovers around 2mln even
with 50+ fighting producers, which is a 4.75x drop in throughput and
a testament to a good implementation of spinlock in the kernel.

Note, that in the intended real-world scenarios, it's not expected to get even
close to such a high levels of contention. But if contention will become
a problem, there is always an option of sharding few ring buffers across a set
of CPUs.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---
 tools/testing/selftests/bpf/Makefile          |   5 +-
 tools/testing/selftests/bpf/bench.c           |  18 +
 .../selftests/bpf/benchs/bench_ringbufs.c     | 593 ++++++++++++++++++
 .../bpf/benchs/run_bench_ringbufs.sh          |  61 ++
 .../selftests/bpf/progs/perfbuf_bench.c       |  33 +
 .../selftests/bpf/progs/ringbuf_bench.c       |  45 ++
 6 files changed, 754 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_ringbufs.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_ringbufs.sh
 create mode 100644 tools/testing/selftests/bpf/progs/perfbuf_bench.c
 create mode 100644 tools/testing/selftests/bpf/progs/ringbuf_bench.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 352d17a16bae..a95ac4d691d2 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -412,12 +412,15 @@ $(OUTPUT)/bench_%.o: benchs/bench_%.c bench.h
 	$(CC) $(CFLAGS) -c $(filter %.c,$^) $(LDLIBS) -o $@
 $(OUTPUT)/bench_rename.o: $(OUTPUT)/test_overhead.skel.h
 $(OUTPUT)/bench_trigger.o: $(OUTPUT)/trigger_bench.skel.h
+$(OUTPUT)/bench_ringbufs.o: $(OUTPUT)/ringbuf_bench.skel.h \
+			    $(OUTPUT)/perfbuf_bench.skel.h
 $(OUTPUT)/bench.o: bench.h testing_helpers.h
 $(OUTPUT)/bench: LDLIBS += -lm
 $(OUTPUT)/bench: $(OUTPUT)/bench.o $(OUTPUT)/testing_helpers.o \
 		 $(OUTPUT)/bench_count.o \
 		 $(OUTPUT)/bench_rename.o \
-		 $(OUTPUT)/bench_trigger.o
+		 $(OUTPUT)/bench_trigger.o \
+		 $(OUTPUT)/bench_ringbufs.o
 	$(call msg,BINARY,,$@)
 	$(CC) $(LDFLAGS) -o $@ $(filter %.a %.o,$^) $(LDLIBS)
 
diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index 8c0dfbfe6088..a18f93804da7 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -130,6 +130,13 @@ static const struct argp_option opts[] = {
 	{},
 };
 
+extern struct argp bench_ringbufs_argp;
+
+static const struct argp_child bench_parsers[] = {
+	{ &bench_ringbufs_argp, 0, "Ring buffers benchmark", 0 },
+	{},
+};
+
 static error_t parse_arg(int key, char *arg, struct argp_state *state)
 {
 	static int pos_args;
@@ -208,6 +215,7 @@ static void parse_cmdline_args(int argc, char **argv)
 		.options = opts,
 		.parser = parse_arg,
 		.doc = argp_program_doc,
+		.children = bench_parsers,
 	};
 	if (argp_parse(&argp, argc, argv, 0, NULL, NULL))
 		exit(1);
@@ -310,6 +318,11 @@ extern const struct bench bench_trig_rawtp;
 extern const struct bench bench_trig_kprobe;
 extern const struct bench bench_trig_fentry;
 extern const struct bench bench_trig_fmodret;
+extern const struct bench bench_rb_libbpf;
+extern const struct bench bench_rb_custom;
+extern const struct bench bench_pb_libbpf;
+extern const struct bench bench_pb_raw;
+extern const struct bench bench_pb_custom;
 
 static const struct bench *benchs[] = {
 	&bench_count_global,
@@ -327,6 +340,11 @@ static const struct bench *benchs[] = {
 	&bench_trig_kprobe,
 	&bench_trig_fentry,
 	&bench_trig_fmodret,
+	&bench_rb_libbpf,
+	&bench_rb_custom,
+	&bench_pb_libbpf,
+	&bench_pb_raw,
+	&bench_pb_custom,
 };
 
 static void setup_benchmark()
diff --git a/tools/testing/selftests/bpf/benchs/bench_ringbufs.c b/tools/testing/selftests/bpf/benchs/bench_ringbufs.c
new file mode 100644
index 000000000000..01017105ac08
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/bench_ringbufs.c
@@ -0,0 +1,593 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Facebook */
+#include <asm/barrier.h>
+#include <linux/perf_event.h>
+#include <linux/ring_buffer.h>
+#include <sys/epoll.h>
+#include <sys/mman.h>
+#include <argp.h>
+#include <stdlib.h>
+#include "bench.h"
+#include "ringbuf_bench.skel.h"
+#include "perfbuf_bench.skel.h"
+
+static struct {
+	bool back2back;
+	int batch_cnt;
+	int ringbuf_sz; /* per-ringbuf, in bytes */
+	bool ringbuf_use_reserve; /* use reserve/submit or output API */
+	int perfbuf_sz; /* per-CPU size, in pages */
+	int perfbuf_sample_rate;
+} args = {
+	.back2back = false,
+	.batch_cnt = 500,
+	.ringbuf_sz = 512 * 1024,
+	.ringbuf_use_reserve = true,
+	.perfbuf_sz = 128,
+	.perfbuf_sample_rate = 500,
+};
+
+enum {
+	ARG_RB_BACK2BACK = 2000,
+	ARG_RB_USE_OUTPUT = 2001,
+	ARG_RB_BATCH_CNT = 2002,
+	ARG_RB_SAMPLE_RATE = 2003,
+};
+
+static const struct argp_option opts[] = {
+	{ "rb-b2b", ARG_RB_BACK2BACK, NULL, 0, "Back-to-back mode"},
+	{ "rb-use-output", ARG_RB_USE_OUTPUT, NULL, 0, "Use bpf_ringbuf_output() instead of bpf_ringbuf_reserve()"},
+	{ "rb-batch-cnt", ARG_RB_BATCH_CNT, "CNT", 0, "Set BPF-side record batch count"},
+	{ "rb-sample-rate", ARG_RB_SAMPLE_RATE, "RATE", 0, "Perf buf's sample rate"},
+	{},
+};
+
+static error_t parse_arg(int key, char *arg, struct argp_state *state)
+{
+	switch (key) {
+	case ARG_RB_BACK2BACK:
+		args.back2back = true;
+		break;
+	case ARG_RB_USE_OUTPUT:
+		args.ringbuf_use_reserve = false;
+		break;
+	case ARG_RB_BATCH_CNT:
+		args.batch_cnt = strtol(arg, NULL, 10);
+		if (args.batch_cnt < 0) {
+			fprintf(stderr, "Invalid batch count.");
+			argp_usage(state);
+		}
+		break;
+	case ARG_RB_SAMPLE_RATE:
+		args.perfbuf_sample_rate = strtol(arg, NULL, 10);
+		if (args.perfbuf_sample_rate < 0) {
+			fprintf(stderr, "Invalid perfbuf sample rate.");
+			argp_usage(state);
+		}
+		break;
+	default:
+		return ARGP_ERR_UNKNOWN;
+	}
+	return 0;
+}
+
+/* exported into benchmark runner */
+const struct argp bench_ringbufs_argp = {
+	.options = opts,
+	.parser = parse_arg,
+};
+
+/* RINGBUF-LIBBPF benchmark */
+
+static struct counter buf_hits;
+
+static inline void bufs_trigger_batch()
+{
+	(void)syscall(__NR_getpgid);
+}
+
+static void bufs_validate()
+{
+	if (env.consumer_cnt != 1) {
+		fprintf(stderr, "rb-libbpf benchmark doesn't support multi-consumer!\n");
+		exit(1);
+	}
+
+	if (args.back2back && env.producer_cnt > 1) {
+		fprintf(stderr, "back-to-back mode makes sense only for single-producer case!\n");
+		exit(1);
+	}
+}
+
+static void *bufs_sample_producer(void *input)
+{
+	if (args.back2back) {
+		/* initial batch to get everything started */
+		bufs_trigger_batch();
+		return NULL;
+	}
+
+	while (true)
+		bufs_trigger_batch();
+	return NULL;
+}
+
+static struct ringbuf_libbpf_ctx {
+	struct ringbuf_bench *skel;
+	struct ring_buffer *ringbuf;
+} ringbuf_libbpf_ctx;
+
+static void ringbuf_libbpf_measure(struct bench_res *res)
+{
+	struct ringbuf_libbpf_ctx *ctx = &ringbuf_libbpf_ctx;
+
+	res->hits = atomic_swap(&buf_hits.value, 0);
+	res->drops = atomic_swap(&ctx->skel->bss->dropped, 0);
+}
+
+static struct ringbuf_bench *ringbuf_setup_skeleton()
+{
+	struct ringbuf_bench *skel;
+
+	setup_libbpf();
+
+	skel = ringbuf_bench__open();
+	if (!skel) {
+		fprintf(stderr, "failed to open skeleton\n");
+		exit(1);
+	}
+
+	skel->rodata->batch_cnt = args.batch_cnt;
+	skel->rodata->use_reserve = args.ringbuf_use_reserve ? 1 : 0;
+
+	bpf_map__resize(skel->maps.ringbuf, args.ringbuf_sz);
+
+	if (ringbuf_bench__load(skel)) {
+		fprintf(stderr, "failed to load skeleton\n");
+		exit(1);
+	}
+
+	return skel;
+}
+
+static int buf_process_sample(void *ctx, void *data, size_t len)
+{
+	atomic_inc(&buf_hits.value);
+	return 0;
+}
+
+static void ringbuf_libbpf_setup()
+{
+	struct ringbuf_libbpf_ctx *ctx = &ringbuf_libbpf_ctx;
+	struct bpf_link *link;
+
+	ctx->skel = ringbuf_setup_skeleton();
+	ctx->ringbuf = ring_buffer__new(bpf_map__fd(ctx->skel->maps.ringbuf),
+					buf_process_sample, NULL, NULL);
+	if (!ctx->ringbuf) {
+		fprintf(stderr, "failed to create ringbuf\n");
+		exit(1);
+	}
+
+	link = bpf_program__attach(ctx->skel->progs.bench_ringbuf);
+	if (IS_ERR(link)) {
+		fprintf(stderr, "failed to attach program!\n");
+		exit(1);
+	}
+}
+
+static void *ringbuf_libbpf_consumer(void *input)
+{
+	struct ringbuf_libbpf_ctx *ctx = &ringbuf_libbpf_ctx;
+
+	while (ring_buffer__poll(ctx->ringbuf, -1) >= 0) {
+		if (args.back2back)
+			bufs_trigger_batch();
+	}
+	fprintf(stderr, "ringbuf polling failed!\n");
+	return NULL;
+}
+
+/* RINGBUF-CUSTOM benchmark */
+struct ringbuf_custom {
+	__u64 *consumer_pos;
+	__u64 *producer_pos;
+	__u64 mask;
+	void *data;
+	int map_fd;
+};
+
+static struct ringbuf_custom_ctx {
+	struct ringbuf_bench *skel;
+	struct ringbuf_custom ringbuf;
+	int epoll_fd;
+	struct epoll_event event;
+} ringbuf_custom_ctx;
+
+static void ringbuf_custom_measure(struct bench_res *res)
+{
+	struct ringbuf_custom_ctx *ctx = &ringbuf_custom_ctx;
+
+	res->hits = atomic_swap(&buf_hits.value, 0);
+	res->drops = atomic_swap(&ctx->skel->bss->dropped, 0);
+}
+
+static void ringbuf_custom_setup()
+{
+	struct ringbuf_custom_ctx *ctx = &ringbuf_custom_ctx;
+	const size_t page_size = getpagesize();
+	struct bpf_link *link;
+	struct ringbuf_custom *r;
+	void *tmp;
+	int err;
+
+	ctx->skel = ringbuf_setup_skeleton();
+
+	ctx->epoll_fd = epoll_create1(EPOLL_CLOEXEC);
+	if (ctx->epoll_fd < 0) {
+		fprintf(stderr, "failed to create epoll fd: %d\n", -errno);
+		exit(1);
+	}
+
+	r = &ctx->ringbuf;
+	r->map_fd = bpf_map__fd(ctx->skel->maps.ringbuf);
+	r->mask = args.ringbuf_sz - 1;
+
+	/* Map writable consumer page */
+	tmp = mmap(NULL, page_size, PROT_READ | PROT_WRITE, MAP_SHARED,
+		   r->map_fd, 0);
+	if (tmp == MAP_FAILED) {
+		fprintf(stderr, "failed to mmap consumer page: %d\n", -errno);
+		exit(1);
+	}
+	r->consumer_pos = tmp;
+
+	/* Map read-only producer page and data pages. */
+	tmp = mmap(NULL, page_size + 2 * args.ringbuf_sz, PROT_READ, MAP_SHARED,
+		   r->map_fd, page_size);
+	if (tmp == MAP_FAILED) {
+		fprintf(stderr, "failed to mmap data pages: %d\n", -errno);
+		exit(1);
+	}
+	r->producer_pos = tmp;
+	r->data = tmp + page_size;
+
+	ctx->event.events = EPOLLIN;
+	err = epoll_ctl(ctx->epoll_fd, EPOLL_CTL_ADD, r->map_fd, &ctx->event);
+	if (err < 0) {
+		fprintf(stderr, "failed to epoll add ringbuf: %d\n", -errno);
+		exit(1);
+	}
+
+	link = bpf_program__attach(ctx->skel->progs.bench_ringbuf);
+	if (IS_ERR(link)) {
+		fprintf(stderr, "failed to attach program\n");
+		exit(1);
+	}
+}
+
+#define RINGBUF_BUSY_BIT (1 << 31)
+#define RINGBUF_DISCARD_BIT (1 << 30)
+#define RINGBUF_META_LEN 8
+
+static inline int roundup_len(__u32 len)
+{
+	/* clear out top 2 bits */
+	len <<= 2;
+	len >>= 2;
+	/* add length prefix */
+	len += RINGBUF_META_LEN;
+	/* round up to 8 byte alignment */
+	return (len + 7) / 8 * 8;
+}
+
+static void ringbuf_custom_process_ring(struct ringbuf_custom *r)
+{
+	__u64 cons_pos, prod_pos;
+	int *len_ptr, len;
+	bool got_new_data;
+
+	cons_pos = *r->consumer_pos;
+	while (true) {
+		got_new_data = false;
+		prod_pos = smp_load_acquire(r->producer_pos);
+		while (cons_pos < prod_pos) {
+			len_ptr = r->data + (cons_pos & r->mask);
+			len = smp_load_acquire(len_ptr);
+
+			/* sample not committed yet, bail out for now */
+			if (len & RINGBUF_BUSY_BIT)
+				return;
+
+			got_new_data = true;
+			cons_pos += roundup_len(len);
+
+			atomic_inc(&buf_hits.value);
+		}
+		if (got_new_data)
+			smp_store_release(r->consumer_pos, cons_pos);
+		else
+			break;
+	};
+}
+
+static void *ringbuf_custom_consumer(void *input)
+{
+	struct ringbuf_custom_ctx *ctx = &ringbuf_custom_ctx;
+	int cnt;
+
+	do {
+		if (args.back2back)
+			bufs_trigger_batch();
+		cnt = epoll_wait(ctx->epoll_fd, &ctx->event, 1, -1);
+		if (cnt > 0)
+		{
+			ringbuf_custom_process_ring(&ctx->ringbuf);
+		}
+	} while (cnt >= 0);
+	fprintf(stderr, "ringbuf polling failed!\n");
+	return 0;
+}
+
+/* PERFBUF-LIBBPF benchmark */
+static struct perfbuf_libbpf_ctx {
+	struct perfbuf_bench *skel;
+	struct perf_buffer *perfbuf;
+} perfbuf_libbpf_ctx;
+
+static void perfbuf_measure(struct bench_res *res)
+{
+	struct perfbuf_libbpf_ctx *ctx = &perfbuf_libbpf_ctx;
+
+	res->hits = atomic_swap(&buf_hits.value, 0);
+	res->drops = atomic_swap(&ctx->skel->bss->dropped, 0);
+}
+
+static struct perfbuf_bench *perfbuf_setup_skeleton()
+{
+	struct perfbuf_bench *skel;
+
+	setup_libbpf();
+
+	skel = perfbuf_bench__open();
+	if (!skel) {
+		fprintf(stderr, "failed to open skeleton\n");
+		exit(1);
+	}
+
+	skel->rodata->batch_cnt = args.batch_cnt;
+
+	if (perfbuf_bench__load(skel)) {
+		fprintf(stderr, "failed to load skeleton\n");
+		exit(1);
+	}
+
+	return skel;
+}
+
+static void perfbuf_process_sample(void *input_ctx, int cpu, void *data,
+				   __u32 len)
+{
+	atomic_inc(&buf_hits.value);
+}
+
+static enum bpf_perf_event_ret
+perfbuf_process_sample_raw(void *input_ctx, int cpu,
+			   struct perf_event_header *e)
+{
+	switch (e->type) {
+	case PERF_RECORD_SAMPLE:
+		atomic_inc(&buf_hits.value);
+		break;
+	case PERF_RECORD_LOST:
+		break;
+	default:
+		return LIBBPF_PERF_EVENT_ERROR;
+	}
+	return LIBBPF_PERF_EVENT_CONT;
+}
+
+static void perfbuf_libbpf_setup()
+{
+	struct perfbuf_libbpf_ctx *ctx = &perfbuf_libbpf_ctx;
+	struct perf_buffer_opts pb_opts = {
+		.sample_cb = perfbuf_process_sample,
+		.ctx = (void *)(long)0,
+	};
+	struct bpf_link *link;
+
+	ctx->skel = perfbuf_setup_skeleton();
+	ctx->perfbuf = perf_buffer__new(bpf_map__fd(ctx->skel->maps.perfbuf),
+					args.perfbuf_sz, &pb_opts);
+	if (!ctx->perfbuf) {
+		fprintf(stderr, "failed to create perfbuf\n");
+		exit(1);
+	}
+
+	link = bpf_program__attach(ctx->skel->progs.bench_perfbuf);
+	if (IS_ERR(link)) {
+		fprintf(stderr, "failed to attach program\n");
+		exit(1);
+	}
+}
+
+static void perfbuf_raw_setup()
+{
+	struct perfbuf_libbpf_ctx *ctx = &perfbuf_libbpf_ctx;
+	struct perf_event_attr attr;
+	struct perf_buffer_raw_opts pb_opts = {
+		.event_cb = perfbuf_process_sample_raw,
+		.ctx = (void *)(long)0,
+		.attr = &attr,
+	};
+	struct bpf_link *link;
+
+	ctx->skel = perfbuf_setup_skeleton();
+
+	memset(&attr, 0, sizeof(attr));
+	attr.config = PERF_COUNT_SW_BPF_OUTPUT,
+	attr.type = PERF_TYPE_SOFTWARE;
+	attr.sample_type = PERF_SAMPLE_RAW;
+	/* notify only every Nth sample */
+	attr.sample_period = args.perfbuf_sample_rate;
+	attr.wakeup_events = args.perfbuf_sample_rate;
+
+	if (args.perfbuf_sample_rate > args.batch_cnt) {
+		fprintf(stderr, "sample rate %d is too high for given batch count %d\n",
+			args.perfbuf_sample_rate, args.batch_cnt);
+		exit(1);
+	}
+
+	ctx->perfbuf = perf_buffer__new_raw(bpf_map__fd(ctx->skel->maps.perfbuf),
+					    args.perfbuf_sz, &pb_opts);
+	if (!ctx->perfbuf) {
+		fprintf(stderr, "failed to create perfbuf\n");
+		exit(1);
+	}
+
+	link = bpf_program__attach(ctx->skel->progs.bench_perfbuf);
+	if (IS_ERR(link)) {
+		fprintf(stderr, "failed to attach program\n");
+		exit(1);
+	}
+}
+
+static void *perfbuf_libbpf_consumer(void *input)
+{
+	struct perfbuf_libbpf_ctx *ctx = &perfbuf_libbpf_ctx;
+
+	while (perf_buffer__poll(ctx->perfbuf, -1) >= 0) {
+		if (args.back2back)
+			bufs_trigger_batch();
+	}
+	fprintf(stderr, "perfbuf polling failed!\n");
+	return NULL;
+}
+
+/* PERFBUF-CUSTOM benchmark */
+
+/* copies of internal libbpf definitions */
+struct perf_cpu_buf {
+	struct perf_buffer *pb;
+	void *base; /* mmap()'ed memory */
+	void *buf; /* for reconstructing segmented data */
+	size_t buf_size;
+	int fd;
+	int cpu;
+	int map_key;
+};
+
+struct perf_buffer {
+	perf_buffer_event_fn event_cb;
+	perf_buffer_sample_fn sample_cb;
+	perf_buffer_lost_fn lost_cb;
+	void *ctx; /* passed into callbacks */
+
+	size_t page_size;
+	size_t mmap_size;
+	struct perf_cpu_buf **cpu_bufs;
+	struct epoll_event *events;
+	int cpu_cnt; /* number of allocated CPU buffers */
+	int epoll_fd; /* perf event FD */
+	int map_fd; /* BPF_MAP_TYPE_PERF_EVENT_ARRAY BPF map FD */
+};
+
+static void *perfbuf_custom_consumer(void *input)
+{
+	struct perfbuf_libbpf_ctx *ctx = &perfbuf_libbpf_ctx;
+	struct perf_buffer *pb = ctx->perfbuf;
+	struct perf_cpu_buf *cpu_buf;
+	struct perf_event_mmap_page *header;
+	size_t mmap_mask = pb->mmap_size - 1;
+	struct perf_event_header *ehdr;
+	__u64 data_head, data_tail;
+	size_t ehdr_size;
+	void *base;
+	int i, cnt;
+
+	while (true) {
+		if (args.back2back)
+			bufs_trigger_batch();
+		cnt = epoll_wait(pb->epoll_fd, pb->events, pb->cpu_cnt, -1);
+		if (cnt <= 0) {
+			fprintf(stderr, "perf epoll failed: %d\n", -errno);
+			exit(1);
+		}
+
+		for (i = 0; i < cnt; ++i) {
+			cpu_buf = pb->events[i].data.ptr;
+			header = cpu_buf->base;
+			base = ((void *)header) + pb->page_size;
+
+			data_head = ring_buffer_read_head(header);
+			data_tail = header->data_tail;
+			while (data_head != data_tail) {
+				ehdr = base + (data_tail & mmap_mask);
+				ehdr_size = ehdr->size;
+
+				if (ehdr->type == PERF_RECORD_SAMPLE)
+					atomic_inc(&buf_hits.value);
+
+				data_tail += ehdr_size;
+			}
+			ring_buffer_write_tail(header, data_tail);
+		}
+	}
+	return NULL;
+}
+
+const struct bench bench_rb_libbpf = {
+	.name = "rb-libbpf",
+	.validate = bufs_validate,
+	.setup = ringbuf_libbpf_setup,
+	.producer_thread = bufs_sample_producer,
+	.consumer_thread = ringbuf_libbpf_consumer,
+	.measure = ringbuf_libbpf_measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = hits_drops_report_final,
+};
+
+const struct bench bench_rb_custom = {
+	.name = "rb-custom",
+	.validate = bufs_validate,
+	.setup = ringbuf_custom_setup,
+	.producer_thread = bufs_sample_producer,
+	.consumer_thread = ringbuf_custom_consumer,
+	.measure = ringbuf_custom_measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = hits_drops_report_final,
+};
+
+const struct bench bench_pb_libbpf = {
+	.name = "pb-libbpf",
+	.validate = bufs_validate,
+	.setup = perfbuf_libbpf_setup,
+	.producer_thread = bufs_sample_producer,
+	.consumer_thread = perfbuf_libbpf_consumer,
+	.measure = perfbuf_measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = hits_drops_report_final,
+};
+
+const struct bench bench_pb_raw = {
+	.name = "pb-raw",
+	.validate = bufs_validate,
+	.setup = perfbuf_raw_setup,
+	.producer_thread = bufs_sample_producer,
+	.consumer_thread = perfbuf_libbpf_consumer,
+	.measure = perfbuf_measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = hits_drops_report_final,
+};
+
+const struct bench bench_pb_custom = {
+	.name = "pb-custom",
+	.validate = bufs_validate,
+	.setup = perfbuf_raw_setup,
+	.producer_thread = bufs_sample_producer,
+	.consumer_thread = perfbuf_custom_consumer,
+	.measure = perfbuf_measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = hits_drops_report_final,
+};
+
diff --git a/tools/testing/selftests/bpf/benchs/run_bench_ringbufs.sh b/tools/testing/selftests/bpf/benchs/run_bench_ringbufs.sh
new file mode 100755
index 000000000000..15b6cd5bd9b3
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/run_bench_ringbufs.sh
@@ -0,0 +1,61 @@
+#!/bin/bash
+
+set -eufo pipefail
+
+RUN_BENCH="sudo ./bench -w3 -d10 -a"
+
+function hits()
+{
+	echo "$*" | sed -E "s/.*hits\s+([0-9]+\.[0-9]+ ± [0-9]+\.[0-9]+M\/s).*/\1/"
+}
+
+function drops()
+{
+	echo "$*" | sed -E "s/.*drops\s+([0-9]+\.[0-9]+ ± [0-9]+\.[0-9]+M\/s).*/\1/"
+}
+
+function header()
+{
+	local len=${#1}
+
+	printf "\n%s\n" "$1"
+	for i in $(seq 1 $len); do printf '='; done
+	printf '\n'
+}
+
+function summarize()
+{
+	bench="$1"
+	summary=$(echo $2 | tail -n1)
+	printf "%-20s %s (drops %s)\n" "$bench" "$(hits $summary)" "$(drops $summary)"
+}
+
+header "Single-producer, parallel producer"
+for b in rb-libbpf rb-custom pb-libbpf pb-raw pb-custom; do
+	summarize $b "$($RUN_BENCH $b)"
+done
+
+header "Single-producer, back-to-back mode"
+for b in rb-libbpf rb-custom pb-libbpf pb-raw pb-custom; do
+	summarize $b "$($RUN_BENCH --rb-b2b $b)"
+done
+
+header "Perfbuf, effect of sample rate, back-to-back"
+for b in 1 5 10 25 50 100 250 500 1000 2000 3000; do
+	summarize "sample rate $b" "$($RUN_BENCH --rb-b2b --rb-batch-cnt 3000 --rb-sample-rate $b pb-raw)"
+done
+
+header "Ringbuf, reserve+commit vs output, back-to-back"
+summarize "reserve" "$($RUN_BENCH --rb-b2b                 rb-custom)"
+summarize "output"  "$($RUN_BENCH --rb-b2b --rb-use-output rb-custom)"
+
+header "Single-producer, consumer/producer competing on the same CPU, low batch count"
+for b in rb-libbpf rb-custom pb-libbpf pb-raw pb-custom; do
+	summarize $b "$($RUN_BENCH --rb-sample-rate 1 --rb-batch-cnt 1 --prod-affinity 0 --cons-affinity 0 $b)"
+done
+
+header "Ringbuf, multi-producer contention, low batch count"
+for b in 1 2 3 4 8 12 16 20 24 28 32 36 40 44 48 52; do
+	summarize "rb-libbpf nr_prod $b" "$($RUN_BENCH -p$b --rb-batch-cnt 50 rb-libbpf)"
+done
+
diff --git a/tools/testing/selftests/bpf/progs/perfbuf_bench.c b/tools/testing/selftests/bpf/progs/perfbuf_bench.c
new file mode 100644
index 000000000000..e5ab4836a641
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/perfbuf_bench.c
@@ -0,0 +1,33 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2020 Facebook
+
+#include <linux/bpf.h>
+#include <stdint.h>
+#include <bpf/bpf_helpers.h>
+
+char _license[] SEC("license") = "GPL";
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
+	__uint(value_size, sizeof(int));
+	__uint(key_size, sizeof(int));
+} perfbuf SEC(".maps");
+
+const volatile int batch_cnt = 0;
+
+long sample_val = 42;
+long dropped __attribute__((aligned(128))) = 0;
+
+SEC("fentry/__x64_sys_getpgid")
+int bench_perfbuf(void *ctx)
+{
+	__u64 *sample;
+	int i;
+
+	for (i = 0; i < batch_cnt; i++) {
+		if (bpf_perf_event_output(ctx, &perfbuf, BPF_F_CURRENT_CPU,
+					  &sample_val, sizeof(sample_val)))
+			__sync_add_and_fetch(&dropped, 1);
+	}
+	return 0;
+}
diff --git a/tools/testing/selftests/bpf/progs/ringbuf_bench.c b/tools/testing/selftests/bpf/progs/ringbuf_bench.c
new file mode 100644
index 000000000000..6008ec5d6a22
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/ringbuf_bench.c
@@ -0,0 +1,45 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2020 Facebook
+
+#include <linux/bpf.h>
+#include <stdint.h>
+#include <bpf/bpf_helpers.h>
+
+char _license[] SEC("license") = "GPL";
+
+struct {
+	__uint(type, BPF_MAP_TYPE_RINGBUF);
+} ringbuf SEC(".maps");
+
+const volatile int batch_cnt = 0;
+const volatile long use_reserve = 1;
+
+long sample_val = 42;
+long dropped __attribute__((aligned(128))) = 0;
+
+SEC("fentry/__x64_sys_getpgid")
+int bench_ringbuf(void *ctx)
+{
+	long *sample;
+	int i;
+
+	if (use_reserve) {
+		for (i = 0; i < batch_cnt; i++) {
+			sample = bpf_ringbuf_reserve(&ringbuf,
+						     sizeof(sample_val), 0);
+			if (!sample) {
+				__sync_add_and_fetch(&dropped, 1);
+			} else {
+				*sample = sample_val;
+				bpf_ringbuf_submit(sample);
+			}
+		}
+	} else {
+		for (i = 0; i < batch_cnt; i++) {
+			if (bpf_ringbuf_output(&ringbuf, &sample_val,
+					       sizeof(sample_val), 0))
+				__sync_add_and_fetch(&dropped, 1);
+		}
+	}
+	return 0;
+}
-- 
2.24.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next 1/6] bpf: implement BPF ring buffer and verifier support for it
  2020-05-13 19:25 ` [PATCH bpf-next 1/6] bpf: implement BPF ring buffer and verifier support for it Andrii Nakryiko
@ 2020-05-13 20:57   ` kbuild test robot
  2020-05-13 21:58   ` Alan Maguire
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 32+ messages in thread
From: kbuild test robot @ 2020-05-13 20:57 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, netdev, ast, daniel
  Cc: kbuild-all, andrii.nakryiko, kernel-team, Andrii Nakryiko,
	Paul E . McKenney, Jonathan Lemon

[-- Attachment #1: Type: text/plain, Size: 3592 bytes --]

Hi Andrii,

I love your patch! Yet something to improve:

[auto build test ERROR on bpf-next/master]
[also build test ERROR on next-20200512]
[cannot apply to bpf/master rcu/dev v5.7-rc5]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Andrii-Nakryiko/BPF-ring-buffer/20200514-032857
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: sh-allmodconfig (attached as .config)
compiler: sh4-linux-gcc (GCC) 9.3.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day GCC_VERSION=9.3.0 make.cross ARCH=sh 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kbuild test robot <lkp@intel.com>

All error/warnings (new ones prefixed by >>):

In file included from include/linux/kernel.h:11,
from include/linux/list.h:9,
from include/linux/timer.h:5,
from include/linux/workqueue.h:9,
from include/linux/bpf.h:9,
from kernel/bpf/ringbuf.c:1:
kernel/bpf/ringbuf.c: In function 'bpf_ringbuf_commit':
>> include/linux/compiler.h:350:38: error: call to '__compiletime_assert_134' declared with attribute error: Need native word sized stores/loads for atomicity.
350 |  _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
|                                      ^
include/linux/compiler.h:331:4: note: in definition of macro '__compiletime_assert'
331 |    prefix ## suffix();             |    ^~~~~~
include/linux/compiler.h:350:2: note: in expansion of macro '_compiletime_assert'
350 |  _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
|  ^~~~~~~~~~~~~~~~~~~
include/linux/compiler.h:353:2: note: in expansion of macro 'compiletime_assert'
353 |  compiletime_assert(__native_word(t),             |  ^~~~~~~~~~~~~~~~~~
include/asm-generic/barrier.h:187:2: note: in expansion of macro 'compiletime_assert_atomic_type'
187 |  compiletime_assert_atomic_type(*p);             |  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> kernel/bpf/ringbuf.c:354:13: note: in expansion of macro 'smp_load_acquire'
354 |  cons_pos = smp_load_acquire(&rb->consumer_pos) & rb->mask;
|             ^~~~~~~~~~~~~~~~

vim +/smp_load_acquire +354 kernel/bpf/ringbuf.c

   332	
   333	static void bpf_ringbuf_commit(void *sample, bool discard)
   334	{
   335		unsigned long rec_pos, cons_pos;
   336		u32 new_meta, old_meta;
   337		void *meta_ptr;
   338		struct bpf_ringbuf *rb;
   339	
   340		meta_ptr = sample - RINGBUF_META_SZ;
   341		rb = bpf_ringbuf_restore_from_rec(meta_ptr);
   342		old_meta = *(u32 *)meta_ptr;
   343		new_meta = old_meta ^ RINGBUF_BUSY_BIT;
   344		if (discard)
   345			new_meta |= RINGBUF_DISCARD_BIT;
   346	
   347		/* update metadata header with correct final size prefix */
   348		xchg((u32 *)meta_ptr, new_meta);
   349	
   350		/* if consumer caught up and is waiting for our record, notify about
   351		 * new data availability
   352		 */
   353		rec_pos = (void *)meta_ptr - (void *)rb->data;
 > 354		cons_pos = smp_load_acquire(&rb->consumer_pos) & rb->mask;
   355		if (cons_pos == rec_pos)
   356			wake_up_all(&rb->waitq);
   357	}
   358	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 54706 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next 1/6] bpf: implement BPF ring buffer and verifier support for it
  2020-05-13 19:25 ` [PATCH bpf-next 1/6] bpf: implement BPF ring buffer and verifier support for it Andrii Nakryiko
  2020-05-13 20:57   ` kbuild test robot
@ 2020-05-13 21:58   ` Alan Maguire
  2020-05-14  5:59     ` Andrii Nakryiko
  2020-05-13 22:16   ` kbuild test robot
                     ` (4 subsequent siblings)
  6 siblings, 1 reply; 32+ messages in thread
From: Alan Maguire @ 2020-05-13 21:58 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, netdev, ast, daniel, andrii.nakryiko, kernel-team,
	Paul E . McKenney, Jonathan Lemon

On Wed, 13 May 2020, Andrii Nakryiko wrote:

> This commits adds a new MPSC ring buffer implementation into BPF ecosystem,
> which allows multiple CPUs to submit data to a single shared ring buffer. On
> the consumption side, only single consumer is assumed.
> 
> Motivation
> ----------
> There are two distinctive motivators for this work, which are not satisfied by
> existing perf buffer, which prompted creation of a new ring buffer
> implementation.
>   - more efficient memory utilization by sharing ring buffer across CPUs;
>   - preserving ordering of events that happen sequentially in time, even
>   across multiple CPUs (e.g., fork/exec/exit events for a task).
> 
> These two problems are independent, but perf buffer fails to satisfy both.
> Both are a result of a choice to have per-CPU perf ring buffer.  Both can be
> also solved by having an MPSC implementation of ring buffer. The ordering
> problem could technically be solved for perf buffer with some in-kernel
> counting, but given the first one requires an MPSC buffer, the same solution
> would solve the second problem automatically.
> 

This looks great Andrii! One potentially interesting side-effect of
the way this is implemented is that it could (I think) support speculative 
tracing.

Say I want to record some tracing info when I enter function foo(), but 
I only care about cases where that function later returns an error value.
I _think_ your implementation could support that via a scheme like 
this:

- attach a kprobe program to record the data via bpf_ringbuf_reserve(),
  and store the reserved pointer value in a per-task keyed hashmap.
  Then record the values of interest in the reserved space. This is our
  speculative data as we don't know whether we want to commit it yet.

- attach a kretprobe program that picks up our reserved pointer and
  commit()s or discard()s the associated data based on the return value.

- the consumer should (I think) then only read the committed data, so in
  this case just the data of interest associated with the failure case.

I'm curious if that sort of ringbuf access pattern across multiple 
programs would work? Thanks!

Alan

> Semantics and APIs
> ------------------
> Single ring buffer is presented to BPF programs as an instance of BPF map of
> type BPF_MAP_TYPE_RINGBUF. Two other alternatives considered, but ultimately
> rejected.
> 
> One way would be to, similar to BPF_MAP_TYPE_PERF_EVENT_ARRAY, make
> BPF_MAP_TYPE_RINGBUF could represent an array of ring buffers, but not enforce
> "same CPU only" rule. This would be more familiar interface compatible with
> existing perf buffer use in BPF, but would fail if application needed more
> advanced logic to lookup ring buffer by arbitrary key. HASH_OF_MAPS addresses
> this with current approach. Additionally, given the performance of BPF
> ringbuf, many use cases would just opt into a simple single ring buffer shared
> among all CPUs, for which current approach would be an overkill.
> 
> Another approach could introduce a new concept, alongside BPF map, to
> represent generic "container" object, which doesn't necessarily have key/value
> interface with lookup/update/delete operations. This approach would add a lot
> of extra infrastructure that has to be built for observability and verifier
> support. It would also add another concept that BPF developers would have to
> familiarize themselves with, new syntax in libbpf, etc. But then would really
> provide no additional benefits over the approach of using a map.
> BPF_MAP_TYPE_RINGBUF doesn't support lookup/update/delete operations, but so
> doesn't few other map types (e.g., queue and stack; array doesn't support
> delete, etc).
> 
> The approach chosen has an advantage of re-using existing BPF map
> infrastructure (introspection APIs in kernel, libbpf support, etc), being
> familiar concept (no need to teach users a new type of object in BPF program),
> and utilizing existing tooling (bpftool). For common scenario of using
> a single ring buffer for all CPUs, it's as simple and straightforward, as
> would be with a dedicated "container" object. On the other hand, by being
> a map, it can be combined with ARRAY_OF_MAPS and HASH_OF_MAPS map-in-maps to
> implement a wide variety of topologies, from one ring buffer for each CPU
> (e.g., as a replacement for perf buffer use cases), to a complicated
> application hashing/sharding of ring buffers (e.g., having a small pool of
> ring buffers with hashed task's tgid being a look up key to preserve order,
> but reduce contention).
> 
> Key and value sizes are enforced to be zero. max_entries is used to specify
> the size of ring buffer and has to be a power of 2 value.
> 
> There are a bunch of similarities between perf buffer
> (BPF_MAP_TYPE_PERF_EVENT_ARRAY) and new BPF ring buffer semantics:
>   - variable-length records;
>   - if there is no more space left in ring buffer, reservation fails, no
>     blocking;
>   - memory-mappable data area for user-space applications for ease of
>     consumption and high performance;
>   - epoll notifications for new incoming data;
>   - but still the ability to do busy polling for new data to achieve the
>     lowest latency, if necessary.
> 
> BPF ringbuf provides two sets of APIs to BPF programs:
>   - bpf_ringbuf_output() allows to *copy* data from one place to a ring
>     buffer, similarly to bpf_perf_event_output();
>   - bpf_ringbuf_reserve()/bpf_ringbuf_commit()/bpf_ringbuf_discard() APIs
>     split the whole process into two steps. First, a fixed amount of space is
>     reserved. If successful, a pointer to a data inside ring buffer data area
>     is returned, which BPF programs can use similarly to a data inside
>     array/hash maps. Once ready, this piece of memory is either committed or
>     discarded. Discard is similar to commit, but makes consumer ignore the
>     record.
> 
> bpf_ringbuf_output() has disadvantage of incurring extra memory copy, because
> record has to be prepared in some other place first. But it allows to submit
> records of the length that's not known to verifier beforehand. It also closely
> matches bpf_perf_event_output(), so will simplify migration significantly.
> 
> bpf_ringbuf_reserve() avoids the extra copy of memory by providing a memory
> pointer directly to ring buffer memory. In a lot of cases records are larger
> than BPF stack space allows, so many programs have use extra per-CPU array as
> a temporary heap for preparing sample. bpf_ringbuf_reserve() avoid this needs
> completely. But in exchange, it only allows a known constant size of memory to
> be reserved, such that verifier can verify that BPF program can't access
> memory outside its reserved record space. bpf_ringbuf_output(), while slightly
> slower due to extra memory copy, covers some use cases that are not suitable
> for bpf_ringbuf_reserve().
> 
> The difference between commit and discard is very small. Discard just marks
> a record as discarded, and such records are supposed to be ignored by consumer
> code. Discard is useful for some advanced use-cases, such as ensuring
> all-or-nothing multi-record submission, or emulating temporary malloc()/free()
> within single BPF program invocation.
> 
> Each reserved record is tracked by verifier through existing
> reference-tracking logic, similar to socket ref-tracking. It is thus
> impossible to reserve a record, but forget to submit (or discard) it.
> 
> Design and implementation
> -------------------------
> This reserve/commit schema allows a natural way for multiple producers, either
> on different CPUs or even on the same CPU/in the same BPF program, to reserve
> independent records and work with them without blocking other producers. This
> means that if BPF program was interruped by another BPF program sharing the
> same ring buffer, they will both get a record reserved (provided there is
> enough space left) and can work with it and submit it independently. This
> applies to NMI context as well, except that due to using a spinlock during
> reservation, in NMI context, bpf_ringbuf_reserve() might fail to get a lock,
> in which case reservation will fail even if ring buffer is not full.
> 
> The ring buffer itself internally is implemented as a power-of-2 sized
> circular buffer, with two logical and ever-increasing counters (which might
> wrap around on 32-bit architectures, that's not a problem):
>   - consumer counter shows up to which logical position consumer consumed the
>     data;
>   - producer counter denotes amount of data reserved by all producers.
> 
> Each time a record is reserved, producer that "owns" the record will
> successfully advance producer counter. At that point, data is still not yet
> ready to be consumed, though. Each record has 8 byte header, which contains
> the length of reserved record, as well as two extra bits: busy bit to denote
> that record is still being worked on, and discard bit, which might be set at
> commit time if record is discarded. In the latter case, consumer is supposed
> to skip the record and move on to the next one. Record header also encodes
> record's relative offset from the beginning of ring buffer data area (in
> pages). This allows bpf_ringbuf_commit()/bpf_ringbuf_discard() to accept only
> the pointer to the record itself, without requiring also the pointer to ring
> buffer itself. Ring buffer memory location will be restored from record
> metadata header. This significantly simplifies verifier, as well as improving
> API usability.
> 
> Producer counter increments are serialized under spinlock, so there is
> a strict ordering between reservations. Commits, on the other hand, are
> completely lockless and independent. All records become available to consumer
> in the order of reservations, but only after all previous records where
> already committed. It is thus possible for slow producers to temporarily hold
> off submitted records, that were reserved later.
> 
> Reservation/commit/consumer protocol is verified by litmus tests in the later
> patch in this series.
> 
> One interesting implementation bit, that significantly simplifies (and thus
> speeds up as well) implementation of both producers and consumers is how data
> area is mapped twice contiguously back-to-back in the virtual memory. This
> allows to not take any special measures for samples that have to wrap around
> at the end of the circular buffer data area, because the next page after the
> last data page would be first data page again, and thus the sample will still
> appear completely contiguous in virtual memory. See comment and a simple ASCII
> diagram showing this visually in bpf_ringbuf_area_alloc().
> 
> Another feature that distinguishes BPF ringbuf from perf ring buffer is
> a self-pacing notifications of new data being availability.
> bpf_ringbuf_commit() implementation will send a notification of new record
> being available after commit only if consumer has already caught up right up
> to the record being committed. If not, consumer still has to catch up and thus
> will see new data anyways without needing an extra poll notification. As will
> be shown in benchmarks in later patch in the series, this allows to achieve
> a very high throughput without having to resort to tricks like "notify only
> every Nth sample", like with perf buffer, to achieve good throughput
> performance.
> 
> For performance evaluation against perf buffer and scalability limits, see
> patch later in the series, adding ring buffers benchmark.
> number of contention
> 
> Comparison to alternatives
> --------------------------
> Before considering implementing BPF ring buffer from scratch existing
> alternatives in kernel were evaluated, but didn't seem to meet the needs. They
> largely fell into few categores:
>   - per-CPU buffers (perf, ftrace, etc), which don't satisfy two motivations
>     outlined above (ordering and memory consumption);
>   - linked list-based implementations; while some were multi-producer designs,
>     consuming these from user-space would be very complicated and most
>     probably not performant; memory-mapping contiguous piece of memory is
>     simpler and more performant for user-space consumers;
>   - io_uring is SPSC, but also requires fixed-sized elements. Naively turning
>     SPSC queue into MPSC w/ lock would have subpar performance compared to
>     locked reserve + lockless commit, as with BPF ring buffer. Fixed sized
>     elements would be too limiting for BPF programs, given existing BPF
>     programs heavily rely on variable-sized perf buffer already;
>   - specialized implementations (like a new printk ring buffer, [0]) with lots
>     of printk-specific limitations and implications, that didn't seem to fit
>     well for intended use with BPF programs.
> 
>   [0] https://lwn.net/Articles/779550/
> 
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> ---
>  include/linux/bpf.h            |  12 +
>  include/linux/bpf_types.h      |   1 +
>  include/linux/bpf_verifier.h   |   4 +
>  include/uapi/linux/bpf.h       |  33 ++-
>  kernel/bpf/Makefile            |   2 +-
>  kernel/bpf/helpers.c           |   8 +
>  kernel/bpf/ringbuf.c           | 409 +++++++++++++++++++++++++++++++++
>  kernel/bpf/syscall.c           |  12 +
>  kernel/bpf/verifier.c          | 156 ++++++++++---
>  kernel/trace/bpf_trace.c       |   8 +
>  tools/include/uapi/linux/bpf.h |  33 ++-
>  11 files changed, 643 insertions(+), 35 deletions(-)
>  create mode 100644 kernel/bpf/ringbuf.c
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index cf4b6e44f2bc..9e3da01f3e9b 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -89,6 +89,8 @@ struct bpf_map_ops {
>  	int (*map_direct_value_meta)(const struct bpf_map *map,
>  				     u64 imm, u32 *off);
>  	int (*map_mmap)(struct bpf_map *map, struct vm_area_struct *vma);
> +	__poll_t (*map_poll)(struct bpf_map *map, struct file *filp,
> +			     struct poll_table_struct *pts);
>  };
>  
>  struct bpf_map_memory {
> @@ -243,6 +245,9 @@ enum bpf_arg_type {
>  	ARG_PTR_TO_LONG,	/* pointer to long */
>  	ARG_PTR_TO_SOCKET,	/* pointer to bpf_sock (fullsock) */
>  	ARG_PTR_TO_BTF_ID,	/* pointer to in-kernel struct */
> +	ARG_PTR_TO_ALLOC_MEM,	/* pointer to dynamically allocated memory */
> +	ARG_PTR_TO_ALLOC_MEM_OR_NULL,	/* pointer to dynamically allocated memory or NULL */
> +	ARG_CONST_ALLOC_SIZE_OR_ZERO,	/* number of allocated bytes requested */
>  };
>  
>  /* type of values returned from helper functions */
> @@ -254,6 +259,7 @@ enum bpf_return_type {
>  	RET_PTR_TO_SOCKET_OR_NULL,	/* returns a pointer to a socket or NULL */
>  	RET_PTR_TO_TCP_SOCK_OR_NULL,	/* returns a pointer to a tcp_sock or NULL */
>  	RET_PTR_TO_SOCK_COMMON_OR_NULL,	/* returns a pointer to a sock_common or NULL */
> +	RET_PTR_TO_ALLOC_MEM_OR_NULL,	/* returns a pointer to dynamically allocated memory or NULL */
>  };
>  
>  /* eBPF function prototype used by verifier to allow BPF_CALLs from eBPF programs
> @@ -321,6 +327,8 @@ enum bpf_reg_type {
>  	PTR_TO_XDP_SOCK,	 /* reg points to struct xdp_sock */
>  	PTR_TO_BTF_ID,		 /* reg points to kernel struct */
>  	PTR_TO_BTF_ID_OR_NULL,	 /* reg points to kernel struct or NULL */
> +	PTR_TO_MEM,		 /* reg points to valid memory region */
> +	PTR_TO_MEM_OR_NULL,	 /* reg points to valid memory region or NULL */
>  };
>  
>  /* The information passed from prog-specific *_is_valid_access
> @@ -1585,6 +1593,10 @@ extern const struct bpf_func_proto bpf_tcp_sock_proto;
>  extern const struct bpf_func_proto bpf_jiffies64_proto;
>  extern const struct bpf_func_proto bpf_get_ns_current_pid_tgid_proto;
>  extern const struct bpf_func_proto bpf_event_output_data_proto;
> +extern const struct bpf_func_proto bpf_ringbuf_output_proto;
> +extern const struct bpf_func_proto bpf_ringbuf_reserve_proto;
> +extern const struct bpf_func_proto bpf_ringbuf_submit_proto;
> +extern const struct bpf_func_proto bpf_ringbuf_discard_proto;
>  
>  const struct bpf_func_proto *bpf_tracing_func_proto(
>  	enum bpf_func_id func_id, const struct bpf_prog *prog);
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index 29d22752fc87..fa8e1b552acd 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -118,6 +118,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops)
>  #if defined(CONFIG_BPF_JIT)
>  BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops)
>  #endif
> +BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops)
>  
>  BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint)
>  BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing)
> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> index 6abd5a778fcd..c94a736e53cd 100644
> --- a/include/linux/bpf_verifier.h
> +++ b/include/linux/bpf_verifier.h
> @@ -54,6 +54,8 @@ struct bpf_reg_state {
>  
>  		u32 btf_id; /* for PTR_TO_BTF_ID */
>  
> +		u32 mem_size; /* for PTR_TO_MEM | PTR_TO_MEM_OR_NULL */
> +
>  		/* Max size from any of the above. */
>  		unsigned long raw;
>  	};
> @@ -63,6 +65,8 @@ struct bpf_reg_state {
>  	 * offset, so they can share range knowledge.
>  	 * For PTR_TO_MAP_VALUE_OR_NULL this is used to share which map value we
>  	 * came from, when one is tested for != NULL.
> +	 * For PTR_TO_MEM_OR_NULL this is used to identify memory allocation
> +	 * for the purpose of tracking that it's freed.
>  	 * For PTR_TO_SOCKET this is used to share which pointers retain the
>  	 * same reference to the socket, to determine proper reference freeing.
>  	 */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index bfb31c1be219..ae2deb6a8afc 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -147,6 +147,7 @@ enum bpf_map_type {
>  	BPF_MAP_TYPE_SK_STORAGE,
>  	BPF_MAP_TYPE_DEVMAP_HASH,
>  	BPF_MAP_TYPE_STRUCT_OPS,
> +	BPF_MAP_TYPE_RINGBUF,
>  };
>  
>  /* Note that tracing related programs such as
> @@ -3121,6 +3122,32 @@ union bpf_attr {
>   * 		0 on success, or a negative error in case of failure:
>   *
>   *		**-EOVERFLOW** if an overflow happened: The same object will be tried again.
> + *
> + * void *bpf_ringbuf_output(void *ringbuf, void *data, u64 size, u64 flags)
> + * 	Description
> + * 		Copy *size* bytes from *data* into a ring buffer *ringbuf*.
> + * 	Return
> + * 		0, on success;
> + * 		< 0, on error.
> + *
> + * void *bpf_ringbuf_reserve(void *ringbuf, u64 size, u64 flags)
> + * 	Description
> + * 		Reserve *size* bytes of payload in a ring buffer *ringbuf*.
> + * 	Return
> + * 		Valid pointer with *size* bytes of memory available; NULL,
> + * 		otherwise.
> + *
> + * void bpf_ringbuf_submit(void *data)
> + * 	Description
> + * 		Submit reserved ring buffer sample, pointed to by *data*.
> + * 	Return
> + * 		Nothing.
> + *
> + * void bpf_ringbuf_discard(void *data)
> + * 	Description
> + * 		Discard reserved ring buffer sample, pointed to by *data*.
> + * 	Return
> + * 		Nothing.
>   */
>  #define __BPF_FUNC_MAPPER(FN)		\
>  	FN(unspec),			\
> @@ -3250,7 +3277,11 @@ union bpf_attr {
>  	FN(sk_assign),			\
>  	FN(ktime_get_boot_ns),		\
>  	FN(seq_printf),			\
> -	FN(seq_write),
> +	FN(seq_write),			\
> +	FN(ringbuf_output),		\
> +	FN(ringbuf_reserve),		\
> +	FN(ringbuf_submit),		\
> +	FN(ringbuf_discard),
>  
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index 37b2d8620153..c9aada6c1806 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -4,7 +4,7 @@ CFLAGS_core.o += $(call cc-disable-warning, override-init)
>  
>  obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o
>  obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o
> -obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o
> +obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
>  obj-$(CONFIG_BPF_SYSCALL) += disasm.o
>  obj-$(CONFIG_BPF_JIT) += trampoline.o
>  obj-$(CONFIG_BPF_SYSCALL) += btf.o
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 5c0290e0696e..27321ca8803f 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -629,6 +629,14 @@ bpf_base_func_proto(enum bpf_func_id func_id)
>  		return &bpf_ktime_get_ns_proto;
>  	case BPF_FUNC_ktime_get_boot_ns:
>  		return &bpf_ktime_get_boot_ns_proto;
> +	case BPF_FUNC_ringbuf_output:
> +		return &bpf_ringbuf_output_proto;
> +	case BPF_FUNC_ringbuf_reserve:
> +		return &bpf_ringbuf_reserve_proto;
> +	case BPF_FUNC_ringbuf_submit:
> +		return &bpf_ringbuf_submit_proto;
> +	case BPF_FUNC_ringbuf_discard:
> +		return &bpf_ringbuf_discard_proto;
>  	default:
>  		break;
>  	}
> diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c
> new file mode 100644
> index 000000000000..f2ae441a1695
> --- /dev/null
> +++ b/kernel/bpf/ringbuf.c
> @@ -0,0 +1,409 @@
> +#include <linux/bpf.h>
> +#include <linux/btf.h>
> +#include <linux/err.h>
> +#include <linux/slab.h>
> +#include <linux/filter.h>
> +#include <linux/mm.h>
> +#include <linux/vmalloc.h>
> +#include <linux/wait.h>
> +#include <linux/poll.h>
> +#include <uapi/linux/btf.h>
> +
> +#define RINGBUF_CREATE_FLAG_MASK (BPF_F_NUMA_NODE)
> +
> +#define RINGBUF_BUSY_BIT BIT(31)
> +#define RINGBUF_DISCARD_BIT BIT(30)
> +#define RINGBUF_META_SZ 8
> +
> +/* non-mmap()'able part of bpf_ringbuf (everything up to consumer page) */
> +#define BPF_RINGBUF_PGOFF \
> +	(offsetof(struct bpf_ringbuf, consumer_pos) >> PAGE_SHIFT)
> +
> +struct bpf_ringbuf {
> +	wait_queue_head_t waitq;
> +	u64 mask;
> +	spinlock_t spinlock ____cacheline_aligned_in_smp;
> +	u64 consumer_pos __aligned(PAGE_SIZE);
> +	u64 producer_pos __aligned(PAGE_SIZE);
> +	char data[] __aligned(PAGE_SIZE);
> +};
> +
> +struct bpf_ringbuf_map {
> +	struct bpf_map map;
> +	struct bpf_map_memory memory;
> +	struct bpf_ringbuf *rb;
> +};
> +
> +static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node)
> +{
> +	const gfp_t flags = GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN |
> +			    __GFP_ZERO;
> +	int nr_meta_pages = 2 + BPF_RINGBUF_PGOFF;
> +	int nr_data_pages = data_sz >> PAGE_SHIFT;
> +	int nr_pages = nr_meta_pages + nr_data_pages;
> +	struct page **pages, *page;
> +	size_t array_size;
> +	void *addr;
> +	int i;
> +
> +	/* Each data page is mapped twice to allow "virtual"
> +	 * continuous read of samples wrapping around the end of ring
> +	 * buffer area:
> +	 * ------------------------------------------------------
> +	 * | meta pages |  real data pages  |  same data pages  |
> +	 * ------------------------------------------------------
> +	 * |            | 1 2 3 4 5 6 7 8 9 | 1 2 3 4 5 6 7 8 9 |
> +	 * ------------------------------------------------------
> +	 * |            | TA             DA | TA             DA |
> +	 * ------------------------------------------------------
> +	 *                               ^^^^^^^
> +	 *                                  |
> +	 * Here, no need to worry about special handling of wrapped-around
> +	 * data due to double-mapped data pages. This works both in kernel and
> +	 * when mmap()'ed in user-space, simplifying both kernel and
> +	 * user-space implementations significantly.
> +	 */
> +	array_size = (nr_meta_pages + 2 * nr_data_pages) * sizeof(*pages);
> +	if (array_size > PAGE_SIZE)
> +		pages = vmalloc_node(array_size, numa_node);
> +	else
> +		pages = kmalloc_node(array_size, flags, numa_node);
> +	if (!pages)
> +		return NULL;
> +
> +	for (i = 0; i < nr_pages; i++) {
> +		page = alloc_pages_node(numa_node, flags, 0);
> +		if (!page) {
> +			nr_pages = i;
> +			goto err_free_pages;
> +		}
> +		pages[i] = page;
> +		if (i >= nr_meta_pages)
> +			pages[nr_data_pages + i] = page;
> +	}
> +
> +	addr = vmap(pages, nr_meta_pages + 2 * nr_data_pages,
> +		    VM_ALLOC | VM_USERMAP, PAGE_KERNEL);
> +	if (addr)
> +		return addr;
> +
> +err_free_pages:
> +	for (i = 0; i < nr_pages; i++)
> +		free_page((unsigned long)pages[i]);
> +	kvfree(pages);
> +	return NULL;
> +}
> +
> +static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node)
> +{
> +	struct bpf_ringbuf *rb;
> +
> +	if (!data_sz || !PAGE_ALIGNED(data_sz))
> +		return ERR_PTR(-EINVAL);
> +
> +	rb = bpf_ringbuf_area_alloc(data_sz, numa_node);
> +	if (!rb)
> +		return ERR_PTR(-ENOMEM);
> +
> +	spin_lock_init(&rb->spinlock);
> +	init_waitqueue_head(&rb->waitq);
> +
> +	rb->mask = data_sz - 1;
> +	rb->consumer_pos = 0;
> +	rb->producer_pos = 0;
> +
> +	return rb;
> +}
> +
> +static struct bpf_map *ringbuf_map_alloc(union bpf_attr *attr)
> +{
> +	struct bpf_ringbuf_map *rb_map;
> +	u64 cost;
> +	int err;
> +
> +	if (attr->map_flags & ~RINGBUF_CREATE_FLAG_MASK)
> +		return ERR_PTR(-EINVAL);
> +
> +	if (attr->key_size || attr->value_size ||
> +	    attr->max_entries == 0 || !PAGE_ALIGNED(attr->max_entries))
> +		return ERR_PTR(-EINVAL);
> +
> +	rb_map = kzalloc(sizeof(*rb_map), GFP_USER);
> +	if (!rb_map)
> +		return ERR_PTR(-ENOMEM);
> +
> +	bpf_map_init_from_attr(&rb_map->map, attr);
> +
> +	cost = sizeof(struct bpf_ringbuf_map) +
> +	       sizeof(struct bpf_ringbuf) +
> +	       attr->max_entries;
> +	err = bpf_map_charge_init(&rb_map->map.memory, cost);
> +	if (err)
> +		goto err_free_map;
> +
> +	rb_map->rb = bpf_ringbuf_alloc(attr->max_entries, rb_map->map.numa_node);
> +	if (IS_ERR(rb_map->rb)) {
> +		err = PTR_ERR(rb_map->rb);
> +		goto err_uncharge;
> +	}
> +
> +	return &rb_map->map;
> +
> +err_uncharge:
> +	bpf_map_charge_finish(&rb_map->map.memory);
> +err_free_map:
> +	kfree(rb_map);
> +	return ERR_PTR(err);
> +}
> +
> +static void bpf_ringbuf_free(struct bpf_ringbuf *ringbuf)
> +{
> +	kvfree(ringbuf);
> +}
> +
> +static void ringbuf_map_free(struct bpf_map *map)
> +{
> +	struct bpf_ringbuf_map *rb_map;
> +
> +	/* at this point bpf_prog->aux->refcnt == 0 and this map->refcnt == 0,
> +	 * so the programs (can be more than one that used this map) were
> +	 * disconnected from events. Wait for outstanding critical sections in
> +	 * these programs to complete
> +	 */
> +	synchronize_rcu();
> +
> +	rb_map = container_of(map, struct bpf_ringbuf_map, map);
> +	bpf_ringbuf_free(rb_map->rb);
> +	kfree(rb_map);
> +}
> +
> +static void *ringbuf_map_lookup_elem(struct bpf_map *map, void *key)
> +{
> +	return ERR_PTR(-ENOTSUPP);
> +}
> +
> +static int ringbuf_map_update_elem(struct bpf_map *map, void *key, void *value,
> +				   u64 flags)
> +{
> +	return -ENOTSUPP;
> +}
> +
> +static int ringbuf_map_delete_elem(struct bpf_map *map, void *key)
> +{
> +	return -ENOTSUPP;
> +}
> +
> +static int ringbuf_map_get_next_key(struct bpf_map *map, void *key,
> +				    void *next_key)
> +{
> +	return -ENOTSUPP;
> +}
> +
> +static size_t bpf_ringbuf_mmap_page_cnt(const struct bpf_ringbuf *rb)
> +{
> +	size_t data_pages = (rb->mask + 1) >> PAGE_SHIFT;
> +
> +	/* consumer page + producer page + 2 x data pages */
> +	return 2 + 2 * data_pages;
> +}
> +
> +static int ringbuf_map_mmap(struct bpf_map *map, struct vm_area_struct *vma)
> +{
> +	struct bpf_ringbuf_map *rb_map;
> +	size_t mmap_sz;
> +
> +	rb_map = container_of(map, struct bpf_ringbuf_map, map);
> +	mmap_sz = bpf_ringbuf_mmap_page_cnt(rb_map->rb) << PAGE_SHIFT;
> +
> +	if (vma->vm_pgoff * PAGE_SIZE + (vma->vm_end - vma->vm_start) > mmap_sz)
> +		return -EINVAL;
> +
> +	return remap_vmalloc_range(vma, rb_map->rb,
> +				   vma->vm_pgoff + BPF_RINGBUF_PGOFF);
> +}
> +
> +static __poll_t ringbuf_map_poll(struct bpf_map *map, struct file *filp,
> +				  struct poll_table_struct *pts)
> +{
> +	struct bpf_ringbuf_map *rb_map;
> +
> +	rb_map = container_of(map, struct bpf_ringbuf_map, map);
> +	poll_wait(filp, &rb_map->rb->waitq, pts);
> +
> +	return EPOLLIN | EPOLLRDNORM;
> +}
> +
> +const struct bpf_map_ops ringbuf_map_ops = {
> +	.map_alloc = ringbuf_map_alloc,
> +	.map_free = ringbuf_map_free,
> +	.map_mmap = ringbuf_map_mmap,
> +	.map_poll = ringbuf_map_poll,
> +	.map_lookup_elem = ringbuf_map_lookup_elem,
> +	.map_update_elem = ringbuf_map_update_elem,
> +	.map_delete_elem = ringbuf_map_delete_elem,
> +	.map_get_next_key = ringbuf_map_get_next_key,
> +};
> +
> +/* Given pointer to ring buffer record metadata and struct bpf_ringbuf itself,
> + * calculate offset from record metadata to ring buffer in pages, rounded
> + * down. This page offset is stored as part of record metadata and allows to
> + * restore struct bpf_ringbuf * from record pointer. This page offset is
> + * stored at offset 4 of record metadata header.
> + */
> +static size_t bpf_ringbuf_rec_pg_off(struct bpf_ringbuf *rb, void *meta_ptr)
> +{
> +	return (meta_ptr - (void *)rb) >> PAGE_SHIFT;
> +}
> +
> +/* Given pointer to ring buffer record metadata, restore pointer to struct
> + * bpf_ringbuf itself by using page offset stored at offset 4
> + */
> +static struct bpf_ringbuf *bpf_ringbuf_restore_from_rec(void *meta_ptr)
> +{
> +	unsigned long addr = (unsigned long)meta_ptr;
> +	unsigned long off = *(u32 *)(meta_ptr + 4) << PAGE_SHIFT;
> +
> +	return (void*)((addr & PAGE_MASK) - off);
> +}
> +
> +static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size)
> +{
> +	unsigned long cons_pos, prod_pos, new_prod_pos, flags;
> +	u32 len, pg_off;
> +	void *meta_ptr;
> +
> +	if (unlikely(size > UINT_MAX))
> +		return NULL;
> +
> +	len = round_up(size + RINGBUF_META_SZ, 8);
> +	cons_pos = READ_ONCE(rb->consumer_pos);
> +
> +	if (in_nmi()) {
> +		if (!spin_trylock_irqsave(&rb->spinlock, flags))
> +			return NULL;
> +	} else {
> +		spin_lock_irqsave(&rb->spinlock, flags);
> +	}
> +
> +	prod_pos = rb->producer_pos;
> +	new_prod_pos = prod_pos + len;
> +
> +	/* check for out of ringbuf space by ensuring producer position
> +	 * doesn't advance more than (ringbuf_size - 1) ahead
> +	 */
> +	if (new_prod_pos - cons_pos > rb->mask) {
> +		spin_unlock_irqrestore(&rb->spinlock, flags);
> +		return NULL;
> +	}
> +
> +	meta_ptr = rb->data + (prod_pos & rb->mask);
> +	pg_off = bpf_ringbuf_rec_pg_off(rb, meta_ptr);
> +
> +	WRITE_ONCE(*(u32 *)meta_ptr, RINGBUF_BUSY_BIT | size);
> +	WRITE_ONCE(*(u32 *)(meta_ptr + 4), pg_off);
> +
> +	/* ensure length prefix is written before updating producer positions */
> +	smp_wmb();
> +	WRITE_ONCE(rb->producer_pos, new_prod_pos);
> +
> +	spin_unlock_irqrestore(&rb->spinlock, flags);
> +
> +	return meta_ptr + RINGBUF_META_SZ;
> +}
> +
> +BPF_CALL_3(bpf_ringbuf_reserve, struct bpf_map *, map, u64, size, u64, flags)
> +{
> +	struct bpf_ringbuf_map *rb_map;
> +
> +	if (unlikely(flags))
> +		return -EINVAL;
> +
> +	rb_map = container_of(map, struct bpf_ringbuf_map, map);
> +	return (unsigned long)__bpf_ringbuf_reserve(rb_map->rb, size);
> +}
> +
> +const struct bpf_func_proto bpf_ringbuf_reserve_proto = {
> +	.func		= bpf_ringbuf_reserve,
> +	.ret_type	= RET_PTR_TO_ALLOC_MEM_OR_NULL,
> +	.arg1_type	= ARG_CONST_MAP_PTR,
> +	.arg2_type	= ARG_CONST_ALLOC_SIZE_OR_ZERO,
> +	.arg3_type	= ARG_ANYTHING,
> +};
> +
> +static void bpf_ringbuf_commit(void *sample, bool discard)
> +{
> +	unsigned long rec_pos, cons_pos;
> +	u32 new_meta, old_meta;
> +	void *meta_ptr;
> +	struct bpf_ringbuf *rb;
> +
> +	meta_ptr = sample - RINGBUF_META_SZ;
> +	rb = bpf_ringbuf_restore_from_rec(meta_ptr);
> +	old_meta = *(u32 *)meta_ptr;
> +	new_meta = old_meta ^ RINGBUF_BUSY_BIT;
> +	if (discard)
> +		new_meta |= RINGBUF_DISCARD_BIT;
> +
> +	/* update metadata header with correct final size prefix */
> +	xchg((u32 *)meta_ptr, new_meta);
> +
> +	/* if consumer caught up and is waiting for our record, notify about
> +	 * new data availability
> +	 */
> +	rec_pos = (void *)meta_ptr - (void *)rb->data;
> +	cons_pos = smp_load_acquire(&rb->consumer_pos) & rb->mask;
> +	if (cons_pos == rec_pos)
> +		wake_up_all(&rb->waitq);
> +}
> +
> +BPF_CALL_1(bpf_ringbuf_submit, void *, sample)
> +{
> +	bpf_ringbuf_commit(sample, false /* discard */);
> +	return 0;
> +}
> +
> +const struct bpf_func_proto bpf_ringbuf_submit_proto = {
> +	.func		= bpf_ringbuf_submit,
> +	.ret_type	= RET_VOID,
> +	.arg1_type	= ARG_PTR_TO_ALLOC_MEM,
> +};
> +
> +BPF_CALL_1(bpf_ringbuf_discard, void *, sample)
> +{
> +	bpf_ringbuf_commit(sample, true /* discard */);
> +	return 0;
> +}
> +
> +const struct bpf_func_proto bpf_ringbuf_discard_proto = {
> +	.func		= bpf_ringbuf_discard,
> +	.ret_type	= RET_VOID,
> +	.arg1_type	= ARG_PTR_TO_ALLOC_MEM,
> +};
> +
> +BPF_CALL_4(bpf_ringbuf_output, struct bpf_map *, map, void *, data, u64, size,
> +	   u64, flags)
> +{
> +	struct bpf_ringbuf_map *rb_map;
> +	void *rec;
> +
> +	if (unlikely(flags))
> +		return -EINVAL;
> +
> +	rb_map = container_of(map, struct bpf_ringbuf_map, map);
> +	rec = __bpf_ringbuf_reserve(rb_map->rb, size);
> +	if (!rec)
> +		return -EAGAIN;
> +
> +	memcpy(rec, data, size);
> +	bpf_ringbuf_commit(rec, false /* discard */);
> +	return 0;
> +}
> +
> +const struct bpf_func_proto bpf_ringbuf_output_proto = {
> +	.func		= bpf_ringbuf_output,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_CONST_MAP_PTR,
> +	.arg2_type	= ARG_PTR_TO_MEM,
> +	.arg3_type	= ARG_CONST_SIZE_OR_ZERO,
> +	.arg4_type	= ARG_ANYTHING,
> +};
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index de2a75500233..462db8595e9f 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -26,6 +26,7 @@
>  #include <linux/audit.h>
>  #include <uapi/linux/btf.h>
>  #include <linux/bpf_lsm.h>
> +#include <linux/poll.h>
>  
>  #define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \
>  			  (map)->map_type == BPF_MAP_TYPE_CGROUP_ARRAY || \
> @@ -651,6 +652,16 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
>  	return err;
>  }
>  
> +static __poll_t bpf_map_poll(struct file *filp, struct poll_table_struct *pts)
> +{
> +	struct bpf_map *map = filp->private_data;
> +
> +	if (map->ops->map_poll)
> +		return map->ops->map_poll(map, filp, pts);
> +
> +	return EPOLLERR;
> +}
> +
>  const struct file_operations bpf_map_fops = {
>  #ifdef CONFIG_PROC_FS
>  	.show_fdinfo	= bpf_map_show_fdinfo,
> @@ -659,6 +670,7 @@ const struct file_operations bpf_map_fops = {
>  	.read		= bpf_dummy_read,
>  	.write		= bpf_dummy_write,
>  	.mmap		= bpf_map_mmap,
> +	.poll		= bpf_map_poll,
>  };
>  
>  int bpf_map_new_fd(struct bpf_map *map, int flags)
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 2a1826c76bb6..b8f0158d2327 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -233,6 +233,7 @@ struct bpf_call_arg_meta {
>  	bool pkt_access;
>  	int regno;
>  	int access_size;
> +	int mem_size;
>  	u64 msize_max_value;
>  	int ref_obj_id;
>  	int func_id;
> @@ -399,7 +400,8 @@ static bool reg_type_may_be_null(enum bpf_reg_type type)
>  	       type == PTR_TO_SOCKET_OR_NULL ||
>  	       type == PTR_TO_SOCK_COMMON_OR_NULL ||
>  	       type == PTR_TO_TCP_SOCK_OR_NULL ||
> -	       type == PTR_TO_BTF_ID_OR_NULL;
> +	       type == PTR_TO_BTF_ID_OR_NULL ||
> +	       type == PTR_TO_MEM_OR_NULL;
>  }
>  
>  static bool reg_may_point_to_spin_lock(const struct bpf_reg_state *reg)
> @@ -413,7 +415,9 @@ static bool reg_type_may_be_refcounted_or_null(enum bpf_reg_type type)
>  	return type == PTR_TO_SOCKET ||
>  		type == PTR_TO_SOCKET_OR_NULL ||
>  		type == PTR_TO_TCP_SOCK ||
> -		type == PTR_TO_TCP_SOCK_OR_NULL;
> +		type == PTR_TO_TCP_SOCK_OR_NULL ||
> +		type == PTR_TO_MEM ||
> +		type == PTR_TO_MEM_OR_NULL;
>  }
>  
>  static bool arg_type_may_be_refcounted(enum bpf_arg_type type)
> @@ -427,7 +431,9 @@ static bool arg_type_may_be_refcounted(enum bpf_arg_type type)
>   */
>  static bool is_release_function(enum bpf_func_id func_id)
>  {
> -	return func_id == BPF_FUNC_sk_release;
> +	return func_id == BPF_FUNC_sk_release ||
> +	       func_id == BPF_FUNC_ringbuf_submit ||
> +	       func_id == BPF_FUNC_ringbuf_discard;
>  }
>  
>  static bool may_be_acquire_function(enum bpf_func_id func_id)
> @@ -435,7 +441,8 @@ static bool may_be_acquire_function(enum bpf_func_id func_id)
>  	return func_id == BPF_FUNC_sk_lookup_tcp ||
>  		func_id == BPF_FUNC_sk_lookup_udp ||
>  		func_id == BPF_FUNC_skc_lookup_tcp ||
> -		func_id == BPF_FUNC_map_lookup_elem;
> +		func_id == BPF_FUNC_map_lookup_elem ||
> +	        func_id == BPF_FUNC_ringbuf_reserve;
>  }
>  
>  static bool is_acquire_function(enum bpf_func_id func_id,
> @@ -445,7 +452,8 @@ static bool is_acquire_function(enum bpf_func_id func_id,
>  
>  	if (func_id == BPF_FUNC_sk_lookup_tcp ||
>  	    func_id == BPF_FUNC_sk_lookup_udp ||
> -	    func_id == BPF_FUNC_skc_lookup_tcp)
> +	    func_id == BPF_FUNC_skc_lookup_tcp ||
> +	    func_id == BPF_FUNC_ringbuf_reserve)
>  		return true;
>  
>  	if (func_id == BPF_FUNC_map_lookup_elem &&
> @@ -485,6 +493,8 @@ static const char * const reg_type_str[] = {
>  	[PTR_TO_XDP_SOCK]	= "xdp_sock",
>  	[PTR_TO_BTF_ID]		= "ptr_",
>  	[PTR_TO_BTF_ID_OR_NULL]	= "ptr_or_null_",
> +	[PTR_TO_MEM]		= "mem",
> +	[PTR_TO_MEM_OR_NULL]	= "mem_or_null",
>  };
>  
>  static char slot_type_char[] = {
> @@ -2459,32 +2469,31 @@ static int check_map_access_type(struct bpf_verifier_env *env, u32 regno,
>  	return 0;
>  }
>  
> -/* check read/write into map element returned by bpf_map_lookup_elem() */
> -static int __check_map_access(struct bpf_verifier_env *env, u32 regno, int off,
> -			      int size, bool zero_size_allowed)
> +/* check read/write into memory region (e.g., map value, ringbuf sample, etc) */
> +static int __check_mem_access(struct bpf_verifier_env *env, int off,
> +			      int size, u32 mem_size, bool zero_size_allowed)
>  {
> -	struct bpf_reg_state *regs = cur_regs(env);
> -	struct bpf_map *map = regs[regno].map_ptr;
> +	bool size_ok = size > 0 || (size == 0 && zero_size_allowed);
>  
> -	if (off < 0 || size < 0 || (size == 0 && !zero_size_allowed) ||
> -	    off + size > map->value_size) {
> -		verbose(env, "invalid access to map value, value_size=%d off=%d size=%d\n",
> -			map->value_size, off, size);
> -		return -EACCES;
> -	}
> -	return 0;
> +	if (off >= 0 && size_ok && off + size <= mem_size)
> +		return 0;
> +
> +	verbose(env, "invalid access to memory, mem_size=%u off=%d size=%d\n",
> +		mem_size, off, size);
> +	return -EACCES;
>  }
>  
> -/* check read/write into a map element with possible variable offset */
> -static int check_map_access(struct bpf_verifier_env *env, u32 regno,
> -			    int off, int size, bool zero_size_allowed)
> +/* check read/write into a memory region with possible variable offset */
> +static int check_mem_region_access(struct bpf_verifier_env *env, u32 regno,
> +				   int off, int size, u32 mem_size,
> +				   bool zero_size_allowed)
>  {
>  	struct bpf_verifier_state *vstate = env->cur_state;
>  	struct bpf_func_state *state = vstate->frame[vstate->curframe];
>  	struct bpf_reg_state *reg = &state->regs[regno];
>  	int err;
>  
> -	/* We may have adjusted the register to this map value, so we
> +	/* We may have adjusted the register pointing to memory region, so we
>  	 * need to try adding each of min_value and max_value to off
>  	 * to make sure our theoretical access will be safe.
>  	 */
> @@ -2501,14 +2510,14 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno,
>  	    (reg->smin_value == S64_MIN ||
>  	     (off + reg->smin_value != (s64)(s32)(off + reg->smin_value)) ||
>  	      reg->smin_value + off < 0)) {
> -		verbose(env, "R%d min value is negative, either use unsigned index or do a if (index >=0) check.\n",
> +		verbose(env, "R%d min value is negative, either use unsigned index or do an if (index >=0) check.\n",
>  			regno);
>  		return -EACCES;
>  	}
> -	err = __check_map_access(env, regno, reg->smin_value + off, size,
> +	err = __check_mem_access(env, reg->smin_value + off, size, mem_size,
>  				 zero_size_allowed);
>  	if (err) {
> -		verbose(env, "R%d min value is outside of the array range\n",
> +		verbose(env, "R%d min value is outside of the memory region\n",
>  			regno);
>  		return err;
>  	}
> @@ -2518,18 +2527,38 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno,
>  	 * If reg->umax_value + off could overflow, treat that as unbounded too.
>  	 */
>  	if (reg->umax_value >= BPF_MAX_VAR_OFF) {
> -		verbose(env, "R%d unbounded memory access, make sure to bounds check any array access into a map\n",
> +		verbose(env, "R%d unbounded memory access, make sure to bounds check any memory region access\n",
>  			regno);
>  		return -EACCES;
>  	}
> -	err = __check_map_access(env, regno, reg->umax_value + off, size,
> +	err = __check_mem_access(env, reg->umax_value + off, size, mem_size,
>  				 zero_size_allowed);
> -	if (err)
> -		verbose(env, "R%d max value is outside of the array range\n",
> +	if (err) {
> +		verbose(env, "R%d max value is outside of the memory region\n",
>  			regno);
> +		return err;
> +	}
> +
> +	return 0;
> +}
> +
> +/* check read/write into a map element with possible variable offset */
> +static int check_map_access(struct bpf_verifier_env *env, u32 regno,
> +			    int off, int size, bool zero_size_allowed)
> +{
> +	struct bpf_verifier_state *vstate = env->cur_state;
> +	struct bpf_func_state *state = vstate->frame[vstate->curframe];
> +	struct bpf_reg_state *reg = &state->regs[regno];
> +	struct bpf_map *map = reg->map_ptr;
> +	int err;
> +
> +	err = check_mem_region_access(env, regno, off, size, map->value_size,
> +				      zero_size_allowed);
> +	if (err)
> +		return err;
>  
> -	if (map_value_has_spin_lock(reg->map_ptr)) {
> -		u32 lock = reg->map_ptr->spin_lock_off;
> +	if (map_value_has_spin_lock(map)) {
> +		u32 lock = map->spin_lock_off;
>  
>  		/* if any part of struct bpf_spin_lock can be touched by
>  		 * load/store reject this program.
> @@ -3211,6 +3240,16 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
>  				mark_reg_unknown(env, regs, value_regno);
>  			}
>  		}
> +	} else if (reg->type == PTR_TO_MEM) {
> +		if (t == BPF_WRITE && value_regno >= 0 &&
> +		    is_pointer_value(env, value_regno)) {
> +			verbose(env, "R%d leaks addr into mem\n", value_regno);
> +			return -EACCES;
> +		}
> +		err = check_mem_region_access(env, regno, off, size,
> +					      reg->mem_size, false);
> +		if (!err && t == BPF_READ && value_regno >= 0)
> +			mark_reg_unknown(env, regs, value_regno);
>  	} else if (reg->type == PTR_TO_CTX) {
>  		enum bpf_reg_type reg_type = SCALAR_VALUE;
>  		u32 btf_id = 0;
> @@ -3548,6 +3587,10 @@ static int check_helper_mem_access(struct bpf_verifier_env *env, int regno,
>  			return -EACCES;
>  		return check_map_access(env, regno, reg->off, access_size,
>  					zero_size_allowed);
> +	case PTR_TO_MEM:
> +		return check_mem_region_access(env, regno, reg->off,
> +					       access_size, reg->mem_size,
> +					       zero_size_allowed);
>  	default: /* scalar_value|ptr_to_stack or invalid ptr */
>  		return check_stack_boundary(env, regno, access_size,
>  					    zero_size_allowed, meta);
> @@ -3652,6 +3695,17 @@ static bool arg_type_is_mem_size(enum bpf_arg_type type)
>  	       type == ARG_CONST_SIZE_OR_ZERO;
>  }
>  
> +static bool arg_type_is_alloc_mem_ptr(enum bpf_arg_type type)
> +{
> +	return type == ARG_PTR_TO_ALLOC_MEM ||
> +	       type == ARG_PTR_TO_ALLOC_MEM_OR_NULL;
> +}
> +
> +static bool arg_type_is_alloc_size(enum bpf_arg_type type)
> +{
> +	return type == ARG_CONST_ALLOC_SIZE_OR_ZERO;
> +}
> +
>  static bool arg_type_is_int_ptr(enum bpf_arg_type type)
>  {
>  	return type == ARG_PTR_TO_INT ||
> @@ -3711,7 +3765,8 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
>  			 type != expected_type)
>  			goto err_type;
>  	} else if (arg_type == ARG_CONST_SIZE ||
> -		   arg_type == ARG_CONST_SIZE_OR_ZERO) {
> +		   arg_type == ARG_CONST_SIZE_OR_ZERO ||
> +		   arg_type == ARG_CONST_ALLOC_SIZE_OR_ZERO) {
>  		expected_type = SCALAR_VALUE;
>  		if (type != expected_type)
>  			goto err_type;
> @@ -3782,13 +3837,29 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
>  		 * happens during stack boundary checking.
>  		 */
>  		if (register_is_null(reg) &&
> -		    arg_type == ARG_PTR_TO_MEM_OR_NULL)
> +		    (arg_type == ARG_PTR_TO_MEM_OR_NULL ||
> +		     arg_type == ARG_PTR_TO_ALLOC_MEM_OR_NULL))
>  			/* final test in check_stack_boundary() */;
>  		else if (!type_is_pkt_pointer(type) &&
>  			 type != PTR_TO_MAP_VALUE &&
> +			 type != PTR_TO_MEM &&
>  			 type != expected_type)
>  			goto err_type;
>  		meta->raw_mode = arg_type == ARG_PTR_TO_UNINIT_MEM;
> +	} else if (arg_type_is_alloc_mem_ptr(arg_type)) {
> +		expected_type = PTR_TO_MEM;
> +		if (register_is_null(reg) &&
> +		    arg_type == ARG_PTR_TO_ALLOC_MEM_OR_NULL)
> +			/* final test in check_stack_boundary() */;
> +		else if (type != expected_type)
> +			goto err_type;
> +		if (meta->ref_obj_id) {
> +			verbose(env, "verifier internal error: more than one arg with ref_obj_id R%d %u %u\n",
> +				regno, reg->ref_obj_id,
> +				meta->ref_obj_id);
> +			return -EFAULT;
> +		}
> +		meta->ref_obj_id = reg->ref_obj_id;
>  	} else if (arg_type_is_int_ptr(arg_type)) {
>  		expected_type = PTR_TO_STACK;
>  		if (!type_is_pkt_pointer(type) &&
> @@ -3884,6 +3955,13 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
>  					      zero_size_allowed, meta);
>  		if (!err)
>  			err = mark_chain_precision(env, regno);
> +	} else if (arg_type_is_alloc_size(arg_type)) {
> +		if (!tnum_is_const(reg->var_off)) {
> +			verbose(env, "R%d unbounded size, use 'var &= const' or 'if (var < const)'\n",
> +				regno);
> +			return -EACCES;
> +		}
> +		meta->mem_size = reg->var_off.value;
>  	} else if (arg_type_is_int_ptr(arg_type)) {
>  		int size = int_ptr_type_to_size(arg_type);
>  
> @@ -3920,6 +3998,13 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
>  		    func_id != BPF_FUNC_xdp_output)
>  			goto error;
>  		break;
> +	case BPF_MAP_TYPE_RINGBUF:
> +		if (func_id != BPF_FUNC_ringbuf_output &&
> +		    func_id != BPF_FUNC_ringbuf_reserve &&
> +		    func_id != BPF_FUNC_ringbuf_submit &&
> +		    func_id != BPF_FUNC_ringbuf_discard)
> +			goto error;
> +		break;
>  	case BPF_MAP_TYPE_STACK_TRACE:
>  		if (func_id != BPF_FUNC_get_stackid)
>  			goto error;
> @@ -4644,6 +4729,11 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
>  		mark_reg_known_zero(env, regs, BPF_REG_0);
>  		regs[BPF_REG_0].type = PTR_TO_TCP_SOCK_OR_NULL;
>  		regs[BPF_REG_0].id = ++env->id_gen;
> +	} else if (fn->ret_type == RET_PTR_TO_ALLOC_MEM_OR_NULL) {
> +		mark_reg_known_zero(env, regs, BPF_REG_0);
> +		regs[BPF_REG_0].type = PTR_TO_MEM_OR_NULL;
> +		regs[BPF_REG_0].id = ++env->id_gen;
> +		regs[BPF_REG_0].mem_size = meta.mem_size;
>  	} else {
>  		verbose(env, "unknown return type %d of func %s#%d\n",
>  			fn->ret_type, func_id_name(func_id), func_id);
> @@ -6583,6 +6673,8 @@ static void mark_ptr_or_null_reg(struct bpf_func_state *state,
>  			reg->type = PTR_TO_TCP_SOCK;
>  		} else if (reg->type == PTR_TO_BTF_ID_OR_NULL) {
>  			reg->type = PTR_TO_BTF_ID;
> +		} else if (reg->type == PTR_TO_MEM_OR_NULL) {
> +			reg->type = PTR_TO_MEM;
>  		}
>  		if (is_null) {
>  			/* We don't need id and ref_obj_id from this point
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index d961428fb5b6..6e6b3f8f77c1 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -1053,6 +1053,14 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>  		return &bpf_perf_event_read_value_proto;
>  	case BPF_FUNC_get_ns_current_pid_tgid:
>  		return &bpf_get_ns_current_pid_tgid_proto;
> +	case BPF_FUNC_ringbuf_output:
> +		return &bpf_ringbuf_output_proto;
> +	case BPF_FUNC_ringbuf_reserve:
> +		return &bpf_ringbuf_reserve_proto;
> +	case BPF_FUNC_ringbuf_submit:
> +		return &bpf_ringbuf_submit_proto;
> +	case BPF_FUNC_ringbuf_discard:
> +		return &bpf_ringbuf_discard_proto;
>  	default:
>  		return NULL;
>  	}
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index bfb31c1be219..ae2deb6a8afc 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -147,6 +147,7 @@ enum bpf_map_type {
>  	BPF_MAP_TYPE_SK_STORAGE,
>  	BPF_MAP_TYPE_DEVMAP_HASH,
>  	BPF_MAP_TYPE_STRUCT_OPS,
> +	BPF_MAP_TYPE_RINGBUF,
>  };
>  
>  /* Note that tracing related programs such as
> @@ -3121,6 +3122,32 @@ union bpf_attr {
>   * 		0 on success, or a negative error in case of failure:
>   *
>   *		**-EOVERFLOW** if an overflow happened: The same object will be tried again.
> + *
> + * void *bpf_ringbuf_output(void *ringbuf, void *data, u64 size, u64 flags)
> + * 	Description
> + * 		Copy *size* bytes from *data* into a ring buffer *ringbuf*.
> + * 	Return
> + * 		0, on success;
> + * 		< 0, on error.
> + *
> + * void *bpf_ringbuf_reserve(void *ringbuf, u64 size, u64 flags)
> + * 	Description
> + * 		Reserve *size* bytes of payload in a ring buffer *ringbuf*.
> + * 	Return
> + * 		Valid pointer with *size* bytes of memory available; NULL,
> + * 		otherwise.
> + *
> + * void bpf_ringbuf_submit(void *data)
> + * 	Description
> + * 		Submit reserved ring buffer sample, pointed to by *data*.
> + * 	Return
> + * 		Nothing.
> + *
> + * void bpf_ringbuf_discard(void *data)
> + * 	Description
> + * 		Discard reserved ring buffer sample, pointed to by *data*.
> + * 	Return
> + * 		Nothing.
>   */
>  #define __BPF_FUNC_MAPPER(FN)		\
>  	FN(unspec),			\
> @@ -3250,7 +3277,11 @@ union bpf_attr {
>  	FN(sk_assign),			\
>  	FN(ktime_get_boot_ns),		\
>  	FN(seq_printf),			\
> -	FN(seq_write),
> +	FN(seq_write),			\
> +	FN(ringbuf_output),		\
> +	FN(ringbuf_reserve),		\
> +	FN(ringbuf_submit),		\
> +	FN(ringbuf_discard),
>  
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> -- 
> 2.24.1
> 
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next 1/6] bpf: implement BPF ring buffer and verifier support for it
  2020-05-13 19:25 ` [PATCH bpf-next 1/6] bpf: implement BPF ring buffer and verifier support for it Andrii Nakryiko
  2020-05-13 20:57   ` kbuild test robot
  2020-05-13 21:58   ` Alan Maguire
@ 2020-05-13 22:16   ` kbuild test robot
  2020-05-14 16:50   ` Jonathan Lemon
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 32+ messages in thread
From: kbuild test robot @ 2020-05-13 22:16 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, netdev, ast, daniel
  Cc: kbuild-all, andrii.nakryiko, kernel-team, Andrii Nakryiko,
	Paul E . McKenney, Jonathan Lemon

[-- Attachment #1: Type: text/plain, Size: 3739 bytes --]

Hi Andrii,

I love your patch! Perhaps something to improve:

[auto build test WARNING on bpf-next/master]
[also build test WARNING on next-20200512]
[cannot apply to bpf/master rcu/dev v5.7-rc5]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Andrii-Nakryiko/BPF-ring-buffer/20200514-032857
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: riscv-rv32_defconfig (attached as .config)
compiler: riscv32-linux-gcc (GCC) 9.3.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day GCC_VERSION=9.3.0 make.cross ARCH=riscv 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kbuild test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

In file included from include/linux/kernel.h:11,
from include/linux/list.h:9,
from include/linux/timer.h:5,
from include/linux/workqueue.h:9,
from include/linux/bpf.h:9,
from kernel/bpf/ringbuf.c:1:
kernel/bpf/ringbuf.c: In function 'bpf_ringbuf_commit':
include/linux/compiler.h:350:38: error: call to '__compiletime_assert_155' declared with attribute error: Need native word sized stores/loads for atomicity.
350 |  _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
|                                      ^
include/linux/compiler.h:331:4: note: in definition of macro '__compiletime_assert'
331 |    prefix ## suffix();             |    ^~~~~~
include/linux/compiler.h:350:2: note: in expansion of macro '_compiletime_assert'
350 |  _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
|  ^~~~~~~~~~~~~~~~~~~
include/linux/compiler.h:353:2: note: in expansion of macro 'compiletime_assert'
353 |  compiletime_assert(__native_word(t),             |  ^~~~~~~~~~~~~~~~~~
>> arch/riscv/include/asm/barrier.h:40:2: note: in expansion of macro 'compiletime_assert_atomic_type'
40 |  compiletime_assert_atomic_type(*p);             |  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/asm-generic/barrier.h:157:29: note: in expansion of macro '__smp_load_acquire'
157 | #define smp_load_acquire(p) __smp_load_acquire(p)
|                             ^~~~~~~~~~~~~~~~~~
kernel/bpf/ringbuf.c:354:13: note: in expansion of macro 'smp_load_acquire'
354 |  cons_pos = smp_load_acquire(&rb->consumer_pos) & rb->mask;
|             ^~~~~~~~~~~~~~~~

vim +/compiletime_assert_atomic_type +40 arch/riscv/include/asm/barrier.h

8d235b174af5d0 Andrea Parri 2018-02-27  36  
8d235b174af5d0 Andrea Parri 2018-02-27  37  #define __smp_load_acquire(p)						\
8d235b174af5d0 Andrea Parri 2018-02-27  38  ({									\
8d235b174af5d0 Andrea Parri 2018-02-27  39  	typeof(*p) ___p1 = READ_ONCE(*p);				\
8d235b174af5d0 Andrea Parri 2018-02-27 @40  	compiletime_assert_atomic_type(*p);				\
8d235b174af5d0 Andrea Parri 2018-02-27  41  	RISCV_FENCE(r,rw);						\
8d235b174af5d0 Andrea Parri 2018-02-27  42  	___p1;								\
8d235b174af5d0 Andrea Parri 2018-02-27  43  })
8d235b174af5d0 Andrea Parri 2018-02-27  44  

:::::: The code at line 40 was first introduced by commit
:::::: 8d235b174af5d0af35ff206c15041fc2b02a0993 riscv/barrier: Define __smp_{store_release,load_acquire}

:::::: TO: Andrea Parri <parri.andrea@gmail.com>
:::::: CC: Palmer Dabbelt <palmer@sifive.com>

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 18817 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next 0/6] BPF ring buffer
  2020-05-13 19:25 [PATCH bpf-next 0/6] BPF ring buffer Andrii Nakryiko
                   ` (5 preceding siblings ...)
  2020-05-13 19:25 ` [PATCH bpf-next 6/6] bpf: add BPF ringbuf and perf buffer benchmarks Andrii Nakryiko
@ 2020-05-13 22:49 ` Jonathan Lemon
  2020-05-14  6:08   ` Andrii Nakryiko
  6 siblings, 1 reply; 32+ messages in thread
From: Jonathan Lemon @ 2020-05-13 22:49 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, netdev, ast, daniel, andrii.nakryiko, kernel-team,
	Paul E . McKenney, Jonathan Lemon

On Wed, May 13, 2020 at 12:25:26PM -0700, Andrii Nakryiko wrote:
> Implement a new BPF ring buffer, as presented at BPF virtual conference ([0]).
> It presents an alternative to perf buffer, following its semantics closely,
> but allowing sharing same instance of ring buffer across multiple CPUs
> efficiently.
> 
> Most patches have extensive commentary explaining various aspects, so I'll
> keep cover letter short. Overall structure of the patch set:
> - patch #1 adds BPF ring buffer implementation to kernel and necessary
>   verifier support;
> - patch #2 adds litmus tests validating all the memory orderings and locking
>   is correct;
> - patch #3 is an optional patch that generalizes verifier's reference tracking
>   machinery to capture type of reference;
> - patch #4 adds libbpf consumer implementation for BPF ringbuf;
> - path #5 adds selftest, both for single BPF ring buf use case, as well as
>   using it with array/hash of maps;
> - patch #6 adds extensive benchmarks and provide some analysis in commit
>   message, it build upon selftests/bpf's bench runner.
> 
>   [0] https://docs.google.com/presentation/d/18ITdg77Bj6YDOH2LghxrnFxiPWe0fAqcmJY95t_qr0w
> 
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Cc: Jonathan Lemon <jonathan.lemon@gmail.com>

Looks very nice!  A few random questions:

1) Why not use a structure for the header, instead of 2 32bit ints?

2) Would it make sense to reserve X bytes, but only commit Y?
   the offset field could be used to write the record length.

   E.g.:
      reserve 512 bytes    [BUSYBIT | 512][PG OFFSET]
      commit  400 bytes    [ 512 ] [ 400 ]

3) Why have 2 separate pages for producer/consumer, instead of
   just aligning to a smp cache line (or even 1/2 page?)

4) The XOR of busybit makes me wonder if there is anything that
   prevents the system from calling commit twice?
--
Jonathan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next 1/6] bpf: implement BPF ring buffer and verifier support for it
  2020-05-13 21:58   ` Alan Maguire
@ 2020-05-14  5:59     ` Andrii Nakryiko
  2020-05-14 22:25       ` Alan Maguire
  0 siblings, 1 reply; 32+ messages in thread
From: Andrii Nakryiko @ 2020-05-14  5:59 UTC (permalink / raw)
  To: Alan Maguire
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Paul E . McKenney, Jonathan Lemon

On Wed, May 13, 2020 at 2:59 PM Alan Maguire <alan.maguire@oracle.com> wrote:
>
> On Wed, 13 May 2020, Andrii Nakryiko wrote:
>
> > This commits adds a new MPSC ring buffer implementation into BPF ecosystem,
> > which allows multiple CPUs to submit data to a single shared ring buffer. On
> > the consumption side, only single consumer is assumed.
> >
> > Motivation
> > ----------
> > There are two distinctive motivators for this work, which are not satisfied by
> > existing perf buffer, which prompted creation of a new ring buffer
> > implementation.
> >   - more efficient memory utilization by sharing ring buffer across CPUs;
> >   - preserving ordering of events that happen sequentially in time, even
> >   across multiple CPUs (e.g., fork/exec/exit events for a task).
> >
> > These two problems are independent, but perf buffer fails to satisfy both.
> > Both are a result of a choice to have per-CPU perf ring buffer.  Both can be
> > also solved by having an MPSC implementation of ring buffer. The ordering
> > problem could technically be solved for perf buffer with some in-kernel
> > counting, but given the first one requires an MPSC buffer, the same solution
> > would solve the second problem automatically.
> >
>
> This looks great Andrii! One potentially interesting side-effect of
> the way this is implemented is that it could (I think) support speculative
> tracing.
>
> Say I want to record some tracing info when I enter function foo(), but
> I only care about cases where that function later returns an error value.
> I _think_ your implementation could support that via a scheme like
> this:
>
> - attach a kprobe program to record the data via bpf_ringbuf_reserve(),
>   and store the reserved pointer value in a per-task keyed hashmap.
>   Then record the values of interest in the reserved space. This is our
>   speculative data as we don't know whether we want to commit it yet.
>
> - attach a kretprobe program that picks up our reserved pointer and
>   commit()s or discard()s the associated data based on the return value.
>
> - the consumer should (I think) then only read the committed data, so in
>   this case just the data of interest associated with the failure case.
>
> I'm curious if that sort of ringbuf access pattern across multiple
> programs would work? Thanks!


Right now it's not allowed. Similar to spin lock and socket reference,
verifier will enforce that reserved record is committed or discarded
within the same BPF program invocation. Technically, nothing prevents
us from relaxing this and allowing to store this pointer in a map, but
that's probably way too dangerous and not necessary for most common
cases.

But all your troubles with this is due to using a pair of
kprobe+kretprobe. What I think should solve your problem is a single
fexit program. It can read input arguments *and* return value of
traced function. So there won't be any need for additional map and
storing speculative data (and no speculation as well, because you'll
just know beforehand if you even need to capture data). Does this work
for your case?

>
> Alan
>

[...]

no one seems to like trimming emails ;)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next 0/6] BPF ring buffer
  2020-05-13 22:49 ` [PATCH bpf-next 0/6] BPF ring buffer Jonathan Lemon
@ 2020-05-14  6:08   ` Andrii Nakryiko
  2020-05-14 16:30     ` Jonathan Lemon
  0 siblings, 1 reply; 32+ messages in thread
From: Andrii Nakryiko @ 2020-05-14  6:08 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Paul E . McKenney

On Wed, May 13, 2020 at 3:49 PM Jonathan Lemon <jonathan.lemon@gmail.com> wrote:
>
> On Wed, May 13, 2020 at 12:25:26PM -0700, Andrii Nakryiko wrote:
> > Implement a new BPF ring buffer, as presented at BPF virtual conference ([0]).
> > It presents an alternative to perf buffer, following its semantics closely,
> > but allowing sharing same instance of ring buffer across multiple CPUs
> > efficiently.
> >
> > Most patches have extensive commentary explaining various aspects, so I'll
> > keep cover letter short. Overall structure of the patch set:
> > - patch #1 adds BPF ring buffer implementation to kernel and necessary
> >   verifier support;
> > - patch #2 adds litmus tests validating all the memory orderings and locking
> >   is correct;
> > - patch #3 is an optional patch that generalizes verifier's reference tracking
> >   machinery to capture type of reference;
> > - patch #4 adds libbpf consumer implementation for BPF ringbuf;
> > - path #5 adds selftest, both for single BPF ring buf use case, as well as
> >   using it with array/hash of maps;
> > - patch #6 adds extensive benchmarks and provide some analysis in commit
> >   message, it build upon selftests/bpf's bench runner.
> >
> >   [0] https://docs.google.com/presentation/d/18ITdg77Bj6YDOH2LghxrnFxiPWe0fAqcmJY95t_qr0w
> >
> > Cc: Paul E. McKenney <paulmck@kernel.org>
> > Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
>
> Looks very nice!  A few random questions:
>
> 1) Why not use a structure for the header, instead of 2 32bit ints?

hm... no reason, just never occurred to me it's necessary :)

>
> 2) Would it make sense to reserve X bytes, but only commit Y?
>    the offset field could be used to write the record length.
>
>    E.g.:
>       reserve 512 bytes    [BUSYBIT | 512][PG OFFSET]
>       commit  400 bytes    [ 512 ] [ 400 ]

It could be done, though I had tentative plans to use those second 4
bytes for something useful eventually.

But what's the use case? From ring buffer's perspective, X bytes were
reserved and are gone already and subsequent writers might have
already advanced producer counter with the assumption that all X bytes
are going to be used. So there are no space savings, even if record is
discarded or only portion of it is submitted. I can only see a bit of
added convenience for an application, because it doesn't have to track
amount of actual data in its record. But this doesn't seem to be a
common case either, so not sure how it's worth supporting... Is there
a particular case where this is extremely useful and extra 4 bytes in
record payload is too much?

>
> 3) Why have 2 separate pages for producer/consumer, instead of
>    just aligning to a smp cache line (or even 1/2 page?)

Access rights restrictions. Consumer page is readable/writable,
producer page is read-only for user-space. If user-space had ability
to write producer position, it could wreck a huge havoc for the
ringbuf algorithm.

>
> 4) The XOR of busybit makes me wonder if there is anything that
>    prevents the system from calling commit twice?

Yes, verifier checks this and will reject such BPF program.

> --
> Jonathan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next 0/6] BPF ring buffer
  2020-05-14  6:08   ` Andrii Nakryiko
@ 2020-05-14 16:30     ` Jonathan Lemon
  2020-05-14 20:06       ` Andrii Nakryiko
  0 siblings, 1 reply; 32+ messages in thread
From: Jonathan Lemon @ 2020-05-14 16:30 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Jonathan Lemon, Andrii Nakryiko, bpf, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Paul E . McKenney

On Wed, May 13, 2020 at 11:08:46PM -0700, Andrii Nakryiko wrote:
> On Wed, May 13, 2020 at 3:49 PM Jonathan Lemon <jonathan.lemon@gmail.com> wrote:
> >
> > On Wed, May 13, 2020 at 12:25:26PM -0700, Andrii Nakryiko wrote:
> > > Implement a new BPF ring buffer, as presented at BPF virtual conference ([0]).
> > > It presents an alternative to perf buffer, following its semantics closely,
> > > but allowing sharing same instance of ring buffer across multiple CPUs
> > > efficiently.
> > >
> > > Most patches have extensive commentary explaining various aspects, so I'll
> > > keep cover letter short. Overall structure of the patch set:
> > > - patch #1 adds BPF ring buffer implementation to kernel and necessary
> > >   verifier support;
> > > - patch #2 adds litmus tests validating all the memory orderings and locking
> > >   is correct;
> > > - patch #3 is an optional patch that generalizes verifier's reference tracking
> > >   machinery to capture type of reference;
> > > - patch #4 adds libbpf consumer implementation for BPF ringbuf;
> > > - path #5 adds selftest, both for single BPF ring buf use case, as well as
> > >   using it with array/hash of maps;
> > > - patch #6 adds extensive benchmarks and provide some analysis in commit
> > >   message, it build upon selftests/bpf's bench runner.
> > >
> > >   [0] https://docs.google.com/presentation/d/18ITdg77Bj6YDOH2LghxrnFxiPWe0fAqcmJY95t_qr0w
> > >
> > > Cc: Paul E. McKenney <paulmck@kernel.org>
> > > Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
> >
> > Looks very nice!  A few random questions:
> >
> > 1) Why not use a structure for the header, instead of 2 32bit ints?
> 
> hm... no reason, just never occurred to me it's necessary :)

It might be clearer to do this.  Something like:

struct ringbuf_record {
    union {
        struct {
            u32 size:30;
            bool busy:1;
            bool discard:1;
        };
        u32 word1;
    };
    union {
        u32 pgoff;
        u32 word2;
    };
};

While perhaps a bit overkill, makes it clear what is going on.


> > 2) Would it make sense to reserve X bytes, but only commit Y?
> >    the offset field could be used to write the record length.
> >
> >    E.g.:
> >       reserve 512 bytes    [BUSYBIT | 512][PG OFFSET]
> >       commit  400 bytes    [ 512 ] [ 400 ]
> 
> It could be done, though I had tentative plans to use those second 4
> bytes for something useful eventually.
> 
> But what's the use case? From ring buffer's perspective, X bytes were
> reserved and are gone already and subsequent writers might have
> already advanced producer counter with the assumption that all X bytes
> are going to be used. So there are no space savings, even if record is
> discarded or only portion of it is submitted. I can only see a bit of
> added convenience for an application, because it doesn't have to track
> amount of actual data in its record. But this doesn't seem to be a
> common case either, so not sure how it's worth supporting... Is there
> a particular case where this is extremely useful and extra 4 bytes in
> record payload is too much?

Not off the top of my head - it was just the first thing that came to
mind when reading about the commit/discard paradigm.  I was thinking
about variable records, where the maximum is reserved, but less data
is written.  But there's no particular reason for the ringbuffer to
track this either, it could be part of the application framing.


> > 3) Why have 2 separate pages for producer/consumer, instead of
> >    just aligning to a smp cache line (or even 1/2 page?)
> 
> Access rights restrictions. Consumer page is readable/writable,
> producer page is read-only for user-space. If user-space had ability
> to write producer position, it could wreck a huge havoc for the
> ringbuf algorithm.

Ah, thanks, that makes sense.  Might want to add a comment to
that effect, as it's different from other implementations.
-- 
Jonathan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next 1/6] bpf: implement BPF ring buffer and verifier support for it
  2020-05-13 19:25 ` [PATCH bpf-next 1/6] bpf: implement BPF ring buffer and verifier support for it Andrii Nakryiko
                     ` (2 preceding siblings ...)
  2020-05-13 22:16   ` kbuild test robot
@ 2020-05-14 16:50   ` Jonathan Lemon
  2020-05-14 20:11     ` Andrii Nakryiko
  2020-05-14 17:33   ` sdf
                     ` (2 subsequent siblings)
  6 siblings, 1 reply; 32+ messages in thread
From: Jonathan Lemon @ 2020-05-14 16:50 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, netdev, ast, daniel, andrii.nakryiko, kernel-team,
	Paul E . McKenney, Jonathan Lemon

On Wed, May 13, 2020 at 12:25:27PM -0700, Andrii Nakryiko wrote:
> +static struct bpf_ringbuf *bpf_ringbuf_restore_from_rec(void *meta_ptr)
> +{
> +	unsigned long addr = (unsigned long)meta_ptr;
> +	unsigned long off = *(u32 *)(meta_ptr + 4) << PAGE_SHIFT;
> +
> +	return (void*)((addr & PAGE_MASK) - off);
> +}
> +
> +static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size)
> +{
> +	unsigned long cons_pos, prod_pos, new_prod_pos, flags;
> +	u32 len, pg_off;
> +	void *meta_ptr;
> +
> +	if (unlikely(size > UINT_MAX))
> +		return NULL;

Size should be 30 bits, not UINT_MAX, since 2 bits are reserved.

> +
> +	len = round_up(size + RINGBUF_META_SZ, 8);
> +	cons_pos = READ_ONCE(rb->consumer_pos);
> +
> +	if (in_nmi()) {
> +		if (!spin_trylock_irqsave(&rb->spinlock, flags))
> +			return NULL;
> +	} else {
> +		spin_lock_irqsave(&rb->spinlock, flags);
> +	}
> +
> +	prod_pos = rb->producer_pos;
> +	new_prod_pos = prod_pos + len;
> +
> +	/* check for out of ringbuf space by ensuring producer position
> +	 * doesn't advance more than (ringbuf_size - 1) ahead
> +	 */
> +	if (new_prod_pos - cons_pos > rb->mask) {
> +		spin_unlock_irqrestore(&rb->spinlock, flags);
> +		return NULL;
> +	}
> +
> +	meta_ptr = rb->data + (prod_pos & rb->mask);
> +	pg_off = bpf_ringbuf_rec_pg_off(rb, meta_ptr);
> +
> +	WRITE_ONCE(*(u32 *)meta_ptr, RINGBUF_BUSY_BIT | size);
> +	WRITE_ONCE(*(u32 *)(meta_ptr + 4), pg_off);

Or define a 64bit word in the structure and use:

        WRITE_ONCE(*(u64 *)meta_ptr, rec.header);


> +
> +	/* ensure length prefix is written before updating producer positions */
> +	smp_wmb();
> +	WRITE_ONCE(rb->producer_pos, new_prod_pos);
> +
> +	spin_unlock_irqrestore(&rb->spinlock, flags);
> +
> +	return meta_ptr + RINGBUF_META_SZ;
> +}
> +
> +BPF_CALL_3(bpf_ringbuf_reserve, struct bpf_map *, map, u64, size, u64, flags)
> +{
> +	struct bpf_ringbuf_map *rb_map;
> +
> +	if (unlikely(flags))
> +		return -EINVAL;
> +
> +	rb_map = container_of(map, struct bpf_ringbuf_map, map);
> +	return (unsigned long)__bpf_ringbuf_reserve(rb_map->rb, size);
> +}
> +

--
Jonathan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next 1/6] bpf: implement BPF ring buffer and verifier support for it
  2020-05-13 19:25 ` [PATCH bpf-next 1/6] bpf: implement BPF ring buffer and verifier support for it Andrii Nakryiko
                     ` (3 preceding siblings ...)
  2020-05-14 16:50   ` Jonathan Lemon
@ 2020-05-14 17:33   ` sdf
  2020-05-14 20:18     ` Andrii Nakryiko
  2020-05-14 19:06   ` Alexei Starovoitov
  2020-05-14 19:18     ` Jakub Kicinski
  6 siblings, 1 reply; 32+ messages in thread
From: sdf @ 2020-05-14 17:33 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, netdev, ast, daniel, andrii.nakryiko, kernel-team,
	Paul E . McKenney, Jonathan Lemon

On 05/13, Andrii Nakryiko wrote:
> This commits adds a new MPSC ring buffer implementation into BPF  
> ecosystem,
> which allows multiple CPUs to submit data to a single shared ring buffer.  
> On
> the consumption side, only single consumer is assumed.

> Motivation
> ----------
> There are two distinctive motivators for this work, which are not  
> satisfied by
> existing perf buffer, which prompted creation of a new ring buffer
> implementation.
>    - more efficient memory utilization by sharing ring buffer across CPUs;
>    - preserving ordering of events that happen sequentially in time, even
>    across multiple CPUs (e.g., fork/exec/exit events for a task).

> These two problems are independent, but perf buffer fails to satisfy both.
> Both are a result of a choice to have per-CPU perf ring buffer.  Both can  
> be
> also solved by having an MPSC implementation of ring buffer. The ordering
> problem could technically be solved for perf buffer with some in-kernel
> counting, but given the first one requires an MPSC buffer, the same  
> solution
> would solve the second problem automatically.

> Semantics and APIs
> ------------------
> Single ring buffer is presented to BPF programs as an instance of BPF map  
> of
> type BPF_MAP_TYPE_RINGBUF. Two other alternatives considered, but  
> ultimately
> rejected.

> One way would be to, similar to BPF_MAP_TYPE_PERF_EVENT_ARRAY, make
> BPF_MAP_TYPE_RINGBUF could represent an array of ring buffers, but not  
> enforce
> "same CPU only" rule. This would be more familiar interface compatible  
> with
> existing perf buffer use in BPF, but would fail if application needed more
> advanced logic to lookup ring buffer by arbitrary key. HASH_OF_MAPS  
> addresses
> this with current approach. Additionally, given the performance of BPF
> ringbuf, many use cases would just opt into a simple single ring buffer  
> shared
> among all CPUs, for which current approach would be an overkill.

> Another approach could introduce a new concept, alongside BPF map, to
> represent generic "container" object, which doesn't necessarily have  
> key/value
> interface with lookup/update/delete operations. This approach would add a  
> lot
> of extra infrastructure that has to be built for observability and  
> verifier
> support. It would also add another concept that BPF developers would have  
> to
> familiarize themselves with, new syntax in libbpf, etc. But then would  
> really
> provide no additional benefits over the approach of using a map.
> BPF_MAP_TYPE_RINGBUF doesn't support lookup/update/delete operations, but  
> so
> doesn't few other map types (e.g., queue and stack; array doesn't support
> delete, etc).

> The approach chosen has an advantage of re-using existing BPF map
> infrastructure (introspection APIs in kernel, libbpf support, etc), being
> familiar concept (no need to teach users a new type of object in BPF  
> program),
> and utilizing existing tooling (bpftool). For common scenario of using
> a single ring buffer for all CPUs, it's as simple and straightforward, as
> would be with a dedicated "container" object. On the other hand, by being
> a map, it can be combined with ARRAY_OF_MAPS and HASH_OF_MAPS map-in-maps  
> to
> implement a wide variety of topologies, from one ring buffer for each CPU
> (e.g., as a replacement for perf buffer use cases), to a complicated
> application hashing/sharding of ring buffers (e.g., having a small pool of
> ring buffers with hashed task's tgid being a look up key to preserve  
> order,
> but reduce contention).

> Key and value sizes are enforced to be zero. max_entries is used to  
> specify
> the size of ring buffer and has to be a power of 2 value.

> There are a bunch of similarities between perf buffer
> (BPF_MAP_TYPE_PERF_EVENT_ARRAY) and new BPF ring buffer semantics:
>    - variable-length records;
>    - if there is no more space left in ring buffer, reservation fails, no
>      blocking;
>    - memory-mappable data area for user-space applications for ease of
>      consumption and high performance;
>    - epoll notifications for new incoming data;
>    - but still the ability to do busy polling for new data to achieve the
>      lowest latency, if necessary.

> BPF ringbuf provides two sets of APIs to BPF programs:
>    - bpf_ringbuf_output() allows to *copy* data from one place to a ring
>      buffer, similarly to bpf_perf_event_output();
>    - bpf_ringbuf_reserve()/bpf_ringbuf_commit()/bpf_ringbuf_discard() APIs
>      split the whole process into two steps. First, a fixed amount of  
> space is
>      reserved. If successful, a pointer to a data inside ring buffer data  
> area
>      is returned, which BPF programs can use similarly to a data inside
>      array/hash maps. Once ready, this piece of memory is either committed  
> or
>      discarded. Discard is similar to commit, but makes consumer ignore the
>      record.

> bpf_ringbuf_output() has disadvantage of incurring extra memory copy,  
> because
> record has to be prepared in some other place first. But it allows to  
> submit
> records of the length that's not known to verifier beforehand. It also  
> closely
> matches bpf_perf_event_output(), so will simplify migration significantly.

> bpf_ringbuf_reserve() avoids the extra copy of memory by providing a  
> memory
> pointer directly to ring buffer memory. In a lot of cases records are  
> larger
> than BPF stack space allows, so many programs have use extra per-CPU  
> array as
> a temporary heap for preparing sample. bpf_ringbuf_reserve() avoid this  
> needs
> completely. But in exchange, it only allows a known constant size of  
> memory to
> be reserved, such that verifier can verify that BPF program can't access
> memory outside its reserved record space. bpf_ringbuf_output(), while  
> slightly
> slower due to extra memory copy, covers some use cases that are not  
> suitable
> for bpf_ringbuf_reserve().

> The difference between commit and discard is very small. Discard just  
> marks
> a record as discarded, and such records are supposed to be ignored by  
> consumer
> code. Discard is useful for some advanced use-cases, such as ensuring
> all-or-nothing multi-record submission, or emulating temporary  
> malloc()/free()
> within single BPF program invocation.

> Each reserved record is tracked by verifier through existing
> reference-tracking logic, similar to socket ref-tracking. It is thus
> impossible to reserve a record, but forget to submit (or discard) it.

> Design and implementation
> -------------------------
> This reserve/commit schema allows a natural way for multiple producers,  
> either
> on different CPUs or even on the same CPU/in the same BPF program, to  
> reserve
> independent records and work with them without blocking other producers.  
> This
> means that if BPF program was interruped by another BPF program sharing  
> the
> same ring buffer, they will both get a record reserved (provided there is
> enough space left) and can work with it and submit it independently. This
> applies to NMI context as well, except that due to using a spinlock during
> reservation, in NMI context, bpf_ringbuf_reserve() might fail to get a  
> lock,
> in which case reservation will fail even if ring buffer is not full.

> The ring buffer itself internally is implemented as a power-of-2 sized
> circular buffer, with two logical and ever-increasing counters (which  
> might
> wrap around on 32-bit architectures, that's not a problem):
>    - consumer counter shows up to which logical position consumer consumed  
> the
>      data;
>    - producer counter denotes amount of data reserved by all producers.

> Each time a record is reserved, producer that "owns" the record will
> successfully advance producer counter. At that point, data is still not  
> yet
> ready to be consumed, though. Each record has 8 byte header, which  
> contains
> the length of reserved record, as well as two extra bits: busy bit to  
> denote
> that record is still being worked on, and discard bit, which might be set  
> at
> commit time if record is discarded. In the latter case, consumer is  
> supposed
> to skip the record and move on to the next one. Record header also encodes
> record's relative offset from the beginning of ring buffer data area (in
> pages). This allows bpf_ringbuf_commit()/bpf_ringbuf_discard() to accept  
> only
> the pointer to the record itself, without requiring also the pointer to  
> ring
> buffer itself. Ring buffer memory location will be restored from record
> metadata header. This significantly simplifies verifier, as well as  
> improving
> API usability.

> Producer counter increments are serialized under spinlock, so there is
> a strict ordering between reservations. Commits, on the other hand, are
> completely lockless and independent. All records become available to  
> consumer
> in the order of reservations, but only after all previous records where
> already committed. It is thus possible for slow producers to temporarily  
> hold
> off submitted records, that were reserved later.

> Reservation/commit/consumer protocol is verified by litmus tests in the  
> later
> patch in this series.

> One interesting implementation bit, that significantly simplifies (and  
> thus
> speeds up as well) implementation of both producers and consumers is how  
> data
> area is mapped twice contiguously back-to-back in the virtual memory. This
> allows to not take any special measures for samples that have to wrap  
> around
> at the end of the circular buffer data area, because the next page after  
> the
> last data page would be first data page again, and thus the sample will  
> still
> appear completely contiguous in virtual memory. See comment and a simple  
> ASCII
> diagram showing this visually in bpf_ringbuf_area_alloc().

> Another feature that distinguishes BPF ringbuf from perf ring buffer is
> a self-pacing notifications of new data being availability.
> bpf_ringbuf_commit() implementation will send a notification of new record
> being available after commit only if consumer has already caught up right  
> up
> to the record being committed. If not, consumer still has to catch up and  
> thus
> will see new data anyways without needing an extra poll notification. As  
> will
> be shown in benchmarks in later patch in the series, this allows to  
> achieve
> a very high throughput without having to resort to tricks like "notify  
> only
> every Nth sample", like with perf buffer, to achieve good throughput
> performance.

> For performance evaluation against perf buffer and scalability limits, see
> patch later in the series, adding ring buffers benchmark.
> number of contention

> Comparison to alternatives
> --------------------------
> Before considering implementing BPF ring buffer from scratch existing
> alternatives in kernel were evaluated, but didn't seem to meet the needs.  
> They
> largely fell into few categores:
>    - per-CPU buffers (perf, ftrace, etc), which don't satisfy two  
> motivations
>      outlined above (ordering and memory consumption);
>    - linked list-based implementations; while some were multi-producer  
> designs,
>      consuming these from user-space would be very complicated and most
>      probably not performant; memory-mapping contiguous piece of memory is
>      simpler and more performant for user-space consumers;
>    - io_uring is SPSC, but also requires fixed-sized elements. Naively  
> turning
>      SPSC queue into MPSC w/ lock would have subpar performance compared to
>      locked reserve + lockless commit, as with BPF ring buffer. Fixed sized
>      elements would be too limiting for BPF programs, given existing BPF
>      programs heavily rely on variable-sized perf buffer already;
>    - specialized implementations (like a new printk ring buffer, [0]) with  
> lots
>      of printk-specific limitations and implications, that didn't seem to  
> fit
>      well for intended use with BPF programs.
That's a very nice write up! Does it make sense to put most of it
under Documentation/bpf? We were discussing socket storage with KP
recently and I mentioned that commit 6ac99e8f23d4 has a really nice
description of the architecture with ascii diagrams/etc. Sometimes
it's really hard to chase down the commit history to find out
these sorts of details.

>    [0] https://lwn.net/Articles/779550/

> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> ---
>   include/linux/bpf.h            |  12 +
>   include/linux/bpf_types.h      |   1 +
>   include/linux/bpf_verifier.h   |   4 +
>   include/uapi/linux/bpf.h       |  33 ++-
>   kernel/bpf/Makefile            |   2 +-
>   kernel/bpf/helpers.c           |   8 +
>   kernel/bpf/ringbuf.c           | 409 +++++++++++++++++++++++++++++++++
>   kernel/bpf/syscall.c           |  12 +
>   kernel/bpf/verifier.c          | 156 ++++++++++---
>   kernel/trace/bpf_trace.c       |   8 +
>   tools/include/uapi/linux/bpf.h |  33 ++-
>   11 files changed, 643 insertions(+), 35 deletions(-)
>   create mode 100644 kernel/bpf/ringbuf.c

> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index cf4b6e44f2bc..9e3da01f3e9b 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -89,6 +89,8 @@ struct bpf_map_ops {
>   	int (*map_direct_value_meta)(const struct bpf_map *map,
>   				     u64 imm, u32 *off);
>   	int (*map_mmap)(struct bpf_map *map, struct vm_area_struct *vma);
> +	__poll_t (*map_poll)(struct bpf_map *map, struct file *filp,
> +			     struct poll_table_struct *pts);
>   };

>   struct bpf_map_memory {
> @@ -243,6 +245,9 @@ enum bpf_arg_type {
>   	ARG_PTR_TO_LONG,	/* pointer to long */
>   	ARG_PTR_TO_SOCKET,	/* pointer to bpf_sock (fullsock) */
>   	ARG_PTR_TO_BTF_ID,	/* pointer to in-kernel struct */
> +	ARG_PTR_TO_ALLOC_MEM,	/* pointer to dynamically allocated memory */
> +	ARG_PTR_TO_ALLOC_MEM_OR_NULL,	/* pointer to dynamically allocated  
> memory or NULL */
> +	ARG_CONST_ALLOC_SIZE_OR_ZERO,	/* number of allocated bytes requested */
>   };

>   /* type of values returned from helper functions */
> @@ -254,6 +259,7 @@ enum bpf_return_type {
>   	RET_PTR_TO_SOCKET_OR_NULL,	/* returns a pointer to a socket or NULL */
>   	RET_PTR_TO_TCP_SOCK_OR_NULL,	/* returns a pointer to a tcp_sock or NULL  
> */
>   	RET_PTR_TO_SOCK_COMMON_OR_NULL,	/* returns a pointer to a sock_common  
> or NULL */
> +	RET_PTR_TO_ALLOC_MEM_OR_NULL,	/* returns a pointer to dynamically  
> allocated memory or NULL */
>   };

>   /* eBPF function prototype used by verifier to allow BPF_CALLs from eBPF  
> programs
> @@ -321,6 +327,8 @@ enum bpf_reg_type {
>   	PTR_TO_XDP_SOCK,	 /* reg points to struct xdp_sock */
>   	PTR_TO_BTF_ID,		 /* reg points to kernel struct */
>   	PTR_TO_BTF_ID_OR_NULL,	 /* reg points to kernel struct or NULL */
> +	PTR_TO_MEM,		 /* reg points to valid memory region */
> +	PTR_TO_MEM_OR_NULL,	 /* reg points to valid memory region or NULL */
>   };

>   /* The information passed from prog-specific *_is_valid_access
> @@ -1585,6 +1593,10 @@ extern const struct bpf_func_proto  
> bpf_tcp_sock_proto;
>   extern const struct bpf_func_proto bpf_jiffies64_proto;
>   extern const struct bpf_func_proto bpf_get_ns_current_pid_tgid_proto;
>   extern const struct bpf_func_proto bpf_event_output_data_proto;
> +extern const struct bpf_func_proto bpf_ringbuf_output_proto;
> +extern const struct bpf_func_proto bpf_ringbuf_reserve_proto;
> +extern const struct bpf_func_proto bpf_ringbuf_submit_proto;
> +extern const struct bpf_func_proto bpf_ringbuf_discard_proto;

>   const struct bpf_func_proto *bpf_tracing_func_proto(
>   	enum bpf_func_id func_id, const struct bpf_prog *prog);
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index 29d22752fc87..fa8e1b552acd 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -118,6 +118,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops)
>   #if defined(CONFIG_BPF_JIT)
>   BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops)
>   #endif
> +BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops)

>   BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint)
>   BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing)
> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> index 6abd5a778fcd..c94a736e53cd 100644
> --- a/include/linux/bpf_verifier.h
> +++ b/include/linux/bpf_verifier.h
> @@ -54,6 +54,8 @@ struct bpf_reg_state {

>   		u32 btf_id; /* for PTR_TO_BTF_ID */

> +		u32 mem_size; /* for PTR_TO_MEM | PTR_TO_MEM_OR_NULL */
> +
>   		/* Max size from any of the above. */
>   		unsigned long raw;
>   	};
> @@ -63,6 +65,8 @@ struct bpf_reg_state {
>   	 * offset, so they can share range knowledge.
>   	 * For PTR_TO_MAP_VALUE_OR_NULL this is used to share which map value we
>   	 * came from, when one is tested for != NULL.
> +	 * For PTR_TO_MEM_OR_NULL this is used to identify memory allocation
> +	 * for the purpose of tracking that it's freed.
>   	 * For PTR_TO_SOCKET this is used to share which pointers retain the
>   	 * same reference to the socket, to determine proper reference freeing.
>   	 */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index bfb31c1be219..ae2deb6a8afc 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -147,6 +147,7 @@ enum bpf_map_type {
>   	BPF_MAP_TYPE_SK_STORAGE,
>   	BPF_MAP_TYPE_DEVMAP_HASH,
>   	BPF_MAP_TYPE_STRUCT_OPS,
> +	BPF_MAP_TYPE_RINGBUF,
>   };

>   /* Note that tracing related programs such as
> @@ -3121,6 +3122,32 @@ union bpf_attr {
>    * 		0 on success, or a negative error in case of failure:
>    *
>    *		**-EOVERFLOW** if an overflow happened: The same object will be  
> tried again.
> + *
> + * void *bpf_ringbuf_output(void *ringbuf, void *data, u64 size, u64  
> flags)
> + * 	Description
> + * 		Copy *size* bytes from *data* into a ring buffer *ringbuf*.
> + * 	Return
> + * 		0, on success;
> + * 		< 0, on error.
> + *
> + * void *bpf_ringbuf_reserve(void *ringbuf, u64 size, u64 flags)
> + * 	Description
> + * 		Reserve *size* bytes of payload in a ring buffer *ringbuf*.
> + * 	Return
> + * 		Valid pointer with *size* bytes of memory available; NULL,
> + * 		otherwise.
> + *
> + * void bpf_ringbuf_submit(void *data)
> + * 	Description
> + * 		Submit reserved ring buffer sample, pointed to by *data*.
> + * 	Return
> + * 		Nothing.
Even though you mention self-pacing properties, would it still
make sense to add some argument to bpf_ringbuf_submit/bpf_ringbuf_output
to indicate whether to wake up userspace or not? Maybe something like
a threshold of number of outstanding events in the ringbuf after which
we do the wakeup? The default 0/1 preserve the existing behavior.

The example I can give is a control plane userspace thread that
once a second aggregates the events, it doesn't care about millisecond
resolution. With the current scheme, I suppose, if BPF generates events
every 1ms, the userspace will be woken up 1000 times (if it can keep
up). Most of the time, we don't really care and some buffering
properties are desired.

> + *
> + * void bpf_ringbuf_discard(void *data)
> + * 	Description
> + * 		Discard reserved ring buffer sample, pointed to by *data*.
> + * 	Return
> + * 		Nothing.
>    */
>   #define __BPF_FUNC_MAPPER(FN)		\
>   	FN(unspec),			\
> @@ -3250,7 +3277,11 @@ union bpf_attr {
>   	FN(sk_assign),			\
>   	FN(ktime_get_boot_ns),		\
>   	FN(seq_printf),			\
> -	FN(seq_write),
> +	FN(seq_write),			\
> +	FN(ringbuf_output),		\
> +	FN(ringbuf_reserve),		\
> +	FN(ringbuf_submit),		\
> +	FN(ringbuf_discard),

>   /* integer value in 'imm' field of BPF_CALL instruction selects which  
> helper
>    * function eBPF program intends to call
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index 37b2d8620153..c9aada6c1806 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -4,7 +4,7 @@ CFLAGS_core.o += $(call cc-disable-warning, override-init)

>   obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o  
> tnum.o bpf_iter.o map_iter.o task_iter.o
>   obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o  
> bpf_lru_list.o lpm_trie.o map_in_map.o
> -obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o
> +obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
>   obj-$(CONFIG_BPF_SYSCALL) += disasm.o
>   obj-$(CONFIG_BPF_JIT) += trampoline.o
>   obj-$(CONFIG_BPF_SYSCALL) += btf.o
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 5c0290e0696e..27321ca8803f 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -629,6 +629,14 @@ bpf_base_func_proto(enum bpf_func_id func_id)
>   		return &bpf_ktime_get_ns_proto;
>   	case BPF_FUNC_ktime_get_boot_ns:
>   		return &bpf_ktime_get_boot_ns_proto;
> +	case BPF_FUNC_ringbuf_output:
> +		return &bpf_ringbuf_output_proto;
> +	case BPF_FUNC_ringbuf_reserve:
> +		return &bpf_ringbuf_reserve_proto;
> +	case BPF_FUNC_ringbuf_submit:
> +		return &bpf_ringbuf_submit_proto;
> +	case BPF_FUNC_ringbuf_discard:
> +		return &bpf_ringbuf_discard_proto;
>   	default:
>   		break;
>   	}
> diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c
> new file mode 100644
> index 000000000000..f2ae441a1695
> --- /dev/null
> +++ b/kernel/bpf/ringbuf.c
> @@ -0,0 +1,409 @@
> +#include <linux/bpf.h>
> +#include <linux/btf.h>
> +#include <linux/err.h>
> +#include <linux/slab.h>
> +#include <linux/filter.h>
> +#include <linux/mm.h>
> +#include <linux/vmalloc.h>
> +#include <linux/wait.h>
> +#include <linux/poll.h>
> +#include <uapi/linux/btf.h>
> +
> +#define RINGBUF_CREATE_FLAG_MASK (BPF_F_NUMA_NODE)
> +
> +#define RINGBUF_BUSY_BIT BIT(31)
> +#define RINGBUF_DISCARD_BIT BIT(30)
> +#define RINGBUF_META_SZ 8
> +
> +/* non-mmap()'able part of bpf_ringbuf (everything up to consumer page)  
> */
> +#define BPF_RINGBUF_PGOFF \
> +	(offsetof(struct bpf_ringbuf, consumer_pos) >> PAGE_SHIFT)
> +
> +struct bpf_ringbuf {
> +	wait_queue_head_t waitq;
> +	u64 mask;
> +	spinlock_t spinlock ____cacheline_aligned_in_smp;
> +	u64 consumer_pos __aligned(PAGE_SIZE);
> +	u64 producer_pos __aligned(PAGE_SIZE);
> +	char data[] __aligned(PAGE_SIZE);
> +};
> +
> +struct bpf_ringbuf_map {
> +	struct bpf_map map;
> +	struct bpf_map_memory memory;
> +	struct bpf_ringbuf *rb;
> +};
> +
> +static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int  
> numa_node)
> +{
> +	const gfp_t flags = GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN |
> +			    __GFP_ZERO;
> +	int nr_meta_pages = 2 + BPF_RINGBUF_PGOFF;
There is a bunch of magic '2's scattered around. Would it make sense
to use a proper define (with a comment) instead?

> +	int nr_data_pages = data_sz >> PAGE_SHIFT;
> +	int nr_pages = nr_meta_pages + nr_data_pages;
> +	struct page **pages, *page;
> +	size_t array_size;
> +	void *addr;
> +	int i;
> +
> +	/* Each data page is mapped twice to allow "virtual"
> +	 * continuous read of samples wrapping around the end of ring
> +	 * buffer area:
> +	 * ------------------------------------------------------
> +	 * | meta pages |  real data pages  |  same data pages  |
> +	 * ------------------------------------------------------
> +	 * |            | 1 2 3 4 5 6 7 8 9 | 1 2 3 4 5 6 7 8 9 |
> +	 * ------------------------------------------------------
> +	 * |            | TA             DA | TA             DA |
> +	 * ------------------------------------------------------
> +	 *                               ^^^^^^^
> +	 *                                  |
> +	 * Here, no need to worry about special handling of wrapped-around
> +	 * data due to double-mapped data pages. This works both in kernel and
> +	 * when mmap()'ed in user-space, simplifying both kernel and
> +	 * user-space implementations significantly.
> +	 */
> +	array_size = (nr_meta_pages + 2 * nr_data_pages) * sizeof(*pages);
> +	if (array_size > PAGE_SIZE)
> +		pages = vmalloc_node(array_size, numa_node);
> +	else
> +		pages = kmalloc_node(array_size, flags, numa_node);
> +	if (!pages)
> +		return NULL;
> +
> +	for (i = 0; i < nr_pages; i++) {
> +		page = alloc_pages_node(numa_node, flags, 0);
> +		if (!page) {
> +			nr_pages = i;
> +			goto err_free_pages;
> +		}
> +		pages[i] = page;
> +		if (i >= nr_meta_pages)
> +			pages[nr_data_pages + i] = page;
> +	}
> +
> +	addr = vmap(pages, nr_meta_pages + 2 * nr_data_pages,
> +		    VM_ALLOC | VM_USERMAP, PAGE_KERNEL);
> +	if (addr)
> +		return addr;
> +
> +err_free_pages:
> +	for (i = 0; i < nr_pages; i++)
> +		free_page((unsigned long)pages[i]);
> +	kvfree(pages);
> +	return NULL;
> +}
> +
> +static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int  
> numa_node)
> +{
> +	struct bpf_ringbuf *rb;
> +
> +	if (!data_sz || !PAGE_ALIGNED(data_sz))
> +		return ERR_PTR(-EINVAL);
> +
> +	rb = bpf_ringbuf_area_alloc(data_sz, numa_node);
> +	if (!rb)
> +		return ERR_PTR(-ENOMEM);
> +
> +	spin_lock_init(&rb->spinlock);
> +	init_waitqueue_head(&rb->waitq);
> +
> +	rb->mask = data_sz - 1;
> +	rb->consumer_pos = 0;
> +	rb->producer_pos = 0;
> +
> +	return rb;
> +}
> +
> +static struct bpf_map *ringbuf_map_alloc(union bpf_attr *attr)
> +{
> +	struct bpf_ringbuf_map *rb_map;
> +	u64 cost;
> +	int err;
> +
> +	if (attr->map_flags & ~RINGBUF_CREATE_FLAG_MASK)
> +		return ERR_PTR(-EINVAL);
> +
> +	if (attr->key_size || attr->value_size ||
> +	    attr->max_entries == 0 || !PAGE_ALIGNED(attr->max_entries))
> +		return ERR_PTR(-EINVAL);
> +
> +	rb_map = kzalloc(sizeof(*rb_map), GFP_USER);
> +	if (!rb_map)
> +		return ERR_PTR(-ENOMEM);
> +
> +	bpf_map_init_from_attr(&rb_map->map, attr);
> +
> +	cost = sizeof(struct bpf_ringbuf_map) +
> +	       sizeof(struct bpf_ringbuf) +
> +	       attr->max_entries;
> +	err = bpf_map_charge_init(&rb_map->map.memory, cost);
> +	if (err)
> +		goto err_free_map;
> +
> +	rb_map->rb = bpf_ringbuf_alloc(attr->max_entries,  
> rb_map->map.numa_node);
> +	if (IS_ERR(rb_map->rb)) {
> +		err = PTR_ERR(rb_map->rb);
> +		goto err_uncharge;
> +	}
> +
> +	return &rb_map->map;
> +
> +err_uncharge:
> +	bpf_map_charge_finish(&rb_map->map.memory);
> +err_free_map:
> +	kfree(rb_map);
> +	return ERR_PTR(err);
> +}
> +
> +static void bpf_ringbuf_free(struct bpf_ringbuf *ringbuf)
> +{
> +	kvfree(ringbuf);
> +}
> +
> +static void ringbuf_map_free(struct bpf_map *map)
> +{
> +	struct bpf_ringbuf_map *rb_map;
> +
> +	/* at this point bpf_prog->aux->refcnt == 0 and this map->refcnt == 0,
> +	 * so the programs (can be more than one that used this map) were
> +	 * disconnected from events. Wait for outstanding critical sections in
> +	 * these programs to complete
> +	 */
> +	synchronize_rcu();
> +
> +	rb_map = container_of(map, struct bpf_ringbuf_map, map);
> +	bpf_ringbuf_free(rb_map->rb);
> +	kfree(rb_map);
> +}
> +
> +static void *ringbuf_map_lookup_elem(struct bpf_map *map, void *key)
> +{
> +	return ERR_PTR(-ENOTSUPP);
> +}
> +
> +static int ringbuf_map_update_elem(struct bpf_map *map, void *key, void  
> *value,
> +				   u64 flags)
> +{
> +	return -ENOTSUPP;
> +}
> +
> +static int ringbuf_map_delete_elem(struct bpf_map *map, void *key)
> +{
> +	return -ENOTSUPP;
> +}
> +
> +static int ringbuf_map_get_next_key(struct bpf_map *map, void *key,
> +				    void *next_key)
> +{
> +	return -ENOTSUPP;
> +}
> +
> +static size_t bpf_ringbuf_mmap_page_cnt(const struct bpf_ringbuf *rb)
> +{
> +	size_t data_pages = (rb->mask + 1) >> PAGE_SHIFT;
> +
> +	/* consumer page + producer page + 2 x data pages */
> +	return 2 + 2 * data_pages;
> +}
> +
> +static int ringbuf_map_mmap(struct bpf_map *map, struct vm_area_struct  
> *vma)
> +{
> +	struct bpf_ringbuf_map *rb_map;
> +	size_t mmap_sz;
> +
> +	rb_map = container_of(map, struct bpf_ringbuf_map, map);
> +	mmap_sz = bpf_ringbuf_mmap_page_cnt(rb_map->rb) << PAGE_SHIFT;
> +
> +	if (vma->vm_pgoff * PAGE_SIZE + (vma->vm_end - vma->vm_start) > mmap_sz)
> +		return -EINVAL;
> +
> +	return remap_vmalloc_range(vma, rb_map->rb,
> +				   vma->vm_pgoff + BPF_RINGBUF_PGOFF);
> +}
> +
> +static __poll_t ringbuf_map_poll(struct bpf_map *map, struct file *filp,
> +				  struct poll_table_struct *pts)
> +{
> +	struct bpf_ringbuf_map *rb_map;
> +
> +	rb_map = container_of(map, struct bpf_ringbuf_map, map);
> +	poll_wait(filp, &rb_map->rb->waitq, pts);
> +
> +	return EPOLLIN | EPOLLRDNORM;
> +}
> +
> +const struct bpf_map_ops ringbuf_map_ops = {
> +	.map_alloc = ringbuf_map_alloc,
> +	.map_free = ringbuf_map_free,
> +	.map_mmap = ringbuf_map_mmap,
> +	.map_poll = ringbuf_map_poll,
> +	.map_lookup_elem = ringbuf_map_lookup_elem,
> +	.map_update_elem = ringbuf_map_update_elem,
> +	.map_delete_elem = ringbuf_map_delete_elem,
> +	.map_get_next_key = ringbuf_map_get_next_key,
> +};
> +
> +/* Given pointer to ring buffer record metadata and struct bpf_ringbuf  
> itself,
> + * calculate offset from record metadata to ring buffer in pages, rounded
> + * down. This page offset is stored as part of record metadata and  
> allows to
> + * restore struct bpf_ringbuf * from record pointer. This page offset is
> + * stored at offset 4 of record metadata header.
> + */
> +static size_t bpf_ringbuf_rec_pg_off(struct bpf_ringbuf *rb, void  
> *meta_ptr)
> +{
> +	return (meta_ptr - (void *)rb) >> PAGE_SHIFT;
> +}
> +
> +/* Given pointer to ring buffer record metadata, restore pointer to  
> struct
> + * bpf_ringbuf itself by using page offset stored at offset 4
> + */
> +static struct bpf_ringbuf *bpf_ringbuf_restore_from_rec(void *meta_ptr)
> +{
> +	unsigned long addr = (unsigned long)meta_ptr;
> +	unsigned long off = *(u32 *)(meta_ptr + 4) << PAGE_SHIFT;
> +
> +	return (void*)((addr & PAGE_MASK) - off);
> +}
> +
> +static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size)
> +{
> +	unsigned long cons_pos, prod_pos, new_prod_pos, flags;
> +	u32 len, pg_off;
> +	void *meta_ptr;
> +
> +	if (unlikely(size > UINT_MAX))
> +		return NULL;
> +
> +	len = round_up(size + RINGBUF_META_SZ, 8);
> +	cons_pos = READ_ONCE(rb->consumer_pos);
> +
> +	if (in_nmi()) {
> +		if (!spin_trylock_irqsave(&rb->spinlock, flags))
> +			return NULL;
> +	} else {
> +		spin_lock_irqsave(&rb->spinlock, flags);
> +	}
> +
> +	prod_pos = rb->producer_pos;
> +	new_prod_pos = prod_pos + len;
> +
> +	/* check for out of ringbuf space by ensuring producer position
> +	 * doesn't advance more than (ringbuf_size - 1) ahead
> +	 */
> +	if (new_prod_pos - cons_pos > rb->mask) {
> +		spin_unlock_irqrestore(&rb->spinlock, flags);
> +		return NULL;
> +	}
> +
> +	meta_ptr = rb->data + (prod_pos & rb->mask);
> +	pg_off = bpf_ringbuf_rec_pg_off(rb, meta_ptr);
> +
> +	WRITE_ONCE(*(u32 *)meta_ptr, RINGBUF_BUSY_BIT | size);
> +	WRITE_ONCE(*(u32 *)(meta_ptr + 4), pg_off);
> +
> +	/* ensure length prefix is written before updating producer positions */
> +	smp_wmb();
> +	WRITE_ONCE(rb->producer_pos, new_prod_pos);
> +
> +	spin_unlock_irqrestore(&rb->spinlock, flags);
> +
> +	return meta_ptr + RINGBUF_META_SZ;
> +}
> +
> +BPF_CALL_3(bpf_ringbuf_reserve, struct bpf_map *, map, u64, size, u64,  
> flags)
> +{
> +	struct bpf_ringbuf_map *rb_map;
> +
> +	if (unlikely(flags))
> +		return -EINVAL;
> +
> +	rb_map = container_of(map, struct bpf_ringbuf_map, map);
> +	return (unsigned long)__bpf_ringbuf_reserve(rb_map->rb, size);
> +}
> +
> +const struct bpf_func_proto bpf_ringbuf_reserve_proto = {
> +	.func		= bpf_ringbuf_reserve,
> +	.ret_type	= RET_PTR_TO_ALLOC_MEM_OR_NULL,
> +	.arg1_type	= ARG_CONST_MAP_PTR,
> +	.arg2_type	= ARG_CONST_ALLOC_SIZE_OR_ZERO,
> +	.arg3_type	= ARG_ANYTHING,
> +};
> +
> +static void bpf_ringbuf_commit(void *sample, bool discard)
> +{
> +	unsigned long rec_pos, cons_pos;
> +	u32 new_meta, old_meta;
> +	void *meta_ptr;
> +	struct bpf_ringbuf *rb;
> +
> +	meta_ptr = sample - RINGBUF_META_SZ;
> +	rb = bpf_ringbuf_restore_from_rec(meta_ptr);
> +	old_meta = *(u32 *)meta_ptr;
> +	new_meta = old_meta ^ RINGBUF_BUSY_BIT;
> +	if (discard)
> +		new_meta |= RINGBUF_DISCARD_BIT;
> +
> +	/* update metadata header with correct final size prefix */
> +	xchg((u32 *)meta_ptr, new_meta);
> +
> +	/* if consumer caught up and is waiting for our record, notify about
> +	 * new data availability
> +	 */
> +	rec_pos = (void *)meta_ptr - (void *)rb->data;
> +	cons_pos = smp_load_acquire(&rb->consumer_pos) & rb->mask;
> +	if (cons_pos == rec_pos)
> +		wake_up_all(&rb->waitq);
> +}
> +
> +BPF_CALL_1(bpf_ringbuf_submit, void *, sample)
> +{
> +	bpf_ringbuf_commit(sample, false /* discard */);
> +	return 0;
> +}
> +
> +const struct bpf_func_proto bpf_ringbuf_submit_proto = {
> +	.func		= bpf_ringbuf_submit,
> +	.ret_type	= RET_VOID,
> +	.arg1_type	= ARG_PTR_TO_ALLOC_MEM,
> +};
> +
> +BPF_CALL_1(bpf_ringbuf_discard, void *, sample)
> +{
> +	bpf_ringbuf_commit(sample, true /* discard */);
> +	return 0;
> +}
> +
> +const struct bpf_func_proto bpf_ringbuf_discard_proto = {
> +	.func		= bpf_ringbuf_discard,
> +	.ret_type	= RET_VOID,
> +	.arg1_type	= ARG_PTR_TO_ALLOC_MEM,
> +};
> +
> +BPF_CALL_4(bpf_ringbuf_output, struct bpf_map *, map, void *, data, u64,  
> size,
> +	   u64, flags)
> +{
> +	struct bpf_ringbuf_map *rb_map;
> +	void *rec;
> +
> +	if (unlikely(flags))
> +		return -EINVAL;
> +
> +	rb_map = container_of(map, struct bpf_ringbuf_map, map);
> +	rec = __bpf_ringbuf_reserve(rb_map->rb, size);
> +	if (!rec)
> +		return -EAGAIN;
> +
> +	memcpy(rec, data, size);
> +	bpf_ringbuf_commit(rec, false /* discard */);
> +	return 0;
> +}
> +
> +const struct bpf_func_proto bpf_ringbuf_output_proto = {
> +	.func		= bpf_ringbuf_output,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_CONST_MAP_PTR,
> +	.arg2_type	= ARG_PTR_TO_MEM,
> +	.arg3_type	= ARG_CONST_SIZE_OR_ZERO,
> +	.arg4_type	= ARG_ANYTHING,
> +};
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index de2a75500233..462db8595e9f 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -26,6 +26,7 @@
>   #include <linux/audit.h>
>   #include <uapi/linux/btf.h>
>   #include <linux/bpf_lsm.h>
> +#include <linux/poll.h>

>   #define IS_FD_ARRAY(map) ((map)->map_type ==  
> BPF_MAP_TYPE_PERF_EVENT_ARRAY || \
>   			  (map)->map_type == BPF_MAP_TYPE_CGROUP_ARRAY || \
> @@ -651,6 +652,16 @@ static int bpf_map_mmap(struct file *filp, struct  
> vm_area_struct *vma)
>   	return err;
>   }

> +static __poll_t bpf_map_poll(struct file *filp, struct poll_table_struct  
> *pts)
> +{
> +	struct bpf_map *map = filp->private_data;
> +
> +	if (map->ops->map_poll)
> +		return map->ops->map_poll(map, filp, pts);
> +
> +	return EPOLLERR;
> +}
> +
>   const struct file_operations bpf_map_fops = {
>   #ifdef CONFIG_PROC_FS
>   	.show_fdinfo	= bpf_map_show_fdinfo,
> @@ -659,6 +670,7 @@ const struct file_operations bpf_map_fops = {
>   	.read		= bpf_dummy_read,
>   	.write		= bpf_dummy_write,
>   	.mmap		= bpf_map_mmap,
> +	.poll		= bpf_map_poll,
>   };

>   int bpf_map_new_fd(struct bpf_map *map, int flags)
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 2a1826c76bb6..b8f0158d2327 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -233,6 +233,7 @@ struct bpf_call_arg_meta {
>   	bool pkt_access;
>   	int regno;
>   	int access_size;
> +	int mem_size;
>   	u64 msize_max_value;
>   	int ref_obj_id;
>   	int func_id;
> @@ -399,7 +400,8 @@ static bool reg_type_may_be_null(enum bpf_reg_type  
> type)
>   	       type == PTR_TO_SOCKET_OR_NULL ||
>   	       type == PTR_TO_SOCK_COMMON_OR_NULL ||
>   	       type == PTR_TO_TCP_SOCK_OR_NULL ||
> -	       type == PTR_TO_BTF_ID_OR_NULL;
> +	       type == PTR_TO_BTF_ID_OR_NULL ||
> +	       type == PTR_TO_MEM_OR_NULL;
>   }

>   static bool reg_may_point_to_spin_lock(const struct bpf_reg_state *reg)
> @@ -413,7 +415,9 @@ static bool reg_type_may_be_refcounted_or_null(enum  
> bpf_reg_type type)
>   	return type == PTR_TO_SOCKET ||
>   		type == PTR_TO_SOCKET_OR_NULL ||
>   		type == PTR_TO_TCP_SOCK ||
> -		type == PTR_TO_TCP_SOCK_OR_NULL;
> +		type == PTR_TO_TCP_SOCK_OR_NULL ||
> +		type == PTR_TO_MEM ||
> +		type == PTR_TO_MEM_OR_NULL;
>   }

>   static bool arg_type_may_be_refcounted(enum bpf_arg_type type)
> @@ -427,7 +431,9 @@ static bool arg_type_may_be_refcounted(enum  
> bpf_arg_type type)
>    */
>   static bool is_release_function(enum bpf_func_id func_id)
>   {
> -	return func_id == BPF_FUNC_sk_release;
> +	return func_id == BPF_FUNC_sk_release ||
> +	       func_id == BPF_FUNC_ringbuf_submit ||
> +	       func_id == BPF_FUNC_ringbuf_discard;
>   }

>   static bool may_be_acquire_function(enum bpf_func_id func_id)
> @@ -435,7 +441,8 @@ static bool may_be_acquire_function(enum bpf_func_id  
> func_id)
>   	return func_id == BPF_FUNC_sk_lookup_tcp ||
>   		func_id == BPF_FUNC_sk_lookup_udp ||
>   		func_id == BPF_FUNC_skc_lookup_tcp ||
> -		func_id == BPF_FUNC_map_lookup_elem;
> +		func_id == BPF_FUNC_map_lookup_elem ||
> +	        func_id == BPF_FUNC_ringbuf_reserve;
>   }

>   static bool is_acquire_function(enum bpf_func_id func_id,
> @@ -445,7 +452,8 @@ static bool is_acquire_function(enum bpf_func_id  
> func_id,

>   	if (func_id == BPF_FUNC_sk_lookup_tcp ||
>   	    func_id == BPF_FUNC_sk_lookup_udp ||
> -	    func_id == BPF_FUNC_skc_lookup_tcp)
> +	    func_id == BPF_FUNC_skc_lookup_tcp ||
> +	    func_id == BPF_FUNC_ringbuf_reserve)
>   		return true;

>   	if (func_id == BPF_FUNC_map_lookup_elem &&
> @@ -485,6 +493,8 @@ static const char * const reg_type_str[] = {
>   	[PTR_TO_XDP_SOCK]	= "xdp_sock",
>   	[PTR_TO_BTF_ID]		= "ptr_",
>   	[PTR_TO_BTF_ID_OR_NULL]	= "ptr_or_null_",
> +	[PTR_TO_MEM]		= "mem",
> +	[PTR_TO_MEM_OR_NULL]	= "mem_or_null",
>   };

>   static char slot_type_char[] = {
> @@ -2459,32 +2469,31 @@ static int check_map_access_type(struct  
> bpf_verifier_env *env, u32 regno,
>   	return 0;
>   }

> -/* check read/write into map element returned by bpf_map_lookup_elem() */
> -static int __check_map_access(struct bpf_verifier_env *env, u32 regno,  
> int off,
> -			      int size, bool zero_size_allowed)
> +/* check read/write into memory region (e.g., map value, ringbuf sample,  
> etc) */
> +static int __check_mem_access(struct bpf_verifier_env *env, int off,
> +			      int size, u32 mem_size, bool zero_size_allowed)
>   {
> -	struct bpf_reg_state *regs = cur_regs(env);
> -	struct bpf_map *map = regs[regno].map_ptr;
> +	bool size_ok = size > 0 || (size == 0 && zero_size_allowed);

> -	if (off < 0 || size < 0 || (size == 0 && !zero_size_allowed) ||
> -	    off + size > map->value_size) {
> -		verbose(env, "invalid access to map value, value_size=%d off=%d  
> size=%d\n",
> -			map->value_size, off, size);
> -		return -EACCES;
> -	}
> -	return 0;
> +	if (off >= 0 && size_ok && off + size <= mem_size)
> +		return 0;
> +
> +	verbose(env, "invalid access to memory, mem_size=%u off=%d size=%d\n",
> +		mem_size, off, size);
> +	return -EACCES;
>   }

> -/* check read/write into a map element with possible variable offset */
> -static int check_map_access(struct bpf_verifier_env *env, u32 regno,
> -			    int off, int size, bool zero_size_allowed)
> +/* check read/write into a memory region with possible variable offset */
> +static int check_mem_region_access(struct bpf_verifier_env *env, u32  
> regno,
> +				   int off, int size, u32 mem_size,
> +				   bool zero_size_allowed)
>   {
>   	struct bpf_verifier_state *vstate = env->cur_state;
>   	struct bpf_func_state *state = vstate->frame[vstate->curframe];
>   	struct bpf_reg_state *reg = &state->regs[regno];
>   	int err;

> -	/* We may have adjusted the register to this map value, so we
> +	/* We may have adjusted the register pointing to memory region, so we
>   	 * need to try adding each of min_value and max_value to off
>   	 * to make sure our theoretical access will be safe.
>   	 */
> @@ -2501,14 +2510,14 @@ static int check_map_access(struct  
> bpf_verifier_env *env, u32 regno,
>   	    (reg->smin_value == S64_MIN ||
>   	     (off + reg->smin_value != (s64)(s32)(off + reg->smin_value)) ||
>   	      reg->smin_value + off < 0)) {
> -		verbose(env, "R%d min value is negative, either use unsigned index or  
> do a if (index >=0) check.\n",
> +		verbose(env, "R%d min value is negative, either use unsigned index or  
> do an if (index >=0) check.\n",
>   			regno);
>   		return -EACCES;
>   	}
> -	err = __check_map_access(env, regno, reg->smin_value + off, size,
> +	err = __check_mem_access(env, reg->smin_value + off, size, mem_size,
>   				 zero_size_allowed);
>   	if (err) {
> -		verbose(env, "R%d min value is outside of the array range\n",
> +		verbose(env, "R%d min value is outside of the memory region\n",
>   			regno);
>   		return err;
>   	}
> @@ -2518,18 +2527,38 @@ static int check_map_access(struct  
> bpf_verifier_env *env, u32 regno,
>   	 * If reg->umax_value + off could overflow, treat that as unbounded too.
>   	 */
>   	if (reg->umax_value >= BPF_MAX_VAR_OFF) {
> -		verbose(env, "R%d unbounded memory access, make sure to bounds check  
> any array access into a map\n",
> +		verbose(env, "R%d unbounded memory access, make sure to bounds check  
> any memory region access\n",
>   			regno);
>   		return -EACCES;
>   	}
> -	err = __check_map_access(env, regno, reg->umax_value + off, size,
> +	err = __check_mem_access(env, reg->umax_value + off, size, mem_size,
>   				 zero_size_allowed);
> -	if (err)
> -		verbose(env, "R%d max value is outside of the array range\n",
> +	if (err) {
> +		verbose(env, "R%d max value is outside of the memory region\n",
>   			regno);
> +		return err;
> +	}
> +
> +	return 0;
> +}
> +
> +/* check read/write into a map element with possible variable offset */
> +static int check_map_access(struct bpf_verifier_env *env, u32 regno,
> +			    int off, int size, bool zero_size_allowed)
> +{
> +	struct bpf_verifier_state *vstate = env->cur_state;
> +	struct bpf_func_state *state = vstate->frame[vstate->curframe];
> +	struct bpf_reg_state *reg = &state->regs[regno];
> +	struct bpf_map *map = reg->map_ptr;
> +	int err;
> +
> +	err = check_mem_region_access(env, regno, off, size, map->value_size,
> +				      zero_size_allowed);
> +	if (err)
> +		return err;

> -	if (map_value_has_spin_lock(reg->map_ptr)) {
> -		u32 lock = reg->map_ptr->spin_lock_off;
> +	if (map_value_has_spin_lock(map)) {
> +		u32 lock = map->spin_lock_off;

>   		/* if any part of struct bpf_spin_lock can be touched by
>   		 * load/store reject this program.
> @@ -3211,6 +3240,16 @@ static int check_mem_access(struct  
> bpf_verifier_env *env, int insn_idx, u32 regn
>   				mark_reg_unknown(env, regs, value_regno);
>   			}
>   		}
> +	} else if (reg->type == PTR_TO_MEM) {
> +		if (t == BPF_WRITE && value_regno >= 0 &&
> +		    is_pointer_value(env, value_regno)) {
> +			verbose(env, "R%d leaks addr into mem\n", value_regno);
> +			return -EACCES;
> +		}
> +		err = check_mem_region_access(env, regno, off, size,
> +					      reg->mem_size, false);
> +		if (!err && t == BPF_READ && value_regno >= 0)
> +			mark_reg_unknown(env, regs, value_regno);
>   	} else if (reg->type == PTR_TO_CTX) {
>   		enum bpf_reg_type reg_type = SCALAR_VALUE;
>   		u32 btf_id = 0;
> @@ -3548,6 +3587,10 @@ static int check_helper_mem_access(struct  
> bpf_verifier_env *env, int regno,
>   			return -EACCES;
>   		return check_map_access(env, regno, reg->off, access_size,
>   					zero_size_allowed);
> +	case PTR_TO_MEM:
> +		return check_mem_region_access(env, regno, reg->off,
> +					       access_size, reg->mem_size,
> +					       zero_size_allowed);
>   	default: /* scalar_value|ptr_to_stack or invalid ptr */
>   		return check_stack_boundary(env, regno, access_size,
>   					    zero_size_allowed, meta);
> @@ -3652,6 +3695,17 @@ static bool arg_type_is_mem_size(enum bpf_arg_type  
> type)
>   	       type == ARG_CONST_SIZE_OR_ZERO;
>   }

> +static bool arg_type_is_alloc_mem_ptr(enum bpf_arg_type type)
> +{
> +	return type == ARG_PTR_TO_ALLOC_MEM ||
> +	       type == ARG_PTR_TO_ALLOC_MEM_OR_NULL;
> +}
> +
> +static bool arg_type_is_alloc_size(enum bpf_arg_type type)
> +{
> +	return type == ARG_CONST_ALLOC_SIZE_OR_ZERO;
> +}
> +
>   static bool arg_type_is_int_ptr(enum bpf_arg_type type)
>   {
>   	return type == ARG_PTR_TO_INT ||
> @@ -3711,7 +3765,8 @@ static int check_func_arg(struct bpf_verifier_env  
> *env, u32 regno,
>   			 type != expected_type)
>   			goto err_type;
>   	} else if (arg_type == ARG_CONST_SIZE ||
> -		   arg_type == ARG_CONST_SIZE_OR_ZERO) {
> +		   arg_type == ARG_CONST_SIZE_OR_ZERO ||
> +		   arg_type == ARG_CONST_ALLOC_SIZE_OR_ZERO) {
>   		expected_type = SCALAR_VALUE;
>   		if (type != expected_type)
>   			goto err_type;
> @@ -3782,13 +3837,29 @@ static int check_func_arg(struct bpf_verifier_env  
> *env, u32 regno,
>   		 * happens during stack boundary checking.
>   		 */
>   		if (register_is_null(reg) &&
> -		    arg_type == ARG_PTR_TO_MEM_OR_NULL)
> +		    (arg_type == ARG_PTR_TO_MEM_OR_NULL ||
> +		     arg_type == ARG_PTR_TO_ALLOC_MEM_OR_NULL))
>   			/* final test in check_stack_boundary() */;
>   		else if (!type_is_pkt_pointer(type) &&
>   			 type != PTR_TO_MAP_VALUE &&
> +			 type != PTR_TO_MEM &&
>   			 type != expected_type)
>   			goto err_type;
>   		meta->raw_mode = arg_type == ARG_PTR_TO_UNINIT_MEM;
> +	} else if (arg_type_is_alloc_mem_ptr(arg_type)) {
> +		expected_type = PTR_TO_MEM;
> +		if (register_is_null(reg) &&
> +		    arg_type == ARG_PTR_TO_ALLOC_MEM_OR_NULL)
> +			/* final test in check_stack_boundary() */;
> +		else if (type != expected_type)
> +			goto err_type;
> +		if (meta->ref_obj_id) {
> +			verbose(env, "verifier internal error: more than one arg with  
> ref_obj_id R%d %u %u\n",
> +				regno, reg->ref_obj_id,
> +				meta->ref_obj_id);
> +			return -EFAULT;
> +		}
> +		meta->ref_obj_id = reg->ref_obj_id;
>   	} else if (arg_type_is_int_ptr(arg_type)) {
>   		expected_type = PTR_TO_STACK;
>   		if (!type_is_pkt_pointer(type) &&
> @@ -3884,6 +3955,13 @@ static int check_func_arg(struct bpf_verifier_env  
> *env, u32 regno,
>   					      zero_size_allowed, meta);
>   		if (!err)
>   			err = mark_chain_precision(env, regno);
> +	} else if (arg_type_is_alloc_size(arg_type)) {
> +		if (!tnum_is_const(reg->var_off)) {
> +			verbose(env, "R%d unbounded size, use 'var &= const' or 'if (var <  
> const)'\n",
> +				regno);
> +			return -EACCES;
> +		}
> +		meta->mem_size = reg->var_off.value;
>   	} else if (arg_type_is_int_ptr(arg_type)) {
>   		int size = int_ptr_type_to_size(arg_type);

> @@ -3920,6 +3998,13 @@ static int check_map_func_compatibility(struct  
> bpf_verifier_env *env,
>   		    func_id != BPF_FUNC_xdp_output)
>   			goto error;
>   		break;
> +	case BPF_MAP_TYPE_RINGBUF:
> +		if (func_id != BPF_FUNC_ringbuf_output &&
> +		    func_id != BPF_FUNC_ringbuf_reserve &&
> +		    func_id != BPF_FUNC_ringbuf_submit &&
> +		    func_id != BPF_FUNC_ringbuf_discard)
> +			goto error;
> +		break;
>   	case BPF_MAP_TYPE_STACK_TRACE:
>   		if (func_id != BPF_FUNC_get_stackid)
>   			goto error;
> @@ -4644,6 +4729,11 @@ static int check_helper_call(struct  
> bpf_verifier_env *env, int func_id, int insn
>   		mark_reg_known_zero(env, regs, BPF_REG_0);
>   		regs[BPF_REG_0].type = PTR_TO_TCP_SOCK_OR_NULL;
>   		regs[BPF_REG_0].id = ++env->id_gen;
> +	} else if (fn->ret_type == RET_PTR_TO_ALLOC_MEM_OR_NULL) {
> +		mark_reg_known_zero(env, regs, BPF_REG_0);
> +		regs[BPF_REG_0].type = PTR_TO_MEM_OR_NULL;
> +		regs[BPF_REG_0].id = ++env->id_gen;
> +		regs[BPF_REG_0].mem_size = meta.mem_size;
>   	} else {
>   		verbose(env, "unknown return type %d of func %s#%d\n",
>   			fn->ret_type, func_id_name(func_id), func_id);
> @@ -6583,6 +6673,8 @@ static void mark_ptr_or_null_reg(struct  
> bpf_func_state *state,
>   			reg->type = PTR_TO_TCP_SOCK;
>   		} else if (reg->type == PTR_TO_BTF_ID_OR_NULL) {
>   			reg->type = PTR_TO_BTF_ID;
> +		} else if (reg->type == PTR_TO_MEM_OR_NULL) {
> +			reg->type = PTR_TO_MEM;
>   		}
>   		if (is_null) {
>   			/* We don't need id and ref_obj_id from this point
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index d961428fb5b6..6e6b3f8f77c1 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -1053,6 +1053,14 @@ bpf_tracing_func_proto(enum bpf_func_id func_id,  
> const struct bpf_prog *prog)
>   		return &bpf_perf_event_read_value_proto;
>   	case BPF_FUNC_get_ns_current_pid_tgid:
>   		return &bpf_get_ns_current_pid_tgid_proto;
> +	case BPF_FUNC_ringbuf_output:
> +		return &bpf_ringbuf_output_proto;
> +	case BPF_FUNC_ringbuf_reserve:
> +		return &bpf_ringbuf_reserve_proto;
> +	case BPF_FUNC_ringbuf_submit:
> +		return &bpf_ringbuf_submit_proto;
> +	case BPF_FUNC_ringbuf_discard:
> +		return &bpf_ringbuf_discard_proto;
>   	default:
>   		return NULL;
>   	}
> diff --git a/tools/include/uapi/linux/bpf.h  
> b/tools/include/uapi/linux/bpf.h
> index bfb31c1be219..ae2deb6a8afc 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -147,6 +147,7 @@ enum bpf_map_type {
>   	BPF_MAP_TYPE_SK_STORAGE,
>   	BPF_MAP_TYPE_DEVMAP_HASH,
>   	BPF_MAP_TYPE_STRUCT_OPS,
> +	BPF_MAP_TYPE_RINGBUF,
>   };

>   /* Note that tracing related programs such as
> @@ -3121,6 +3122,32 @@ union bpf_attr {
>    * 		0 on success, or a negative error in case of failure:
>    *
>    *		**-EOVERFLOW** if an overflow happened: The same object will be  
> tried again.
> + *
> + * void *bpf_ringbuf_output(void *ringbuf, void *data, u64 size, u64  
> flags)
> + * 	Description
> + * 		Copy *size* bytes from *data* into a ring buffer *ringbuf*.
> + * 	Return
> + * 		0, on success;
> + * 		< 0, on error.
> + *
> + * void *bpf_ringbuf_reserve(void *ringbuf, u64 size, u64 flags)
> + * 	Description
> + * 		Reserve *size* bytes of payload in a ring buffer *ringbuf*.
> + * 	Return
> + * 		Valid pointer with *size* bytes of memory available; NULL,
> + * 		otherwise.
> + *
> + * void bpf_ringbuf_submit(void *data)
> + * 	Description
> + * 		Submit reserved ring buffer sample, pointed to by *data*.
> + * 	Return
> + * 		Nothing.
> + *
> + * void bpf_ringbuf_discard(void *data)
> + * 	Description
> + * 		Discard reserved ring buffer sample, pointed to by *data*.
> + * 	Return
> + * 		Nothing.
>    */
>   #define __BPF_FUNC_MAPPER(FN)		\
>   	FN(unspec),			\
> @@ -3250,7 +3277,11 @@ union bpf_attr {
>   	FN(sk_assign),			\
>   	FN(ktime_get_boot_ns),		\
>   	FN(seq_printf),			\
> -	FN(seq_write),
> +	FN(seq_write),			\
> +	FN(ringbuf_output),		\
> +	FN(ringbuf_reserve),		\
> +	FN(ringbuf_submit),		\
> +	FN(ringbuf_discard),

>   /* integer value in 'imm' field of BPF_CALL instruction selects which  
> helper
>    * function eBPF program intends to call
> --
> 2.24.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next 1/6] bpf: implement BPF ring buffer and verifier support for it
  2020-05-13 19:25 ` [PATCH bpf-next 1/6] bpf: implement BPF ring buffer and verifier support for it Andrii Nakryiko
                     ` (4 preceding siblings ...)
  2020-05-14 17:33   ` sdf
@ 2020-05-14 19:06   ` Alexei Starovoitov
  2020-05-14 20:49     ` Andrii Nakryiko
  2020-05-14 19:18     ` Jakub Kicinski
  6 siblings, 1 reply; 32+ messages in thread
From: Alexei Starovoitov @ 2020-05-14 19:06 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, netdev, ast, daniel, andrii.nakryiko, kernel-team,
	Paul E . McKenney, Jonathan Lemon

On Wed, May 13, 2020 at 12:25:27PM -0700, Andrii Nakryiko wrote:
> +
> +/* Given pointer to ring buffer record metadata, restore pointer to struct
> + * bpf_ringbuf itself by using page offset stored at offset 4
> + */
> +static struct bpf_ringbuf *bpf_ringbuf_restore_from_rec(void *meta_ptr)
> +{
> +	unsigned long addr = (unsigned long)meta_ptr;
> +	unsigned long off = *(u32 *)(meta_ptr + 4) << PAGE_SHIFT;

Looking at the further code it seems this one should be READ_ONCE, but...

> +
> +	return (void*)((addr & PAGE_MASK) - off);
> +}
> +
> +static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size)
> +{
> +	unsigned long cons_pos, prod_pos, new_prod_pos, flags;
> +	u32 len, pg_off;
> +	void *meta_ptr;
> +
> +	if (unlikely(size > UINT_MAX))
> +		return NULL;
> +
> +	len = round_up(size + RINGBUF_META_SZ, 8);

it may overflow despite the check above.

> +	cons_pos = READ_ONCE(rb->consumer_pos);
> +
> +	if (in_nmi()) {
> +		if (!spin_trylock_irqsave(&rb->spinlock, flags))
> +			return NULL;
> +	} else {
> +		spin_lock_irqsave(&rb->spinlock, flags);
> +	}
> +
> +	prod_pos = rb->producer_pos;
> +	new_prod_pos = prod_pos + len;
> +
> +	/* check for out of ringbuf space by ensuring producer position
> +	 * doesn't advance more than (ringbuf_size - 1) ahead
> +	 */
> +	if (new_prod_pos - cons_pos > rb->mask) {
> +		spin_unlock_irqrestore(&rb->spinlock, flags);
> +		return NULL;
> +	}
> +
> +	meta_ptr = rb->data + (prod_pos & rb->mask);
> +	pg_off = bpf_ringbuf_rec_pg_off(rb, meta_ptr);
> +
> +	WRITE_ONCE(*(u32 *)meta_ptr, RINGBUF_BUSY_BIT | size);
> +	WRITE_ONCE(*(u32 *)(meta_ptr + 4), pg_off);

it doens't match to few other places where normal read is done.
But why WRITE_ONCE here?
How does it race with anything?
producer_pos is updated later.

> +
> +	/* ensure length prefix is written before updating producer positions */
> +	smp_wmb();

this barrier is enough to make sure meta_ptr and meta_ptr+4 init
is visible before producer_pos is updated below.

> +	WRITE_ONCE(rb->producer_pos, new_prod_pos);
> +
> +	spin_unlock_irqrestore(&rb->spinlock, flags);
> +
> +	return meta_ptr + RINGBUF_META_SZ;
> +}
> +
> +BPF_CALL_3(bpf_ringbuf_reserve, struct bpf_map *, map, u64, size, u64, flags)
> +{
> +	struct bpf_ringbuf_map *rb_map;
> +
> +	if (unlikely(flags))
> +		return -EINVAL;
> +
> +	rb_map = container_of(map, struct bpf_ringbuf_map, map);
> +	return (unsigned long)__bpf_ringbuf_reserve(rb_map->rb, size);
> +}
> +
> +const struct bpf_func_proto bpf_ringbuf_reserve_proto = {
> +	.func		= bpf_ringbuf_reserve,
> +	.ret_type	= RET_PTR_TO_ALLOC_MEM_OR_NULL,
> +	.arg1_type	= ARG_CONST_MAP_PTR,
> +	.arg2_type	= ARG_CONST_ALLOC_SIZE_OR_ZERO,
> +	.arg3_type	= ARG_ANYTHING,
> +};
> +
> +static void bpf_ringbuf_commit(void *sample, bool discard)
> +{
> +	unsigned long rec_pos, cons_pos;
> +	u32 new_meta, old_meta;
> +	void *meta_ptr;
> +	struct bpf_ringbuf *rb;
> +
> +	meta_ptr = sample - RINGBUF_META_SZ;
> +	rb = bpf_ringbuf_restore_from_rec(meta_ptr);
> +	old_meta = *(u32 *)meta_ptr;

I think this one will race with user space and should be READ_ONCE.

> +	new_meta = old_meta ^ RINGBUF_BUSY_BIT;
> +	if (discard)
> +		new_meta |= RINGBUF_DISCARD_BIT;
> +
> +	/* update metadata header with correct final size prefix */
> +	xchg((u32 *)meta_ptr, new_meta);
> +
> +	/* if consumer caught up and is waiting for our record, notify about
> +	 * new data availability
> +	 */
> +	rec_pos = (void *)meta_ptr - (void *)rb->data;
> +	cons_pos = smp_load_acquire(&rb->consumer_pos) & rb->mask;

hmm. Earlier WRITE_ONCE(rb->producer_pos) is used, but here it's load_acquire.
Please be consistent with pairing.

> +	if (cons_pos == rec_pos)
> +		wake_up_all(&rb->waitq);

Is it legal to do from preempt_disabled region?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next 1/6] bpf: implement BPF ring buffer and verifier support for it
  2020-05-13 19:25 ` [PATCH bpf-next 1/6] bpf: implement BPF ring buffer and verifier support for it Andrii Nakryiko
@ 2020-05-14 19:18     ` Jakub Kicinski
  2020-05-13 21:58   ` Alan Maguire
                       ` (5 subsequent siblings)
  6 siblings, 0 replies; 32+ messages in thread
From: Jakub Kicinski @ 2020-05-14 19:18 UTC (permalink / raw)
  To: Andrii Nakryiko, linux-arch
  Cc: bpf, netdev, ast, daniel, andrii.nakryiko, kernel-team,
	Paul E . McKenney, Jonathan Lemon

On Wed, 13 May 2020 12:25:27 -0700 Andrii Nakryiko wrote:
> One interesting implementation bit, that significantly simplifies (and thus
> speeds up as well) implementation of both producers and consumers is how data
> area is mapped twice contiguously back-to-back in the virtual memory. This
> allows to not take any special measures for samples that have to wrap around
> at the end of the circular buffer data area, because the next page after the
> last data page would be first data page again, and thus the sample will still
> appear completely contiguous in virtual memory. See comment and a simple ASCII
> diagram showing this visually in bpf_ringbuf_area_alloc().

Out of curiosity - is this 100% okay to do in the kernel and user space
these days? Is this bit part of the uAPI in case we need to back out of
it? 

In the olden days virtually mapped/tagged caches could get confused
seeing the same physical memory have two active virtual mappings, or 
at least that's what I've been told in school :)

Checking with Paul - he says that could have been the case for Itanium
and PA-RISC CPUs.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next 1/6] bpf: implement BPF ring buffer and verifier support for it
@ 2020-05-14 19:18     ` Jakub Kicinski
  0 siblings, 0 replies; 32+ messages in thread
From: Jakub Kicinski @ 2020-05-14 19:18 UTC (permalink / raw)
  To: Andrii Nakryiko, linux-arch
  Cc: bpf, netdev, ast, daniel, andrii.nakryiko, kernel-team,
	Paul E . McKenney, Jonathan Lemon

On Wed, 13 May 2020 12:25:27 -0700 Andrii Nakryiko wrote:
> One interesting implementation bit, that significantly simplifies (and thus
> speeds up as well) implementation of both producers and consumers is how data
> area is mapped twice contiguously back-to-back in the virtual memory. This
> allows to not take any special measures for samples that have to wrap around
> at the end of the circular buffer data area, because the next page after the
> last data page would be first data page again, and thus the sample will still
> appear completely contiguous in virtual memory. See comment and a simple ASCII
> diagram showing this visually in bpf_ringbuf_area_alloc().

Out of curiosity - is this 100% okay to do in the kernel and user space
these days? Is this bit part of the uAPI in case we need to back out of
it? 

In the olden days virtually mapped/tagged caches could get confused
seeing the same physical memory have two active virtual mappings, or 
at least that's what I've been told in school :)

Checking with Paul - he says that could have been the case for Itanium
and PA-RISC CPUs.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next 0/6] BPF ring buffer
  2020-05-14 16:30     ` Jonathan Lemon
@ 2020-05-14 20:06       ` Andrii Nakryiko
  0 siblings, 0 replies; 32+ messages in thread
From: Andrii Nakryiko @ 2020-05-14 20:06 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: Jonathan Lemon, Andrii Nakryiko, bpf, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Paul E . McKenney

On Thu, May 14, 2020 at 9:30 AM Jonathan Lemon <bsd@fb.com> wrote:
>
> On Wed, May 13, 2020 at 11:08:46PM -0700, Andrii Nakryiko wrote:
> > On Wed, May 13, 2020 at 3:49 PM Jonathan Lemon <jonathan.lemon@gmail.com> wrote:
> > >
> > > On Wed, May 13, 2020 at 12:25:26PM -0700, Andrii Nakryiko wrote:
> > > > Implement a new BPF ring buffer, as presented at BPF virtual conference ([0]).
> > > > It presents an alternative to perf buffer, following its semantics closely,
> > > > but allowing sharing same instance of ring buffer across multiple CPUs
> > > > efficiently.
> > > >
> > > > Most patches have extensive commentary explaining various aspects, so I'll
> > > > keep cover letter short. Overall structure of the patch set:
> > > > - patch #1 adds BPF ring buffer implementation to kernel and necessary
> > > >   verifier support;
> > > > - patch #2 adds litmus tests validating all the memory orderings and locking
> > > >   is correct;
> > > > - patch #3 is an optional patch that generalizes verifier's reference tracking
> > > >   machinery to capture type of reference;
> > > > - patch #4 adds libbpf consumer implementation for BPF ringbuf;
> > > > - path #5 adds selftest, both for single BPF ring buf use case, as well as
> > > >   using it with array/hash of maps;
> > > > - patch #6 adds extensive benchmarks and provide some analysis in commit
> > > >   message, it build upon selftests/bpf's bench runner.
> > > >
> > > >   [0] https://docs.google.com/presentation/d/18ITdg77Bj6YDOH2LghxrnFxiPWe0fAqcmJY95t_qr0w
> > > >
> > > > Cc: Paul E. McKenney <paulmck@kernel.org>
> > > > Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
> > >
> > > Looks very nice!  A few random questions:
> > >
> > > 1) Why not use a structure for the header, instead of 2 32bit ints?
> >
> > hm... no reason, just never occurred to me it's necessary :)
>
> It might be clearer to do this.  Something like:
>
> struct ringbuf_record {
>     union {
>         struct {
>             u32 size:30;
>             bool busy:1;
>             bool discard:1;
>         };
>         u32 word1;
>     };
>     union {
>         u32 pgoff;
>         u32 word2;
>     };
> };
>
> While perhaps a bit overkill, makes it clear what is going on.

I really want to avoid specifying bitfields, because that gets into
endianness territory, and I don't want to do different order of
bitfields depending on endianness of the platform. I can do


struct ringbuf_record_header {
    u32 len;
    u32 pgoff;
};

But that will be useful for kernel only and shouldn't be part of UAPI,
because pgoff makes sense only inside the kernel. For user-space,
first 4 bytes is length + busy&discard bits, second 4 bytes are
reserved and shouldn't be used (at least yet).

I guess I should put RINGBUF_META_SZ, RINGBUF_BUSY_BIT,
RINGBUF_DISCARD_BIT from patch #1 into include/uapi/linux/bpf.h to
make it a stable API, I suppose?

>
>
> > > 2) Would it make sense to reserve X bytes, but only commit Y?
> > >    the offset field could be used to write the record length.
> > >
> > >    E.g.:
> > >       reserve 512 bytes    [BUSYBIT | 512][PG OFFSET]
> > >       commit  400 bytes    [ 512 ] [ 400 ]
> >
> > It could be done, though I had tentative plans to use those second 4
> > bytes for something useful eventually.
> >
> > But what's the use case? From ring buffer's perspective, X bytes were
> > reserved and are gone already and subsequent writers might have
> > already advanced producer counter with the assumption that all X bytes
> > are going to be used. So there are no space savings, even if record is
> > discarded or only portion of it is submitted. I can only see a bit of
> > added convenience for an application, because it doesn't have to track
> > amount of actual data in its record. But this doesn't seem to be a
> > common case either, so not sure how it's worth supporting... Is there
> > a particular case where this is extremely useful and extra 4 bytes in
> > record payload is too much?
>
> Not off the top of my head - it was just the first thing that came to
> mind when reading about the commit/discard paradigm.  I was thinking
> about variable records, where the maximum is reserved, but less data
> is written.  But there's no particular reason for the ringbuffer to
> track this either, it could be part of the application framing.

Yeah, I'd defer to application doing that. People were asking about
using reserve with variable-sized records, but I don't think it's
possible to do. That what bpf_ringbuf_output() helper was added for:
prepare variable-sized data outside of ringbuf, then reserve exact
amount and copy over. Less performant, but allows to use ring buffer
space more efficiently.

>
>
> > > 3) Why have 2 separate pages for producer/consumer, instead of
> > >    just aligning to a smp cache line (or even 1/2 page?)
> >
> > Access rights restrictions. Consumer page is readable/writable,
> > producer page is read-only for user-space. If user-space had ability
> > to write producer position, it could wreck a huge havoc for the
> > ringbuf algorithm.
>
> Ah, thanks, that makes sense.  Might want to add a comment to
> that effect, as it's different from other implementations.

Yep, definitely, I knew I forgot to document something :)

> --
> Jonathan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next 1/6] bpf: implement BPF ring buffer and verifier support for it
  2020-05-14 16:50   ` Jonathan Lemon
@ 2020-05-14 20:11     ` Andrii Nakryiko
  0 siblings, 0 replies; 32+ messages in thread
From: Andrii Nakryiko @ 2020-05-14 20:11 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Paul E . McKenney, Jonathan Lemon

On Thu, May 14, 2020 at 9:50 AM Jonathan Lemon <bsd@fb.com> wrote:
>
> On Wed, May 13, 2020 at 12:25:27PM -0700, Andrii Nakryiko wrote:
> > +static struct bpf_ringbuf *bpf_ringbuf_restore_from_rec(void *meta_ptr)
> > +{
> > +     unsigned long addr = (unsigned long)meta_ptr;
> > +     unsigned long off = *(u32 *)(meta_ptr + 4) << PAGE_SHIFT;
> > +
> > +     return (void*)((addr & PAGE_MASK) - off);
> > +}
> > +
> > +static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size)
> > +{
> > +     unsigned long cons_pos, prod_pos, new_prod_pos, flags;
> > +     u32 len, pg_off;
> > +     void *meta_ptr;
> > +
> > +     if (unlikely(size > UINT_MAX))
> > +             return NULL;
>
> Size should be 30 bits, not UINT_MAX, since 2 bits are reserved.

Oh, good catch, thanks. Yep, should be (UINT_MAX >> 2), will add a
constant for this.


>
> > +
> > +     len = round_up(size + RINGBUF_META_SZ, 8);
> > +     cons_pos = READ_ONCE(rb->consumer_pos);
> > +
> > +     if (in_nmi()) {
> > +             if (!spin_trylock_irqsave(&rb->spinlock, flags))
> > +                     return NULL;
> > +     } else {
> > +             spin_lock_irqsave(&rb->spinlock, flags);
> > +     }
> > +
> > +     prod_pos = rb->producer_pos;
> > +     new_prod_pos = prod_pos + len;
> > +
> > +     /* check for out of ringbuf space by ensuring producer position
> > +      * doesn't advance more than (ringbuf_size - 1) ahead
> > +      */
> > +     if (new_prod_pos - cons_pos > rb->mask) {
> > +             spin_unlock_irqrestore(&rb->spinlock, flags);
> > +             return NULL;
> > +     }
> > +
> > +     meta_ptr = rb->data + (prod_pos & rb->mask);
> > +     pg_off = bpf_ringbuf_rec_pg_off(rb, meta_ptr);
> > +
> > +     WRITE_ONCE(*(u32 *)meta_ptr, RINGBUF_BUSY_BIT | size);
> > +     WRITE_ONCE(*(u32 *)(meta_ptr + 4), pg_off);
>
> Or define a 64bit word in the structure and use:
>
>         WRITE_ONCE(*(u64 *)meta_ptr, rec.header);

yep, can do that

>
>
> > +
> > +     /* ensure length prefix is written before updating producer positions */
> > +     smp_wmb();
> > +     WRITE_ONCE(rb->producer_pos, new_prod_pos);
> > +
> > +     spin_unlock_irqrestore(&rb->spinlock, flags);
> > +
> > +     return meta_ptr + RINGBUF_META_SZ;
> > +}
> > +
> > +BPF_CALL_3(bpf_ringbuf_reserve, struct bpf_map *, map, u64, size, u64, flags)
> > +{
> > +     struct bpf_ringbuf_map *rb_map;
> > +
> > +     if (unlikely(flags))
> > +             return -EINVAL;
> > +
> > +     rb_map = container_of(map, struct bpf_ringbuf_map, map);
> > +     return (unsigned long)__bpf_ringbuf_reserve(rb_map->rb, size);
> > +}
> > +
>
> --
> Jonathan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next 1/6] bpf: implement BPF ring buffer and verifier support for it
  2020-05-14 17:33   ` sdf
@ 2020-05-14 20:18     ` Andrii Nakryiko
  2020-05-14 20:53       ` sdf
  0 siblings, 1 reply; 32+ messages in thread
From: Andrii Nakryiko @ 2020-05-14 20:18 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Paul E . McKenney, Jonathan Lemon

On Thu, May 14, 2020 at 10:33 AM <sdf@google.com> wrote:
>
> On 05/13, Andrii Nakryiko wrote:
> > This commits adds a new MPSC ring buffer implementation into BPF
> > ecosystem,
> > which allows multiple CPUs to submit data to a single shared ring buffer.
> > On
> > the consumption side, only single consumer is assumed.
>

[...]

> > Comparison to alternatives
> > --------------------------
> > Before considering implementing BPF ring buffer from scratch existing
> > alternatives in kernel were evaluated, but didn't seem to meet the needs.
> > They
> > largely fell into few categores:
> >    - per-CPU buffers (perf, ftrace, etc), which don't satisfy two
> > motivations
> >      outlined above (ordering and memory consumption);
> >    - linked list-based implementations; while some were multi-producer
> > designs,
> >      consuming these from user-space would be very complicated and most
> >      probably not performant; memory-mapping contiguous piece of memory is
> >      simpler and more performant for user-space consumers;
> >    - io_uring is SPSC, but also requires fixed-sized elements. Naively
> > turning
> >      SPSC queue into MPSC w/ lock would have subpar performance compared to
> >      locked reserve + lockless commit, as with BPF ring buffer. Fixed sized
> >      elements would be too limiting for BPF programs, given existing BPF
> >      programs heavily rely on variable-sized perf buffer already;
> >    - specialized implementations (like a new printk ring buffer, [0]) with
> > lots
> >      of printk-specific limitations and implications, that didn't seem to
> > fit
> >      well for intended use with BPF programs.
> That's a very nice write up! Does it make sense to put most of it
> under Documentation/bpf? We were discussing socket storage with KP
> recently and I mentioned that commit 6ac99e8f23d4 has a really nice
> description of the architecture with ascii diagrams/etc. Sometimes
> it's really hard to chase down the commit history to find out
> these sorts of details.

Sure, can do that. And thanks :)

>
> >    [0] https://lwn.net/Articles/779550/
>
> > Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> > ---
> >   include/linux/bpf.h            |  12 +
> >   include/linux/bpf_types.h      |   1 +
> >   include/linux/bpf_verifier.h   |   4 +
> >   include/uapi/linux/bpf.h       |  33 ++-
> >   kernel/bpf/Makefile            |   2 +-
> >   kernel/bpf/helpers.c           |   8 +
> >   kernel/bpf/ringbuf.c           | 409 +++++++++++++++++++++++++++++++++
> >   kernel/bpf/syscall.c           |  12 +
> >   kernel/bpf/verifier.c          | 156 ++++++++++---
> >   kernel/trace/bpf_trace.c       |   8 +
> >   tools/include/uapi/linux/bpf.h |  33 ++-
> >   11 files changed, 643 insertions(+), 35 deletions(-)
> >   create mode 100644 kernel/bpf/ringbuf.c
>

[...]

> > + * void bpf_ringbuf_submit(void *data)
> > + *   Description
> > + *           Submit reserved ring buffer sample, pointed to by *data*.
> > + *   Return
> > + *           Nothing.
> Even though you mention self-pacing properties, would it still
> make sense to add some argument to bpf_ringbuf_submit/bpf_ringbuf_output
> to indicate whether to wake up userspace or not? Maybe something like
> a threshold of number of outstanding events in the ringbuf after which
> we do the wakeup? The default 0/1 preserve the existing behavior.
>
> The example I can give is a control plane userspace thread that
> once a second aggregates the events, it doesn't care about millisecond
> resolution. With the current scheme, I suppose, if BPF generates events
> every 1ms, the userspace will be woken up 1000 times (if it can keep
> up). Most of the time, we don't really care and some buffering
> properties are desired.

perf buffer has setting like this, and believe me, it's so confusing
and dangerous, that I wouldn't want this to be exposed. Even though I
was aware of this behavior, I still had to debug and work-around this
lack on wakeup few times, it's really-really confusing feature.

In your case, though, why wouldn't user-space poll data just once a
second, if it's not interested in getting data as fast as possible?

[...]

> > +struct bpf_ringbuf {
> > +     wait_queue_head_t waitq;
> > +     u64 mask;
> > +     spinlock_t spinlock ____cacheline_aligned_in_smp;
> > +     u64 consumer_pos __aligned(PAGE_SIZE);
> > +     u64 producer_pos __aligned(PAGE_SIZE);
> > +     char data[] __aligned(PAGE_SIZE);
> > +};
> > +
> > +struct bpf_ringbuf_map {
> > +     struct bpf_map map;
> > +     struct bpf_map_memory memory;
> > +     struct bpf_ringbuf *rb;
> > +};
> > +
> > +static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int
> > numa_node)
> > +{
> > +     const gfp_t flags = GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN |
> > +                         __GFP_ZERO;
> > +     int nr_meta_pages = 2 + BPF_RINGBUF_PGOFF;
> There is a bunch of magic '2's scattered around. Would it make sense
> to use a proper define (with a comment) instead?

Yep, it's consumer + producer counter pages, I'll add a constant.

>
> > +     int nr_data_pages = data_sz >> PAGE_SHIFT;
> > +     int nr_pages = nr_meta_pages + nr_data_pages;
> > +     struct page **pages, *page;
> > +     size_t array_size;
> > +     void *addr;
> > +     int i;
> > +

[...]

Please trim. I do love my code of course, but scrolling through it so
many times gets old still ;)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next 1/6] bpf: implement BPF ring buffer and verifier support for it
  2020-05-14 19:18     ` Jakub Kicinski
  (?)
@ 2020-05-14 20:39     ` Thomas Gleixner
  2020-05-14 21:30       ` Andrii Nakryiko
  -1 siblings, 1 reply; 32+ messages in thread
From: Thomas Gleixner @ 2020-05-14 20:39 UTC (permalink / raw)
  To: Jakub Kicinski, Andrii Nakryiko, linux-arch
  Cc: bpf, netdev, ast, daniel, andrii.nakryiko, kernel-team,
	Paul E . McKenney, Jonathan Lemon

Jakub Kicinski <kuba@kernel.org> writes:

> On Wed, 13 May 2020 12:25:27 -0700 Andrii Nakryiko wrote:
>> One interesting implementation bit, that significantly simplifies (and thus
>> speeds up as well) implementation of both producers and consumers is how data
>> area is mapped twice contiguously back-to-back in the virtual memory. This
>> allows to not take any special measures for samples that have to wrap around
>> at the end of the circular buffer data area, because the next page after the
>> last data page would be first data page again, and thus the sample will still
>> appear completely contiguous in virtual memory. See comment and a simple ASCII
>> diagram showing this visually in bpf_ringbuf_area_alloc().
>
> Out of curiosity - is this 100% okay to do in the kernel and user space
> these days? Is this bit part of the uAPI in case we need to back out of
> it? 
>
> In the olden days virtually mapped/tagged caches could get confused
> seeing the same physical memory have two active virtual mappings, or 
> at least that's what I've been told in school :)

Yes, caching the same thing twice causes coherency problems.

VIVT can be found in ARMv5, MIPS, NDS32 and Unicore32.

> Checking with Paul - he says that could have been the case for Itanium
> and PA-RISC CPUs.

Itanium: PIPT L1/L2.
PA-RISC: VIPT L1 and PIPT L2

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next 1/6] bpf: implement BPF ring buffer and verifier support for it
  2020-05-14 19:06   ` Alexei Starovoitov
@ 2020-05-14 20:49     ` Andrii Nakryiko
  0 siblings, 0 replies; 32+ messages in thread
From: Andrii Nakryiko @ 2020-05-14 20:49 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Paul E . McKenney, Jonathan Lemon

On Thu, May 14, 2020 at 12:06 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, May 13, 2020 at 12:25:27PM -0700, Andrii Nakryiko wrote:
> > +
> > +/* Given pointer to ring buffer record metadata, restore pointer to struct
> > + * bpf_ringbuf itself by using page offset stored at offset 4
> > + */
> > +static struct bpf_ringbuf *bpf_ringbuf_restore_from_rec(void *meta_ptr)
> > +{
> > +     unsigned long addr = (unsigned long)meta_ptr;
> > +     unsigned long off = *(u32 *)(meta_ptr + 4) << PAGE_SHIFT;
>
> Looking at the further code it seems this one should be READ_ONCE, but...
>

This will be called from commit(), which will be called after
successful reserve(), which dose spin_unlock at the end. So in terms
of memory barriers, everything should be fine? Or am I missing some
trickier aspect?

> > +
> > +     return (void*)((addr & PAGE_MASK) - off);
> > +}
> > +
> > +static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size)
> > +{
> > +     unsigned long cons_pos, prod_pos, new_prod_pos, flags;
> > +     u32 len, pg_off;
> > +     void *meta_ptr;
> > +
> > +     if (unlikely(size > UINT_MAX))
> > +             return NULL;
> > +
> > +     len = round_up(size + RINGBUF_META_SZ, 8);
>
> it may overflow despite the check above.

As Jonathan pointed out, max size should be enforced about to be at
most (UINT_MAX/4), due to 2 bits reserved for BUSY and DISCARD. So it
shouldn't overflow.

>
> > +     cons_pos = READ_ONCE(rb->consumer_pos);
> > +
> > +     if (in_nmi()) {
> > +             if (!spin_trylock_irqsave(&rb->spinlock, flags))
> > +                     return NULL;
> > +     } else {
> > +             spin_lock_irqsave(&rb->spinlock, flags);
> > +     }
> > +
> > +     prod_pos = rb->producer_pos;
> > +     new_prod_pos = prod_pos + len;
> > +
> > +     /* check for out of ringbuf space by ensuring producer position
> > +      * doesn't advance more than (ringbuf_size - 1) ahead
> > +      */
> > +     if (new_prod_pos - cons_pos > rb->mask) {
> > +             spin_unlock_irqrestore(&rb->spinlock, flags);
> > +             return NULL;
> > +     }
> > +
> > +     meta_ptr = rb->data + (prod_pos & rb->mask);
> > +     pg_off = bpf_ringbuf_rec_pg_off(rb, meta_ptr);
> > +
> > +     WRITE_ONCE(*(u32 *)meta_ptr, RINGBUF_BUSY_BIT | size);
> > +     WRITE_ONCE(*(u32 *)(meta_ptr + 4), pg_off);
>
> it doens't match to few other places where normal read is done.
> But why WRITE_ONCE here?
> How does it race with anything?
> producer_pos is updated later.

Yeah, I think you are right. I left it as WRITE_ONCE from my initial
lockless variant, but this could be just a normal store, will fix.

>
> > +
> > +     /* ensure length prefix is written before updating producer positions */
> > +     smp_wmb();
>
> this barrier is enough to make sure meta_ptr and meta_ptr+4 init
> is visible before producer_pos is updated below.

yep, 100% agree


>
> > +     WRITE_ONCE(rb->producer_pos, new_prod_pos);

consumer reads this with smp_load_acquire, I guess for consistency
I'll switch this to smp_store_release() then, right?

> > +
> > +     spin_unlock_irqrestore(&rb->spinlock, flags);
> > +
> > +     return meta_ptr + RINGBUF_META_SZ;
> > +}
> > +
> > +BPF_CALL_3(bpf_ringbuf_reserve, struct bpf_map *, map, u64, size, u64, flags)
> > +{
> > +     struct bpf_ringbuf_map *rb_map;
> > +
> > +     if (unlikely(flags))
> > +             return -EINVAL;
> > +
> > +     rb_map = container_of(map, struct bpf_ringbuf_map, map);
> > +     return (unsigned long)__bpf_ringbuf_reserve(rb_map->rb, size);
> > +}
> > +
> > +const struct bpf_func_proto bpf_ringbuf_reserve_proto = {
> > +     .func           = bpf_ringbuf_reserve,
> > +     .ret_type       = RET_PTR_TO_ALLOC_MEM_OR_NULL,
> > +     .arg1_type      = ARG_CONST_MAP_PTR,
> > +     .arg2_type      = ARG_CONST_ALLOC_SIZE_OR_ZERO,
> > +     .arg3_type      = ARG_ANYTHING,
> > +};
> > +
> > +static void bpf_ringbuf_commit(void *sample, bool discard)
> > +{
> > +     unsigned long rec_pos, cons_pos;
> > +     u32 new_meta, old_meta;
> > +     void *meta_ptr;
> > +     struct bpf_ringbuf *rb;
> > +
> > +     meta_ptr = sample - RINGBUF_META_SZ;
> > +     rb = bpf_ringbuf_restore_from_rec(meta_ptr);
> > +     old_meta = *(u32 *)meta_ptr;
>
> I think this one will race with user space and should be READ_ONCE.

This is never modified by user-space, only by previous reserve(). Is
that still considered a race by some tooling?

>
> > +     new_meta = old_meta ^ RINGBUF_BUSY_BIT;
> > +     if (discard)
> > +             new_meta |= RINGBUF_DISCARD_BIT;
> > +
> > +     /* update metadata header with correct final size prefix */
> > +     xchg((u32 *)meta_ptr, new_meta);
> > +
> > +     /* if consumer caught up and is waiting for our record, notify about
> > +      * new data availability
> > +      */
> > +     rec_pos = (void *)meta_ptr - (void *)rb->data;
> > +     cons_pos = smp_load_acquire(&rb->consumer_pos) & rb->mask;
>
> hmm. Earlier WRITE_ONCE(rb->producer_pos) is used, but here it's load_acquire.
> Please be consistent with pairing.

wait, it's *consumer* vs *producer* positions. consumer_pos is updated
by consumer using smp_store_release(), so it's paired. But I mentioned
above that producer_pos has to be written with smp_store_release().
Does that address your concern?


>
> > +     if (cons_pos == rec_pos)
> > +             wake_up_all(&rb->waitq);
>
> Is it legal to do from preempt_disabled region?

Don't know, couldn't find anything about this aspect. But looking at
perf buffer code, all the wakeups are done after irq_work_queue(), so
I guess it's not safe then? I'll add irq_work_queue() then.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next 1/6] bpf: implement BPF ring buffer and verifier support for it
  2020-05-14 20:18     ` Andrii Nakryiko
@ 2020-05-14 20:53       ` sdf
  2020-05-14 21:13         ` Andrii Nakryiko
  0 siblings, 1 reply; 32+ messages in thread
From: sdf @ 2020-05-14 20:53 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Paul E . McKenney, Jonathan Lemon

On 05/14, Andrii Nakryiko wrote:
> On Thu, May 14, 2020 at 10:33 AM <sdf@google.com> wrote:
> >
> > On 05/13, Andrii Nakryiko wrote:

[...]

> > > + * void bpf_ringbuf_submit(void *data)
> > > + *   Description
> > > + *           Submit reserved ring buffer sample, pointed to by  
> *data*.
> > > + *   Return
> > > + *           Nothing.
> > Even though you mention self-pacing properties, would it still
> > make sense to add some argument to bpf_ringbuf_submit/bpf_ringbuf_output
> > to indicate whether to wake up userspace or not? Maybe something like
> > a threshold of number of outstanding events in the ringbuf after which
> > we do the wakeup? The default 0/1 preserve the existing behavior.
> >
> > The example I can give is a control plane userspace thread that
> > once a second aggregates the events, it doesn't care about millisecond
> > resolution. With the current scheme, I suppose, if BPF generates events
> > every 1ms, the userspace will be woken up 1000 times (if it can keep
> > up). Most of the time, we don't really care and some buffering
> > properties are desired.

> perf buffer has setting like this, and believe me, it's so confusing
> and dangerous, that I wouldn't want this to be exposed. Even though I
> was aware of this behavior, I still had to debug and work-around this
> lack on wakeup few times, it's really-really confusing feature.

> In your case, though, why wouldn't user-space poll data just once a
> second, if it's not interested in getting data as fast as possible?
If I poll once per second I might lose the events if, for some reason,
there is a spike. I really want to have something like: "wakeup
userspace if the ringbuffer fill is over some threshold or
the last wakeup was too long ago". We currently do this via a percpu
cache map. IIRC, you've shared on lsfmmbpf that you do something like
that as well.

So I was thinking how I can use new ringbuff to remove the unneeded
copies and help with the reordering, but I'm a bit concerned about
regressing on the number of wakeups.

Maybe having a flag like RINGBUF_NO_WAKEUP in bpf_ringbuf_submit()
will suffice? And if there is a helper or some way to obtain a
number of unconsumed items, I can implement my own flushing policy.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next 1/6] bpf: implement BPF ring buffer and verifier support for it
  2020-05-14 20:53       ` sdf
@ 2020-05-14 21:13         ` Andrii Nakryiko
  2020-05-14 21:56           ` Stanislav Fomichev
  0 siblings, 1 reply; 32+ messages in thread
From: Andrii Nakryiko @ 2020-05-14 21:13 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Paul E . McKenney, Jonathan Lemon

On Thu, May 14, 2020 at 1:53 PM <sdf@google.com> wrote:
>
> On 05/14, Andrii Nakryiko wrote:
> > On Thu, May 14, 2020 at 10:33 AM <sdf@google.com> wrote:
> > >
> > > On 05/13, Andrii Nakryiko wrote:
>
> [...]
>
> > > > + * void bpf_ringbuf_submit(void *data)
> > > > + *   Description
> > > > + *           Submit reserved ring buffer sample, pointed to by
> > *data*.
> > > > + *   Return
> > > > + *           Nothing.
> > > Even though you mention self-pacing properties, would it still
> > > make sense to add some argument to bpf_ringbuf_submit/bpf_ringbuf_output
> > > to indicate whether to wake up userspace or not? Maybe something like
> > > a threshold of number of outstanding events in the ringbuf after which
> > > we do the wakeup? The default 0/1 preserve the existing behavior.
> > >
> > > The example I can give is a control plane userspace thread that
> > > once a second aggregates the events, it doesn't care about millisecond
> > > resolution. With the current scheme, I suppose, if BPF generates events
> > > every 1ms, the userspace will be woken up 1000 times (if it can keep
> > > up). Most of the time, we don't really care and some buffering
> > > properties are desired.
>
> > perf buffer has setting like this, and believe me, it's so confusing
> > and dangerous, that I wouldn't want this to be exposed. Even though I
> > was aware of this behavior, I still had to debug and work-around this
> > lack on wakeup few times, it's really-really confusing feature.
>
> > In your case, though, why wouldn't user-space poll data just once a
> > second, if it's not interested in getting data as fast as possible?
> If I poll once per second I might lose the events if, for some reason,
> there is a spike. I really want to have something like: "wakeup
> userspace if the ringbuffer fill is over some threshold or
> the last wakeup was too long ago". We currently do this via a percpu
> cache map. IIRC, you've shared on lsfmmbpf that you do something like
> that as well.

Hm... don't remember such use case on our side. All applications I
know of use default perf_buffer settings with no sampling.

>
> So I was thinking how I can use new ringbuff to remove the unneeded
> copies and help with the reordering, but I'm a bit concerned about
> regressing on the number of wakeups.
>
> Maybe having a flag like RINGBUF_NO_WAKEUP in bpf_ringbuf_submit()
> will suffice? And if there is a helper or some way to obtain a
> number of unconsumed items, I can implement my own flushing policy.

Ok, I guess giving application control at each discard/commit makes
for ultimate flexibility. Let me add flags argument to commit/discard
and allow to specify NO_WAKEUP flag. As for count of unconsumed events
-- that would be a bit expensive to maintain. How about amount of data
that's not consumed? It's obviously going to be racy, but returning
(producer_pos - consumer_pos) should be sufficient enough for such
smart and best-effort heuristics? WDYT?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next 1/6] bpf: implement BPF ring buffer and verifier support for it
  2020-05-14 20:39     ` Thomas Gleixner
@ 2020-05-14 21:30       ` Andrii Nakryiko
  2020-05-14 22:13         ` Paul E. McKenney
  2020-05-14 22:56         ` Alexei Starovoitov
  0 siblings, 2 replies; 32+ messages in thread
From: Andrii Nakryiko @ 2020-05-14 21:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jakub Kicinski, Andrii Nakryiko, linux-arch, bpf, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Paul E . McKenney, Jonathan Lemon

On Thu, May 14, 2020 at 1:39 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Jakub Kicinski <kuba@kernel.org> writes:
>
> > On Wed, 13 May 2020 12:25:27 -0700 Andrii Nakryiko wrote:
> >> One interesting implementation bit, that significantly simplifies (and thus
> >> speeds up as well) implementation of both producers and consumers is how data
> >> area is mapped twice contiguously back-to-back in the virtual memory. This
> >> allows to not take any special measures for samples that have to wrap around
> >> at the end of the circular buffer data area, because the next page after the
> >> last data page would be first data page again, and thus the sample will still
> >> appear completely contiguous in virtual memory. See comment and a simple ASCII
> >> diagram showing this visually in bpf_ringbuf_area_alloc().
> >
> > Out of curiosity - is this 100% okay to do in the kernel and user space
> > these days? Is this bit part of the uAPI in case we need to back out of
> > it?
> >
> > In the olden days virtually mapped/tagged caches could get confused
> > seeing the same physical memory have two active virtual mappings, or
> > at least that's what I've been told in school :)
>
> Yes, caching the same thing twice causes coherency problems.
>
> VIVT can be found in ARMv5, MIPS, NDS32 and Unicore32.
>
> > Checking with Paul - he says that could have been the case for Itanium
> > and PA-RISC CPUs.
>
> Itanium: PIPT L1/L2.
> PA-RISC: VIPT L1 and PIPT L2
>
> Thanks,
>

Jakub, thanks for bringing this up.

Thomas, Paul, what kind of problems are we talking about here? What
are the possible problems in practice?

So just for the context, all the metadata (record header) that is
written/read under lock and with smp_store_release/smp_load_acquire is
written through the one set of page mappings (the first one). Only
some of sample payload might go into the second set of mapped pages.
Does this mean that user-space might read some old payloads in such
case?

I could work-around that in user-space, by mmaping twice the same
range, one after the other (second mmap would use MAP_FIXED flag, of
course). So that's not a big deal.

But on the kernel side it's crucial property, because it allows BPF
programs to work with data with the assumption that all data is
linearly mapped. If we can't do that, reserve() API is impossible to
implement. So in that case, I'd rather enable BPF ring buffer only on
platforms that won't have these problems, instead of removing
reserve/commit API altogether.

Well, another way is to just "discard" remaining space at the end, if
it's not sufficient for entire record. That's doable, there will
always be at least 8 bytes available for record header, so not a
problem in that regard. But I would appreciate if you can help me
understand full implications of caching physical memory twice.

Also just for my education, with VIVT caches, if user-space
application mmap()'s same region of memory twice (without MAP_FIXED),
wouldn't that cause similar problems? Can't this happen today with
mmap() API? Why is that not a problem?


>         tglx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next 1/6] bpf: implement BPF ring buffer and verifier support for it
  2020-05-14 21:13         ` Andrii Nakryiko
@ 2020-05-14 21:56           ` Stanislav Fomichev
  0 siblings, 0 replies; 32+ messages in thread
From: Stanislav Fomichev @ 2020-05-14 21:56 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Paul E . McKenney, Jonathan Lemon

On Thu, May 14, 2020 at 2:13 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Thu, May 14, 2020 at 1:53 PM <sdf@google.com> wrote:
> >
> > On 05/14, Andrii Nakryiko wrote:
> > > On Thu, May 14, 2020 at 10:33 AM <sdf@google.com> wrote:
> > > >
> > > > On 05/13, Andrii Nakryiko wrote:
> >
> > [...]
> >
> > > > > + * void bpf_ringbuf_submit(void *data)
> > > > > + *   Description
> > > > > + *           Submit reserved ring buffer sample, pointed to by
> > > *data*.
> > > > > + *   Return
> > > > > + *           Nothing.
> > > > Even though you mention self-pacing properties, would it still
> > > > make sense to add some argument to bpf_ringbuf_submit/bpf_ringbuf_output
> > > > to indicate whether to wake up userspace or not? Maybe something like
> > > > a threshold of number of outstanding events in the ringbuf after which
> > > > we do the wakeup? The default 0/1 preserve the existing behavior.
> > > >
> > > > The example I can give is a control plane userspace thread that
> > > > once a second aggregates the events, it doesn't care about millisecond
> > > > resolution. With the current scheme, I suppose, if BPF generates events
> > > > every 1ms, the userspace will be woken up 1000 times (if it can keep
> > > > up). Most of the time, we don't really care and some buffering
> > > > properties are desired.
> >
> > > perf buffer has setting like this, and believe me, it's so confusing
> > > and dangerous, that I wouldn't want this to be exposed. Even though I
> > > was aware of this behavior, I still had to debug and work-around this
> > > lack on wakeup few times, it's really-really confusing feature.
> >
> > > In your case, though, why wouldn't user-space poll data just once a
> > > second, if it's not interested in getting data as fast as possible?
> > If I poll once per second I might lose the events if, for some reason,
> > there is a spike. I really want to have something like: "wakeup
> > userspace if the ringbuffer fill is over some threshold or
> > the last wakeup was too long ago". We currently do this via a percpu
> > cache map. IIRC, you've shared on lsfmmbpf that you do something like
> > that as well.
>
> Hm... don't remember such use case on our side. All applications I
> know of use default perf_buffer settings with no sampling.
Nevermind, I might have misunderstood :-)

> > So I was thinking how I can use new ringbuff to remove the unneeded
> > copies and help with the reordering, but I'm a bit concerned about
> > regressing on the number of wakeups.
> >
> > Maybe having a flag like RINGBUF_NO_WAKEUP in bpf_ringbuf_submit()
> > will suffice? And if there is a helper or some way to obtain a
> > number of unconsumed items, I can implement my own flushing policy.
>
> Ok, I guess giving application control at each discard/commit makes
> for ultimate flexibility. Let me add flags argument to commit/discard
> and allow to specify NO_WAKEUP flag. As for count of unconsumed events
> -- that would be a bit expensive to maintain. How about amount of data
> that's not consumed? It's obviously going to be racy, but returning
> (producer_pos - consumer_pos) should be sufficient enough for such
> smart and best-effort heuristics? WDYT?
Awesome, SGTM! Racy is fine (I don't see how we can make it non-racy
as well). The amount of data instead of the number of items is also fine
since I know the size of the buffer.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next 1/6] bpf: implement BPF ring buffer and verifier support for it
  2020-05-14 21:30       ` Andrii Nakryiko
@ 2020-05-14 22:13         ` Paul E. McKenney
  2020-05-14 22:56         ` Alexei Starovoitov
  1 sibling, 0 replies; 32+ messages in thread
From: Paul E. McKenney @ 2020-05-14 22:13 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Thomas Gleixner, Jakub Kicinski, Andrii Nakryiko, linux-arch,
	bpf, Networking, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Jonathan Lemon

On Thu, May 14, 2020 at 02:30:11PM -0700, Andrii Nakryiko wrote:
> On Thu, May 14, 2020 at 1:39 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > Jakub Kicinski <kuba@kernel.org> writes:
> >
> > > On Wed, 13 May 2020 12:25:27 -0700 Andrii Nakryiko wrote:
> > >> One interesting implementation bit, that significantly simplifies (and thus
> > >> speeds up as well) implementation of both producers and consumers is how data
> > >> area is mapped twice contiguously back-to-back in the virtual memory. This
> > >> allows to not take any special measures for samples that have to wrap around
> > >> at the end of the circular buffer data area, because the next page after the
> > >> last data page would be first data page again, and thus the sample will still
> > >> appear completely contiguous in virtual memory. See comment and a simple ASCII
> > >> diagram showing this visually in bpf_ringbuf_area_alloc().
> > >
> > > Out of curiosity - is this 100% okay to do in the kernel and user space
> > > these days? Is this bit part of the uAPI in case we need to back out of
> > > it?
> > >
> > > In the olden days virtually mapped/tagged caches could get confused
> > > seeing the same physical memory have two active virtual mappings, or
> > > at least that's what I've been told in school :)
> >
> > Yes, caching the same thing twice causes coherency problems.
> >
> > VIVT can be found in ARMv5, MIPS, NDS32 and Unicore32.
> >
> > > Checking with Paul - he says that could have been the case for Itanium
> > > and PA-RISC CPUs.
> >
> > Itanium: PIPT L1/L2.
> > PA-RISC: VIPT L1 and PIPT L2

Thank you, Thomas!

> > Thanks,
> 
> Jakub, thanks for bringing this up.

Indeed!  I had completely forgotten about it.

> Thomas, Paul, what kind of problems are we talking about here? What
> are the possible problems in practice?

One CPU stores into one of the mappings, and then it (or some other CPU)
subsequently sees the old value via the other mapping, maybe for a short
time, or maybe indefinitely, depending.  This sort of thing can happen
when the same location in the two mappings map to different location in
the cache.  The store via one virtual address then is placed into one
location in the cache, but the reads from the other virtual address are
referring to some other location in the cache.

In the past, some systems have documented virtual address offsets that
are guaranteed to work, presumably because those offsets force the two
views of the same physical memory to share the same location in the cache.

> So just for the context, all the metadata (record header) that is
> written/read under lock and with smp_store_release/smp_load_acquire is
> written through the one set of page mappings (the first one). Only
> some of sample payload might go into the second set of mapped pages.
> Does this mean that user-space might read some old payloads in such
> case?

That could happen, depending on which CPU accessed what physical
memory using which virtual address.

> I could work-around that in user-space, by mmaping twice the same
> range, one after the other (second mmap would use MAP_FIXED flag, of
> course). So that's not a big deal.

That would work, assuming you mean to map double the size of memory
and then handle the wraparound case very very carefully.  ;-)

But you need only do that on VI*T systems, if that helps.

> But on the kernel side it's crucial property, because it allows BPF
> programs to work with data with the assumption that all data is
> linearly mapped. If we can't do that, reserve() API is impossible to
> implement. So in that case, I'd rather enable BPF ring buffer only on
> platforms that won't have these problems, instead of removing
> reserve/commit API altogether.

You could flush the local CPU's cache before reading past the end,
but only if it is guaranteed that no other CPU is accessing that same
memory using the other mapping.  (No convinced that this is feasible,
but who knows?)

I see that linux-arch is copied, so do any of the affected architectures
object to being left out?

> Well, another way is to just "discard" remaining space at the end, if
> it's not sufficient for entire record. That's doable, there will
> always be at least 8 bytes available for record header, so not a
> problem in that regard. But I would appreciate if you can help me
> understand full implications of caching physical memory twice.
> 
> Also just for my education, with VIVT caches, if user-space
> application mmap()'s same region of memory twice (without MAP_FIXED),
> wouldn't that cause similar problems? Can't this happen today with
> mmap() API? Why is that not a problem?

It does indeed affect userspace applications as well.  And I haven't
heard about this being a problem for a very long time, which might be
why I had forgotten about it.

But the underlying problem is that on VIVT and VIPT platforms, mapping
the same physical memory to two different virtual addresses can cause
that same memory to appear twice in the cache, and the resulting pair
of cachelines will not be guaranteed to be in sync with each other.
So CPUs accessing this memory through the two virtual addresses might
see different values.  Which can come as a bit of a surprise to many
algorithms.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next 1/6] bpf: implement BPF ring buffer and verifier support for it
  2020-05-14  5:59     ` Andrii Nakryiko
@ 2020-05-14 22:25       ` Alan Maguire
  0 siblings, 0 replies; 32+ messages in thread
From: Alan Maguire @ 2020-05-14 22:25 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Alan Maguire, Andrii Nakryiko, bpf, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Paul E . McKenney, Jonathan Lemon

On Wed, 13 May 2020, Andrii Nakryiko wrote:

> On Wed, May 13, 2020 at 2:59 PM Alan Maguire <alan.maguire@oracle.com> wrote:
> >
> >
> > - attach a kprobe program to record the data via bpf_ringbuf_reserve(),
> >   and store the reserved pointer value in a per-task keyed hashmap.
> >   Then record the values of interest in the reserved space. This is our
> >   speculative data as we don't know whether we want to commit it yet.
> >
> > - attach a kretprobe program that picks up our reserved pointer and
> >   commit()s or discard()s the associated data based on the return value.
> >
> > - the consumer should (I think) then only read the committed data, so in
> >   this case just the data of interest associated with the failure case.
> >
> > I'm curious if that sort of ringbuf access pattern across multiple
> > programs would work? Thanks!
> 
> 
> Right now it's not allowed. Similar to spin lock and socket reference,
> verifier will enforce that reserved record is committed or discarded
> within the same BPF program invocation. Technically, nothing prevents
> us from relaxing this and allowing to store this pointer in a map, but
> that's probably way too dangerous and not necessary for most common
> cases.
> 

Understood.

> But all your troubles with this is due to using a pair of
> kprobe+kretprobe. What I think should solve your problem is a single
> fexit program. It can read input arguments *and* return value of
> traced function. So there won't be any need for additional map and
> storing speculative data (and no speculation as well, because you'll
> just know beforehand if you even need to capture data). Does this work
> for your case?
> 

That would work for that case, absolutely! Thanks!

Alan




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next 1/6] bpf: implement BPF ring buffer and verifier support for it
  2020-05-14 21:30       ` Andrii Nakryiko
  2020-05-14 22:13         ` Paul E. McKenney
@ 2020-05-14 22:56         ` Alexei Starovoitov
  2020-05-14 23:06           ` Andrii Nakryiko
  1 sibling, 1 reply; 32+ messages in thread
From: Alexei Starovoitov @ 2020-05-14 22:56 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Thomas Gleixner, Jakub Kicinski, Andrii Nakryiko, linux-arch,
	bpf, Networking, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Paul E . McKenney, Jonathan Lemon

On Thu, May 14, 2020 at 02:30:11PM -0700, Andrii Nakryiko wrote:
> On Thu, May 14, 2020 at 1:39 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > Jakub Kicinski <kuba@kernel.org> writes:
> >
> > > On Wed, 13 May 2020 12:25:27 -0700 Andrii Nakryiko wrote:
> > >> One interesting implementation bit, that significantly simplifies (and thus
> > >> speeds up as well) implementation of both producers and consumers is how data
> > >> area is mapped twice contiguously back-to-back in the virtual memory. This
> > >> allows to not take any special measures for samples that have to wrap around
> > >> at the end of the circular buffer data area, because the next page after the
> > >> last data page would be first data page again, and thus the sample will still
> > >> appear completely contiguous in virtual memory. See comment and a simple ASCII
> > >> diagram showing this visually in bpf_ringbuf_area_alloc().
> > >
> > > Out of curiosity - is this 100% okay to do in the kernel and user space
> > > these days? Is this bit part of the uAPI in case we need to back out of
> > > it?
> > >
> > > In the olden days virtually mapped/tagged caches could get confused
> > > seeing the same physical memory have two active virtual mappings, or
> > > at least that's what I've been told in school :)
> >
> > Yes, caching the same thing twice causes coherency problems.
> >
> > VIVT can be found in ARMv5, MIPS, NDS32 and Unicore32.
> >
> > > Checking with Paul - he says that could have been the case for Itanium
> > > and PA-RISC CPUs.
> >
> > Itanium: PIPT L1/L2.
> > PA-RISC: VIPT L1 and PIPT L2
> >
> > Thanks,
> >
> 
> Jakub, thanks for bringing this up.
> 
> Thomas, Paul, what kind of problems are we talking about here? What
> are the possible problems in practice?

VIVT cpus will have issues with coherency protocol between cpus.
I don't think it applies to this case.
Here all cpus we have the same phys page seen in two virtual pages.
That mapping is the same across all cpus.
But any given range of virtual addresses in these two pages will
be accessed by only one cpu at a time.
At least that's my understanding of Andrii's algorithm.
We probably need to white board the overlapping case a bit more.
Worst case I think it's fine to disallow this new ring buffer
on such architectures. The usability from bpf program side
is too great to give up.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next 1/6] bpf: implement BPF ring buffer and verifier support for it
  2020-05-14 22:56         ` Alexei Starovoitov
@ 2020-05-14 23:06           ` Andrii Nakryiko
  0 siblings, 0 replies; 32+ messages in thread
From: Andrii Nakryiko @ 2020-05-14 23:06 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Thomas Gleixner, Jakub Kicinski, Andrii Nakryiko, linux-arch,
	bpf, Networking, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Paul E . McKenney, Jonathan Lemon

On Thu, May 14, 2020 at 3:56 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, May 14, 2020 at 02:30:11PM -0700, Andrii Nakryiko wrote:
> > On Thu, May 14, 2020 at 1:39 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> > >
> > > Jakub Kicinski <kuba@kernel.org> writes:
> > >
> > > > On Wed, 13 May 2020 12:25:27 -0700 Andrii Nakryiko wrote:
> > > >> One interesting implementation bit, that significantly simplifies (and thus
> > > >> speeds up as well) implementation of both producers and consumers is how data
> > > >> area is mapped twice contiguously back-to-back in the virtual memory. This
> > > >> allows to not take any special measures for samples that have to wrap around
> > > >> at the end of the circular buffer data area, because the next page after the
> > > >> last data page would be first data page again, and thus the sample will still
> > > >> appear completely contiguous in virtual memory. See comment and a simple ASCII
> > > >> diagram showing this visually in bpf_ringbuf_area_alloc().
> > > >
> > > > Out of curiosity - is this 100% okay to do in the kernel and user space
> > > > these days? Is this bit part of the uAPI in case we need to back out of
> > > > it?
> > > >
> > > > In the olden days virtually mapped/tagged caches could get confused
> > > > seeing the same physical memory have two active virtual mappings, or
> > > > at least that's what I've been told in school :)
> > >
> > > Yes, caching the same thing twice causes coherency problems.
> > >
> > > VIVT can be found in ARMv5, MIPS, NDS32 and Unicore32.
> > >
> > > > Checking with Paul - he says that could have been the case for Itanium
> > > > and PA-RISC CPUs.
> > >
> > > Itanium: PIPT L1/L2.
> > > PA-RISC: VIPT L1 and PIPT L2
> > >
> > > Thanks,
> > >
> >
> > Jakub, thanks for bringing this up.
> >
> > Thomas, Paul, what kind of problems are we talking about here? What
> > are the possible problems in practice?
>
> VIVT cpus will have issues with coherency protocol between cpus.
> I don't think it applies to this case.
> Here all cpus we have the same phys page seen in two virtual pages.
> That mapping is the same across all cpus.
> But any given range of virtual addresses in these two pages will
> be accessed by only one cpu at a time.
> At least that's my understanding of Andrii's algorithm.
> We probably need to white board the overlapping case a bit more.
> Worst case I think it's fine to disallow this new ring buffer
> on such architectures. The usability from bpf program side
> is too great to give up.

From what Paul described, I think this will work in any case. Each
byte of reserved/committed record is going to be both written and
consumed using exactly the same virtual mapping and only that one.
E.g., in case of samples starting at the end of ringbuf and ending at
the beginning. Header and first part will be read using first set of
mapped pages, while second part will be written and read using second
set of pages (never first set of pages). So it seems like everything
should be fine even on VIVT architectures?

More visually, copying diagram from the code:

------------------------------------------------------
| meta pages |     mapping 1     |     mapping 2     |
------------------------------------------------------
|            | 1 2 3 4 5 6 7 8 9 | 1 2 3 4 5 6 7 8 9 |
------------------------------------------------------
|            | TA             DA | TA             DA |
------------------------------------------------------
                              ^^^^^^^

DA is always written/read using "mapping 1", while TA is always
written/read through mapping 2. Never DA is accessed through "mapping
2", nor TA is accessed through "mapping 1".

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2020-05-14 23:06 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-13 19:25 [PATCH bpf-next 0/6] BPF ring buffer Andrii Nakryiko
2020-05-13 19:25 ` [PATCH bpf-next 1/6] bpf: implement BPF ring buffer and verifier support for it Andrii Nakryiko
2020-05-13 20:57   ` kbuild test robot
2020-05-13 21:58   ` Alan Maguire
2020-05-14  5:59     ` Andrii Nakryiko
2020-05-14 22:25       ` Alan Maguire
2020-05-13 22:16   ` kbuild test robot
2020-05-14 16:50   ` Jonathan Lemon
2020-05-14 20:11     ` Andrii Nakryiko
2020-05-14 17:33   ` sdf
2020-05-14 20:18     ` Andrii Nakryiko
2020-05-14 20:53       ` sdf
2020-05-14 21:13         ` Andrii Nakryiko
2020-05-14 21:56           ` Stanislav Fomichev
2020-05-14 19:06   ` Alexei Starovoitov
2020-05-14 20:49     ` Andrii Nakryiko
2020-05-14 19:18   ` Jakub Kicinski
2020-05-14 19:18     ` Jakub Kicinski
2020-05-14 20:39     ` Thomas Gleixner
2020-05-14 21:30       ` Andrii Nakryiko
2020-05-14 22:13         ` Paul E. McKenney
2020-05-14 22:56         ` Alexei Starovoitov
2020-05-14 23:06           ` Andrii Nakryiko
2020-05-13 19:25 ` [PATCH bpf-next 2/6] tools/memory-model: add BPF ringbuf MPSC litmus tests Andrii Nakryiko
2020-05-13 19:25 ` [PATCH bpf-next 3/6] bpf: track reference type in verifier Andrii Nakryiko
2020-05-13 19:25 ` [PATCH bpf-next 4/6] libbpf: add BPF ring buffer support Andrii Nakryiko
2020-05-13 19:25 ` [PATCH bpf-next 5/6] selftests/bpf: add BPF ringbuf selftests Andrii Nakryiko
2020-05-13 19:25 ` [PATCH bpf-next 6/6] bpf: add BPF ringbuf and perf buffer benchmarks Andrii Nakryiko
2020-05-13 22:49 ` [PATCH bpf-next 0/6] BPF ring buffer Jonathan Lemon
2020-05-14  6:08   ` Andrii Nakryiko
2020-05-14 16:30     ` Jonathan Lemon
2020-05-14 20:06       ` Andrii Nakryiko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.