All of lore.kernel.org
 help / color / mirror / Atom feed
* BPF ring buffer variable-length data appending
@ 2021-01-07 19:48 Andrii Nakryiko
  2021-01-07 22:55 ` Alan Maguire
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Andrii Nakryiko @ 2021-01-07 19:48 UTC (permalink / raw)
  To: Brendan Jackman, KP Singh, bpf; +Cc: Alexei Starovoitov, Daniel Borkmann

We discussed this topic today at office hour. As I mentioned, I don't
know the ideal solution, but here is something that has enough
flexibility for real-world uses, while giving the performance and
convenience of reserve/commit API. Ignore naming, we can bikeshed that
later.

So what we can do is introduce a new bpf_ringbuf_reserve() variant:

bpf_ringbuf_reserve_extra(void *ringbuf, __u64 size, __u64 flags, void
*extra, __u64 extra_sz);

The idea is that we reserve a fixed size amount of data that can be
used like it is today for filling a fixed-sized metadata/sample
directly. But the real size of the reserved sample is (size +
extra_sz), and bpf_ringbuf_reserve_extra() helper will bpf_probe_read
(kernel or user, depending on flags) data from extra and put it right
after the fixed-size part.

So the use would be something like:

struct my_meta *m = bpf_ringbuf_reserve_extra(&rb, sizeof(*m),
BPF_RB_PROBE_USER, env_vars, 1024);

if (!m)
    /* too bad, either probe_read_user failed or ringbuf is full */
    return 1;

m->my_field1 = 123;
m->my_field2 = 321;


So the main problem with this is that when probe_read fails, we fail
reservation completely(internally we'd just discard ringbuf sample).
Is that OK? Or is it better to still reserve fixed-sized part and
zero-out the variable-length part? We are combining two separate
operations into a single API, so error handling is more convoluted.


Now, the main use case requested was to be able to fetch an array of
zero-terminated strings. I honestly don't think it's possible to
implement this efficiently without two copies of string data. Mostly
because to just determine the size of the string you have to read it
one extra time. And you'd probably want to copy string data into some
controlled storage first, so that you don't end up reading it once
successfully and then failing to read it on the second try. Next, when
you have multiple strings, how do you deal with partial failures? It's
even worse in terms of error handling and error propagation than the
fixed extra size variant described above.

Ignoring all that, let's say we'd implement
bpf_ringbuf_reserve_extra_strs() helper, that would somehow be copying
multiple zero-terminated strings after the fixed-size prefix. Think
about implementation. Just to determine the total size of the ringbuf
sample, you'd need to read all strings once, and probably also copy
them locally.  Then you'd reserve a ringbuf sample and copy all that
for the second time. So it's as inefficient as a BPF program
constructing a single block of memory by reading all such strings
manually into a per-CPU array and then using the above
bpf_ringbuf_reserve_extra().

But offloading that preparation to a BPF program bypasses all these
error handling and memory layout questions. It will be up to a BPF
program itself. From a kernel perspective, we just append a block of
memory with known (at runtime) size.

As a more restricted version of bpf_ringbuf_reserve_extra(), instead
of allowing reading arbitrary kernel or user-space memory in
bpf_ringbuf_reserve_extra() we can say that it has to be known and
initialized memory (like MAP_VALUE pointer), so helper knows that it
can just copy data directly.

Thoughts?

-- Andrii

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BPF ring buffer variable-length data appending
  2021-01-07 19:48 BPF ring buffer variable-length data appending Andrii Nakryiko
@ 2021-01-07 22:55 ` Alan Maguire
  2021-01-07 23:26   ` Andrii Nakryiko
  2021-01-08  5:43 ` John Fastabend
  2021-01-08 13:05 ` Brendan Jackman
  2 siblings, 1 reply; 8+ messages in thread
From: Alan Maguire @ 2021-01-07 22:55 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Brendan Jackman, KP Singh, bpf, Alexei Starovoitov, Daniel Borkmann

On Thu, 7 Jan 2021, Andrii Nakryiko wrote:

> We discussed this topic today at office hour. As I mentioned, I don't
> know the ideal solution, but here is something that has enough
> flexibility for real-world uses, while giving the performance and
> convenience of reserve/commit API. Ignore naming, we can bikeshed that
> later.
> 
> So what we can do is introduce a new bpf_ringbuf_reserve() variant:
> 
> bpf_ringbuf_reserve_extra(void *ringbuf, __u64 size, __u64 flags, void
> *extra, __u64 extra_sz);
> 
> The idea is that we reserve a fixed size amount of data that can be
> used like it is today for filling a fixed-sized metadata/sample
> directly. But the real size of the reserved sample is (size +
> extra_sz), and bpf_ringbuf_reserve_extra() helper will bpf_probe_read
> (kernel or user, depending on flags) data from extra and put it right
> after the fixed-size part.
> 
> So the use would be something like:
> 
> struct my_meta *m = bpf_ringbuf_reserve_extra(&rb, sizeof(*m),
> BPF_RB_PROBE_USER, env_vars, 1024);
> 
> if (!m)
>     /* too bad, either probe_read_user failed or ringbuf is full */
>     return 1;
> 
> m->my_field1 = 123;
> m->my_field2 = 321;
> 
> 
> So the main problem with this is that when probe_read fails, we fail
> reservation completely(internally we'd just discard ringbuf sample).
> Is that OK? Or is it better to still reserve fixed-sized part and
> zero-out the variable-length part? We are combining two separate
> operations into a single API, so error handling is more convoluted.
> 
> 
> Now, the main use case requested was to be able to fetch an array of
> zero-terminated strings. I honestly don't think it's possible to
> implement this efficiently without two copies of string data. Mostly
> because to just determine the size of the string you have to read it
> one extra time. And you'd probably want to copy string data into some
> controlled storage first, so that you don't end up reading it once
> successfully and then failing to read it on the second try. Next, when
> you have multiple strings, how do you deal with partial failures? It's
> even worse in terms of error handling and error propagation than the
> fixed extra size variant described above.
> 
> Ignoring all that, let's say we'd implement
> bpf_ringbuf_reserve_extra_strs() helper, that would somehow be copying
> multiple zero-terminated strings after the fixed-size prefix. Think
> about implementation. Just to determine the total size of the ringbuf
> sample, you'd need to read all strings once, and probably also copy
> them locally.  Then you'd reserve a ringbuf sample and copy all that
> for the second time. So it's as inefficient as a BPF program
> constructing a single block of memory by reading all such strings
> manually into a per-CPU array and then using the above
> bpf_ringbuf_reserve_extra().
> 

I ran into a variation of this problem with the ksnoop tool [1]. I'd hoped 
to use ringbuf, but the problem I had was I needed to store a series of N 
strings into a buffer, and for each I needed to specify a maximum size (for 
bpf_snprintf_btf()).  However it was entirely possible that the 
strings would be a lot smaller, and they'd be copied one after the other, 
so while I needed to reserve a buffer for those N strings of
(N * MAX_STRINGSIZE) size as the worst case scenario, it would likely be a 
lot smaller (the sum of the lengths of the N strings plus null 
termination), so there was no need to commit the unused space.  I ended up 
using a BPF map-derived string buffer and perf events to send the events 
instead (the code for this is ksnoop.bpf.c in [1]).

So all of this is to say that I'm assuming along with the reserve_extra() 
API, there'd need to be some sort of bpf_ringbuf_submit_extra(ringbuf, 
flags, extra_size) which specifies how much of the extra space was used? 
If that's the case I think this approach makes ringbuf usable for my 
scenario; the string buffer would effectively all be considered extra 
space, and we'd just submit what was used.

However I _think_ you were suggesting above combining the probe read and 
the extra reservation as one operation; in my case that wouldn't work 
because the strings were written directly by a helper (bpf_snprintf_btf())
into the target buffer. It's probably an oddball situation of course, but 
thought I'd better mention it just in case. Thanks!

Alan

[1] https://lore.kernel.org/bpf/1609773991-10509-1-git-send-email-alan.maguire@oracle.com/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BPF ring buffer variable-length data appending
  2021-01-07 22:55 ` Alan Maguire
@ 2021-01-07 23:26   ` Andrii Nakryiko
  0 siblings, 0 replies; 8+ messages in thread
From: Andrii Nakryiko @ 2021-01-07 23:26 UTC (permalink / raw)
  To: Alan Maguire
  Cc: Brendan Jackman, KP Singh, bpf, Alexei Starovoitov, Daniel Borkmann

On Thu, Jan 7, 2021 at 2:55 PM Alan Maguire <alan.maguire@oracle.com> wrote:
>
> On Thu, 7 Jan 2021, Andrii Nakryiko wrote:
>
> > We discussed this topic today at office hour. As I mentioned, I don't
> > know the ideal solution, but here is something that has enough
> > flexibility for real-world uses, while giving the performance and
> > convenience of reserve/commit API. Ignore naming, we can bikeshed that
> > later.
> >
> > So what we can do is introduce a new bpf_ringbuf_reserve() variant:
> >
> > bpf_ringbuf_reserve_extra(void *ringbuf, __u64 size, __u64 flags, void
> > *extra, __u64 extra_sz);
> >
> > The idea is that we reserve a fixed size amount of data that can be
> > used like it is today for filling a fixed-sized metadata/sample
> > directly. But the real size of the reserved sample is (size +
> > extra_sz), and bpf_ringbuf_reserve_extra() helper will bpf_probe_read
> > (kernel or user, depending on flags) data from extra and put it right
> > after the fixed-size part.
> >
> > So the use would be something like:
> >
> > struct my_meta *m = bpf_ringbuf_reserve_extra(&rb, sizeof(*m),
> > BPF_RB_PROBE_USER, env_vars, 1024);
> >
> > if (!m)
> >     /* too bad, either probe_read_user failed or ringbuf is full */
> >     return 1;
> >
> > m->my_field1 = 123;
> > m->my_field2 = 321;
> >
> >
> > So the main problem with this is that when probe_read fails, we fail
> > reservation completely(internally we'd just discard ringbuf sample).
> > Is that OK? Or is it better to still reserve fixed-sized part and
> > zero-out the variable-length part? We are combining two separate
> > operations into a single API, so error handling is more convoluted.
> >
> >
> > Now, the main use case requested was to be able to fetch an array of
> > zero-terminated strings. I honestly don't think it's possible to
> > implement this efficiently without two copies of string data. Mostly
> > because to just determine the size of the string you have to read it
> > one extra time. And you'd probably want to copy string data into some
> > controlled storage first, so that you don't end up reading it once
> > successfully and then failing to read it on the second try. Next, when
> > you have multiple strings, how do you deal with partial failures? It's
> > even worse in terms of error handling and error propagation than the
> > fixed extra size variant described above.
> >
> > Ignoring all that, let's say we'd implement
> > bpf_ringbuf_reserve_extra_strs() helper, that would somehow be copying
> > multiple zero-terminated strings after the fixed-size prefix. Think
> > about implementation. Just to determine the total size of the ringbuf
> > sample, you'd need to read all strings once, and probably also copy
> > them locally.  Then you'd reserve a ringbuf sample and copy all that
> > for the second time. So it's as inefficient as a BPF program
> > constructing a single block of memory by reading all such strings
> > manually into a per-CPU array and then using the above
> > bpf_ringbuf_reserve_extra().
> >
>
> I ran into a variation of this problem with the ksnoop tool [1]. I'd hoped
> to use ringbuf, but the problem I had was I needed to store a series of N
> strings into a buffer, and for each I needed to specify a maximum size (for
> bpf_snprintf_btf()).  However it was entirely possible that the
> strings would be a lot smaller, and they'd be copied one after the other,
> so while I needed to reserve a buffer for those N strings of
> (N * MAX_STRINGSIZE) size as the worst case scenario, it would likely be a
> lot smaller (the sum of the lengths of the N strings plus null
> termination), so there was no need to commit the unused space.  I ended up
> using a BPF map-derived string buffer and perf events to send the events
> instead (the code for this is ksnoop.bpf.c in [1]).

Yes, you could have used ringbuf as well, just replace
bpf_perf_event_output() with bpf_ringbuf_output().

>
> So all of this is to say that I'm assuming along with the reserve_extra()
> API, there'd need to be some sort of bpf_ringbuf_submit_extra(ringbuf,
> flags, extra_size) which specifies how much of the extra space was used?
> If that's the case I think this approach makes ringbuf usable for my
> scenario; the string buffer would effectively all be considered extra
> space, and we'd just submit what was used.

Not really, there is no need for bpf_ringbuf_submit_extra(), normal
submit will do. See definition of bpf_ringbuf_reserve_extra(), it
already expects the caller to specify how much extra (extra_sz) bytes
to append after the fixed-size part. Why would it return this again to
you?

>
> However I _think_ you were suggesting above combining the probe read and
> the extra reservation as one operation; in my case that wouldn't work
> because the strings were written directly by a helper (bpf_snprintf_btf())
> into the target buffer. It's probably an oddball situation of course, but
> thought I'd better mention it just in case. Thanks!

Not sure why that wouldn't work. You'd use your struct trace's buffer
and correct size of accumulated strings as the last two params to
bpf_ringbuf_reserve_extra(). I have a feeling that we are not on the
same page yet :)

>
> Alan
>
> [1] https://lore.kernel.org/bpf/1609773991-10509-1-git-send-email-alan.maguire@oracle.com/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: BPF ring buffer variable-length data appending
  2021-01-07 19:48 BPF ring buffer variable-length data appending Andrii Nakryiko
  2021-01-07 22:55 ` Alan Maguire
@ 2021-01-08  5:43 ` John Fastabend
  2021-01-08 20:02   ` Andrii Nakryiko
  2021-01-08 13:05 ` Brendan Jackman
  2 siblings, 1 reply; 8+ messages in thread
From: John Fastabend @ 2021-01-08  5:43 UTC (permalink / raw)
  To: Andrii Nakryiko, Brendan Jackman, KP Singh, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann

Andrii Nakryiko wrote:
> We discussed this topic today at office hour. As I mentioned, I don't
> know the ideal solution, but here is something that has enough
> flexibility for real-world uses, while giving the performance and
> convenience of reserve/commit API. Ignore naming, we can bikeshed that
> later.

Missed office hours today, dang sounds interesting. Apoligies if I'm
missing some context.

> 
> So what we can do is introduce a new bpf_ringbuf_reserve() variant:
> 
> bpf_ringbuf_reserve_extra(void *ringbuf, __u64 size, __u64 flags, void
> *extra, __u64 extra_sz);
> 
> The idea is that we reserve a fixed size amount of data that can be
> used like it is today for filling a fixed-sized metadata/sample
> directly. But the real size of the reserved sample is (size +
> extra_sz), and bpf_ringbuf_reserve_extra() helper will bpf_probe_read
> (kernel or user, depending on flags) data from extra and put it right
> after the fixed-size part.
> 
> So the use would be something like:
> 
> struct my_meta *m = bpf_ringbuf_reserve_extra(&rb, sizeof(*m),
> BPF_RB_PROBE_USER, env_vars, 1024);
> 
> if (!m)
>     /* too bad, either probe_read_user failed or ringbuf is full */
>     return 1;
> 
> m->my_field1 = 123;
> m->my_field2 = 321;
> 

Is this a way to increase the reserved space? I'm a bit fuzzy on the
details here. But what happens if something else has done another
reserve? Then it fails I guess?

So consider,

CPU0                   CPU1

bpf_ringbuf_reserve()
                       bpf_ringbuf_reserve()
bpf_ringbuf_reserve_extra()

Does that *_reserve_extra() fail then? If so it seems very limited
from a use perspective. Most the systems we work with will fail more
often than not I think.

If the above doesn't fail, I'm missing something about how userspace
can know where that buffer is without a scatter-gather list.

> 
> So the main problem with this is that when probe_read fails, we fail
> reservation completely(internally we'd just discard ringbuf sample).
> Is that OK? Or is it better to still reserve fixed-sized part and
> zero-out the variable-length part? We are combining two separate
> operations into a single API, so error handling is more convoluted.

My $.02 here. Failing is going to be ugly and a real pain to deal
with. I think best approach is reserve a fixed-sized buffer that
meets your 99% case or whatever. Then reserve some overflow buffers
you can point to for the oddball java application with a million
strings. Yes you need to get more complicated in userspace to manage
the thing, but once that codes written everything works out.

Also I think we keep the kernel simpler if the BPF program just
does another reserve() if it needs more space so,

 bpf_ringbuf_reserve()
 copy
 copy
 ENOMEM <- buff is full,
 bpf_ringbuf_reserve()
 copy
 copy
 ....

Again userspace needs some logic to join the two buffers but we
could come up with some user side convention to do this. libbpf
for example could have a small buffer header to do this if folks
wanted.  Smart BPF programs can even reserve a couple buffers
up front for the worse case and recycle them back into its
next invocation, I think.

Conceptually, what is the difference between a second reserve
as compared to reserve_extra()?

> 
> 
> Now, the main use case requested was to be able to fetch an array of
> zero-terminated strings. I honestly don't think it's possible to
> implement this efficiently without two copies of string data. Mostly
> because to just determine the size of the string you have to read it
> one extra time. And you'd probably want to copy string data into some
> controlled storage first, so that you don't end up reading it once
> successfully and then failing to read it on the second try. Next, when
> you have multiple strings, how do you deal with partial failures? It's
> even worse in terms of error handling and error propagation than the
> fixed extra size variant described above.

+1 agree

> 
> Ignoring all that, let's say we'd implement
> bpf_ringbuf_reserve_extra_strs() helper, that would somehow be copying
> multiple zero-terminated strings after the fixed-size prefix. Think
> about implementation. Just to determine the total size of the ringbuf
> sample, you'd need to read all strings once, and probably also copy
> them locally.  Then you'd reserve a ringbuf sample and copy all that
> for the second time. So it's as inefficient as a BPF program
> constructing a single block of memory by reading all such strings
> manually into a per-CPU array and then using the above
> bpf_ringbuf_reserve_extra().

+1

> 
> But offloading that preparation to a BPF program bypasses all these
> error handling and memory layout questions. It will be up to a BPF
> program itself. From a kernel perspective, we just append a block of
> memory with known (at runtime) size.

Still missing how this would be different from multiple reserve()
calls. Its not too hard to join user space buffers I promise ;)

> 
> As a more restricted version of bpf_ringbuf_reserve_extra(), instead
> of allowing reading arbitrary kernel or user-space memory in
> bpf_ringbuf_reserve_extra() we can say that it has to be known and
> initialized memory (like MAP_VALUE pointer), so helper knows that it
> can just copy data directly.

This is a fairly common operation for us, but also just chunks of a map
value pointer. So would want a start/end offset bytes. Often our
map values have extra data that user space doesn't need or care about.

> 
> Thoughts?

I think I missed the point.

> 
> -- Andrii

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BPF ring buffer variable-length data appending
  2021-01-07 19:48 BPF ring buffer variable-length data appending Andrii Nakryiko
  2021-01-07 22:55 ` Alan Maguire
  2021-01-08  5:43 ` John Fastabend
@ 2021-01-08 13:05 ` Brendan Jackman
  2021-01-08 20:14   ` Andrii Nakryiko
  2 siblings, 1 reply; 8+ messages in thread
From: Brendan Jackman @ 2021-01-08 13:05 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: KP Singh, bpf, Alexei Starovoitov, Daniel Borkmann

Hi Andrii,

I should preface by saying I don't yet truly understand why
variable-length reservations are difficult in the first place. With
that caveat in place...

On Thu, 7 Jan 2021 at 20:48, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> We discussed this topic today at office hour. As I mentioned, I don't
> know the ideal solution, but here is something that has enough
> flexibility for real-world uses, while giving the performance and
> convenience of reserve/commit API. Ignore naming, we can bikeshed that
> later.
>
> So what we can do is introduce a new bpf_ringbuf_reserve() variant:
>
> bpf_ringbuf_reserve_extra(void *ringbuf, __u64 size, __u64 flags, void
> *extra, __u64 extra_sz);
>
> The idea is that we reserve a fixed size amount of data that can be
> used like it is today for filling a fixed-sized metadata/sample
> directly. But the real size of the reserved sample is (size +
> extra_sz), and bpf_ringbuf_reserve_extra() helper will bpf_probe_read
> (kernel or user, depending on flags) data from extra and put it right
> after the fixed-size part.
>
> So the use would be something like:
>
> struct my_meta *m = bpf_ringbuf_reserve_extra(&rb, sizeof(*m),
> BPF_RB_PROBE_USER, env_vars, 1024);
>
> if (!m)
>     /* too bad, either probe_read_user failed or ringbuf is full */
>     return 1;
>
> m->my_field1 = 123;
> m->my_field2 = 321;

This seems useful although it seems we would then also want a version
that did probe_read_{user,kernel}_str as well...

> So the main problem with this is that when probe_read fails, we fail
> reservation completely(internally we'd just discard ringbuf sample).
> Is that OK? Or is it better to still reserve fixed-sized part and
> zero-out the variable-length part? We are combining two separate
> operations into a single API, so error handling is more convoluted.

I think the correct answer here is "we don't know", and the natural
response is to then let the user decide. However then we already have
at least two or three dimensions (user/kernel, error behaviour, _str
or runtime-fixed size...) which feels like a "design smell" to me.

> Now, the main use case requested was to be able to fetch an array of
> zero-terminated strings. I honestly don't think it's possible to
> implement this efficiently without two copies of string data. Mostly
> because to just determine the size of the string you have to read it
> one extra time. And you'd probably want to copy string data into some
> controlled storage first, so that you don't end up reading it once
> successfully and then failing to read it on the second try. Next, when
> you have multiple strings, how do you deal with partial failures? It's
> even worse in terms of error handling and error propagation than the
> fixed extra size variant described above.
>
> Ignoring all that, let's say we'd implement
> bpf_ringbuf_reserve_extra_strs() helper, that would somehow be copying
> multiple zero-terminated strings after the fixed-size prefix. Think
> about implementation. Just to determine the total size of the ringbuf
> sample, you'd need to read all strings once, and probably also copy
> them locally.  Then you'd reserve a ringbuf sample and copy all that
> for the second time. So it's as inefficient as a BPF program
> constructing a single block of memory by reading all such strings
> manually into a per-CPU array and then using the above
> bpf_ringbuf_reserve_extra().
>
> But offloading that preparation to a BPF program bypasses all these
> error handling and memory layout questions. It will be up to a BPF
> program itself. From a kernel perspective, we just append a block of
> memory with known (at runtime) size.

I agree, I think bpf_ringbuf_reserve_extra_strs would be unnecessarily
complex, especially if we have what I suggested above which would
probably be called bpf_ringbuf_reserve_extra_str.

> As a more restricted version of bpf_ringbuf_reserve_extra(), instead
> of allowing reading arbitrary kernel or user-space memory in
> bpf_ringbuf_reserve_extra() we can say that it has to be known and
> initialized memory (like MAP_VALUE pointer), so helper knows that it
> can just copy data directly.

This is just some preliminary input but I need to do some reading and
think more deeply about this.

Another dimension to think about is that it would be great to be able
to go directly from a helper like bpf_get_cwd into the ringbuffer with
neither an intermediate copy nor a reservation of PATH_MAX.

On the other hand, as you pointed out in the call, reserving PATH_MAX
may not be as bad as it sounds since you still only copy the actual
string length. I'm planning to do some benchmarking next week to
investigate that.

Thanks,
Brendan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BPF ring buffer variable-length data appending
  2021-01-08  5:43 ` John Fastabend
@ 2021-01-08 20:02   ` Andrii Nakryiko
  2021-01-10  0:57     ` John Fastabend
  0 siblings, 1 reply; 8+ messages in thread
From: Andrii Nakryiko @ 2021-01-08 20:02 UTC (permalink / raw)
  To: John Fastabend
  Cc: Brendan Jackman, KP Singh, bpf, Alexei Starovoitov, Daniel Borkmann

On Thu, Jan 7, 2021 at 9:44 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Andrii Nakryiko wrote:
> > We discussed this topic today at office hour. As I mentioned, I don't
> > know the ideal solution, but here is something that has enough
> > flexibility for real-world uses, while giving the performance and
> > convenience of reserve/commit API. Ignore naming, we can bikeshed that
> > later.
>
> Missed office hours today, dang sounds interesting. Apoligies if I'm
> missing some context.
>
> >
> > So what we can do is introduce a new bpf_ringbuf_reserve() variant:
> >
> > bpf_ringbuf_reserve_extra(void *ringbuf, __u64 size, __u64 flags, void
> > *extra, __u64 extra_sz);
> >
> > The idea is that we reserve a fixed size amount of data that can be
> > used like it is today for filling a fixed-sized metadata/sample
> > directly. But the real size of the reserved sample is (size +
> > extra_sz), and bpf_ringbuf_reserve_extra() helper will bpf_probe_read
> > (kernel or user, depending on flags) data from extra and put it right
> > after the fixed-size part.
> >
> > So the use would be something like:
> >
> > struct my_meta *m = bpf_ringbuf_reserve_extra(&rb, sizeof(*m),
> > BPF_RB_PROBE_USER, env_vars, 1024);
> >
> > if (!m)
> >     /* too bad, either probe_read_user failed or ringbuf is full */
> >     return 1;
> >
> > m->my_field1 = 123;
> > m->my_field2 = 321;
> >
>
> Is this a way to increase the reserved space? I'm a bit fuzzy on the
> details here. But what happens if something else has done another
> reserve? Then it fails I guess?

No, you misunderstood. There is no way to safely increase the size of
already reserved ringbuf record. Naming is hard.
bpf_ringbuf_reserve_extra() reserve the correct size of the record
from the very beginning. It's just that size is (size + extra_sz). The
difference from non-extra() variant is that size is still restricted
to be a known constant, so that verifier can verify safe direct memory
reads and writes within first *size* bytes. But extra_sz can be
unknown at the verification time (so it's ARG_ANYTHING), do it can be
calculated dynamically. That's most of the difference. The other part
is that because BPF program can't do much with this dynamic part of
ringbuf record, we also immediately copy provided memory for later
submission to the user space.

>
> So consider,
>
> CPU0                   CPU1
>
> bpf_ringbuf_reserve()
>                        bpf_ringbuf_reserve()
> bpf_ringbuf_reserve_extra()
>
> Does that *_reserve_extra() fail then? If so it seems very limited
> from a use perspective. Most the systems we work with will fail more
> often than not I think.
>
> If the above doesn't fail, I'm missing something about how userspace
> can know where that buffer is without a scatter-gather list.
>
> >
> > So the main problem with this is that when probe_read fails, we fail
> > reservation completely(internally we'd just discard ringbuf sample).
> > Is that OK? Or is it better to still reserve fixed-sized part and
> > zero-out the variable-length part? We are combining two separate
> > operations into a single API, so error handling is more convoluted.
>
> My $.02 here. Failing is going to be ugly and a real pain to deal

If data copying failed, you are not getting data you expected. So
failing seems reasonable in such case. The convoluted and unfortunate
part is that you don't know whether it is ringbuf ran out of free
space or memory copying failed. If we restring extra_data pointer to
be a known good memory (like sk_buff, map_value, etc), then this
concern goes away. But you also won't be able to use this directly to
read some piece of generic kernel or user memory without extra
(explicit) bpf_probe_read(). Which might be acceptable, I think.

> with. I think best approach is reserve a fixed-sized buffer that
> meets your 99% case or whatever. Then reserve some overflow buffers
> you can point to for the oddball java application with a million
> strings. Yes you need to get more complicated in userspace to manage
> the thing, but once that codes written everything works out.
>
> Also I think we keep the kernel simpler if the BPF program just
> does another reserve() if it needs more space so,
>
>  bpf_ringbuf_reserve()
>  copy
>  copy
>  ENOMEM <- buff is full,
>  bpf_ringbuf_reserve()
>  copy
>  copy
>  ....
>
> Again userspace needs some logic to join the two buffers but we
> could come up with some user side convention to do this. libbpf
> for example could have a small buffer header to do this if folks
> wanted.  Smart BPF programs can even reserve a couple buffers
> up front for the worse case and recycle them back into its
> next invocation, I think.

First, I don't think libbpf should do anything extra here. It's just
bound to be suboptimal and cumbersome.

But yes, this is another approach, though arguably quite complicated.
Chaining buffers can be done simply by using cpu_id and recording the
total number of expected chunks in the very first chunk. You can
submit the first chunk last, if you use reserve/commit API.
Reconstruction in user-space is going to require memory allocations
and copying, though.

>
> Conceptually, what is the difference between a second reserve
> as compared to reserve_extra()?
>
> >
> >
> > Now, the main use case requested was to be able to fetch an array of
> > zero-terminated strings. I honestly don't think it's possible to
> > implement this efficiently without two copies of string data. Mostly
> > because to just determine the size of the string you have to read it
> > one extra time. And you'd probably want to copy string data into some
> > controlled storage first, so that you don't end up reading it once
> > successfully and then failing to read it on the second try. Next, when
> > you have multiple strings, how do you deal with partial failures? It's
> > even worse in terms of error handling and error propagation than the
> > fixed extra size variant described above.
>
> +1 agree
>
> >
> > Ignoring all that, let's say we'd implement
> > bpf_ringbuf_reserve_extra_strs() helper, that would somehow be copying
> > multiple zero-terminated strings after the fixed-size prefix. Think
> > about implementation. Just to determine the total size of the ringbuf
> > sample, you'd need to read all strings once, and probably also copy
> > them locally.  Then you'd reserve a ringbuf sample and copy all that
> > for the second time. So it's as inefficient as a BPF program
> > constructing a single block of memory by reading all such strings
> > manually into a per-CPU array and then using the above
> > bpf_ringbuf_reserve_extra().
>
> +1
>
> >
> > But offloading that preparation to a BPF program bypasses all these
> > error handling and memory layout questions. It will be up to a BPF
> > program itself. From a kernel perspective, we just append a block of
> > memory with known (at runtime) size.
>
> Still missing how this would be different from multiple reserve()
> calls. Its not too hard to join user space buffers I promise ;)

It's not trivial, but also probably less efficiently, as now you need
to dynamically allocate memory and still do extra copy. So I can
certainly see the appeal of being able to submit one whole record,
instead of trying to re-construct intermingled chunks across multiple
CPUs.

Anyways, I'm not advocating one or the other approach, both have the
right to exist, IMO. There is no free lunch, unfortunately. Either way
complexity and extra overhead creeps in.

>
> >
> > As a more restricted version of bpf_ringbuf_reserve_extra(), instead
> > of allowing reading arbitrary kernel or user-space memory in
> > bpf_ringbuf_reserve_extra() we can say that it has to be known and
> > initialized memory (like MAP_VALUE pointer), so helper knows that it
> > can just copy data directly.
>
> This is a fairly common operation for us, but also just chunks of a map
> value pointer. So would want a start/end offset bytes. Often our
> map values have extra data that user space doesn't need or care about.

Huh? It's just a pointer, you can point inside the MAP_VALUE today,
no? And you already have extra_sz. So this should work:

struct my_data *d = bpf_map_lookup_elem(&my_map, &my_key);
if (!d) return 0;

bpf_ringbuf_reserve_extra(&rb, 100, 0, d + 100, 200);

And you'll be copying [100, 200) range into ringbuf. No?

>
> >
> > Thoughts?
>
> I think I missed the point.

I think you misunderstood what bpf_ringbuf_reserve_extra() is going to
do. And we might differ in evaluating how easy it is to handle
chunking, both in kernel and user-space :)

>
> >
> > -- Andrii

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BPF ring buffer variable-length data appending
  2021-01-08 13:05 ` Brendan Jackman
@ 2021-01-08 20:14   ` Andrii Nakryiko
  0 siblings, 0 replies; 8+ messages in thread
From: Andrii Nakryiko @ 2021-01-08 20:14 UTC (permalink / raw)
  To: Brendan Jackman; +Cc: KP Singh, bpf, Alexei Starovoitov, Daniel Borkmann

On Fri, Jan 8, 2021 at 5:05 AM Brendan Jackman <jackmanb@google.com> wrote:
>
> Hi Andrii,
>
> I should preface by saying I don't yet truly understand why
> variable-length reservations are difficult in the first place. With
> that caveat in place...

Well, there are two aspects. BPF ringbuf doesn't allow to re-reserve
already reserved space, because in general it's impossible. So you
need to provide the desired length upfront. Which means you need to
calculate all strlen()s first to know the exact size. But assuming
that's done, it's still not possible because of the BPF verifier.

It's impossible for BPF verifier to guarantee safe memory accesses if
the size of ringbuf record is not known statically at the verification
time. Imagine something like this:

int sz = rand() % 100;
void *data = bpf_ringbuf_reserve(&rb, sz);

Verifier is smart enough to deduce that sz is [0, 100), right? And so
what? We don't know whether it's actually 0, 50, or 100. The value is
known only in runtime. So allowing to access even the very first byte
of the record is unsafe, because if sz turns out to be 0, you are
already reading/writing wrong memory.

Verifier would need to completely change how it does the register
state tracking to allow something like this. It would need to properly
track relationships between registers, not just known scalar ranges.

>
> On Thu, 7 Jan 2021 at 20:48, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> >
> > We discussed this topic today at office hour. As I mentioned, I don't
> > know the ideal solution, but here is something that has enough
> > flexibility for real-world uses, while giving the performance and
> > convenience of reserve/commit API. Ignore naming, we can bikeshed that
> > later.
> >
> > So what we can do is introduce a new bpf_ringbuf_reserve() variant:
> >
> > bpf_ringbuf_reserve_extra(void *ringbuf, __u64 size, __u64 flags, void
> > *extra, __u64 extra_sz);
> >
> > The idea is that we reserve a fixed size amount of data that can be
> > used like it is today for filling a fixed-sized metadata/sample
> > directly. But the real size of the reserved sample is (size +
> > extra_sz), and bpf_ringbuf_reserve_extra() helper will bpf_probe_read
> > (kernel or user, depending on flags) data from extra and put it right
> > after the fixed-size part.
> >
> > So the use would be something like:
> >
> > struct my_meta *m = bpf_ringbuf_reserve_extra(&rb, sizeof(*m),
> > BPF_RB_PROBE_USER, env_vars, 1024);
> >
> > if (!m)
> >     /* too bad, either probe_read_user failed or ringbuf is full */
> >     return 1;
> >
> > m->my_field1 = 123;
> > m->my_field2 = 321;
>
> This seems useful although it seems we would then also want a version
> that did probe_read_{user,kernel}_str as well...

It will be as inefficient as a BPF program doing probe_read_str() on
its own into some per-CPU temporary buffer and then using
bpf_ringbuf_reserve_extra(). See below, knowing the size of
zero-terminated string requires extra pass over data. And to be on the
safe side, you need to copy it locally, so it's not modified between
strlen() and strcpy(). Which is why I didn't propose such a string
reading variant. It's inherently inefficient, so better to just let
BPF program do that and handle all the error conditions explicitly.

>
> > So the main problem with this is that when probe_read fails, we fail
> > reservation completely(internally we'd just discard ringbuf sample).
> > Is that OK? Or is it better to still reserve fixed-sized part and
> > zero-out the variable-length part? We are combining two separate
> > operations into a single API, so error handling is more convoluted.
>
> I think the correct answer here is "we don't know", and the natural
> response is to then let the user decide. However then we already have
> at least two or three dimensions (user/kernel, error behaviour, _str
> or runtime-fixed size...) which feels like a "design smell" to me.

Exactly. Which is why I'm leaning towards restricting extra_data to be
a known good and initialized memory location (map_value, sk_buff,
another ringbuf sample, etc).

>
> > Now, the main use case requested was to be able to fetch an array of
> > zero-terminated strings. I honestly don't think it's possible to
> > implement this efficiently without two copies of string data. Mostly
> > because to just determine the size of the string you have to read it
> > one extra time. And you'd probably want to copy string data into some
> > controlled storage first, so that you don't end up reading it once
> > successfully and then failing to read it on the second try. Next, when
> > you have multiple strings, how do you deal with partial failures? It's
> > even worse in terms of error handling and error propagation than the
> > fixed extra size variant described above.
> >
> > Ignoring all that, let's say we'd implement
> > bpf_ringbuf_reserve_extra_strs() helper, that would somehow be copying
> > multiple zero-terminated strings after the fixed-size prefix. Think
> > about implementation. Just to determine the total size of the ringbuf
> > sample, you'd need to read all strings once, and probably also copy
> > them locally.  Then you'd reserve a ringbuf sample and copy all that
> > for the second time. So it's as inefficient as a BPF program
> > constructing a single block of memory by reading all such strings
> > manually into a per-CPU array and then using the above
> > bpf_ringbuf_reserve_extra().
> >
> > But offloading that preparation to a BPF program bypasses all these
> > error handling and memory layout questions. It will be up to a BPF
> > program itself. From a kernel perspective, we just append a block of
> > memory with known (at runtime) size.
>
> I agree, I think bpf_ringbuf_reserve_extra_strs would be unnecessarily
> complex, especially if we have what I suggested above which would
> probably be called bpf_ringbuf_reserve_extra_str.
>

Ok, cool.

> > As a more restricted version of bpf_ringbuf_reserve_extra(), instead
> > of allowing reading arbitrary kernel or user-space memory in
> > bpf_ringbuf_reserve_extra() we can say that it has to be known and
> > initialized memory (like MAP_VALUE pointer), so helper knows that it
> > can just copy data directly.
>
> This is just some preliminary input but I need to do some reading and
> think more deeply about this.
>
> Another dimension to think about is that it would be great to be able
> to go directly from a helper like bpf_get_cwd into the ringbuffer with
> neither an intermediate copy nor a reservation of PATH_MAX.

I don't know how you imagine this to even look like, honestly. I
certainly can't see it given the pretty fundamental limitations (but
reasonable, honestly) we have with ringbuf and verifier.

>
> On the other hand, as you pointed out in the call, reserving PATH_MAX
> may not be as bad as it sounds since you still only copy the actual
> string length. I'm planning to do some benchmarking next week to
> investigate that.

Sounds good.

>
> Thanks,
> Brendan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BPF ring buffer variable-length data appending
  2021-01-08 20:02   ` Andrii Nakryiko
@ 2021-01-10  0:57     ` John Fastabend
  0 siblings, 0 replies; 8+ messages in thread
From: John Fastabend @ 2021-01-10  0:57 UTC (permalink / raw)
  To: Andrii Nakryiko, John Fastabend
  Cc: Brendan Jackman, KP Singh, bpf, Alexei Starovoitov, Daniel Borkmann

Andrii Nakryiko wrote:
> On Thu, Jan 7, 2021 at 9:44 PM John Fastabend <john.fastabend@gmail.com> wrote:
> >
> > Andrii Nakryiko wrote:
> > > We discussed this topic today at office hour. As I mentioned, I don't
> > > know the ideal solution, but here is something that has enough
> > > flexibility for real-world uses, while giving the performance and
> > > convenience of reserve/commit API. Ignore naming, we can bikeshed that
> > > later.
> >
> > Missed office hours today, dang sounds interesting. Apoligies if I'm
> > missing some context.
> >
> > >
> > > So what we can do is introduce a new bpf_ringbuf_reserve() variant:
> > >
> > > bpf_ringbuf_reserve_extra(void *ringbuf, __u64 size, __u64 flags, void
> > > *extra, __u64 extra_sz);
> > >
> > > The idea is that we reserve a fixed size amount of data that can be
> > > used like it is today for filling a fixed-sized metadata/sample
> > > directly. But the real size of the reserved sample is (size +
> > > extra_sz), and bpf_ringbuf_reserve_extra() helper will bpf_probe_read
> > > (kernel or user, depending on flags) data from extra and put it right
> > > after the fixed-size part.
> > >
> > > So the use would be something like:
> > >
> > > struct my_meta *m = bpf_ringbuf_reserve_extra(&rb, sizeof(*m),
> > > BPF_RB_PROBE_USER, env_vars, 1024);
> > >
> > > if (!m)
> > >     /* too bad, either probe_read_user failed or ringbuf is full */
> > >     return 1;
> > >
> > > m->my_field1 = 123;
> > > m->my_field2 = 321;
> > >
> >
> > Is this a way to increase the reserved space? I'm a bit fuzzy on the
> > details here. But what happens if something else has done another
> > reserve? Then it fails I guess?
> 
> No, you misunderstood. There is no way to safely increase the size of
> already reserved ringbuf record. Naming is hard.

Great without a scatter-gather list I couldn't see any way increasing
size would work ;)

> bpf_ringbuf_reserve_extra() reserve the correct size of the record
> from the very beginning. It's just that size is (size + extra_sz). The
> difference from non-extra() variant is that size is still restricted
> to be a known constant, so that verifier can verify safe direct memory
> reads and writes within first *size* bytes. But extra_sz can be
> unknown at the verification time (so it's ARG_ANYTHING), do it can be
> calculated dynamically. That's most of the difference. The other part
> is that because BPF program can't do much with this dynamic part of
> ringbuf record, we also immediately copy provided memory for later
> submission to the user space.

OK. Thanks for the extra details.

> 
> >
> > So consider,
> >
> > CPU0                   CPU1
> >
> > bpf_ringbuf_reserve()
> >                        bpf_ringbuf_reserve()
> > bpf_ringbuf_reserve_extra()
> >
> > Does that *_reserve_extra() fail then? If so it seems very limited
> > from a use perspective. Most the systems we work with will fail more
> > often than not I think.
> >
> > If the above doesn't fail, I'm missing something about how userspace
> > can know where that buffer is without a scatter-gather list.
> >
> > >
> > > So the main problem with this is that when probe_read fails, we fail
> > > reservation completely(internally we'd just discard ringbuf sample).
> > > Is that OK? Or is it better to still reserve fixed-sized part and
> > > zero-out the variable-length part? We are combining two separate
> > > operations into a single API, so error handling is more convoluted.
> >
> > My $.02 here. Failing is going to be ugly and a real pain to deal
> 
> If data copying failed, you are not getting data you expected. So
> failing seems reasonable in such case. The convoluted and unfortunate
> part is that you don't know whether it is ringbuf ran out of free
> space or memory copying failed. If we restring extra_data pointer to
> be a known good memory (like sk_buff, map_value, etc), then this
> concern goes away. But you also won't be able to use this directly to
> read some piece of generic kernel or user memory without extra
> (explicit) bpf_probe_read(). Which might be acceptable, I think.

Seems reasonable. Not knowing what failed seems like a horrible
to debug error case so better to fix that.

> 
> > with. I think best approach is reserve a fixed-sized buffer that
> > meets your 99% case or whatever. Then reserve some overflow buffers
> > you can point to for the oddball java application with a million
> > strings. Yes you need to get more complicated in userspace to manage
> > the thing, but once that codes written everything works out.
> >
> > Also I think we keep the kernel simpler if the BPF program just
> > does another reserve() if it needs more space so,
> >
> >  bpf_ringbuf_reserve()
> >  copy
> >  copy
> >  ENOMEM <- buff is full,
> >  bpf_ringbuf_reserve()
> >  copy
> >  copy
> >  ....
> >
> > Again userspace needs some logic to join the two buffers but we
> > could come up with some user side convention to do this. libbpf
> > for example could have a small buffer header to do this if folks
> > wanted.  Smart BPF programs can even reserve a couple buffers
> > up front for the worse case and recycle them back into its
> > next invocation, I think.
> 
> First, I don't think libbpf should do anything extra here. It's just
> bound to be suboptimal and cumbersome.

Agree, not a great idea. Users will want to do this using specifics
of their use case.

> 
> But yes, this is another approach, though arguably quite complicated.
> Chaining buffers can be done simply by using cpu_id and recording the
> total number of expected chunks in the very first chunk. You can
> submit the first chunk last, if you use reserve/commit API.
> Reconstruction in user-space is going to require memory allocations
> and copying, though.

Sure, although maybe not copying if you have a scatter gather list
somewhere. I guess it all depends on your use case.

[...]

> > >
> > > But offloading that preparation to a BPF program bypasses all these
> > > error handling and memory layout questions. It will be up to a BPF
> > > program itself. From a kernel perspective, we just append a block of
> > > memory with known (at runtime) size.
> >
> > Still missing how this would be different from multiple reserve()
> > calls. Its not too hard to join user space buffers I promise ;)
> 
> It's not trivial, but also probably less efficiently, as now you need
> to dynamically allocate memory and still do extra copy. So I can
> certainly see the appeal of being able to submit one whole record,
> instead of trying to re-construct intermingled chunks across multiple
> CPUs.
> 
> Anyways, I'm not advocating one or the other approach, both have the
> right to exist, IMO. There is no free lunch, unfortunately. Either way
> complexity and extra overhead creeps in.

Sure.

> 
> >
> > >
> > > As a more restricted version of bpf_ringbuf_reserve_extra(), instead
> > > of allowing reading arbitrary kernel or user-space memory in
> > > bpf_ringbuf_reserve_extra() we can say that it has to be known and
> > > initialized memory (like MAP_VALUE pointer), so helper knows that it
> > > can just copy data directly.
> >
> > This is a fairly common operation for us, but also just chunks of a map
> > value pointer. So would want a start/end offset bytes. Often our
> > map values have extra data that user space doesn't need or care about.
> 
> Huh? It's just a pointer, you can point inside the MAP_VALUE today,
> no? And you already have extra_sz. So this should work:
> 
> struct my_data *d = bpf_map_lookup_elem(&my_map, &my_key);
> if (!d) return 0;
> 
> bpf_ringbuf_reserve_extra(&rb, 100, 0, d + 100, 200);
> 
> And you'll be copying [100, 200) range into ringbuf. No?

Yep that works fine.

> 
> >
> > >
> > > Thoughts?
> >
> > I think I missed the point.
> 
> I think you misunderstood what bpf_ringbuf_reserve_extra() is going to
> do. And we might differ in evaluating how easy it is to handle
> chunking, both in kernel and user-space :)

Yep above clarifies the ringbuf_reserve_extra() proposal thanks. And
having done the work to handle chunking I don't think its terribly
difficult, but agree non-trivial. Once done though its actually
fairly flexible and handles use cases like array of strings and
variable data size structures (TLVs ;) well and seemingly
efficiently.

> 
> >
> > >
> > > -- Andrii

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-01-10  0:59 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-07 19:48 BPF ring buffer variable-length data appending Andrii Nakryiko
2021-01-07 22:55 ` Alan Maguire
2021-01-07 23:26   ` Andrii Nakryiko
2021-01-08  5:43 ` John Fastabend
2021-01-08 20:02   ` Andrii Nakryiko
2021-01-10  0:57     ` John Fastabend
2021-01-08 13:05 ` Brendan Jackman
2021-01-08 20:14   ` Andrii Nakryiko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.