bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [QUESTION] usage of BPF_MAP_TYPE_RINGBUF
@ 2023-01-10 13:49 andrea terzolo
  2023-01-11  8:27 ` Jiri Olsa
  0 siblings, 1 reply; 8+ messages in thread
From: andrea terzolo @ 2023-01-10 13:49 UTC (permalink / raw)
  To: bpf

Hello!

If I can I would ask a question regarding the BPF_MAP_TYPE_RINGBUF
map. Looking at the kernel implementation [0] it seems that data pages
are mapped 2 times to have a more efficient and simpler
implementation. This seems to be a ring buffer peculiarity, the perf
buffer didn't have such an implementation. In the Falco project [1] we
use huge per-CPU buffers to collect almost all the syscalls that the
system throws and the default size of each buffer is 8 MB. This means
that using the ring buffer approach on a system with 128 CPUs, we will
have (128*8*2) MB, while with the perf buffer only (128*8) MB. The
issue is that this memory requirement could be too much for some
systems and also in Kubernetes environments where there are strict
resource limits... Our actual workaround is to use ring buffers shared
between more than one CPU with a BPF_MAP_TYPE_ARRAY_OF_MAPS, so for
example we allocate a ring buffer for each CPU pair. Unfortunately,
this solution has a price since we increase the contention on the ring
buffers and as highlighted here [2], the presence of multiple
competing writers on the same buffer could become a real bottleneck...
Sorry for the long introduction, my question here is, are there any
other approaches to manage such a scenario? Will there be a
possibility to use the ring buffer without the kernel double mapping
in the near future? The ring buffer has such amazing features with
respect to the perf buffer, but in a scenario like the Falco one,
where we have aggressive multiple producers, this double mapping could
become a limitation.

Thank you in advance for your time,
Andrea

0: https://github.com/torvalds/linux/blob/master/kernel/bpf/ringbuf.c#L107
1: https://github.com/falcosecurity/falco
2: https://patchwork.ozlabs.org/project/netdev/patch/20200529075424.3139988-5-andriin@fb.com/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [QUESTION] usage of BPF_MAP_TYPE_RINGBUF
  2023-01-10 13:49 [QUESTION] usage of BPF_MAP_TYPE_RINGBUF andrea terzolo
@ 2023-01-11  8:27 ` Jiri Olsa
  2023-01-13 22:56   ` Andrii Nakryiko
  0 siblings, 1 reply; 8+ messages in thread
From: Jiri Olsa @ 2023-01-11  8:27 UTC (permalink / raw)
  To: andrea terzolo; +Cc: bpf, Andrii Nakryiko

On Tue, Jan 10, 2023 at 02:49:59PM +0100, andrea terzolo wrote:
> Hello!
> 
> If I can I would ask a question regarding the BPF_MAP_TYPE_RINGBUF
> map. Looking at the kernel implementation [0] it seems that data pages
> are mapped 2 times to have a more efficient and simpler
> implementation. This seems to be a ring buffer peculiarity, the perf
> buffer didn't have such an implementation. In the Falco project [1] we
> use huge per-CPU buffers to collect almost all the syscalls that the
> system throws and the default size of each buffer is 8 MB. This means
> that using the ring buffer approach on a system with 128 CPUs, we will
> have (128*8*2) MB, while with the perf buffer only (128*8) MB. The

hum IIUC it's not allocated twice but pages are just mapped twice,
to cope with wrap around samples, described in git log:

    One interesting implementation bit, that significantly simplifies (and thus
    speeds up as well) implementation of both producers and consumers is how data
    area is mapped twice contiguously back-to-back in the virtual memory. This
    allows to not take any special measures for samples that have to wrap around
    at the end of the circular buffer data area, because the next page after the
    last data page would be first data page again, and thus the sample will still
    appear completely contiguous in virtual memory. See comment and a simple ASCII
    diagram showing this visually in bpf_ringbuf_area_alloc().

> issue is that this memory requirement could be too much for some
> systems and also in Kubernetes environments where there are strict
> resource limits... Our actual workaround is to use ring buffers shared
> between more than one CPU with a BPF_MAP_TYPE_ARRAY_OF_MAPS, so for
> example we allocate a ring buffer for each CPU pair. Unfortunately,
> this solution has a price since we increase the contention on the ring
> buffers and as highlighted here [2], the presence of multiple
> competing writers on the same buffer could become a real bottleneck...
> Sorry for the long introduction, my question here is, are there any
> other approaches to manage such a scenario? Will there be a
> possibility to use the ring buffer without the kernel double mapping
> in the near future? The ring buffer has such amazing features with
> respect to the perf buffer, but in a scenario like the Falco one,
> where we have aggressive multiple producers, this double mapping could
> become a limitation.

AFAIK the bpf ring buffer can be used across cpus, so you don't need
to have extra copy for each cpu if you don't really want to

jirka

> 
> Thank you in advance for your time,
> Andrea
> 
> 0: https://github.com/torvalds/linux/blob/master/kernel/bpf/ringbuf.c#L107
> 1: https://github.com/falcosecurity/falco
> 2: https://patchwork.ozlabs.org/project/netdev/patch/20200529075424.3139988-5-andriin@fb.com/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [QUESTION] usage of BPF_MAP_TYPE_RINGBUF
  2023-01-11  8:27 ` Jiri Olsa
@ 2023-01-13 22:56   ` Andrii Nakryiko
  2023-01-15 17:09     ` andrea terzolo
  0 siblings, 1 reply; 8+ messages in thread
From: Andrii Nakryiko @ 2023-01-13 22:56 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: andrea terzolo, bpf

On Wed, Jan 11, 2023 at 12:27 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Tue, Jan 10, 2023 at 02:49:59PM +0100, andrea terzolo wrote:
> > Hello!
> >
> > If I can I would ask a question regarding the BPF_MAP_TYPE_RINGBUF
> > map. Looking at the kernel implementation [0] it seems that data pages
> > are mapped 2 times to have a more efficient and simpler
> > implementation. This seems to be a ring buffer peculiarity, the perf
> > buffer didn't have such an implementation. In the Falco project [1] we
> > use huge per-CPU buffers to collect almost all the syscalls that the
> > system throws and the default size of each buffer is 8 MB. This means
> > that using the ring buffer approach on a system with 128 CPUs, we will
> > have (128*8*2) MB, while with the perf buffer only (128*8) MB. The
>
> hum IIUC it's not allocated twice but pages are just mapped twice,
> to cope with wrap around samples, described in git log:
>
>     One interesting implementation bit, that significantly simplifies (and thus
>     speeds up as well) implementation of both producers and consumers is how data
>     area is mapped twice contiguously back-to-back in the virtual memory. This
>     allows to not take any special measures for samples that have to wrap around
>     at the end of the circular buffer data area, because the next page after the
>     last data page would be first data page again, and thus the sample will still
>     appear completely contiguous in virtual memory. See comment and a simple ASCII
>     diagram showing this visually in bpf_ringbuf_area_alloc().

yes, exactly, there is no duplication of memory, it's just mapped
twice to make working with records that wrap around simple and
efficient

>
> > issue is that this memory requirement could be too much for some
> > systems and also in Kubernetes environments where there are strict
> > resource limits... Our actual workaround is to use ring buffers shared
> > between more than one CPU with a BPF_MAP_TYPE_ARRAY_OF_MAPS, so for
> > example we allocate a ring buffer for each CPU pair. Unfortunately,
> > this solution has a price since we increase the contention on the ring
> > buffers and as highlighted here [2], the presence of multiple
> > competing writers on the same buffer could become a real bottleneck...
> > Sorry for the long introduction, my question here is, are there any
> > other approaches to manage such a scenario? Will there be a
> > possibility to use the ring buffer without the kernel double mapping
> > in the near future? The ring buffer has such amazing features with
> > respect to the perf buffer, but in a scenario like the Falco one,
> > where we have aggressive multiple producers, this double mapping could
> > become a limitation.
>
> AFAIK the bpf ring buffer can be used across cpus, so you don't need
> to have extra copy for each cpu if you don't really want to
>

seems like they do share, but only between CPUs. But nothing prevents
you from sharing between more than 2 CPUs, right? It's a tradeoff
between contention and overall memory usage (but as pointed out,
ringbuf doesn't use 2x more memory). Do you actually see a lot of
contention when sharing ringbuf between 2 CPUs? There are multiple
applications that share a single ringbuf between all CPUs, and no one
really complained about high contention so far. You'd need to push
tons of data non-stop, probably, at which point I'd worry about
consumers not being able to keep up (and definitely not doing much
useful with all this data). But YMMV, of course.

> jirka
>
> >
> > Thank you in advance for your time,
> > Andrea
> >
> > 0: https://github.com/torvalds/linux/blob/master/kernel/bpf/ringbuf.c#L107
> > 1: https://github.com/falcosecurity/falco
> > 2: https://patchwork.ozlabs.org/project/netdev/patch/20200529075424.3139988-5-andriin@fb.com/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [QUESTION] usage of BPF_MAP_TYPE_RINGBUF
  2023-01-13 22:56   ` Andrii Nakryiko
@ 2023-01-15 17:09     ` andrea terzolo
  2023-01-27 18:54       ` Andrii Nakryiko
  0 siblings, 1 reply; 8+ messages in thread
From: andrea terzolo @ 2023-01-15 17:09 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: Jiri Olsa, bpf

Il giorno ven 13 gen 2023 alle ore 23:57 Andrii Nakryiko
<andrii.nakryiko@gmail.com> ha scritto:
>
> On Wed, Jan 11, 2023 at 12:27 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > On Tue, Jan 10, 2023 at 02:49:59PM +0100, andrea terzolo wrote:
> > > Hello!
> > >
> > > If I can I would ask a question regarding the BPF_MAP_TYPE_RINGBUF
> > > map. Looking at the kernel implementation [0] it seems that data pages
> > > are mapped 2 times to have a more efficient and simpler
> > > implementation. This seems to be a ring buffer peculiarity, the perf
> > > buffer didn't have such an implementation. In the Falco project [1] we
> > > use huge per-CPU buffers to collect almost all the syscalls that the
> > > system throws and the default size of each buffer is 8 MB. This means
> > > that using the ring buffer approach on a system with 128 CPUs, we will
> > > have (128*8*2) MB, while with the perf buffer only (128*8) MB. The
> >
> > hum IIUC it's not allocated twice but pages are just mapped twice,
> > to cope with wrap around samples, described in git log:
> >
> >     One interesting implementation bit, that significantly simplifies (and thus
> >     speeds up as well) implementation of both producers and consumers is how data
> >     area is mapped twice contiguously back-to-back in the virtual memory. This
> >     allows to not take any special measures for samples that have to wrap around
> >     at the end of the circular buffer data area, because the next page after the
> >     last data page would be first data page again, and thus the sample will still
> >     appear completely contiguous in virtual memory. See comment and a simple ASCII
> >     diagram showing this visually in bpf_ringbuf_area_alloc().
>
> yes, exactly, there is no duplication of memory, it's just mapped
> twice to make working with records that wrap around simple and
> efficient
>

Thank you very much for the quick response, my previous question was
quite unclear, sorry for that, I will try to explain me better with
some data. I've collected in this document [3] some thoughts regarding
2 simple examples with perf buffer and ring buffer. Without going into
too many details about the document, I've noticed a strange value of
"Resident set size" (RSS) in the ring buffer example. Probably is
perfectly fine but I really don't understand why the "RSS" for each
ring buffer assumes the same value of the Virtual memory size and I'm
just asking myself if this fact could impact the OOM score computation
making the program that uses ring buffers more vulnerable to the OOM
killer.

[3]: https://hackmd.io/@l56JYH1SS9-QXhSNXKanMw/r1Z8APWso

> >
> > > issue is that this memory requirement could be too much for some
> > > systems and also in Kubernetes environments where there are strict
> > > resource limits... Our actual workaround is to use ring buffers shared
> > > between more than one CPU with a BPF_MAP_TYPE_ARRAY_OF_MAPS, so for
> > > example we allocate a ring buffer for each CPU pair. Unfortunately,
> > > this solution has a price since we increase the contention on the ring
> > > buffers and as highlighted here [2], the presence of multiple
> > > competing writers on the same buffer could become a real bottleneck...
> > > Sorry for the long introduction, my question here is, are there any
> > > other approaches to manage such a scenario? Will there be a
> > > possibility to use the ring buffer without the kernel double mapping
> > > in the near future? The ring buffer has such amazing features with
> > > respect to the perf buffer, but in a scenario like the Falco one,
> > > where we have aggressive multiple producers, this double mapping could
> > > become a limitation.
> >
> > AFAIK the bpf ring buffer can be used across cpus, so you don't need
> > to have extra copy for each cpu if you don't really want to
> >
>
> seems like they do share, but only between CPUs. But nothing prevents
> you from sharing between more than 2 CPUs, right? It's a tradeoff

Yes exactly, we can and we will do it

> between contention and overall memory usage (but as pointed out,
> ringbuf doesn't use 2x more memory). Do you actually see a lot of
> contention when sharing ringbuf between 2 CPUs? There are multiple

Actually no, I've not seen a lot of contention with this
configuration, it seems to handle throughput quite well. BTW it's
still an experimental solution so it is not much tested against
real-world workloads.

> applications that share a single ringbuf between all CPUs, and no one
> really complained about high contention so far. You'd need to push
> tons of data non-stop, probably, at which point I'd worry about
> consumers not being able to keep up (and definitely not doing much
> useful with all this data). But YMMV, of course.
>

We are a little bit worried about the single ring buffer scenario,
mainly when we have something like 64 CPUs and all syscalls enabled,
but as you correctly highlighted in this case we would have also some
issues userspace side because we wouldn't be able to handle all this
traffic, causing tons of event drops. BTW thank you for the feedback!

> > jirka
> >
> > >
> > > Thank you in advance for your time,
> > > Andrea
> > >
> > > 0: https://github.com/torvalds/linux/blob/master/kernel/bpf/ringbuf.c#L107
> > > 1: https://github.com/falcosecurity/falco
> > > 2: https://patchwork.ozlabs.org/project/netdev/patch/20200529075424.3139988-5-andriin@fb.com/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [QUESTION] usage of BPF_MAP_TYPE_RINGBUF
  2023-01-15 17:09     ` andrea terzolo
@ 2023-01-27 18:54       ` Andrii Nakryiko
  2023-02-05 15:28         ` andrea terzolo
  0 siblings, 1 reply; 8+ messages in thread
From: Andrii Nakryiko @ 2023-01-27 18:54 UTC (permalink / raw)
  To: andrea terzolo; +Cc: Jiri Olsa, bpf

On Sun, Jan 15, 2023 at 9:10 AM andrea terzolo <andreaterzolo3@gmail.com> wrote:
>
> Il giorno ven 13 gen 2023 alle ore 23:57 Andrii Nakryiko
> <andrii.nakryiko@gmail.com> ha scritto:
> >
> > On Wed, Jan 11, 2023 at 12:27 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > >
> > > On Tue, Jan 10, 2023 at 02:49:59PM +0100, andrea terzolo wrote:
> > > > Hello!
> > > >
> > > > If I can I would ask a question regarding the BPF_MAP_TYPE_RINGBUF
> > > > map. Looking at the kernel implementation [0] it seems that data pages
> > > > are mapped 2 times to have a more efficient and simpler
> > > > implementation. This seems to be a ring buffer peculiarity, the perf
> > > > buffer didn't have such an implementation. In the Falco project [1] we
> > > > use huge per-CPU buffers to collect almost all the syscalls that the
> > > > system throws and the default size of each buffer is 8 MB. This means
> > > > that using the ring buffer approach on a system with 128 CPUs, we will
> > > > have (128*8*2) MB, while with the perf buffer only (128*8) MB. The
> > >
> > > hum IIUC it's not allocated twice but pages are just mapped twice,
> > > to cope with wrap around samples, described in git log:
> > >
> > >     One interesting implementation bit, that significantly simplifies (and thus
> > >     speeds up as well) implementation of both producers and consumers is how data
> > >     area is mapped twice contiguously back-to-back in the virtual memory. This
> > >     allows to not take any special measures for samples that have to wrap around
> > >     at the end of the circular buffer data area, because the next page after the
> > >     last data page would be first data page again, and thus the sample will still
> > >     appear completely contiguous in virtual memory. See comment and a simple ASCII
> > >     diagram showing this visually in bpf_ringbuf_area_alloc().
> >
> > yes, exactly, there is no duplication of memory, it's just mapped
> > twice to make working with records that wrap around simple and
> > efficient
> >
>
> Thank you very much for the quick response, my previous question was
> quite unclear, sorry for that, I will try to explain me better with
> some data. I've collected in this document [3] some thoughts regarding
> 2 simple examples with perf buffer and ring buffer. Without going into
> too many details about the document, I've noticed a strange value of
> "Resident set size" (RSS) in the ring buffer example. Probably is
> perfectly fine but I really don't understand why the "RSS" for each
> ring buffer assumes the same value of the Virtual memory size and I'm
> just asking myself if this fact could impact the OOM score computation
> making the program that uses ring buffers more vulnerable to the OOM
> killer.
>
> [3]: https://hackmd.io/@l56JYH1SS9-QXhSNXKanMw/r1Z8APWso
>

I'm not an mm expert, unfortunately. Perhaps because we have twice as
many pages mapped (even though they are using only 8MB of physical
memory), it is treated as if process' RSS usage is 2x of that. I can
see how that might be a concern for OOM score, but I'm not sure what
can be done about this...

> > >
> > > > issue is that this memory requirement could be too much for some
> > > > systems and also in Kubernetes environments where there are strict
> > > > resource limits... Our actual workaround is to use ring buffers shared
> > > > between more than one CPU with a BPF_MAP_TYPE_ARRAY_OF_MAPS, so for
> > > > example we allocate a ring buffer for each CPU pair. Unfortunately,
> > > > this solution has a price since we increase the contention on the ring
> > > > buffers and as highlighted here [2], the presence of multiple
> > > > competing writers on the same buffer could become a real bottleneck...
> > > > Sorry for the long introduction, my question here is, are there any
> > > > other approaches to manage such a scenario? Will there be a
> > > > possibility to use the ring buffer without the kernel double mapping
> > > > in the near future? The ring buffer has such amazing features with
> > > > respect to the perf buffer, but in a scenario like the Falco one,
> > > > where we have aggressive multiple producers, this double mapping could
> > > > become a limitation.
> > >
> > > AFAIK the bpf ring buffer can be used across cpus, so you don't need
> > > to have extra copy for each cpu if you don't really want to
> > >
> >
> > seems like they do share, but only between CPUs. But nothing prevents
> > you from sharing between more than 2 CPUs, right? It's a tradeoff
>
> Yes exactly, we can and we will do it
>
> > between contention and overall memory usage (but as pointed out,
> > ringbuf doesn't use 2x more memory). Do you actually see a lot of
> > contention when sharing ringbuf between 2 CPUs? There are multiple
>
> Actually no, I've not seen a lot of contention with this
> configuration, it seems to handle throughput quite well. BTW it's
> still an experimental solution so it is not much tested against
> real-world workloads.
>
> > applications that share a single ringbuf between all CPUs, and no one
> > really complained about high contention so far. You'd need to push
> > tons of data non-stop, probably, at which point I'd worry about
> > consumers not being able to keep up (and definitely not doing much
> > useful with all this data). But YMMV, of course.
> >
>
> We are a little bit worried about the single ring buffer scenario,
> mainly when we have something like 64 CPUs and all syscalls enabled,
> but as you correctly highlighted in this case we would have also some
> issues userspace side because we wouldn't be able to handle all this
> traffic, causing tons of event drops. BTW thank you for the feedback!
>

If you decide to use ringbuf, I'd leverage its ability to be used
across multiple CPUs and thus reduce the OOM score concern. This is
what we see in practice here at Meta: at the same or even smaller
total amount of memory used for ringbuf(s), compared to perfbuf, we
see less (or no) event drops due to bigger shared buffer that can
absorb temporary spikes in the amount of events produced.

> > > jirka
> > >
> > > >
> > > > Thank you in advance for your time,
> > > > Andrea
> > > >
> > > > 0: https://github.com/torvalds/linux/blob/master/kernel/bpf/ringbuf.c#L107
> > > > 1: https://github.com/falcosecurity/falco
> > > > 2: https://patchwork.ozlabs.org/project/netdev/patch/20200529075424.3139988-5-andriin@fb.com/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [QUESTION] usage of BPF_MAP_TYPE_RINGBUF
  2023-01-27 18:54       ` Andrii Nakryiko
@ 2023-02-05 15:28         ` andrea terzolo
  2023-02-15  1:35           ` Andrii Nakryiko
  0 siblings, 1 reply; 8+ messages in thread
From: andrea terzolo @ 2023-02-05 15:28 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: Jiri Olsa, bpf

Il giorno ven 27 gen 2023 alle ore 19:54 Andrii Nakryiko
<andrii.nakryiko@gmail.com> ha scritto:
>
> On Sun, Jan 15, 2023 at 9:10 AM andrea terzolo <andreaterzolo3@gmail.com> wrote:
> >
> > Il giorno ven 13 gen 2023 alle ore 23:57 Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> ha scritto:
> > >
> > > On Wed, Jan 11, 2023 at 12:27 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > > >
> > > > On Tue, Jan 10, 2023 at 02:49:59PM +0100, andrea terzolo wrote:
> > > > > Hello!
> > > > >
> > > > > If I can I would ask a question regarding the BPF_MAP_TYPE_RINGBUF
> > > > > map. Looking at the kernel implementation [0] it seems that data pages
> > > > > are mapped 2 times to have a more efficient and simpler
> > > > > implementation. This seems to be a ring buffer peculiarity, the perf
> > > > > buffer didn't have such an implementation. In the Falco project [1] we
> > > > > use huge per-CPU buffers to collect almost all the syscalls that the
> > > > > system throws and the default size of each buffer is 8 MB. This means
> > > > > that using the ring buffer approach on a system with 128 CPUs, we will
> > > > > have (128*8*2) MB, while with the perf buffer only (128*8) MB. The
> > > >
> > > > hum IIUC it's not allocated twice but pages are just mapped twice,
> > > > to cope with wrap around samples, described in git log:
> > > >
> > > >     One interesting implementation bit, that significantly simplifies (and thus
> > > >     speeds up as well) implementation of both producers and consumers is how data
> > > >     area is mapped twice contiguously back-to-back in the virtual memory. This
> > > >     allows to not take any special measures for samples that have to wrap around
> > > >     at the end of the circular buffer data area, because the next page after the
> > > >     last data page would be first data page again, and thus the sample will still
> > > >     appear completely contiguous in virtual memory. See comment and a simple ASCII
> > > >     diagram showing this visually in bpf_ringbuf_area_alloc().
> > >
> > > yes, exactly, there is no duplication of memory, it's just mapped
> > > twice to make working with records that wrap around simple and
> > > efficient
> > >
> >
> > Thank you very much for the quick response, my previous question was
> > quite unclear, sorry for that, I will try to explain me better with
> > some data. I've collected in this document [3] some thoughts regarding
> > 2 simple examples with perf buffer and ring buffer. Without going into
> > too many details about the document, I've noticed a strange value of
> > "Resident set size" (RSS) in the ring buffer example. Probably is
> > perfectly fine but I really don't understand why the "RSS" for each
> > ring buffer assumes the same value of the Virtual memory size and I'm
> > just asking myself if this fact could impact the OOM score computation
> > making the program that uses ring buffers more vulnerable to the OOM
> > killer.
> >
> > [3]: https://hackmd.io/@l56JYH1SS9-QXhSNXKanMw/r1Z8APWso
> >
>
> I'm not an mm expert, unfortunately. Perhaps because we have twice as
> many pages mapped (even though they are using only 8MB of physical
> memory), it is treated as if process' RSS usage is 2x of that. I can
> see how that might be a concern for OOM score, but I'm not sure what
> can be done about this...
>

Yes, this is weird behavior. Unfortunately, a process that uses a ring
buffer for each CPU is penalized from this point of view with respect
to one that uses a perf buffer. Do you know by chance someone who can
help us with this strange memory reservation?

> > > >
> > > > > issue is that this memory requirement could be too much for some
> > > > > systems and also in Kubernetes environments where there are strict
> > > > > resource limits... Our actual workaround is to use ring buffers shared
> > > > > between more than one CPU with a BPF_MAP_TYPE_ARRAY_OF_MAPS, so for
> > > > > example we allocate a ring buffer for each CPU pair. Unfortunately,
> > > > > this solution has a price since we increase the contention on the ring
> > > > > buffers and as highlighted here [2], the presence of multiple
> > > > > competing writers on the same buffer could become a real bottleneck...
> > > > > Sorry for the long introduction, my question here is, are there any
> > > > > other approaches to manage such a scenario? Will there be a
> > > > > possibility to use the ring buffer without the kernel double mapping
> > > > > in the near future? The ring buffer has such amazing features with
> > > > > respect to the perf buffer, but in a scenario like the Falco one,
> > > > > where we have aggressive multiple producers, this double mapping could
> > > > > become a limitation.
> > > >
> > > > AFAIK the bpf ring buffer can be used across cpus, so you don't need
> > > > to have extra copy for each cpu if you don't really want to
> > > >
> > >
> > > seems like they do share, but only between CPUs. But nothing prevents
> > > you from sharing between more than 2 CPUs, right? It's a tradeoff
> >
> > Yes exactly, we can and we will do it
> >
> > > between contention and overall memory usage (but as pointed out,
> > > ringbuf doesn't use 2x more memory). Do you actually see a lot of
> > > contention when sharing ringbuf between 2 CPUs? There are multiple
> >
> > Actually no, I've not seen a lot of contention with this
> > configuration, it seems to handle throughput quite well. BTW it's
> > still an experimental solution so it is not much tested against
> > real-world workloads.
> >
> > > applications that share a single ringbuf between all CPUs, and no one
> > > really complained about high contention so far. You'd need to push
> > > tons of data non-stop, probably, at which point I'd worry about
> > > consumers not being able to keep up (and definitely not doing much
> > > useful with all this data). But YMMV, of course.
> > >
> >
> > We are a little bit worried about the single ring buffer scenario,
> > mainly when we have something like 64 CPUs and all syscalls enabled,
> > but as you correctly highlighted in this case we would have also some
> > issues userspace side because we wouldn't be able to handle all this
> > traffic, causing tons of event drops. BTW thank you for the feedback!
> >
>
> If you decide to use ringbuf, I'd leverage its ability to be used
> across multiple CPUs and thus reduce the OOM score concern. This is
> what we see in practice here at Meta: at the same or even smaller
> total amount of memory used for ringbuf(s), compared to perfbuf, we
> see less (or no) event drops due to bigger shared buffer that can
> absorb temporary spikes in the amount of events produced.
>

Thank you for the precious feedback about shared ring buffers, we are
already experimenting with similar solutions to mitigate the OOM score
issue, maybe this could be the right way to go also for our use case!

> > > > jirka
> > > >
> > > > >
> > > > > Thank you in advance for your time,
> > > > > Andrea
> > > > >
> > > > > 0: https://github.com/torvalds/linux/blob/master/kernel/bpf/ringbuf.c#L107
> > > > > 1: https://github.com/falcosecurity/falco
> > > > > 2: https://patchwork.ozlabs.org/project/netdev/patch/20200529075424.3139988-5-andriin@fb.com/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [QUESTION] usage of BPF_MAP_TYPE_RINGBUF
  2023-02-05 15:28         ` andrea terzolo
@ 2023-02-15  1:35           ` Andrii Nakryiko
  2023-02-15 22:00             ` andrea terzolo
  0 siblings, 1 reply; 8+ messages in thread
From: Andrii Nakryiko @ 2023-02-15  1:35 UTC (permalink / raw)
  To: andrea terzolo; +Cc: Jiri Olsa, bpf

On Sun, Feb 5, 2023 at 7:28 AM andrea terzolo <andreaterzolo3@gmail.com> wrote:
>
> Il giorno ven 27 gen 2023 alle ore 19:54 Andrii Nakryiko
> <andrii.nakryiko@gmail.com> ha scritto:
> >
> > On Sun, Jan 15, 2023 at 9:10 AM andrea terzolo <andreaterzolo3@gmail.com> wrote:
> > >
> > > Il giorno ven 13 gen 2023 alle ore 23:57 Andrii Nakryiko
> > > <andrii.nakryiko@gmail.com> ha scritto:
> > > >
> > > > On Wed, Jan 11, 2023 at 12:27 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > > > >
> > > > > On Tue, Jan 10, 2023 at 02:49:59PM +0100, andrea terzolo wrote:
> > > > > > Hello!
> > > > > >
> > > > > > If I can I would ask a question regarding the BPF_MAP_TYPE_RINGBUF
> > > > > > map. Looking at the kernel implementation [0] it seems that data pages
> > > > > > are mapped 2 times to have a more efficient and simpler
> > > > > > implementation. This seems to be a ring buffer peculiarity, the perf
> > > > > > buffer didn't have such an implementation. In the Falco project [1] we
> > > > > > use huge per-CPU buffers to collect almost all the syscalls that the
> > > > > > system throws and the default size of each buffer is 8 MB. This means
> > > > > > that using the ring buffer approach on a system with 128 CPUs, we will
> > > > > > have (128*8*2) MB, while with the perf buffer only (128*8) MB. The
> > > > >
> > > > > hum IIUC it's not allocated twice but pages are just mapped twice,
> > > > > to cope with wrap around samples, described in git log:
> > > > >
> > > > >     One interesting implementation bit, that significantly simplifies (and thus
> > > > >     speeds up as well) implementation of both producers and consumers is how data
> > > > >     area is mapped twice contiguously back-to-back in the virtual memory. This
> > > > >     allows to not take any special measures for samples that have to wrap around
> > > > >     at the end of the circular buffer data area, because the next page after the
> > > > >     last data page would be first data page again, and thus the sample will still
> > > > >     appear completely contiguous in virtual memory. See comment and a simple ASCII
> > > > >     diagram showing this visually in bpf_ringbuf_area_alloc().
> > > >
> > > > yes, exactly, there is no duplication of memory, it's just mapped
> > > > twice to make working with records that wrap around simple and
> > > > efficient
> > > >
> > >
> > > Thank you very much for the quick response, my previous question was
> > > quite unclear, sorry for that, I will try to explain me better with
> > > some data. I've collected in this document [3] some thoughts regarding
> > > 2 simple examples with perf buffer and ring buffer. Without going into
> > > too many details about the document, I've noticed a strange value of
> > > "Resident set size" (RSS) in the ring buffer example. Probably is
> > > perfectly fine but I really don't understand why the "RSS" for each
> > > ring buffer assumes the same value of the Virtual memory size and I'm
> > > just asking myself if this fact could impact the OOM score computation
> > > making the program that uses ring buffers more vulnerable to the OOM
> > > killer.
> > >
> > > [3]: https://hackmd.io/@l56JYH1SS9-QXhSNXKanMw/r1Z8APWso
> > >
> >
> > I'm not an mm expert, unfortunately. Perhaps because we have twice as
> > many pages mapped (even though they are using only 8MB of physical
> > memory), it is treated as if process' RSS usage is 2x of that. I can
> > see how that might be a concern for OOM score, but I'm not sure what
> > can be done about this...
> >
>
> Yes, this is weird behavior. Unfortunately, a process that uses a ring
> buffer for each CPU is penalized from this point of view with respect
> to one that uses a perf buffer. Do you know by chance someone who can
> help us with this strange memory reservation?

So I checked with MM expert, and he confirmed that currently there is
no way to avoid this double-accounting of memory reserved by BPF
ringbuf. But this doesn't seem to be a problem unique to BPF ringbuf,
generally RSS accounting is known to have problems with double
counting memory in some situations.

One relatively clean suggested way to solve this problem would be to
add a new memory counter (in addition to existing MM_SHMEMPAGES,
MM_SWAPENTS, MM_ANONPAGES, MM_FILEPAGES) to compensate for cases like
this.

But it does look like a pretty big overkill here, tbh. Sorry, I don't
have a good solution for you here.

>
> > > > >
> > > > > > issue is that this memory requirement could be too much for some
> > > > > > systems and also in Kubernetes environments where there are strict
> > > > > > resource limits... Our actual workaround is to use ring buffers shared
> > > > > > between more than one CPU with a BPF_MAP_TYPE_ARRAY_OF_MAPS, so for
> > > > > > example we allocate a ring buffer for each CPU pair. Unfortunately,
> > > > > > this solution has a price since we increase the contention on the ring
> > > > > > buffers and as highlighted here [2], the presence of multiple
> > > > > > competing writers on the same buffer could become a real bottleneck...
> > > > > > Sorry for the long introduction, my question here is, are there any
> > > > > > other approaches to manage such a scenario? Will there be a
> > > > > > possibility to use the ring buffer without the kernel double mapping
> > > > > > in the near future? The ring buffer has such amazing features with
> > > > > > respect to the perf buffer, but in a scenario like the Falco one,
> > > > > > where we have aggressive multiple producers, this double mapping could
> > > > > > become a limitation.
> > > > >
> > > > > AFAIK the bpf ring buffer can be used across cpus, so you don't need
> > > > > to have extra copy for each cpu if you don't really want to
> > > > >
> > > >
> > > > seems like they do share, but only between CPUs. But nothing prevents
> > > > you from sharing between more than 2 CPUs, right? It's a tradeoff
> > >
> > > Yes exactly, we can and we will do it
> > >
> > > > between contention and overall memory usage (but as pointed out,
> > > > ringbuf doesn't use 2x more memory). Do you actually see a lot of
> > > > contention when sharing ringbuf between 2 CPUs? There are multiple
> > >
> > > Actually no, I've not seen a lot of contention with this
> > > configuration, it seems to handle throughput quite well. BTW it's
> > > still an experimental solution so it is not much tested against
> > > real-world workloads.
> > >
> > > > applications that share a single ringbuf between all CPUs, and no one
> > > > really complained about high contention so far. You'd need to push
> > > > tons of data non-stop, probably, at which point I'd worry about
> > > > consumers not being able to keep up (and definitely not doing much
> > > > useful with all this data). But YMMV, of course.
> > > >
> > >
> > > We are a little bit worried about the single ring buffer scenario,
> > > mainly when we have something like 64 CPUs and all syscalls enabled,
> > > but as you correctly highlighted in this case we would have also some
> > > issues userspace side because we wouldn't be able to handle all this
> > > traffic, causing tons of event drops. BTW thank you for the feedback!
> > >
> >
> > If you decide to use ringbuf, I'd leverage its ability to be used
> > across multiple CPUs and thus reduce the OOM score concern. This is
> > what we see in practice here at Meta: at the same or even smaller
> > total amount of memory used for ringbuf(s), compared to perfbuf, we
> > see less (or no) event drops due to bigger shared buffer that can
> > absorb temporary spikes in the amount of events produced.
> >
>
> Thank you for the precious feedback about shared ring buffers, we are
> already experimenting with similar solutions to mitigate the OOM score
> issue, maybe this could be the right way to go also for our use case!

Hopefully this will work for you.

>
> > > > > jirka
> > > > >
> > > > > >
> > > > > > Thank you in advance for your time,
> > > > > > Andrea
> > > > > >
> > > > > > 0: https://github.com/torvalds/linux/blob/master/kernel/bpf/ringbuf.c#L107
> > > > > > 1: https://github.com/falcosecurity/falco
> > > > > > 2: https://patchwork.ozlabs.org/project/netdev/patch/20200529075424.3139988-5-andriin@fb.com/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [QUESTION] usage of BPF_MAP_TYPE_RINGBUF
  2023-02-15  1:35           ` Andrii Nakryiko
@ 2023-02-15 22:00             ` andrea terzolo
  0 siblings, 0 replies; 8+ messages in thread
From: andrea terzolo @ 2023-02-15 22:00 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: Jiri Olsa, bpf

Il giorno mer 15 feb 2023 alle ore 02:35 Andrii Nakryiko
<andrii.nakryiko@gmail.com> ha scritto:
>
> On Sun, Feb 5, 2023 at 7:28 AM andrea terzolo <andreaterzolo3@gmail.com> wrote:
> >
> > Il giorno ven 27 gen 2023 alle ore 19:54 Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> ha scritto:
> > >
> > > On Sun, Jan 15, 2023 at 9:10 AM andrea terzolo <andreaterzolo3@gmail.com> wrote:
> > > >
> > > > Il giorno ven 13 gen 2023 alle ore 23:57 Andrii Nakryiko
> > > > <andrii.nakryiko@gmail.com> ha scritto:
> > > > >
> > > > > On Wed, Jan 11, 2023 at 12:27 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > > > > >
> > > > > > On Tue, Jan 10, 2023 at 02:49:59PM +0100, andrea terzolo wrote:
> > > > > > > Hello!
> > > > > > >
> > > > > > > If I can I would ask a question regarding the BPF_MAP_TYPE_RINGBUF
> > > > > > > map. Looking at the kernel implementation [0] it seems that data pages
> > > > > > > are mapped 2 times to have a more efficient and simpler
> > > > > > > implementation. This seems to be a ring buffer peculiarity, the perf
> > > > > > > buffer didn't have such an implementation. In the Falco project [1] we
> > > > > > > use huge per-CPU buffers to collect almost all the syscalls that the
> > > > > > > system throws and the default size of each buffer is 8 MB. This means
> > > > > > > that using the ring buffer approach on a system with 128 CPUs, we will
> > > > > > > have (128*8*2) MB, while with the perf buffer only (128*8) MB. The
> > > > > >
> > > > > > hum IIUC it's not allocated twice but pages are just mapped twice,
> > > > > > to cope with wrap around samples, described in git log:
> > > > > >
> > > > > >     One interesting implementation bit, that significantly simplifies (and thus
> > > > > >     speeds up as well) implementation of both producers and consumers is how data
> > > > > >     area is mapped twice contiguously back-to-back in the virtual memory. This
> > > > > >     allows to not take any special measures for samples that have to wrap around
> > > > > >     at the end of the circular buffer data area, because the next page after the
> > > > > >     last data page would be first data page again, and thus the sample will still
> > > > > >     appear completely contiguous in virtual memory. See comment and a simple ASCII
> > > > > >     diagram showing this visually in bpf_ringbuf_area_alloc().
> > > > >
> > > > > yes, exactly, there is no duplication of memory, it's just mapped
> > > > > twice to make working with records that wrap around simple and
> > > > > efficient
> > > > >
> > > >
> > > > Thank you very much for the quick response, my previous question was
> > > > quite unclear, sorry for that, I will try to explain me better with
> > > > some data. I've collected in this document [3] some thoughts regarding
> > > > 2 simple examples with perf buffer and ring buffer. Without going into
> > > > too many details about the document, I've noticed a strange value of
> > > > "Resident set size" (RSS) in the ring buffer example. Probably is
> > > > perfectly fine but I really don't understand why the "RSS" for each
> > > > ring buffer assumes the same value of the Virtual memory size and I'm
> > > > just asking myself if this fact could impact the OOM score computation
> > > > making the program that uses ring buffers more vulnerable to the OOM
> > > > killer.
> > > >
> > > > [3]: https://hackmd.io/@l56JYH1SS9-QXhSNXKanMw/r1Z8APWso
> > > >
> > >
> > > I'm not an mm expert, unfortunately. Perhaps because we have twice as
> > > many pages mapped (even though they are using only 8MB of physical
> > > memory), it is treated as if process' RSS usage is 2x of that. I can
> > > see how that might be a concern for OOM score, but I'm not sure what
> > > can be done about this...
> > >
> >
> > Yes, this is weird behavior. Unfortunately, a process that uses a ring
> > buffer for each CPU is penalized from this point of view with respect
> > to one that uses a perf buffer. Do you know by chance someone who can
> > help us with this strange memory reservation?
>
> So I checked with MM expert, and he confirmed that currently there is
> no way to avoid this double-accounting of memory reserved by BPF
> ringbuf. But this doesn't seem to be a problem unique to BPF ringbuf,
> generally RSS accounting is known to have problems with double
> counting memory in some situations.
>
Thank you for reporting this and for all the help in this thread,
really appreciated!

> One relatively clean suggested way to solve this problem would be to
> add a new memory counter (in addition to existing MM_SHMEMPAGES,
> MM_SWAPENTS, MM_ANONPAGES, MM_FILEPAGES) to compensate for cases like
> this.
>
> But it does look like a pretty big overkill here, tbh. Sorry, I don't
> have a good solution for you here.
>
> >
> > > > > >
> > > > > > > issue is that this memory requirement could be too much for some
> > > > > > > systems and also in Kubernetes environments where there are strict
> > > > > > > resource limits... Our actual workaround is to use ring buffers shared
> > > > > > > between more than one CPU with a BPF_MAP_TYPE_ARRAY_OF_MAPS, so for
> > > > > > > example we allocate a ring buffer for each CPU pair. Unfortunately,
> > > > > > > this solution has a price since we increase the contention on the ring
> > > > > > > buffers and as highlighted here [2], the presence of multiple
> > > > > > > competing writers on the same buffer could become a real bottleneck...
> > > > > > > Sorry for the long introduction, my question here is, are there any
> > > > > > > other approaches to manage such a scenario? Will there be a
> > > > > > > possibility to use the ring buffer without the kernel double mapping
> > > > > > > in the near future? The ring buffer has such amazing features with
> > > > > > > respect to the perf buffer, but in a scenario like the Falco one,
> > > > > > > where we have aggressive multiple producers, this double mapping could
> > > > > > > become a limitation.
> > > > > >
> > > > > > AFAIK the bpf ring buffer can be used across cpus, so you don't need
> > > > > > to have extra copy for each cpu if you don't really want to
> > > > > >
> > > > >
> > > > > seems like they do share, but only between CPUs. But nothing prevents
> > > > > you from sharing between more than 2 CPUs, right? It's a tradeoff
> > > >
> > > > Yes exactly, we can and we will do it
> > > >
> > > > > between contention and overall memory usage (but as pointed out,
> > > > > ringbuf doesn't use 2x more memory). Do you actually see a lot of
> > > > > contention when sharing ringbuf between 2 CPUs? There are multiple
> > > >
> > > > Actually no, I've not seen a lot of contention with this
> > > > configuration, it seems to handle throughput quite well. BTW it's
> > > > still an experimental solution so it is not much tested against
> > > > real-world workloads.
> > > >
> > > > > applications that share a single ringbuf between all CPUs, and no one
> > > > > really complained about high contention so far. You'd need to push
> > > > > tons of data non-stop, probably, at which point I'd worry about
> > > > > consumers not being able to keep up (and definitely not doing much
> > > > > useful with all this data). But YMMV, of course.
> > > > >
> > > >
> > > > We are a little bit worried about the single ring buffer scenario,
> > > > mainly when we have something like 64 CPUs and all syscalls enabled,
> > > > but as you correctly highlighted in this case we would have also some
> > > > issues userspace side because we wouldn't be able to handle all this
> > > > traffic, causing tons of event drops. BTW thank you for the feedback!
> > > >
> > >
> > > If you decide to use ringbuf, I'd leverage its ability to be used
> > > across multiple CPUs and thus reduce the OOM score concern. This is
> > > what we see in practice here at Meta: at the same or even smaller
> > > total amount of memory used for ringbuf(s), compared to perfbuf, we
> > > see less (or no) event drops due to bigger shared buffer that can
> > > absorb temporary spikes in the amount of events produced.
> > >
> >
> > Thank you for the precious feedback about shared ring buffers, we are
> > already experimenting with similar solutions to mitigate the OOM score
> > issue, maybe this could be the right way to go also for our use case!
>
> Hopefully this will work for you.
>
> >
> > > > > > jirka
> > > > > >
> > > > > > >
> > > > > > > Thank you in advance for your time,
> > > > > > > Andrea
> > > > > > >
> > > > > > > 0: https://github.com/torvalds/linux/blob/master/kernel/bpf/ringbuf.c#L107
> > > > > > > 1: https://github.com/falcosecurity/falco
> > > > > > > 2: https://patchwork.ozlabs.org/project/netdev/patch/20200529075424.3139988-5-andriin@fb.com/

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-02-15 22:00 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-10 13:49 [QUESTION] usage of BPF_MAP_TYPE_RINGBUF andrea terzolo
2023-01-11  8:27 ` Jiri Olsa
2023-01-13 22:56   ` Andrii Nakryiko
2023-01-15 17:09     ` andrea terzolo
2023-01-27 18:54       ` Andrii Nakryiko
2023-02-05 15:28         ` andrea terzolo
2023-02-15  1:35           ` Andrii Nakryiko
2023-02-15 22:00             ` andrea terzolo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).