Question about bpf perfbuf/ringbuf: pinned in backend with overwriting

bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Question about bpf perfbuf/ringbuf: pinned in backend with overwriting
@ 2023-12-07 13:15 Philo Lu
  2023-12-07 14:48 ` Alan Maguire
  0 siblings, 1 reply; 17+ messages in thread
From: Philo Lu @ 2023-12-07 13:15 UTC (permalink / raw)
  To: bpf
  Cc: song, andrii, ast, Daniel Borkmann, xuanzhuo, dust.li, guwen,
	alibuda, hengqi

Hi all. I have a question when using perfbuf/ringbuf in bpf. I will 
appreciate it if you give me any advice.

Imagine a simple case: the bpf program output a log (some tcp 
statistics) to user every time a packet is received, and the user 
actively read the logs if he wants. I do not want to keep a user process 
alive, waiting for outputs of the buffer. User can read the buffer as 
need. BTW, the order does not matter.

To conclude, I hope the buffer performs like relayfs: (1) no need for 
user process to receive logs, and the user may read at any time (and no 
wakeup would be better); (2) old data can be overwritten by new ones.

Currently, it seems that perfbuf and ringbuf cannot satisfy both: (i) 
ringbuf: only satisfies (1). However, if data arrive when the buffer is 
full, the new data will be lost, until the buffer is consumed. (ii) 
perfbuf: only satisfies (2). But user cannot access the buffer after the 
process who creates it (including perf_event.rb via mmap) exits. 
Specifically, I can use BPF_F_PRESERVE_ELEMS flag to keep the 
perf_events, but I do not know how to get the buffer again in a new process.

In my opinion, this can be solved by either of the following: (a) add 
overwrite support in ringbuf (maybe a new flag for reserve), but we have 
to address synchronization between kernel and user, especially under 
variable data size, because when overwriting occurs, kernel has to 
update the consumer posi too; (b) implement map_fd_sys_lookup_elem for 
perfbuf to expose fds to user via map_lookup_elem syscall, and a 
mechanism is need to preserve perf_event->rb when process exits 
(otherwise the buffer will be freed by perf_mmap_close). I am not sure 
if they are feasible, and which is better. If not, perhaps we can 
develop another mechanism to achieve this?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Question about bpf perfbuf/ringbuf: pinned in backend with overwriting
  2023-12-07 13:15 Question about bpf perfbuf/ringbuf: pinned in backend with overwriting Philo Lu
@ 2023-12-07 14:48 ` Alan Maguire
  2023-12-08 22:32   ` Andrii Nakryiko
  0 siblings, 1 reply; 17+ messages in thread
From: Alan Maguire @ 2023-12-07 14:48 UTC (permalink / raw)
  To: Philo Lu, bpf
  Cc: song, andrii, ast, Daniel Borkmann, xuanzhuo, dust.li, guwen,
	alibuda, hengqi

On 07/12/2023 13:15, Philo Lu wrote:
> Hi all. I have a question when using perfbuf/ringbuf in bpf. I will
> appreciate it if you give me any advice.
> 
> Imagine a simple case: the bpf program output a log (some tcp
> statistics) to user every time a packet is received, and the user
> actively read the logs if he wants. I do not want to keep a user process
> alive, waiting for outputs of the buffer. User can read the buffer as
> need. BTW, the order does not matter.
> 
> To conclude, I hope the buffer performs like relayfs: (1) no need for
> user process to receive logs, and the user may read at any time (and no
> wakeup would be better); (2) old data can be overwritten by new ones.
> 
> Currently, it seems that perfbuf and ringbuf cannot satisfy both: (i)
> ringbuf: only satisfies (1). However, if data arrive when the buffer is
> full, the new data will be lost, until the buffer is consumed. (ii)
> perfbuf: only satisfies (2). But user cannot access the buffer after the
> process who creates it (including perf_event.rb via mmap) exits.
> Specifically, I can use BPF_F_PRESERVE_ELEMS flag to keep the
> perf_events, but I do not know how to get the buffer again in a new
> process.
> 
> In my opinion, this can be solved by either of the following: (a) add
> overwrite support in ringbuf (maybe a new flag for reserve), but we have
> to address synchronization between kernel and user, especially under
> variable data size, because when overwriting occurs, kernel has to
> update the consumer posi too; (b) implement map_fd_sys_lookup_elem for
> perfbuf to expose fds to user via map_lookup_elem syscall, and a
> mechanism is need to preserve perf_event->rb when process exits
> (otherwise the buffer will be freed by perf_mmap_close). I am not sure
> if they are feasible, and which is better. If not, perhaps we can
> develop another mechanism to achieve this?
> 

There was an RFC a while back focused on supporting BPF ringbuf
over-writing [1]; at the time, Andrii noted some potential issues that
might be exposed by doing multiple ringbuf reserves to overfill the
buffer within the same program.

Alan

[1]
https://lore.kernel.org/lkml/20220906195656.33021-2-flaniel@linux.microsoft.com/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Question about bpf perfbuf/ringbuf: pinned in backend with overwriting
  2023-12-07 14:48 ` Alan Maguire
@ 2023-12-08 22:32   ` Andrii Nakryiko
  2023-12-11 12:39     ` Philo Lu
  0 siblings, 1 reply; 17+ messages in thread
From: Andrii Nakryiko @ 2023-12-08 22:32 UTC (permalink / raw)
  To: Alan Maguire
  Cc: Philo Lu, bpf, song, andrii, ast, Daniel Borkmann, xuanzhuo,
	dust.li, guwen, alibuda, hengqi, Nathan Slingerland, rihams

On Thu, Dec 7, 2023 at 6:49 AM Alan Maguire <alan.maguire@oracle.com> wrote:
>
> On 07/12/2023 13:15, Philo Lu wrote:
> > Hi all. I have a question when using perfbuf/ringbuf in bpf. I will
> > appreciate it if you give me any advice.
> >
> > Imagine a simple case: the bpf program output a log (some tcp
> > statistics) to user every time a packet is received, and the user
> > actively read the logs if he wants. I do not want to keep a user process
> > alive, waiting for outputs of the buffer. User can read the buffer as
> > need. BTW, the order does not matter.
> >
> > To conclude, I hope the buffer performs like relayfs: (1) no need for
> > user process to receive logs, and the user may read at any time (and no
> > wakeup would be better); (2) old data can be overwritten by new ones.
> >
> > Currently, it seems that perfbuf and ringbuf cannot satisfy both: (i)
> > ringbuf: only satisfies (1). However, if data arrive when the buffer is
> > full, the new data will be lost, until the buffer is consumed. (ii)
> > perfbuf: only satisfies (2). But user cannot access the buffer after the
> > process who creates it (including perf_event.rb via mmap) exits.
> > Specifically, I can use BPF_F_PRESERVE_ELEMS flag to keep the
> > perf_events, but I do not know how to get the buffer again in a new
> > process.
> >
> > In my opinion, this can be solved by either of the following: (a) add
> > overwrite support in ringbuf (maybe a new flag for reserve), but we have
> > to address synchronization between kernel and user, especially under
> > variable data size, because when overwriting occurs, kernel has to
> > update the consumer posi too; (b) implement map_fd_sys_lookup_elem for
> > perfbuf to expose fds to user via map_lookup_elem syscall, and a
> > mechanism is need to preserve perf_event->rb when process exits
> > (otherwise the buffer will be freed by perf_mmap_close). I am not sure
> > if they are feasible, and which is better. If not, perhaps we can
> > develop another mechanism to achieve this?
> >
>
> There was an RFC a while back focused on supporting BPF ringbuf
> over-writing [1]; at the time, Andrii noted some potential issues that
> might be exposed by doing multiple ringbuf reserves to overfill the
> buffer within the same program.
>

Correct. I don't think it's possible to correctly and safely support
overwriting with BPF ringbuf that has variable-sized elements.

We'll need to implement MPMC ringbuf (probably with fixed sized
element size) to be able to support this.

> Alan
>
> [1]
> https://lore.kernel.org/lkml/20220906195656.33021-2-flaniel@linux.microsoft.com/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Question about bpf perfbuf/ringbuf: pinned in backend with overwriting
  2023-12-08 22:32   ` Andrii Nakryiko
@ 2023-12-11 12:39     ` Philo Lu
  2023-12-13 23:35       ` Andrii Nakryiko
  0 siblings, 1 reply; 17+ messages in thread
From: Philo Lu @ 2023-12-11 12:39 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, song, andrii, ast, Daniel Borkmann, xuanzhuo, dust.li,
	guwen, alibuda, hengqi, Nathan Slingerland, rihams, Alan Maguire



On 2023/12/9 06:32, Andrii Nakryiko wrote:
> On Thu, Dec 7, 2023 at 6:49 AM Alan Maguire <alan.maguire@oracle.com> wrote:
>>
>> On 07/12/2023 13:15, Philo Lu wrote:
>>> Hi all. I have a question when using perfbuf/ringbuf in bpf. I will
>>> appreciate it if you give me any advice.
>>>
>>> Imagine a simple case: the bpf program output a log (some tcp
>>> statistics) to user every time a packet is received, and the user
>>> actively read the logs if he wants. I do not want to keep a user process
>>> alive, waiting for outputs of the buffer. User can read the buffer as
>>> need. BTW, the order does not matter.
>>>
>>> To conclude, I hope the buffer performs like relayfs: (1) no need for
>>> user process to receive logs, and the user may read at any time (and no
>>> wakeup would be better); (2) old data can be overwritten by new ones.
>>>
>>> Currently, it seems that perfbuf and ringbuf cannot satisfy both: (i)
>>> ringbuf: only satisfies (1). However, if data arrive when the buffer is
>>> full, the new data will be lost, until the buffer is consumed. (ii)
>>> perfbuf: only satisfies (2). But user cannot access the buffer after the
>>> process who creates it (including perf_event.rb via mmap) exits.
>>> Specifically, I can use BPF_F_PRESERVE_ELEMS flag to keep the
>>> perf_events, but I do not know how to get the buffer again in a new
>>> process.
>>>
>>> In my opinion, this can be solved by either of the following: (a) add
>>> overwrite support in ringbuf (maybe a new flag for reserve), but we have
>>> to address synchronization between kernel and user, especially under
>>> variable data size, because when overwriting occurs, kernel has to
>>> update the consumer posi too; (b) implement map_fd_sys_lookup_elem for
>>> perfbuf to expose fds to user via map_lookup_elem syscall, and a
>>> mechanism is need to preserve perf_event->rb when process exits
>>> (otherwise the buffer will be freed by perf_mmap_close). I am not sure
>>> if they are feasible, and which is better. If not, perhaps we can
>>> develop another mechanism to achieve this?
>>>
>>
>> There was an RFC a while back focused on supporting BPF ringbuf
>> over-writing [1]; at the time, Andrii noted some potential issues that
>> might be exposed by doing multiple ringbuf reserves to overfill the
>> buffer within the same program.
>>
> 
> Correct. I don't think it's possible to correctly and safely support
> overwriting with BPF ringbuf that has variable-sized elements.
> 
> We'll need to implement MPMC ringbuf (probably with fixed sized
> element size) to be able to support this.
> 

Thank you very much!

If it is indeed difficult with ringbuf, maybe I can implement a new type 
of bpf map based on relay interface [1]? e.g., init relay during map 
creating, write into it with bpf helper, and then user can access to it 
in filesystem. I think it will be a simple but useful map for 
overwritable data transfer.

[1]
https://github.com/torvalds/linux/blob/master/Documentation/filesystems/relay.rst

>> Alan
>>
>> [1]
>> https://lore.kernel.org/lkml/20220906195656.33021-2-flaniel@linux.microsoft.com/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Question about bpf perfbuf/ringbuf: pinned in backend with overwriting
  2023-12-11 12:39     ` Philo Lu
@ 2023-12-13 23:35       ` Andrii Nakryiko
  2023-12-15 10:10         ` Philo Lu
  2023-12-19  6:23         ` Shung-Hsi Yu
  0 siblings, 2 replies; 17+ messages in thread
From: Andrii Nakryiko @ 2023-12-13 23:35 UTC (permalink / raw)
  To: Philo Lu
  Cc: bpf, song, andrii, ast, Daniel Borkmann, xuanzhuo, dust.li,
	guwen, alibuda, hengqi, Nathan Slingerland, rihams, Alan Maguire

On Mon, Dec 11, 2023 at 4:39 AM Philo Lu <lulie@linux.alibaba.com> wrote:
>
>
>
> On 2023/12/9 06:32, Andrii Nakryiko wrote:
> > On Thu, Dec 7, 2023 at 6:49 AM Alan Maguire <alan.maguire@oracle.com> wrote:
> >>
> >> On 07/12/2023 13:15, Philo Lu wrote:
> >>> Hi all. I have a question when using perfbuf/ringbuf in bpf. I will
> >>> appreciate it if you give me any advice.
> >>>
> >>> Imagine a simple case: the bpf program output a log (some tcp
> >>> statistics) to user every time a packet is received, and the user
> >>> actively read the logs if he wants. I do not want to keep a user process
> >>> alive, waiting for outputs of the buffer. User can read the buffer as
> >>> need. BTW, the order does not matter.
> >>>
> >>> To conclude, I hope the buffer performs like relayfs: (1) no need for
> >>> user process to receive logs, and the user may read at any time (and no
> >>> wakeup would be better); (2) old data can be overwritten by new ones.
> >>>
> >>> Currently, it seems that perfbuf and ringbuf cannot satisfy both: (i)
> >>> ringbuf: only satisfies (1). However, if data arrive when the buffer is
> >>> full, the new data will be lost, until the buffer is consumed. (ii)
> >>> perfbuf: only satisfies (2). But user cannot access the buffer after the
> >>> process who creates it (including perf_event.rb via mmap) exits.
> >>> Specifically, I can use BPF_F_PRESERVE_ELEMS flag to keep the
> >>> perf_events, but I do not know how to get the buffer again in a new
> >>> process.
> >>>
> >>> In my opinion, this can be solved by either of the following: (a) add
> >>> overwrite support in ringbuf (maybe a new flag for reserve), but we have
> >>> to address synchronization between kernel and user, especially under
> >>> variable data size, because when overwriting occurs, kernel has to
> >>> update the consumer posi too; (b) implement map_fd_sys_lookup_elem for
> >>> perfbuf to expose fds to user via map_lookup_elem syscall, and a
> >>> mechanism is need to preserve perf_event->rb when process exits
> >>> (otherwise the buffer will be freed by perf_mmap_close). I am not sure
> >>> if they are feasible, and which is better. If not, perhaps we can
> >>> develop another mechanism to achieve this?
> >>>
> >>
> >> There was an RFC a while back focused on supporting BPF ringbuf
> >> over-writing [1]; at the time, Andrii noted some potential issues that
> >> might be exposed by doing multiple ringbuf reserves to overfill the
> >> buffer within the same program.
> >>
> >
> > Correct. I don't think it's possible to correctly and safely support
> > overwriting with BPF ringbuf that has variable-sized elements.
> >
> > We'll need to implement MPMC ringbuf (probably with fixed sized
> > element size) to be able to support this.
> >
>
> Thank you very much!
>
> If it is indeed difficult with ringbuf, maybe I can implement a new type
> of bpf map based on relay interface [1]? e.g., init relay during map
> creating, write into it with bpf helper, and then user can access to it
> in filesystem. I think it will be a simple but useful map for
> overwritable data transfer.

I don't know much about relay, tbh. Give it a try, I guess.
Alternatively, we need better and faster implementation of
BPF_MAP_TYPE_QUEUE, which seems like the data structure that can
support overwriting and generally be a fixed elementa size
alternative/complement to BPF ringbuf.

>
> [1]
> https://github.com/torvalds/linux/blob/master/Documentation/filesystems/relay.rst
>
> >> Alan
> >>
> >> [1]
> >> https://lore.kernel.org/lkml/20220906195656.33021-2-flaniel@linux.microsoft.com/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Question about bpf perfbuf/ringbuf: pinned in backend with overwriting
  2023-12-13 23:35       ` Andrii Nakryiko
@ 2023-12-15 10:10         ` Philo Lu
  2023-12-15 22:39           ` Andrii Nakryiko
  2023-12-19  6:23         ` Shung-Hsi Yu
  1 sibling, 1 reply; 17+ messages in thread
From: Philo Lu @ 2023-12-15 10:10 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, song, andrii, ast, Daniel Borkmann, xuanzhuo, dust.li,
	guwen, alibuda, hengqi, Nathan Slingerland, rihams, Alan Maguire



On 2023/12/14 07:35, Andrii Nakryiko wrote:
> On Mon, Dec 11, 2023 at 4:39 AM Philo Lu <lulie@linux.alibaba.com> wrote:
>>
>>
>>
>> On 2023/12/9 06:32, Andrii Nakryiko wrote:
>>> On Thu, Dec 7, 2023 at 6:49 AM Alan Maguire <alan.maguire@oracle.com> wrote:
>>>>
>>>> On 07/12/2023 13:15, Philo Lu wrote:
>>>>> Hi all. I have a question when using perfbuf/ringbuf in bpf. I will
>>>>> appreciate it if you give me any advice.
>>>>>
>>>>> Imagine a simple case: the bpf program output a log (some tcp
>>>>> statistics) to user every time a packet is received, and the user
>>>>> actively read the logs if he wants. I do not want to keep a user process
>>>>> alive, waiting for outputs of the buffer. User can read the buffer as
>>>>> need. BTW, the order does not matter.
>>>>>
>>>>> To conclude, I hope the buffer performs like relayfs: (1) no need for
>>>>> user process to receive logs, and the user may read at any time (and no
>>>>> wakeup would be better); (2) old data can be overwritten by new ones.
>>>>>
>>>>> Currently, it seems that perfbuf and ringbuf cannot satisfy both: (i)
>>>>> ringbuf: only satisfies (1). However, if data arrive when the buffer is
>>>>> full, the new data will be lost, until the buffer is consumed. (ii)
>>>>> perfbuf: only satisfies (2). But user cannot access the buffer after the
>>>>> process who creates it (including perf_event.rb via mmap) exits.
>>>>> Specifically, I can use BPF_F_PRESERVE_ELEMS flag to keep the
>>>>> perf_events, but I do not know how to get the buffer again in a new
>>>>> process.
>>>>>
>>>>> In my opinion, this can be solved by either of the following: (a) add
>>>>> overwrite support in ringbuf (maybe a new flag for reserve), but we have
>>>>> to address synchronization between kernel and user, especially under
>>>>> variable data size, because when overwriting occurs, kernel has to
>>>>> update the consumer posi too; (b) implement map_fd_sys_lookup_elem for
>>>>> perfbuf to expose fds to user via map_lookup_elem syscall, and a
>>>>> mechanism is need to preserve perf_event->rb when process exits
>>>>> (otherwise the buffer will be freed by perf_mmap_close). I am not sure
>>>>> if they are feasible, and which is better. If not, perhaps we can
>>>>> develop another mechanism to achieve this?
>>>>>
>>>>
>>>> There was an RFC a while back focused on supporting BPF ringbuf
>>>> over-writing [1]; at the time, Andrii noted some potential issues that
>>>> might be exposed by doing multiple ringbuf reserves to overfill the
>>>> buffer within the same program.
>>>>
>>>
>>> Correct. I don't think it's possible to correctly and safely support
>>> overwriting with BPF ringbuf that has variable-sized elements.
>>>
>>> We'll need to implement MPMC ringbuf (probably with fixed sized
>>> element size) to be able to support this.
>>>
>>
>> Thank you very much!
>>
>> If it is indeed difficult with ringbuf, maybe I can implement a new type
>> of bpf map based on relay interface [1]? e.g., init relay during map
>> creating, write into it with bpf helper, and then user can access to it
>> in filesystem. I think it will be a simple but useful map for
>> overwritable data transfer.
> 
> I don't know much about relay, tbh. Give it a try, I guess.
> Alternatively, we need better and faster implementation of
> BPF_MAP_TYPE_QUEUE, which seems like the data structure that can
> support overwriting and generally be a fixed elementa size
> alternative/complement to BPF ringbuf.
> 

Thank you for your reply. I am afraid BPF_MAP_TYPE_QUEUE cannot get rid 
of locking overheads with concurrent reading and writing by design, and 
a lockless buffer like relay fits better to our case. So I will try it :)

>>
>> [1]
>> https://github.com/torvalds/linux/blob/master/Documentation/filesystems/relay.rst
>>
>>>> Alan
>>>>
>>>> [1]
>>>> https://lore.kernel.org/lkml/20220906195656.33021-2-flaniel@linux.microsoft.com/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Question about bpf perfbuf/ringbuf: pinned in backend with overwriting
  2023-12-15 10:10         ` Philo Lu
@ 2023-12-15 22:39           ` Andrii Nakryiko
  2023-12-16  8:50             ` Dmitry Vyukov
  0 siblings, 1 reply; 17+ messages in thread
From: Andrii Nakryiko @ 2023-12-15 22:39 UTC (permalink / raw)
  To: Philo Lu
  Cc: bpf, song, andrii, ast, Daniel Borkmann, xuanzhuo, dust.li,
	guwen, alibuda, hengqi, Nathan Slingerland, rihams, Alan Maguire,
	Dmitry Vyukov

On Fri, Dec 15, 2023 at 2:10 AM Philo Lu <lulie@linux.alibaba.com> wrote:
>
>
>
> On 2023/12/14 07:35, Andrii Nakryiko wrote:
> > On Mon, Dec 11, 2023 at 4:39 AM Philo Lu <lulie@linux.alibaba.com> wrote:
> >>
> >>
> >>
> >> On 2023/12/9 06:32, Andrii Nakryiko wrote:
> >>> On Thu, Dec 7, 2023 at 6:49 AM Alan Maguire <alan.maguire@oracle.com> wrote:
> >>>>
> >>>> On 07/12/2023 13:15, Philo Lu wrote:
> >>>>> Hi all. I have a question when using perfbuf/ringbuf in bpf. I will
> >>>>> appreciate it if you give me any advice.
> >>>>>
> >>>>> Imagine a simple case: the bpf program output a log (some tcp
> >>>>> statistics) to user every time a packet is received, and the user
> >>>>> actively read the logs if he wants. I do not want to keep a user process
> >>>>> alive, waiting for outputs of the buffer. User can read the buffer as
> >>>>> need. BTW, the order does not matter.
> >>>>>
> >>>>> To conclude, I hope the buffer performs like relayfs: (1) no need for
> >>>>> user process to receive logs, and the user may read at any time (and no
> >>>>> wakeup would be better); (2) old data can be overwritten by new ones.
> >>>>>
> >>>>> Currently, it seems that perfbuf and ringbuf cannot satisfy both: (i)
> >>>>> ringbuf: only satisfies (1). However, if data arrive when the buffer is
> >>>>> full, the new data will be lost, until the buffer is consumed. (ii)
> >>>>> perfbuf: only satisfies (2). But user cannot access the buffer after the
> >>>>> process who creates it (including perf_event.rb via mmap) exits.
> >>>>> Specifically, I can use BPF_F_PRESERVE_ELEMS flag to keep the
> >>>>> perf_events, but I do not know how to get the buffer again in a new
> >>>>> process.
> >>>>>
> >>>>> In my opinion, this can be solved by either of the following: (a) add
> >>>>> overwrite support in ringbuf (maybe a new flag for reserve), but we have
> >>>>> to address synchronization between kernel and user, especially under
> >>>>> variable data size, because when overwriting occurs, kernel has to
> >>>>> update the consumer posi too; (b) implement map_fd_sys_lookup_elem for
> >>>>> perfbuf to expose fds to user via map_lookup_elem syscall, and a
> >>>>> mechanism is need to preserve perf_event->rb when process exits
> >>>>> (otherwise the buffer will be freed by perf_mmap_close). I am not sure
> >>>>> if they are feasible, and which is better. If not, perhaps we can
> >>>>> develop another mechanism to achieve this?
> >>>>>
> >>>>
> >>>> There was an RFC a while back focused on supporting BPF ringbuf
> >>>> over-writing [1]; at the time, Andrii noted some potential issues that
> >>>> might be exposed by doing multiple ringbuf reserves to overfill the
> >>>> buffer within the same program.
> >>>>
> >>>
> >>> Correct. I don't think it's possible to correctly and safely support
> >>> overwriting with BPF ringbuf that has variable-sized elements.
> >>>
> >>> We'll need to implement MPMC ringbuf (probably with fixed sized
> >>> element size) to be able to support this.
> >>>
> >>
> >> Thank you very much!
> >>
> >> If it is indeed difficult with ringbuf, maybe I can implement a new type
> >> of bpf map based on relay interface [1]? e.g., init relay during map
> >> creating, write into it with bpf helper, and then user can access to it
> >> in filesystem. I think it will be a simple but useful map for
> >> overwritable data transfer.
> >
> > I don't know much about relay, tbh. Give it a try, I guess.
> > Alternatively, we need better and faster implementation of
> > BPF_MAP_TYPE_QUEUE, which seems like the data structure that can
> > support overwriting and generally be a fixed elementa size
> > alternative/complement to BPF ringbuf.
> >
>
> Thank you for your reply. I am afraid BPF_MAP_TYPE_QUEUE cannot get rid
> of locking overheads with concurrent reading and writing by design, and

I disagree, I think [0] from Dmitry Vyukov is one way to implement
lock-free BPF_MAP_TYPE_QUEUE. I don't know how easy it would be to
implement overwriting support, but it would be worth considering.

  [0] https://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue



> a lockless buffer like relay fits better to our case. So I will try it :)
>
> >>
> >> [1]
> >> https://github.com/torvalds/linux/blob/master/Documentation/filesystems/relay.rst
> >>
> >>>> Alan
> >>>>
> >>>> [1]
> >>>> https://lore.kernel.org/lkml/20220906195656.33021-2-flaniel@linux.microsoft.com/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Question about bpf perfbuf/ringbuf: pinned in backend with overwriting
  2023-12-15 22:39           ` Andrii Nakryiko
@ 2023-12-16  8:50             ` Dmitry Vyukov
  2023-12-18 12:58               ` Philo Lu
  2023-12-19 19:25               ` Andrii Nakryiko
  0 siblings, 2 replies; 17+ messages in thread
From: Dmitry Vyukov @ 2023-12-16  8:50 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Philo Lu, bpf, song, andrii, ast, Daniel Borkmann, xuanzhuo,
	dust.li, guwen, alibuda, hengqi, Nathan Slingerland, rihams,
	Alan Maguire

On Fri, 15 Dec 2023 at 23:39, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> > On 2023/12/14 07:35, Andrii Nakryiko wrote:
> > > On Mon, Dec 11, 2023 at 4:39 AM Philo Lu <lulie@linux.alibaba.com> wrote:
> > >>
> > >>
> > >>
> > >> On 2023/12/9 06:32, Andrii Nakryiko wrote:
> > >>> On Thu, Dec 7, 2023 at 6:49 AM Alan Maguire <alan.maguire@oracle.com> wrote:
> > >>>>
> > >>>> On 07/12/2023 13:15, Philo Lu wrote:
> > >>>>> Hi all. I have a question when using perfbuf/ringbuf in bpf. I will
> > >>>>> appreciate it if you give me any advice.
> > >>>>>
> > >>>>> Imagine a simple case: the bpf program output a log (some tcp
> > >>>>> statistics) to user every time a packet is received, and the user
> > >>>>> actively read the logs if he wants. I do not want to keep a user process
> > >>>>> alive, waiting for outputs of the buffer. User can read the buffer as
> > >>>>> need. BTW, the order does not matter.
> > >>>>>
> > >>>>> To conclude, I hope the buffer performs like relayfs: (1) no need for
> > >>>>> user process to receive logs, and the user may read at any time (and no
> > >>>>> wakeup would be better); (2) old data can be overwritten by new ones.
> > >>>>>
> > >>>>> Currently, it seems that perfbuf and ringbuf cannot satisfy both: (i)
> > >>>>> ringbuf: only satisfies (1). However, if data arrive when the buffer is
> > >>>>> full, the new data will be lost, until the buffer is consumed. (ii)
> > >>>>> perfbuf: only satisfies (2). But user cannot access the buffer after the
> > >>>>> process who creates it (including perf_event.rb via mmap) exits.
> > >>>>> Specifically, I can use BPF_F_PRESERVE_ELEMS flag to keep the
> > >>>>> perf_events, but I do not know how to get the buffer again in a new
> > >>>>> process.
> > >>>>>
> > >>>>> In my opinion, this can be solved by either of the following: (a) add
> > >>>>> overwrite support in ringbuf (maybe a new flag for reserve), but we have
> > >>>>> to address synchronization between kernel and user, especially under
> > >>>>> variable data size, because when overwriting occurs, kernel has to
> > >>>>> update the consumer posi too; (b) implement map_fd_sys_lookup_elem for
> > >>>>> perfbuf to expose fds to user via map_lookup_elem syscall, and a
> > >>>>> mechanism is need to preserve perf_event->rb when process exits
> > >>>>> (otherwise the buffer will be freed by perf_mmap_close). I am not sure
> > >>>>> if they are feasible, and which is better. If not, perhaps we can
> > >>>>> develop another mechanism to achieve this?
> > >>>>>
> > >>>>
> > >>>> There was an RFC a while back focused on supporting BPF ringbuf
> > >>>> over-writing [1]; at the time, Andrii noted some potential issues that
> > >>>> might be exposed by doing multiple ringbuf reserves to overfill the
> > >>>> buffer within the same program.
> > >>>>
> > >>>
> > >>> Correct. I don't think it's possible to correctly and safely support
> > >>> overwriting with BPF ringbuf that has variable-sized elements.
> > >>>
> > >>> We'll need to implement MPMC ringbuf (probably with fixed sized
> > >>> element size) to be able to support this.
> > >>>
> > >>
> > >> Thank you very much!
> > >>
> > >> If it is indeed difficult with ringbuf, maybe I can implement a new type
> > >> of bpf map based on relay interface [1]? e.g., init relay during map
> > >> creating, write into it with bpf helper, and then user can access to it
> > >> in filesystem. I think it will be a simple but useful map for
> > >> overwritable data transfer.
> > >
> > > I don't know much about relay, tbh. Give it a try, I guess.
> > > Alternatively, we need better and faster implementation of
> > > BPF_MAP_TYPE_QUEUE, which seems like the data structure that can
> > > support overwriting and generally be a fixed elementa size
> > > alternative/complement to BPF ringbuf.
> > >
> >
> > Thank you for your reply. I am afraid BPF_MAP_TYPE_QUEUE cannot get rid
> > of locking overheads with concurrent reading and writing by design, and
>
> I disagree, I think [0] from Dmitry Vyukov is one way to implement
> lock-free BPF_MAP_TYPE_QUEUE. I don't know how easy it would be to
> implement overwriting support, but it would be worth considering.
>
>   [0] https://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue


I am missing some context here. But note that this queue is not
formally lock-free. While it's usually faster and more scalable than
mutex-protected queues, stuck readers and writers will eventually
block each other. Stucking for a short time is not a problem because
the queue allows parallelism for both readers and writers. But if
threads get stuck for a long time and the queue wraps around so that
writers try to write to elements being read/written by slow threads,
they block. Similarly, readers get blocked by slow writers even if
there are other fully written elements in the queue already.
The queue is not serializable either, which may be surprisable in some cases.

Adding overwriting support may be an interesting exercise.
I guess readers could use some variation of a seqlock to deal with
elements that are being overwritten.
Writers can already skip over other slow writers. Normally this is
used w/o wrap-around, but I suspect it can just work with wrap-around
as well (a writer can skip over a writer stuck on the previous lap).
Since we overwrite elements, the queue provides only a very weak
notion of FIFO anyway, so skipping over very old writers may be fine.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Question about bpf perfbuf/ringbuf: pinned in backend with overwriting
  2023-12-16  8:50             ` Dmitry Vyukov
@ 2023-12-18 12:58               ` Philo Lu
  2023-12-19 19:25               ` Andrii Nakryiko
  1 sibling, 0 replies; 17+ messages in thread
From: Philo Lu @ 2023-12-18 12:58 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, song, andrii, ast, Daniel Borkmann, xuanzhuo, dust.li,
	guwen, alibuda, hengqi, Nathan Slingerland, rihams, Alan Maguire,
	Dmitry Vyukov



On 2023/12/16 16:50, Dmitry Vyukov wrote:
> On Fri, 15 Dec 2023 at 23:39, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>>> On 2023/12/14 07:35, Andrii Nakryiko wrote:
>>>> On Mon, Dec 11, 2023 at 4:39 AM Philo Lu <lulie@linux.alibaba.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 2023/12/9 06:32, Andrii Nakryiko wrote:
>>>>>> On Thu, Dec 7, 2023 at 6:49 AM Alan Maguire <alan.maguire@oracle.com> wrote:
>>>>>>>
>>>>>>> On 07/12/2023 13:15, Philo Lu wrote:
>>>>>>>> Hi all. I have a question when using perfbuf/ringbuf in bpf. I will
>>>>>>>> appreciate it if you give me any advice.
>>>>>>>>
>>>>>>>> Imagine a simple case: the bpf program output a log (some tcp
>>>>>>>> statistics) to user every time a packet is received, and the user
>>>>>>>> actively read the logs if he wants. I do not want to keep a user process
>>>>>>>> alive, waiting for outputs of the buffer. User can read the buffer as
>>>>>>>> need. BTW, the order does not matter.
>>>>>>>>
>>>>>>>> To conclude, I hope the buffer performs like relayfs: (1) no need for
>>>>>>>> user process to receive logs, and the user may read at any time (and no
>>>>>>>> wakeup would be better); (2) old data can be overwritten by new ones.
>>>>>>>>
>>>>>>>> Currently, it seems that perfbuf and ringbuf cannot satisfy both: (i)
>>>>>>>> ringbuf: only satisfies (1). However, if data arrive when the buffer is
>>>>>>>> full, the new data will be lost, until the buffer is consumed. (ii)
>>>>>>>> perfbuf: only satisfies (2). But user cannot access the buffer after the
>>>>>>>> process who creates it (including perf_event.rb via mmap) exits.
>>>>>>>> Specifically, I can use BPF_F_PRESERVE_ELEMS flag to keep the
>>>>>>>> perf_events, but I do not know how to get the buffer again in a new
>>>>>>>> process.
>>>>>>>>
>>>>>>>> In my opinion, this can be solved by either of the following: (a) add
>>>>>>>> overwrite support in ringbuf (maybe a new flag for reserve), but we have
>>>>>>>> to address synchronization between kernel and user, especially under
>>>>>>>> variable data size, because when overwriting occurs, kernel has to
>>>>>>>> update the consumer posi too; (b) implement map_fd_sys_lookup_elem for
>>>>>>>> perfbuf to expose fds to user via map_lookup_elem syscall, and a
>>>>>>>> mechanism is need to preserve perf_event->rb when process exits
>>>>>>>> (otherwise the buffer will be freed by perf_mmap_close). I am not sure
>>>>>>>> if they are feasible, and which is better. If not, perhaps we can
>>>>>>>> develop another mechanism to achieve this?
>>>>>>>>
>>>>>>>
>>>>>>> There was an RFC a while back focused on supporting BPF ringbuf
>>>>>>> over-writing [1]; at the time, Andrii noted some potential issues that
>>>>>>> might be exposed by doing multiple ringbuf reserves to overfill the
>>>>>>> buffer within the same program.
>>>>>>>
>>>>>>
>>>>>> Correct. I don't think it's possible to correctly and safely support
>>>>>> overwriting with BPF ringbuf that has variable-sized elements.
>>>>>>
>>>>>> We'll need to implement MPMC ringbuf (probably with fixed sized
>>>>>> element size) to be able to support this.
>>>>>>
>>>>>
>>>>> Thank you very much!
>>>>>
>>>>> If it is indeed difficult with ringbuf, maybe I can implement a new type
>>>>> of bpf map based on relay interface [1]? e.g., init relay during map
>>>>> creating, write into it with bpf helper, and then user can access to it
>>>>> in filesystem. I think it will be a simple but useful map for
>>>>> overwritable data transfer.
>>>>
>>>> I don't know much about relay, tbh. Give it a try, I guess.
>>>> Alternatively, we need better and faster implementation of
>>>> BPF_MAP_TYPE_QUEUE, which seems like the data structure that can
>>>> support overwriting and generally be a fixed elementa size
>>>> alternative/complement to BPF ringbuf.
>>>>
>>>
>>> Thank you for your reply. I am afraid BPF_MAP_TYPE_QUEUE cannot get rid
>>> of locking overheads with concurrent reading and writing by design, and
>>
>> I disagree, I think [0] from Dmitry Vyukov is one way to implement
>> lock-free BPF_MAP_TYPE_QUEUE. I don't know how easy it would be to
>> implement overwriting support, but it would be worth considering.
>>
>>    [0] https://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue
> 
> 
> I am missing some context here. But note that this queue is not
> formally lock-free. While it's usually faster and more scalable than
> mutex-protected queues, stuck readers and writers will eventually
> block each other. Stucking for a short time is not a problem because
> the queue allows parallelism for both readers and writers. But if
> threads get stuck for a long time and the queue wraps around so that
> writers try to write to elements being read/written by slow threads,
> they block. Similarly, readers get blocked by slow writers even if
> there are other fully written elements in the queue already.
> The queue is not serializable either, which may be surprisable in some cases.
> 
> Adding overwriting support may be an interesting exercise.
> I guess readers could use some variation of a seqlock to deal with
> elements that are being overwritten.
> Writers can already skip over other slow writers. Normally this is
> used w/o wrap-around, but I suspect it can just work with wrap-around
> as well (a writer can skip over a writer stuck on the previous lap).
> Since we overwrite elements, the queue provides only a very weak
> notion of FIFO anyway, so skipping over very old writers may be fine's

Thanks for these hints. The MPMC queue with a seqlock could be a 
effective method to improve BPF_MAP_TYPE_QUEUE. But I don't think it 
will work well in our case.

In my opinion, under very frequent writing, it will be hard for reader 
to get all elements in one shot (e.g., bpf_map_lookup_batch), because we 
use a seqlock and the whole buffer could be large. What's worse, with 
overwriting, many elements will be dropped sliently before readers get 
access to them.

Basically, I think BPF_MAP_TYPE_QUEUE assumes reliable results by 
design, and so does ringbuf. But in our case, we'd rather catch logs in 
time, even at the cost of few wrong records, and this is how relay performs.

Anyway, the MPMC queue optimization for BPF_MAP_TYPE_QUEUE is an 
interesting topic. I'd like to try it besides relay if possible.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Question about bpf perfbuf/ringbuf: pinned in backend with overwriting
  2023-12-13 23:35       ` Andrii Nakryiko
  2023-12-15 10:10         ` Philo Lu
@ 2023-12-19  6:23         ` Shung-Hsi Yu
  2023-12-19 13:38           ` Steven Rostedt
  1 sibling, 1 reply; 17+ messages in thread
From: Shung-Hsi Yu @ 2023-12-19  6:23 UTC (permalink / raw)
  To: Andrii Nakryiko, Philo Lu
  Cc: bpf, song, andrii, ast, Daniel Borkmann, xuanzhuo, dust.li,
	guwen, alibuda, hengqi, Nathan Slingerland, rihams, Alan Maguire,
	Masami Hiramatsu, Steven Rostedt

On Wed, Dec 13, 2023 at 03:35:19PM -0800, Andrii Nakryiko wrote:
> On Mon, Dec 11, 2023 at 4:39 AM Philo Lu <lulie@linux.alibaba.com> wrote:
> [...]
> > >>> Imagine a simple case: the bpf program output a log (some tcp
> > >>> statistics) to user every time a packet is received, and the user
> > >>> actively read the logs if he wants. I do not want to keep a user process
> > >>> alive, waiting for outputs of the buffer. User can read the buffer as
> > >>> need. BTW, the order does not matter.

Not sure if it's the same usecase, but I'd imagine for debugging
hard-to-reproduce issue where little is known (thus minimal filtering is
applied and the volume of event is large), this would be quite useful.
You just want to gather as much details as possible for events that
happens just before the issue occurs, and don't care about events that
happended much earlier.

> > >>> To conclude, I hope the buffer performs like relayfs: (1) no need for
> > >>> user process to receive logs, and the user may read at any time (and no
> > >>> wakeup would be better); (2) old data can be overwritten by new ones.
> > >>> 
> > >>> Currently, it seems that perfbuf and ringbuf cannot satisfy both: (i)
> > >>> ringbuf: only satisfies (1). However, if data arrive when the buffer is
> > >>> full, the new data will be lost, until the buffer is consumed. (ii)
> > >>> perfbuf: only satisfies (2). But user cannot access the buffer after the
> > >>> process who creates it (including perf_event.rb via mmap) exits.
> > >>> Specifically, I can use BPF_F_PRESERVE_ELEMS flag to keep the
> > >>> perf_events, but I do not know how to get the buffer again in a new
> > >>> process.
> > 
> > [...]
> > 
> > If it is indeed difficult with ringbuf, maybe I can implement a new type
> > of bpf map based on relay interface [1]? e.g., init relay during map
> > creating, write into it with bpf helper, and then user can access to it
> > in filesystem. I think it will be a simple but useful map for
> > overwritable data transfer.
> 
> I don't know much about relay, tbh. Give it a try, I guess.
> Alternatively, we need better and faster implementation of
> BPF_MAP_TYPE_QUEUE, which seems like the data structure that can
> support overwriting and generally be a fixed elementa size
> alternative/complement to BPF ringbuf.

Curious whether it is possible to reuse ftrace's trace buffer instead
(or it's underlying ring buffer implementation at
kernel/trace/ring_buffer.c). AFAICT it satisfies both requirements that
Philo stated: (1) no need for user process as the buffer is accessible
through tracefs, and (2) has an overwrite mode.

Further more, a natural feature request that would come after
overwriting support would be snapshotting, and that has already been
covered in ftrace.

Note: technically BPF program could already write to ftrace's trace
buffer with the bpf_trace_vprintk() helper, but that goes through string
formatting and only allows writing into to the global buffer.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Question about bpf perfbuf/ringbuf: pinned in backend with overwriting
  2023-12-19  6:23         ` Shung-Hsi Yu
@ 2023-12-19 13:38           ` Steven Rostedt
  2023-12-19 17:01             ` Alexei Starovoitov
                               ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Steven Rostedt @ 2023-12-19 13:38 UTC (permalink / raw)
  To: Shung-Hsi Yu
  Cc: Andrii Nakryiko, Philo Lu, bpf, song, andrii, ast,
	Daniel Borkmann, xuanzhuo, dust.li, guwen, alibuda, hengqi,
	Nathan Slingerland, rihams, Alan Maguire, Masami Hiramatsu

On Tue, 19 Dec 2023 14:23:59 +0800
Shung-Hsi Yu <shung-hsi.yu@suse.com> wrote:

> Curious whether it is possible to reuse ftrace's trace buffer instead
> (or it's underlying ring buffer implementation at
> kernel/trace/ring_buffer.c). AFAICT it satisfies both requirements that
> Philo stated: (1) no need for user process as the buffer is accessible
> through tracefs, and (2) has an overwrite mode.

Yes, the ftrace ring-buffer was in fact designed for the above use case.

> 
> Further more, a natural feature request that would come after
> overwriting support would be snapshotting, and that has already been
> covered in ftrace.

Yes, it has that too.

> 
> Note: technically BPF program could already write to ftrace's trace
> buffer with the bpf_trace_vprintk() helper, but that goes through string
> formatting and only allows writing into to the global buffer.

When eBPF was first being developed, Alexei told me he tried the ftrace
ring buffer, and he said the filtering was too slow. That's because it
would always write into the ring buffer and then try to discard it after
the fact, which required a few cmpxchg to synchronize. He decided that the
perf ring buffer was a better fit for this.

That was solved with this: 0fc1b09ff1ff4 ("tracing: Use temp buffer when
filtering events") Which makes the filtering similar to perf as perf always
copies events to a temporary buffer first.

It still falls back to writing directly into the ring buffer if the temp
buffer is currently being used by another event on the same CPU.

Note that the perf ring buffer was designed for profiling (taking
intermediate traces) and tightly coupled to have a reader. Whereas the
ftrace ring buffer was designed for high speed constant tracing, with or
without a reader.

-- Steve

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Question about bpf perfbuf/ringbuf: pinned in backend with overwriting
  2023-12-19 13:38           ` Steven Rostedt
@ 2023-12-19 17:01             ` Alexei Starovoitov
  2023-12-19 17:28             ` Steven Rostedt
  2023-12-21 13:00             ` Philo Lu
  2 siblings, 0 replies; 17+ messages in thread
From: Alexei Starovoitov @ 2023-12-19 17:01 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Shung-Hsi Yu, Andrii Nakryiko, Philo Lu, bpf, Song Liu,
	Andrii Nakryiko, Alexei Starovoitov, Daniel Borkmann, Xuan Zhuo,
	Dust Li, guwen, D. Wythe, hengqi, Nathan Slingerland, rihams,
	Alan Maguire, Masami Hiramatsu

On Tue, Dec 19, 2023 at 5:37 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Tue, 19 Dec 2023 14:23:59 +0800
> Shung-Hsi Yu <shung-hsi.yu@suse.com> wrote:
>
> > Curious whether it is possible to reuse ftrace's trace buffer instead
> > (or it's underlying ring buffer implementation at
> > kernel/trace/ring_buffer.c). AFAICT it satisfies both requirements that
> > Philo stated: (1) no need for user process as the buffer is accessible
> > through tracefs, and (2) has an overwrite mode.
>
> Yes, the ftrace ring-buffer was in fact designed for the above use case.
>
> >
> > Further more, a natural feature request that would come after
> > overwriting support would be snapshotting, and that has already been
> > covered in ftrace.
>
> Yes, it has that too.
>
> >
> > Note: technically BPF program could already write to ftrace's trace
> > buffer with the bpf_trace_vprintk() helper, but that goes through string
> > formatting and only allows writing into to the global buffer.
>
> When eBPF was first being developed, Alexei told me he tried the ftrace
> ring buffer, and he said the filtering was too slow. That's because it
> would always write into the ring buffer and then try to discard it after
> the fact, which required a few cmpxchg to synchronize. He decided that the
> perf ring buffer was a better fit for this.

Well. A lot of things have changed since then :)
It might be a good idea to teach bpf to interface with ftrace ring buffers.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Question about bpf perfbuf/ringbuf: pinned in backend with overwriting
  2023-12-19 13:38           ` Steven Rostedt
  2023-12-19 17:01             ` Alexei Starovoitov
@ 2023-12-19 17:28             ` Steven Rostedt
  2023-12-21 13:00             ` Philo Lu
  2 siblings, 0 replies; 17+ messages in thread
From: Steven Rostedt @ 2023-12-19 17:28 UTC (permalink / raw)
  To: Shung-Hsi Yu
  Cc: Andrii Nakryiko, Philo Lu, bpf, song, andrii, ast,
	Daniel Borkmann, xuanzhuo, dust.li, guwen, alibuda, hengqi,
	Nathan Slingerland, rihams, Alan Maguire, Masami Hiramatsu


BTW, if anyone's interested, there's a "benchmark" trace event when you
enable:

  CONFIG_TRACEPOINT_BENCHMARK=y

You can see the code in: kernel/trace/trace_benchmark.c:

Which does a loop of:

	local_irq_disable();
	start = trace_clock_local();
	trace_benchmark_event(bm_str, bm_last);
	stop = trace_clock_local();
	local_irq_enable();

Where it writes the result of the previous timings into the current trace
event via the bm_str:


	delta = stop - start;

	[..]

	bm_last = delta;

	[..]

	scnprintf(bm_str, BENCHMARK_EVENT_STRLEN,
		  "last=%llu first=%llu max=%llu min=%llu avg=%u std=%d std^2=%lld",
		  bm_last, bm_first, bm_max, bm_min, avg, std, stddev);




I ran: perf record -a -e benchmark:benchmark_event sleep 20

and perf script produces (I scrolled down to get to hot cache):

 event_benchmark    2289 [001]   672.581425: benchmark:benchmark_event: last=247 first=5693 max=8969 min=204 avg=240 std=234 std^2=55157 delta=247
 event_benchmark    2289 [001]   672.581426: benchmark:benchmark_event: last=222 first=5693 max=8969 min=204 avg=240 std=234 std^2=55151 delta=222
 event_benchmark    2289 [001]   672.581427: benchmark:benchmark_event: last=229 first=5693 max=8969 min=204 avg=240 std=234 std^2=55144 delta=229
 event_benchmark    2289 [001]   672.581427: benchmark:benchmark_event: last=221 first=5693 max=8969 min=204 avg=240 std=234 std^2=55138 delta=221
 event_benchmark    2289 [001]   672.581428: benchmark:benchmark_event: last=223 first=5693 max=8969 min=204 avg=240 std=234 std^2=55131 delta=223
 event_benchmark    2289 [001]   672.581428: benchmark:benchmark_event: last=220 first=5693 max=8969 min=204 avg=240 std=234 std^2=55125 delta=220
 event_benchmark    2289 [001]   672.581429: benchmark:benchmark_event: last=215 first=5693 max=8969 min=204 avg=240 std=234 std^2=55118 delta=215
 event_benchmark    2289 [001]   672.581430: benchmark:benchmark_event: last=221 first=5693 max=8969 min=204 avg=240 std=234 std^2=55112 delta=221
 event_benchmark    2289 [001]   672.581430: benchmark:benchmark_event: last=240 first=5693 max=8969 min=204 avg=240 std=234 std^2=55105 delta=240
 event_benchmark    2289 [001]   672.581431: benchmark:benchmark_event: last=225 first=5693 max=8969 min=204 avg=240 std=234 std^2=55099 delta=225
 event_benchmark    2289 [001]   672.581432: benchmark:benchmark_event: last=235 first=5693 max=8969 min=204 avg=240 std=234 std^2=55092 delta=235
 event_benchmark    2289 [001]   672.581432: benchmark:benchmark_event: last=220 first=5693 max=8969 min=204 avg=240 std=234 std^2=55086 delta=220
 event_benchmark    2289 [001]   672.581433: benchmark:benchmark_event: last=245 first=5693 max=8969 min=204 avg=240 std=234 std^2=55079 delta=245
 event_benchmark    2289 [001]   672.581433: benchmark:benchmark_event: last=215 first=5693 max=8969 min=204 avg=240 std=234 std^2=55073 delta=215
 event_benchmark    2289 [001]   672.581434: benchmark:benchmark_event: last=216 first=5693 max=8969 min=204 avg=240 std=234 std^2=55066 delta=216


For ftrace: trace-cmd record -e benchmark_event sleep 20

trace-cmd report:

 event_benchmark-2253  [000]   549.747068: benchmark_event:      last=78 first=2674 max=1185 min=71 avg=84 std=30 std^2=935 delta=78
 event_benchmark-2253  [000]   549.747069: benchmark_event:      last=79 first=2674 max=1185 min=71 avg=84 std=30 std^2=935 delta=79
 event_benchmark-2253  [000]   549.747069: benchmark_event:      last=72 first=2674 max=1185 min=71 avg=84 std=30 std^2=935 delta=72
 event_benchmark-2253  [000]   549.747069: benchmark_event:      last=79 first=2674 max=1185 min=71 avg=84 std=30 std^2=935 delta=79
 event_benchmark-2253  [000]   549.747070: benchmark_event:      last=78 first=2674 max=1185 min=71 avg=84 std=30 std^2=935 delta=78
 event_benchmark-2253  [000]   549.747070: benchmark_event:      last=78 first=2674 max=1185 min=71 avg=84 std=30 std^2=934 delta=78
 event_benchmark-2253  [000]   549.747071: benchmark_event:      last=79 first=2674 max=1185 min=71 avg=84 std=30 std^2=934 delta=79
 event_benchmark-2253  [000]   549.747071: benchmark_event:      last=78 first=2674 max=1185 min=71 avg=84 std=30 std^2=934 delta=78
 event_benchmark-2253  [000]   549.747072: benchmark_event:      last=80 first=2674 max=1185 min=71 avg=84 std=30 std^2=934 delta=80
 event_benchmark-2253  [000]   549.747072: benchmark_event:      last=79 first=2674 max=1185 min=71 avg=84 std=30 std^2=934 delta=79
 event_benchmark-2253  [000]   549.747073: benchmark_event:      last=78 first=2674 max=1185 min=71 avg=84 std=30 std^2=933 delta=78
 event_benchmark-2253  [000]   549.747073: benchmark_event:      last=165 first=2674 max=1185 min=71 avg=84 std=30 std^2=935 delta=165
 event_benchmark-2253  [000]   549.747074: benchmark_event:      last=79 first=2674 max=1185 min=71 avg=84 std=30 std^2=934 delta=79
 event_benchmark-2253  [000]   549.747074: benchmark_event:      last=153 first=2674 max=1185 min=71 avg=84 std=30 std^2=935 delta=153
 event_benchmark-2253  [000]   549.747075: benchmark_event:      last=78 first=2674 max=1185 min=71 avg=84 std=30 std^2=935 delta=78
 event_benchmark-2253  [000]   549.747075: benchmark_event:      last=78 first=2674 max=1185 min=71 avg=84 std=30 std^2=935 delta=78
 event_benchmark-2253  [000]   549.747076: benchmark_event:      last=78 first=2674 max=1185 min=71 avg=84 std=30 std^2=935 delta=78
 event_benchmark-2253  [000]   549.747076: benchmark_event:      last=73 first=2674 max=1185 min=71 avg=84 std=30 std^2=934 delta=73
 event_benchmark-2253  [000]   549.747077: benchmark_event:      last=79 first=2674 max=1185 min=71 avg=84 std=30 std^2=934 delta=79
 event_benchmark-2253  [000]   549.747077: benchmark_event:      last=78 first=2674 max=1185 min=71 avg=84 std=30 std^2=934 delta=78


For normal tracing, the average perf event takes 204ns per event, and the
average ftrace event takes 84ns per event. The "first" in the output above
is how long the first event took (which was cold cache).

I added filtering to trace-cmd with:

 trace-cmd record -o trace-filter.dat -e benchmark_event -f 'delta & 1' sleep 20

I should modify the event to have a counter so that I can filter every
other event with that, but for now I just print out anything that has an
odd delta.

 event_benchmark-2548  [000]  1558.776493: benchmark_event:       str=last=199 first=1964 max=2215 min=40 avg=78 std=44 std^2=2022 delta=199
 event_benchmark-2548  [000]  1558.776498: benchmark_event:       str=last=43 first=1964 max=2215 min=40 avg=78 std=44 std^2=2021 delta=43
 event_benchmark-2548  [000]  1558.776498: benchmark_event:       str=last=191 first=1964 max=2215 min=40 avg=78 std=44 std^2=2022 delta=191
 event_benchmark-2548  [000]  1558.776500: benchmark_event:       str=last=41 first=1964 max=2215 min=40 avg=78 std=44 std^2=2022 delta=41
 event_benchmark-2548  [000]  1558.776500: benchmark_event:       str=last=119 first=1964 max=2215 min=40 avg=78 std=44 std^2=2022 delta=119
 event_benchmark-2548  [000]  1558.776501: benchmark_event:       str=last=41 first=1964 max=2215 min=40 avg=78 std=44 std^2=2022 delta=41
 event_benchmark-2548  [000]  1558.776502: benchmark_event:       str=last=105 first=1964 max=2215 min=40 avg=78 std=44 std^2=2022 delta=105
 event_benchmark-2548  [000]  1558.776503: benchmark_event:       str=last=41 first=1964 max=2215 min=40 avg=78 std=44 std^2=2022 delta=41
 event_benchmark-2548  [000]  1558.776505: benchmark_event:       str=last=41 first=1964 max=2215 min=40 avg=78 std=44 std^2=2021 delta=41
 event_benchmark-2548  [000]  1558.776505: benchmark_event:       str=last=111 first=1964 max=2215 min=40 avg=78 std=44 std^2=2021 delta=111
 event_benchmark-2548  [000]  1558.776506: benchmark_event:       str=last=109 first=1964 max=2215 min=40 avg=78 std=44 std^2=2021 delta=109
 event_benchmark-2548  [000]  1558.776506: benchmark_event:       str=last=109 first=1964 max=2215 min=40 avg=78 std=44 std^2=2021 delta=109
 event_benchmark-2548  [000]  1558.776508: benchmark_event:       str=last=41 first=1964 max=2215 min=40 avg=78 std=44 std^2=2021 delta=41
 event_benchmark-2548  [000]  1558.776508: benchmark_event:       str=last=109 first=1964 max=2215 min=40 avg=78 std=44 std^2=2021 delta=109
 event_benchmark-2548  [000]  1558.776509: benchmark_event:       str=last=117 first=1964 max=2215 min=40 avg=78 std=44 std^2=2021 delta=117
 event_benchmark-2548  [000]  1558.776510: benchmark_event:       str=last=51 first=1964 max=2215 min=40 avg=78 std=44 std^2=2020 delta=51
 event_benchmark-2548  [000]  1558.776510: benchmark_event:       str=last=103 first=1964 max=2215 min=40 avg=78 std=44 std^2=2020 delta=103
 event_benchmark-2548  [000]  1558.776511: benchmark_event:       str=last=109 first=1964 max=2215 min=40 avg=78 std=44 std^2=2020 delta=109
 event_benchmark-2548  [000]  1558.776512: benchmark_event:       str=last=51 first=1964 max=2215 min=40 avg=78 std=44 std^2=2020 delta=51
 event_benchmark-2548  [000]  1558.776512: benchmark_event:       str=last=103 first=1964 max=2215 min=40 avg=78 std=44 std^2=2020 delta=103
 event_benchmark-2548  [000]  1558.776513: benchmark_event:       str=last=95 first=1964 max=2215 min=40 avg=78 std=44 std^2=2020 delta=95
 event_benchmark-2548  [000]  1558.776513: benchmark_event:       str=last=101 first=1964 max=2215 min=40 avg=78 std=44 std^2=2020 delta=101
 event_benchmark-2548  [000]  1558.776514: benchmark_event:       str=last=109 first=1964 max=2215 min=40 avg=78 std=44 std^2=2019 delta=109

It looks like throwing away an event is around 40-50ns, where as now the
copying of the event into a temp buffer before writing it into the ring
buffer increased the time from 84ns to around 100-110ns. Still half the
time it takes for the perf event.

The above trace event benchmark has been part of the Linux kernel since
3.16, so everyone should have it if you want to run your own tests.

-- Steve


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Question about bpf perfbuf/ringbuf: pinned in backend with overwriting
  2023-12-16  8:50             ` Dmitry Vyukov
  2023-12-18 12:58               ` Philo Lu
@ 2023-12-19 19:25               ` Andrii Nakryiko
  1 sibling, 0 replies; 17+ messages in thread
From: Andrii Nakryiko @ 2023-12-19 19:25 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Philo Lu, bpf, song, andrii, ast, Daniel Borkmann, xuanzhuo,
	dust.li, guwen, alibuda, hengqi, Nathan Slingerland, rihams,
	Alan Maguire

On Sat, Dec 16, 2023 at 12:50 AM Dmitry Vyukov <dvyukov@google.com> wrote:
>
> On Fri, 15 Dec 2023 at 23:39, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> > > On 2023/12/14 07:35, Andrii Nakryiko wrote:
> > > > On Mon, Dec 11, 2023 at 4:39 AM Philo Lu <lulie@linux.alibaba.com> wrote:
> > > >>
> > > >>
> > > >>
> > > >> On 2023/12/9 06:32, Andrii Nakryiko wrote:
> > > >>> On Thu, Dec 7, 2023 at 6:49 AM Alan Maguire <alan.maguire@oracle.com> wrote:
> > > >>>>
> > > >>>> On 07/12/2023 13:15, Philo Lu wrote:
> > > >>>>> Hi all. I have a question when using perfbuf/ringbuf in bpf. I will
> > > >>>>> appreciate it if you give me any advice.
> > > >>>>>
> > > >>>>> Imagine a simple case: the bpf program output a log (some tcp
> > > >>>>> statistics) to user every time a packet is received, and the user
> > > >>>>> actively read the logs if he wants. I do not want to keep a user process
> > > >>>>> alive, waiting for outputs of the buffer. User can read the buffer as
> > > >>>>> need. BTW, the order does not matter.
> > > >>>>>
> > > >>>>> To conclude, I hope the buffer performs like relayfs: (1) no need for
> > > >>>>> user process to receive logs, and the user may read at any time (and no
> > > >>>>> wakeup would be better); (2) old data can be overwritten by new ones.
> > > >>>>>
> > > >>>>> Currently, it seems that perfbuf and ringbuf cannot satisfy both: (i)
> > > >>>>> ringbuf: only satisfies (1). However, if data arrive when the buffer is
> > > >>>>> full, the new data will be lost, until the buffer is consumed. (ii)
> > > >>>>> perfbuf: only satisfies (2). But user cannot access the buffer after the
> > > >>>>> process who creates it (including perf_event.rb via mmap) exits.
> > > >>>>> Specifically, I can use BPF_F_PRESERVE_ELEMS flag to keep the
> > > >>>>> perf_events, but I do not know how to get the buffer again in a new
> > > >>>>> process.
> > > >>>>>
> > > >>>>> In my opinion, this can be solved by either of the following: (a) add
> > > >>>>> overwrite support in ringbuf (maybe a new flag for reserve), but we have
> > > >>>>> to address synchronization between kernel and user, especially under
> > > >>>>> variable data size, because when overwriting occurs, kernel has to
> > > >>>>> update the consumer posi too; (b) implement map_fd_sys_lookup_elem for
> > > >>>>> perfbuf to expose fds to user via map_lookup_elem syscall, and a
> > > >>>>> mechanism is need to preserve perf_event->rb when process exits
> > > >>>>> (otherwise the buffer will be freed by perf_mmap_close). I am not sure
> > > >>>>> if they are feasible, and which is better. If not, perhaps we can
> > > >>>>> develop another mechanism to achieve this?
> > > >>>>>
> > > >>>>
> > > >>>> There was an RFC a while back focused on supporting BPF ringbuf
> > > >>>> over-writing [1]; at the time, Andrii noted some potential issues that
> > > >>>> might be exposed by doing multiple ringbuf reserves to overfill the
> > > >>>> buffer within the same program.
> > > >>>>
> > > >>>
> > > >>> Correct. I don't think it's possible to correctly and safely support
> > > >>> overwriting with BPF ringbuf that has variable-sized elements.
> > > >>>
> > > >>> We'll need to implement MPMC ringbuf (probably with fixed sized
> > > >>> element size) to be able to support this.
> > > >>>
> > > >>
> > > >> Thank you very much!
> > > >>
> > > >> If it is indeed difficult with ringbuf, maybe I can implement a new type
> > > >> of bpf map based on relay interface [1]? e.g., init relay during map
> > > >> creating, write into it with bpf helper, and then user can access to it
> > > >> in filesystem. I think it will be a simple but useful map for
> > > >> overwritable data transfer.
> > > >
> > > > I don't know much about relay, tbh. Give it a try, I guess.
> > > > Alternatively, we need better and faster implementation of
> > > > BPF_MAP_TYPE_QUEUE, which seems like the data structure that can
> > > > support overwriting and generally be a fixed elementa size
> > > > alternative/complement to BPF ringbuf.
> > > >
> > >
> > > Thank you for your reply. I am afraid BPF_MAP_TYPE_QUEUE cannot get rid
> > > of locking overheads with concurrent reading and writing by design, and
> >
> > I disagree, I think [0] from Dmitry Vyukov is one way to implement
> > lock-free BPF_MAP_TYPE_QUEUE. I don't know how easy it would be to
> > implement overwriting support, but it would be worth considering.
> >
> >   [0] https://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue
>
>
> I am missing some context here. But note that this queue is not
> formally lock-free. While it's usually faster and more scalable than
> mutex-protected queues, stuck readers and writers will eventually
> block each other. Stucking for a short time is not a problem because
> the queue allows parallelism for both readers and writers. But if
> threads get stuck for a long time and the queue wraps around so that
> writers try to write to elements being read/written by slow threads,
> they block. Similarly, readers get blocked by slow writers even if
> there are other fully written elements in the queue already.
> The queue is not serializable either, which may be surprisable in some cases.

Thanks for additional insights, Dmitry!

In our case producers will be either BPF programs or done by bpf()
syscall in the kernel, so the expectation is that they will be fast
and will be guaranteed to run to completion. (We can decide whether
sleepable/faultable BPF programs should be allowed to work with this
QUEUE or not). For consuming, the main target is probably user-space,
and probably we'd want to be able to do this without a syscall through
mmaping. If the user is slow, on the producer side we can perhaps just
fail to enqueue a new element (not sure how easy it is to tell "slow
consumer" vs "no consumer, we are full"?)

Anyways, I think it's an interesting algorithm, I stumbled upon it a
while ago and was always curious how it would fit BPF use cases :)

>
> Adding overwriting support may be an interesting exercise.
> I guess readers could use some variation of a seqlock to deal with
> elements that are being overwritten.

One way I was thinking would be to remember sequence number before
reading data, read data, and then re-read sequence number. If it
changed, user can discard because data was modified. If not, then we
have a guarantee that data was intact for the entire duration of the
read operation.

> Writers can already skip over other slow writers. Normally this is
> used w/o wrap-around, but I suspect it can just work with wrap-around
> as well (a writer can skip over a writer stuck on the previous lap).
> Since we overwrite elements, the queue provides only a very weak
> notion of FIFO anyway, so skipping over very old writers may be fine.

Exactly, it's not really FIFO (so perhaps literally retrofitting it
into BPF_MAP_TYPE_QUEUE might not be the best idea, maybe it would be
a new map type), so overwriting is like some consumer quickly consumed
(and discarded) an element, and then wrote over it some new
information. That was how my thinking went.

The devil is in details and fitting all this end-to-end into BPF
subsystem, of course.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Question about bpf perfbuf/ringbuf: pinned in backend with overwriting
  2023-12-19 13:38           ` Steven Rostedt
  2023-12-19 17:01             ` Alexei Starovoitov
  2023-12-19 17:28             ` Steven Rostedt
@ 2023-12-21 13:00             ` Philo Lu
  2023-12-21 14:49               ` Steven Rostedt
  2 siblings, 1 reply; 17+ messages in thread
From: Philo Lu @ 2023-12-21 13:00 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andrii Nakryiko, Shung-Hsi Yu, bpf, song, andrii, ast,
	Daniel Borkmann, xuanzhuo, dust.li, guwen, alibuda, hengqi,
	Nathan Slingerland, rihams, Alan Maguire, Masami Hiramatsu

Hi Steven,

Thanks for your explanation about ftrace ring buffer. Also thanks to 
Shung-Hsi for the discussion.

Here are some features of ftrace buffer that I'm not sure if they are 
right. Could you please tell me if my understandings correct?

(1) When reading and writing occur concurrently:
   (a) If reader is faster than writer, the reader cannot get the page 
which is still being written, which means the reader cannot get the data 
immediately of one-page length in the worst case.
   (b) If writer is faster than reader, the only race between them is 
when reader is doing swap while writer wraps in overwrite mode. But if 
the reader has finished swapping, the writer can wrap safely, because 
the reader page if already out of the buffer page list.

(2) As the per-cpu buffer list is dynamic with reader page moves, we 
cannot do mmap to expose the buffer to user. Users can consume at most 
one page at a time.

(3) The wake-up behavior is controllable. If there is no waiter at all, 
no overhead will be induced because of waking up.

Thanks.

On 2023/12/19 21:38, Steven Rostedt wrote:
> On Tue, 19 Dec 2023 14:23:59 +0800
> Shung-Hsi Yu <shung-hsi.yu@suse.com> wrote:
> 
>> Curious whether it is possible to reuse ftrace's trace buffer instead
>> (or it's underlying ring buffer implementation at
>> kernel/trace/ring_buffer.c). AFAICT it satisfies both requirements that
>> Philo stated: (1) no need for user process as the buffer is accessible
>> through tracefs, and (2) has an overwrite mode.
> 
> Yes, the ftrace ring-buffer was in fact designed for the above use case.
> 
>>
>> Further more, a natural feature request that would come after
>> overwriting support would be snapshotting, and that has already been
>> covered in ftrace.
> 
> Yes, it has that too.
> 
>>
>> Note: technically BPF program could already write to ftrace's trace
>> buffer with the bpf_trace_vprintk() helper, but that goes through string
>> formatting and only allows writing into to the global buffer.
> 
> When eBPF was first being developed, Alexei told me he tried the ftrace
> ring buffer, and he said the filtering was too slow. That's because it
> would always write into the ring buffer and then try to discard it after
> the fact, which required a few cmpxchg to synchronize. He decided that the
> perf ring buffer was a better fit for this.
> 
> That was solved with this: 0fc1b09ff1ff4 ("tracing: Use temp buffer when
> filtering events") Which makes the filtering similar to perf as perf always
> copies events to a temporary buffer first.
> 
> It still falls back to writing directly into the ring buffer if the temp
> buffer is currently being used by another event on the same CPU.
> 
> Note that the perf ring buffer was designed for profiling (taking
> intermediate traces) and tightly coupled to have a reader. Whereas the
> ftrace ring buffer was designed for high speed constant tracing, with or
> without a reader.
> 
> -- Steve

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Question about bpf perfbuf/ringbuf: pinned in backend with overwriting
  2023-12-21 13:00             ` Philo Lu
@ 2023-12-21 14:49               ` Steven Rostedt
  2023-12-22 12:25                 ` Philo Lu
  0 siblings, 1 reply; 17+ messages in thread
From: Steven Rostedt @ 2023-12-21 14:49 UTC (permalink / raw)
  To: Philo Lu
  Cc: Andrii Nakryiko, Shung-Hsi Yu, bpf, song, andrii, ast,
	Daniel Borkmann, xuanzhuo, dust.li, guwen, alibuda, hengqi,
	Nathan Slingerland, rihams, Alan Maguire, Masami Hiramatsu

On Thu, 21 Dec 2023 21:00:39 +0800
Philo Lu <lulie@linux.alibaba.com> wrote:

> Hi Steven,
> 
> Thanks for your explanation about ftrace ring buffer. Also thanks to 
> Shung-Hsi for the discussion.
> 
> Here are some features of ftrace buffer that I'm not sure if they are 
> right. Could you please tell me if my understandings correct?
> 
> (1) When reading and writing occur concurrently:
>    (a) If reader is faster than writer, the reader cannot get the page 
> which is still being written, which means the reader cannot get the data 
> immediately of one-page length in the worst case.

Nope, that's not the case. Otherwise you couldn't do this!

 ~# cd /sys/kernel/tracing
 ~# echo hello world > trace_marker
 ~# cat trace_pipe
           <...>-861     [001] ..... 76124.880943: tracing_mark_write: hello world

Yes, the reader swaps out an active sub-buffer to read it. But it's fine if
the writer is still on that sub-buffer. That's because the sub-buffers are
a linked list and the writer will simply walk off the end of the sub-buffer
and back into the sub-buffers in the active ring buffer.

Note, in this case, the ring buffer cannot give the sub-buffer to the
reader to pass to splice, as then it could free it while the writer is
still on it, but instead, copies the data for the reader. It also keeps
track of what it copied so it doesn't copy it again the next time.

>    (b) If writer is faster than reader, the only race between them is 
> when reader is doing swap while writer wraps in overwrite mode. But if 
> the reader has finished swapping, the writer can wrap safely, because 
> the reader page if already out of the buffer page list.

Yes, that is the point of contention. But the writer doesn't wait for the
reader. The reader does a cmpxchg loop to make sure it's not conflicting
with the writer. The writer has priority and doesn't loop in this case.
That is, a reader will not slow down the writer except for what the
hardware causes in the contention.

> 
> (2) As the per-cpu buffer list is dynamic with reader page moves, we 
> cannot do mmap to expose the buffer to user. Users can consume at most 
> one page at a time.

The code works with splice, and the way trace-cmd does it, is to use the
max pipe size, and will read by default 64kb at a time. The internals swap
out one sub-buffer at a time, but then move them into the pipe, with zero
copy (if the sub-buffers are full and the writer is not still on them). The
user can see all these sub-buffers in the pipe at once.

I'm working to have 6.8 remove the limit of "one page" and allow the
sub-buffers to be any order of pages (1,2,4,8,...). I'm hoping to have that
work pushed to linux-next by end of today.

 https://lore.kernel.org/linux-trace-kernel/20231219185414.474197117@goodmis.org/

and we are also working on mmapping the ring buffer to user space:

 https://lore.kernel.org/linux-trace-kernel/20231219184556.1552951-1-vdonnefort@google.com/

That may not make 6.8 but will likely make 6.9 at the latest.

It still requires user space to make an ioctl() system call between
sub-buffers, as the swap logic is still implemented.

The way it will work is all the sub-buffers will be mmapped to user space
including the reader page. A meta data will point to which sub-buffer is
what. When user space calls the ioctl() it will update which one of the
mapped sub-buffers is the "reader-page" (really "reader-subbuf") and the
writers will not write on it. When user space is finished reading the data
on the reader-page it will call the ioctl() again and the meta data will be
updated to point to which sub-buffer is now the new "reader-page" for user
space to read.

There's no new allocations needed for the swap. The old reader-subbuf gets
swapped with one of the active sub-buffers and becomes an active sub-buffer
itself. The swapped out sub-buffer becomes the new "reader-page/subbuf".

> 
> (3) The wake-up behavior is controllable. If there is no waiter at all, 
> no overhead will be induced because of waking up.

Correct. When there's a waiter, a bit is set and an irq_work is called to
wake up the waiter (this is basically the same as what perf does).

You can also set when you want to wake up via the buffer_percent file in
tracefs. If the buffer is not filled to the percentage specified, it will
not wake up the waiters.

-- Steve

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Question about bpf perfbuf/ringbuf: pinned in backend with overwriting
  2023-12-21 14:49               ` Steven Rostedt
@ 2023-12-22 12:25                 ` Philo Lu
  0 siblings, 0 replies; 17+ messages in thread
From: Philo Lu @ 2023-12-22 12:25 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andrii Nakryiko, Shung-Hsi Yu, bpf, song, andrii, ast,
	Daniel Borkmann, xuanzhuo, dust.li, guwen, alibuda, hengqi,
	Nathan Slingerland, rihams, Alan Maguire, Masami Hiramatsu

Thank you very much for your reply, making me understand ftrace buffer 
better.

I think it feasible to implement a new type of bpf map based on ftrace 
buffer. As for user interface, perhaps representing as files is still a 
good choice (like tracefs for ftrace)? But we should make sure that each 
map use a exclusive directory.

Also, I have tried relay map and submitted the patches [0], and any 
comment is welcome. Its behavior is exactly what I describe above. The 
buffer is represented as files in debugfs (`/sys/kernel/debug/`), one 
directory for one map. Users can get data with read or mmap interfaces.

The relay interface is also designed as a sub-buffer structure. It is 
light-weighted and provides users with much flexibility to formulate and 
process the data. Meanwhile, ftrace buffer provides thorough 
consideration for various use cases, so that users just care about the 
data entry by entry. It seems that ftrace buffer could be a better 
alternative of perfbuf. Therefore, I think it possible that relay and 
ftrace buffer coexist as bpf maps.

Wish you all happy holidays :)

[0]
https://lore.kernel.org/all/20231222122146.65519-1-lulie@linux.alibaba.com/

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-12-22 12:25 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-07 13:15 Question about bpf perfbuf/ringbuf: pinned in backend with overwriting Philo Lu
2023-12-07 14:48 ` Alan Maguire
2023-12-08 22:32   ` Andrii Nakryiko
2023-12-11 12:39     ` Philo Lu
2023-12-13 23:35       ` Andrii Nakryiko
2023-12-15 10:10         ` Philo Lu
2023-12-15 22:39           ` Andrii Nakryiko
2023-12-16  8:50             ` Dmitry Vyukov
2023-12-18 12:58               ` Philo Lu
2023-12-19 19:25               ` Andrii Nakryiko
2023-12-19  6:23         ` Shung-Hsi Yu
2023-12-19 13:38           ` Steven Rostedt
2023-12-19 17:01             ` Alexei Starovoitov
2023-12-19 17:28             ` Steven Rostedt
2023-12-21 13:00             ` Philo Lu
2023-12-21 14:49               ` Steven Rostedt
2023-12-22 12:25                 ` Philo Lu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).