bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Behavior of pinned perf event array
@ 2020-08-24 23:57 Song Liu
       [not found] ` <47929B19-E739-4E74-BBB7-B2C0DCC7A7F8@fb.com>
  0 siblings, 1 reply; 7+ messages in thread
From: Song Liu @ 2020-08-24 23:57 UTC (permalink / raw)
  To: Daniel Borkmann, bpf, Networking
  Cc: Andrii Nakryiko, Alexei Starovoitov, Kernel Team

Hi Daniel, 

We are looking at sharing perf events amount multiple processes via 
pinned perf event array. However, we found this doesn't really work 
as the perf event is removed from the map when the struct file is
released from user space (in perf_event_fd_array_release). This 
means, the pinned perf event array can be shared among multiple 
process. But each perf event still has single owner. Once the owner 
process closes the fd (or terminates), the perf event is removed 
from the array. I went thought the history of the code and found 
this behavior is actually expected (commit 3b1efb196eee). 

In our use case, however, we want to share the perf event among 
different processes. I think we have a few options to achieve this:

1. Introduce a new flag for the perf event map, like BPF_F_KEEP_PE_OPEN.
   Once this flag is set, we should not remove the fd on struct file 
   release. Instead, we remove fd in map_release_uref. 

2. Allow a different process to hold reference to the perf_event 
   via pinned perf event array. I guess we can achieve this by 
   enabling for BPF_MAP_UPDATE_ELEM perf event array. 

3. Other ideas?

Could you please let us know your thoughts on this?

Thanks,
Song

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Behavior of pinned perf event array
       [not found]           ` <DEBBD27D-188D-4EFD-8C04-838F54689587@fb.com>
@ 2020-09-11 20:36             ` Song Liu
  2020-09-14 22:59               ` Andrii Nakryiko
  0 siblings, 1 reply; 7+ messages in thread
From: Song Liu @ 2020-09-11 20:36 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Andrii Nakryiko, bpf



> On Sep 8, 2020, at 5:32 PM, Song Liu <songliubraving@fb.com> wrote:
> 
> 
> 
>> On Sep 8, 2020, at 10:22 AM, Song Liu <songliubraving@fb.com> wrote:
>> 
>> 
>> 
>>> On Sep 3, 2020, at 2:22 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
>>> 
>>> On 9/3/20 1:05 AM, Song Liu wrote:
>>>>> On Sep 2, 2020, at 3:28 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
>>>>> 
>>>>> Hi Song,
>>>>> 
>>>>> Sorry indeed missed it.
>>>>> 
>>>>> On 9/2/20 11:33 PM, Song Liu wrote:
>>>>>>> On Aug 24, 2020, at 4:57 PM, Song Liu <songliubraving@fb.com> wrote:
>>>>>>> 
>>>>>>> We are looking at sharing perf events amount multiple processes via
>>>>>>> pinned perf event array. However, we found this doesn't really work
>>>>>>> as the perf event is removed from the map when the struct file is
>>>>>>> released from user space (in perf_event_fd_array_release). This
>>>>>>> means, the pinned perf event array can be shared among multiple
>>>>>>> process. But each perf event still has single owner. Once the owner
>>>>>>> process closes the fd (or terminates), the perf event is removed
>>>>>>> from the array. I went thought the history of the code and found
>>>>>>> this behavior is actually expected (commit 3b1efb196eee).
>>>>> 
>>>>> Right, that auto-cleanup is expected.
>>>>> 
>>>>>>> In our use case, however, we want to share the perf event among
>>>>>>> different processes. I think we have a few options to achieve this:
>>>>>>> 
>>>>>>> 1. Introduce a new flag for the perf event map, like BPF_F_KEEP_PE_OPEN.
>>>>>>> Once this flag is set, we should not remove the fd on struct file
>>>>>>> release. Instead, we remove fd in map_release_uref.
>>>>>>> 
>>>>>>> 2. Allow a different process to hold reference to the perf_event
>>>>>>> via pinned perf event array. I guess we can achieve this by
>>>>>>> enabling for BPF_MAP_UPDATE_ELEM perf event array.
>>>>>>> 
>>>>>>> 3. Other ideas?
>>>>>>> 
>>>>>>> Could you please let us know your thoughts on this?
>>>>> 
>>>>> One option that would work for sure is to transfer the fd to the other
>>>>> processes via fd passing e.g. through pipe or unix domain socket.
>>>> I haven't tried to transfer the fd, but it might be tricky. We need to
>>>> plan for more than 2 processes sharing the events, and these processes
>>>> will start and terminate in any order.
>>>>> I guess my question would be that it would be hard to debug if we keep
>>>>> dangling perf event entries in there yb accident that noone is cleaning
>>>>> up. Some sort of flag is probably okay, but it needs proper introspection
>>>>> facilities from bpftool side so that it could be detected that it's just
>>>>> dangling around waiting for cleanup.
>>>> With my latest design, we don't need to pin the perf_event map (neither
>>>> the prog accessing the map. I guess this can make the clean up problem
>>>> better? So we will add a new flag for map_create. With the flag, we
>>> 
>>> I mean pinning the map itself or the prog making use of accessing the map
>>> is not the issue. Afaik, it's more the perf RB that is consuming memory and
>>> can be dangling, so the presence of the /entry/ in the map itself which
>>> would then not be cleaned up by accident, I think this was the motivation
>>> back then at least.
>>> 
>>>> will not close the perf_event during process termination, and we block
>>>> pinning for this map, and any program accessing this map. Does this
>>>> sounds like a good plan?
>>> 
>>> Could you elaborate why blocking pinning of map/prog is useful in this context?
>> 
>> I was thinking, we are more likely to forget cleaning up pinned map. If the
>> map is not pinned, it will be at least cleaned up when all processes accessing 
>> it terminate. On the other hand, pinned map would stay until someone removes
>> it from bpffs. So the idea is to avoid the pinning scenario. 
>> 
>> But I agree this won't solve all the problems. Say process A added a few 
>> perf events (A_events) to the map. And process B added some other events 
>> (B_events)to the same map. Blocking pinning makes sure we clean up everything
>> when both A and B terminates. But if A terminates soon, while B runs for a 
>> long time, A_events will stay open unnecessarily. 
>> 
>> Alternatively, we can implement map_fd_sys_lookup_elem for perf event map, 
>> which returns an fd to the perf event. With this solution, if process A added 
>> some events to the map, and process B want to use them after process A 
>> terminates. We need to explicitly do the lookup in process B and holds the fd 
>> to the perf event. Maybe this is a better solution?
> 
> Actually, this doesn't work. :( With map_fd_sys_lookup_elem(), we can get a fd
> on the perf_event, but we still remove the event from the map in 
> perf_event_fd_array_release(). Let me see what the best next step...

CC Andrii and bpf@

Andrii and I had some discussion on this. 

Currently, I am working on something with a new flag BPF_F_SHARE_PE. I attached 
the diff below. 

On the other hand, we found current behavior of perf_event_array puzzling, 
especially pinned perf_event_array (as pinning doesn't really pin the content). 
Therefore, we may consider changing the behavior without a flag?

Thanks,
Song


========================================================================

From f78720bbaec9aae3a44bd21e0902d5a215dfe0b0 Mon Sep 17 00:00:00 2001
From: Song Liu <songliubraving@fb.com>
Date: Tue, 11 Aug 2020 13:38:55 -0700
Subject: [PATCH 1/3] bpf: preserve fd in pinned perf_event_map

---
 include/uapi/linux/bpf.h       |  3 +++
 kernel/bpf/arraymap.c          | 31 +++++++++++++++++++++++++++++--
 tools/include/uapi/linux/bpf.h |  3 +++
 3 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 90359cab501d1..b527a28a4ba5c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -413,6 +413,9 @@ enum {

 /* Enable memory-mapping BPF map */
        BPF_F_MMAPABLE          = (1U << 10),
+
+/* Share perf_event among processes */
+       BPF_F_SHARE_PE          = (1U << 11),
 };

 /* Flags for BPF_PROG_QUERY. */
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index e046fb7d17cd0..3bb58fe3d444c 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -15,7 +15,7 @@
 #include "map_in_map.h"

 #define ARRAY_CREATE_FLAG_MASK \
-       (BPF_F_NUMA_NODE | BPF_F_MMAPABLE | BPF_F_ACCESS_MASK)
+       (BPF_F_NUMA_NODE | BPF_F_MMAPABLE | BPF_F_ACCESS_MASK | BPF_F_SHARE_PE)

 static void bpf_array_free_percpu(struct bpf_array *array)
 {
@@ -64,6 +64,10 @@ int array_map_alloc_check(union bpf_attr *attr)
            attr->map_flags & BPF_F_MMAPABLE)
                return -EINVAL;

+       if (attr->map_type != BPF_MAP_TYPE_PERF_EVENT_ARRAY &&
+           attr->map_flags & BPF_F_SHARE_PE)
+               return -EINVAL;
+
        if (attr->value_size > KMALLOC_MAX_SIZE)
                /* if value_size is bigger, the user space won't be able to
                 * access the elements.
@@ -778,6 +782,26 @@ static int fd_array_map_delete_elem(struct bpf_map *map, void *key)
        }
 }

+static void perf_event_fd_array_map_free(struct bpf_map *map)
+{
+       struct bpf_array *array = container_of(map, struct bpf_array, map);
+       struct bpf_event_entry *ee;
+       int i;
+
+       if (map->map_flags & BPF_F_SHARE_PE) {
+               for (i = 0; i < array->map.max_entries; i++) {
+                       ee = READ_ONCE(array->ptrs[i]);
+                       if (ee)
+                               fd_array_map_delete_elem(map, &i);
+               }
+       } else {
+               for (i = 0; i < array->map.max_entries; i++)
+                       BUG_ON(array->ptrs[i] != NULL);
+       }
+
+       bpf_map_area_free(array);
+}
+
 static void *prog_fd_array_get_ptr(struct bpf_map *map,
                                   struct file *map_file, int fd)
 {
@@ -1105,6 +1129,9 @@ static void perf_event_fd_array_release(struct bpf_map *map,
        struct bpf_event_entry *ee;
        int i;

+       if (map->map_flags & BPF_F_SHARE_PE)
+               return;
+
        rcu_read_lock();
        for (i = 0; i < array->map.max_entries; i++) {
                ee = READ_ONCE(array->ptrs[i]);
@@ -1119,7 +1146,7 @@ const struct bpf_map_ops perf_event_array_map_ops = {
        .map_meta_equal = bpf_map_meta_equal,
        .map_alloc_check = fd_array_map_alloc_check,
        .map_alloc = array_map_alloc,
-       .map_free = fd_array_map_free,
+       .map_free = perf_event_fd_array_map_free,
        .map_get_next_key = array_map_get_next_key,
        .map_lookup_elem = fd_array_map_lookup_elem,
        .map_delete_elem = fd_array_map_delete_elem,
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 90359cab501d1..b527a28a4ba5c 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -413,6 +413,9 @@ enum {

 /* Enable memory-mapping BPF map */
        BPF_F_MMAPABLE          = (1U << 10),
+
+/* Share perf_event among processes */
+       BPF_F_SHARE_PE          = (1U << 11),
 };

 /* Flags for BPF_PROG_QUERY. */
--
2.24.1



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: Behavior of pinned perf event array
  2020-09-11 20:36             ` Song Liu
@ 2020-09-14 22:59               ` Andrii Nakryiko
  2020-09-23 16:21                 ` Song Liu
  0 siblings, 1 reply; 7+ messages in thread
From: Andrii Nakryiko @ 2020-09-14 22:59 UTC (permalink / raw)
  To: Song Liu; +Cc: Daniel Borkmann, bpf

On Fri, Sep 11, 2020 at 1:36 PM Song Liu <songliubraving@fb.com> wrote:
>
>
>
> > On Sep 8, 2020, at 5:32 PM, Song Liu <songliubraving@fb.com> wrote:
> >
> >
> >
> >> On Sep 8, 2020, at 10:22 AM, Song Liu <songliubraving@fb.com> wrote:
> >>
> >>
> >>
> >>> On Sep 3, 2020, at 2:22 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> >>>
> >>> On 9/3/20 1:05 AM, Song Liu wrote:
> >>>>> On Sep 2, 2020, at 3:28 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> >>>>>
> >>>>> Hi Song,
> >>>>>
> >>>>> Sorry indeed missed it.
> >>>>>
> >>>>> On 9/2/20 11:33 PM, Song Liu wrote:
> >>>>>>> On Aug 24, 2020, at 4:57 PM, Song Liu <songliubraving@fb.com> wrote:
> >>>>>>>
> >>>>>>> We are looking at sharing perf events amount multiple processes via
> >>>>>>> pinned perf event array. However, we found this doesn't really work
> >>>>>>> as the perf event is removed from the map when the struct file is
> >>>>>>> released from user space (in perf_event_fd_array_release). This
> >>>>>>> means, the pinned perf event array can be shared among multiple
> >>>>>>> process. But each perf event still has single owner. Once the owner
> >>>>>>> process closes the fd (or terminates), the perf event is removed
> >>>>>>> from the array. I went thought the history of the code and found
> >>>>>>> this behavior is actually expected (commit 3b1efb196eee).
> >>>>>
> >>>>> Right, that auto-cleanup is expected.
> >>>>>
> >>>>>>> In our use case, however, we want to share the perf event among
> >>>>>>> different processes. I think we have a few options to achieve this:
> >>>>>>>
> >>>>>>> 1. Introduce a new flag for the perf event map, like BPF_F_KEEP_PE_OPEN.
> >>>>>>> Once this flag is set, we should not remove the fd on struct file
> >>>>>>> release. Instead, we remove fd in map_release_uref.
> >>>>>>>
> >>>>>>> 2. Allow a different process to hold reference to the perf_event
> >>>>>>> via pinned perf event array. I guess we can achieve this by
> >>>>>>> enabling for BPF_MAP_UPDATE_ELEM perf event array.
> >>>>>>>
> >>>>>>> 3. Other ideas?
> >>>>>>>
> >>>>>>> Could you please let us know your thoughts on this?
> >>>>>
> >>>>> One option that would work for sure is to transfer the fd to the other
> >>>>> processes via fd passing e.g. through pipe or unix domain socket.
> >>>> I haven't tried to transfer the fd, but it might be tricky. We need to
> >>>> plan for more than 2 processes sharing the events, and these processes
> >>>> will start and terminate in any order.
> >>>>> I guess my question would be that it would be hard to debug if we keep
> >>>>> dangling perf event entries in there yb accident that noone is cleaning
> >>>>> up. Some sort of flag is probably okay, but it needs proper introspection
> >>>>> facilities from bpftool side so that it could be detected that it's just
> >>>>> dangling around waiting for cleanup.
> >>>> With my latest design, we don't need to pin the perf_event map (neither
> >>>> the prog accessing the map. I guess this can make the clean up problem
> >>>> better? So we will add a new flag for map_create. With the flag, we
> >>>
> >>> I mean pinning the map itself or the prog making use of accessing the map
> >>> is not the issue. Afaik, it's more the perf RB that is consuming memory and
> >>> can be dangling, so the presence of the /entry/ in the map itself which
> >>> would then not be cleaned up by accident, I think this was the motivation
> >>> back then at least.

Daniel, are you aware of any use cases that do rely on such a behavior
of PERV_EVENT_ARRAY?

For me this auto-removal of elements on closing *one of a few*
PERF_EVENT_ARRAY FDs (original one, but still just one of a few active
ones) was extremely surprising. It doesn't follow what we do for any
other BPF map, as far as I can tell. E.g., think about
BPF_MAP_TYPE_PROG_ARRAY. If we pin it in BPF FS and close FD, it won't
auto-remove all the tail call programs, right? There is exactly the
same concern with not auto-releasing bpf_progs, just like with
perf_event. But it's not accidental, if you are pinning a BPF map, you
know what you are doing (at least we have to assume so :).

So instead of adding an extra option, shouldn't we just fix this
behavior instead and make it the same across all BPF maps that hold
kernel resources?

> >>>
> >>>> will not close the perf_event during process termination, and we block
> >>>> pinning for this map, and any program accessing this map. Does this
> >>>> sounds like a good plan?
> >>>
> >>> Could you elaborate why blocking pinning of map/prog is useful in this context?
> >>
> >> I was thinking, we are more likely to forget cleaning up pinned map. If the
> >> map is not pinned, it will be at least cleaned up when all processes accessing
> >> it terminate. On the other hand, pinned map would stay until someone removes
> >> it from bpffs. So the idea is to avoid the pinning scenario.
> >>
> >> But I agree this won't solve all the problems. Say process A added a few
> >> perf events (A_events) to the map. And process B added some other events
> >> (B_events)to the same map. Blocking pinning makes sure we clean up everything
> >> when both A and B terminates. But if A terminates soon, while B runs for a
> >> long time, A_events will stay open unnecessarily.
> >>
> >> Alternatively, we can implement map_fd_sys_lookup_elem for perf event map,
> >> which returns an fd to the perf event. With this solution, if process A added
> >> some events to the map, and process B want to use them after process A
> >> terminates. We need to explicitly do the lookup in process B and holds the fd
> >> to the perf event. Maybe this is a better solution?
> >
> > Actually, this doesn't work. :( With map_fd_sys_lookup_elem(), we can get a fd
> > on the perf_event, but we still remove the event from the map in
> > perf_event_fd_array_release(). Let me see what the best next step...
>
> CC Andrii and bpf@

thanks, Song!

>
> Andrii and I had some discussion on this.
>
> Currently, I am working on something with a new flag BPF_F_SHARE_PE. I attached
> the diff below.
>
> On the other hand, we found current behavior of perf_event_array puzzling,
> especially pinned perf_event_array (as pinning doesn't really pin the content).
> Therefore, we may consider changing the behavior without a flag?
>
> Thanks,
> Song
>
>
> ========================================================================
>

[...]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Behavior of pinned perf event array
  2020-09-14 22:59               ` Andrii Nakryiko
@ 2020-09-23 16:21                 ` Song Liu
  2020-09-24  8:43                   ` Daniel Borkmann
  0 siblings, 1 reply; 7+ messages in thread
From: Song Liu @ 2020-09-23 16:21 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: bpf, Andrii Nakryiko

Hi Daniel,

> On Sep 14, 2020, at 3:59 PM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> 
> On Fri, Sep 11, 2020 at 1:36 PM Song Liu <songliubraving@fb.com> wrote:
>> 
>> 
>> 
>>> On Sep 8, 2020, at 5:32 PM, Song Liu <songliubraving@fb.com> wrote:
>>> 
>>> 
>>> 
>>>> On Sep 8, 2020, at 10:22 AM, Song Liu <songliubraving@fb.com> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On Sep 3, 2020, at 2:22 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
>>>>> 
>>>>> On 9/3/20 1:05 AM, Song Liu wrote:
>>>>>>> On Sep 2, 2020, at 3:28 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
>>>>>>> 
>>>>>>> Hi Song,
>>>>>>> 
>>>>>>> Sorry indeed missed it.
>>>>>>> 
>>>>>>> On 9/2/20 11:33 PM, Song Liu wrote:
>>>>>>>>> On Aug 24, 2020, at 4:57 PM, Song Liu <songliubraving@fb.com> wrote:
>>>>>>>>> 
>>>>>>>>> We are looking at sharing perf events amount multiple processes via
>>>>>>>>> pinned perf event array. However, we found this doesn't really work
>>>>>>>>> as the perf event is removed from the map when the struct file is
>>>>>>>>> released from user space (in perf_event_fd_array_release). This
>>>>>>>>> means, the pinned perf event array can be shared among multiple
>>>>>>>>> process. But each perf event still has single owner. Once the owner
>>>>>>>>> process closes the fd (or terminates), the perf event is removed
>>>>>>>>> from the array. I went thought the history of the code and found
>>>>>>>>> this behavior is actually expected (commit 3b1efb196eee).
>>>>>>> 
>>>>>>> Right, that auto-cleanup is expected.
>>>>>>> 
>>>>>>>>> In our use case, however, we want to share the perf event among
>>>>>>>>> different processes. I think we have a few options to achieve this:
>>>>>>>>> 
>>>>>>>>> 1. Introduce a new flag for the perf event map, like BPF_F_KEEP_PE_OPEN.
>>>>>>>>> Once this flag is set, we should not remove the fd on struct file
>>>>>>>>> release. Instead, we remove fd in map_release_uref.
>>>>>>>>> 
>>>>>>>>> 2. Allow a different process to hold reference to the perf_event
>>>>>>>>> via pinned perf event array. I guess we can achieve this by
>>>>>>>>> enabling for BPF_MAP_UPDATE_ELEM perf event array.
>>>>>>>>> 
>>>>>>>>> 3. Other ideas?
>>>>>>>>> 
>>>>>>>>> Could you please let us know your thoughts on this?
>>>>>>> 
>>>>>>> One option that would work for sure is to transfer the fd to the other
>>>>>>> processes via fd passing e.g. through pipe or unix domain socket.
>>>>>> I haven't tried to transfer the fd, but it might be tricky. We need to
>>>>>> plan for more than 2 processes sharing the events, and these processes
>>>>>> will start and terminate in any order.
>>>>>>> I guess my question would be that it would be hard to debug if we keep
>>>>>>> dangling perf event entries in there yb accident that noone is cleaning
>>>>>>> up. Some sort of flag is probably okay, but it needs proper introspection
>>>>>>> facilities from bpftool side so that it could be detected that it's just
>>>>>>> dangling around waiting for cleanup.
>>>>>> With my latest design, we don't need to pin the perf_event map (neither
>>>>>> the prog accessing the map. I guess this can make the clean up problem
>>>>>> better? So we will add a new flag for map_create. With the flag, we
>>>>> 
>>>>> I mean pinning the map itself or the prog making use of accessing the map
>>>>> is not the issue. Afaik, it's more the perf RB that is consuming memory and
>>>>> can be dangling, so the presence of the /entry/ in the map itself which
>>>>> would then not be cleaned up by accident, I think this was the motivation
>>>>> back then at least.
> 
> Daniel, are you aware of any use cases that do rely on such a behavior
> of PERV_EVENT_ARRAY?
> 
> For me this auto-removal of elements on closing *one of a few*
> PERF_EVENT_ARRAY FDs (original one, but still just one of a few active
> ones) was extremely surprising. It doesn't follow what we do for any
> other BPF map, as far as I can tell. E.g., think about
> BPF_MAP_TYPE_PROG_ARRAY. If we pin it in BPF FS and close FD, it won't
> auto-remove all the tail call programs, right? There is exactly the
> same concern with not auto-releasing bpf_progs, just like with
> perf_event. But it's not accidental, if you are pinning a BPF map, you
> know what you are doing (at least we have to assume so :).
> 
> So instead of adding an extra option, shouldn't we just fix this
> behavior instead and make it the same across all BPF maps that hold
> kernel resources?

Could you please share your thoughts on this? I personally don't have 
strong preference one way (add a flag) or the other (change the default
behavior). But I think we need to agree on the direction to go. 

Thanks,
Song

[...]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Behavior of pinned perf event array
  2020-09-23 16:21                 ` Song Liu
@ 2020-09-24  8:43                   ` Daniel Borkmann
  2020-09-24 20:14                     ` Andrii Nakryiko
  0 siblings, 1 reply; 7+ messages in thread
From: Daniel Borkmann @ 2020-09-24  8:43 UTC (permalink / raw)
  To: Song Liu; +Cc: bpf, Andrii Nakryiko

On 9/23/20 6:21 PM, Song Liu wrote:
>> On Sep 14, 2020, at 3:59 PM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>> On Fri, Sep 11, 2020 at 1:36 PM Song Liu <songliubraving@fb.com> wrote:
[...]
>> Daniel, are you aware of any use cases that do rely on such a behavior
>> of PERV_EVENT_ARRAY?
>>
>> For me this auto-removal of elements on closing *one of a few*
>> PERF_EVENT_ARRAY FDs (original one, but still just one of a few active
>> ones) was extremely surprising. It doesn't follow what we do for any
>> other BPF map, as far as I can tell. E.g., think about
>> BPF_MAP_TYPE_PROG_ARRAY. If we pin it in BPF FS and close FD, it won't
>> auto-remove all the tail call programs, right? There is exactly the
>> same concern with not auto-releasing bpf_progs, just like with
>> perf_event. But it's not accidental, if you are pinning a BPF map, you
>> know what you are doing (at least we have to assume so :).
>>
>> So instead of adding an extra option, shouldn't we just fix this
>> behavior instead and make it the same across all BPF maps that hold
>> kernel resources?
> 
> Could you please share your thoughts on this? I personally don't have
> strong preference one way (add a flag) or the other (change the default
> behavior). But I think we need to agree on the direction to go.

My preference would be to have an opt-in flag, we do rely on the auto-removal
of the perf event map entry on client close in Cilium at least, e.g. a monitor
application can insert itself into the map to start receiving events from
the BPF datapath, and upon exit (no matter whether graceful or not) we don't
consume any more cycles in the data path than necessary for events, and
from the __bpf_perf_event_output() we bail out right after checking the
READ_ONCE(array->ptrs[index]) instead of pushing data that later on no-one
picks up.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Behavior of pinned perf event array
  2020-09-24  8:43                   ` Daniel Borkmann
@ 2020-09-24 20:14                     ` Andrii Nakryiko
  2020-09-24 21:01                       ` Song Liu
  0 siblings, 1 reply; 7+ messages in thread
From: Andrii Nakryiko @ 2020-09-24 20:14 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Song Liu, bpf

On Thu, Sep 24, 2020 at 1:43 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 9/23/20 6:21 PM, Song Liu wrote:
> >> On Sep 14, 2020, at 3:59 PM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> >> On Fri, Sep 11, 2020 at 1:36 PM Song Liu <songliubraving@fb.com> wrote:
> [...]
> >> Daniel, are you aware of any use cases that do rely on such a behavior
> >> of PERV_EVENT_ARRAY?
> >>
> >> For me this auto-removal of elements on closing *one of a few*
> >> PERF_EVENT_ARRAY FDs (original one, but still just one of a few active
> >> ones) was extremely surprising. It doesn't follow what we do for any
> >> other BPF map, as far as I can tell. E.g., think about
> >> BPF_MAP_TYPE_PROG_ARRAY. If we pin it in BPF FS and close FD, it won't
> >> auto-remove all the tail call programs, right? There is exactly the
> >> same concern with not auto-releasing bpf_progs, just like with
> >> perf_event. But it's not accidental, if you are pinning a BPF map, you
> >> know what you are doing (at least we have to assume so :).
> >>
> >> So instead of adding an extra option, shouldn't we just fix this
> >> behavior instead and make it the same across all BPF maps that hold
> >> kernel resources?
> >
> > Could you please share your thoughts on this? I personally don't have
> > strong preference one way (add a flag) or the other (change the default
> > behavior). But I think we need to agree on the direction to go.
>
> My preference would be to have an opt-in flag, we do rely on the auto-removal
> of the perf event map entry on client close in Cilium at least, e.g. a monitor
> application can insert itself into the map to start receiving events from
> the BPF datapath, and upon exit (no matter whether graceful or not) we don't
> consume any more cycles in the data path than necessary for events, and
> from the __bpf_perf_event_output() we bail out right after checking the
> READ_ONCE(array->ptrs[index]) instead of pushing data that later on no-one
> picks up.

Well, then that's settled. We can't break existing use cases.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Behavior of pinned perf event array
  2020-09-24 20:14                     ` Andrii Nakryiko
@ 2020-09-24 21:01                       ` Song Liu
  0 siblings, 0 replies; 7+ messages in thread
From: Song Liu @ 2020-09-24 21:01 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: Daniel Borkmann, bpf



> On Sep 24, 2020, at 1:14 PM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> 
> On Thu, Sep 24, 2020 at 1:43 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> 
>> On 9/23/20 6:21 PM, Song Liu wrote:
>>>> On Sep 14, 2020, at 3:59 PM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>>>> On Fri, Sep 11, 2020 at 1:36 PM Song Liu <songliubraving@fb.com> wrote:
>> [...]
>>>> Daniel, are you aware of any use cases that do rely on such a behavior
>>>> of PERV_EVENT_ARRAY?
>>>> 
>>>> For me this auto-removal of elements on closing *one of a few*
>>>> PERF_EVENT_ARRAY FDs (original one, but still just one of a few active
>>>> ones) was extremely surprising. It doesn't follow what we do for any
>>>> other BPF map, as far as I can tell. E.g., think about
>>>> BPF_MAP_TYPE_PROG_ARRAY. If we pin it in BPF FS and close FD, it won't
>>>> auto-remove all the tail call programs, right? There is exactly the
>>>> same concern with not auto-releasing bpf_progs, just like with
>>>> perf_event. But it's not accidental, if you are pinning a BPF map, you
>>>> know what you are doing (at least we have to assume so :).
>>>> 
>>>> So instead of adding an extra option, shouldn't we just fix this
>>>> behavior instead and make it the same across all BPF maps that hold
>>>> kernel resources?
>>> 
>>> Could you please share your thoughts on this? I personally don't have
>>> strong preference one way (add a flag) or the other (change the default
>>> behavior). But I think we need to agree on the direction to go.
>> 
>> My preference would be to have an opt-in flag, we do rely on the auto-removal
>> of the perf event map entry on client close in Cilium at least, e.g. a monitor
>> application can insert itself into the map to start receiving events from
>> the BPF datapath, and upon exit (no matter whether graceful or not) we don't
>> consume any more cycles in the data path than necessary for events, and
>> from the __bpf_perf_event_output() we bail out right after checking the
>> READ_ONCE(array->ptrs[index]) instead of pushing data that later on no-one
>> picks up.
> 
> Well, then that's settled. We can't break existing use cases.

Thanks Daniel and Andrii! I will draft the patch with a new flag. 

Song



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-09-24 21:01 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-24 23:57 Behavior of pinned perf event array Song Liu
     [not found] ` <47929B19-E739-4E74-BBB7-B2C0DCC7A7F8@fb.com>
     [not found]   ` <0fb36afb-6056-5e44-77d8-1ad57d82db1c@iogearbox.net>
     [not found]     ` <BE639CE6-8566-4184-B386-7AEED22939FB@fb.com>
     [not found]       ` <fae5ddc7-b7b5-e757-fdbb-2946d56caca3@iogearbox.net>
     [not found]         ` <107FC288-D07C-4881-82BD-8FD29CE42290@fb.com>
     [not found]           ` <DEBBD27D-188D-4EFD-8C04-838F54689587@fb.com>
2020-09-11 20:36             ` Song Liu
2020-09-14 22:59               ` Andrii Nakryiko
2020-09-23 16:21                 ` Song Liu
2020-09-24  8:43                   ` Daniel Borkmann
2020-09-24 20:14                     ` Andrii Nakryiko
2020-09-24 21:01                       ` Song Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).