Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call

From: Brian Vazquez <brianvv.kernel@gmail.com>
To: Yonghong Song <yhs@fb.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>,
	Song Liu <liu.song.a23@gmail.com>,
	Brian Vazquez <brianvv@google.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	"David S . Miller" <davem@davemloft.net>,
	Stanislav Fomichev <sdf@google.com>,
	Willem de Bruijn <willemb@google.com>,
	Petar Penkov <ppenkov@google.com>,
	Networking <netdev@vger.kernel.org>, bpf <bpf@vger.kernel.org>
Subject: Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
Date: Fri, 26 Jul 2019 16:36:19 -0700	[thread overview]
Message-ID: <CABCgpaVB+iDGO132d9CTtC_GYiKJuuL6pe5_Krm3-THgvfMO=A@mail.gmail.com> (raw)
In-Reply-To: <beb513cb-2d76-30d4-6500-2892c6566a7e@fb.com>

On Thu, Jul 25, 2019 at 11:10 PM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 7/25/19 6:47 PM, Alexei Starovoitov wrote:
> > On Thu, Jul 25, 2019 at 6:24 PM Brian Vazquez <brianvv.kernel@gmail.com> wrote:
> >>
> >> On Thu, Jul 25, 2019 at 4:54 PM Alexei Starovoitov
> >> <alexei.starovoitov@gmail.com> wrote:
> >>>
> >>> On Thu, Jul 25, 2019 at 04:25:53PM -0700, Brian Vazquez wrote:
> >>>>>>> If prev_key is deleted before map_get_next_key(), we get the first key
> >>>>>>> again. This is pretty weird.
> >>>>>>
> >>>>>> Yes, I know. But note that the current scenario happens even for the
> >>>>>> old interface (imagine you are walking a map from userspace and you
> >>>>>> tried get_next_key the prev_key was removed, you will start again from
> >>>>>> the beginning without noticing it).
> >>>>>> I tried to sent a patch in the past but I was missing some context:
> >>>>>> before NULL was used to get the very first_key the interface relied in
> >>>>>> a random (non existent) key to retrieve the first_key in the map, and
> >>>>>> I was told what we still have to support that scenario.
> >>>>>
> >>>>> BPF_MAP_DUMP is slightly different, as you may return the first key
> >>>>> multiple times in the same call. Also, BPF_MAP_DUMP is new, so we
> >>>>> don't have to support legacy scenarios.
> >>>>>
> >>>>> Since BPF_MAP_DUMP keeps a list of elements. It is possible to try
> >>>>> to look up previous keys. Would something down this direction work?
> >>>>
> >>>> I've been thinking about it and I think first we need a way to detect
> >>>> that since key was not present we got the first_key instead:
> >>>>
> >>>> - One solution I had in mind was to explicitly asked for the first key
> >>>> with map_get_next_key(map, NULL, first_key) and while walking the map
> >>>> check that map_get_next_key(map, prev_key, key) doesn't return the
> >>>> same key. This could be done using memcmp.
> >>>> - Discussing with Stan, he mentioned that another option is to support
> >>>> a flag in map_get_next_key to let it know that we want an error
> >>>> instead of the first_key.
> >>>>
> >>>> After detecting the problem we also need to define what we want to do,
> >>>> here some options:
> >>>>
> >>>> a) Return the error to the caller
> >>>> b) Try with previous keys if any (which be limited to the keys that we
> >>>> have traversed so far in this dump call)
> >>>> c) continue with next entries in the map. array is easy just get the
> >>>> next valid key (starting on i+1), but hmap might be difficult since
> >>>> starting on the next bucket could potentially skip some keys that were
> >>>> concurrently added to the same bucket where key used to be, and
> >>>> starting on the same bucket could lead us to return repeated elements.
> >>>>
> >>>> Or maybe we could support those 3 cases via flags and let the caller
> >>>> decide which one to use?
> >>>
> >>> this type of indecision is the reason why I wasn't excited about
> >>> batch dumping in the first place and gave 'soft yes' when Stan
> >>> mentioned it during lsf/mm/bpf uconf.
> >>> We probably shouldn't do it.
> >>> It feels this map_dump makes api more complex and doesn't really
> >>> give much benefit to the user other than large map dump becomes faster.
> >>> I think we gotta solve this problem differently.
> >>
> >> Some users are working around the dumping problems with the existing
> >> api by creating a bpf_map_get_next_key_and_delete userspace function
> >> (see https://urldefense.proofpoint.com/v2/url?u=https-3A__www.bouncybouncy.net_blog_bpf-5Fmap-5Fget-5Fnext-5Fkey-2Dpitfalls_&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=DA8e1B5r073vIqRrFz7MRA&m=XvNxqsDhRi62gzZ04HbLRTOFJX8X6mTuK7PZGn80akY&s=7q7beZxOJJ3Q0el8L0r-xDctedSpnEejJ6PVX1XYotQ&e= )
> >> which in my opinion is actually a good idea. The only problem with
> >> that is that calling bpf_map_get_next_key(fd, key, next_key) and then
> >> bpf_map_delete_elem(fd, key) from userspace is racing with kernel code
> >> and it might lose some information when deleting.
> >> We could then do map_dump_and_delete using that idea but in the kernel
> >> where we could better handle the racing condition. In that scenario
> >> even if we retrieve the same key it will contain different info ( the
> >> delta between old and new value). Would that work?
> >
> > you mean get_next+lookup+delete at once?
> > Sounds useful.
> > Yonghong has been thinking about batching api as well.
>
> In bcc, we have many instances like this:
>     getting all (key value) pairs, do some analysis and output,
>     delete all keys
>
> The implementation typically like
>     /* to get all (key, value) pairs */
>     while(bpf_get_next_key() == 0)
>       bpf_map_lookup()
>     /* do analysis and output */
>     for (all keys)
>       bpf_map_delete()

If you do that in a map that is being modified while you are doing the
analysis and output, you will lose some new data by deleting the keys,
right?

> get_next+lookup+delete will be definitely useful.
> batching will be even better to save the number of syscalls.
>
> An alternative is to do batch get_next+lookup and batch delete
> to achieve similar goal as the above code.

What I mentioned above is what it makes me think that with the
deletion it'd be better if we perform these 3 operations at once:
get_next+lookup+delete in a jumbo/atomic command and batch them later?

>
> There is a minor difference between this approach
> and the above get_next+lookup+delete.
> During scanning the hash map, get_next+lookup may get less number
> of elements compared to get_next+lookup+delete as the latter
> may have more later-inserted hash elements after the operation
> start. But both are inaccurate, so probably the difference
> is minor.
>
> >
> > I think if we cannot figure out how to make a batch of two commands
> > get_next + lookup to work correctly then we need to identify/invent one
> > command and make batching more generic.
>
> not 100% sure. It will be hard to define what is "correctly".

I agree, it'll be hard to define what is the right behavior.

> For not changing map, looping of (get_next, lookup) and batch
> get_next+lookup should have the same results.

This is true for the api I'm presenting the only think that I was
missing was what to do for changing maps to avoid the weird scenario
(getting the first key due a concurrent deletion). And, in my opinion
the way to go should be what also Willem supported: return the err to
the caller and restart the dumping. I could do this with existing code
just by detecting that we do provide a prev_key and got the first_key
instead of the next_key or even implement a new function if you want
to.

> For constant changing loops, not sure how to define which one
> is correct. If users have concerns, they may need to just pick one
> which gives them more comfort.
>
> > Like make one jumbo/compound/atomic command to be get_next+lookup+delete.
> > Define the semantics of this single compound command.
> > And then let batching to be a multiplier of such command.
> > In a sense that multiplier 1 or N should be have the same way.
> > No extra flags to alter the batching.
> > The high level description of the batch would be:
> > pls execute get_next,lookup,delete and repeat it N times.
> > or
> > pls execute get_next,lookup and repeat N times.

But any attempt to do get_next+lookup will have same problem with
deletions right?

I don't see how we could do it more consistent than what I'm
proposing. Let's just support one case: report an error if the
prev_key was not found instead of retrieving the first_key. Would that
work?

> > where each command action is defined to be composable.
> >
> > Just a rough idea.
> >