From: Brian Vazquez <firstname.lastname@example.org>
To: Yonghong Song <email@example.com>
Cc: Alexei Starovoitov <firstname.lastname@example.org>,
Song Liu <email@example.com>,
Brian Vazquez <firstname.lastname@example.org>,
Daniel Borkmann <email@example.com>,
"David S . Miller" <firstname.lastname@example.org>,
Stanislav Fomichev <email@example.com>,
Willem de Bruijn <firstname.lastname@example.org>,
Petar Penkov <email@example.com>,
Networking <firstname.lastname@example.org>, bpf <email@example.com>
Subject: Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
Date: Fri, 26 Jul 2019 16:36:19 -0700 [thread overview]
Message-ID: <CABCgpaVB+iDGO132d9CTtC_GYiKJuuL6pe5_Krm3-THgvfMO=A@mail.gmail.com> (raw)
On Thu, Jul 25, 2019 at 11:10 PM Yonghong Song <firstname.lastname@example.org> wrote:
> On 7/25/19 6:47 PM, Alexei Starovoitov wrote:
> > On Thu, Jul 25, 2019 at 6:24 PM Brian Vazquez <email@example.com> wrote:
> >> On Thu, Jul 25, 2019 at 4:54 PM Alexei Starovoitov
> >> <firstname.lastname@example.org> wrote:
> >>> On Thu, Jul 25, 2019 at 04:25:53PM -0700, Brian Vazquez wrote:
> >>>>>>> If prev_key is deleted before map_get_next_key(), we get the first key
> >>>>>>> again. This is pretty weird.
> >>>>>> Yes, I know. But note that the current scenario happens even for the
> >>>>>> old interface (imagine you are walking a map from userspace and you
> >>>>>> tried get_next_key the prev_key was removed, you will start again from
> >>>>>> the beginning without noticing it).
> >>>>>> I tried to sent a patch in the past but I was missing some context:
> >>>>>> before NULL was used to get the very first_key the interface relied in
> >>>>>> a random (non existent) key to retrieve the first_key in the map, and
> >>>>>> I was told what we still have to support that scenario.
> >>>>> BPF_MAP_DUMP is slightly different, as you may return the first key
> >>>>> multiple times in the same call. Also, BPF_MAP_DUMP is new, so we
> >>>>> don't have to support legacy scenarios.
> >>>>> Since BPF_MAP_DUMP keeps a list of elements. It is possible to try
> >>>>> to look up previous keys. Would something down this direction work?
> >>>> I've been thinking about it and I think first we need a way to detect
> >>>> that since key was not present we got the first_key instead:
> >>>> - One solution I had in mind was to explicitly asked for the first key
> >>>> with map_get_next_key(map, NULL, first_key) and while walking the map
> >>>> check that map_get_next_key(map, prev_key, key) doesn't return the
> >>>> same key. This could be done using memcmp.
> >>>> - Discussing with Stan, he mentioned that another option is to support
> >>>> a flag in map_get_next_key to let it know that we want an error
> >>>> instead of the first_key.
> >>>> After detecting the problem we also need to define what we want to do,
> >>>> here some options:
> >>>> a) Return the error to the caller
> >>>> b) Try with previous keys if any (which be limited to the keys that we
> >>>> have traversed so far in this dump call)
> >>>> c) continue with next entries in the map. array is easy just get the
> >>>> next valid key (starting on i+1), but hmap might be difficult since
> >>>> starting on the next bucket could potentially skip some keys that were
> >>>> concurrently added to the same bucket where key used to be, and
> >>>> starting on the same bucket could lead us to return repeated elements.
> >>>> Or maybe we could support those 3 cases via flags and let the caller
> >>>> decide which one to use?
> >>> this type of indecision is the reason why I wasn't excited about
> >>> batch dumping in the first place and gave 'soft yes' when Stan
> >>> mentioned it during lsf/mm/bpf uconf.
> >>> We probably shouldn't do it.
> >>> It feels this map_dump makes api more complex and doesn't really
> >>> give much benefit to the user other than large map dump becomes faster.
> >>> I think we gotta solve this problem differently.
> >> Some users are working around the dumping problems with the existing
> >> api by creating a bpf_map_get_next_key_and_delete userspace function
> >> (see https://urldefense.proofpoint.com/v2/url?u=https-3A__www.bouncybouncy.net_blog_bpf-5Fmap-5Fget-5Fnext-5Fkey-2Dpitfalls_&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=DA8e1B5r073vIqRrFz7MRA&m=XvNxqsDhRi62gzZ04HbLRTOFJX8X6mTuK7PZGn80akY&s=7q7beZxOJJ3Q0el8L0r-xDctedSpnEejJ6PVX1XYotQ&e= )
> >> which in my opinion is actually a good idea. The only problem with
> >> that is that calling bpf_map_get_next_key(fd, key, next_key) and then
> >> bpf_map_delete_elem(fd, key) from userspace is racing with kernel code
> >> and it might lose some information when deleting.
> >> We could then do map_dump_and_delete using that idea but in the kernel
> >> where we could better handle the racing condition. In that scenario
> >> even if we retrieve the same key it will contain different info ( the
> >> delta between old and new value). Would that work?
> > you mean get_next+lookup+delete at once?
> > Sounds useful.
> > Yonghong has been thinking about batching api as well.
> In bcc, we have many instances like this:
> getting all (key value) pairs, do some analysis and output,
> delete all keys
> The implementation typically like
> /* to get all (key, value) pairs */
> while(bpf_get_next_key() == 0)
> /* do analysis and output */
> for (all keys)
If you do that in a map that is being modified while you are doing the
analysis and output, you will lose some new data by deleting the keys,
> get_next+lookup+delete will be definitely useful.
> batching will be even better to save the number of syscalls.
> An alternative is to do batch get_next+lookup and batch delete
> to achieve similar goal as the above code.
What I mentioned above is what it makes me think that with the
deletion it'd be better if we perform these 3 operations at once:
get_next+lookup+delete in a jumbo/atomic command and batch them later?
> There is a minor difference between this approach
> and the above get_next+lookup+delete.
> During scanning the hash map, get_next+lookup may get less number
> of elements compared to get_next+lookup+delete as the latter
> may have more later-inserted hash elements after the operation
> start. But both are inaccurate, so probably the difference
> is minor.
> > I think if we cannot figure out how to make a batch of two commands
> > get_next + lookup to work correctly then we need to identify/invent one
> > command and make batching more generic.
> not 100% sure. It will be hard to define what is "correctly".
I agree, it'll be hard to define what is the right behavior.
> For not changing map, looping of (get_next, lookup) and batch
> get_next+lookup should have the same results.
This is true for the api I'm presenting the only think that I was
missing was what to do for changing maps to avoid the weird scenario
(getting the first key due a concurrent deletion). And, in my opinion
the way to go should be what also Willem supported: return the err to
the caller and restart the dumping. I could do this with existing code
just by detecting that we do provide a prev_key and got the first_key
instead of the next_key or even implement a new function if you want
> For constant changing loops, not sure how to define which one
> is correct. If users have concerns, they may need to just pick one
> which gives them more comfort.
> > Like make one jumbo/compound/atomic command to be get_next+lookup+delete.
> > Define the semantics of this single compound command.
> > And then let batching to be a multiplier of such command.
> > In a sense that multiplier 1 or N should be have the same way.
> > No extra flags to alter the batching.
> > The high level description of the batch would be:
> > pls execute get_next,lookup,delete and repeat it N times.
> > or
> > pls execute get_next,lookup and repeat N times.
But any attempt to do get_next+lookup will have same problem with
I don't see how we could do it more consistent than what I'm
proposing. Let's just support one case: report an error if the
prev_key was not found instead of retrieving the first_key. Would that
> > where each command action is defined to be composable.
> > Just a rough idea.
next prev parent reply other threads:[~2019-07-26 23:36 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-07-24 16:57 [PATCH bpf-next 0/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Brian Vazquez
2019-07-24 16:57 ` [PATCH bpf-next 1/6] bpf: add bpf_map_value_size and bp_map_copy_value helper functions Brian Vazquez
2019-07-24 20:53 ` Song Liu
2019-07-24 16:57 ` [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Brian Vazquez
2019-07-24 19:54 ` Willem de Bruijn
2019-07-24 22:26 ` Brian Vazquez
2019-07-24 22:33 ` Willem de Bruijn
2019-07-24 21:40 ` Song Liu
2019-07-24 22:44 ` Brian Vazquez
2019-07-24 23:04 ` Song Liu
2019-07-25 23:25 ` Brian Vazquez
2019-07-25 23:54 ` Alexei Starovoitov
2019-07-26 1:02 ` Willem de Bruijn
2019-07-26 1:24 ` Brian Vazquez
2019-07-26 1:47 ` Alexei Starovoitov
2019-07-26 6:10 ` Yonghong Song
2019-07-26 23:36 ` Brian Vazquez [this message]
2019-07-27 0:02 ` Jakub Kicinski
2019-07-27 17:54 ` Yonghong Song
2019-07-24 16:58 ` [PATCH bpf-next 3/6] bpf: keep bpf.h in sync with tools/ Brian Vazquez
2019-07-24 21:41 ` Song Liu
2019-07-24 23:10 ` Andrii Nakryiko
2019-07-25 23:27 ` Brian Vazquez
2019-07-24 16:58 ` [PATCH bpf-next 4/6] libbpf: support BPF_MAP_DUMP command Brian Vazquez
2019-07-24 19:51 ` Willem de Bruijn
2019-07-24 16:58 ` [PATCH bpf-next 5/6] selftests/bpf: test BPF_MAP_DUMP command on a bpf hashmap Brian Vazquez
2019-07-24 21:58 ` Song Liu
2019-07-24 16:58 ` [PATCH bpf-next 6/6] selftests/bpf: add test to measure performance of BPF_MAP_DUMP Brian Vazquez
2019-07-24 22:01 ` Song Liu
2019-07-24 19:20 ` [PATCH bpf-next 0/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Song Liu
2019-07-24 22:15 ` Brian Vazquez
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).