bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jakub Kicinski <jakub.kicinski@netronome.com>
To: Brian Vazquez <brianvv.kernel@gmail.com>
Cc: Yonghong Song <yhs@fb.com>,
	Alexei Starovoitov <alexei.starovoitov@gmail.com>,
	Song Liu <liu.song.a23@gmail.com>,
	Brian Vazquez <brianvv@google.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	"David S . Miller" <davem@davemloft.net>,
	Stanislav Fomichev <sdf@google.com>,
	Willem de Bruijn <willemb@google.com>,
	Petar Penkov <ppenkov@google.com>,
	Networking <netdev@vger.kernel.org>, bpf <bpf@vger.kernel.org>
Subject: Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
Date: Fri, 26 Jul 2019 17:02:34 -0700	[thread overview]
Message-ID: <20190726170234.1b925a97@cakuba.netronome.com> (raw)
In-Reply-To: <CABCgpaVB+iDGO132d9CTtC_GYiKJuuL6pe5_Krm3-THgvfMO=A@mail.gmail.com>

On Fri, 26 Jul 2019 16:36:19 -0700, Brian Vazquez wrote:
> > In bcc, we have many instances like this:
> >     getting all (key value) pairs, do some analysis and output,
> >     delete all keys
> >
> > The implementation typically like
> >     /* to get all (key, value) pairs */
> >     while(bpf_get_next_key() == 0)
> >       bpf_map_lookup()
> >     /* do analysis and output */
> >     for (all keys)
> >       bpf_map_delete()  
> 
> If you do that in a map that is being modified while you are doing the
> analysis and output, you will lose some new data by deleting the keys,
> right?
> 
> > get_next+lookup+delete will be definitely useful.
> > batching will be even better to save the number of syscalls.
> >
> > An alternative is to do batch get_next+lookup and batch delete
> > to achieve similar goal as the above code.  
> 
> What I mentioned above is what it makes me think that with the
> deletion it'd be better if we perform these 3 operations at once:
> get_next+lookup+delete in a jumbo/atomic command and batch them later?

Hm. The lookup+delete are only "atomic" if we insert an RCU sync in
between right? The elements' lifetime is protected by RCU.

	CPU 1				CPU 2
					val = lookup(map, key)				
	val = lookup(map, key)
	delete(map, key)
	dump(val)
					val->counter++

So we'd need to walk the hash table, disconnect all lists from the
buckets and splice them onto one combined list, then synchronize_rcu()
and only after that we can dump the elements and free them.

> > There is a minor difference between this approach
> > and the above get_next+lookup+delete.
> > During scanning the hash map, get_next+lookup may get less number
> > of elements compared to get_next+lookup+delete as the latter
> > may have more later-inserted hash elements after the operation
> > start. But both are inaccurate, so probably the difference
> > is minor.

> > For not changing map, looping of (get_next, lookup) and batch
> > get_next+lookup should have the same results.  
> 
> This is true for the api I'm presenting the only think that I was
> missing was what to do for changing maps to avoid the weird scenario
> (getting the first key due a concurrent deletion). And, in my opinion
> the way to go should be what also Willem supported: return the err to
> the caller and restart the dumping. I could do this with existing code
> just by detecting that we do provide a prev_key and got the first_key
> instead of the next_key or even implement a new function if you want
> to.

My knee jerk reaction to Willem's proposal was to nit pick that sizing
the dump space to the map's max entries doesn't guarantee all entries
will fit if an entry can be removed and re-added in a different place
while dumper proceeds.

If we keep elements disconnected from the hash array _without_
decrementing element count until synchronize_rcu() completes we have
that guarantee, but OTOH it may not be possible to add any new entries
from the datapath, so that doesn't sound great either :/

  reply	other threads:[~2019-07-27  0:02 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-24 16:57 [PATCH bpf-next 0/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Brian Vazquez
2019-07-24 16:57 ` [PATCH bpf-next 1/6] bpf: add bpf_map_value_size and bp_map_copy_value helper functions Brian Vazquez
2019-07-24 20:53   ` Song Liu
2019-07-24 16:57 ` [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Brian Vazquez
2019-07-24 19:54   ` Willem de Bruijn
2019-07-24 22:26     ` Brian Vazquez
2019-07-24 22:33       ` Willem de Bruijn
2019-07-24 21:40   ` Song Liu
2019-07-24 22:44     ` Brian Vazquez
2019-07-24 23:04       ` Song Liu
2019-07-25 23:25         ` Brian Vazquez
2019-07-25 23:54           ` Alexei Starovoitov
2019-07-26  1:02             ` Willem de Bruijn
2019-07-26  1:24             ` Brian Vazquez
2019-07-26  1:47               ` Alexei Starovoitov
2019-07-26  6:10                 ` Yonghong Song
2019-07-26 23:36                   ` Brian Vazquez
2019-07-27  0:02                     ` Jakub Kicinski [this message]
2019-07-27 17:54                     ` Yonghong Song
2019-07-24 16:58 ` [PATCH bpf-next 3/6] bpf: keep bpf.h in sync with tools/ Brian Vazquez
2019-07-24 21:41   ` Song Liu
2019-07-24 23:10   ` Andrii Nakryiko
2019-07-25 23:27     ` Brian Vazquez
2019-07-24 16:58 ` [PATCH bpf-next 4/6] libbpf: support BPF_MAP_DUMP command Brian Vazquez
2019-07-24 19:51   ` Willem de Bruijn
2019-07-24 16:58 ` [PATCH bpf-next 5/6] selftests/bpf: test BPF_MAP_DUMP command on a bpf hashmap Brian Vazquez
2019-07-24 21:58   ` Song Liu
2019-07-24 16:58 ` [PATCH bpf-next 6/6] selftests/bpf: add test to measure performance of BPF_MAP_DUMP Brian Vazquez
2019-07-24 22:01   ` Song Liu
2019-07-24 19:20 ` [PATCH bpf-next 0/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Song Liu
2019-07-24 22:15   ` Brian Vazquez

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190726170234.1b925a97@cakuba.netronome.com \
    --to=jakub.kicinski@netronome.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=bpf@vger.kernel.org \
    --cc=brianvv.kernel@gmail.com \
    --cc=brianvv@google.com \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=liu.song.a23@gmail.com \
    --cc=netdev@vger.kernel.org \
    --cc=ppenkov@google.com \
    --cc=sdf@google.com \
    --cc=willemb@google.com \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).