Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call

From: Jakub Kicinski <jakub.kicinski@netronome.com>
To: Brian Vazquez <brianvv.kernel@gmail.com>
Cc: Yonghong Song <yhs@fb.com>,
	Alexei Starovoitov <alexei.starovoitov@gmail.com>,
	Song Liu <liu.song.a23@gmail.com>,
	Brian Vazquez <brianvv@google.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	"David S . Miller" <davem@davemloft.net>,
	Stanislav Fomichev <sdf@google.com>,
	Willem de Bruijn <willemb@google.com>,
	Petar Penkov <ppenkov@google.com>,
	Networking <netdev@vger.kernel.org>, bpf <bpf@vger.kernel.org>
Subject: Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
Date: Fri, 26 Jul 2019 17:02:34 -0700	[thread overview]
Message-ID: <20190726170234.1b925a97@cakuba.netronome.com> (raw)
In-Reply-To: <CABCgpaVB+iDGO132d9CTtC_GYiKJuuL6pe5_Krm3-THgvfMO=A@mail.gmail.com>

On Fri, 26 Jul 2019 16:36:19 -0700, Brian Vazquez wrote:
> > In bcc, we have many instances like this:
> >     getting all (key value) pairs, do some analysis and output,
> >     delete all keys
> >
> > The implementation typically like
> >     /* to get all (key, value) pairs */
> >     while(bpf_get_next_key() == 0)
> >       bpf_map_lookup()
> >     /* do analysis and output */
> >     for (all keys)
> >       bpf_map_delete()  
> 
> If you do that in a map that is being modified while you are doing the
> analysis and output, you will lose some new data by deleting the keys,
> right?
> 
> > get_next+lookup+delete will be definitely useful.
> > batching will be even better to save the number of syscalls.
> >
> > An alternative is to do batch get_next+lookup and batch delete
> > to achieve similar goal as the above code.  
> 
> What I mentioned above is what it makes me think that with the
> deletion it'd be better if we perform these 3 operations at once:
> get_next+lookup+delete in a jumbo/atomic command and batch them later?

Hm. The lookup+delete are only "atomic" if we insert an RCU sync in
between right? The elements' lifetime is protected by RCU.

	CPU 1				CPU 2
					val = lookup(map, key)				
	val = lookup(map, key)
	delete(map, key)
	dump(val)
					val->counter++

So we'd need to walk the hash table, disconnect all lists from the
buckets and splice them onto one combined list, then synchronize_rcu()
and only after that we can dump the elements and free them.

> > There is a minor difference between this approach
> > and the above get_next+lookup+delete.
> > During scanning the hash map, get_next+lookup may get less number
> > of elements compared to get_next+lookup+delete as the latter
> > may have more later-inserted hash elements after the operation
> > start. But both are inaccurate, so probably the difference
> > is minor.

> > For not changing map, looping of (get_next, lookup) and batch
> > get_next+lookup should have the same results.  
> 
> This is true for the api I'm presenting the only think that I was
> missing was what to do for changing maps to avoid the weird scenario
> (getting the first key due a concurrent deletion). And, in my opinion
> the way to go should be what also Willem supported: return the err to
> the caller and restart the dumping. I could do this with existing code
> just by detecting that we do provide a prev_key and got the first_key
> instead of the next_key or even implement a new function if you want
> to.

My knee jerk reaction to Willem's proposal was to nit pick that sizing
the dump space to the map's max entries doesn't guarantee all entries
will fit if an entry can be removed and re-added in a different place
while dumper proceeds.

If we keep elements disconnected from the hash array _without_
decrementing element count until synchronize_rcu() completes we have
that guarantee, but OTOH it may not be possible to add any new entries
from the datapath, so that doesn't sound great either :/