bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
To: Stanislav Fomichev <sdf@fomichev.me>
Cc: Yonghong Song <yhs@fb.com>,
	Jakub Kicinski <jakub.kicinski@netronome.com>,
	Brian Vazquez <brianvv@google.com>,
	Alexei Starovoitov <ast@fb.com>,
	"bpf@vger.kernel.org" <bpf@vger.kernel.org>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Kernel Team <Kernel-team@fb.com>
Subject: Re: [PATCH bpf-next 00/13] bpf: adding map batch processing support
Date: Tue, 3 Sep 2019 14:01:29 -0700	[thread overview]
Message-ID: <20190903210127.z6mhkryqg6qz62dq@ast-mbp.dhcp.thefacebook.com> (raw)
In-Reply-To: <20190830211809.GB2101@mini-arch>

On Fri, Aug 30, 2019 at 02:18:09PM -0700, Stanislav Fomichev wrote:
> > > 
> > > I personally like Jakub's/Quentin's proposal more. So if I get to choose
> > > between this series and Jakub's filter+dump in BPF, I'd pick filter+dump
> > > (pending per-cpu issue which we actually care about).
> > > 
> > > But if we can have both, I don't have any objections; this patch

I think we need to have both.
imo Jakub's and Yonghong's approach are solving slightly different cases.

filter+dump via program is better suited for LRU map walks where filter prog
would do some non-trivial logic.
Whereas plain 'delete all' or 'dump all' is much simpler to use without
loading yet another prog just to dump it.
bpf infra today isn't quite ready for this very short lived auxiliary progs.
At prog load pages get read-only mapping, tlbs across cpus flushed,
kallsyms populated, FDs allocated, etc.
Loading the prog is a heavy operation. There was a chatter before to have
built-in progs. This filter+dump could benefit from builtin 'allow all'
or 'delete all' progs, but imo that complicates design and asks even
more questions than it answers. Should this builtin progs show up
in 'bpftool prog show' ? When do they load/unload? Same safety requirements
as normal progs? etc.
imo it's fine to have little bit overlap between apis.
So I think we should proceed with both batching apis.

Having said that I think both are suffering from the important issue pointed out
by Brian: when kernel deletes an element get_next_key iterator over hash/lru
map will produce duplicates.
The amount of duplicates can be huge. When batched iterator is slow and
bpf prog is doing a lot of update/delete, there could be 10x worth of duplicates,
since walk will resume from the beginning.
User space cannot be tasked to deal with it.
I think this issue has to be solved in the kernel first and it may require
different batching api.

One idea is to use bucket spin_lock and batch process it bucket-at-a-time.
From api pov the user space will tell kernel:
- here is the buffer for N element. start dump from the beginning.
- kernel will return <= N elements and an iterator.
- user space will pass this opaque iterator back to get another batch
For well behaved hash/lru map there will be zero or one elements per bucket.
When there are 2+ the batching logic can process them together.
If 'lookup' is requested the kernel can check whether user space provided
enough space for these 2 elements. If not abort the batch earlier.
get_next_key won't be used. Instead some sort of opaque iterator
will be returned to user space, so next batch lookup can start from it.
This iterator could be the index of the last dumped bucket.
This idea won't work for pathological hash tables though.
A lot of elements in a single bucket may be more than room for single batch.
In such case iterator will get stuck, since num_of_elements_in_bucket > batch_buf_size.
May be special error code can be used to solve that?

I hope we can come up with other ideas to have a stable iterator over hash table.
Let's use email to describe the ideas and upcoming LPC conference to
sort out details and finalize the one to use.


  reply	other threads:[~2019-09-03 21:01 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-29  6:45 [PATCH bpf-next 00/13] bpf: adding map batch processing support Yonghong Song
2019-08-29  6:45 ` [PATCH bpf-next 01/13] bpf: add bpf_map_value_size and bp_map_copy_value helper functions Yonghong Song
2019-08-29 22:04   ` Song Liu
2019-08-30  6:40     ` Yonghong Song
2019-08-29  6:45 ` [PATCH bpf-next 02/13] bpf: refactor map_update_elem() Yonghong Song
2019-08-29 23:37   ` Song Liu
2019-08-29  6:45 ` [PATCH bpf-next 03/13] bpf: refactor map_delete_elem() Yonghong Song
2019-08-29 23:39   ` Song Liu
2019-08-29  6:45 ` [PATCH bpf-next 04/13] bpf: refactor map_get_next_key() Yonghong Song
2019-08-29 23:39   ` Song Liu
2019-08-29  6:45 ` [PATCH bpf-next 05/13] bpf: adding map batch processing support Yonghong Song
2019-08-29 23:01   ` Brian Vazquez
2019-08-30  6:39     ` Yonghong Song
2019-08-30  6:58       ` Alexei Starovoitov
2019-08-29  6:45 ` [PATCH bpf-next 06/13] tools/bpf: sync uapi header bpf.h Yonghong Song
2019-08-29  6:45 ` [PATCH bpf-next 07/13] tools/bpf: implement libbpf API functions for map batch operations Yonghong Song
2019-08-29  6:45 ` [PATCH bpf-next 08/13] tools/bpf: add test for bpf_map_update_batch() Yonghong Song
2019-08-29  6:45 ` [PATCH bpf-next 09/13] tools/bpf: add test for bpf_map_lookup_batch() Yonghong Song
2019-08-29  6:45 ` [PATCH bpf-next 10/13] tools/bpf: add test for bpf_map_lookup_and_delete_batch() Yonghong Song
2019-08-29  6:45 ` [PATCH bpf-next 11/13] tools/bpf: add test for bpf_map_delete_batch() Yonghong Song
2019-08-29  6:45 ` [PATCH bpf-next 12/13] tools/bpf: add a multithreaded test for map batch operations Yonghong Song
2019-08-29  6:45 ` [PATCH bpf-next 13/13] tools/bpf: measure map batching perf Yonghong Song
2019-08-29 18:39 ` [PATCH bpf-next 00/13] bpf: adding map batch processing support Jakub Kicinski
2019-08-29 23:13   ` Brian Vazquez
2019-08-30  0:15     ` Jakub Kicinski
2019-08-30 20:15       ` Stanislav Fomichev
2019-08-30 20:55         ` Yonghong Song
2019-08-30 21:10           ` Jakub Kicinski
2019-08-30 22:24             ` Yonghong Song
2019-08-30 21:18           ` Stanislav Fomichev
2019-09-03 21:01             ` Alexei Starovoitov [this message]
2019-09-03 22:30               ` Stanislav Fomichev
2019-09-03 23:07                 ` Brian Vazquez
2019-09-04  1:35                   ` Alexei Starovoitov
2019-09-03 23:07                 ` Yonghong Song
2019-08-30  7:25   ` Yonghong Song
2019-08-30 21:35     ` Jakub Kicinski
2019-08-30 22:38       ` Yonghong Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190903210127.z6mhkryqg6qz62dq@ast-mbp.dhcp.thefacebook.com \
    --to=alexei.starovoitov@gmail.com \
    --cc=Kernel-team@fb.com \
    --cc=ast@fb.com \
    --cc=bpf@vger.kernel.org \
    --cc=brianvv@google.com \
    --cc=daniel@iogearbox.net \
    --cc=jakub.kicinski@netronome.com \
    --cc=netdev@vger.kernel.org \
    --cc=sdf@fomichev.me \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).