bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jakub Kicinski <jakub.kicinski@netronome.com>
To: Yonghong Song <yhs@fb.com>, Alexei Starovoitov <ast@fb.com>
Cc: <bpf@vger.kernel.org>, <netdev@vger.kernel.org>,
	Brian Vazquez <brianvv@google.com>,
	Daniel Borkmann <daniel@iogearbox.net>, <kernel-team@fb.com>
Subject: Re: [PATCH bpf-next 00/13] bpf: adding map batch processing support
Date: Thu, 29 Aug 2019 11:39:32 -0700	[thread overview]
Message-ID: <20190829113932.5c058194@cakuba.netronome.com> (raw)
In-Reply-To: <20190829064502.2750303-1-yhs@fb.com>

On Wed, 28 Aug 2019 23:45:02 -0700, Yonghong Song wrote:
> Brian Vazquez has proposed BPF_MAP_DUMP command to look up more than one
> map entries per syscall.
>   https://lore.kernel.org/bpf/CABCgpaU3xxX6CMMxD+1knApivtc2jLBHysDXw-0E9bQEL0qC3A@mail.gmail.com/T/#t
> 
> During discussion, we found more use cases can be supported in a similar
> map operation batching framework. For example, batched map lookup and delete,
> which can be really helpful for bcc.
>   https://github.com/iovisor/bcc/blob/master/tools/tcptop.py#L233-L243
>   https://github.com/iovisor/bcc/blob/master/tools/slabratetop.py#L129-L138
>     
> Also, in bcc, we have API to delete all entries in a map.
>   https://github.com/iovisor/bcc/blob/master/src/cc/api/BPFTable.h#L257-L264
> 
> For map update, batched operations also useful as sometimes applications need
> to populate initial maps with more than one entry. For example, the below
> example is from kernel/samples/bpf/xdp_redirect_cpu_user.c:
>   https://github.com/torvalds/linux/blob/master/samples/bpf/xdp_redirect_cpu_user.c#L543-L550
> 
> This patch addresses all the above use cases. To make uapi stable, it also
> covers other potential use cases. Four bpf syscall subcommands are introduced:
>     BPF_MAP_LOOKUP_BATCH
>     BPF_MAP_LOOKUP_AND_DELETE_BATCH
>     BPF_MAP_UPDATE_BATCH
>     BPF_MAP_DELETE_BATCH
> 
> In userspace, application can iterate through the whole map one batch
> as a time, e.g., bpf_map_lookup_batch() in the below:
>     p_key = NULL;
>     p_next_key = &key;
>     while (true) {
>        err = bpf_map_lookup_batch(fd, p_key, &p_next_key, keys, values,
>                                   &batch_size, elem_flags, flags);
>        if (err) ...
>        if (p_next_key) break; // done
>        if (!p_key) p_key = p_next_key;
>     }
> Please look at individual patches for details of new syscall subcommands
> and examples of user codes.
> 
> The testing is also done in a qemu VM environment:
>       measure_lookup: max_entries 1000000, batch 10, time 342ms
>       measure_lookup: max_entries 1000000, batch 1000, time 295ms
>       measure_lookup: max_entries 1000000, batch 1000000, time 270ms
>       measure_lookup: max_entries 1000000, no batching, time 1346ms
>       measure_lookup_delete: max_entries 1000000, batch 10, time 433ms
>       measure_lookup_delete: max_entries 1000000, batch 1000, time 363ms
>       measure_lookup_delete: max_entries 1000000, batch 1000000, time 357ms
>       measure_lookup_delete: max_entries 1000000, not batch, time 1894ms
>       measure_delete: max_entries 1000000, batch, time 220ms
>       measure_delete: max_entries 1000000, not batch, time 1289ms
> For a 1M entry hash table, batch size of 10 can reduce cpu time
> by 70%. Please see patch "tools/bpf: measure map batching perf"
> for details of test codes.

Hi Yonghong!

great to see this, we have been looking at implementing some way to
speed up map walks as well.

The direction we were looking in, after previous discussions [1],
however, was to provide a BPF program which can run the logic entirely
within the kernel.

We have a rough PoC on the FW side (we can offload the program which
walks the map, which is pretty neat), but the kernel verifier side
hasn't really progressed. It will soon.

The rough idea is that the user space provides two programs, "filter"
and "dumper":

	bpftool map exec id XYZ filter pinned /some/prog \
				dumper pinned /some/other_prog

Both programs get this context:

struct map_op_ctx {
	u64 key;
	u64 value;
}

We need a per-map implementation of the exec side, but roughly maps
would do:

	LIST_HEAD(deleted);

	for entry in map {
		struct map_op_ctx {
			.key	= entry->key,
			.value	= entry->value,
		};

		act = BPF_PROG_RUN(filter, &map_op_ctx);
		if (act & ~ACT_BITS)
			return -EINVAL;

		if (act & DELETE) {
			map_unlink(entry);
			list_add(entry, &deleted);
		}
		if (act & STOP)
			break;
	}

	synchronize_rcu();

	for entry in deleted {
		struct map_op_ctx {
			.key	= entry->key,
			.value	= entry->value,
		};
		
		BPF_PROG_RUN(dumper, &map_op_ctx);
		map_free(entry);
	}

The filter program can't perform any map operations other than lookup,
otherwise we won't be able to guarantee that we'll walk the entire map
(if the filter program deletes some entries in a unfortunate order).

If user space just wants a pure dump it can simply load a program which
dumps the entries into a perf ring.

I'm bringing this up because that mechanism should cover what is
achieved with this patch set and much more. 

In particular for networking workloads where old flows have to be
pruned from the map periodically it's far more efficient to communicate
to user space only the flows which timed out (the delete batching from
this set won't help at all).

With a 2M entry map and this patch set we still won't be able to prune
once a second on one core.

[1]
https://lore.kernel.org/netdev/20190813130921.10704-4-quentin.monnet@netronome.com/

  parent reply	other threads:[~2019-08-29 18:40 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-29  6:45 [PATCH bpf-next 00/13] bpf: adding map batch processing support Yonghong Song
2019-08-29  6:45 ` [PATCH bpf-next 01/13] bpf: add bpf_map_value_size and bp_map_copy_value helper functions Yonghong Song
2019-08-29 22:04   ` Song Liu
2019-08-30  6:40     ` Yonghong Song
2019-08-29  6:45 ` [PATCH bpf-next 02/13] bpf: refactor map_update_elem() Yonghong Song
2019-08-29 23:37   ` Song Liu
2019-08-29  6:45 ` [PATCH bpf-next 03/13] bpf: refactor map_delete_elem() Yonghong Song
2019-08-29 23:39   ` Song Liu
2019-08-29  6:45 ` [PATCH bpf-next 04/13] bpf: refactor map_get_next_key() Yonghong Song
2019-08-29 23:39   ` Song Liu
2019-08-29  6:45 ` [PATCH bpf-next 05/13] bpf: adding map batch processing support Yonghong Song
2019-08-29 23:01   ` Brian Vazquez
2019-08-30  6:39     ` Yonghong Song
2019-08-30  6:58       ` Alexei Starovoitov
2019-08-29  6:45 ` [PATCH bpf-next 06/13] tools/bpf: sync uapi header bpf.h Yonghong Song
2019-08-29  6:45 ` [PATCH bpf-next 07/13] tools/bpf: implement libbpf API functions for map batch operations Yonghong Song
2019-08-29  6:45 ` [PATCH bpf-next 08/13] tools/bpf: add test for bpf_map_update_batch() Yonghong Song
2019-08-29  6:45 ` [PATCH bpf-next 09/13] tools/bpf: add test for bpf_map_lookup_batch() Yonghong Song
2019-08-29  6:45 ` [PATCH bpf-next 10/13] tools/bpf: add test for bpf_map_lookup_and_delete_batch() Yonghong Song
2019-08-29  6:45 ` [PATCH bpf-next 11/13] tools/bpf: add test for bpf_map_delete_batch() Yonghong Song
2019-08-29  6:45 ` [PATCH bpf-next 12/13] tools/bpf: add a multithreaded test for map batch operations Yonghong Song
2019-08-29  6:45 ` [PATCH bpf-next 13/13] tools/bpf: measure map batching perf Yonghong Song
2019-08-29 18:39 ` Jakub Kicinski [this message]
2019-08-29 23:13   ` [PATCH bpf-next 00/13] bpf: adding map batch processing support Brian Vazquez
2019-08-30  0:15     ` Jakub Kicinski
2019-08-30 20:15       ` Stanislav Fomichev
2019-08-30 20:55         ` Yonghong Song
2019-08-30 21:10           ` Jakub Kicinski
2019-08-30 22:24             ` Yonghong Song
2019-08-30 21:18           ` Stanislav Fomichev
2019-09-03 21:01             ` Alexei Starovoitov
2019-09-03 22:30               ` Stanislav Fomichev
2019-09-03 23:07                 ` Brian Vazquez
2019-09-04  1:35                   ` Alexei Starovoitov
2019-09-03 23:07                 ` Yonghong Song
2019-08-30  7:25   ` Yonghong Song
2019-08-30 21:35     ` Jakub Kicinski
2019-08-30 22:38       ` Yonghong Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190829113932.5c058194@cakuba.netronome.com \
    --to=jakub.kicinski@netronome.com \
    --cc=ast@fb.com \
    --cc=bpf@vger.kernel.org \
    --cc=brianvv@google.com \
    --cc=daniel@iogearbox.net \
    --cc=kernel-team@fb.com \
    --cc=netdev@vger.kernel.org \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).