Re: [PATCH bpf-next 00/13] bpf: adding map batch processing support

From: Jakub Kicinski <jakub.kicinski@netronome.com>
To: Yonghong Song <yhs@fb.com>
Cc: Alexei Starovoitov <ast@fb.com>,
	"bpf@vger.kernel.org" <bpf@vger.kernel.org>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	Brian Vazquez <brianvv@google.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Kernel Team <Kernel-team@fb.com>,
	Quentin Monnet <quentin.monnet@netronome.com>
Subject: Re: [PATCH bpf-next 00/13] bpf: adding map batch processing support
Date: Fri, 30 Aug 2019 14:35:08 -0700	[thread overview]
Message-ID: <20190830143508.73c30631@cakuba.netronome.com> (raw)
In-Reply-To: <a3422ffd-e9f2-af77-a92d-81393a9f4fc7@fb.com>

On Fri, 30 Aug 2019 07:25:54 +0000, Yonghong Song wrote:
> On 8/29/19 11:39 AM, Jakub Kicinski wrote:
> > On Wed, 28 Aug 2019 23:45:02 -0700, Yonghong Song wrote:  
> >> Brian Vazquez has proposed BPF_MAP_DUMP command to look up more than one
> >> map entries per syscall.
> >>    https://lore.kernel.org/bpf/CABCgpaU3xxX6CMMxD+1knApivtc2jLBHysDXw-0E9bQEL0qC3A@mail.gmail.com/T/#t
> >>
> >> During discussion, we found more use cases can be supported in a similar
> >> map operation batching framework. For example, batched map lookup and delete,
> >> which can be really helpful for bcc.
> >>    https://github.com/iovisor/bcc/blob/master/tools/tcptop.py#L233-L243
> >>    https://github.com/iovisor/bcc/blob/master/tools/slabratetop.py#L129-L138
> >>      
> >> Also, in bcc, we have API to delete all entries in a map.
> >>    https://github.com/iovisor/bcc/blob/master/src/cc/api/BPFTable.h#L257-L264
> >>
> >> For map update, batched operations also useful as sometimes applications need
> >> to populate initial maps with more than one entry. For example, the below
> >> example is from kernel/samples/bpf/xdp_redirect_cpu_user.c:
> >>    https://github.com/torvalds/linux/blob/master/samples/bpf/xdp_redirect_cpu_user.c#L543-L550
> >>
> >> This patch addresses all the above use cases. To make uapi stable, it also
> >> covers other potential use cases. Four bpf syscall subcommands are introduced:
> >>      BPF_MAP_LOOKUP_BATCH
> >>      BPF_MAP_LOOKUP_AND_DELETE_BATCH
> >>      BPF_MAP_UPDATE_BATCH
> >>      BPF_MAP_DELETE_BATCH
> >>
> >> In userspace, application can iterate through the whole map one batch
> >> as a time, e.g., bpf_map_lookup_batch() in the below:
> >>      p_key = NULL;
> >>      p_next_key = &key;
> >>      while (true) {
> >>         err = bpf_map_lookup_batch(fd, p_key, &p_next_key, keys, values,
> >>                                    &batch_size, elem_flags, flags);
> >>         if (err) ...
> >>         if (p_next_key) break; // done
> >>         if (!p_key) p_key = p_next_key;
> >>      }
> >> Please look at individual patches for details of new syscall subcommands
> >> and examples of user codes.
> >>
> >> The testing is also done in a qemu VM environment:
> >>        measure_lookup: max_entries 1000000, batch 10, time 342ms
> >>        measure_lookup: max_entries 1000000, batch 1000, time 295ms
> >>        measure_lookup: max_entries 1000000, batch 1000000, time 270ms
> >>        measure_lookup: max_entries 1000000, no batching, time 1346ms
> >>        measure_lookup_delete: max_entries 1000000, batch 10, time 433ms
> >>        measure_lookup_delete: max_entries 1000000, batch 1000, time 363ms
> >>        measure_lookup_delete: max_entries 1000000, batch 1000000, time 357ms
> >>        measure_lookup_delete: max_entries 1000000, not batch, time 1894ms
> >>        measure_delete: max_entries 1000000, batch, time 220ms
> >>        measure_delete: max_entries 1000000, not batch, time 1289ms
> >> For a 1M entry hash table, batch size of 10 can reduce cpu time
> >> by 70%. Please see patch "tools/bpf: measure map batching perf"
> >> for details of test codes.  
> > 
> > Hi Yonghong!
> > 
> > great to see this, we have been looking at implementing some way to
> > speed up map walks as well.
> > 
> > The direction we were looking in, after previous discussions [1],
> > however, was to provide a BPF program which can run the logic entirely
> > within the kernel.
> > 
> > We have a rough PoC on the FW side (we can offload the program which
> > walks the map, which is pretty neat), but the kernel verifier side
> > hasn't really progressed. It will soon.
> > 
> > The rough idea is that the user space provides two programs, "filter"
> > and "dumper":
> > 
> > 	bpftool map exec id XYZ filter pinned /some/prog \
> > 				dumper pinned /some/other_prog
> > 
> > Both programs get this context:
> > 
> > struct map_op_ctx {
> > 	u64 key;
> > 	u64 value;
> > }
> > 
> > We need a per-map implementation of the exec side, but roughly maps
> > would do:
> > 
> > 	LIST_HEAD(deleted);
> > 
> > 	for entry in map {
> > 		struct map_op_ctx {
> > 			.key	= entry->key,
> > 			.value	= entry->value,
> > 		};
> > 
> > 		act = BPF_PROG_RUN(filter, &map_op_ctx);
> > 		if (act & ~ACT_BITS)
> > 			return -EINVAL;
> > 
> > 		if (act & DELETE) {
> > 			map_unlink(entry);
> > 			list_add(entry, &deleted);
> > 		}
> > 		if (act & STOP)
> > 			break;
> > 	}
> > 
> > 	synchronize_rcu();
> > 
> > 	for entry in deleted {
> > 		struct map_op_ctx {
> > 			.key	= entry->key,
> > 			.value	= entry->value,
> > 		};
> > 		
> > 		BPF_PROG_RUN(dumper, &map_op_ctx);
> > 		map_free(entry);
> > 	}
> > 
> > The filter program can't perform any map operations other than lookup,
> > otherwise we won't be able to guarantee that we'll walk the entire map
> > (if the filter program deletes some entries in a unfortunate order).  
> 
> Looks like you will provide a new program type and per-map 
> implementation of above code. My patch set indeed avoided per-map 
> implementation for all of lookup/delete/get-next-key...

Indeed, the simple batched ops are undeniably lower LoC.

> > If user space just wants a pure dump it can simply load a program which
> > dumps the entries into a perf ring.  
> 
> percpu perf ring is not really ideal for user space which simply just
> want to get some key/value pairs back. Some kind of generate non-per-cpu
> ring buffer might be better for such cases.

I don't think it had to be per-cpu, but I may be blissfully ignorant
about the perf ring details :) bpf_perf_event_output() takes flags,
which are effectively selecting the "output CPU", no?

> > I'm bringing this up because that mechanism should cover what is
> > achieved with this patch set and much more.  
> 
> The only case it did not cover is batched update. But that may not
> be super critical.

Right, my other concern (which admittedly is slightly pedantic) is the
potential loss of modifications. the lookup_and_delete() operation does
not guarantee that the result returned to user space is the "final"
value. We'd need to wait for a RCU grace period between lookup and dump.
The "map is definitely unused now"/RCU guarantee had came up in the
past.

> Your approach give each element an action choice through another bpf 
> program. This indeed powerful. My use case is simpler than your use case
> below, hence the implementation.

Agreed, the simplicity is tempting. 

> > In particular for networking workloads where old flows have to be
> > pruned from the map periodically it's far more efficient to communicate
> > to user space only the flows which timed out (the delete batching from
> > this set won't help at all).  
> 
> Maybe LRU map will help in this case? It is designed for such
> use cases.

LRU map would be perfect if it dumped the entry somewhere before it got
reused... Perhaps we need to attach a program to the LRU map that'd get
run when entries get reaped. That'd be cool 🤔 We could trivially reuse
the "dumper" prog type for this.

> > With a 2M entry map and this patch set we still won't be able to prune
> > once a second on one core.
> > 
> > [1]
> > https://lore.kernel.org/netdev/20190813130921.10704-4-quentin.monnet@netronome.com/