Re: [RFC PATCH bpf-next 0/2] bpf: Introduce ternary search tree for string key

From: Andrii Nakryiko <andrii.nakryiko@gmail.com>
To: Hou Tao <houtao1@huawei.com>
Cc: Alexei Starovoitov <ast@kernel.org>, Yonghong Song <yhs@fb.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Martin KaFai Lau <kafai@fb.com>,
	Andrii Nakryiko <andrii@kernel.org>,
	Song Liu <songliubraving@fb.com>, KP Singh <kpsingh@kernel.org>,
	"David S . Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>,
	John Fastabend <john.fastabend@gmail.com>,
	Networking <netdev@vger.kernel.org>, bpf <bpf@vger.kernel.org>
Subject: Re: [RFC PATCH bpf-next 0/2] bpf: Introduce ternary search tree for string key
Date: Tue, 5 Jul 2022 15:37:01 -0700	[thread overview]
Message-ID: <CAEf4BzY+HRxgW_3=-b5Brp8BrEn4ZrCEb70q5FoNVusBu+VyLA@mail.gmail.com> (raw)
In-Reply-To: <503873ca-b094-961b-0fd6-cf842b39cf84@huawei.com>

On Wed, Jun 8, 2022 at 2:00 AM Hou Tao <houtao1@huawei.com> wrote:
>
> Hi,
>
> On 4/27/2022 11:57 AM, Andrii Nakryiko wrote:
> > On Tue, Apr 26, 2022 at 1:03 AM Hou Tao <houtao1@huawei.com> wrote:
> snip
> >>> I'm biased :) But I like the idea of qp-trie as a general purpose
> >>> ordered and dynamically sized BPF map. It makes no assumption about
> >>> data being string-like and sharing common prefixes. It can be made to
> >>> work just as fine with any array of bytes, making it very suitable as
> >>> a generic lookup table map. Note that upstream implementation does
> >>> assume zero-terminated strings and no key being a prefix of another
> >>> key. But all that can be removed. For fixed-length keys this can never
> >>> happen by construction, for variable-length keys (and we'll be able to
> >>> support this finally with bpf_dynptr's help very soon), we can record
> >>> length of the key in each leaf and use that during comparisons.
> >> Using the trailing zero byte to make sure no key will be a prefix of another is
> >> simple, but I will check whether or not there is other way to make the bytes
> >> array key work out. Alexei had suggest me to use the length of key as part of
> >> key just like bpf_lpm_trie_key does, maybe i can try it first.
> > Yeah, using key length as part of the key during comparison is what I
> > meant as well. I didn't mean to aritificially add trailing zero (this
> > idea doesn't work for arbitrary binary data).
> >
> >>> Also note that qp-trie can be internally used by BPF_MAP_TYPE_LPM_TRIE
> >>> very efficiently and speed it up considerable in the process (and
> >>> especially to get rid of the global lock).
> >>>
> >>> So if you were to invest in a proper full-featured production
> >>> implementation of a BPF map, I'd start with qp-trie. From available
> >>> benchmarks it's both faster and more memory efficient than Red-Black
> >>> trees, which could be an alternative underlying implementation of such
> >>> ordered and "resizable" map.
> Recently I tried to add concurrent update and deletion support for qp-trie, and
> found out that hand-over-hand lock scheme may don't work for qp-trie update. The
> short explanation is that update procedure needs traverse qp-trie twice and the
> position tuple got in the first pass may be stale due to concurrent updates
> occurred in the second pass. The detailed explanation is shown below.
>
> To reduce space usage for qp-trie, there is no common prefix saved in branch
> node, so the update of qp-trie needs to traversing qp-trie to find the most
> similar key firstly, comparing with it to get a (index, flag,) tuple for the
> leaf key, then traversing qp-trie again to find the insert position by using the
> tuple. The problem is that, the position tuple may be stale due to concurrent
> updates occurred in different branch. Considering the following case:
>
> When inserting "aa_bind_mount" and "aa_buffers_lock" concurrently into the
> following qp-trie. The most similar key for "aa_bind_mount" is leaf X, and for
> "aa_buffers_lock" it is leaf Y. The calculated index tuple for both new keys are
> the same: (index=3, flag=2). Assuming "aa_bind_mount" is inserted firstly, so
> when inserting "aa_buffers_lock", the correct index will be (index=4, flag=1)
> instead of (index=3, flag=2) and the result will be incorrect.
>
> branch: index  1 flags 1 bitmap 0x00088
> * leaf: a.81577 0
> * branch: index  4 flags 1 bitmap 0x00180
> * * branch: index  4 flags 2 bitmap 0x02080
> * * * leaf: aa_af_perm 1 (leaf X, for aa_bind_mount)
> * * branch: index  4 flags 2 bitmap 0x00052
> * * * leaf: aa_apply_modes_to_perms 6
> * * * leaf: aa_asprint_hashstr 7 (leaf Y, for aa_buffers_lock)
>
> In order to get a correct position tuple, the intuitive solution is adding
> common prefix in branch node and letting update procedure to find the insertion
> position by comparing with the prefix, so it only needs to traverse qp-trie once
> and hand-over-hand locking scheme can work. I plan to work on qp-trie with
> common prefix and will update its memory usage and concurrent update/insert
> speed in this thread once the demo is ready.  So any more suggestions ?
>

Yeah, that sucks. I'm not sure I completely understand the common
prefix solution and whether it will still be qp-trie after that, but
if you try that, it would be interesting to learn about the results
you get!

I think in the worst case we'll have to do tree-wide lock, perhaps
maybe having an initial 256-way root node for first byte, and each of
256 subtrees could have their own lock.

Alternatively we can do optimistic lockless lookup (in the hope to
find a match), but if that fails - take tree-wide lock, and perform
the search and insertion again. This will favor lookup hits,
obviously, but hopefully that's going to be a common use case where
keys are mostly matching.

WDYT?

> Regards,
> Tao
>
> >> Thanks for your suggestions. I will give it a try.
> > Awesome!
> >
> >> Regards,
> >> Tao
> >>
> >>>> Regards,
> >>>> Tao
> >>>>>>> This prefix sharing is nice when you have a lot of long common
> >>>>>>> prefixes, but I'm a bit skeptical that as a general-purpose BPF data
> >>>>>>> structure it's going to be that beneficial. 192 bytes of common
> >>>>>>> prefixes seems like a very unusual dataset :)
> >>>>>> Yes. The case with common prefix I known is full file path.
> >>>>>>> More specifically about TST implementation in your paches. One global
> >>>>>>> per-map lock I think is a very big downside. We have LPM trie which is
> >>>>>>> very slow in big part due to global lock. It might be possible to
> >>>>>>> design more granular schema for TST, but this whole in-place splitting
> >>>>>>> logic makes this harder. I think qp-trie can be locked in a granular
> >>>>>>> fashion much more easily by having a "hand over hand" locking: lock
> >>>>>>> parent, find child, lock child, unlock parent, move into child node.
> >>>>>>> Something like that would be more scalable overall, especially if the
> >>>>>>> access pattern is not focused on a narrow set of nodes.
> >>>>>> Yes. The global lock is a problem but the splitting is not in-place. I will try
> >>>>>> to figure out whether the lock can be more scalable after the benchmark test
> >>>>>> between qp-trie and tst.
> >>>>> Great, looking forward!
> >>>>>
> >>>>>> Regards,
> >>>>>> Tao
> >>>>>>
> >>>>>> [0]: https://github.com/Tessil/hat-trie
> >>>>>>> Anyways, I love data structures and this one is an interesting idea.
> >>>>>>> But just my few cents of "production-readiness" for general-purpose
> >>>>>>> data structures for BPF.
> >>>>>>>
> >>>>>>>   [0] https://dotat.at/prog/qp/README.html
> >>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Tao
> >>>>>>>>
> >>>>>>>> [1]: https://lore.kernel.org/bpf/CAADnVQJUJp3YBcpESwR3Q1U6GS1mBM=Vp-qYuQX7eZOaoLjdUA@mail.gmail.com/
> >>>>>>>>
> >>>>>>>> Hou Tao (2):
> >>>>>>>>   bpf: Introduce ternary search tree for string key
> >>>>>>>>   selftests/bpf: add benchmark for ternary search tree map
> >>>>>>>>
> >>>>>>>>  include/linux/bpf_types.h                     |   1 +
> >>>>>>>>  include/uapi/linux/bpf.h                      |   1 +
> >>>>>>>>  kernel/bpf/Makefile                           |   1 +
> >>>>>>>>  kernel/bpf/bpf_tst.c                          | 411 +++++++++++++++++
> >>>>>>>>  tools/include/uapi/linux/bpf.h                |   1 +
> >>>>>>>>  tools/testing/selftests/bpf/Makefile          |   5 +-
> >>>>>>>>  tools/testing/selftests/bpf/bench.c           |   6 +
> >>>>>>>>  .../selftests/bpf/benchs/bench_tst_map.c      | 415 ++++++++++++++++++
> >>>>>>>>  .../selftests/bpf/benchs/run_bench_tst.sh     |  54 +++
> >>>>>>>>  tools/testing/selftests/bpf/progs/tst_bench.c |  70 +++
> >>>>>>>>  10 files changed, 964 insertions(+), 1 deletion(-)
> >>>>>>>>  create mode 100644 kernel/bpf/bpf_tst.c
> >>>>>>>>  create mode 100644 tools/testing/selftests/bpf/benchs/bench_tst_map.c
> >>>>>>>>  create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_tst.sh
> >>>>>>>>  create mode 100644 tools/testing/selftests/bpf/progs/tst_bench.c
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> 2.31.1
> >>>>>>>>
> >>>>>>> .
> >>>>> .
> >>> .
> > .
>