All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key
@ 2022-09-24 13:36 Hou Tao
  2022-09-24 13:36 ` [PATCH bpf-next v2 01/13] bpf: Export bpf_dynptr_set_size() Hou Tao
                   ` (14 more replies)
  0 siblings, 15 replies; 52+ messages in thread
From: Hou Tao @ 2022-09-24 13:36 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, houtao1

From: Hou Tao <houtao1@huawei.com>

Hi,

The initial motivation for qp-trie map is to reduce memory usage for
string keys special those with large differencies in length as
discussed in [0]. And as a big-endian lexicographical ordered map, it
can also be used for any binary data with fixed or variable length.

Now the basic functionality of qp-trie is ready, so posting it to get
more feedback or suggestions about qp-trie. Specially feedback
about the following questions:

(1) Use cases for qp-trie
Andrii had proposed to re-implement lpm-trie by using qp-trie. The
advantage would be the speed up of lookup operations due to lower tree
depth of qp-trie and the performance of update may also increase.
But is there any other use cases for qp-trie ? Specially those cases
which need both ordering and memory efficiency or cases in which qp-trie
will have high fan-out and its lookup performance will be much better than
hash-table as shown below:

  Randomly-generated binary data (key size=255, max entries=16K, key length range:[1, 255])
  htab lookup      (1  thread)    4.968 ± 0.009M/s (drops 0.002 ± 0.000M/s mem 8.169 MiB)
  htab lookup      (2  thread)   10.118 ± 0.010M/s (drops 0.007 ± 0.000M/s mem 8.169 MiB)
  htab lookup      (4  thread)   20.084 ± 0.022M/s (drops 0.007 ± 0.000M/s mem 8.168 MiB)
  htab lookup      (8  thread)   39.866 ± 0.047M/s (drops 0.010 ± 0.000M/s mem 8.168 MiB)
  htab lookup      (16 thread)   79.412 ± 0.065M/s (drops 0.049 ± 0.000M/s mem 8.169 MiB)
  
  qp-trie lookup   (1  thread)   10.291 ± 0.007M/s (drops 0.004 ± 0.000M/s mem 4.899 MiB)
  qp-trie lookup   (2  thread)   20.797 ± 0.009M/s (drops 0.006 ± 0.000M/s mem 4.879 MiB)
  qp-trie lookup   (4  thread)   41.943 ± 0.019M/s (drops 0.015 ± 0.000M/s mem 4.262 MiB)
  qp-trie lookup   (8  thread)   81.985 ± 0.032M/s (drops 0.025 ± 0.000M/s mem 4.215 MiB)
  qp-trie lookup   (16 thread)  164.681 ± 0.051M/s (drops 0.050 ± 0.000M/s mem 4.261 MiB)

  * non-zero drops is due to duplicated keys in generated keys.

(2) Improve update/delete performance for qp-trie
Now top-5 overheads in update/delete operations are:

    21.23%  bench    [kernel.vmlinux]    [k] qp_trie_update_elem
    13.98%  bench    [kernel.vmlinux]    [k] qp_trie_delete_elem
     7.96%  bench    [kernel.vmlinux]    [k] native_queued_spin_lock_slowpath
     5.16%  bench    [kernel.vmlinux]    [k] memcpy_erms
     5.00%  bench    [kernel.vmlinux]    [k] __kmalloc_node

The top-2 overheads are due to memory access and atomic ops on
max_entries. I had tried memory prefetch but it didn't work out, maybe
I did it wrong. For subtree spinlock overhead, I also had tried the
hierarchical lock by using hand-over-hand lock scheme, but it didn't
scale well [1]. I will try to increase the number of subtrees from 256
to 1024, 4096 or bigger and check whether it makes any difference.

For atomic ops and kmalloc overhead, I think I can reuse the idea from
patchset "bpf: BPF specific memory allocator". I have given bpf_mem_alloc
a simple try and encounter some problems. One problem is that
immediate reuse of freed object in bpf memory allocator. Because qp-trie
uses bpf memory allocator to allocate and free qp_trie_branch, if
qp_trie_branch is reused immediately, the lookup procedure may oops due
to the incorrect content in qp_trie_branch. And another problem is the
size limitation in bpf_mem_alloc() is 4096. It may be a little small for
the total size of key size and value size, but maybe I can use two
separated bpf_mem_alloc for key and value.

String in BTF string sections (entries=115980, max size=71, threads=4)

On hash-table
Iter   0 ( 21.872us): hits   11.496M/s (  2.874M/prod), drops    0.000M/s, total operations   11.496M/s
Iter   1 (  4.816us): hits   12.701M/s (  3.175M/prod), drops    0.000M/s, total operations   12.701M/s
Iter   2 (-11.006us): hits   12.443M/s (  3.111M/prod), drops    0.000M/s, total operations   12.443M/s
Iter   3 (  2.628us): hits   11.223M/s (  2.806M/prod), drops    0.000M/s, total operations   11.223M/s
Slab: 22.225 MiB

On qp-trie:
Iter   0 ( 24.388us): hits    4.062M/s (  1.016M/prod), drops    0.000M/s, total operations    4.062M/s
Iter   1 ( 20.612us): hits    3.884M/s (  0.971M/prod), drops    0.000M/s, total operations    3.884M/s
Iter   2 (-21.533us): hits    4.046M/s (  1.012M/prod), drops    0.000M/s, total operations    4.046M/s
Iter   3 (  2.090us): hits    3.971M/s (  0.993M/prod), drops    0.000M/s, total operations    3.971M/s
Slab: 10.849 MiB

(3) Improve memory efficiency further
When using strings in BTF string section as a data set for qp-trie, the
slab memory usage showed in cgroup memory.stats file is about 11MB for
qp-trie and 22MB for hash table as shown above. However the theoretical
memory usage for qp-trie is ~6.8MB (is ~4.9MB if removing "parent" & "rcu"
fields from qp_trie_branch) and the extra memory usage (about 38% of total
usage) mainly comes from internal fragment in slab (namely 2^n alignment
for allocation) and overhead in kmem-cgroup accounting. We can reduce the
internal fragment by creating separated kmem_cache for qp_trie_branch with
different child nodes, but not sure whether it is worth it.

And in order to prevent allocating a rcu_head for each leaf node, now only
branch node is RCU-freed, so when replacing a leaf node, a new branch node
and a new leaf node will be allocated instead of replacing the old leaf
node and RCU-freed the old leaf node.

No sure whether or not there are ways to remove rcu_head completely from
qp_trie_branch node. Headless kfree_rcu() doesn't need rcu_head but it
might sleeps. It seems BPF memory allocator is a better choice as for
now because it replaces rcu_read by llist_node.

  Sorted strings in BTF string sections (entries=115980)
  htab lookup      (1  thread)    6.915 ± 0.029M/s (drops 0.000 ± 0.000M/s mem 22.216 MiB)
  qp-trie lookup   (1  thread)    6.791 ± 0.005M/s (drops 0.000 ± 0.000M/s mem 11.273 MiB)
  
  All files under linux kernel source directory (entries=74359)
  htab lookup      (1  thread)    7.978 ± 0.009M/s (drops 0.000 ± 0.000M/s mem 14.272 MiB)
  qp-trie lookup   (1  thread)    5.521 ± 0.003M/s (drops 0.000 ± 0.000M/s mem 9.367 MiB)
  
  Domain names for Alexa top million web site (entries=1000000)
  htab lookup      (1  thread)    3.148 ± 0.039M/s (drops 0.000 ± 0.000M/s mem 190.831 MiB)
  qp-trie lookup   (1  thread)    2.374 ± 0.026M/s (drops 0.000 ± 0.000M/s mem 83.733 MiB)

Comments and suggestions are always welcome.

[0]: https://lore.kernel.org/bpf/CAEf4Bzb7keBS8vXgV5JZzwgNGgMV0X3_guQ_m9JW3X6fJBDpPQ@mail.gmail.com/
[1]: https://lore.kernel.org/bpf/db34696a-cbfe-16e8-6dd5-8174b97dcf1d@huawei.com/

Change Log:
v2:
 * Always use copy_to_user() in bpf_copy_to_dynptr_ukey() (from kernel test robot <lkp@intel.com>)
   No-MMU ARM32 host doesn't support 8-bytes get_user().
 * Remove BPF_F_DYNPTR_KEY and use the more extensible dynptr_key_off
 * Remove the unnecessary rcu_barrier in qp_trie_free()
 * Add a new helper bpf_dynptr_user_trim() for bpftool in libbpf
 * Support qp-trie map in bpftool
 * Add 2 more test_progs test cases for qp-trie map
 * Fix test_progs-no_alu32 failure for test_progs test case
 * Add tests for not-supported operations in test_maps

v1: https://lore.kernel.org/bpf/20220917153125.2001645-1-houtao@huaweicloud.com/
 * Use bpf_dynptr as map key type instead of bpf_lpm_trie_key-styled key (Suggested by Andrii)
 * Fix build error and RCU related sparse errors reported by lkp robot
 * Copy the passed key firstly in qp_trie_update_elem(), because the content of passed key may
   change and may break the assumption in two-round lookup process during update.
 * Add the missing rcu_barrier in qp_trie_free()

RFC: https://lore.kernel.org/bpf/20220726130005.3102470-1-houtao1@huawei.com/

Hou Tao (13):
  bpf: Export bpf_dynptr_set_size()
  bpf: Add helper btf_find_dynptr()
  bpf: Support bpf_dynptr-typed map key in bpf syscall
  bpf: Support bpf_dynptr-typed map key in verifier
  libbpf: Add helpers for bpf_dynptr_user
  bpf: Add support for qp-trie map with dynptr key
  libbpf: Add probe support for BPF_MAP_TYPE_QP_TRIE
  bpftool: Add support for qp-trie map
  selftests/bpf: Add two new dynptr_fail cases for map key
  selftests/bpf: Move ENOTSUPP into bpf_util.h
  selftests/bpf: Add prog tests for qp-trie map
  selftests/bpf: Add benchmark for qp-trie map
  selftests/bpf: Add map tests for qp-trie by using bpf syscall

 include/linux/bpf.h                           |   11 +-
 include/linux/bpf_types.h                     |    1 +
 include/linux/btf.h                           |    1 +
 include/uapi/linux/bpf.h                      |    7 +
 kernel/bpf/Makefile                           |    1 +
 kernel/bpf/bpf_qp_trie.c                      | 1057 ++++++++++++++
 kernel/bpf/btf.c                              |   13 +
 kernel/bpf/helpers.c                          |    9 +-
 kernel/bpf/map_in_map.c                       |    3 +
 kernel/bpf/syscall.c                          |  121 +-
 kernel/bpf/verifier.c                         |   17 +-
 .../bpf/bpftool/Documentation/bpftool-map.rst |    4 +-
 tools/bpf/bpftool/btf_dumper.c                |   33 +
 tools/bpf/bpftool/map.c                       |  149 +-
 tools/include/uapi/linux/bpf.h                |    7 +
 tools/lib/bpf/bpf.h                           |   29 +
 tools/lib/bpf/libbpf.c                        |    1 +
 tools/lib/bpf/libbpf_probes.c                 |   25 +
 tools/testing/selftests/bpf/Makefile          |    5 +-
 tools/testing/selftests/bpf/bench.c           |   10 +
 .../selftests/bpf/benchs/bench_qp_trie.c      |  511 +++++++
 .../selftests/bpf/benchs/run_bench_qp_trie.sh |   55 +
 tools/testing/selftests/bpf/bpf_util.h        |    4 +
 .../selftests/bpf/map_tests/qp_trie_map.c     | 1209 +++++++++++++++++
 .../selftests/bpf/prog_tests/bpf_tcp_ca.c     |    4 -
 .../testing/selftests/bpf/prog_tests/dynptr.c |    2 +
 .../selftests/bpf/prog_tests/lsm_cgroup.c     |    4 -
 .../selftests/bpf/prog_tests/qp_trie_test.c   |  214 +++
 .../testing/selftests/bpf/progs/dynptr_fail.c |   43 +
 .../selftests/bpf/progs/qp_trie_bench.c       |  236 ++++
 .../selftests/bpf/progs/qp_trie_test.c        |  200 +++
 tools/testing/selftests/bpf/test_maps.c       |    4 -
 32 files changed, 3931 insertions(+), 59 deletions(-)
 create mode 100644 kernel/bpf/bpf_qp_trie.c
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_qp_trie.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_qp_trie.sh
 create mode 100644 tools/testing/selftests/bpf/map_tests/qp_trie_map.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/qp_trie_test.c
 create mode 100644 tools/testing/selftests/bpf/progs/qp_trie_bench.c
 create mode 100644 tools/testing/selftests/bpf/progs/qp_trie_test.c

-- 
2.29.2


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH bpf-next v2 01/13] bpf: Export bpf_dynptr_set_size()
  2022-09-24 13:36 [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key Hou Tao
@ 2022-09-24 13:36 ` Hou Tao
  2022-09-24 13:36 ` [PATCH bpf-next v2 02/13] bpf: Add helper btf_find_dynptr() Hou Tao
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 52+ messages in thread
From: Hou Tao @ 2022-09-24 13:36 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, houtao1

From: Hou Tao <houtao1@huawei.com>

For map with bpf_dynptr-typed key, lookup and update procedures will use
bpf_dynptr_get_size() to get the length of key, and iteration procedure
will use bpf_dynptr_set_size() to set the length of returned key.

The implementation of bpf_dynptr_set_size() is taken from Joanne's patch
"bpf: Add bpf_dynptr_trim and bpf_dynptr_advance". Also add a const
qualifier to dynptr argument of bpf_dynptr_get_size().

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 include/linux/bpf.h  | 3 ++-
 kernel/bpf/helpers.c | 9 ++++++++-
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index edd43edb27d6..66a18dc67b46 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2660,7 +2660,8 @@ void bpf_dynptr_init(struct bpf_dynptr_kern *ptr, void *data,
 		     enum bpf_dynptr_type type, u32 offset, u32 size);
 void bpf_dynptr_set_null(struct bpf_dynptr_kern *ptr);
 int bpf_dynptr_check_size(u32 size);
-u32 bpf_dynptr_get_size(struct bpf_dynptr_kern *ptr);
+u32 bpf_dynptr_get_size(const struct bpf_dynptr_kern *ptr);
+void bpf_dynptr_set_size(struct bpf_dynptr_kern *ptr, u32 new_size);
 
 #ifdef CONFIG_BPF_LSM
 void bpf_cgroup_atype_get(u32 attach_btf_id, int cgroup_atype);
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index b069517a3da0..a9ca2de8f8cd 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1408,11 +1408,18 @@ static void bpf_dynptr_set_type(struct bpf_dynptr_kern *ptr, enum bpf_dynptr_typ
 	ptr->size |= type << DYNPTR_TYPE_SHIFT;
 }
 
-u32 bpf_dynptr_get_size(struct bpf_dynptr_kern *ptr)
+u32 bpf_dynptr_get_size(const struct bpf_dynptr_kern *ptr)
 {
 	return ptr->size & DYNPTR_SIZE_MASK;
 }
 
+void bpf_dynptr_set_size(struct bpf_dynptr_kern *ptr, u32 new_size)
+{
+	u32 metadata = ptr->size & ~DYNPTR_SIZE_MASK;
+
+	ptr->size = new_size | metadata;
+}
+
 int bpf_dynptr_check_size(u32 size)
 {
 	return size > DYNPTR_MAX_SIZE ? -E2BIG : 0;
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH bpf-next v2 02/13] bpf: Add helper btf_find_dynptr()
  2022-09-24 13:36 [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key Hou Tao
  2022-09-24 13:36 ` [PATCH bpf-next v2 01/13] bpf: Export bpf_dynptr_set_size() Hou Tao
@ 2022-09-24 13:36 ` Hou Tao
  2022-09-24 13:36 ` [PATCH bpf-next v2 03/13] bpf: Support bpf_dynptr-typed map key in bpf syscall Hou Tao
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 52+ messages in thread
From: Hou Tao @ 2022-09-24 13:36 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, houtao1

From: Hou Tao <houtao1@huawei.com>

Add helper btf_find_dynptr() to check whether or not the passed btf type
is bpf_dynptr. It can be extend to find an embedded dynptr in passed btf
type if needed later.

It will be used by bpf map (e.g. qp-trie) with bpf_dynptr key.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 include/linux/btf.h |  1 +
 kernel/bpf/btf.c    | 13 +++++++++++++
 2 files changed, 14 insertions(+)

diff --git a/include/linux/btf.h b/include/linux/btf.h
index f9aababc5d78..5bf508c2bad2 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -165,6 +165,7 @@ int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t);
 int btf_find_timer(const struct btf *btf, const struct btf_type *t);
 struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
 					  const struct btf_type *t);
+int btf_find_dynptr(const struct btf *btf, const struct btf_type *t);
 bool btf_type_is_void(const struct btf_type *t);
 s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind);
 const struct btf_type *btf_type_skip_modifiers(const struct btf *btf,
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index a44ad4b347ff..fefbe84c6998 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3522,6 +3522,19 @@ struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
 	return ERR_PTR(ret);
 }
 
+/* Now only allow to use 'struct bpf_dynptr' as map key.
+ * Map key with embedded bpf_dynptr is not allowed.
+ */
+int btf_find_dynptr(const struct btf *btf, const struct btf_type *t)
+{
+	/* Only allow struct type */
+	if (__btf_type_is_struct(t) && t->size == sizeof(struct bpf_dynptr) &&
+	    !strcmp("bpf_dynptr", __btf_name_by_offset(btf, t->name_off)))
+		return 0;
+
+	return -EINVAL;
+}
+
 static void __btf_struct_show(const struct btf *btf, const struct btf_type *t,
 			      u32 type_id, void *data, u8 bits_offset,
 			      struct btf_show *show)
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH bpf-next v2 03/13] bpf: Support bpf_dynptr-typed map key in bpf syscall
  2022-09-24 13:36 [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key Hou Tao
  2022-09-24 13:36 ` [PATCH bpf-next v2 01/13] bpf: Export bpf_dynptr_set_size() Hou Tao
  2022-09-24 13:36 ` [PATCH bpf-next v2 02/13] bpf: Add helper btf_find_dynptr() Hou Tao
@ 2022-09-24 13:36 ` Hou Tao
  2022-09-29  0:16   ` Andrii Nakryiko
  2022-09-24 13:36 ` [PATCH bpf-next v2 04/13] bpf: Support bpf_dynptr-typed map key in verifier Hou Tao
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 52+ messages in thread
From: Hou Tao @ 2022-09-24 13:36 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, houtao1

From: Hou Tao <houtao1@huawei.com>

Userspace application uses bpf syscall to lookup or update bpf map. It
passes a pointer of fixed-size buffer to kernel to represent the map
key. To support map with variable-length key, introduce bpf_dynptr_user
to allow userspace to pass a pointer of bpf_dynptr_user to specify the
address and the length of key buffer. And in order to represent dynptr
from userspace, adding a new dynptr type: BPF_DYNPTR_TYPE_USER. Because
BPF_DYNPTR_TYPE_USER-typed dynptr is not available from bpf program, so
no verifier update is needed.

Add dynptr_key_off in bpf_map to distinguish map with fixed-size key
from map with variable-length. dynptr_key_off is less than zero for
fixed-size key and can only be zero for dynptr key.

For dynptr-key map, key btf type is bpf_dynptr and key size is 16, so
use the lower 32-bits of map_extra to specify the maximum size of dynptr
key.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 include/linux/bpf.h            |   8 +++
 include/uapi/linux/bpf.h       |   6 ++
 kernel/bpf/map_in_map.c        |   3 +
 kernel/bpf/syscall.c           | 121 +++++++++++++++++++++++++++------
 tools/include/uapi/linux/bpf.h |   6 ++
 5 files changed, 125 insertions(+), 19 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 66a18dc67b46..44bef4110179 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -216,6 +216,7 @@ struct bpf_map {
 	int spin_lock_off; /* >=0 valid offset, <0 error */
 	struct bpf_map_value_off *kptr_off_tab;
 	int timer_off; /* >=0 valid offset, <0 error */
+	int dynptr_key_off; /* >=0 valid offset, <0 error */
 	u32 id;
 	int numa_node;
 	u32 btf_key_type_id;
@@ -265,6 +266,11 @@ static inline bool map_value_has_kptrs(const struct bpf_map *map)
 	return !IS_ERR_OR_NULL(map->kptr_off_tab);
 }
 
+static inline bool map_key_has_dynptr(const struct bpf_map *map)
+{
+	return map->dynptr_key_off >= 0;
+}
+
 static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
 {
 	if (unlikely(map_value_has_spin_lock(map)))
@@ -2654,6 +2660,8 @@ enum bpf_dynptr_type {
 	BPF_DYNPTR_TYPE_LOCAL,
 	/* Underlying data is a kernel-produced ringbuf record */
 	BPF_DYNPTR_TYPE_RINGBUF,
+	/* Points to memory copied from/to userspace */
+	BPF_DYNPTR_TYPE_USER,
 };
 
 void bpf_dynptr_init(struct bpf_dynptr_kern *ptr, void *data,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index ead35f39f185..3466bcc9aeca 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -6814,6 +6814,12 @@ struct bpf_timer {
 	__u64 :64;
 } __attribute__((aligned(8)));
 
+struct bpf_dynptr_user {
+	__u64 data;
+	__u32 size;
+	__u32 :32;
+} __attribute__((aligned(8)));
+
 struct bpf_dynptr {
 	__u64 :64;
 	__u64 :64;
diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
index 135205d0d560..8ba96337893b 100644
--- a/kernel/bpf/map_in_map.c
+++ b/kernel/bpf/map_in_map.c
@@ -52,6 +52,7 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
 	inner_map_meta->max_entries = inner_map->max_entries;
 	inner_map_meta->spin_lock_off = inner_map->spin_lock_off;
 	inner_map_meta->timer_off = inner_map->timer_off;
+	inner_map_meta->dynptr_key_off = inner_map->dynptr_key_off;
 	inner_map_meta->kptr_off_tab = bpf_map_copy_kptr_off_tab(inner_map);
 	if (inner_map->btf) {
 		btf_get(inner_map->btf);
@@ -85,7 +86,9 @@ bool bpf_map_meta_equal(const struct bpf_map *meta0,
 		meta0->key_size == meta1->key_size &&
 		meta0->value_size == meta1->value_size &&
 		meta0->timer_off == meta1->timer_off &&
+		meta0->dynptr_key_off == meta1->dynptr_key_off &&
 		meta0->map_flags == meta1->map_flags &&
+		meta0->map_extra == meta1->map_extra &&
 		bpf_map_equal_kptr_off_tab(meta0, meta1);
 }
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 372fad5ef3d3..70919155c4ed 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -996,6 +996,7 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 		key_type = btf_type_id_size(btf, &btf_key_id, &key_size);
 		if (!key_type || key_size != map->key_size)
 			return -EINVAL;
+		map->dynptr_key_off = btf_find_dynptr(btf, key_type);
 	} else {
 		key_type = btf_type_by_id(btf, 0);
 		if (!map->ops->map_check_btf)
@@ -1089,10 +1090,6 @@ static int map_create(union bpf_attr *attr)
 		return -EINVAL;
 	}
 
-	if (attr->map_type != BPF_MAP_TYPE_BLOOM_FILTER &&
-	    attr->map_extra != 0)
-		return -EINVAL;
-
 	f_flags = bpf_get_file_flag(attr->map_flags);
 	if (f_flags < 0)
 		return f_flags;
@@ -1119,6 +1116,7 @@ static int map_create(union bpf_attr *attr)
 
 	map->spin_lock_off = -EINVAL;
 	map->timer_off = -EINVAL;
+	map->dynptr_key_off = -EINVAL;
 	if (attr->btf_key_type_id || attr->btf_value_type_id ||
 	    /* Even the map's value is a kernel's struct,
 	     * the bpf_prog.o must have BTF to begin with
@@ -1154,6 +1152,20 @@ static int map_create(union bpf_attr *attr)
 			attr->btf_vmlinux_value_type_id;
 	}
 
+	if (map_key_has_dynptr(map)) {
+		/* The lower 32-bits of map_extra specifies the maximum size
+		 * of bpf_dynptr-typed key
+		 */
+		if (!attr->map_extra || (attr->map_extra >> 32) ||
+		    bpf_dynptr_check_size((u32)attr->map_extra)) {
+			err = -EINVAL;
+			goto free_map;
+		}
+	} else if (attr->map_type != BPF_MAP_TYPE_BLOOM_FILTER && attr->map_extra != 0) {
+		err = -EINVAL;
+		goto free_map;
+	}
+
 	err = bpf_map_alloc_off_arr(map);
 	if (err)
 		goto free_map;
@@ -1280,10 +1292,41 @@ int __weak bpf_stackmap_copy(struct bpf_map *map, void *key, void *value)
 	return -ENOTSUPP;
 }
 
-static void *__bpf_copy_key(void __user *ukey, u64 key_size)
+static void *bpf_copy_from_dynptr_ukey(bpfptr_t ukey)
+{
+	struct bpf_dynptr_kern *kptr;
+	struct bpf_dynptr_user uptr;
+	bpfptr_t data;
+
+	if (copy_from_bpfptr(&uptr, ukey, sizeof(uptr)))
+		return ERR_PTR(-EFAULT);
+
+	if (!uptr.size || bpf_dynptr_check_size(uptr.size))
+		return ERR_PTR(-EINVAL);
+
+	/* Allocate and free bpf_dynptr_kern and its data together */
+	kptr = kvmalloc(sizeof(*kptr) + uptr.size, GFP_USER | __GFP_NOWARN);
+	if (!kptr)
+		return ERR_PTR(-ENOMEM);
+
+	data = make_bpfptr(uptr.data, bpfptr_is_kernel(ukey));
+	if (copy_from_bpfptr(&kptr[1], data, uptr.size)) {
+		kvfree(kptr);
+		return ERR_PTR(-EFAULT);
+	}
+
+	bpf_dynptr_init(kptr, &kptr[1], BPF_DYNPTR_TYPE_USER, 0, uptr.size);
+
+	return kptr;
+}
+
+static void *__bpf_copy_key(const struct bpf_map *map, void __user *ukey)
 {
-	if (key_size)
-		return vmemdup_user(ukey, key_size);
+	if (map_key_has_dynptr(map))
+		return bpf_copy_from_dynptr_ukey(USER_BPFPTR(ukey));
+
+	if (map->key_size)
+		return vmemdup_user(ukey, map->key_size);
 
 	if (ukey)
 		return ERR_PTR(-EINVAL);
@@ -1291,10 +1334,13 @@ static void *__bpf_copy_key(void __user *ukey, u64 key_size)
 	return NULL;
 }
 
-static void *___bpf_copy_key(bpfptr_t ukey, u64 key_size)
+static void *___bpf_copy_key(const struct bpf_map *map, bpfptr_t ukey)
 {
-	if (key_size)
-		return kvmemdup_bpfptr(ukey, key_size);
+	if (map_key_has_dynptr(map))
+		return bpf_copy_from_dynptr_ukey(ukey);
+
+	if (map->key_size)
+		return kvmemdup_bpfptr(ukey, map->key_size);
 
 	if (!bpfptr_is_null(ukey))
 		return ERR_PTR(-EINVAL);
@@ -1302,6 +1348,38 @@ static void *___bpf_copy_key(bpfptr_t ukey, u64 key_size)
 	return NULL;
 }
 
+static void *bpf_new_dynptr_key(u32 key_size)
+{
+	struct bpf_dynptr_kern *kptr;
+
+	kptr = kvmalloc(sizeof(*kptr) + key_size, GFP_USER | __GFP_NOWARN);
+	if (kptr)
+		bpf_dynptr_init(kptr, &kptr[1], BPF_DYNPTR_TYPE_USER, 0, key_size);
+	return kptr;
+}
+
+static int bpf_copy_to_dynptr_ukey(struct bpf_dynptr_user __user *uptr,
+				   struct bpf_dynptr_kern *kptr)
+{
+	struct {
+		unsigned int size;
+		unsigned int zero;
+	} tuple;
+	u64 udata;
+
+	if (copy_from_user(&udata, &uptr->data, sizeof(udata)))
+		return -EFAULT;
+
+	/* Also zeroing the reserved field in uptr */
+	tuple.size = bpf_dynptr_get_size(kptr);
+	tuple.zero = 0;
+	if (copy_to_user(u64_to_user_ptr(udata), kptr->data + kptr->offset, tuple.size) ||
+	    copy_to_user(&uptr->size, &tuple, sizeof(tuple)))
+		return -EFAULT;
+
+	return 0;
+}
+
 /* last field in 'union bpf_attr' used by this command */
 #define BPF_MAP_LOOKUP_ELEM_LAST_FIELD flags
 
@@ -1337,7 +1415,7 @@ static int map_lookup_elem(union bpf_attr *attr)
 		goto err_put;
 	}
 
-	key = __bpf_copy_key(ukey, map->key_size);
+	key = __bpf_copy_key(map, ukey);
 	if (IS_ERR(key)) {
 		err = PTR_ERR(key);
 		goto err_put;
@@ -1377,7 +1455,6 @@ static int map_lookup_elem(union bpf_attr *attr)
 	return err;
 }
 
-
 #define BPF_MAP_UPDATE_ELEM_LAST_FIELD flags
 
 static int map_update_elem(union bpf_attr *attr, bpfptr_t uattr)
@@ -1410,7 +1487,7 @@ static int map_update_elem(union bpf_attr *attr, bpfptr_t uattr)
 		goto err_put;
 	}
 
-	key = ___bpf_copy_key(ukey, map->key_size);
+	key = ___bpf_copy_key(map, ukey);
 	if (IS_ERR(key)) {
 		err = PTR_ERR(key);
 		goto err_put;
@@ -1458,7 +1535,7 @@ static int map_delete_elem(union bpf_attr *attr, bpfptr_t uattr)
 		goto err_put;
 	}
 
-	key = ___bpf_copy_key(ukey, map->key_size);
+	key = ___bpf_copy_key(map, ukey);
 	if (IS_ERR(key)) {
 		err = PTR_ERR(key);
 		goto err_put;
@@ -1514,7 +1591,7 @@ static int map_get_next_key(union bpf_attr *attr)
 	}
 
 	if (ukey) {
-		key = __bpf_copy_key(ukey, map->key_size);
+		key = __bpf_copy_key(map, ukey);
 		if (IS_ERR(key)) {
 			err = PTR_ERR(key);
 			goto err_put;
@@ -1524,7 +1601,10 @@ static int map_get_next_key(union bpf_attr *attr)
 	}
 
 	err = -ENOMEM;
-	next_key = kvmalloc(map->key_size, GFP_USER);
+	if (map_key_has_dynptr(map))
+		next_key = bpf_new_dynptr_key(map->map_extra);
+	else
+		next_key = kvmalloc(map->key_size, GFP_USER | __GFP_NOWARN);
 	if (!next_key)
 		goto free_key;
 
@@ -1540,8 +1620,11 @@ static int map_get_next_key(union bpf_attr *attr)
 	if (err)
 		goto free_next_key;
 
-	err = -EFAULT;
-	if (copy_to_user(unext_key, next_key, map->key_size) != 0)
+	if (map_key_has_dynptr(map))
+		err = bpf_copy_to_dynptr_ukey(unext_key, next_key);
+	else
+		err = copy_to_user(unext_key, next_key, map->key_size) != 0 ? -EFAULT : 0;
+	if (err)
 		goto free_next_key;
 
 	err = 0;
@@ -1815,7 +1898,7 @@ static int map_lookup_and_delete_elem(union bpf_attr *attr)
 		goto err_put;
 	}
 
-	key = __bpf_copy_key(ukey, map->key_size);
+	key = __bpf_copy_key(map, ukey);
 	if (IS_ERR(key)) {
 		err = PTR_ERR(key);
 		goto err_put;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index ead35f39f185..3466bcc9aeca 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -6814,6 +6814,12 @@ struct bpf_timer {
 	__u64 :64;
 } __attribute__((aligned(8)));
 
+struct bpf_dynptr_user {
+	__u64 data;
+	__u32 size;
+	__u32 :32;
+} __attribute__((aligned(8)));
+
 struct bpf_dynptr {
 	__u64 :64;
 	__u64 :64;
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH bpf-next v2 04/13] bpf: Support bpf_dynptr-typed map key in verifier
  2022-09-24 13:36 [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key Hou Tao
                   ` (2 preceding siblings ...)
  2022-09-24 13:36 ` [PATCH bpf-next v2 03/13] bpf: Support bpf_dynptr-typed map key in bpf syscall Hou Tao
@ 2022-09-24 13:36 ` Hou Tao
  2022-09-24 13:36 ` [PATCH bpf-next v2 05/13] libbpf: Add helpers for bpf_dynptr_user Hou Tao
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 52+ messages in thread
From: Hou Tao @ 2022-09-24 13:36 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, houtao1

From: Hou Tao <houtao1@huawei.com>

For map with dynptr key, only allow a bpf_dynptr on stack to be used as
a map key.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/verifier.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 6f6d2d511c06..5d2868a798d6 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -6020,9 +6020,20 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
 			verbose(env, "invalid map_ptr to access map->key\n");
 			return -EACCES;
 		}
-		err = check_helper_mem_access(env, regno,
-					      meta->map_ptr->key_size, false,
-					      NULL);
+		/* Allow bpf_dynptr to be used as map key */
+		if (map_key_has_dynptr(meta->map_ptr)) {
+			if (base_type(reg->type) != PTR_TO_STACK ||
+			    !is_dynptr_reg_valid_init(env, reg) ||
+			    !is_dynptr_type_expected(env, reg, ARG_PTR_TO_DYNPTR)) {
+				verbose(env, "expect R%d to be dynptr instead of %s\n",
+					regno, reg_type_str(env, reg->type));
+				return -EACCES;
+			}
+		} else {
+			err = check_helper_mem_access(env, regno,
+						      meta->map_ptr->key_size, false,
+						      NULL);
+		}
 		break;
 	case ARG_PTR_TO_MAP_VALUE:
 		if (type_may_be_null(arg_type) && register_is_null(reg))
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH bpf-next v2 05/13] libbpf: Add helpers for bpf_dynptr_user
  2022-09-24 13:36 [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key Hou Tao
                   ` (3 preceding siblings ...)
  2022-09-24 13:36 ` [PATCH bpf-next v2 04/13] bpf: Support bpf_dynptr-typed map key in verifier Hou Tao
@ 2022-09-24 13:36 ` Hou Tao
  2022-09-24 13:36 ` [PATCH bpf-next v2 06/13] bpf: Add support for qp-trie map with dynptr key Hou Tao
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 52+ messages in thread
From: Hou Tao @ 2022-09-24 13:36 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, houtao1

From: Hou Tao <houtao1@huawei.com>

Add bpf_dynptr_user_init() to initialize a bpf_dynptr,
bpf_dynptr_user_get_{data,size}() to get the address and length of
dynptr, and bpf_dynptr_user_trim() to trim size of dynptr.

Instead of exporting these symbols, simply adding these helpers as
inline functions in bpf.h.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 tools/lib/bpf/bpf.h | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 9c50beabdd14..6b91a8c2b2ae 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -371,6 +371,35 @@ LIBBPF_API int bpf_btf_get_fd_by_id(__u32 id);
 LIBBPF_API int bpf_link_get_fd_by_id(__u32 id);
 LIBBPF_API int bpf_obj_get_info_by_fd(int bpf_fd, void *info, __u32 *info_len);
 
+/* sys_bpf() will check the validity of size */
+static inline void bpf_dynptr_user_init(void *data, __u32 size,
+					struct bpf_dynptr_user *dynptr)
+{
+	/* Zero padding bytes */
+	memset(dynptr, 0, sizeof(*dynptr));
+	dynptr->data = (__u64)(unsigned long)data;
+	dynptr->size = size;
+}
+
+static inline __u32
+bpf_dynptr_user_get_size(const struct bpf_dynptr_user *dynptr)
+{
+	return dynptr->size;
+}
+
+static inline void *
+bpf_dynptr_user_get_data(const struct bpf_dynptr_user *dynptr)
+{
+	return (void *)(unsigned long)dynptr->data;
+}
+
+static inline void bpf_dynptr_user_trim(struct bpf_dynptr_user *dynptr,
+					__u32 new_size)
+{
+	if (new_size < dynptr->size)
+		dynptr->size = new_size;
+}
+
 struct bpf_prog_query_opts {
 	size_t sz; /* size of this struct for forward/backward compatibility */
 	__u32 query_flags;
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH bpf-next v2 06/13] bpf: Add support for qp-trie map with dynptr key
  2022-09-24 13:36 [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key Hou Tao
                   ` (4 preceding siblings ...)
  2022-09-24 13:36 ` [PATCH bpf-next v2 05/13] libbpf: Add helpers for bpf_dynptr_user Hou Tao
@ 2022-09-24 13:36 ` Hou Tao
  2022-09-24 13:36 ` [PATCH bpf-next v2 07/13] libbpf: Add probe support for BPF_MAP_TYPE_QP_TRIE Hou Tao
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 52+ messages in thread
From: Hou Tao @ 2022-09-24 13:36 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, houtao1

From: Hou Tao <houtao1@huawei.com>

The initial motivation for qp-trie map is to reduce memory usage for
string keys specially those with large differencies in length. Moreover
as a big-endian lexicographic-ordered map, qp-trie can also be used for
any binary data with fixed or variable length.

The memory efficiency of qp-tries comes partly from the design of qp-trie
which doesn't save key for branch node and uses sparse array to save leaf
nodes, partly comes from the support of bpf_dynptr-typed key: only the
used part in key is saved.

But the memory efficiency and ordered keys come with cost: for strings
(e.g. symbol names in /proc/kallsyms) the lookup performance of qp-trie
is about 30% or more slower compared with hash-table. But the lookup
performance is not always bad than hash-table, for randomly generated
binary data set with big differencies in length, the lookup performance
of qp-trie will be twice as good as hash-table as showed in the
following benchmark.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 include/linux/bpf_types.h      |    1 +
 include/uapi/linux/bpf.h       |    1 +
 kernel/bpf/Makefile            |    1 +
 kernel/bpf/bpf_qp_trie.c       | 1057 ++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |    1 +
 5 files changed, 1061 insertions(+)
 create mode 100644 kernel/bpf/bpf_qp_trie.c

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 2c6a4f2562a7..c50233463e9b 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -127,6 +127,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_BLOOM_FILTER, bloom_filter_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_USER_RINGBUF, user_ringbuf_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_QP_TRIE, qp_trie_map_ops)
 
 BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint)
 BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 3466bcc9aeca..bdd964d51c38 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -929,6 +929,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_TASK_STORAGE,
 	BPF_MAP_TYPE_BLOOM_FILTER,
 	BPF_MAP_TYPE_USER_RINGBUF,
+	BPF_MAP_TYPE_QP_TRIE,
 };
 
 /* Note that tracing related programs such as
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 341c94f208f4..8419f44fea50 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -10,6 +10,7 @@ obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_i
 obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
 obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
 obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
+obj-$(CONFIG_BPF_SYSCALL) += bpf_qp_trie.o
 obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
 obj-$(CONFIG_BPF_SYSCALL) += disasm.o
 obj-$(CONFIG_BPF_JIT) += trampoline.o
diff --git a/kernel/bpf/bpf_qp_trie.c b/kernel/bpf/bpf_qp_trie.c
new file mode 100644
index 000000000000..19974f63a102
--- /dev/null
+++ b/kernel/bpf/bpf_qp_trie.c
@@ -0,0 +1,1057 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Derived from qp.c in https://github.com/fanf2/qp.git
+ *
+ * Copyright (C) 2022. Huawei Technologies Co., Ltd
+ */
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/spinlock.h>
+#include <linux/rcupdate.h>
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/btf_ids.h>
+
+/* qp-trie (quadbit popcount trie) is a memory efficient trie. Unlike
+ * normal trie which uses byte as lookup key, qp-trie interprets its keys
+ * as quadbit/nibble array and uses one nibble each time during lookup.
+ * The most significant nibble (upper nibble) of byte N in the key will
+ * be the 2*N element of nibble array, and the least significant nibble
+ * (lower nibble) of byte N will be the 2*N+1 element in nibble array.
+ *
+ * For normal trie, it may have 256 child nodes, and for qp-trie one branch
+ * node may have 17 child nodes. #0 child node is special because it must
+ * be a leaf node and its key is the same as the branch node. #1~#16 child
+ * nodes represent leaf nodes or branch nodes which have different keys
+ * with parent node. The key of branch node is the common prefix for these
+ * child nodes, and the index of child node minus one is the value of first
+ * different nibble between these child nodes.
+ *
+ * qp-trie reduces memory usage through two methods:
+ * (1) Branch node doesn't store the key. It only stores the position of
+ *     the first nibble which differentiates child nodes.
+ * (2) Branch node doesn't store all 17 child nodes. It uses a bitmap and
+ *     popcount() to implement a sparse array and only allocates memory
+ *     for those present children.
+ *
+ * Like normal trie, qp-trie is also ordered and is in big-endian
+ * lexicographic order. If traverse qp-trie in a depth-first way, it will
+ * return a string of ordered keys.
+ *
+ * The following diagrams show the construction of a tiny qp-trie:
+ *
+ * (1) insert abc
+ *
+ *          [ leaf node: abc ]
+ *
+ * (2) insert abc_d
+ *
+ * The first different nibble between "abc" and "abc_d" is the upper nibble
+ * of character '_' (0x5), and its position in nibble array is 6
+ * (starts from 0).
+ *
+ *          [ branch node ] bitmap: 0x41 diff pos: 6
+ *                 |
+ *                 *
+ *             children
+ *          [0]        [6]
+ *           |          |
+ *       [leaf: abc] [leaf: abc_d]
+ *
+ * (3) insert abc_e
+ *
+ * The first different nibble between "abc_d" and "abc_e" is the lower
+ * nibble of character 'd'/'e', and its position in array is 9.
+ *
+ *          [ branch node ] bitmap: 0x41 diff pos: 6
+ *                 |
+ *                 *
+ *             children
+ *          [0]        [6]
+ *           |          |
+ *       [leaf: abc]    |
+ *                      *
+ *                [ branch node ] bitmap: 0x60 diff pos: 9
+ *                      |
+ *                      *
+ *                   children
+ *                [5]        [6]
+ *                 |          |
+ *          [leaf: abc_d]  [leaf: abc_e]
+ */
+
+#define QP_TRIE_CREATE_FLAG_MASK (BPF_F_NO_PREALLOC | BPF_F_NUMA_NODE | \
+				  BPF_F_ACCESS_MASK)
+
+/* bit[0] of nodes in qp_trie_branch is used to tell node type:
+ *
+ * bit[0]: 0-branch node
+ * bit[0]: 1-leaf node
+ *
+ * Size of qp_trie_branch is already 2-bytes aligned, so only need to make
+ * allocation of leaf node to be 2-bytes aligned.
+ */
+#define QP_TRIE_LEAF_NODE_MASK 1UL
+#define QP_TRIE_LEAF_ALLOC_ALIGN 2
+
+/* To reduce memory usage, only qp_trie_branch is RCU-freed. To handle
+ * freeing of the last leaf node, an extra qp_trie_branch node is
+ * allocated. The branch node has only one child and its index is 0. It
+ * is set as root node after adding the first leaf node.
+ */
+#define QP_TRIE_ROOT_NODE_INDEX 0
+#define QP_TRIE_NON_ROOT_NODE_MASK 1
+
+#define QP_TRIE_NIBBLE_SHIFT 1
+#define QP_TRIE_BYTE_INDEX_SHIFT 2
+
+#define QP_TRIE_TWIGS_FREE_NONE_IDX 17
+
+struct qp_trie_branch {
+	/* The bottom two bits of index are used as special flags:
+	 *
+	 * bit[0]: 0-root, 1-not root
+	 * bit[1]: 0-upper nibble, 1-lower nibble
+	 *
+	 * bit[2:31]: byte index for key
+	 */
+	unsigned int index;
+	/* 17 bits are used to accommodate arbitrary keys, even when there are
+	 * zero-bytes in these keys.
+	 *
+	 * bit[0]: a leaf node has the same key as the prefix of parent node
+	 * bit[N]: a child node with the value of nibble at index as (N - 1)
+	 */
+	unsigned int bitmap:17;
+	/* The index of leaf node will be RCU-freed together */
+	unsigned int to_free_idx:5;
+	struct qp_trie_branch __rcu *parent;
+	struct rcu_head rcu;
+	void __rcu *nodes[0];
+};
+
+#define QP_TRIE_NR_SUBTREE 256
+
+struct qp_trie {
+	struct bpf_map map;
+	atomic_t entries;
+	void __rcu *roots[QP_TRIE_NR_SUBTREE];
+	spinlock_t locks[QP_TRIE_NR_SUBTREE];
+};
+
+/* Internally use qp_trie_key instead of bpf_dynptr_kern
+ * to reduce memory usage
+ */
+struct qp_trie_key {
+	/* the length of blob data */
+	unsigned int len;
+	/* blob data */
+	unsigned char data[0];
+};
+
+struct qp_trie_diff {
+	unsigned int index;
+	unsigned int sibling_bm;
+	unsigned int new_bm;
+};
+
+static inline void *to_child_node(const struct qp_trie_key *key)
+{
+	return (void *)((long)key | QP_TRIE_LEAF_NODE_MASK);
+}
+
+static inline struct qp_trie_key *to_leaf_node(void *node)
+{
+	return (void *)((long)node & ~QP_TRIE_LEAF_NODE_MASK);
+}
+
+static inline bool is_branch_node(void *node)
+{
+	return !((long)node & QP_TRIE_LEAF_NODE_MASK);
+}
+
+static inline bool is_same_key(const struct qp_trie_key *k, const unsigned char *data,
+			       unsigned int len)
+{
+	return k->len == len && !memcmp(k->data, data, len);
+}
+
+static inline void *qp_trie_leaf_value(const struct qp_trie_key *key)
+{
+	return (void *)key + sizeof(*key) + key->len;
+}
+
+static inline unsigned int calc_twig_index(unsigned int mask, unsigned int bitmap)
+{
+	return hweight32(mask & (bitmap - 1));
+}
+
+static inline unsigned int calc_twig_nr(unsigned int bitmap)
+{
+	return hweight32(bitmap);
+}
+
+static inline unsigned int nibble_to_bitmap(unsigned char nibble)
+{
+	return 1U << (nibble + 1);
+}
+
+static inline unsigned int index_to_byte_index(unsigned int index)
+{
+	return index >> QP_TRIE_BYTE_INDEX_SHIFT;
+}
+
+static inline unsigned int calc_br_bitmap(unsigned int index, const unsigned char *data,
+					  unsigned int len)
+{
+	unsigned char nibble;
+	unsigned int byte;
+
+	if (index == QP_TRIE_ROOT_NODE_INDEX)
+		return 1;
+
+	byte = index_to_byte_index(index);
+	if (byte >= len)
+		return 1;
+
+	nibble = data[byte];
+	/* lower nibble */
+	if ((index >> QP_TRIE_NIBBLE_SHIFT) & 1)
+		nibble &= 0xf;
+	else
+		nibble >>= 4;
+	return nibble_to_bitmap(nibble);
+}
+
+static void qp_trie_free_twigs_rcu(struct rcu_head *rcu)
+{
+	struct qp_trie_branch *twigs = container_of(rcu, struct qp_trie_branch, rcu);
+	unsigned int idx = twigs->to_free_idx;
+
+	if (idx != QP_TRIE_TWIGS_FREE_NONE_IDX)
+		kfree(to_leaf_node(rcu_access_pointer(twigs->nodes[idx])));
+	kfree(twigs);
+}
+
+static void qp_trie_branch_free(struct qp_trie_branch *twigs, unsigned int to_free_idx)
+{
+	twigs->to_free_idx = to_free_idx;
+	call_rcu(&twigs->rcu, qp_trie_free_twigs_rcu);
+}
+
+static inline struct qp_trie_branch *
+qp_trie_branch_new(struct bpf_map *map, unsigned int nr)
+{
+	struct qp_trie_branch *a;
+
+	a = bpf_map_kmalloc_node(map, sizeof(*a) + nr * sizeof(*a->nodes),
+				 GFP_NOWAIT | __GFP_NOWARN, map->numa_node);
+	return a;
+}
+
+static inline void qp_trie_assign_parent(struct qp_trie_branch *parent, void *node)
+{
+	if (is_branch_node(node))
+		rcu_assign_pointer(((struct qp_trie_branch *)node)->parent, parent);
+}
+
+static void qp_trie_update_parent(struct qp_trie_branch *parent, unsigned int nr)
+{
+	unsigned int i;
+
+	for (i = 0; i < nr; i++)
+		qp_trie_assign_parent(parent, rcu_dereference_protected(parent->nodes[i], 1));
+}
+
+/* new_node can be either a leaf node or a branch node */
+static struct qp_trie_branch *
+qp_trie_branch_replace(struct bpf_map *map, struct qp_trie_branch *old, unsigned int bitmap,
+		       void *new_node)
+{
+	unsigned int nr = calc_twig_nr(old->bitmap), p;
+	struct qp_trie_branch *twigs;
+
+	twigs = qp_trie_branch_new(map, nr);
+	if (!twigs)
+		return NULL;
+
+	p = calc_twig_index(old->bitmap, bitmap);
+	if (p)
+		memcpy(twigs->nodes, old->nodes, p * sizeof(*twigs->nodes));
+
+	rcu_assign_pointer(twigs->nodes[p], new_node);
+
+	if (nr - 1 > p)
+		memcpy(&twigs->nodes[p+1], &old->nodes[p+1], (nr - 1 - p) * sizeof(*twigs->nodes));
+
+	twigs->index = old->index;
+	twigs->bitmap = old->bitmap;
+	/* twigs will not be visible to reader until rcu_assign_pointer(), so
+	 * use RCU_INIT_POINTER() here.
+	 */
+	RCU_INIT_POINTER(twigs->parent, old->parent);
+
+	/* Initialize ->parent of parent node first, then update ->parent for
+	 * child nodes after parent node is fully initialized.
+	 */
+	qp_trie_update_parent(twigs, nr);
+
+	return twigs;
+}
+
+static struct qp_trie_branch *
+qp_trie_branch_insert(struct bpf_map *map, struct qp_trie_branch *old, unsigned int bitmap,
+		      const struct qp_trie_key *new)
+{
+	unsigned int nr = calc_twig_nr(old->bitmap), p;
+	struct qp_trie_branch *twigs;
+
+	twigs = qp_trie_branch_new(map, nr + 1);
+	if (!twigs)
+		return NULL;
+
+	p = calc_twig_index(old->bitmap, bitmap);
+	if (p)
+		memcpy(twigs->nodes, old->nodes, p * sizeof(*twigs->nodes));
+
+	rcu_assign_pointer(twigs->nodes[p], to_child_node(new));
+
+	if (nr > p)
+		memcpy(&twigs->nodes[p+1], &old->nodes[p], (nr - p) * sizeof(*twigs->nodes));
+
+	twigs->bitmap = old->bitmap | bitmap;
+	twigs->index = old->index;
+	RCU_INIT_POINTER(twigs->parent, old->parent);
+
+	qp_trie_update_parent(twigs, nr + 1);
+
+	return twigs;
+}
+
+static struct qp_trie_branch *
+qp_trie_branch_remove(struct bpf_map *map, struct qp_trie_branch *old, unsigned int bitmap)
+{
+	unsigned int nr = calc_twig_nr(old->bitmap), p;
+	struct qp_trie_branch *twigs;
+
+	twigs = qp_trie_branch_new(map, nr - 1);
+	if (!twigs)
+		return NULL;
+
+	p = calc_twig_index(old->bitmap, bitmap);
+	if (p)
+		memcpy(twigs->nodes, old->nodes, p * sizeof(*twigs->nodes));
+	if (nr - 1 > p)
+		memcpy(&twigs->nodes[p], &old->nodes[p+1], (nr - 1 - p) * sizeof(*twigs->nodes));
+
+	twigs->bitmap = old->bitmap & ~bitmap;
+	twigs->index = old->index;
+	RCU_INIT_POINTER(twigs->parent, old->parent);
+
+	qp_trie_update_parent(twigs, nr - 1);
+
+	return twigs;
+}
+
+static struct qp_trie_key *
+qp_trie_init_leaf_node(struct bpf_map *map, const struct bpf_dynptr_kern *k, void *v)
+{
+	unsigned int key_size, total;
+	struct qp_trie_key *new;
+
+	key_size = bpf_dynptr_get_size(k);
+	if (!key_size || key_size > (u32)map->map_extra)
+		return ERR_PTR(-EINVAL);
+
+	total = round_up(sizeof(*new) + key_size + map->value_size, QP_TRIE_LEAF_ALLOC_ALIGN);
+	new = bpf_map_kmalloc_node(map, total, GFP_NOWAIT | __GFP_NOWARN, map->numa_node);
+	if (!new)
+		return ERR_PTR(-ENOMEM);
+
+	new->len = key_size;
+	memcpy(new->data, k->data + k->offset, key_size);
+	memcpy((void *)&new[1] + key_size, v, map->value_size);
+
+	return new;
+}
+
+static bool calc_prefix_len(const struct qp_trie_key *s_key, const struct qp_trie_key *n_key,
+			    unsigned int *index)
+{
+	unsigned int i, len = min(s_key->len, n_key->len);
+	unsigned char diff = 0;
+
+	for (i = 0; i < len; i++) {
+		diff = s_key->data[i] ^ n_key->data[i];
+		if (diff)
+			break;
+	}
+
+	*index = (i << QP_TRIE_BYTE_INDEX_SHIFT) | QP_TRIE_NON_ROOT_NODE_MASK;
+	if (!diff)
+		return s_key->len == n_key->len;
+
+	*index += (diff & 0xf0) ? 0 : (1U << QP_TRIE_NIBBLE_SHIFT);
+	return false;
+}
+
+static int qp_trie_new_branch(struct qp_trie *trie, struct qp_trie_branch __rcu **parent,
+			      unsigned int bitmap, void *sibling, struct qp_trie_diff *d,
+			      const struct qp_trie_key *leaf)
+{
+	struct qp_trie_branch *new_child_twigs, *new_twigs, *old_twigs;
+	struct bpf_map *map;
+	unsigned int iip;
+	int err;
+
+	map = &trie->map;
+	if (atomic_inc_return(&trie->entries) > map->max_entries) {
+		err = -ENOSPC;
+		goto dec_entries;
+	}
+
+	new_child_twigs = qp_trie_branch_new(map, 2);
+	if (!new_child_twigs) {
+		err = -ENOMEM;
+		goto dec_entries;
+	}
+
+	new_child_twigs->index = d->index;
+	new_child_twigs->bitmap = d->sibling_bm | d->new_bm;
+
+	iip = calc_twig_index(new_child_twigs->bitmap, d->sibling_bm);
+	RCU_INIT_POINTER(new_child_twigs->nodes[iip], sibling);
+	rcu_assign_pointer(new_child_twigs->nodes[!iip], to_child_node(leaf));
+	RCU_INIT_POINTER(new_child_twigs->parent, NULL);
+
+	old_twigs = rcu_dereference_protected(*parent, 1);
+	new_twigs = qp_trie_branch_replace(map, old_twigs, bitmap, new_child_twigs);
+	if (!new_twigs) {
+		err = -ENOMEM;
+		goto free_child_twigs;
+	}
+
+	qp_trie_assign_parent(new_child_twigs, sibling);
+	rcu_assign_pointer(*parent, new_twigs);
+	qp_trie_branch_free(old_twigs, QP_TRIE_TWIGS_FREE_NONE_IDX);
+
+	return 0;
+
+free_child_twigs:
+	kfree(new_child_twigs);
+dec_entries:
+	atomic_dec(&trie->entries);
+	return err;
+}
+
+static int qp_trie_ext_branch(struct qp_trie *trie, struct qp_trie_branch __rcu **parent,
+			      const struct qp_trie_key *new, unsigned int bitmap)
+{
+	struct qp_trie_branch *old_twigs, *new_twigs;
+	struct bpf_map *map;
+	int err;
+
+	map = &trie->map;
+	if (atomic_inc_return(&trie->entries) > map->max_entries) {
+		err = -ENOSPC;
+		goto dec_entries;
+	}
+
+	old_twigs = rcu_dereference_protected(*parent, 1);
+	new_twigs = qp_trie_branch_insert(map, old_twigs, bitmap, new);
+	if (!new_twigs) {
+		err = -ENOMEM;
+		goto dec_entries;
+	}
+
+	rcu_assign_pointer(*parent, new_twigs);
+	qp_trie_branch_free(old_twigs, QP_TRIE_TWIGS_FREE_NONE_IDX);
+
+	return 0;
+
+dec_entries:
+	atomic_dec(&trie->entries);
+	return err;
+}
+
+static int qp_trie_add_leaf_node(struct qp_trie *trie, struct qp_trie_branch __rcu **parent,
+				 const struct qp_trie_key *new)
+{
+	struct bpf_map *map = &trie->map;
+	struct qp_trie_branch *twigs;
+	int err;
+
+	if (atomic_inc_return(&trie->entries) > map->max_entries) {
+		err = -ENOSPC;
+		goto dec_entries;
+	}
+
+	twigs = qp_trie_branch_new(map, 1);
+	if (!twigs) {
+		err = -ENOMEM;
+		goto dec_entries;
+	}
+	twigs->index = QP_TRIE_ROOT_NODE_INDEX;
+	twigs->bitmap = 1;
+	RCU_INIT_POINTER(twigs->parent, NULL);
+	rcu_assign_pointer(twigs->nodes[0], to_child_node(new));
+
+	rcu_assign_pointer(*parent, twigs);
+
+	return 0;
+dec_entries:
+	atomic_dec(&trie->entries);
+	return err;
+}
+
+static int qp_trie_rep_leaf_node(struct qp_trie *trie, struct qp_trie_branch __rcu **parent,
+				 const struct qp_trie_key *new, unsigned int bitmap)
+{
+	struct qp_trie_branch *old_twigs, *new_twigs;
+	struct bpf_map *map = &trie->map;
+
+	/* Only branch node is freed by RCU, so replace the old branch node
+	 * and free the old leaf node together with the old branch node.
+	 */
+	old_twigs = rcu_dereference_protected(*parent, 1);
+	new_twigs = qp_trie_branch_replace(map, old_twigs, bitmap, to_child_node(new));
+	if (!new_twigs)
+		return -ENOMEM;
+
+	rcu_assign_pointer(*parent, new_twigs);
+
+	qp_trie_branch_free(old_twigs, calc_twig_index(old_twigs->bitmap, bitmap));
+
+	return 0;
+}
+
+static int qp_trie_remove_leaf(struct qp_trie *trie, struct qp_trie_branch __rcu **parent,
+			       unsigned int bitmap, const struct qp_trie_key *node)
+{
+	struct bpf_map *map = &trie->map;
+	struct qp_trie_branch *new, *old;
+	unsigned int nr;
+
+	old = rcu_dereference_protected(*parent, 1);
+	nr = calc_twig_nr(old->bitmap);
+	if (nr > 2) {
+		new = qp_trie_branch_remove(map, old, bitmap);
+		if (!new)
+			return -ENOMEM;
+	} else {
+		new = NULL;
+	}
+
+	rcu_assign_pointer(*parent, new);
+
+	qp_trie_branch_free(old, calc_twig_index(old->bitmap, bitmap));
+
+	atomic_dec(&trie->entries);
+
+	return 0;
+}
+
+static int qp_trie_merge_node(struct qp_trie *trie, struct qp_trie_branch __rcu **grand_parent,
+			      struct qp_trie_branch *parent, unsigned int parent_bitmap,
+			      unsigned int bitmap)
+{
+	struct qp_trie_branch *old_twigs, *new_twigs;
+	struct bpf_map *map = &trie->map;
+	void *new_sibling;
+	unsigned int iip;
+
+	iip = calc_twig_index(parent->bitmap, bitmap);
+	new_sibling = rcu_dereference_protected(parent->nodes[!iip], 1);
+
+	old_twigs = rcu_dereference_protected(*grand_parent, 1);
+	new_twigs = qp_trie_branch_replace(map, old_twigs, parent_bitmap, new_sibling);
+	if (!new_twigs)
+		return -ENOMEM;
+
+	rcu_assign_pointer(*grand_parent, new_twigs);
+
+	qp_trie_branch_free(old_twigs, QP_TRIE_TWIGS_FREE_NONE_IDX);
+	qp_trie_branch_free(parent, iip);
+
+	atomic_dec(&trie->entries);
+
+	return 0;
+}
+
+static int qp_trie_alloc_check(union bpf_attr *attr)
+{
+	if (!bpf_capable())
+		return -EPERM;
+
+	if ((attr->map_flags & BPF_F_NO_PREALLOC) != BPF_F_NO_PREALLOC ||
+	    attr->map_flags & ~QP_TRIE_CREATE_FLAG_MASK ||
+	    !bpf_map_flags_access_ok(attr->map_flags))
+		return -EINVAL;
+
+	if (!attr->max_entries || !attr->value_size)
+		return -EINVAL;
+
+	/* Key and value are allocated together in qp_trie_init_leaf_node() */
+	if (round_up((u64)sizeof(struct qp_trie_key) + (u32)attr->map_extra + attr->value_size,
+		     QP_TRIE_LEAF_ALLOC_ALIGN) >= KMALLOC_MAX_SIZE)
+		return -E2BIG;
+
+	return 0;
+}
+
+static struct bpf_map *qp_trie_alloc(union bpf_attr *attr)
+{
+	struct qp_trie *trie;
+	unsigned int i;
+
+	trie = bpf_map_area_alloc(sizeof(*trie), bpf_map_attr_numa_node(attr));
+	if (!trie)
+		return ERR_PTR(-ENOMEM);
+
+	/* roots are zeroed by bpf_map_area_alloc() */
+	for (i = 0; i < QP_TRIE_NR_SUBTREE; i++)
+		spin_lock_init(&trie->locks[i]);
+
+	atomic_set(&trie->entries, 0);
+	bpf_map_init_from_attr(&trie->map, attr);
+
+	return &trie->map;
+}
+
+static void qp_trie_free_subtree(void *root)
+{
+	struct qp_trie_branch *parent = NULL;
+	struct qp_trie_key *cur = NULL;
+	void *node = root;
+
+	/*
+	 * Depth-first deletion
+	 *
+	 * 1. find left-most key and its parent
+	 * 2. get next sibling Y from parent
+	 * (a) Y is leaf node: continue
+	 * (b) Y is branch node: goto step 1
+	 * (c) no more sibling: backtrace upwards if parent is not NULL and
+	 *     goto step 1
+	 */
+	do {
+		while (is_branch_node(node)) {
+			parent = node;
+			node = rcu_dereference_raw(parent->nodes[0]);
+		}
+
+		cur = to_leaf_node(node);
+		while (parent) {
+			unsigned int iip, bitmap, nr;
+			void *ancestor;
+
+			bitmap = calc_br_bitmap(parent->index, cur->data, cur->len);
+			iip = calc_twig_index(parent->bitmap, bitmap) + 1;
+			nr = calc_twig_nr(parent->bitmap);
+
+			for (; iip < nr; iip++) {
+				kfree(cur);
+
+				node = rcu_dereference_raw(parent->nodes[iip]);
+				if (is_branch_node(node))
+					break;
+
+				cur = to_leaf_node(node);
+			}
+			if (iip < nr)
+				break;
+
+			ancestor = rcu_dereference_raw(parent->parent);
+			kfree(parent);
+			parent = ancestor;
+		}
+	} while (parent);
+
+	kfree(cur);
+}
+
+static void qp_trie_free(struct bpf_map *map)
+{
+	struct qp_trie *trie = container_of(map, struct qp_trie, map);
+	unsigned int i;
+
+	/* No need to wait for the pending qp_trie_free_twigs_rcu() here.
+	 * It just call kfree to free memory.
+	 */
+
+	for (i = 0; i < ARRAY_SIZE(trie->roots); i++) {
+		void *root = rcu_dereference_raw(trie->roots[i]);
+
+		if (root)
+			qp_trie_free_subtree(root);
+	}
+	bpf_map_area_free(trie);
+}
+
+static inline void qp_trie_copy_leaf(const struct qp_trie_key *leaf, struct bpf_dynptr_kern *key)
+{
+	memcpy(key->data + key->offset, leaf->data, leaf->len);
+	bpf_dynptr_set_size(key, leaf->len);
+}
+
+static void qp_trie_copy_min_key_from(void *root, struct bpf_dynptr_kern *key)
+{
+	void *node;
+
+	node = root;
+	while (is_branch_node(node))
+		node = rcu_dereference(((struct qp_trie_branch *)node)->nodes[0]);
+
+	qp_trie_copy_leaf(to_leaf_node(node), key);
+}
+
+static int qp_trie_lookup_min_key(struct qp_trie *trie, unsigned int from,
+				  struct bpf_dynptr_kern *key)
+{
+	unsigned int i;
+
+	for (i = from; i < ARRAY_SIZE(trie->roots); i++) {
+		void *root = rcu_dereference(trie->roots[i]);
+
+		if (root) {
+			qp_trie_copy_min_key_from(root, key);
+			return 0;
+		}
+	}
+
+	return -ENOENT;
+}
+
+static int qp_trie_next_twigs_index(struct qp_trie_branch *twigs, unsigned int bitmap)
+{
+	unsigned int idx, nr, next;
+
+	/* bitmap may not in twigs->bitmap */
+	idx = calc_twig_index(twigs->bitmap, bitmap);
+	nr = calc_twig_nr(twigs->bitmap);
+
+	next = idx;
+	if (twigs->bitmap & bitmap)
+		next += 1;
+
+	if (next >= nr)
+		return -1;
+	return next;
+}
+
+static int qp_trie_lookup_next_node(struct qp_trie *trie, const struct bpf_dynptr_kern *key,
+				    struct bpf_dynptr_kern *next_key)
+{
+	const struct qp_trie_key *found;
+	struct qp_trie_branch *parent;
+	const unsigned char *data;
+	unsigned int data_len;
+	void *node, *next;
+
+	/* Non-existent key, so restart from the beginning */
+	data = key->data + key->offset;
+	node = rcu_dereference(trie->roots[*data]);
+	if (!node)
+		return qp_trie_lookup_min_key(trie, 0, next_key);
+
+	parent = NULL;
+	data_len = bpf_dynptr_get_size(key);
+	while (is_branch_node(node)) {
+		struct qp_trie_branch *br = node;
+		unsigned int iip, bitmap;
+
+		bitmap = calc_br_bitmap(br->index, data, data_len);
+		if (bitmap & br->bitmap)
+			iip = calc_twig_index(br->bitmap, bitmap);
+		else
+			iip = 0;
+
+		parent = br;
+		node = rcu_dereference(br->nodes[iip]);
+	}
+	found = to_leaf_node(node);
+	if (!is_same_key(found, data, data_len))
+		return qp_trie_lookup_min_key(trie, 0, next_key);
+
+	/* Pair with store release in rcu_assign_pointer(*parent, twigs) to
+	 * ensure reading node->parent will not return the old parent if
+	 * the node is found by following the newly-created parent.
+	 */
+	smp_rmb();
+
+	next = NULL;
+	while (parent) {
+		unsigned int bitmap;
+		int next_idx;
+
+		bitmap = calc_br_bitmap(parent->index, data, data_len);
+		next_idx = qp_trie_next_twigs_index(parent, bitmap);
+		if (next_idx >= 0) {
+			next = rcu_dereference(parent->nodes[next_idx]);
+			break;
+		}
+		parent = rcu_dereference(parent->parent);
+	}
+
+	/* Goto next sub-tree */
+	if (!next)
+		return qp_trie_lookup_min_key(trie, *data + 1, next_key);
+
+	if (!is_branch_node(next))
+		qp_trie_copy_leaf(to_leaf_node(next), next_key);
+	else
+		qp_trie_copy_min_key_from(next, next_key);
+
+	return 0;
+}
+
+/* Called from syscall */
+static int qp_trie_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+	struct qp_trie *trie = container_of(map, struct qp_trie, map);
+	int err;
+
+	if (!key)
+		err = qp_trie_lookup_min_key(trie, 0, next_key);
+	else
+		err = qp_trie_lookup_next_node(trie, key, next_key);
+	return err;
+}
+
+/* Called from syscall or from eBPF program */
+static void *qp_trie_lookup_elem(struct bpf_map *map, void *key)
+{
+	struct qp_trie *trie = container_of(map, struct qp_trie, map);
+	const struct bpf_dynptr_kern *dynptr_key = key;
+	const struct qp_trie_key *found;
+	const unsigned char *data;
+	unsigned int data_len;
+	void *node, *value;
+
+	/* Dynptr with zero length is possible, but is invalid for qp-trie */
+	data_len = bpf_dynptr_get_size(dynptr_key);
+	if (!data_len)
+		return NULL;
+
+	data = dynptr_key->data + dynptr_key->offset;
+	node = rcu_dereference_check(trie->roots[*data], rcu_read_lock_bh_held());
+	if (!node)
+		return NULL;
+
+	value = NULL;
+	while (is_branch_node(node)) {
+		struct qp_trie_branch *br = node;
+		unsigned int bitmap;
+		unsigned int iip;
+
+		/* When byte index equals with key len, the target key
+		 * may be in twigs->nodes[0].
+		 */
+		if (index_to_byte_index(br->index) > data_len)
+			goto done;
+
+		bitmap = calc_br_bitmap(br->index, data, data_len);
+		if (!(bitmap & br->bitmap))
+			goto done;
+
+		iip = calc_twig_index(br->bitmap, bitmap);
+		node = rcu_dereference_check(br->nodes[iip], rcu_read_lock_bh_held());
+	}
+
+	found = to_leaf_node(node);
+	if (is_same_key(found, data, data_len))
+		value = qp_trie_leaf_value(found);
+done:
+	return value;
+}
+
+/* Called from syscall or from eBPF program */
+static int qp_trie_update_elem(struct bpf_map *map, void *key, void *value, u64 flags)
+{
+	struct qp_trie *trie = container_of(map, struct qp_trie, map);
+	const struct qp_trie_key *leaf_key, *new_key;
+	struct qp_trie_branch __rcu **parent;
+	struct qp_trie_diff d;
+	unsigned int bitmap;
+	void __rcu **node;
+	spinlock_t *lock;
+	unsigned char c;
+	bool equal;
+	int err;
+
+	if (flags > BPF_EXIST)
+		return -EINVAL;
+
+	/* The content of key may change, so copy it firstly */
+	new_key = qp_trie_init_leaf_node(map, key, value);
+	if (IS_ERR(new_key))
+		return PTR_ERR(new_key);
+
+	c = new_key->data[0];
+	lock = &trie->locks[c];
+	spin_lock(lock);
+	parent = (struct qp_trie_branch __rcu **)&trie->roots[c];
+	if (!rcu_dereference_protected(*parent, 1)) {
+		if (flags == BPF_EXIST) {
+			err = -ENOENT;
+			goto unlock;
+		}
+		err = qp_trie_add_leaf_node(trie, parent, new_key);
+		goto unlock;
+	}
+
+	bitmap = 1;
+	node = &rcu_dereference_protected(*parent, 1)->nodes[0];
+	while (is_branch_node(rcu_dereference_protected(*node, 1))) {
+		struct qp_trie_branch *br = rcu_dereference_protected(*node, 1);
+		unsigned int iip;
+
+		bitmap = calc_br_bitmap(br->index, new_key->data, new_key->len);
+		if (bitmap & br->bitmap)
+			iip = calc_twig_index(br->bitmap, bitmap);
+		else
+			iip = 0;
+		parent = (struct qp_trie_branch __rcu **)node;
+		node = &br->nodes[iip];
+	}
+
+	leaf_key = to_leaf_node(rcu_dereference_protected(*node, 1));
+	equal = calc_prefix_len(leaf_key, new_key, &d.index);
+	if (equal) {
+		if (flags == BPF_NOEXIST) {
+			err = -EEXIST;
+			goto unlock;
+		}
+		err = qp_trie_rep_leaf_node(trie, parent, new_key, bitmap);
+		goto unlock;
+	}
+
+	d.sibling_bm = calc_br_bitmap(d.index, leaf_key->data, leaf_key->len);
+	d.new_bm = calc_br_bitmap(d.index, new_key->data, new_key->len);
+
+	bitmap = 1;
+	parent = (struct qp_trie_branch __rcu **)&trie->roots[c];
+	node = &rcu_dereference_protected(*parent, 1)->nodes[0];
+	while (is_branch_node(rcu_dereference_protected(*node, 1))) {
+		struct qp_trie_branch *br = rcu_dereference_protected(*node, 1);
+		unsigned int iip;
+
+		if (d.index < br->index)
+			goto new_branch;
+
+		parent = (struct qp_trie_branch __rcu **)node;
+		if (d.index == br->index) {
+			if (flags == BPF_EXIST) {
+				err = -ENOENT;
+				goto unlock;
+			}
+			err = qp_trie_ext_branch(trie, parent, new_key, d.new_bm);
+			goto unlock;
+		}
+
+		bitmap = calc_br_bitmap(br->index, new_key->data, new_key->len);
+		iip = calc_twig_index(br->bitmap, bitmap);
+		node = &br->nodes[iip];
+	}
+
+new_branch:
+	if (flags == BPF_EXIST) {
+		err = -ENOENT;
+		goto unlock;
+	}
+	err = qp_trie_new_branch(trie, parent, bitmap, rcu_dereference_protected(*node, 1),
+				 &d, new_key);
+unlock:
+	spin_unlock(lock);
+	if (err)
+		kfree(new_key);
+	return err;
+}
+
+/* Called from syscall or from eBPF program */
+static int qp_trie_delete_elem(struct bpf_map *map, void *key)
+{
+	struct qp_trie *trie = container_of(map, struct qp_trie, map);
+	struct qp_trie_branch __rcu **parent, **grand_parent;
+	unsigned int bitmap, parent_bitmap, data_len, nr;
+	const struct bpf_dynptr_kern *dynptr_key;
+	const struct qp_trie_key *found;
+	const unsigned char *data;
+	void __rcu **node;
+	spinlock_t *lock;
+	unsigned char c;
+	int err;
+
+	dynptr_key = key;
+	data_len = bpf_dynptr_get_size(dynptr_key);
+	if (!data_len)
+		return -EINVAL;
+
+	err = -ENOENT;
+	data = dynptr_key->data + dynptr_key->offset;
+	c = *data;
+	lock = &trie->locks[c];
+	spin_lock(lock);
+	parent = (struct qp_trie_branch __rcu **)&trie->roots[c];
+	if (!*parent)
+		goto unlock;
+
+	grand_parent = NULL;
+	parent_bitmap = bitmap = 1;
+	node = &rcu_dereference_protected(*parent, 1)->nodes[0];
+	while (is_branch_node(rcu_dereference_protected(*node, 1))) {
+		struct qp_trie_branch *br = rcu_dereference_protected(*node, 1);
+		unsigned int iip;
+
+		if (index_to_byte_index(br->index) > data_len)
+			goto unlock;
+
+		parent_bitmap = bitmap;
+		bitmap = calc_br_bitmap(br->index, data, data_len);
+		if (!(bitmap & br->bitmap))
+			goto unlock;
+
+		grand_parent = parent;
+		parent = (struct qp_trie_branch __rcu **)node;
+		iip = calc_twig_index(br->bitmap, bitmap);
+		node = &br->nodes[iip];
+	}
+
+	found = to_leaf_node(rcu_dereference_protected(*node, 1));
+	if (!is_same_key(found, data, data_len))
+		goto unlock;
+
+	nr = calc_twig_nr(rcu_dereference_protected(*parent, 1)->bitmap);
+	if (nr != 2)
+		err = qp_trie_remove_leaf(trie, parent, bitmap, found);
+	else
+		err = qp_trie_merge_node(trie, grand_parent, rcu_dereference_protected(*parent, 1),
+					 parent_bitmap, bitmap);
+unlock:
+	spin_unlock(lock);
+	return err;
+}
+
+static int qp_trie_check_btf(const struct bpf_map *map,
+			     const struct btf *btf,
+			     const struct btf_type *key_type,
+			     const struct btf_type *value_type)
+{
+	if (!map->dynptr_key_off && key_type->size == sizeof(struct bpf_dynptr))
+		return 0;
+	return -EINVAL;
+}
+
+BTF_ID_LIST_SINGLE(qp_trie_map_btf_ids, struct, qp_trie)
+const struct bpf_map_ops qp_trie_map_ops = {
+	.map_alloc_check = qp_trie_alloc_check,
+	.map_alloc = qp_trie_alloc,
+	.map_free = qp_trie_free,
+	.map_get_next_key = qp_trie_get_next_key,
+	.map_lookup_elem = qp_trie_lookup_elem,
+	.map_update_elem = qp_trie_update_elem,
+	.map_delete_elem = qp_trie_delete_elem,
+	.map_meta_equal = bpf_map_meta_equal,
+	.map_check_btf = qp_trie_check_btf,
+	.map_btf_id = &qp_trie_map_btf_ids[0],
+};
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 3466bcc9aeca..bdd964d51c38 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -929,6 +929,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_TASK_STORAGE,
 	BPF_MAP_TYPE_BLOOM_FILTER,
 	BPF_MAP_TYPE_USER_RINGBUF,
+	BPF_MAP_TYPE_QP_TRIE,
 };
 
 /* Note that tracing related programs such as
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH bpf-next v2 07/13] libbpf: Add probe support for BPF_MAP_TYPE_QP_TRIE
  2022-09-24 13:36 [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key Hou Tao
                   ` (5 preceding siblings ...)
  2022-09-24 13:36 ` [PATCH bpf-next v2 06/13] bpf: Add support for qp-trie map with dynptr key Hou Tao
@ 2022-09-24 13:36 ` Hou Tao
  2022-09-24 13:36 ` [PATCH bpf-next v2 08/13] bpftool: Add support for qp-trie map Hou Tao
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 52+ messages in thread
From: Hou Tao @ 2022-09-24 13:36 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, houtao1

From: Hou Tao <houtao1@huawei.com>

BPF_MAP_TYPE_QP_TRIE requires BTF, so create a BTF fd to pass the types
of key and value to kernel and initialize map_extra to specify the
maximum size of bpf_dynptr-typed key.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 tools/lib/bpf/libbpf.c        |  1 +
 tools/lib/bpf/libbpf_probes.c | 25 +++++++++++++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index e691f08a297f..21b5b33fa010 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -164,6 +164,7 @@ static const char * const map_type_name[] = {
 	[BPF_MAP_TYPE_TASK_STORAGE]		= "task_storage",
 	[BPF_MAP_TYPE_BLOOM_FILTER]		= "bloom_filter",
 	[BPF_MAP_TYPE_USER_RINGBUF]             = "user_ringbuf",
+	[BPF_MAP_TYPE_QP_TRIE]			= "qp_trie",
 };
 
 static const char * const prog_type_name[] = {
diff --git a/tools/lib/bpf/libbpf_probes.c b/tools/lib/bpf/libbpf_probes.c
index f3a8e8e74eb8..c4d1b595a4fe 100644
--- a/tools/lib/bpf/libbpf_probes.c
+++ b/tools/lib/bpf/libbpf_probes.c
@@ -188,6 +188,19 @@ static int load_local_storage_btf(void)
 				     strs, sizeof(strs));
 }
 
+static int load_qp_trie_btf(void)
+{
+	const char strs[] = "\0bpf_dynptr";
+	__u32 types[] = {
+		/* struct bpf_dynptr */				/* [1] */
+		BTF_TYPE_ENC(1, BTF_INFO_ENC(BTF_KIND_STRUCT, 0, 0), 16),
+		/* unsigned int */				/* [2] */
+		BTF_TYPE_INT_ENC(0, 0, 0, 32, 4),
+	};
+	return libbpf__load_raw_btf((char *)types, sizeof(types),
+				    strs, sizeof(strs));
+}
+
 static int probe_map_create(enum bpf_map_type map_type)
 {
 	LIBBPF_OPTS(bpf_map_create_opts, opts);
@@ -264,6 +277,18 @@ static int probe_map_create(enum bpf_map_type map_type)
 	case BPF_MAP_TYPE_SOCKHASH:
 	case BPF_MAP_TYPE_REUSEPORT_SOCKARRAY:
 		break;
+	case BPF_MAP_TYPE_QP_TRIE:
+		key_size = sizeof(struct bpf_dynptr);
+		value_size = 4;
+		btf_key_type_id = 1;
+		btf_value_type_id = 2;
+		max_entries = 1;
+		opts.map_flags = BPF_F_NO_PREALLOC;
+		opts.map_extra = 1;
+		btf_fd = load_qp_trie_btf();
+		if (btf_fd < 0)
+			return btf_fd;
+		break;
 	case BPF_MAP_TYPE_UNSPEC:
 	default:
 		return -EOPNOTSUPP;
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH bpf-next v2 08/13] bpftool: Add support for qp-trie map
  2022-09-24 13:36 [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key Hou Tao
                   ` (6 preceding siblings ...)
  2022-09-24 13:36 ` [PATCH bpf-next v2 07/13] libbpf: Add probe support for BPF_MAP_TYPE_QP_TRIE Hou Tao
@ 2022-09-24 13:36 ` Hou Tao
  2022-09-27 11:24   ` Quentin Monnet
  2022-09-24 13:36 ` [PATCH bpf-next v2 09/13] selftests/bpf: Add two new dynptr_fail cases for map key Hou Tao
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 52+ messages in thread
From: Hou Tao @ 2022-09-24 13:36 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, houtao1

From: Hou Tao <houtao1@huawei.com>

Support lookup/update/delete/iterate/dump operations for qp-trie in
bpftool. Mainly add two functions: one function to parse dynptr key and
another one to dump dynptr key. The input format of dynptr key is:
"key [hex] size BYTES" and the output format of dynptr key is:
"size BYTES".

The following is the output when using bpftool to manipulate
qp-trie:

  $ bpftool map pin id 724953 /sys/fs/bpf/qp
  $ bpftool map show pinned /sys/fs/bpf/qp
  724953: qp_trie  name qp_trie  flags 0x1
          key 16B  value 4B  max_entries 2  memlock 65536B  map_extra 8
          btf_id 779
          pids test_qp_trie.bi(109167)
  $ bpftool map dump pinned /sys/fs/bpf/qp
  [{
          "key": {
              "size": 4,
              "data": ["0x0","0x0","0x0","0x0"
              ]
          },
          "value": 0
      },{
          "key": {
              "size": 4,
              "data": ["0x0","0x0","0x0","0x1"
              ]
          },
          "value": 2
      }
  ]
  $ bpftool map lookup pinned /sys/fs/bpf/qp key 4 0 0 0 1
  {
      "key": {
          "size": 4,
          "data": ["0x0","0x0","0x0","0x1"
          ]
      },
      "value": 2
  }
  $ bpftool map update pinned /sys/fs/bpf/qp key 4 0 0 0 1 value 2 0 0 0
  $ bpftool map getnext pinned /sys/fs/bpf/qp
  key: None
  next key:
  00 00 00 00
  $ bpftool map getnext pinned /sys/fs/bpf/qp key 4 0 0 0 0
  key:
  00 00 00 00
  next key:
  00 00 00 01
  $ bpftool map getnext pinned /sys/fs/bpf/qp key 4 0 0 0 1
  Error: can't get next key: No such file or directory
  $ bpftool map delete pinned /sys/fs/bpf/qp key 4 0 0 0 1
  $ bpftool map dump pinned /sys/fs/bpf/qp
  [{
          "key": {
              "size": 4,
              "data": ["0x0","0x0","0x0","0x0"
              ]
          },
          "value": 0
      }
  ]

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 .../bpf/bpftool/Documentation/bpftool-map.rst |   4 +-
 tools/bpf/bpftool/btf_dumper.c                |  33 ++++
 tools/bpf/bpftool/map.c                       | 149 +++++++++++++++---
 3 files changed, 164 insertions(+), 22 deletions(-)

diff --git a/tools/bpf/bpftool/Documentation/bpftool-map.rst b/tools/bpf/bpftool/Documentation/bpftool-map.rst
index 7f3b67a8b48f..020df5481fd6 100644
--- a/tools/bpf/bpftool/Documentation/bpftool-map.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool-map.rst
@@ -45,7 +45,7 @@ MAP COMMANDS
 |	**bpftool** **map help**
 |
 |	*MAP* := { **id** *MAP_ID* | **pinned** *FILE* | **name** *MAP_NAME* }
-|	*DATA* := { [**hex**] *BYTES* }
+|	*DATA* := { [**hex**] *BYTES* | [**hex**] *size* *BYTES* }
 |	*PROG* := { **id** *PROG_ID* | **pinned** *FILE* | **tag** *PROG_TAG* | **name** *PROG_NAME* }
 |	*VALUE* := { *DATA* | *MAP* | *PROG* }
 |	*UPDATE_FLAGS* := { **any** | **exist** | **noexist** }
@@ -55,7 +55,7 @@ MAP COMMANDS
 |		| **devmap** | **devmap_hash** | **sockmap** | **cpumap** | **xskmap** | **sockhash**
 |		| **cgroup_storage** | **reuseport_sockarray** | **percpu_cgroup_storage**
 |		| **queue** | **stack** | **sk_storage** | **struct_ops** | **ringbuf** | **inode_storage**
-|		| **task_storage** | **bloom_filter** | **user_ringbuf** }
+|		| **task_storage** | **bloom_filter** | **user_ringbuf** | **qp_trie** }
 
 DESCRIPTION
 ===========
diff --git a/tools/bpf/bpftool/btf_dumper.c b/tools/bpf/bpftool/btf_dumper.c
index 19924b6ce796..817868961963 100644
--- a/tools/bpf/bpftool/btf_dumper.c
+++ b/tools/bpf/bpftool/btf_dumper.c
@@ -462,6 +462,30 @@ static int btf_dumper_int(const struct btf_type *t, __u8 bit_offset,
 	return 0;
 }
 
+static int btf_dumper_dynptr_user(const struct btf_dumper *d,
+				  const struct bpf_dynptr_user *dynptr)
+{
+	unsigned int i, size;
+	unsigned char *data;
+
+	data = bpf_dynptr_user_get_data(dynptr);
+	size = bpf_dynptr_user_get_size(dynptr);
+
+	jsonw_start_object(d->jw);
+
+	jsonw_name(d->jw, "size");
+	jsonw_printf(d->jw, "%u", size);
+	jsonw_name(d->jw, "data");
+	jsonw_start_array(d->jw);
+	for (i = 0; i < size; i++)
+		jsonw_printf(d->jw, "\"0x%hhx\"", data[i]);
+	jsonw_end_array(d->jw);
+
+	jsonw_end_object(d->jw);
+
+	return 0;
+}
+
 static int btf_dumper_struct(const struct btf_dumper *d, __u32 type_id,
 			     const void *data)
 {
@@ -552,6 +576,12 @@ static int btf_dumper_datasec(const struct btf_dumper *d, __u32 type_id,
 	return ret;
 }
 
+static bool btf_is_dynptr(const struct btf *btf, const struct btf_type *t)
+{
+	return t->size == sizeof(struct bpf_dynptr) &&
+	       !strcmp(btf__name_by_offset(btf, t->name_off), "bpf_dynptr");
+}
+
 static int btf_dumper_do_type(const struct btf_dumper *d, __u32 type_id,
 			      __u8 bit_offset, const void *data)
 {
@@ -562,6 +592,9 @@ static int btf_dumper_do_type(const struct btf_dumper *d, __u32 type_id,
 		return btf_dumper_int(t, bit_offset, data, d->jw,
 				     d->is_plain_text);
 	case BTF_KIND_STRUCT:
+		if (btf_is_dynptr(d->btf, t))
+			return btf_dumper_dynptr_user(d, data);
+		/* fallthrough */
 	case BTF_KIND_UNION:
 		return btf_dumper_struct(d, type_id, data);
 	case BTF_KIND_ARRAY:
diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index 9a6ca9f31133..92c175518293 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -43,6 +43,17 @@ static bool map_is_map_of_progs(__u32 type)
 	return type == BPF_MAP_TYPE_PROG_ARRAY;
 }
 
+static bool map_is_dynptr_key(__u32 type)
+{
+	return type == BPF_MAP_TYPE_QP_TRIE;
+}
+
+static bool map_use_map_extra(__u32 type)
+{
+	return type == BPF_MAP_TYPE_BLOOM_FILTER ||
+	       type == BPF_MAP_TYPE_QP_TRIE;
+}
+
 static int map_type_from_str(const char *type)
 {
 	const char *map_type_str;
@@ -130,14 +141,44 @@ static json_writer_t *get_btf_writer(void)
 	return jw;
 }
 
-static void print_entry_json(struct bpf_map_info *info, unsigned char *key,
+static void print_key_by_hex_data_json(struct bpf_map_info *info, void *key)
+{
+	unsigned int data_size;
+	unsigned char *data;
+
+	if (map_is_dynptr_key(info->type)) {
+		data = bpf_dynptr_user_get_data(key);
+		data_size = bpf_dynptr_user_get_size(key);
+	} else {
+		data = key;
+		data_size = info->key_size;
+	}
+	print_hex_data_json(data, data_size);
+}
+
+static void fprint_key_in_hex(struct bpf_map_info *info, void *key)
+{
+	unsigned int data_size;
+	unsigned char *data;
+
+	if (map_is_dynptr_key(info->type)) {
+		data = bpf_dynptr_user_get_data(key);
+		data_size = bpf_dynptr_user_get_size(key);
+	} else {
+		data = key;
+		data_size = info->key_size;
+	}
+	fprint_hex(stdout, data, data_size, " ");
+}
+
+static void print_entry_json(struct bpf_map_info *info, void *key,
 			     unsigned char *value, struct btf *btf)
 {
 	jsonw_start_object(json_wtr);
 
 	if (!map_is_per_cpu(info->type)) {
 		jsonw_name(json_wtr, "key");
-		print_hex_data_json(key, info->key_size);
+		print_key_by_hex_data_json(info, key);
 		jsonw_name(json_wtr, "value");
 		print_hex_data_json(value, info->value_size);
 		if (btf) {
@@ -242,19 +283,23 @@ print_entry_error(struct bpf_map_info *map_info, void *key, int lookup_errno)
 	}
 }
 
-static void print_entry_plain(struct bpf_map_info *info, unsigned char *key,
+static void print_entry_plain(struct bpf_map_info *info, void *key,
 			      unsigned char *value)
 {
 	if (!map_is_per_cpu(info->type)) {
 		bool single_line, break_names;
+		unsigned int key_size;
 
-		break_names = info->key_size > 16 || info->value_size > 16;
-		single_line = info->key_size + info->value_size <= 24 &&
-			!break_names;
+		if (map_is_dynptr_key(info->type))
+			key_size = bpf_dynptr_user_get_size(key);
+		else
+			key_size = info->key_size;
+		break_names = key_size > 16 || info->value_size > 16;
+		single_line = key_size + info->value_size <= 24 && !break_names;
 
-		if (info->key_size) {
+		if (key_size) {
 			printf("key:%c", break_names ? '\n' : ' ');
-			fprint_hex(stdout, key, info->key_size, " ");
+			fprint_key_in_hex(info, key);
 
 			printf(single_line ? "  " : "\n");
 		}
@@ -316,6 +361,38 @@ static char **parse_bytes(char **argv, const char *name, unsigned char *val,
 	return argv + i;
 }
 
+/* The format of dynptr is "[hex] size BYTES" */
+static char **parse_dynptr(char **argv, const char *name,
+			   struct bpf_dynptr_user *dynptr)
+{
+	unsigned int base = 0, size;
+	char *endptr;
+
+	if (is_prefix(*argv, "hex")) {
+		base = 16;
+		argv++;
+	}
+	if (!*argv) {
+		p_err("no byte length");
+		return NULL;
+	}
+
+	size = strtoul(*argv, &endptr, base);
+	if (*endptr) {
+		p_err("error parsing byte length: %s", *argv);
+		return NULL;
+	}
+	if (!size || size > bpf_dynptr_user_get_size(dynptr)) {
+		p_err("invalid byte length %u (max length %u)",
+		      size, bpf_dynptr_user_get_size(dynptr));
+		return NULL;
+	}
+	bpf_dynptr_user_trim(dynptr, size);
+
+	return parse_bytes(argv + 1, name, bpf_dynptr_user_get_data(dynptr),
+			   size);
+}
+
 /* on per cpu maps we must copy the provided value on all value instances */
 static void fill_per_cpu_value(struct bpf_map_info *info, void *value)
 {
@@ -350,7 +427,10 @@ static int parse_elem(char **argv, struct bpf_map_info *info,
 			return -1;
 		}
 
-		argv = parse_bytes(argv + 1, "key", key, key_size);
+		if (map_is_dynptr_key(info->type))
+			argv = parse_dynptr(argv + 1, "key", key);
+		else
+			argv = parse_bytes(argv + 1, "key", key, key_size);
 		if (!argv)
 			return -1;
 
@@ -568,6 +648,9 @@ static int show_map_close_plain(int fd, struct bpf_map_info *info)
 		printf("  memlock %sB", memlock);
 	free(memlock);
 
+	if (map_use_map_extra(info->type))
+		printf("  map_extra %llu", info->map_extra);
+
 	if (info->type == BPF_MAP_TYPE_PROG_ARRAY) {
 		char *owner_prog_type = get_fdinfo(fd, "owner_prog_type");
 		char *owner_jited = get_fdinfo(fd, "owner_jited");
@@ -820,6 +903,18 @@ static void free_btf_vmlinux(void)
 		btf__free(btf_vmlinux);
 }
 
+static struct bpf_dynptr_user *bpf_dynptr_user_new(__u32 size)
+{
+	struct bpf_dynptr_user *dynptr;
+
+	dynptr = malloc(sizeof(*dynptr) + size);
+	if (!dynptr)
+		return NULL;
+
+	bpf_dynptr_user_init(&dynptr[1], size, dynptr);
+	return dynptr;
+}
+
 static int
 map_dump(int fd, struct bpf_map_info *info, json_writer_t *wtr,
 	 bool show_header)
@@ -829,7 +924,10 @@ map_dump(int fd, struct bpf_map_info *info, json_writer_t *wtr,
 	struct btf *btf = NULL;
 	int err;
 
-	key = malloc(info->key_size);
+	if (map_is_dynptr_key(info->type))
+		key = bpf_dynptr_user_new(info->map_extra);
+	else
+		key = malloc(info->key_size);
 	value = alloc_value(info);
 	if (!key || !value) {
 		p_err("mem alloc failed");
@@ -966,7 +1064,10 @@ static int alloc_key_value(struct bpf_map_info *info, void **key, void **value)
 	*value = NULL;
 
 	if (info->key_size) {
-		*key = malloc(info->key_size);
+		if (map_is_dynptr_key(info->type))
+			*key = bpf_dynptr_user_new(info->map_extra);
+		else
+			*key = malloc(info->key_size);
 		if (!*key) {
 			p_err("key mem alloc failed");
 			return -1;
@@ -1132,8 +1233,13 @@ static int do_getnext(int argc, char **argv)
 	if (fd < 0)
 		return -1;
 
-	key = malloc(info.key_size);
-	nextkey = malloc(info.key_size);
+	if (map_is_dynptr_key(info.type)) {
+		key = bpf_dynptr_user_new(info.map_extra);
+		nextkey = bpf_dynptr_user_new(info.map_extra);
+	} else {
+		key = malloc(info.key_size);
+		nextkey = malloc(info.key_size);
+	}
 	if (!key || !nextkey) {
 		p_err("mem alloc failed");
 		err = -1;
@@ -1160,23 +1266,23 @@ static int do_getnext(int argc, char **argv)
 		jsonw_start_object(json_wtr);
 		if (key) {
 			jsonw_name(json_wtr, "key");
-			print_hex_data_json(key, info.key_size);
+			print_key_by_hex_data_json(&info, key);
 		} else {
 			jsonw_null_field(json_wtr, "key");
 		}
 		jsonw_name(json_wtr, "next_key");
-		print_hex_data_json(nextkey, info.key_size);
+		print_key_by_hex_data_json(&info, nextkey);
 		jsonw_end_object(json_wtr);
 	} else {
 		if (key) {
 			printf("key:\n");
-			fprint_hex(stdout, key, info.key_size, " ");
+			fprint_key_in_hex(&info, key);
 			printf("\n");
 		} else {
 			printf("key: None\n");
 		}
 		printf("next key:\n");
-		fprint_hex(stdout, nextkey, info.key_size, " ");
+		fprint_key_in_hex(&info, nextkey);
 		printf("\n");
 	}
 
@@ -1203,7 +1309,10 @@ static int do_delete(int argc, char **argv)
 	if (fd < 0)
 		return -1;
 
-	key = malloc(info.key_size);
+	if (map_is_dynptr_key(info.type))
+		key = bpf_dynptr_user_new(info.map_extra);
+	else
+		key = malloc(info.key_size);
 	if (!key) {
 		p_err("mem alloc failed");
 		err = -1;
@@ -1449,7 +1558,7 @@ static int do_help(int argc, char **argv)
 		"       %1$s %2$s help\n"
 		"\n"
 		"       " HELP_SPEC_MAP "\n"
-		"       DATA := { [hex] BYTES }\n"
+		"       DATA := { [hex] BYTES | [hex] size BYTES }\n"
 		"       " HELP_SPEC_PROGRAM "\n"
 		"       VALUE := { DATA | MAP | PROG }\n"
 		"       UPDATE_FLAGS := { any | exist | noexist }\n"
@@ -1459,7 +1568,7 @@ static int do_help(int argc, char **argv)
 		"                 devmap | devmap_hash | sockmap | cpumap | xskmap | sockhash |\n"
 		"                 cgroup_storage | reuseport_sockarray | percpu_cgroup_storage |\n"
 		"                 queue | stack | sk_storage | struct_ops | ringbuf | inode_storage |\n"
-		"                 task_storage | bloom_filter | user_ringbuf }\n"
+		"                 task_storage | bloom_filter | user_ringbuf | qp_trie }\n"
 		"       " HELP_SPEC_OPTIONS " |\n"
 		"                    {-f|--bpffs} | {-n|--nomount} }\n"
 		"",
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH bpf-next v2 09/13] selftests/bpf: Add two new dynptr_fail cases for map key
  2022-09-24 13:36 [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key Hou Tao
                   ` (7 preceding siblings ...)
  2022-09-24 13:36 ` [PATCH bpf-next v2 08/13] bpftool: Add support for qp-trie map Hou Tao
@ 2022-09-24 13:36 ` Hou Tao
  2022-09-24 13:36 ` [PATCH bpf-next v2 10/13] selftests/bpf: Move ENOTSUPP into bpf_util.h Hou Tao
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 52+ messages in thread
From: Hou Tao @ 2022-09-24 13:36 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, houtao1

From: Hou Tao <houtao1@huawei.com>

Now bpf_dynptr is also not usable for map key, so add two test cases to
ensure that: one to directly use bpf_dynptr use as map key and another
to use a struct with embedded bpf_dynptr as map key.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 .../testing/selftests/bpf/prog_tests/dynptr.c |  2 +
 .../testing/selftests/bpf/progs/dynptr_fail.c | 43 +++++++++++++++++++
 2 files changed, 45 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/dynptr.c b/tools/testing/selftests/bpf/prog_tests/dynptr.c
index 8fc4e6c02bfd..7844f76c4ee6 100644
--- a/tools/testing/selftests/bpf/prog_tests/dynptr.c
+++ b/tools/testing/selftests/bpf/prog_tests/dynptr.c
@@ -20,6 +20,8 @@ static struct {
 	{"ringbuf_invalid_api", "type=mem expected=alloc_mem"},
 	{"add_dynptr_to_map1", "invalid indirect read from stack"},
 	{"add_dynptr_to_map2", "invalid indirect read from stack"},
+	{"add_dynptr_to_key1", "invalid indirect read from stack"},
+	{"add_dynptr_to_key2", "invalid indirect read from stack"},
 	{"data_slice_out_of_bounds_ringbuf", "value is outside of the allowed memory range"},
 	{"data_slice_out_of_bounds_map_value", "value is outside of the allowed memory range"},
 	{"data_slice_use_after_release1", "invalid mem access 'scalar'"},
diff --git a/tools/testing/selftests/bpf/progs/dynptr_fail.c b/tools/testing/selftests/bpf/progs/dynptr_fail.c
index b0f08ff024fb..85cc32999e8e 100644
--- a/tools/testing/selftests/bpf/progs/dynptr_fail.c
+++ b/tools/testing/selftests/bpf/progs/dynptr_fail.c
@@ -35,6 +35,20 @@ struct {
 	__type(value, __u32);
 } array_map3 SEC(".maps");
 
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 1);
+	__type(key, struct bpf_dynptr);
+	__type(value, __u32);
+} hash_map1 SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 1);
+	__type(key, struct test_info);
+	__type(value, __u32);
+} hash_map2 SEC(".maps");
+
 struct sample {
 	int pid;
 	long value;
@@ -206,6 +220,35 @@ int add_dynptr_to_map2(void *ctx)
 	return 0;
 }
 
+/* Can't use a dynptr as key of normal map */
+SEC("?raw_tp")
+int add_dynptr_to_key1(void *ctx)
+{
+	struct bpf_dynptr ptr;
+
+	get_map_val_dynptr(&ptr);
+	bpf_map_lookup_elem(&hash_map1, &ptr);
+
+	return 0;
+}
+
+/* Can't use a struct with an embedded dynptr as key of normal map */
+SEC("?raw_tp")
+int add_dynptr_to_key2(void *ctx)
+{
+	struct test_info x;
+
+	x.x = 0;
+	bpf_ringbuf_reserve_dynptr(&ringbuf, val, 0, &x.ptr);
+
+	/* this should fail */
+	bpf_map_lookup_elem(&hash_map2, &x);
+
+	bpf_ringbuf_submit_dynptr(&x.ptr, 0);
+
+	return 0;
+}
+
 /* A data slice can't be accessed out of bounds */
 SEC("?raw_tp")
 int data_slice_out_of_bounds_ringbuf(void *ctx)
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH bpf-next v2 10/13] selftests/bpf: Move ENOTSUPP into bpf_util.h
  2022-09-24 13:36 [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key Hou Tao
                   ` (8 preceding siblings ...)
  2022-09-24 13:36 ` [PATCH bpf-next v2 09/13] selftests/bpf: Add two new dynptr_fail cases for map key Hou Tao
@ 2022-09-24 13:36 ` Hou Tao
  2022-09-24 13:36 ` [PATCH bpf-next v2 11/13] selftests/bpf: Add prog tests for qp-trie map Hou Tao
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 52+ messages in thread
From: Hou Tao @ 2022-09-24 13:36 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, houtao1

From: Hou Tao <houtao1@huawei.com>

ENOTSUPP has been defined in three test files, so moving it into
bpf_util.h, so it can be used by both prog_tests and map_tests.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 tools/testing/selftests/bpf/bpf_util.h              | 4 ++++
 tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c | 4 ----
 tools/testing/selftests/bpf/prog_tests/lsm_cgroup.c | 4 ----
 tools/testing/selftests/bpf/test_maps.c             | 4 ----
 4 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/tools/testing/selftests/bpf/bpf_util.h b/tools/testing/selftests/bpf/bpf_util.h
index a3352a64c067..8e15f5d024cc 100644
--- a/tools/testing/selftests/bpf/bpf_util.h
+++ b/tools/testing/selftests/bpf/bpf_util.h
@@ -8,6 +8,10 @@
 #include <errno.h>
 #include <bpf/libbpf.h> /* libbpf_num_possible_cpus */
 
+#ifndef ENOTSUPP
+#define ENOTSUPP 524
+#endif
+
 static inline unsigned int bpf_num_possible_cpus(void)
 {
 	int possible_cpus = libbpf_num_possible_cpus();
diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c b/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
index 2959a52ced06..13416775f825 100644
--- a/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
@@ -13,10 +13,6 @@
 #include "tcp_ca_incompl_cong_ops.skel.h"
 #include "tcp_ca_unsupp_cong_op.skel.h"
 
-#ifndef ENOTSUPP
-#define ENOTSUPP 524
-#endif
-
 static const unsigned int total_bytes = 10 * 1024 * 1024;
 static int expected_stg = 0xeB9F;
 static int stop, duration;
diff --git a/tools/testing/selftests/bpf/prog_tests/lsm_cgroup.c b/tools/testing/selftests/bpf/prog_tests/lsm_cgroup.c
index 1102e4f42d2d..7df4dc171461 100644
--- a/tools/testing/selftests/bpf/prog_tests/lsm_cgroup.c
+++ b/tools/testing/selftests/bpf/prog_tests/lsm_cgroup.c
@@ -10,10 +10,6 @@
 #include "cgroup_helpers.h"
 #include "network_helpers.h"
 
-#ifndef ENOTSUPP
-#define ENOTSUPP 524
-#endif
-
 static struct btf *btf;
 
 static __u32 query_prog_cnt(int cgroup_fd, const char *attach_func)
diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
index b73152822aa2..a1b94fb4ba51 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -26,10 +26,6 @@
 #include "test_maps.h"
 #include "testing_helpers.h"
 
-#ifndef ENOTSUPP
-#define ENOTSUPP 524
-#endif
-
 int skips;
 
 static struct bpf_map_create_opts map_opts = { .sz = sizeof(map_opts) };
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH bpf-next v2 11/13] selftests/bpf: Add prog tests for qp-trie map
  2022-09-24 13:36 [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key Hou Tao
                   ` (9 preceding siblings ...)
  2022-09-24 13:36 ` [PATCH bpf-next v2 10/13] selftests/bpf: Move ENOTSUPP into bpf_util.h Hou Tao
@ 2022-09-24 13:36 ` Hou Tao
  2022-09-24 13:36 ` [PATCH bpf-next v2 12/13] selftests/bpf: Add benchmark " Hou Tao
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 52+ messages in thread
From: Hou Tao @ 2022-09-24 13:36 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, houtao1

From: Hou Tao <houtao1@huawei.com>

Add three test cases for qp-trie map. Firstly checking the basic
operations of qp-trie (lookup/update/delete) from both bpf syscall and
bpf program are working properly, then ensuring dynptr with zero size or
null data is handled correclty in qp-trie, lastly using full path from
bpf_d_path() as a dynptr key to show the possible use case for qp-trie.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 .../selftests/bpf/prog_tests/qp_trie_test.c   | 214 ++++++++++++++++++
 .../selftests/bpf/progs/qp_trie_test.c        | 200 ++++++++++++++++
 2 files changed, 414 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/qp_trie_test.c
 create mode 100644 tools/testing/selftests/bpf/progs/qp_trie_test.c

diff --git a/tools/testing/selftests/bpf/prog_tests/qp_trie_test.c b/tools/testing/selftests/bpf/prog_tests/qp_trie_test.c
new file mode 100644
index 000000000000..fbfee0916f4c
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/qp_trie_test.c
@@ -0,0 +1,214 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2022. Huawei Technologies Co., Ltd */
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <test_progs.h>
+#include "qp_trie_test.skel.h"
+
+#define FILE_PATH_SIZE 64
+
+static int setup_trie(struct bpf_map *trie, void *data, unsigned int size, unsigned int value)
+{
+	struct bpf_dynptr_user dynptr;
+	char raw[FILE_PATH_SIZE];
+	int fd, err, zero;
+
+	fd = bpf_map__fd(trie);
+	bpf_dynptr_user_init(data, size, &dynptr);
+	err = bpf_map_update_elem(fd, &dynptr, &value, BPF_NOEXIST);
+	if (!ASSERT_OK(err, "trie add data"))
+		return -EINVAL;
+
+	zero = 0;
+	memset(raw, 0, size);
+	bpf_dynptr_user_init(raw, size, &dynptr);
+	err = bpf_map_update_elem(fd, &dynptr, &zero, BPF_NOEXIST);
+	if (!ASSERT_OK(err, "trie add zero"))
+		return -EINVAL;
+
+	return 0;
+}
+
+static int setup_array(struct bpf_map *array, void *data, unsigned int size)
+{
+	char raw[FILE_PATH_SIZE];
+	int fd, idx, err;
+
+	fd = bpf_map__fd(array);
+
+	idx = 0;
+	memcpy(raw, data, size);
+	memset(raw + size, 0, sizeof(raw) - size);
+	err = bpf_map_update_elem(fd, &idx, raw, BPF_EXIST);
+	if (!ASSERT_OK(err, "array add data"))
+		return -EINVAL;
+
+	idx = 1;
+	memset(raw, 0, sizeof(raw));
+	err = bpf_map_update_elem(fd, &idx, raw, BPF_EXIST);
+	if (!ASSERT_OK(err, "array add zero"))
+		return -EINVAL;
+
+	return 0;
+}
+
+static int setup_htab(struct bpf_map *htab, void *data, unsigned int size, unsigned int value)
+{
+	char raw[FILE_PATH_SIZE];
+	int fd, err;
+
+	fd = bpf_map__fd(htab);
+
+	memcpy(raw, data, size);
+	memset(raw + size, 0, sizeof(raw) - size);
+	err = bpf_map_update_elem(fd, &raw, &value, BPF_NOEXIST);
+	if (!ASSERT_OK(err, "htab add data"))
+		return -EINVAL;
+
+	return 0;
+}
+
+static void test_qp_trie_basic_ops(void)
+{
+	const char *name = "qp_trie_basic_ops";
+	unsigned int value, new_value;
+	struct bpf_dynptr_user dynptr;
+	struct qp_trie_test *skel;
+	char raw[FILE_PATH_SIZE];
+	int err;
+
+	if (!ASSERT_LT(strlen(name), sizeof(raw), "lengthy data"))
+		return;
+
+	skel = qp_trie_test__open();
+	if (!ASSERT_OK_PTR(skel, "qp_trie_test__open()"))
+		return;
+
+	bpf_program__set_autoload(skel->progs.basic_ops, true);
+
+	err = qp_trie_test__load(skel);
+	if (!ASSERT_OK(err, "qp_trie_test__load()"))
+		goto out;
+
+	value = time(NULL);
+	if (setup_trie(skel->maps.trie, (void *)name, strlen(name), value))
+		goto out;
+
+	if (setup_array(skel->maps.array, (void *)name, strlen(name)))
+		goto out;
+
+	skel->bss->key_size = strlen(name);
+	skel->bss->pid = getpid();
+	err = qp_trie_test__attach(skel);
+	if (!ASSERT_OK(err, "attach"))
+		goto out;
+
+	usleep(1);
+
+	ASSERT_EQ(skel->bss->lookup_str_value, -1, "trie lookup str");
+	ASSERT_EQ(skel->bss->lookup_bytes_value, value, "trie lookup byte");
+	ASSERT_EQ(skel->bss->delete_again_err, -ENOENT, "trie delete again");
+
+	bpf_dynptr_user_init((void *)name, strlen(name), &dynptr);
+	new_value = 0;
+	err = bpf_map_lookup_elem(bpf_map__fd(skel->maps.trie), &dynptr, &new_value);
+	ASSERT_OK(err, "lookup trie");
+	ASSERT_EQ(new_value, value + 1, "check updated value");
+
+	memset(raw, 0, sizeof(raw));
+	bpf_dynptr_user_init(&raw, strlen(name), &dynptr);
+	err = bpf_map_lookup_elem(bpf_map__fd(skel->maps.trie), &dynptr, &new_value);
+	ASSERT_EQ(err, -ENOENT, "check deleted elem");
+out:
+	qp_trie_test__destroy(skel);
+}
+
+static void test_qp_trie_zero_size_dynptr(void)
+{
+	struct qp_trie_test *skel;
+	int err;
+
+	skel = qp_trie_test__open();
+	if (!ASSERT_OK_PTR(skel, "qp_trie_test__open()"))
+		return;
+
+	bpf_program__set_autoload(skel->progs.zero_size_dynptr, true);
+
+	err = qp_trie_test__load(skel);
+	if (!ASSERT_OK(err, "qp_trie_test__load()"))
+		goto out;
+
+	skel->bss->pid = getpid();
+	err = qp_trie_test__attach(skel);
+	if (!ASSERT_OK(err, "attach"))
+		goto out;
+
+	usleep(1);
+
+	ASSERT_OK(skel->bss->zero_size_err, "handle zero sized dynptr");
+out:
+	qp_trie_test__destroy(skel);
+}
+
+static void test_qp_trie_d_path_key(void)
+{
+	const char *name = "/tmp/qp_trie_d_path_key";
+	struct qp_trie_test *skel;
+	char raw[FILE_PATH_SIZE];
+	unsigned int value;
+	int fd, err;
+
+	if (!ASSERT_LT(strlen(name), sizeof(raw), "lengthy data"))
+		return;
+
+	skel = qp_trie_test__open();
+	if (!ASSERT_OK_PTR(skel, "qp_trie_test__open()"))
+		return;
+
+	bpf_program__set_autoload(skel->progs.d_path_key, true);
+
+	err = qp_trie_test__load(skel);
+	if (!ASSERT_OK(err, "qp_trie_test__load()"))
+		goto out;
+
+	value = time(NULL);
+	/* Include the trailing zero byte */
+	if (setup_trie(skel->maps.trie, (void *)name, strlen(name) + 1, value))
+		goto out;
+
+	if (setup_htab(skel->maps.htab, (void *)name, strlen(name) + 1, value))
+		goto out;
+
+	skel->bss->pid = getpid();
+	err = qp_trie_test__attach(skel);
+	/* No support for bpf trampoline ? */
+	if (err == -ENOTSUPP) {
+		test__skip();
+		goto out;
+	}
+	if (!ASSERT_OK(err, "attach"))
+		goto out;
+
+	fd = open(name, O_RDONLY | O_CREAT, 0644);
+	if (!ASSERT_GT(fd, 0, "open tmp file"))
+		goto out;
+	close(fd);
+	unlink(name);
+
+	ASSERT_EQ(skel->bss->trie_path_value, value, "trie lookup");
+	ASSERT_EQ(skel->bss->htab_path_value, -1, "htab lookup");
+out:
+	qp_trie_test__destroy(skel);
+}
+
+void test_qp_trie_test(void)
+{
+	if (test__start_subtest("basic_ops"))
+		test_qp_trie_basic_ops();
+	if (test__start_subtest("zero_size_dynptr"))
+		test_qp_trie_zero_size_dynptr();
+	if (test__start_subtest("d_path_key"))
+		test_qp_trie_d_path_key();
+}
diff --git a/tools/testing/selftests/bpf/progs/qp_trie_test.c b/tools/testing/selftests/bpf/progs/qp_trie_test.c
new file mode 100644
index 000000000000..3aa6c4784564
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/qp_trie_test.c
@@ -0,0 +1,200 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2022. Huawei Technologies Co., Ltd */
+#include <stdbool.h>
+#include <errno.h>
+#include <linux/types.h>
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_core_read.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+struct path {
+} __attribute__((preserve_access_index));
+
+struct file {
+	struct path f_path;
+} __attribute__((preserve_access_index));
+
+#define FILE_PATH_SIZE 64
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 2);
+	__uint(key_size, 4);
+	__uint(value_size, FILE_PATH_SIZE);
+} array SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_QP_TRIE);
+	__uint(max_entries, 2);
+	__type(key, struct bpf_dynptr);
+	__type(value, unsigned int);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__uint(map_extra, FILE_PATH_SIZE);
+} trie SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 1);
+	__uint(key_size, FILE_PATH_SIZE);
+	__uint(value_size, 4);
+} htab SEC(".maps");
+
+int pid = 0;
+
+unsigned int key_size;
+unsigned int lookup_str_value;
+unsigned int lookup_bytes_value;
+unsigned int delete_again_err;
+
+unsigned int zero_size_err;
+unsigned int null_data_err;
+
+unsigned int trie_path_value = 0;
+unsigned int htab_path_value = 0;
+
+SEC("?tp/syscalls/sys_enter_nanosleep")
+int BPF_PROG(basic_ops)
+{
+	struct bpf_dynptr str_ptr, bytes_ptr, zero_ptr;
+	unsigned int new_value, byte_size;
+	unsigned int *value;
+	char *raw;
+	int idx;
+
+	if (bpf_get_current_pid_tgid() >> 32 != pid)
+		return 0;
+
+	idx = 0;
+	raw = bpf_map_lookup_elem(&array, &idx);
+	if (!raw)
+		return 0;
+
+	byte_size = key_size;
+	if (!byte_size || byte_size >= FILE_PATH_SIZE)
+		return 0;
+
+	/* Append a zero byte to make it as a string */
+	bpf_dynptr_from_mem(raw, byte_size + 1, 0, &str_ptr);
+	value = bpf_map_lookup_elem(&trie, &str_ptr);
+	if (value)
+		lookup_str_value = *value;
+	else
+		lookup_str_value = -1;
+
+	/* Lookup map */
+	bpf_dynptr_from_mem(raw, byte_size, 0, &bytes_ptr);
+	value = bpf_map_lookup_elem(&trie, &bytes_ptr);
+	if (value)
+		lookup_bytes_value = *value;
+	else
+		lookup_bytes_value = -1;
+
+	/* Update map and check it in userspace */
+	new_value = lookup_bytes_value + 1;
+	bpf_map_update_elem(&trie, &bytes_ptr, &new_value, BPF_EXIST);
+
+	/* Delete map and check it in userspace */
+	idx = 1;
+	raw = bpf_map_lookup_elem(&array, &idx);
+	if (!raw)
+		return 0;
+	bpf_dynptr_from_mem(raw, byte_size, 0, &zero_ptr);
+	bpf_map_delete_elem(&trie, &zero_ptr);
+	delete_again_err = bpf_map_delete_elem(&trie, &zero_ptr);
+
+	return 0;
+}
+
+SEC("?tp/syscalls/sys_enter_nanosleep")
+int BPF_PROG(zero_size_dynptr)
+{
+	struct bpf_dynptr ptr, bad_ptr;
+	unsigned int new_value;
+	void *value;
+	int idx, err;
+	char *raw;
+
+	if (bpf_get_current_pid_tgid() >> 32 != pid)
+		return 0;
+
+	idx = 0;
+	raw = bpf_map_lookup_elem(&array, &idx);
+	if (!raw)
+		return 0;
+
+	/* Use zero-sized bpf_dynptr */
+	bpf_dynptr_from_mem(raw, 0, 0, &ptr);
+
+	value = bpf_map_lookup_elem(&trie, &ptr);
+	if (value)
+		zero_size_err |= 1;
+	new_value = 0;
+	err = bpf_map_update_elem(&trie, &ptr, &new_value, BPF_ANY);
+	if (err != -EINVAL)
+		zero_size_err |= 2;
+	err = bpf_map_delete_elem(&trie, &ptr);
+	if (err != -EINVAL)
+		zero_size_err |= 4;
+
+	/* A bad dynptr also is zero-sized */
+	bpf_dynptr_from_mem(raw, 1, 1U << 31, &bad_ptr);
+
+	value = bpf_map_lookup_elem(&trie, &bad_ptr);
+	if (value)
+		zero_size_err |= 8;
+	new_value = 0;
+	err = bpf_map_update_elem(&trie, &bad_ptr, &new_value, BPF_ANY);
+	if (err != -EINVAL)
+		zero_size_err |= 16;
+	err = bpf_map_delete_elem(&trie, &bad_ptr);
+	if (err != -EINVAL)
+		zero_size_err |= 32;
+	return 0;
+}
+
+SEC("?fentry/filp_close")
+int BPF_PROG(d_path_key, struct file *filp)
+{
+	struct bpf_dynptr ptr;
+	unsigned int *value;
+	struct path *p;
+	int idx, err;
+	long len;
+	char *raw;
+
+	if (bpf_get_current_pid_tgid() >> 32 != pid)
+		return 0;
+
+	idx = 0;
+	raw = bpf_map_lookup_elem(&array, &idx);
+	if (!raw)
+		return 0;
+
+	p = &filp->f_path;
+	len = bpf_d_path(p, raw, FILE_PATH_SIZE);
+	if (len < 1 || len > FILE_PATH_SIZE)
+		return 0;
+
+	/* Include the trailing zero byte */
+	bpf_dynptr_from_mem(raw, len, 0, &ptr);
+	value = bpf_map_lookup_elem(&trie, &ptr);
+	if (value)
+		trie_path_value = *value;
+	else
+		trie_path_value = -1;
+
+	/* Due to the implementation of bpf_d_path(), there will be "garbage"
+	 * characters instead of zero bytes after raw[len-1], so the lookup
+	 * will fail.
+	 */
+	value = bpf_map_lookup_elem(&htab, raw);
+	if (value)
+		htab_path_value = *value;
+	else
+		htab_path_value = -1;
+
+	return 0;
+}
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH bpf-next v2 12/13] selftests/bpf: Add benchmark for qp-trie map
  2022-09-24 13:36 [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key Hou Tao
                   ` (10 preceding siblings ...)
  2022-09-24 13:36 ` [PATCH bpf-next v2 11/13] selftests/bpf: Add prog tests for qp-trie map Hou Tao
@ 2022-09-24 13:36 ` Hou Tao
  2022-09-24 13:36 ` [PATCH bpf-next v2 13/13] selftests/bpf: Add map tests for qp-trie by using bpf syscall Hou Tao
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 52+ messages in thread
From: Hou Tao @ 2022-09-24 13:36 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, houtao1

From: Hou Tao <houtao1@huawei.com>

Add a benchmark for qp-trie map to compare lookup, update/delete
performance and memory usage with hash table.

When the content of keys are uniformly distributed and there are large
differencies between key length, qp-trie will be dense and has low
height, but the lookup overhead of htab increases due to unnecessary
memory comparison, so the lookup performance of qp-trie will be much
better than hash-table as shown below:

Randomly-generated binary data (key size=255, max entries=16K, key length range:[1, 255])
htab lookup      (1  thread)    4.968 ± 0.009M/s (drops 0.002 ± 0.000M/s mem 8.169 MiB)
htab lookup      (2  thread)   10.118 ± 0.010M/s (drops 0.007 ± 0.000M/s mem 8.169 MiB)
htab lookup      (4  thread)   20.084 ± 0.022M/s (drops 0.007 ± 0.000M/s mem 8.168 MiB)
htab lookup      (8  thread)   39.866 ± 0.047M/s (drops 0.010 ± 0.000M/s mem 8.168 MiB)
htab lookup      (16 thread)   79.412 ± 0.065M/s (drops 0.049 ± 0.000M/s mem 8.169 MiB)
htab update      (1  thread)    2.122 ± 0.021M/s (drops 0.000 ± 0.000M/s mem 8.169 MiB)
htab update      (2  thread)    4.248 ± 0.197M/s (drops 0.000 ± 0.000M/s mem 8.168 MiB)
htab update      (4  thread)    8.475 ± 0.348M/s (drops 0.000 ± 0.000M/s mem 8.180 MiB)
htab update      (8  thread)   16.725 ± 0.633M/s (drops 0.000 ± 0.000M/s mem 8.208 MiB)
htab update      (16 thread)   30.246 ± 0.611M/s (drops 0.000 ± 0.000M/s mem 8.190 MiB)

qp-trie lookup   (1  thread)   10.291 ± 0.007M/s (drops 0.004 ± 0.000M/s mem 4.899 MiB)
qp-trie lookup   (2  thread)   20.797 ± 0.009M/s (drops 0.006 ± 0.000M/s mem 4.879 MiB)
qp-trie lookup   (4  thread)   41.943 ± 0.019M/s (drops 0.015 ± 0.000M/s mem 4.262 MiB)
qp-trie lookup   (8  thread)   81.985 ± 0.032M/s (drops 0.025 ± 0.000M/s mem 4.215 MiB)
qp-trie lookup   (16 thread)  164.681 ± 0.051M/s (drops 0.050 ± 0.000M/s mem 4.261 MiB)
qp-trie update   (1  thread)    1.622 ± 0.016M/s (drops 0.000 ± 0.000M/s mem 4.918 MiB)
qp-trie update   (2  thread)    2.688 ± 0.021M/s (drops 0.000 ± 0.000M/s mem 4.874 MiB)
qp-trie update   (4  thread)    4.062 ± 0.128M/s (drops 0.000 ± 0.000M/s mem 4.218 MiB)
qp-trie update   (8  thread)    7.037 ± 0.247M/s (drops 0.000 ± 0.000M/s mem 4.900 MiB)
qp-trie update   (16 thread)   11.024 ± 0.295M/s (drops 0.000 ± 0.000M/s mem 4.830 MiB)

For the strings in /proc/kallsyms, single-thread lookup performance is
about ~27% slower compared with hash table. When number of threads
increase, lookup performance is almost the same. But update and deletion
performance of qp-trie are much worsed compared with hash table as shown
below:

Strings in /proc/kallsyms (key size=83, max entries=170958)
htab lookup      (1  thread)    5.686 ± 0.008M/s (drops 0.345 ± 0.002M/s mem 30.840 MiB)
htab lookup      (2  thread)   10.147 ± 0.067M/s (drops 0.616 ± 0.005M/s mem 30.841 MiB)
htab lookup      (4  thread)   16.503 ± 0.025M/s (drops 1.002 ± 0.004M/s mem 30.841 MiB)
htab lookup      (8  thread)   33.429 ± 0.146M/s (drops 2.028 ± 0.020M/s mem 30.848 MiB)
htab lookup      (16 thread)   67.249 ± 0.577M/s (drops 4.085 ± 0.032M/s mem 30.841 MiB)
htab update      (1  thread)    3.135 ± 0.355M/s (drops 0.000 ± 0.000M/s mem 30.842 MiB)
htab update      (2  thread)    6.269 ± 0.686M/s (drops 0.000 ± 0.000M/s mem 30.841 MiB)
htab update      (4  thread)   11.607 ± 1.632M/s (drops 0.000 ± 0.000M/s mem 30.840 MiB)
htab update      (8  thread)   23.041 ± 0.806M/s (drops 0.000 ± 0.000M/s mem 30.842 MiB)
htab update      (16 thread)   31.393 ± 0.307M/s (drops 0.000 ± 0.000M/s mem 30.835 MiB)

qp-trie lookup   (1  thread)    4.122 ± 0.010M/s (drops 0.250 ± 0.002M/s mem 30.108 MiB)
qp-trie lookup   (2  thread)    9.119 ± 0.057M/s (drops 0.554 ± 0.004M/s mem 17.422 MiB)
qp-trie lookup   (4  thread)   16.605 ± 0.032M/s (drops 1.008 ± 0.006M/s mem 17.203 MiB)
qp-trie lookup   (8  thread)   33.461 ± 0.058M/s (drops 2.032 ± 0.004M/s mem 16.977 MiB)
qp-trie lookup   (16 thread)   67.466 ± 0.145M/s (drops 4.097 ± 0.019M/s mem 17.452 MiB)
qp-trie update   (1  thread)    1.191 ± 0.093M/s (drops 0.000 ± 0.000M/s mem 17.170 MiB)
qp-trie update   (2  thread)    2.057 ± 0.041M/s (drops 0.000 ± 0.000M/s mem 17.058 MiB)
qp-trie update   (4  thread)    2.975 ± 0.035M/s (drops 0.000 ± 0.000M/s mem 17.411 MiB)
qp-trie update   (8  thread)    3.596 ± 0.031M/s (drops 0.000 ± 0.000M/s mem 17.110 MiB)
qp-trie update   (16 thread)    4.200 ± 0.048M/s (drops 0.000 ± 0.000M/s mem 17.228 MiB)

For strings in BTF string section, the results are similar:

Sorted strings in BTF string sections (key size=71, max entries=115980)
htab lookup      (1  thread)    6.990 ± 0.050M/s (drops 0.000 ± 0.000M/s mem 22.227 MiB)
htab lookup      (2  thread)   12.729 ± 0.059M/s (drops 0.000 ± 0.000M/s mem 22.224 MiB)
htab lookup      (4  thread)   21.202 ± 0.099M/s (drops 0.000 ± 0.000M/s mem 22.218 MiB)
htab lookup      (8  thread)   43.418 ± 0.169M/s (drops 0.000 ± 0.000M/s mem 22.225 MiB)
htab lookup      (16 thread)   88.745 ± 0.410M/s (drops 0.000 ± 0.000M/s mem 22.224 MiB)
htab update      (1  thread)    3.238 ± 0.271M/s (drops 0.000 ± 0.000M/s mem 22.228 MiB)
htab update      (2  thread)    6.483 ± 0.821M/s (drops 0.000 ± 0.000M/s mem 22.227 MiB)
htab update      (4  thread)   12.702 ± 0.924M/s (drops 0.000 ± 0.000M/s mem 22.226 MiB)
htab update      (8  thread)   22.167 ± 1.269M/s (drops 0.000 ± 0.000M/s mem 22.229 MiB)
htab update      (16 thread)   31.225 ± 0.475M/s (drops 0.000 ± 0.000M/s mem 22.239 MiB)

qp-trie lookup   (1  thread)    6.729 ± 0.006M/s (drops 0.000 ± 0.000M/s mem 11.335 MiB)
qp-trie lookup   (2  thread)   13.417 ± 0.010M/s (drops 0.000 ± 0.000M/s mem 11.287 MiB)
qp-trie lookup   (4  thread)   26.399 ± 0.043M/s (drops 0.000 ± 0.000M/s mem 11.111 MiB)
qp-trie lookup   (8  thread)   52.910 ± 0.049M/s (drops 0.000 ± 0.000M/s mem 11.131 MiB)
qp-trie lookup   (16 thread)  105.444 ± 0.064M/s (drops 0.000 ± 0.000M/s mem 11.060 MiB)
qp-trie update   (1  thread)    1.508 ± 0.102M/s (drops 0.000 ± 0.000M/s mem 10.979 MiB)
qp-trie update   (2  thread)    2.877 ± 0.034M/s (drops 0.000 ± 0.000M/s mem 10.843 MiB)
qp-trie update   (4  thread)    5.111 ± 0.083M/s (drops 0.000 ± 0.000M/s mem 10.938 MiB)
qp-trie update   (8  thread)    9.229 ± 0.046M/s (drops 0.000 ± 0.000M/s mem 11.042 MiB)
qp-trie update   (16 thread)   11.625 ± 0.147M/s (drops 0.000 ± 0.000M/s mem 10.930 MiB)

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 tools/testing/selftests/bpf/Makefile          |   5 +-
 tools/testing/selftests/bpf/bench.c           |  10 +
 .../selftests/bpf/benchs/bench_qp_trie.c      | 511 ++++++++++++++++++
 .../selftests/bpf/benchs/run_bench_qp_trie.sh |  55 ++
 .../selftests/bpf/progs/qp_trie_bench.c       | 236 ++++++++
 5 files changed, 816 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_qp_trie.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_qp_trie.sh
 create mode 100644 tools/testing/selftests/bpf/progs/qp_trie_bench.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index d881a23adc84..4cb301bd4204 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -586,11 +586,13 @@ $(OUTPUT)/bench_strncmp.o: $(OUTPUT)/strncmp_bench.skel.h
 $(OUTPUT)/bench_bpf_hashmap_full_update.o: $(OUTPUT)/bpf_hashmap_full_update_bench.skel.h
 $(OUTPUT)/bench_local_storage.o: $(OUTPUT)/local_storage_bench.skel.h
 $(OUTPUT)/bench_local_storage_rcu_tasks_trace.o: $(OUTPUT)/local_storage_rcu_tasks_trace_bench.skel.h
+$(OUTPUT)/bench_qp_trie.o: $(OUTPUT)/qp_trie_bench.skel.h
 $(OUTPUT)/bench.o: bench.h testing_helpers.h $(BPFOBJ)
 $(OUTPUT)/bench: LDLIBS += -lm
 $(OUTPUT)/bench: $(OUTPUT)/bench.o \
 		 $(TESTING_HELPERS) \
 		 $(TRACE_HELPERS) \
+		 $(CGROUP_HELPERS) \
 		 $(OUTPUT)/bench_count.o \
 		 $(OUTPUT)/bench_rename.o \
 		 $(OUTPUT)/bench_trigger.o \
@@ -600,7 +602,8 @@ $(OUTPUT)/bench: $(OUTPUT)/bench.o \
 		 $(OUTPUT)/bench_strncmp.o \
 		 $(OUTPUT)/bench_bpf_hashmap_full_update.o \
 		 $(OUTPUT)/bench_local_storage.o \
-		 $(OUTPUT)/bench_local_storage_rcu_tasks_trace.o
+		 $(OUTPUT)/bench_local_storage_rcu_tasks_trace.o \
+		 $(OUTPUT)/bench_qp_trie.o
 	$(call msg,BINARY,,$@)
 	$(Q)$(CC) $(CFLAGS) $(LDFLAGS) $(filter %.a %.o,$^) $(LDLIBS) -o $@
 
diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index c1f20a147462..618f45fbe6e2 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -275,6 +275,7 @@ extern struct argp bench_bpf_loop_argp;
 extern struct argp bench_local_storage_argp;
 extern struct argp bench_local_storage_rcu_tasks_trace_argp;
 extern struct argp bench_strncmp_argp;
+extern struct argp bench_qp_trie_argp;
 
 static const struct argp_child bench_parsers[] = {
 	{ &bench_ringbufs_argp, 0, "Ring buffers benchmark", 0 },
@@ -284,6 +285,7 @@ static const struct argp_child bench_parsers[] = {
 	{ &bench_strncmp_argp, 0, "bpf_strncmp helper benchmark", 0 },
 	{ &bench_local_storage_rcu_tasks_trace_argp, 0,
 		"local_storage RCU Tasks Trace slowdown benchmark", 0 },
+	{ &bench_qp_trie_argp, 0, "qp-trie benchmark", 0 },
 	{},
 };
 
@@ -490,6 +492,10 @@ extern const struct bench bench_local_storage_cache_seq_get;
 extern const struct bench bench_local_storage_cache_interleaved_get;
 extern const struct bench bench_local_storage_cache_hashmap_control;
 extern const struct bench bench_local_storage_tasks_trace;
+extern const struct bench bench_htab_lookup;
+extern const struct bench bench_qp_trie_lookup;
+extern const struct bench bench_htab_update;
+extern const struct bench bench_qp_trie_update;
 
 static const struct bench *benchs[] = {
 	&bench_count_global,
@@ -529,6 +535,10 @@ static const struct bench *benchs[] = {
 	&bench_local_storage_cache_interleaved_get,
 	&bench_local_storage_cache_hashmap_control,
 	&bench_local_storage_tasks_trace,
+	&bench_htab_lookup,
+	&bench_qp_trie_lookup,
+	&bench_htab_update,
+	&bench_qp_trie_update,
 };
 
 static void setup_benchmark()
diff --git a/tools/testing/selftests/bpf/benchs/bench_qp_trie.c b/tools/testing/selftests/bpf/benchs/bench_qp_trie.c
new file mode 100644
index 000000000000..9585e9c83fe8
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/bench_qp_trie.c
@@ -0,0 +1,511 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2022. Huawei Technologies Co., Ltd */
+#include <argp.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include "bench.h"
+#include "bpf_util.h"
+#include "cgroup_helpers.h"
+
+#include "qp_trie_bench.skel.h"
+
+enum {
+	FOR_HTAB = 0,
+	FOR_TRIE,
+};
+
+static struct qp_trie_ctx {
+	struct qp_trie_bench *skel;
+	int cgrp_dfd;
+	u64 map_slab_mem;
+} ctx;
+
+static struct {
+	const char *file;
+	__u32 entries;
+} args;
+
+struct qp_trie_key {
+	__u32 len;
+	unsigned char data[0];
+};
+
+struct run_stat {
+	__u64 stats[2];
+};
+
+enum {
+	ARG_DATA_FILE = 8001,
+	ARG_DATA_ENTRIES = 8002,
+};
+
+static const struct argp_option opts[] = {
+	{ "file", ARG_DATA_FILE, "DATA-FILE", 0, "Set data file" },
+	{ "entries", ARG_DATA_ENTRIES, "DATA-ENTRIES", 0, "Set data entries" },
+	{},
+};
+
+static error_t qp_trie_parse_arg(int key, char *arg, struct argp_state *state)
+{
+	switch (key) {
+	case ARG_DATA_FILE:
+		args.file = strdup(arg);
+		break;
+	case ARG_DATA_ENTRIES:
+		args.entries = strtoul(arg, NULL, 10);
+		break;
+	default:
+		return ARGP_ERR_UNKNOWN;
+	}
+
+	return 0;
+}
+
+const struct argp bench_qp_trie_argp = {
+	.options = opts,
+	.parser = qp_trie_parse_arg,
+};
+
+static int parse_data_set(const char *name, struct qp_trie_key ***set, unsigned int *nr,
+			  unsigned int *max_len)
+{
+#define INT_MAX_DATA_SIZE 1024
+	unsigned int i, nr_items, item_max_len;
+	char line[INT_MAX_DATA_SIZE + 1];
+	struct qp_trie_key **items;
+	struct qp_trie_key *cur;
+	int err = 0;
+	FILE *file;
+	char *got;
+
+	file = fopen(name, "rb");
+	if (!file) {
+		fprintf(stderr, "open %s err %s\n", name, strerror(errno));
+		return -1;
+	}
+
+	got = fgets(line, sizeof(line), file);
+	if (!got) {
+		fprintf(stderr, "empty file ?\n");
+		err = -1;
+		goto out;
+	}
+	if (sscanf(line, "%u", &nr_items) != 1) {
+		fprintf(stderr, "the first line must be the number of items\n");
+		err = -1;
+		goto out;
+	}
+
+	fprintf(stdout, "item %u\n", nr_items);
+
+	items = (struct qp_trie_key **)calloc(nr_items, sizeof(*items) + INT_MAX_DATA_SIZE);
+	if (!items) {
+		fprintf(stderr, "no mem for items\n");
+		err = -1;
+		goto out;
+	}
+
+	i = 0;
+	item_max_len = 0;
+	cur = (void *)items + sizeof(*items) * nr_items;
+	while (true) {
+		unsigned int len;
+
+		got = fgets(line, sizeof(line), file);
+		if (!got) {
+			if (!feof(file)) {
+				fprintf(stderr, "read file %s error\n", name);
+				err = -1;
+			}
+			break;
+		}
+
+		len = strlen(got);
+		if (len && got[len - 1] == '\n') {
+			got[len - 1] = 0;
+			len -= 1;
+		}
+		if (!len) {
+			fprintf(stdout, "#%u empty line\n", i + 2);
+			continue;
+		}
+
+		if (i >= nr_items) {
+			fprintf(stderr, "too many line in %s\n", name);
+			break;
+		}
+
+		if (len > item_max_len)
+			item_max_len = len;
+		cur->len = len;
+		memcpy(cur->data, got, len);
+		items[i++] = cur;
+		cur = (void *)cur + INT_MAX_DATA_SIZE;
+	}
+
+	if (!err) {
+		if (i != nr_items)
+			fprintf(stdout, "few lines in %s (exp %u got %u)\n", name, nr_items, i);
+		*nr = i;
+		*set = items;
+		*max_len = item_max_len;
+	} else {
+		free(items);
+	}
+
+out:
+	fclose(file);
+	return err;
+}
+
+static int gen_data_set(struct qp_trie_key ***set, unsigned int *nr, unsigned int *max_len)
+{
+#define RND_MAX_DATA_SIZE 255
+	struct qp_trie_key **items;
+	size_t ptr_size, data_size;
+	struct qp_trie_key *cur;
+	unsigned int i, nr_items;
+	ssize_t got;
+	int err = 0;
+
+	ptr_size = *nr * sizeof(*items);
+	data_size = *nr * (sizeof(*cur) + RND_MAX_DATA_SIZE);
+	items = (struct qp_trie_key **)malloc(ptr_size + data_size);
+	if (!items) {
+		fprintf(stderr, "no mem for items\n");
+		err = -1;
+		goto out;
+	}
+
+	cur = (void *)items + ptr_size;
+	got = syscall(__NR_getrandom, cur, data_size, 0);
+	if (got != data_size) {
+		fprintf(stderr, "getrandom error %s\n", strerror(errno));
+		err = -1;
+		goto out;
+	}
+
+	nr_items = 0;
+	for (i = 0; i < *nr; i++) {
+		cur->len &= 0xff;
+		if (cur->len) {
+			items[nr_items++] = cur;
+			memset(cur->data + cur->len, 0, RND_MAX_DATA_SIZE - cur->len);
+		}
+		cur = (void *)cur + (sizeof(*cur) + RND_MAX_DATA_SIZE);
+	}
+	if (!nr_items) {
+		fprintf(stderr, "no valid key in random data\n");
+		err = -1;
+		goto out;
+	}
+	fprintf(stdout, "generate %u random keys\n", nr_items);
+
+	*nr = nr_items;
+	*set = items;
+	*max_len = RND_MAX_DATA_SIZE;
+out:
+	if (err && items)
+		free(items);
+	return err;
+}
+
+static void qp_trie_validate(void)
+{
+	if (env.consumer_cnt != 1) {
+		fprintf(stderr, "qp_trie_map benchmark doesn't support multi-consumer!\n");
+		exit(1);
+	}
+
+	if (!args.file && !args.entries) {
+		fprintf(stderr, "must specify entries when use random generated data set\n");
+		exit(1);
+	}
+
+	if (args.file && access(args.file, R_OK)) {
+		fprintf(stderr, "data file is un-accessible\n");
+		exit(1);
+	}
+}
+
+static void qp_trie_init_map_opts(struct qp_trie_bench *skel, unsigned int data_size,
+				  unsigned int nr)
+{
+	bpf_map__set_value_size(skel->maps.htab_array, data_size);
+	bpf_map__set_max_entries(skel->maps.htab_array, nr);
+
+	bpf_map__set_key_size(skel->maps.htab, data_size);
+	bpf_map__set_max_entries(skel->maps.htab, nr);
+
+	bpf_map__set_value_size(skel->maps.trie_array, sizeof(struct qp_trie_key) + data_size);
+	bpf_map__set_max_entries(skel->maps.trie_array, nr);
+
+	bpf_map__set_map_extra(skel->maps.qp_trie, data_size);
+	bpf_map__set_max_entries(skel->maps.qp_trie, nr);
+}
+
+static void qp_trie_setup_key_map(struct bpf_map *map, unsigned int map_type,
+				  struct qp_trie_key **set, unsigned int nr)
+{
+	int fd = bpf_map__fd(map);
+	unsigned int i;
+
+	for (i = 0; i < nr; i++) {
+		void *value;
+		int err;
+
+		value = (map_type != FOR_HTAB) ? (void *)set[i] : (void *)set[i]->data;
+		err = bpf_map_update_elem(fd, &i, value, 0);
+		if (err) {
+			fprintf(stderr, "add #%u key (%s) on %s error %d\n",
+				i, set[i]->data, bpf_map__name(map), err);
+			exit(1);
+		}
+	}
+}
+
+static u64 qp_trie_get_slab_mem(int dfd)
+{
+	const char *magic = "slab ";
+	const char *name = "memory.stat";
+	int fd;
+	ssize_t nr;
+	char buf[4096];
+	char *from;
+
+	fd = openat(dfd, name, 0);
+	if (fd < 0) {
+		fprintf(stdout, "no %s (cgroup v1 ?)\n", name);
+		return 0;
+	}
+
+	nr = read(fd, buf, sizeof(buf));
+	if (nr <= 0) {
+		fprintf(stderr, "empty %s ?\n", name);
+		exit(1);
+	}
+	buf[nr - 1] = 0;
+
+	close(fd);
+
+	from = strstr(buf, magic);
+	if (!from) {
+		fprintf(stderr, "no slab in %s\n", name);
+		exit(1);
+	}
+
+	return strtoull(from + strlen(magic), NULL, 10);
+}
+
+static void qp_trie_setup_lookup_map(struct bpf_map *map, unsigned int map_type,
+				     struct qp_trie_key **set, unsigned int nr)
+{
+	int fd = bpf_map__fd(map);
+	unsigned int i;
+
+	for (i = 0; i < nr; i++) {
+		int err;
+
+		if (map_type == FOR_HTAB) {
+			void *key;
+
+			key = set[i]->data;
+			err = bpf_map_update_elem(fd, key, &i, 0);
+		} else {
+			struct bpf_dynptr_user dynptr;
+
+			bpf_dynptr_user_init(set[i]->data, set[i]->len, &dynptr);
+			err = bpf_map_update_elem(fd, &dynptr, &i, 0);
+		}
+		if (err) {
+			fprintf(stderr, "add #%u key (%s) on %s error %d\n",
+				i, set[i]->data, bpf_map__name(map), err);
+			exit(1);
+		}
+	}
+}
+
+static void qp_trie_setup(unsigned int map_type)
+{
+	struct qp_trie_key **set = NULL;
+	struct qp_trie_bench *skel;
+	unsigned int nr = 0, max_len = 0;
+	struct bpf_map *map;
+	u64 before, after;
+	int dfd;
+	int err;
+
+	if (!args.file) {
+		nr = args.entries;
+		err = gen_data_set(&set, &nr, &max_len);
+	} else {
+		err = parse_data_set(args.file, &set, &nr, &max_len);
+	}
+	if (err < 0)
+		exit(1);
+
+	if (args.entries && args.entries < nr)
+		nr = args.entries;
+
+	dfd = cgroup_setup_and_join("/qp_trie");
+	if (dfd < 0) {
+		fprintf(stderr, "failed to setup cgroup env\n");
+		exit(1);
+	}
+
+	setup_libbpf();
+
+	before = qp_trie_get_slab_mem(dfd);
+
+	skel = qp_trie_bench__open();
+	if (!skel) {
+		fprintf(stderr, "failed to open skeleton\n");
+		exit(1);
+	}
+
+	qp_trie_init_map_opts(skel, max_len, nr);
+
+	skel->rodata->qp_trie_key_size = max_len;
+	skel->bss->update_nr = nr;
+	skel->bss->update_chunk = nr / env.producer_cnt;
+
+	err = qp_trie_bench__load(skel);
+	if (err) {
+		fprintf(stderr, "failed to load skeleton\n");
+		exit(1);
+	}
+
+	map = (map_type == FOR_HTAB) ? skel->maps.htab_array : skel->maps.trie_array;
+	qp_trie_setup_key_map(map, map_type, set, nr);
+
+	map = (map_type == FOR_HTAB) ? skel->maps.htab : skel->maps.qp_trie;
+	qp_trie_setup_lookup_map(map, map_type, set, nr);
+
+	after = qp_trie_get_slab_mem(dfd);
+
+	ctx.skel = skel;
+	ctx.cgrp_dfd = dfd;
+	ctx.map_slab_mem = after - before;
+}
+
+static void qp_trie_attach_prog(struct bpf_program *prog)
+{
+	struct bpf_link *link;
+
+	link = bpf_program__attach(prog);
+	if (!link) {
+		fprintf(stderr, "failed to attach program!\n");
+		exit(1);
+	}
+}
+
+static void htab_lookup_setup(void)
+{
+	qp_trie_setup(FOR_HTAB);
+	qp_trie_attach_prog(ctx.skel->progs.htab_lookup);
+}
+
+static void qp_trie_lookup_setup(void)
+{
+	qp_trie_setup(FOR_TRIE);
+	qp_trie_attach_prog(ctx.skel->progs.qp_trie_lookup);
+}
+
+static void htab_update_setup(void)
+{
+	qp_trie_setup(FOR_HTAB);
+	qp_trie_attach_prog(ctx.skel->progs.htab_update);
+}
+
+static void qp_trie_update_setup(void)
+{
+	qp_trie_setup(FOR_TRIE);
+	qp_trie_attach_prog(ctx.skel->progs.qp_trie_update);
+}
+
+static void *qp_trie_producer(void *ctx)
+{
+	while (true)
+		(void)syscall(__NR_getpgid);
+	return NULL;
+}
+
+static void *qp_trie_consumer(void *ctx)
+{
+	return NULL;
+}
+
+static void qp_trie_measure(struct bench_res *res)
+{
+	static __u64 last_hits, last_drops;
+	__u64 total_hits = 0, total_drops = 0;
+	unsigned int i, nr_cpus;
+
+	nr_cpus = bpf_num_possible_cpus();
+	for (i = 0; i < nr_cpus; i++) {
+		struct run_stat *s = (void *)&ctx.skel->bss->percpu_stats[i & 255];
+
+		total_hits += s->stats[0];
+		total_drops += s->stats[1];
+	}
+
+	res->hits = total_hits - last_hits;
+	res->drops = total_drops - last_drops;
+
+	last_hits = total_hits;
+	last_drops = total_drops;
+}
+
+static void qp_trie_report_final(struct bench_res res[], int res_cnt)
+{
+	close(ctx.cgrp_dfd);
+	cleanup_cgroup_environment();
+
+	fprintf(stdout, "Slab: %.3f MiB\n", (float)ctx.map_slab_mem / 1024 / 1024);
+	hits_drops_report_final(res, res_cnt);
+}
+
+const struct bench bench_htab_lookup = {
+	.name = "htab-lookup",
+	.validate = qp_trie_validate,
+	.setup = htab_lookup_setup,
+	.producer_thread = qp_trie_producer,
+	.consumer_thread = qp_trie_consumer,
+	.measure = qp_trie_measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = qp_trie_report_final,
+};
+
+const struct bench bench_qp_trie_lookup = {
+	.name = "qp-trie-lookup",
+	.validate = qp_trie_validate,
+	.setup = qp_trie_lookup_setup,
+	.producer_thread = qp_trie_producer,
+	.consumer_thread = qp_trie_consumer,
+	.measure = qp_trie_measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = qp_trie_report_final,
+};
+
+const struct bench bench_htab_update = {
+	.name = "htab-update",
+	.validate = qp_trie_validate,
+	.setup = htab_update_setup,
+	.producer_thread = qp_trie_producer,
+	.consumer_thread = qp_trie_consumer,
+	.measure = qp_trie_measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = qp_trie_report_final,
+};
+
+const struct bench bench_qp_trie_update = {
+	.name = "qp-trie-update",
+	.validate = qp_trie_validate,
+	.setup = qp_trie_update_setup,
+	.producer_thread = qp_trie_producer,
+	.consumer_thread = qp_trie_consumer,
+	.measure = qp_trie_measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = qp_trie_report_final,
+};
diff --git a/tools/testing/selftests/bpf/benchs/run_bench_qp_trie.sh b/tools/testing/selftests/bpf/benchs/run_bench_qp_trie.sh
new file mode 100755
index 000000000000..0cbcb5bc9292
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/run_bench_qp_trie.sh
@@ -0,0 +1,55 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (C) 2022. Huawei Technologies Co., Ltd
+
+source ./benchs/run_common.sh
+
+set -eufo pipefail
+
+mem()
+{
+	echo "$*" | sed -E "s/.*Slab: ([0-9]+\.[0-9]+ MiB).*/\1/"
+}
+
+run_qp_trie_bench()
+{
+	local title=$1
+	local summary
+
+	shift 1
+	summary=$($RUN_BENCH "$@" | grep "Summary\|Slab:")
+	printf "%s %20s (drops %-16s mem %s)\n" "$title" "$(hits $summary)" \
+		"$(drops $summary)" "$(mem $summary)"
+}
+
+run_qp_trie_benchs()
+{
+	local p
+	local m
+	local b
+	local title
+
+	for m in htab qp-trie
+	do
+		for b in lookup update
+		do
+			for p in 1 2 4 8 16
+			do
+				title=$(printf "%-16s (%-2d thread)" "$m $b" $p)
+				run_qp_trie_bench "$title" ${m}-${b} -p $p "$@"
+			done
+		done
+	done
+	echo
+}
+
+echo "Randomly-generated binary data (16K)"
+run_qp_trie_benchs --entries 16384
+
+echo "Strings in /proc/kallsyms"
+TMP_FILE=/tmp/kallsyms.txt
+SRC_FILE=/proc/kallsyms
+trap 'rm -f $TMP_FILE' EXIT
+wc -l $SRC_FILE | awk '{ print $1}' > $TMP_FILE
+awk '{ print $3 }' $SRC_FILE >> $TMP_FILE
+run_qp_trie_benchs --file $TMP_FILE
diff --git a/tools/testing/selftests/bpf/progs/qp_trie_bench.c b/tools/testing/selftests/bpf/progs/qp_trie_bench.c
new file mode 100644
index 000000000000..303cad7e01d6
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/qp_trie_bench.c
@@ -0,0 +1,236 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2022. Huawei Technologies Co., Ltd */
+#include <linux/types.h>
+#include <linux/bpf.h>
+#include <linux/errno.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+struct bpf_map;
+
+struct qp_trie_key {
+	__u32 len;
+	unsigned char data[0];
+};
+
+/* value_size will be set by benchmark */
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(key_size, 4);
+} htab_array SEC(".maps");
+
+/* value_size will be set by benchmark */
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(key_size, 4);
+} trie_array SEC(".maps");
+
+/* key_size will be set by benchmark */
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(value_size, 4);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+} htab SEC(".maps");
+
+/* map_extra will be set by benchmark */
+struct {
+	__uint(type, BPF_MAP_TYPE_QP_TRIE);
+	__type(key, struct bpf_dynptr);
+	__type(value, unsigned int);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+} qp_trie SEC(".maps");
+
+char _license[] SEC("license") = "GPL";
+
+struct {
+	__u64 stats[2];
+} __attribute__((__aligned__(128))) percpu_stats[256];
+
+struct update_ctx {
+	unsigned int max;
+	unsigned int from;
+};
+
+volatile const unsigned int qp_trie_key_size;
+
+unsigned int update_nr;
+unsigned int update_chunk;
+
+static __always_inline void update_stats(int idx)
+{
+	__u32 cpu = bpf_get_smp_processor_id();
+
+	percpu_stats[cpu & 255].stats[idx]++;
+}
+
+static int lookup_htab(struct bpf_map *map, __u32 *key, void *value, void *data)
+{
+	__u32 *index;
+
+	index = bpf_map_lookup_elem(&htab, value);
+	if (index && *index == *key)
+		update_stats(0);
+	else
+		update_stats(1);
+	return 0;
+}
+
+static int update_htab_loop(unsigned int i, void *ctx)
+{
+	struct update_ctx *update = ctx;
+	void *value;
+	int err;
+
+	if (update->from >= update->max)
+		update->from = 0;
+	value = bpf_map_lookup_elem(&htab_array, &update->from);
+	if (!value)
+		return 1;
+
+	err = bpf_map_update_elem(&htab, value, &update->from, 0);
+	if (!err)
+		update_stats(0);
+	else
+		update_stats(1);
+	update->from++;
+
+	return 0;
+}
+
+static int delete_htab_loop(unsigned int i, void *ctx)
+{
+	struct update_ctx *update = ctx;
+	void *value;
+	int err;
+
+	if (update->from >= update->max)
+		update->from = 0;
+	value = bpf_map_lookup_elem(&htab_array, &update->from);
+	if (!value)
+		return 1;
+
+	err = bpf_map_delete_elem(&htab, value);
+	if (!err)
+		update_stats(0);
+	update->from++;
+
+	return 0;
+}
+
+static int lookup_qp_trie(struct bpf_map *map, __u32 *key, void *value, void *data)
+{
+	struct qp_trie_key *qp_trie_key = value;
+	struct bpf_dynptr dynptr;
+	__u32 *index;
+
+	if (qp_trie_key->len > qp_trie_key_size)
+		return 0;
+
+	bpf_dynptr_from_mem(qp_trie_key->data, qp_trie_key->len, 0, &dynptr);
+	index = bpf_map_lookup_elem(&qp_trie, &dynptr);
+	if (index && *index == *key)
+		update_stats(0);
+	else
+		update_stats(1);
+	return 0;
+}
+
+static int update_qp_trie_loop(unsigned int i, void *ctx)
+{
+	struct update_ctx *update = ctx;
+	struct qp_trie_key *value;
+	struct bpf_dynptr dynptr;
+	int err;
+
+	if (update->from >= update->max)
+		update->from = 0;
+	value = bpf_map_lookup_elem(&trie_array, &update->from);
+	if (!value || value->len > qp_trie_key_size)
+		return 1;
+
+	bpf_dynptr_from_mem(value->data, value->len, 0, &dynptr);
+	err = bpf_map_update_elem(&qp_trie, &dynptr, &update->from, 0);
+	if (!err)
+		update_stats(0);
+	else
+		update_stats(1);
+	update->from++;
+
+	return 0;
+}
+
+static int delete_qp_trie_loop(unsigned int i, void *ctx)
+{
+	struct update_ctx *update = ctx;
+	struct qp_trie_key *value;
+	struct bpf_dynptr dynptr;
+	int err;
+
+	if (update->from >= update->max)
+		update->from = 0;
+	value = bpf_map_lookup_elem(&trie_array, &update->from);
+	if (!value || value->len > qp_trie_key_size)
+		return 1;
+
+	bpf_dynptr_from_mem(value->data, value->len, 0, &dynptr);
+	err = bpf_map_delete_elem(&qp_trie, &dynptr);
+	if (!err)
+		update_stats(0);
+	update->from++;
+
+	return 0;
+}
+
+SEC("tp/syscalls/sys_enter_getpgid")
+int htab_lookup(void *ctx)
+{
+	bpf_for_each_map_elem(&htab_array, lookup_htab, NULL, 0);
+	return 0;
+}
+
+SEC("tp/syscalls/sys_enter_getpgid")
+int qp_trie_lookup(void *ctx)
+{
+	bpf_for_each_map_elem(&trie_array, lookup_qp_trie, NULL, 0);
+	return 0;
+}
+
+SEC("tp/syscalls/sys_enter_getpgid")
+int htab_update(void *ctx)
+{
+	unsigned int index = bpf_get_smp_processor_id() * update_chunk;
+	struct update_ctx update;
+
+	update.max = update_nr;
+	if (update.max && index >= update.max)
+		index %= update.max;
+
+	/* Only operate part of keys according to cpu id */
+	update.from = index;
+	bpf_loop(update_chunk, update_htab_loop, &update, 0);
+
+	update.from = index;
+	bpf_loop(update_chunk, delete_htab_loop, &update, 0);
+
+	return 0;
+}
+
+SEC("tp/syscalls/sys_enter_getpgid")
+int qp_trie_update(void *ctx)
+{
+	unsigned int index = bpf_get_smp_processor_id() * update_chunk;
+	struct update_ctx update;
+
+	update.max = update_nr;
+	if (update.max && index >= update.max)
+		index %= update.max;
+
+	/* Only operate part of keys according to cpu id */
+	update.from = index;
+	bpf_loop(update_chunk, update_qp_trie_loop, &update, 0);
+
+	update.from = index;
+	bpf_loop(update_chunk, delete_qp_trie_loop, &update, 0);
+
+	return 0;
+}
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH bpf-next v2 13/13] selftests/bpf: Add map tests for qp-trie by using bpf syscall
  2022-09-24 13:36 [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key Hou Tao
                   ` (11 preceding siblings ...)
  2022-09-24 13:36 ` [PATCH bpf-next v2 12/13] selftests/bpf: Add benchmark " Hou Tao
@ 2022-09-24 13:36 ` Hou Tao
  2022-09-26  1:25 ` [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key Alexei Starovoitov
  2022-10-19 17:01 ` Tony Finch
  14 siblings, 0 replies; 52+ messages in thread
From: Hou Tao @ 2022-09-24 13:36 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, houtao1

From: Hou Tao <houtao1@huawei.com>

For creation operations, add test cases to check the invalid and valid
configurations are handled correctly. For lookup/delete/update, add test
cases to ensure the invalid key is reject and both the returned value and
the output value are expected. For iteration operations, add test cases to
ensure the returned keys during iteration are always ordered.

Also add a stress test for the concurrent operations on qp-trie.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 .../selftests/bpf/map_tests/qp_trie_map.c     | 1209 +++++++++++++++++
 1 file changed, 1209 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/map_tests/qp_trie_map.c

diff --git a/tools/testing/selftests/bpf/map_tests/qp_trie_map.c b/tools/testing/selftests/bpf/map_tests/qp_trie_map.c
new file mode 100644
index 000000000000..1a353e66e3cc
--- /dev/null
+++ b/tools/testing/selftests/bpf/map_tests/qp_trie_map.c
@@ -0,0 +1,1209 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2022. Huawei Technologies Co., Ltd */
+#include <unistd.h>
+#include <errno.h>
+#include <stdlib.h>
+#include <endian.h>
+#include <limits.h>
+#include <time.h>
+#include <pthread.h>
+#include <linux/btf.h>
+
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#include <test_btf.h>
+#include <test_maps.h>
+
+#include "bpf_util.h"
+
+#define QP_TRIE_KEY_SIZE sizeof(struct bpf_dynptr)
+#define QP_TRIE_DFT_MAX_KEY_LEN 4
+#define QP_TRIE_DFT_VAL_SIZE 4
+#define QP_TRIE_DFT_MAP_FLAGS BPF_F_NO_PREALLOC
+
+#define QP_TRIE_DFT_BTF_KEY_ID 1
+#define QP_TRIE_DFT_BTF_VAL_ID 2
+
+struct qp_trie_create_case {
+	const char *name;
+	int error;
+	unsigned int map_flags;
+	unsigned int max_key_len;
+	unsigned int value_size;
+	unsigned int max_entries;
+	unsigned int btf_key_type_id;
+	unsigned int btf_value_type_id;
+};
+
+struct qp_trie_bytes_key {
+	unsigned int len;
+	unsigned char data[4];
+};
+
+struct qp_trie_int_key {
+	unsigned int len;
+	unsigned int data;
+};
+
+enum {
+	UPDATE_OP = 0,
+	DELETE_OP,
+	LOOKUP_OP,
+	ITERATE_OP,
+	MAX_OP,
+};
+
+struct stress_conf {
+	unsigned int threads[MAX_OP];
+	unsigned int max_key_len;
+	unsigned int loop;
+	unsigned int nr;
+};
+
+struct qp_trie_rw_ctx {
+	unsigned int nr;
+	unsigned int max_key_len;
+	int fd;
+	struct bpf_dynptr_user *set;
+	unsigned int loop;
+	unsigned int nr_delete;
+};
+
+static int qp_trie_load_btf(void)
+{
+	char btf_str_sec[] = "\0bpf_dynptr\0qp_test_key";
+	__u32 btf_raw_types[] = {
+		/* struct bpf_dynptr */				/* [1] */
+		BTF_TYPE_ENC(1, BTF_INFO_ENC(BTF_KIND_STRUCT, 0, 0), 16),
+		/* unsigned int */				/* [2] */
+		BTF_TYPE_INT_ENC(0, 0, 0, 32, 4),
+		/* struct qp_test_key */			/* [3] */
+		BTF_TYPE_ENC(12, BTF_INFO_ENC(BTF_KIND_STRUCT, 0, 0), 16),
+	};
+	struct btf_header btf_hdr = {
+		.magic = BTF_MAGIC,
+		.version = BTF_VERSION,
+		.hdr_len = sizeof(struct btf_header),
+		.type_len = sizeof(btf_raw_types),
+		.str_off = sizeof(btf_raw_types),
+		.str_len = sizeof(btf_str_sec),
+	};
+	__u8 raw_btf[sizeof(struct btf_header) + sizeof(btf_raw_types) +
+		     sizeof(btf_str_sec)];
+
+	memcpy(raw_btf, &btf_hdr, sizeof(btf_hdr));
+	memcpy(raw_btf + sizeof(btf_hdr), btf_raw_types, sizeof(btf_raw_types));
+	memcpy(raw_btf + sizeof(btf_hdr) + sizeof(btf_raw_types),
+	       btf_str_sec, sizeof(btf_str_sec));
+
+	return bpf_btf_load(raw_btf, sizeof(raw_btf), NULL);
+}
+
+struct qp_trie_create_case create_cases[] = {
+	{
+		.name = "tiny qp-trie",
+		.error = 0,
+		.map_flags = QP_TRIE_DFT_MAP_FLAGS,
+		.max_key_len = QP_TRIE_DFT_MAX_KEY_LEN,
+		.value_size = QP_TRIE_DFT_VAL_SIZE,
+		.max_entries = 1,
+		.btf_key_type_id = QP_TRIE_DFT_BTF_KEY_ID,
+		.btf_value_type_id = QP_TRIE_DFT_BTF_VAL_ID,
+	},
+	{
+		.name = "empty qp-trie",
+		.error = -EINVAL,
+		.map_flags = QP_TRIE_DFT_MAP_FLAGS,
+		.max_key_len = QP_TRIE_DFT_MAX_KEY_LEN,
+		.value_size = QP_TRIE_DFT_VAL_SIZE,
+		.max_entries = 0,
+		.btf_key_type_id = QP_TRIE_DFT_BTF_KEY_ID,
+		.btf_value_type_id = QP_TRIE_DFT_BTF_VAL_ID,
+	},
+	{
+		.name = "preallocated qp-trie",
+		.error = -EINVAL,
+		.map_flags = 0,
+		.max_key_len = QP_TRIE_DFT_MAX_KEY_LEN,
+		.value_size = QP_TRIE_DFT_VAL_SIZE,
+		.max_entries = 1,
+		.btf_key_type_id = QP_TRIE_DFT_BTF_KEY_ID,
+		.btf_value_type_id = QP_TRIE_DFT_BTF_VAL_ID,
+	},
+	{
+		.name = "mmapable qp-trie",
+		.error = -EINVAL,
+		.map_flags = QP_TRIE_DFT_MAP_FLAGS | BPF_F_MMAPABLE,
+		.max_key_len = QP_TRIE_DFT_MAX_KEY_LEN,
+		.value_size = QP_TRIE_DFT_VAL_SIZE,
+		.max_entries = 1,
+		.btf_key_type_id = QP_TRIE_DFT_BTF_KEY_ID,
+		.btf_value_type_id = QP_TRIE_DFT_BTF_VAL_ID,
+	},
+	{
+		.name = "no btf qp-trie",
+		.error = -EINVAL,
+		.map_flags = QP_TRIE_DFT_MAP_FLAGS,
+		.max_key_len = QP_TRIE_DFT_MAX_KEY_LEN,
+		.value_size = QP_TRIE_DFT_VAL_SIZE,
+		.max_entries = 1,
+		.btf_key_type_id = 0,
+		.btf_value_type_id = 0,
+	},
+	{
+		.name = "qp_test_key qp-trie",
+		.error = -EINVAL,
+		.map_flags = QP_TRIE_DFT_MAP_FLAGS,
+		.max_key_len = QP_TRIE_DFT_MAX_KEY_LEN,
+		.value_size = QP_TRIE_DFT_VAL_SIZE,
+		.max_entries = 1,
+		.btf_key_type_id = 3,
+		.btf_value_type_id = QP_TRIE_DFT_BTF_VAL_ID,
+	},
+	{
+		.name = "zero max key len qp-trie",
+		.error = -EINVAL,
+		.map_flags = QP_TRIE_DFT_MAP_FLAGS,
+		.max_key_len = 0,
+		.value_size = QP_TRIE_DFT_VAL_SIZE,
+		.max_entries = 1,
+		.btf_key_type_id = QP_TRIE_DFT_BTF_KEY_ID,
+		.btf_value_type_id = QP_TRIE_DFT_BTF_VAL_ID,
+	},
+	{
+		.name = "big k-v size qp-trie",
+		.error = -E2BIG,
+		.map_flags = QP_TRIE_DFT_MAP_FLAGS,
+		.max_key_len = QP_TRIE_DFT_MAX_KEY_LEN,
+		.value_size = 1U << 30,
+		.max_entries = 1,
+		.btf_key_type_id = QP_TRIE_DFT_BTF_KEY_ID,
+		.btf_value_type_id = QP_TRIE_DFT_BTF_VAL_ID,
+	},
+};
+
+static void test_qp_trie_create(void)
+{
+	unsigned int i;
+	int btf_fd;
+
+	btf_fd = qp_trie_load_btf();
+	CHECK(btf_fd < 0, "load btf", "error %d\n", btf_fd);
+
+	for (i = 0; i < ARRAY_SIZE(create_cases); i++) {
+		LIBBPF_OPTS(bpf_map_create_opts, opts);
+		int fd;
+
+		opts.map_flags = create_cases[i].map_flags;
+		opts.btf_fd = btf_fd;
+		opts.btf_key_type_id = create_cases[i].btf_key_type_id;
+		opts.btf_value_type_id = create_cases[i].btf_value_type_id;
+		opts.map_extra = create_cases[i].max_key_len;
+		fd = bpf_map_create(BPF_MAP_TYPE_QP_TRIE, "qp_trie", QP_TRIE_KEY_SIZE,
+				    create_cases[i].value_size, create_cases[i].max_entries, &opts);
+		if (!create_cases[i].error) {
+			CHECK(fd < 0, create_cases[i].name, "error %d\n", fd);
+			close(fd);
+		} else {
+			CHECK(fd != create_cases[i].error, create_cases[i].name,
+			      "expect error %d got %d\n", create_cases[i].error, fd);
+		}
+	}
+
+	close(btf_fd);
+}
+
+static int qp_trie_create(unsigned int max_key_len, unsigned int value_size,
+			  unsigned int max_entries)
+{
+	LIBBPF_OPTS(bpf_map_create_opts, opts);
+	int btf_fd, map_fd;
+
+	btf_fd = qp_trie_load_btf();
+	CHECK(btf_fd < 0, "load btf", "error %d\n", btf_fd);
+
+	opts.map_flags = QP_TRIE_DFT_MAP_FLAGS;
+	opts.btf_fd = btf_fd;
+	opts.btf_key_type_id = QP_TRIE_DFT_BTF_KEY_ID;
+	opts.btf_value_type_id = QP_TRIE_DFT_BTF_VAL_ID;
+	opts.map_extra = max_key_len;
+	map_fd = bpf_map_create(BPF_MAP_TYPE_QP_TRIE, "qp_trie", QP_TRIE_KEY_SIZE, value_size,
+				max_entries, &opts);
+	CHECK(map_fd < 0, "bpf_map_create", "error %d\n", map_fd);
+
+	close(btf_fd);
+
+	return map_fd;
+}
+
+static void test_qp_trie_unsupported_op(void)
+{
+	unsigned int key, value, cnt, out_batch;
+	struct bpf_dynptr_user dynptr;
+	int fd, err;
+
+	fd = qp_trie_create(sizeof(key), sizeof(value), 2);
+
+	key = 0;
+	bpf_dynptr_user_init(&key, sizeof(key), &dynptr);
+	err = bpf_map_lookup_and_delete_elem(fd, &dynptr, &value);
+	CHECK(err != -ENOTSUPP, "unsupported lookup_and_delete", "got %d\n", err);
+
+	cnt = 1;
+	key = 1;
+	bpf_dynptr_user_init(&key, sizeof(key), &dynptr);
+	err = bpf_map_lookup_batch(fd, NULL, &out_batch, &dynptr, &value, &cnt, NULL);
+	CHECK(err != -ENOTSUPP, "unsupported lookup batch", "got %d\n", err);
+
+	cnt = 1;
+	key = 2;
+	bpf_dynptr_user_init(&key, sizeof(key), &dynptr);
+	err = bpf_map_lookup_and_delete_batch(fd, NULL, &out_batch, &dynptr, &value, &cnt, NULL);
+	CHECK(err != -ENOTSUPP, "unsupported lookup_and_delete batch", "got %d\n", err);
+
+	cnt = 1;
+	key = 3;
+	value = 3;
+	bpf_dynptr_user_init(&key, sizeof(key), &dynptr);
+	err = bpf_map_update_batch(fd, &dynptr, &value, &cnt, NULL);
+	CHECK(err != -ENOTSUPP, "unsupported update_batch", "got %d\n", err);
+
+	cnt = 1;
+	key = 4;
+	bpf_dynptr_user_init(&key, sizeof(key), &dynptr);
+	err = bpf_map_delete_batch(fd, &dynptr, &cnt, NULL);
+	CHECK(err != -ENOTSUPP, "unsupported delete_batch", "got %d\n", err);
+
+	close(fd);
+}
+
+static void test_qp_trie_bad_update(void)
+{
+	struct bpf_dynptr_user dynptr;
+	unsigned int key, value;
+	u64 big_key;
+	int fd, err;
+
+	fd = qp_trie_create(sizeof(key), sizeof(value), 1);
+
+	/* Invalid flags (Error) */
+	key = 0;
+	value = 0;
+	bpf_dynptr_user_init(&key, sizeof(key), &dynptr);
+	err = bpf_map_update_elem(fd, &dynptr, &value, BPF_NOEXIST | BPF_EXIST);
+	CHECK(err != -EINVAL, "invalid update flag", "error %d\n", err);
+
+	/* Invalid key len (Error) */
+	big_key = 1;
+	value = 1;
+	bpf_dynptr_user_init(&big_key, sizeof(big_key), &dynptr);
+	err = bpf_map_update_elem(fd, &dynptr, &value, 0);
+	CHECK(err != -EINVAL, "invalid data len", "error %d\n", err);
+
+	/* Invalid key (Error) */
+	value = 0;
+	bpf_dynptr_user_init(NULL, 1, &dynptr);
+	err = bpf_map_update_elem(fd, &dynptr, &value, 0);
+	CHECK(err != -EFAULT, "invalid data addr", "error %d\n", err);
+
+	/* Iterate an empty qp-trie (Error) */
+	bpf_dynptr_user_init(&key, sizeof(key), &dynptr);
+	err = bpf_map_get_next_key(fd, NULL, &dynptr);
+	CHECK(err != -ENOENT, "non-empty qp-trie", "error %d\n", err);
+
+	/* Overwrite an empty qp-trie (Error) */
+	key = 2;
+	value = 2;
+	bpf_dynptr_user_init(&key, sizeof(key), &dynptr);
+	err = bpf_map_update_elem(fd, &dynptr, &value, BPF_EXIST);
+	CHECK(err != -ENOENT, "overwrite empty qp-trie", "error %d\n", err);
+
+	/* Iterate an empty qp-trie (Error) */
+	bpf_dynptr_user_init(&key, sizeof(key), &dynptr);
+	err = bpf_map_get_next_key(fd, NULL, &dynptr);
+	CHECK(err != -ENOENT, "non-empty qp-trie", "error %d\n", err);
+
+	close(fd);
+}
+
+static void test_qp_trie_bad_lookup_delete(void)
+{
+	struct bpf_dynptr_user dynptr;
+	unsigned int key, value;
+	int fd, err;
+
+	fd = qp_trie_create(sizeof(key), sizeof(value), 2);
+
+	/* Lookup/Delete non-existent key (Error) */
+	key = 0;
+	bpf_dynptr_user_init(&key, sizeof(key), &dynptr);
+	err = bpf_map_delete_elem(fd, &dynptr);
+	CHECK(err != -ENOENT, "del non-existent key", "error %d\n", err);
+	err = bpf_map_lookup_elem(fd, &dynptr, &value);
+	CHECK(err != -ENOENT, "lookup non-existent key", "error %d\n", err);
+
+	key = 0;
+	value = 2;
+	bpf_dynptr_user_init(&key, 2, &dynptr);
+	err = bpf_map_update_elem(fd, &dynptr, &value, BPF_NOEXIST);
+	CHECK(err, "add elem", "error %d\n", err);
+
+	key = 0;
+	value = 4;
+	bpf_dynptr_user_init(&key, sizeof(key), &dynptr);
+	err = bpf_map_update_elem(fd, &dynptr, &value, BPF_NOEXIST);
+	CHECK(err, "add elem", "error %d\n", err);
+
+	/*
+	 * Lookup/Delete non-existent key, although it is the prefix of
+	 * existent keys (Error)
+	 */
+	key = 0;
+	bpf_dynptr_user_init(&key, 1, &dynptr);
+	err = bpf_map_delete_elem(fd, &dynptr);
+	CHECK(err != -ENOENT, "del non-existent key", "error %d\n", err);
+	err = bpf_map_lookup_elem(fd, &dynptr, &value);
+	CHECK(err != -ENOENT, "lookup non-existent key", "error %d\n", err);
+
+	/* Lookup/Delete non-existent key, although its prefix exists (Error) */
+	key = 0;
+	bpf_dynptr_user_init(&key, 3, &dynptr);
+	err = bpf_map_delete_elem(fd, &dynptr);
+	CHECK(err != -ENOENT, "del non-existent key", "error %d\n", err);
+	err = bpf_map_lookup_elem(fd, &dynptr, &value);
+	CHECK(err != -ENOENT, "lookup non-existent key", "error %d\n", err);
+
+	close(fd);
+}
+
+static int cmp_str(const void *a, const void *b)
+{
+	const char *str_a = *(const char **)a, *str_b = *(const char **)b;
+
+	return strcmp(str_a, str_b);
+}
+
+static void test_qp_trie_one_subtree_update(void)
+{
+	const char *keys[] = {
+		"ab", "abc", "abo", "abS", "abcd",
+	};
+	const char *sorted_keys[ARRAY_SIZE(keys)];
+	unsigned int value, got, i, j;
+	struct bpf_dynptr_user dynptr;
+	struct bpf_dynptr_user *cur;
+	char data[4];
+	int fd, err;
+
+	fd = qp_trie_create(4, sizeof(value), ARRAY_SIZE(keys));
+
+	for (i = 0; i < ARRAY_SIZE(keys); i++) {
+		unsigned int flags;
+
+		/* Add i-th element */
+		flags = i % 2 ? BPF_NOEXIST : 0;
+		bpf_dynptr_user_init((void *)keys[i], strlen(keys[i]), &dynptr);
+		value = i + 100;
+		err = bpf_map_update_elem(fd, &dynptr, &value, flags);
+		CHECK(err, "add elem", "#%u error %d\n", i, err);
+
+		err = bpf_map_lookup_elem(fd, &dynptr, &got);
+		CHECK(err, "lookup elem", "#%u error %d\n", i, err);
+		CHECK(got != value, "lookup elem", "#%u expect %u got %u\n", i, value, got);
+
+		/* Re-add i-th element (Error) */
+		err = bpf_map_update_elem(fd, &dynptr, &value, BPF_NOEXIST);
+		CHECK(err != -EEXIST, "re-add elem", "#%u error %d\n", i, err);
+
+		/* Overwrite i-th element */
+		flags = i % 2 ? 0 : BPF_EXIST;
+		value = i;
+		err = bpf_map_update_elem(fd, &dynptr, &value, flags);
+		CHECK(err, "update elem", "error %d\n", err);
+
+		/* Lookup #[0~i] elements */
+		for (j = 0; j <= i; j++) {
+			bpf_dynptr_user_init((void *)keys[j], strlen(keys[j]), &dynptr);
+			err = bpf_map_lookup_elem(fd, &dynptr, &got);
+			CHECK(err, "lookup elem", "#%u/%u error %d\n", i, j, err);
+			CHECK(got != j, "lookup elem", "#%u/%u expect %u got %u\n",
+			      i, j, value, got);
+		}
+	}
+
+	/* Add element to a full qp-trie (Error) */
+	memset(data, 0, sizeof(data));
+	bpf_dynptr_user_init(&data, sizeof(data), &dynptr);
+	value = 0;
+	err = bpf_map_update_elem(fd, &dynptr, &value, 0);
+	CHECK(err != -ENOSPC, "add to full qp-trie", "error %d\n", err);
+
+	/* Iterate sorted elements */
+	cur = NULL;
+	memcpy(sorted_keys, keys, sizeof(keys));
+	qsort(sorted_keys, ARRAY_SIZE(sorted_keys), sizeof(sorted_keys[0]), cmp_str);
+	bpf_dynptr_user_init(data, sizeof(data), &dynptr);
+	for (i = 0; i < ARRAY_SIZE(sorted_keys); i++) {
+		unsigned int len;
+		char *got;
+
+		len = strlen(sorted_keys[i]);
+		err = bpf_map_get_next_key(fd, cur, &dynptr);
+		CHECK(err, "iterate", "#%u error %d\n", i, err);
+		CHECK(bpf_dynptr_user_get_size(&dynptr) != len, "iterate",
+		      "#%u invalid len %u expect %u\n",
+		      i, bpf_dynptr_user_get_size(&dynptr), len);
+		got = bpf_dynptr_user_get_data(&dynptr);
+		CHECK(memcmp(sorted_keys[i], got, len), "iterate",
+		      "#%u got %.*s exp %.*s\n", i, len, got, len, sorted_keys[i]);
+
+		if (!cur)
+			cur = &dynptr;
+	}
+	err = bpf_map_get_next_key(fd, cur, &dynptr);
+	CHECK(err != -ENOENT, "more element", "error %d\n", err);
+
+	/* Delete all elements */
+	for (i = 0; i < ARRAY_SIZE(keys); i++) {
+		bpf_dynptr_user_init((void *)keys[i], strlen(keys[i]), &dynptr);
+		err = bpf_map_delete_elem(fd, &dynptr);
+		CHECK(err, "del elem", "#%u elem error %d\n", i, err);
+
+		/* Lookup deleted element (Error) */
+		err = bpf_map_lookup_elem(fd, &dynptr, &got);
+		CHECK(err != -ENOENT, "lookup elem", "#%u error %d\n", i, err);
+
+		/* Lookup #(i~N] elements */
+		for (j = i + 1; j < ARRAY_SIZE(keys); j++) {
+			bpf_dynptr_user_init((void *)keys[j], strlen(keys[j]), &dynptr);
+			err = bpf_map_lookup_elem(fd, &dynptr, &got);
+			CHECK(err, "lookup elem", "#%u/%u error %d\n", i, j, err);
+			CHECK(got != j, "lookup elem", "#%u/%u expect %u got %u\n",
+			      i, j, value, got);
+		}
+	}
+
+	memset(data, 0, sizeof(data));
+	bpf_dynptr_user_init(&data, sizeof(data), &dynptr);
+	err = bpf_map_get_next_key(fd, NULL, &dynptr);
+	CHECK(err != -ENOENT, "non-empty qp-trie", "error %d\n", err);
+
+	close(fd);
+}
+
+static void test_qp_trie_all_subtree_update(void)
+{
+	unsigned int i, max_entries, key, value, got;
+	struct bpf_dynptr_user dynptr;
+	struct bpf_dynptr_user *cur;
+	int fd, err;
+
+	/* 16 elements per subtree */
+	max_entries = 256 * 16;
+	fd = qp_trie_create(sizeof(key), sizeof(value), max_entries);
+
+	for (i = 0; i < max_entries; i++) {
+		key = htole32(i);
+		bpf_dynptr_user_init(&key, sizeof(key), &dynptr);
+		value = i;
+		err = bpf_map_update_elem(fd, &dynptr, &value, BPF_NOEXIST);
+		CHECK(err, "add elem", "#%u error %d\n", i, err);
+
+		err = bpf_map_lookup_elem(fd, &dynptr, &got);
+		CHECK(err, "lookup elem", "#%u elem error %d\n", i, err);
+		CHECK(got != value, "lookup elem", "#%u expect %u got %u\n", i, value, got);
+	}
+
+	/* Add element to a full qp-trie (Error) */
+	key = htole32(max_entries + 1);
+	bpf_dynptr_user_init(&key, sizeof(key), &dynptr);
+	value = 0;
+	err = bpf_map_update_elem(fd, &dynptr, &value, 0);
+	CHECK(err != -ENOSPC, "add to full qp-trie", "error %d\n", err);
+
+	/* Iterate all elements */
+	cur = NULL;
+	bpf_dynptr_user_init(&key, sizeof(key), &dynptr);
+	for (i = 0; i < max_entries; i++) {
+		unsigned int *data;
+		unsigned int exp;
+
+		exp = htole32((i / 16) | ((i & 0xf) << 8));
+		err = bpf_map_get_next_key(fd, cur, &dynptr);
+		CHECK(err, "iterate", "#%u error %d\n", i, err);
+		CHECK(bpf_dynptr_user_get_size(&dynptr) != 4, "iterate",
+		      "#%u invalid len %u\n", i, bpf_dynptr_user_get_size(&dynptr));
+		data = bpf_dynptr_user_get_data(&dynptr);
+		CHECK(data != &key, "dynptr data", "#%u got %p exp %p\n", i, data, &key);
+		CHECK(key != exp, "iterate", "#%u got %u exp %u\n", i, key, exp);
+
+		if (!cur)
+			cur = &dynptr;
+	}
+	err = bpf_map_get_next_key(fd, cur, &dynptr);
+	CHECK(err != -ENOENT, "more element", "error %d\n", err);
+
+	/* Delete all elements */
+	i = max_entries;
+	while (i-- > 0) {
+		key = i;
+		bpf_dynptr_user_init(&key, sizeof(key), &dynptr);
+		err = bpf_map_delete_elem(fd, &dynptr);
+		CHECK(err, "del elem", "#%u error %d\n", i, err);
+
+		/* Lookup deleted element (Error) */
+		err = bpf_map_lookup_elem(fd, &dynptr, &got);
+		CHECK(err != -ENOENT, "lookup elem", "#%u error %d\n", i, err);
+	}
+
+	bpf_dynptr_user_init(&key, sizeof(key), &dynptr);
+	err = bpf_map_get_next_key(fd, NULL, &dynptr);
+	CHECK(err != -ENOENT, "non-empty qp-trie", "error %d\n", err);
+
+	close(fd);
+}
+
+static int binary_insert_data(unsigned int *set, unsigned int nr, unsigned int data)
+{
+	int begin = 0, end = nr - 1, mid, i;
+
+	while (begin <= end) {
+		mid = begin + (end - begin) / 2;
+		if (data == set[mid])
+			return -1;
+		if (data > set[mid])
+			begin = mid + 1;
+		else
+			end = mid - 1;
+	}
+
+	/* Move [begin, nr) backwards and insert new item at begin */
+	i = nr - 1;
+	while (i >= begin) {
+		set[i + 1] = set[i];
+		i--;
+	}
+	set[begin] = data;
+
+	return 0;
+}
+
+/* UINT_MAX will not be in the returned data set */
+static unsigned int *gen_random_unique_data_set(unsigned int max_entries)
+{
+	unsigned int *data_set;
+	unsigned int i, data;
+
+	data_set = malloc(sizeof(*data_set) * max_entries);
+	CHECK(!data_set, "malloc", "no mem");
+
+	for (i = 0; i < max_entries; i++) {
+		while (true) {
+			data = random() % UINT_MAX;
+			if (!binary_insert_data(data_set, i, data))
+				break;
+		}
+	}
+
+	return data_set;
+}
+
+static int cmp_be32(const void *l, const void *r)
+{
+	unsigned int a = htobe32(*(unsigned int *)l), b = htobe32(*(unsigned int *)r);
+
+	if (a < b)
+		return -1;
+	if (a > b)
+		return 1;
+	return 0;
+}
+
+static void test_qp_trie_rdonly_iterate(void)
+{
+	unsigned int i, max_entries, value, data, len;
+	struct bpf_dynptr_user dynptr;
+	struct bpf_dynptr_user *cur;
+	unsigned int *data_set;
+	int fd, err;
+
+	max_entries = 4096;
+	data_set = gen_random_unique_data_set(max_entries);
+	qsort(data_set, max_entries, sizeof(*data_set), cmp_be32);
+
+	fd = qp_trie_create(sizeof(*data_set), sizeof(value), max_entries);
+	value = 1;
+	for (i = 0; i < max_entries; i++) {
+		bpf_dynptr_user_init(&data_set[i], sizeof(data_set[i]), &dynptr);
+		err = bpf_map_update_elem(fd, &dynptr, &value, 0);
+		CHECK(err, "add elem", "#%u error %d\n", i, err);
+	}
+
+	/* Iteration results are big-endian ordered */
+	cur = NULL;
+	bpf_dynptr_user_init(&data, sizeof(data), &dynptr);
+	for (i = 0; i < max_entries; i++) {
+		unsigned int *got;
+
+		err = bpf_map_get_next_key(fd, cur, &dynptr);
+		CHECK(err, "iterate", "#%u error %d\n", i, err);
+
+		got = bpf_dynptr_user_get_data(&dynptr);
+		len = bpf_dynptr_user_get_size(&dynptr);
+		CHECK(len != 4, "iterate", "#%u invalid len %u\n", i, len);
+		CHECK(got != &data, "iterate", "#%u invalid dynptr got %p exp %p\n", i, got, &data);
+		CHECK(*got != data_set[i], "iterate", "#%u got 0x%x exp 0x%x\n",
+		      i, *got, data_set[i]);
+		cur = &dynptr;
+	}
+	err = bpf_map_get_next_key(fd, cur, &dynptr);
+	CHECK(err != -ENOENT, "more element", "error %d\n", err);
+
+	/* Iterate from non-existent key */
+	data = htobe32(UINT_MAX);
+	bpf_dynptr_user_init(&data, sizeof(data), &dynptr);
+	err = bpf_map_get_next_key(fd, &dynptr, &dynptr);
+	CHECK(err, "iterate from non-existent", "error %d\n", err);
+	len = bpf_dynptr_user_get_size(&dynptr);
+	CHECK(len != 4, "iterate", "invalid len %u\n", len);
+	CHECK(data != data_set[0], "iterate", "got 0x%x exp 0x%x\n",
+	      data, data_set[0]);
+
+	free(data_set);
+
+	close(fd);
+}
+
+/*
+ * Delete current key (also the smallest key) after iteration, the next
+ * iteration will return the second smallest key, so the iteration result
+ * is still ordered.
+ */
+static void test_qp_trie_iterate_then_delete(void)
+{
+	unsigned int i, max_entries, value, data, len;
+	struct bpf_dynptr_user dynptr;
+	struct bpf_dynptr_user *cur;
+	unsigned int *data_set;
+	int fd, err;
+
+	max_entries = 4096;
+	data_set = gen_random_unique_data_set(max_entries);
+	qsort(data_set, max_entries, sizeof(*data_set), cmp_be32);
+
+	fd = qp_trie_create(sizeof(*data_set), sizeof(value), max_entries);
+	value = 1;
+	for (i = 0; i < max_entries; i++) {
+		bpf_dynptr_user_init(&data_set[i], sizeof(data_set[i]), &dynptr);
+		err = bpf_map_update_elem(fd, &dynptr, &value, BPF_NOEXIST);
+		CHECK(err, "add elem", "#%u error %d\n", i, err);
+	}
+
+	/* Iteration results are big-endian ordered */
+	cur = NULL;
+	bpf_dynptr_user_init(&data, sizeof(data), &dynptr);
+	for (i = 0; i < max_entries; i++) {
+		err = bpf_map_get_next_key(fd, cur, &dynptr);
+		CHECK(err, "iterate", "#%u error %d\n", i, err);
+
+		len = bpf_dynptr_user_get_size(&dynptr);
+		CHECK(len != 4, "iterate", "#%u invalid len %u\n", i, len);
+		CHECK(data != data_set[i], "iterate", "#%u got 0x%x exp 0x%x\n",
+		      i, data, data_set[i]);
+		cur = &dynptr;
+
+		/*
+		 * Delete the mininal key, next call of bpf_get_next_key() will
+		 * return the second minimal key.
+		 */
+		err = bpf_map_delete_elem(fd, &dynptr);
+		CHECK(err, "del elem", "#%u elem error %d\n", i, err);
+	}
+	err = bpf_map_get_next_key(fd, cur, &dynptr);
+	CHECK(err != -ENOENT, "more element", "error %d\n", err);
+
+	err = bpf_map_get_next_key(fd, NULL, &dynptr);
+	CHECK(err != -ENOENT, "no-empty qp-trie", "error %d\n", err);
+
+	free(data_set);
+
+	close(fd);
+}
+
+/* The range is half-closed: [from, to) */
+static void delete_random_keys_in_range(int fd, unsigned int *data_set,
+					unsigned int from, unsigned int to)
+{
+	unsigned int del_from, del_to;
+
+	if (from >= to)
+		return;
+
+	del_from = random() % (to - from) + from;
+	del_to = random() % (to - del_from) + del_from;
+	for (; del_from <= del_to; del_from++) {
+		struct bpf_dynptr_user dynptr;
+		int err;
+
+		/* Skip deleted keys */
+		if (data_set[del_from] == UINT_MAX)
+			continue;
+
+		bpf_dynptr_user_init(&data_set[del_from], sizeof(data_set[del_from]), &dynptr);
+		err = bpf_map_delete_elem(fd, &dynptr);
+		CHECK(err, "del elem", "#%u range %u-%u error %d\n", del_from, from, to, err);
+		data_set[del_from] = UINT_MAX;
+	}
+}
+
+/* Delete keys randomly and ensure the iteration returns the expected data */
+static void test_qp_trie_iterate_then_batch_delete(void)
+{
+	unsigned int i, max_entries, value, data, len;
+	struct bpf_dynptr_user dynptr;
+	struct bpf_dynptr_user *cur;
+	unsigned int *data_set;
+	int fd, err;
+
+	max_entries = 8192;
+	data_set = gen_random_unique_data_set(max_entries);
+	qsort(data_set, max_entries, sizeof(*data_set), cmp_be32);
+
+	fd = qp_trie_create(sizeof(*data_set), sizeof(value), max_entries);
+	value = 1;
+	for (i = 0; i < max_entries; i++) {
+		bpf_dynptr_user_init(&data_set[i], sizeof(data_set[i]), &dynptr);
+		err = bpf_map_update_elem(fd, &dynptr, &value, BPF_NOEXIST);
+		CHECK(err, "add elem", "#%u error %d\n", i, err);
+	}
+
+	cur = NULL;
+	bpf_dynptr_user_init(&data, sizeof(data), &dynptr);
+	for (i = 0; i < max_entries; i++) {
+		err = bpf_map_get_next_key(fd, cur, &dynptr);
+		CHECK(err, "iterate", "#%u error %d\n", i, err);
+
+		len = bpf_dynptr_user_get_size(&dynptr);
+		CHECK(len != 4, "iterate", "#%u invalid len %u\n", i, len);
+		CHECK(data != data_set[i], "iterate", "#%u got 0x%x exp 0x%x\n",
+		      i, data, data_set[i]);
+		cur = &dynptr;
+
+		/* Delete some keys from iterated keys */
+		delete_random_keys_in_range(fd, data_set, 0, i);
+
+		/* Skip deleted keys */
+		while (i + 1 < max_entries) {
+			if (data_set[i + 1] != UINT_MAX)
+				break;
+			i++;
+		}
+
+		/* Delete some keys from to-iterate keys */
+		delete_random_keys_in_range(fd, data_set, i + 1, max_entries);
+
+		/* Skip deleted keys */
+		while (i + 1 < max_entries) {
+			if (data_set[i + 1] != UINT_MAX)
+				break;
+			i++;
+		}
+	}
+	err = bpf_map_get_next_key(fd, cur, &dynptr);
+	CHECK(err != -ENOENT, "more element", "error %d\n", err);
+
+	free(data_set);
+
+	close(fd);
+}
+
+/*
+ * Add keys with odd index first and add keys with even index during iteration.
+ * Check whether or not the whole key set is returned by iteration procedure.
+ */
+static void test_qp_trie_iterate_then_add(void)
+{
+	unsigned int i, max_entries, value, data, len;
+	struct bpf_dynptr_user dynptr, next_key;
+	struct bpf_dynptr_user *cur;
+	unsigned int *data_set;
+	int fd, err;
+
+	max_entries = 8192;
+	data_set = gen_random_unique_data_set(max_entries);
+	qsort(data_set, max_entries, sizeof(*data_set), cmp_be32);
+
+	fd = qp_trie_create(sizeof(*data_set), sizeof(value), max_entries);
+	value = 1;
+	for (i = 0; i < max_entries; i++) {
+		if (i & 1)
+			continue;
+
+		bpf_dynptr_user_init(&data_set[i], sizeof(data_set[i]), &dynptr);
+		err = bpf_map_update_elem(fd, &dynptr, &value, BPF_NOEXIST);
+		CHECK(err, "add elem", "#%u error %d\n", i, err);
+	}
+
+	/* Iteration results are big-endian ordered */
+	cur = NULL;
+	bpf_dynptr_user_init(&data, sizeof(data), &next_key);
+	for (i = 0; i < max_entries; i++) {
+		err = bpf_map_get_next_key(fd, cur, &next_key);
+		CHECK(err, "iterate", "#%u error %d\n", i, err);
+
+		len = bpf_dynptr_user_get_size(&next_key);
+		CHECK(len != 4, "iterate", "#%u invalid len %u\n", i, len);
+		CHECK(data != data_set[i], "iterate", "#%u got 0x%x exp 0x%x\n",
+		      i, data, data_set[i]);
+		cur = &next_key;
+
+		if ((i & 1) || i + 1 >= max_entries)
+			continue;
+
+		/* Add key with odd index which be returned in next iteration */
+		bpf_dynptr_user_init(&data_set[i + 1], sizeof(data_set[i + 1]), &dynptr);
+		err = bpf_map_update_elem(fd, &dynptr, &value, BPF_NOEXIST);
+		CHECK(err, "add elem", "#%u error %d\n", i + 1, err);
+	}
+	err = bpf_map_get_next_key(fd, cur, &next_key);
+	CHECK(err != -ENOENT, "more element", "error %d\n", err);
+
+	free(data_set);
+
+	close(fd);
+}
+
+static int get_int_from_env(const char *key, int dft)
+{
+	const char *value = getenv(key);
+
+	if (!value)
+		return dft;
+	return atoi(value);
+}
+
+static void free_bytes_set(struct bpf_dynptr_user *set, unsigned int nr)
+{
+	unsigned int i;
+
+	for (i = 0; i < nr; i++)
+		free(bpf_dynptr_user_get_data(&set[i]));
+	free(set);
+}
+
+struct bpf_dynptr_user *generate_random_bytes_set(unsigned int max_key_len, unsigned int nr)
+{
+	struct bpf_dynptr_user *set;
+	unsigned int i;
+
+	set = malloc(nr * sizeof(*set));
+	CHECK(!set, "malloc", "no mem for set");
+
+	for (i = 0; i < nr; i++) {
+		unsigned char *data;
+		unsigned int len, j;
+
+		len = random() % max_key_len + 1;
+		data = malloc(len);
+		CHECK(!data, "maloc", "no mem for data");
+
+		j = 0;
+		while (j + 4 <= len) {
+			unsigned int rnd = random();
+
+			memcpy(&data[j], &rnd, sizeof(rnd));
+			j += 4;
+		}
+		while (j < len)
+			data[j++] = random();
+
+		bpf_dynptr_user_init(data, len, &set[i]);
+	}
+
+	return set;
+}
+
+static struct bpf_dynptr_user *alloc_dynptr_user(unsigned int len)
+{
+	struct bpf_dynptr_user *dynptr;
+
+	dynptr = malloc(sizeof(*dynptr) + len);
+	if (!dynptr)
+		return NULL;
+
+	bpf_dynptr_user_init(&dynptr[1], len, dynptr);
+
+	return dynptr;
+}
+
+static int cmp_dynptr_user(const struct bpf_dynptr_user *a, const struct bpf_dynptr_user *b)
+{
+	unsigned int a_len = bpf_dynptr_user_get_size(a), b_len = bpf_dynptr_user_get_size(b);
+	unsigned int cmp = a_len < b_len ? a_len : b_len;
+	int ret;
+
+	ret = memcmp(bpf_dynptr_user_get_data(a), bpf_dynptr_user_get_data(b), cmp);
+	if (ret)
+		return ret;
+	return a_len - b_len;
+}
+
+static void dump_dynptr_user(const char *name, const struct bpf_dynptr_user *ptr)
+{
+	unsigned char *data = bpf_dynptr_user_get_data(ptr);
+	unsigned int i, len = bpf_dynptr_user_get_size(ptr);
+
+	fprintf(stderr, "%s dynptr len %u data %p\n", name, len, data);
+
+	for (i = 0; i < len; i++) {
+		fprintf(stderr, "%02x ", data[i]);
+		if (i % 16 == 15)
+			fprintf(stderr, "\n");
+	}
+	fprintf(stderr, "\n");
+}
+
+static void copy_and_reset_dynptr_user(struct bpf_dynptr_user *dst_ptr,
+				       struct bpf_dynptr_user *src_ptr, unsigned int reset_len)
+{
+	unsigned char *dst = bpf_dynptr_user_get_data(dst_ptr);
+	unsigned char *src = bpf_dynptr_user_get_data(src_ptr);
+	unsigned int src_len = bpf_dynptr_user_get_size(src_ptr);
+
+	memcpy(dst, src, src_len);
+	bpf_dynptr_user_init(dst, src_len, dst_ptr);
+	bpf_dynptr_user_init(src, reset_len, src_ptr);
+}
+
+static void *update_fn(void *arg)
+{
+	const struct qp_trie_rw_ctx *ctx = arg;
+	unsigned int i, j;
+
+	for (i = 0; i < ctx->loop; i++) {
+		for (j = 0; j < ctx->nr; j++) {
+			unsigned int value;
+			int err;
+
+			value = bpf_dynptr_user_get_size(&ctx->set[i]);
+			err = bpf_map_update_elem(ctx->fd, &ctx->set[i], &value, BPF_ANY);
+			if (err) {
+				fprintf(stderr, "update #%u element error %d\n", j, err);
+				return (void *)(long)err;
+			}
+		}
+	}
+
+	return NULL;
+}
+
+static void *delete_fn(void *arg)
+{
+	const struct qp_trie_rw_ctx *ctx = arg;
+	unsigned int i, j;
+
+	for (i = 0; i < ctx->loop; i++) {
+		for (j = 0; j < ctx->nr; j++) {
+			int err;
+
+			err = bpf_map_delete_elem(ctx->fd, &ctx->set[i]);
+			if (err && err != -ENOENT) {
+				fprintf(stderr, "delete #%u element error %d\n", j, err);
+				return (void *)(long)err;
+			}
+		}
+	}
+
+	return NULL;
+}
+
+static void *lookup_fn(void *arg)
+{
+	const struct qp_trie_rw_ctx *ctx = arg;
+	unsigned int i, j;
+
+	for (i = 0; i < ctx->loop; i++) {
+		for (j = 0; j < ctx->nr; j++) {
+			unsigned int got, value;
+			int err;
+
+			got = 0;
+			value = bpf_dynptr_user_get_size(&ctx->set[i]);
+			err = bpf_map_lookup_elem(ctx->fd, &ctx->set[i], &got);
+			if (!err && got != value) {
+				fprintf(stderr, "lookup #%u element got %u expected %u\n",
+					j, got, value);
+				return (void *)(long)-EINVAL;
+			} else if (err && err != -ENOENT) {
+				fprintf(stderr, "lookup #%u element error %d\n", j, err);
+				return (void *)(long)err;
+			}
+		}
+	}
+
+	return NULL;
+}
+
+static void *iterate_fn(void *arg)
+{
+	const struct qp_trie_rw_ctx *ctx = arg;
+	struct bpf_dynptr_user *key, *next_key;
+	unsigned int i;
+	int err;
+
+	key = NULL;
+	next_key = alloc_dynptr_user(ctx->max_key_len);
+	if (!next_key)
+		return (void *)(long)-ENOMEM;
+
+	err = 0;
+	for (i = 0; i < ctx->loop; i++) {
+		while (true) {
+			err = bpf_map_get_next_key(ctx->fd, key, next_key);
+			if (err < 0) {
+				if (err != -ENOENT) {
+					fprintf(stderr, "get key error %d\n", err);
+					goto out;
+				}
+				err = 0;
+				break;
+			}
+
+			/* If no deletion, next key should be greater than key */
+			if (!ctx->nr_delete && key && cmp_dynptr_user(key, next_key) >= 0) {
+				fprintf(stderr, "unordered iteration result\n");
+				dump_dynptr_user("previous key", key);
+				dump_dynptr_user("cur key", next_key);
+				err = -EINVAL;
+				goto out;
+			}
+
+			if (!key) {
+				key = alloc_dynptr_user(ctx->max_key_len);
+				if (!key) {
+					err = -ENOMEM;
+					goto out;
+				}
+			}
+
+			/* Copy next_key to key, and reset next_key */
+			copy_and_reset_dynptr_user(key, next_key, ctx->max_key_len);
+		}
+
+		free(key);
+		key = NULL;
+	}
+
+out:
+	free(key);
+	free(next_key);
+	return (void *)(long)err;
+}
+
+static void do_qp_trie_stress_test(const struct stress_conf *conf)
+{
+	void *(*fns[MAX_OP])(void *arg) = {
+		update_fn, delete_fn, lookup_fn, iterate_fn,
+	};
+	unsigned int created[MAX_OP];
+	struct qp_trie_rw_ctx ctx;
+	pthread_t *tids[MAX_OP];
+	unsigned int op, i, err;
+
+	ctx.nr = conf->nr;
+	ctx.max_key_len = conf->max_key_len;
+	ctx.fd = qp_trie_create(ctx.max_key_len, sizeof(unsigned int), ctx.nr);
+	ctx.set = generate_random_bytes_set(ctx.max_key_len, ctx.nr);
+	ctx.loop = conf->loop;
+	ctx.nr_delete = conf->threads[DELETE_OP];
+
+	/* Create threads */
+	for (op = 0; op < ARRAY_SIZE(tids); op++) {
+		if (!conf->threads[op]) {
+			tids[op] = NULL;
+			continue;
+		}
+
+		tids[op] = malloc(conf->threads[op] * sizeof(*tids[op]));
+		CHECK(!tids[op], "malloc", "no mem for op %u threads %u\n", op, conf->threads[op]);
+	}
+
+	for (op = 0; op < ARRAY_SIZE(tids); op++) {
+		for (i = 0; i < conf->threads[op]; i++) {
+			err = pthread_create(&tids[op][i], NULL, fns[op], &ctx);
+			if (err) {
+				fprintf(stderr, "create #%u thread for op %u error %d\n",
+					i, op, err);
+				break;
+			}
+		}
+		created[op] = i;
+	}
+
+	err = 0;
+	for (op = 0; op < ARRAY_SIZE(tids); op++) {
+		for (i = 0; i < created[op]; i++) {
+			void *thread_err = NULL;
+
+			pthread_join(tids[op][i], &thread_err);
+			if (thread_err)
+				err |= 1 << op;
+		}
+	}
+	CHECK(err, "stress operation", "err %u\n", err);
+
+	for (op = 0; op < ARRAY_SIZE(tids); op++)
+		free(tids[op]);
+	free_bytes_set(ctx.set, ctx.nr);
+	close(ctx.fd);
+}
+
+static void test_qp_trie_stress(void)
+{
+	struct stress_conf conf;
+
+	memset(&conf, 0, sizeof(conf));
+
+	/* Test concurrently update, lookup and iterate operations. There is
+	 * no deletion, so iteration can check the order of returned keys.
+	 */
+	conf.threads[UPDATE_OP] = get_int_from_env("QP_TRIE_NR_UPDATE", 8);
+	conf.threads[LOOKUP_OP] = get_int_from_env("QP_TRIE_NR_LOOKUP", 8);
+	conf.threads[ITERATE_OP] = get_int_from_env("QP_TRIE_NR_ITERATE", 8);
+	conf.max_key_len = get_int_from_env("QP_TRIE_MAX_KEY_LEN", 256);
+	conf.loop = get_int_from_env("QP_TRIE_NR_LOOP", 8);
+	conf.nr = get_int_from_env("QP_TRIE_NR_DATA", 8192);
+	do_qp_trie_stress_test(&conf);
+
+	/* Add delete operation */
+	conf.threads[DELETE_OP] = get_int_from_env("QP_TRIE_NR_DELETE", 8);
+	do_qp_trie_stress_test(&conf);
+}
+
+void test_qp_trie_map(void)
+{
+	test_qp_trie_create();
+
+	test_qp_trie_unsupported_op();
+
+	test_qp_trie_bad_update();
+
+	test_qp_trie_bad_lookup_delete();
+
+	test_qp_trie_one_subtree_update();
+
+	test_qp_trie_all_subtree_update();
+
+	test_qp_trie_rdonly_iterate();
+
+	test_qp_trie_iterate_then_delete();
+
+	test_qp_trie_iterate_then_batch_delete();
+
+	test_qp_trie_iterate_then_add();
+
+	test_qp_trie_stress();
+
+	printf("%s:PASS\n", __func__);
+}
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key
  2022-09-24 13:36 [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key Hou Tao
                   ` (12 preceding siblings ...)
  2022-09-24 13:36 ` [PATCH bpf-next v2 13/13] selftests/bpf: Add map tests for qp-trie by using bpf syscall Hou Tao
@ 2022-09-26  1:25 ` Alexei Starovoitov
  2022-09-26 13:18   ` Hou Tao
  2022-10-19 17:01 ` Tony Finch
  14 siblings, 1 reply; 52+ messages in thread
From: Alexei Starovoitov @ 2022-09-26  1:25 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, houtao1

On Sat, Sep 24, 2022 at 09:36:07PM +0800, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
> 
> Hi,
> 
> The initial motivation for qp-trie map is to reduce memory usage for
> string keys special those with large differencies in length as
> discussed in [0]. And as a big-endian lexicographical ordered map, it
> can also be used for any binary data with fixed or variable length.
> 
> Now the basic functionality of qp-trie is ready, so posting it to get
> more feedback or suggestions about qp-trie. Specially feedback
> about the following questions:
> 
> (1) Use cases for qp-trie
> Andrii had proposed to re-implement lpm-trie by using qp-trie. The
> advantage would be the speed up of lookup operations due to lower tree
> depth of qp-trie and the performance of update may also increase.
> But is there any other use cases for qp-trie ? Specially those cases
> which need both ordering and memory efficiency or cases in which qp-trie
> will have high fan-out and its lookup performance will be much better than
> hash-table as shown below:
> 
>   Randomly-generated binary data (key size=255, max entries=16K, key length range:[1, 255])
>   htab lookup      (1  thread)    4.968 ± 0.009M/s (drops 0.002 ± 0.000M/s mem 8.169 MiB)
>   htab lookup      (2  thread)   10.118 ± 0.010M/s (drops 0.007 ± 0.000M/s mem 8.169 MiB)
>   htab lookup      (4  thread)   20.084 ± 0.022M/s (drops 0.007 ± 0.000M/s mem 8.168 MiB)
>   htab lookup      (8  thread)   39.866 ± 0.047M/s (drops 0.010 ± 0.000M/s mem 8.168 MiB)
>   htab lookup      (16 thread)   79.412 ± 0.065M/s (drops 0.049 ± 0.000M/s mem 8.169 MiB)
>   
>   qp-trie lookup   (1  thread)   10.291 ± 0.007M/s (drops 0.004 ± 0.000M/s mem 4.899 MiB)
>   qp-trie lookup   (2  thread)   20.797 ± 0.009M/s (drops 0.006 ± 0.000M/s mem 4.879 MiB)
>   qp-trie lookup   (4  thread)   41.943 ± 0.019M/s (drops 0.015 ± 0.000M/s mem 4.262 MiB)
>   qp-trie lookup   (8  thread)   81.985 ± 0.032M/s (drops 0.025 ± 0.000M/s mem 4.215 MiB)
>   qp-trie lookup   (16 thread)  164.681 ± 0.051M/s (drops 0.050 ± 0.000M/s mem 4.261 MiB)
> 
>   * non-zero drops is due to duplicated keys in generated keys.
> 
> (2) Improve update/delete performance for qp-trie
> Now top-5 overheads in update/delete operations are:
> 
>     21.23%  bench    [kernel.vmlinux]    [k] qp_trie_update_elem
>     13.98%  bench    [kernel.vmlinux]    [k] qp_trie_delete_elem
>      7.96%  bench    [kernel.vmlinux]    [k] native_queued_spin_lock_slowpath
>      5.16%  bench    [kernel.vmlinux]    [k] memcpy_erms
>      5.00%  bench    [kernel.vmlinux]    [k] __kmalloc_node
> 
> The top-2 overheads are due to memory access and atomic ops on
> max_entries. I had tried memory prefetch but it didn't work out, maybe
> I did it wrong. For subtree spinlock overhead, I also had tried the
> hierarchical lock by using hand-over-hand lock scheme, but it didn't
> scale well [1]. I will try to increase the number of subtrees from 256
> to 1024, 4096 or bigger and check whether it makes any difference.
> 
> For atomic ops and kmalloc overhead, I think I can reuse the idea from
> patchset "bpf: BPF specific memory allocator". I have given bpf_mem_alloc
> a simple try and encounter some problems. One problem is that
> immediate reuse of freed object in bpf memory allocator. Because qp-trie
> uses bpf memory allocator to allocate and free qp_trie_branch, if
> qp_trie_branch is reused immediately, the lookup procedure may oops due
> to the incorrect content in qp_trie_branch. And another problem is the
> size limitation in bpf_mem_alloc() is 4096. It may be a little small for
> the total size of key size and value size, but maybe I can use two
> separated bpf_mem_alloc for key and value.

4096 limit for key+value size would be an acceptable trade-off.
With kptrs the user will be able to extend value to much bigger sizes
while doing <= 4096 allocation at a time. Larger allocations are failing
in production more often than not. Any algorithm relying on successful
 >= 4096 allocation is likely to fail. kvmalloc is a fallback that
the kernel is using, but we're not there yet in bpf land.
The benefits of bpf_mem_alloc in qp-trie would be huge though.
qp-trie would work in all contexts including sleepable progs.
As presented the use cases for qp-trie are quite limited.
If I understand correctly the concern for not using bpf_mem_alloc
is that qp_trie_branch can be reused. Can you provide an exact scenario
that will casue issuses?
Instead of call_rcu in qp_trie_branch_free (which will work only for
regular progs and have high overhead as demonstrated by mem_alloc patches)
the qp-trie freeing logic can scrub that element, so it's ready to be
reused as another struct qp_trie_branch.
I guess I'm missing how rcu protects this internal data structures of qp-trie.
The rcu_read_lock of regular bpf prog helps to stay lock-less during lookup?
Is that it?
So to make qp-trie work in sleepable progs the algo would need to
be changed to do both call_rcu and call_rcu_task_trace everywhere
to protect these inner structs?
call_rcu_task_trace can take long time. So qp_trie_branch-s may linger
around. So quick update/delete (in sleepable with call_rcu_task_trace)
may very well exhaust memory. With bpf_mem_alloc we don't have this issue
since rcu_task_trace gp is observed only when freeing into global mem pool.
Say qp-trie just uses bpf_mem_alloc for qp_trie_branch.
What is the worst that can happen? qp_trie_lookup_elem will go into wrong
path, but won't crash, right? Can we do hlist_nulls trick to address that?
In other words bpf_mem_alloc reuse behavior is pretty much SLAB_TYPESAFE_BY_RCU.
Many kernel data structures know how to deal with such object reuse.
We can have a private bpf_mem_alloc here for qp_trie_branch-s only and
construct a logic in a way that obj reuse is not problematic.

Another alternative would be to add explicit rcu_read_lock in qp_trie_lookup_elem
to protect qp_trie_branch during lookup while using bpf_mem_alloc
for both qp_trie_branch and leaf nodes, but that's not a great solution either.
It will allow qp-trie to be usable in sleepable, but use of call_rcu
in update/delete will prevent qp-trie to be usable in tracing progs.

Let's try to brainstorm how to make qp_trie_branch work like SLAB_TYPESAFE_BY_RCU.

Other than this issue the patches look great. This new map would be awesome addition.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key
  2022-09-26  1:25 ` [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key Alexei Starovoitov
@ 2022-09-26 13:18   ` Hou Tao
  2022-09-27  1:19     ` Alexei Starovoitov
  0 siblings, 1 reply; 52+ messages in thread
From: Hou Tao @ 2022-09-26 13:18 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, houtao1

Hi,

On 9/26/2022 9:25 AM, Alexei Starovoitov wrote:
> On Sat, Sep 24, 2022 at 09:36:07PM +0800, Hou Tao wrote:
>> From: Hou Tao <houtao1@huawei.com>
SNIP
>> For atomic ops and kmalloc overhead, I think I can reuse the idea from
>> patchset "bpf: BPF specific memory allocator". I have given bpf_mem_alloc
>> a simple try and encounter some problems. One problem is that
>> immediate reuse of freed object in bpf memory allocator. Because qp-trie
>> uses bpf memory allocator to allocate and free qp_trie_branch, if
>> qp_trie_branch is reused immediately, the lookup procedure may oops due
>> to the incorrect content in qp_trie_branch. And another problem is the
>> size limitation in bpf_mem_alloc() is 4096. It may be a little small for
>> the total size of key size and value size, but maybe I can use two
>> separated bpf_mem_alloc for key and value.
> 4096 limit for key+value size would be an acceptable trade-off.
> With kptrs the user will be able to extend value to much bigger sizes
> while doing <= 4096 allocation at a time. Larger allocations are failing
> in production more often than not. Any algorithm relying on successful
>  >= 4096 allocation is likely to fail. kvmalloc is a fallback that
> the kernel is using, but we're not there yet in bpf land.
> The benefits of bpf_mem_alloc in qp-trie would be huge though.
> qp-trie would work in all contexts including sleepable progs.
> As presented the use cases for qp-trie are quite limited.
> If I understand correctly the concern for not using bpf_mem_alloc
> is that qp_trie_branch can be reused. Can you provide an exact scenario
> that will casue issuses?
The usage of branch node during lookup is as follows:
(1) check the index field of branch node which records the position of nibble in
which the keys of child nodes are different
(2) calculate the index of child node by using the nibble value of lookup key in
index position
(3) get the pointer of child node by dereferencing the variable-length pointer
array in branch node

Because both branch node and leaf node have variable length, I used one
bpf_mem_alloc for these two node types, so if a leaf node is reused as a branch
node, the pointer got in step 3 may be invalid.

If using separated bpf_mem_alloc for branch node and leaf node, it may still be
problematic because the updates to a reused branch node are not atomic and
branch nodes with different child node will reuse the same object due to size
alignment in allocator, so the lookup procedure below may get an uninitialized
pointer in the pointer array:

lookup procedure                                update procedure


// three child nodes, 48-bytes
branch node x
                                                              //  four child
nodes, 56-bytes
                                                              reuse branch node x
                                                              x->bitmap = 0xf
// got an uninitialized pointer
x->nodes[3]
                                                              Initialize
x->nodes[0~3]

The problem may can be solved by zeroing the unused or whole part of allocated
object. Maybe adding a paired smp_wmb() and smp_rmb() to ensure the update of
node array happens before the update of bitmap is also OK and the cost will be
much cheaper in x86 host.

Beside lookup procedure, get_next_key() from syscall also lookups trie
locklessly. If the branch node is reused, the order of returned keys may be
broken. There is also a parent pointer in branch node and it is used for reverse
lookup during get_next_key, the reuse may lead to unexpected skip in iteration.
> Instead of call_rcu in qp_trie_branch_free (which will work only for
> regular progs and have high overhead as demonstrated by mem_alloc patches)
> the qp-trie freeing logic can scrub that element, so it's ready to be
> reused as another struct qp_trie_branch.
> I guess I'm missing how rcu protects this internal data structures of qp-trie.
> The rcu_read_lock of regular bpf prog helps to stay lock-less during lookup?
> Is that it?
Yes. The update is made atomic by copying the parent branch node to a new branch
node and replacing the pointer to the parent branch node by the new branch node,
so the lookup procedure either find the old branch node or the new branch node.
> So to make qp-trie work in sleepable progs the algo would need to
> be changed to do both call_rcu and call_rcu_task_trace everywhere
> to protect these inner structs?
> call_rcu_task_trace can take long time. So qp_trie_branch-s may linger
> around. So quick update/delete (in sleepable with call_rcu_task_trace)
> may very well exhaust memory. With bpf_mem_alloc we don't have this issue
> since rcu_task_trace gp is observed only when freeing into global mem pool.
> Say qp-trie just uses bpf_mem_alloc for qp_trie_branch.
> What is the worst that can happen? qp_trie_lookup_elem will go into wrong
> path, but won't crash, right? Can we do hlist_nulls trick to address that?
> In other words bpf_mem_alloc reuse behavior is pretty much SLAB_TYPESAFE_BY_RCU.
> Many kernel data structures know how to deal with such object reuse.
> We can have a private bpf_mem_alloc here for qp_trie_branch-s only and
> construct a logic in a way that obj reuse is not problematic.
As said above, qp_trie_lookup_elem may be OK with SLAB_TYPESAFE_BY_RCU. But I
don't know how to do it for get_next_key because the iteration result needs to
be ordered and can not skip existed elements before the iterations begins.
If removing immediate reuse from bpf_mem_alloc, beside the may-decreased
performance, is there any reason we can not do that ?
>
> Another alternative would be to add explicit rcu_read_lock in qp_trie_lookup_elem
> to protect qp_trie_branch during lookup while using bpf_mem_alloc
> for both qp_trie_branch and leaf nodes, but that's not a great solution either.
> It will allow qp-trie to be usable in sleepable, but use of call_rcu
> in update/delete will prevent qp-trie to be usable in tracing progs.
>
> Let's try to brainstorm how to make qp_trie_branch work like SLAB_TYPESAFE_BY_RCU.
>
> Other than this issue the patches look great. This new map would be awesome addition.
Thanks for that.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key
  2022-09-26 13:18   ` Hou Tao
@ 2022-09-27  1:19     ` Alexei Starovoitov
  2022-09-27  3:08       ` Hou Tao
  0 siblings, 1 reply; 52+ messages in thread
From: Alexei Starovoitov @ 2022-09-27  1:19 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, houtao1

On Mon, Sep 26, 2022 at 09:18:46PM +0800, Hou Tao wrote:
> Hi,
> 
> On 9/26/2022 9:25 AM, Alexei Starovoitov wrote:
> > On Sat, Sep 24, 2022 at 09:36:07PM +0800, Hou Tao wrote:
> >> From: Hou Tao <houtao1@huawei.com>
> SNIP
> >> For atomic ops and kmalloc overhead, I think I can reuse the idea from
> >> patchset "bpf: BPF specific memory allocator". I have given bpf_mem_alloc
> >> a simple try and encounter some problems. One problem is that
> >> immediate reuse of freed object in bpf memory allocator. Because qp-trie
> >> uses bpf memory allocator to allocate and free qp_trie_branch, if
> >> qp_trie_branch is reused immediately, the lookup procedure may oops due
> >> to the incorrect content in qp_trie_branch. And another problem is the
> >> size limitation in bpf_mem_alloc() is 4096. It may be a little small for
> >> the total size of key size and value size, but maybe I can use two
> >> separated bpf_mem_alloc for key and value.
> > 4096 limit for key+value size would be an acceptable trade-off.
> > With kptrs the user will be able to extend value to much bigger sizes
> > while doing <= 4096 allocation at a time. Larger allocations are failing
> > in production more often than not. Any algorithm relying on successful
> >  >= 4096 allocation is likely to fail. kvmalloc is a fallback that
> > the kernel is using, but we're not there yet in bpf land.
> > The benefits of bpf_mem_alloc in qp-trie would be huge though.
> > qp-trie would work in all contexts including sleepable progs.
> > As presented the use cases for qp-trie are quite limited.
> > If I understand correctly the concern for not using bpf_mem_alloc
> > is that qp_trie_branch can be reused. Can you provide an exact scenario
> > that will casue issuses?
> The usage of branch node during lookup is as follows:
> (1) check the index field of branch node which records the position of nibble in
> which the keys of child nodes are different
> (2) calculate the index of child node by using the nibble value of lookup key in
> index position
> (3) get the pointer of child node by dereferencing the variable-length pointer
> array in branch node
> 
> Because both branch node and leaf node have variable length, I used one
> bpf_mem_alloc for these two node types, so if a leaf node is reused as a branch
> node, the pointer got in step 3 may be invalid.
> 
> If using separated bpf_mem_alloc for branch node and leaf node, it may still be
> problematic because the updates to a reused branch node are not atomic and
> branch nodes with different child node will reuse the same object due to size
> alignment in allocator, so the lookup procedure below may get an uninitialized
> pointer in the pointer array:
> 
> lookup procedure                                update procedure
> 
> 
> // three child nodes, 48-bytes
> branch node x
>                                                               //  four child
> nodes, 56-bytes
>                                                               reuse branch node x
>                                                               x->bitmap = 0xf
> // got an uninitialized pointer
> x->nodes[3]
>                                                               Initialize
> x->nodes[0~3]

Looking at lookup:
+	while (is_branch_node(node)) {
+		struct qp_trie_branch *br = node;
+		unsigned int bitmap;
+		unsigned int iip;
+
+		/* When byte index equals with key len, the target key
+		 * may be in twigs->nodes[0].
+		 */
+		if (index_to_byte_index(br->index) > data_len)
+			goto done;
+
+		bitmap = calc_br_bitmap(br->index, data, data_len);
+		if (!(bitmap & br->bitmap))
+			goto done;
+
+		iip = calc_twig_index(br->bitmap, bitmap);
+		node = rcu_dereference_check(br->nodes[iip], rcu_read_lock_bh_held());
+	}

To be safe the br->index needs to be initialized after br->nodex and br->bitmap.
While deleting the br->index can be set to special value which would mean
restart the lookup from the beginning.
As you're suggesting with smp_rmb/wmb pairs the lookup will only see valid br.
Also the race is extremely tight, right?
After brb->nodes[iip] + is_branch_node that memory needs to deleted on other cpu
after spin_lock and reused in update after another spin_lock.
Without artifical big delay it's hard to imagine how nodes[iip] pointer
would be initialized to some other qp_trie_branch or leaf during delete,
then memory reused and nodes[iip] is initialized again with the same address.
Theoretically possible, but unlikely, right?
And with correct ordering of scrubbing and updates to
br->nodes, br->bitmap, br->index it can be made safe.

We can add a sequence number to qp_trie_branch as well and read it before and after.
Every reuse would inc the seq.
If seq number differs, re-read the node pointer form parent.

> The problem may can be solved by zeroing the unused or whole part of allocated
> object. Maybe adding a paired smp_wmb() and smp_rmb() to ensure the update of
> node array happens before the update of bitmap is also OK and the cost will be
> much cheaper in x86 host.

Something like this, right.
We can also consider doing lookup under spin_lock. For a large branchy trie
the cost of spin_lock maybe negligible.

> Beside lookup procedure, get_next_key() from syscall also lookups trie
> locklessly. If the branch node is reused, the order of returned keys may be
> broken. There is also a parent pointer in branch node and it is used for reverse
> lookup during get_next_key, the reuse may lead to unexpected skip in iteration.

qp_trie_lookup_next_node can be done under spin_lock.
Iterating all map elements is a slow operation anyway.

> > Instead of call_rcu in qp_trie_branch_free (which will work only for
> > regular progs and have high overhead as demonstrated by mem_alloc patches)
> > the qp-trie freeing logic can scrub that element, so it's ready to be
> > reused as another struct qp_trie_branch.
> > I guess I'm missing how rcu protects this internal data structures of qp-trie.
> > The rcu_read_lock of regular bpf prog helps to stay lock-less during lookup?
> > Is that it?
> Yes. The update is made atomic by copying the parent branch node to a new branch
> node and replacing the pointer to the parent branch node by the new branch node,
> so the lookup procedure either find the old branch node or the new branch node.
> > So to make qp-trie work in sleepable progs the algo would need to
> > be changed to do both call_rcu and call_rcu_task_trace everywhere
> > to protect these inner structs?
> > call_rcu_task_trace can take long time. So qp_trie_branch-s may linger
> > around. So quick update/delete (in sleepable with call_rcu_task_trace)
> > may very well exhaust memory. With bpf_mem_alloc we don't have this issue
> > since rcu_task_trace gp is observed only when freeing into global mem pool.
> > Say qp-trie just uses bpf_mem_alloc for qp_trie_branch.
> > What is the worst that can happen? qp_trie_lookup_elem will go into wrong
> > path, but won't crash, right? Can we do hlist_nulls trick to address that?
> > In other words bpf_mem_alloc reuse behavior is pretty much SLAB_TYPESAFE_BY_RCU.
> > Many kernel data structures know how to deal with such object reuse.
> > We can have a private bpf_mem_alloc here for qp_trie_branch-s only and
> > construct a logic in a way that obj reuse is not problematic.
> As said above, qp_trie_lookup_elem may be OK with SLAB_TYPESAFE_BY_RCU. But I
> don't know how to do it for get_next_key because the iteration result needs to
> be ordered and can not skip existed elements before the iterations begins.

imo it's fine to spin_lock in get_next_key.
We should measure the lock overhead in lookup. It might be acceptable too.

> If removing immediate reuse from bpf_mem_alloc, beside the may-decreased
> performance, is there any reason we can not do that ?

What do you mean?
Always do call_rcu + call_rcu_tasks_trace for every bpf_mem_free ?
As I said above:
" call_rcu_task_trace can take long time. So qp_trie_branch-s may linger
  around. So quick update/delete (in sleepable with call_rcu_task_trace)
  may very well exhaust memory.
"
As an exercise try samples/bpf/map_perf_test on non-prealloc hashmap
before mem_alloc conversion. Just regular call_rcu consumes 100% of all cpus.
With call_rcu_tasks_trace it's worse. It cannot sustain such flood.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key
  2022-09-27  1:19     ` Alexei Starovoitov
@ 2022-09-27  3:08       ` Hou Tao
  2022-09-27  3:18         ` Alexei Starovoitov
  0 siblings, 1 reply; 52+ messages in thread
From: Hou Tao @ 2022-09-27  3:08 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, houtao1

Hi,

On 9/27/2022 9:19 AM, Alexei Starovoitov wrote:
> On Mon, Sep 26, 2022 at 09:18:46PM +0800, Hou Tao wrote:
>> Hi,
>>
>> On 9/26/2022 9:25 AM, Alexei Starovoitov wrote:
>>> On Sat, Sep 24, 2022 at 09:36:07PM +0800, Hou Tao wrote:
>>>> From: Hou Tao <houtao1@huawei.com>
>> SNIP
>>>> For atomic ops and kmalloc overhead, I think I can reuse the idea from
>>>> patchset "bpf: BPF specific memory allocator". I have given bpf_mem_alloc
>>>> a simple try and encounter some problems. One problem is that
>>>> immediate reuse of freed object in bpf memory allocator. Because qp-trie
>>>> uses bpf memory allocator to allocate and free qp_trie_branch, if
>>>> qp_trie_branch is reused immediately, the lookup procedure may oops due
>>>> to the incorrect content in qp_trie_branch. And another problem is the
>>>> size limitation in bpf_mem_alloc() is 4096. It may be a little small for
>>>> the total size of key size and value size, but maybe I can use two
>>>> separated bpf_mem_alloc for key and value.
>>> 4096 limit for key+value size would be an acceptable trade-off.
>>> With kptrs the user will be able to extend value to much bigger sizes
>>> while doing <= 4096 allocation at a time. Larger allocations are failing
>>> in production more often than not. Any algorithm relying on successful
>>>  >= 4096 allocation is likely to fail. kvmalloc is a fallback that
>>> the kernel is using, but we're not there yet in bpf land.
>>> The benefits of bpf_mem_alloc in qp-trie would be huge though.
>>> qp-trie would work in all contexts including sleepable progs.
>>> As presented the use cases for qp-trie are quite limited.
>>> If I understand correctly the concern for not using bpf_mem_alloc
>>> is that qp_trie_branch can be reused. Can you provide an exact scenario
>>> that will casue issuses?
>> The usage of branch node during lookup is as follows:
>> (1) check the index field of branch node which records the position of nibble in
>> which the keys of child nodes are different
>> (2) calculate the index of child node by using the nibble value of lookup key in
>> index position
>> (3) get the pointer of child node by dereferencing the variable-length pointer
>> array in branch node
>>
>> Because both branch node and leaf node have variable length, I used one
>> bpf_mem_alloc for these two node types, so if a leaf node is reused as a branch
>> node, the pointer got in step 3 may be invalid.
>>
>> If using separated bpf_mem_alloc for branch node and leaf node, it may still be
>> problematic because the updates to a reused branch node are not atomic and
>> branch nodes with different child node will reuse the same object due to size
>> alignment in allocator, so the lookup procedure below may get an uninitialized
>> pointer in the pointer array:
>>
>> lookup procedure                                update procedure
>>
>>
>> // three child nodes, 48-bytes
>> branch node x
>>                                                               //  four child
>> nodes, 56-bytes
>>                                                               reuse branch node x
>>                                                               x->bitmap = 0xf
>> // got an uninitialized pointer
>> x->nodes[3]
>>                                                               Initialize
>> x->nodes[0~3]
> Looking at lookup:
> +	while (is_branch_node(node)) {
> +		struct qp_trie_branch *br = node;
> +		unsigned int bitmap;
> +		unsigned int iip;
> +
> +		/* When byte index equals with key len, the target key
> +		 * may be in twigs->nodes[0].
> +		 */
> +		if (index_to_byte_index(br->index) > data_len)
> +			goto done;
> +
> +		bitmap = calc_br_bitmap(br->index, data, data_len);
> +		if (!(bitmap & br->bitmap))
> +			goto done;
> +
> +		iip = calc_twig_index(br->bitmap, bitmap);
> +		node = rcu_dereference_check(br->nodes[iip], rcu_read_lock_bh_held());
> +	}
>
> To be safe the br->index needs to be initialized after br->nodex and br->bitmap.
> While deleting the br->index can be set to special value which would mean
> restart the lookup from the beginning.
> As you're suggesting with smp_rmb/wmb pairs the lookup will only see valid br.
> Also the race is extremely tight, right?
> After brb->nodes[iip] + is_branch_node that memory needs to deleted on other cpu
> after spin_lock and reused in update after another spin_lock.
> Without artifical big delay it's hard to imagine how nodes[iip] pointer
> would be initialized to some other qp_trie_branch or leaf during delete,
> then memory reused and nodes[iip] is initialized again with the same address.
> Theoretically possible, but unlikely, right?
> And with correct ordering of scrubbing and updates to
> br->nodes, br->bitmap, br->index it can be made safe.
The reuse of node not only introduces the safety problem (e.g. access an invalid
pointer), but also incur the false negative problem (e.g. can not find an
existent element) as show below:

lookup A in X on CPU1            update X on CPU 2

     [ branch X v1 ]                   
 leaf A | leaf B | leaf C
                                                 [ branch X v2 ]
                                               leaf A | leaf B | leaf C | leaf D
                                                 
                                                  // free and reuse branch X v1
                                                  [ branch X v1 ]
                                                leaf O | leaf P | leaf Q
// leaf A can not be found
> We can add a sequence number to qp_trie_branch as well and read it before and after.
> Every reuse would inc the seq.
> If seq number differs, re-read the node pointer form parent.
A seq number on qp_trie_branch is a good idea. Will try it. But we also need to
consider the starvation of lookup by update/deletion. Maybe need fallback to the
subtree spinlock after some reread.
>> The problem may can be solved by zeroing the unused or whole part of allocated
>> object. Maybe adding a paired smp_wmb() and smp_rmb() to ensure the update of
>> node array happens before the update of bitmap is also OK and the cost will be
>> much cheaper in x86 host.
> Something like this, right.
> We can also consider doing lookup under spin_lock. For a large branchy trie
> the cost of spin_lock maybe negligible.
Do you meaning adding an extra spinlock to qp_trie_branch to protect again reuse
or taking the subtree spinlock during lookup ? IMO the latter will make the
lookup performance suffer, but I will check it as well.
>
>> Beside lookup procedure, get_next_key() from syscall also lookups trie
>> locklessly. If the branch node is reused, the order of returned keys may be
>> broken. There is also a parent pointer in branch node and it is used for reverse
>> lookup during get_next_key, the reuse may lead to unexpected skip in iteration.
> qp_trie_lookup_next_node can be done under spin_lock.
> Iterating all map elements is a slow operation anyway.
OK. Taking subtree spinlock is simpler but the scalability will be bad. Not sure
whether or not the solution for lockless lookup will work for get_next_key. Will
check.
>
>>> Instead of call_rcu in qp_trie_branch_free (which will work only for
>>> regular progs and have high overhead as demonstrated by mem_alloc patches)
>>> the qp-trie freeing logic can scrub that element, so it's ready to be
>>> reused as another struct qp_trie_branch.
>>> I guess I'm missing how rcu protects this internal data structures of qp-trie.
>>> The rcu_read_lock of regular bpf prog helps to stay lock-less during lookup?
>>> Is that it?
>> Yes. The update is made atomic by copying the parent branch node to a new branch
>> node and replacing the pointer to the parent branch node by the new branch node,
>> so the lookup procedure either find the old branch node or the new branch node.
>>> So to make qp-trie work in sleepable progs the algo would need to
>>> be changed to do both call_rcu and call_rcu_task_trace everywhere
>>> to protect these inner structs?
>>> call_rcu_task_trace can take long time. So qp_trie_branch-s may linger
>>> around. So quick update/delete (in sleepable with call_rcu_task_trace)
>>> may very well exhaust memory. With bpf_mem_alloc we don't have this issue
>>> since rcu_task_trace gp is observed only when freeing into global mem pool.
>>> Say qp-trie just uses bpf_mem_alloc for qp_trie_branch.
>>> What is the worst that can happen? qp_trie_lookup_elem will go into wrong
>>> path, but won't crash, right? Can we do hlist_nulls trick to address that?
>>> In other words bpf_mem_alloc reuse behavior is pretty much SLAB_TYPESAFE_BY_RCU.
>>> Many kernel data structures know how to deal with such object reuse.
>>> We can have a private bpf_mem_alloc here for qp_trie_branch-s only and
>>> construct a logic in a way that obj reuse is not problematic.
>> As said above, qp_trie_lookup_elem may be OK with SLAB_TYPESAFE_BY_RCU. But I
>> don't know how to do it for get_next_key because the iteration result needs to
>> be ordered and can not skip existed elements before the iterations begins.
> imo it's fine to spin_lock in get_next_key.
> We should measure the lock overhead in lookup. It might be acceptable too.
Will check that.
>
>> If removing immediate reuse from bpf_mem_alloc, beside the may-decreased
>> performance, is there any reason we can not do that ?
> What do you mean?
> Always do call_rcu + call_rcu_tasks_trace for every bpf_mem_free ?
Yes. Does doing call_rcu() + call_rcu_task_trace in batch help just like
free_bulk does ?
> As I said above:
> " call_rcu_task_trace can take long time. So qp_trie_branch-s may linger
>   around. So quick update/delete (in sleepable with call_rcu_task_trace)
>   may very well exhaust memory.
> "
> As an exercise try samples/bpf/map_perf_test on non-prealloc hashmap
> before mem_alloc conversion. Just regular call_rcu consumes 100% of all cpus.
> With call_rcu_tasks_trace it's worse. It cannot sustain such flood.
> .
Will check the result of map_perf_test. But it seems bpf_mem_alloc may still
exhaust memory if __free_rcu_tasks_trace() can not called timely, Will take a
close lookup on that.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key
  2022-09-27  3:08       ` Hou Tao
@ 2022-09-27  3:18         ` Alexei Starovoitov
  2022-09-27 14:07           ` Hou Tao
  0 siblings, 1 reply; 52+ messages in thread
From: Alexei Starovoitov @ 2022-09-27  3:18 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, Hou Tao

On Mon, Sep 26, 2022 at 8:08 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 9/27/2022 9:19 AM, Alexei Starovoitov wrote:
> > On Mon, Sep 26, 2022 at 09:18:46PM +0800, Hou Tao wrote:
> >> Hi,
> >>
> >> On 9/26/2022 9:25 AM, Alexei Starovoitov wrote:
> >>> On Sat, Sep 24, 2022 at 09:36:07PM +0800, Hou Tao wrote:
> >>>> From: Hou Tao <houtao1@huawei.com>
> >> SNIP
> >>>> For atomic ops and kmalloc overhead, I think I can reuse the idea from
> >>>> patchset "bpf: BPF specific memory allocator". I have given bpf_mem_alloc
> >>>> a simple try and encounter some problems. One problem is that
> >>>> immediate reuse of freed object in bpf memory allocator. Because qp-trie
> >>>> uses bpf memory allocator to allocate and free qp_trie_branch, if
> >>>> qp_trie_branch is reused immediately, the lookup procedure may oops due
> >>>> to the incorrect content in qp_trie_branch. And another problem is the
> >>>> size limitation in bpf_mem_alloc() is 4096. It may be a little small for
> >>>> the total size of key size and value size, but maybe I can use two
> >>>> separated bpf_mem_alloc for key and value.
> >>> 4096 limit for key+value size would be an acceptable trade-off.
> >>> With kptrs the user will be able to extend value to much bigger sizes
> >>> while doing <= 4096 allocation at a time. Larger allocations are failing
> >>> in production more often than not. Any algorithm relying on successful
> >>>  >= 4096 allocation is likely to fail. kvmalloc is a fallback that
> >>> the kernel is using, but we're not there yet in bpf land.
> >>> The benefits of bpf_mem_alloc in qp-trie would be huge though.
> >>> qp-trie would work in all contexts including sleepable progs.
> >>> As presented the use cases for qp-trie are quite limited.
> >>> If I understand correctly the concern for not using bpf_mem_alloc
> >>> is that qp_trie_branch can be reused. Can you provide an exact scenario
> >>> that will casue issuses?
> >> The usage of branch node during lookup is as follows:
> >> (1) check the index field of branch node which records the position of nibble in
> >> which the keys of child nodes are different
> >> (2) calculate the index of child node by using the nibble value of lookup key in
> >> index position
> >> (3) get the pointer of child node by dereferencing the variable-length pointer
> >> array in branch node
> >>
> >> Because both branch node and leaf node have variable length, I used one
> >> bpf_mem_alloc for these two node types, so if a leaf node is reused as a branch
> >> node, the pointer got in step 3 may be invalid.
> >>
> >> If using separated bpf_mem_alloc for branch node and leaf node, it may still be
> >> problematic because the updates to a reused branch node are not atomic and
> >> branch nodes with different child node will reuse the same object due to size
> >> alignment in allocator, so the lookup procedure below may get an uninitialized
> >> pointer in the pointer array:
> >>
> >> lookup procedure                                update procedure
> >>
> >>
> >> // three child nodes, 48-bytes
> >> branch node x
> >>                                                               //  four child
> >> nodes, 56-bytes
> >>                                                               reuse branch node x
> >>                                                               x->bitmap = 0xf
> >> // got an uninitialized pointer
> >> x->nodes[3]
> >>                                                               Initialize
> >> x->nodes[0~3]
> > Looking at lookup:
> > +     while (is_branch_node(node)) {
> > +             struct qp_trie_branch *br = node;
> > +             unsigned int bitmap;
> > +             unsigned int iip;
> > +
> > +             /* When byte index equals with key len, the target key
> > +              * may be in twigs->nodes[0].
> > +              */
> > +             if (index_to_byte_index(br->index) > data_len)
> > +                     goto done;
> > +
> > +             bitmap = calc_br_bitmap(br->index, data, data_len);
> > +             if (!(bitmap & br->bitmap))
> > +                     goto done;
> > +
> > +             iip = calc_twig_index(br->bitmap, bitmap);
> > +             node = rcu_dereference_check(br->nodes[iip], rcu_read_lock_bh_held());
> > +     }
> >
> > To be safe the br->index needs to be initialized after br->nodex and br->bitmap.
> > While deleting the br->index can be set to special value which would mean
> > restart the lookup from the beginning.
> > As you're suggesting with smp_rmb/wmb pairs the lookup will only see valid br.
> > Also the race is extremely tight, right?
> > After brb->nodes[iip] + is_branch_node that memory needs to deleted on other cpu
> > after spin_lock and reused in update after another spin_lock.
> > Without artifical big delay it's hard to imagine how nodes[iip] pointer
> > would be initialized to some other qp_trie_branch or leaf during delete,
> > then memory reused and nodes[iip] is initialized again with the same address.
> > Theoretically possible, but unlikely, right?
> > And with correct ordering of scrubbing and updates to
> > br->nodes, br->bitmap, br->index it can be made safe.
> The reuse of node not only introduces the safety problem (e.g. access an invalid
> pointer), but also incur the false negative problem (e.g. can not find an
> existent element) as show below:
>
> lookup A in X on CPU1            update X on CPU 2
>
>      [ branch X v1 ]
>  leaf A | leaf B | leaf C
>                                                  [ branch X v2 ]
>                                                leaf A | leaf B | leaf C | leaf D
>
>                                                   // free and reuse branch X v1
>                                                   [ branch X v1 ]
>                                                 leaf O | leaf P | leaf Q
> // leaf A can not be found

Right. That's why I suggested to consider hlist_nulls-like approach
that htab is using.

> > We can add a sequence number to qp_trie_branch as well and read it before and after.
> > Every reuse would inc the seq.
> > If seq number differs, re-read the node pointer form parent.
> A seq number on qp_trie_branch is a good idea. Will try it. But we also need to
> consider the starvation of lookup by update/deletion. Maybe need fallback to the
> subtree spinlock after some reread.

I think the fallback is an overkill. The race is extremely unlikely.

> >> The problem may can be solved by zeroing the unused or whole part of allocated
> >> object. Maybe adding a paired smp_wmb() and smp_rmb() to ensure the update of
> >> node array happens before the update of bitmap is also OK and the cost will be
> >> much cheaper in x86 host.
> > Something like this, right.
> > We can also consider doing lookup under spin_lock. For a large branchy trie
> > the cost of spin_lock maybe negligible.
> Do you meaning adding an extra spinlock to qp_trie_branch to protect again reuse
> or taking the subtree spinlock during lookup ? IMO the latter will make the
> lookup performance suffer, but I will check it as well.

subtree lock. lookup perf will suffer a bit.
The numbers will tell the true story.

> >
> >> Beside lookup procedure, get_next_key() from syscall also lookups trie
> >> locklessly. If the branch node is reused, the order of returned keys may be
> >> broken. There is also a parent pointer in branch node and it is used for reverse
> >> lookup during get_next_key, the reuse may lead to unexpected skip in iteration.
> > qp_trie_lookup_next_node can be done under spin_lock.
> > Iterating all map elements is a slow operation anyway.
> OK. Taking subtree spinlock is simpler but the scalability will be bad. Not sure
> whether or not the solution for lockless lookup will work for get_next_key. Will
> check.

What kind of scalability are you concerned about?
get_next is done by user space only. Plenty of overhead already.

> >
> >>> Instead of call_rcu in qp_trie_branch_free (which will work only for
> >>> regular progs and have high overhead as demonstrated by mem_alloc patches)
> >>> the qp-trie freeing logic can scrub that element, so it's ready to be
> >>> reused as another struct qp_trie_branch.
> >>> I guess I'm missing how rcu protects this internal data structures of qp-trie.
> >>> The rcu_read_lock of regular bpf prog helps to stay lock-less during lookup?
> >>> Is that it?
> >> Yes. The update is made atomic by copying the parent branch node to a new branch
> >> node and replacing the pointer to the parent branch node by the new branch node,
> >> so the lookup procedure either find the old branch node or the new branch node.
> >>> So to make qp-trie work in sleepable progs the algo would need to
> >>> be changed to do both call_rcu and call_rcu_task_trace everywhere
> >>> to protect these inner structs?
> >>> call_rcu_task_trace can take long time. So qp_trie_branch-s may linger
> >>> around. So quick update/delete (in sleepable with call_rcu_task_trace)
> >>> may very well exhaust memory. With bpf_mem_alloc we don't have this issue
> >>> since rcu_task_trace gp is observed only when freeing into global mem pool.
> >>> Say qp-trie just uses bpf_mem_alloc for qp_trie_branch.
> >>> What is the worst that can happen? qp_trie_lookup_elem will go into wrong
> >>> path, but won't crash, right? Can we do hlist_nulls trick to address that?
> >>> In other words bpf_mem_alloc reuse behavior is pretty much SLAB_TYPESAFE_BY_RCU.
> >>> Many kernel data structures know how to deal with such object reuse.
> >>> We can have a private bpf_mem_alloc here for qp_trie_branch-s only and
> >>> construct a logic in a way that obj reuse is not problematic.
> >> As said above, qp_trie_lookup_elem may be OK with SLAB_TYPESAFE_BY_RCU. But I
> >> don't know how to do it for get_next_key because the iteration result needs to
> >> be ordered and can not skip existed elements before the iterations begins.
> > imo it's fine to spin_lock in get_next_key.
> > We should measure the lock overhead in lookup. It might be acceptable too.
> Will check that.
> >
> >> If removing immediate reuse from bpf_mem_alloc, beside the may-decreased
> >> performance, is there any reason we can not do that ?
> > What do you mean?
> > Always do call_rcu + call_rcu_tasks_trace for every bpf_mem_free ?
> Yes. Does doing call_rcu() + call_rcu_task_trace in batch help just like
> free_bulk does ?
> > As I said above:
> > " call_rcu_task_trace can take long time. So qp_trie_branch-s may linger
> >   around. So quick update/delete (in sleepable with call_rcu_task_trace)
> >   may very well exhaust memory.
> > "
> > As an exercise try samples/bpf/map_perf_test on non-prealloc hashmap
> > before mem_alloc conversion. Just regular call_rcu consumes 100% of all cpus.
> > With call_rcu_tasks_trace it's worse. It cannot sustain such flood.
> > .
> Will check the result of map_perf_test. But it seems bpf_mem_alloc may still
> exhaust memory if __free_rcu_tasks_trace() can not called timely, Will take a
> close lookup on that.

In theory. yes. The batching makes a big difference.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v2 08/13] bpftool: Add support for qp-trie map
  2022-09-24 13:36 ` [PATCH bpf-next v2 08/13] bpftool: Add support for qp-trie map Hou Tao
@ 2022-09-27 11:24   ` Quentin Monnet
  2022-09-28  4:14     ` Hou Tao
  0 siblings, 1 reply; 52+ messages in thread
From: Quentin Monnet @ 2022-09-27 11:24 UTC (permalink / raw)
  To: Hou Tao, bpf
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, houtao1

Sat Sep 24 2022 14:36:15 GMT+0100 (British Summer Time) ~ Hou Tao
<houtao@huaweicloud.com>
> From: Hou Tao <houtao1@huawei.com>
> 
> Support lookup/update/delete/iterate/dump operations for qp-trie in
> bpftool. Mainly add two functions: one function to parse dynptr key and
> another one to dump dynptr key. The input format of dynptr key is:
> "key [hex] size BYTES" and the output format of dynptr key is:
> "size BYTES".
> 
> The following is the output when using bpftool to manipulate
> qp-trie:
> 
>   $ bpftool map pin id 724953 /sys/fs/bpf/qp
>   $ bpftool map show pinned /sys/fs/bpf/qp
>   724953: qp_trie  name qp_trie  flags 0x1
>           key 16B  value 4B  max_entries 2  memlock 65536B  map_extra 8
>           btf_id 779
>           pids test_qp_trie.bi(109167)
>   $ bpftool map dump pinned /sys/fs/bpf/qp
>   [{
>           "key": {
>               "size": 4,
>               "data": ["0x0","0x0","0x0","0x0"
>               ]
>           },
>           "value": 0
>       },{
>           "key": {
>               "size": 4,
>               "data": ["0x0","0x0","0x0","0x1"
>               ]
>           },
>           "value": 2
>       }
>   ]
>   $ bpftool map lookup pinned /sys/fs/bpf/qp key 4 0 0 0 1
>   {
>       "key": {
>           "size": 4,
>           "data": ["0x0","0x0","0x0","0x1"
>           ]
>       },
>       "value": 2
>   }

The bpftool patch looks good, thanks! I have one comment on the syntax
for the keys, I don't find it intuitive to have the size as the first
BYTE. It makes it awkward to understand what the command does if we read
it in the wild without knowing the map type. I can see two alternatives,
either adding a keyword (e.g., "key_size 4 key 0 0 0 1"), or changing
parse_bytes() to make it able to parse as much as it can then count the
bytes, when we don't know in advance how many we get.

Thanks,
Quentin


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key
  2022-09-27  3:18         ` Alexei Starovoitov
@ 2022-09-27 14:07           ` Hou Tao
  2022-09-28  1:08             ` Alexei Starovoitov
  0 siblings, 1 reply; 52+ messages in thread
From: Hou Tao @ 2022-09-27 14:07 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, Hou Tao

Hi,

On 9/27/2022 11:18 AM, Alexei Starovoitov wrote:
> On Mon, Sep 26, 2022 at 8:08 PM Hou Tao <houtao@huaweicloud.com> wrote:
SNIP
>>>>>> For atomic ops and kmalloc overhead, I think I can reuse the idea from
>>>>>> patchset "bpf: BPF specific memory allocator". I have given bpf_mem_alloc
>>>>>> a simple try and encounter some problems. One problem is that
>>>>>> immediate reuse of freed object in bpf memory allocator. Because qp-trie
>>>>>> uses bpf memory allocator to allocate and free qp_trie_branch, if
>>>>>> qp_trie_branch is reused immediately, the lookup procedure may oops due
>>>>>> to the incorrect content in qp_trie_branch. And another problem is the
>>>>>> size limitation in bpf_mem_alloc() is 4096. It may be a little small for
>>>>>> the total size of key size and value size, but maybe I can use two
>>>>>> separated bpf_mem_alloc for key and value.
>>>>> 4096 limit for key+value size would be an acceptable trade-off.
>>>>> With kptrs the user will be able to extend value to much bigger sizes
>>>>> while doing <= 4096 allocation at a time. Larger allocations are failing
>>>>> in production more often than not. Any algorithm relying on successful
>>>>>  >= 4096 allocation is likely to fail. kvmalloc is a fallback that
>>>>> the kernel is using, but we're not there yet in bpf land.
>>>>> The benefits of bpf_mem_alloc in qp-trie would be huge though.
>>>>> qp-trie would work in all contexts including sleepable progs.
>>>>> As presented the use cases for qp-trie are quite limited.
>>>>> If I understand correctly the concern for not using bpf_mem_alloc
>>>>> is that qp_trie_branch can be reused. Can you provide an exact scenario
>>>>> that will casue issuses?
SNIP
>>>> Looking at lookup:
>>>> +     while (is_branch_node(node)) {
>>>> +             struct qp_trie_branch *br = node;
>>>> +             unsigned int bitmap;
>>>> +             unsigned int iip;
>>>> +
>>>> +             /* When byte index equals with key len, the target key
>>>> +              * may be in twigs->nodes[0].
>>>> +              */
>>>> +             if (index_to_byte_index(br->index) > data_len)
>>>> +                     goto done;
>>>> +
>>>> +             bitmap = calc_br_bitmap(br->index, data, data_len);
>>>> +             if (!(bitmap & br->bitmap))
>>>> +                     goto done;
>>>> +
>>>> +             iip = calc_twig_index(br->bitmap, bitmap);
>>>> +             node = rcu_dereference_check(br->nodes[iip], rcu_read_lock_bh_held());
>>>> +     }
>>>>
>>>> To be safe the br->index needs to be initialized after br->nodex and br->bitmap.
>>>> While deleting the br->index can be set to special value which would mean
>>>> restart the lookup from the beginning.
>>>> As you're suggesting with smp_rmb/wmb pairs the lookup will only see valid br.
>>>> Also the race is extremely tight, right?
>>>> After brb->nodes[iip] + is_branch_node that memory needs to deleted on other cpu
>>>> after spin_lock and reused in update after another spin_lock.
>>>> Without artifical big delay it's hard to imagine how nodes[iip] pointer
>>>> would be initialized to some other qp_trie_branch or leaf during delete,
>>>> then memory reused and nodes[iip] is initialized again with the same address.
>>>> Theoretically possible, but unlikely, right?
>>>> And with correct ordering of scrubbing and updates to
>>>> br->nodes, br->bitmap, br->index it can be made safe.
>> The reuse of node not only introduces the safety problem (e.g. access an invalid
>> pointer), but also incur the false negative problem (e.g. can not find an
>> existent element) as show below:
>>
>> lookup A in X on CPU1            update X on CPU 2
>>
>>      [ branch X v1 ]
>>  leaf A | leaf B | leaf C
>>                                                  [ branch X v2 ]
>>                                                leaf A | leaf B | leaf C | leaf D
>>
>>                                                   // free and reuse branch X v1
>>                                                   [ branch X v1 ]
>>                                                 leaf O | leaf P | leaf Q
>> // leaf A can not be found
> Right. That's why I suggested to consider hlist_nulls-like approach
> that htab is using.
>
>>> We can add a sequence number to qp_trie_branch as well and read it before and after.
>>> Every reuse would inc the seq.
>>> If seq number differs, re-read the node pointer form parent.
>> A seq number on qp_trie_branch is a good idea. Will try it. But we also need to
>> consider the starvation of lookup by update/deletion. Maybe need fallback to the
>> subtree spinlock after some reread.
> I think the fallback is an overkill. The race is extremely unlikely.
OK. Will add a test on tiny qp-trie to ensure it is OK.
>>>> The problem may can be solved by zeroing the unused or whole part of allocated
>>>> object. Maybe adding a paired smp_wmb() and smp_rmb() to ensure the update of
>>>> node array happens before the update of bitmap is also OK and the cost will be
>>>> much cheaper in x86 host.
>>> Something like this, right.
>>> We can also consider doing lookup under spin_lock. For a large branchy trie
>>> the cost of spin_lock maybe negligible.
>> Do you meaning adding an extra spinlock to qp_trie_branch to protect again reuse
>> or taking the subtree spinlock during lookup ? IMO the latter will make the
>> lookup performance suffer, but I will check it as well.
> subtree lock. lookup perf will suffer a bit.
> The numbers will tell the true story.
A quick benchmark show the performance is bad when using subtree lock for lookup:

Randomly-generated binary data (key size=255, max entries=16K, key length
range:[1, 255])
* no lock
qp-trie lookup   (1  thread)   10.250 ± 0.009M/s (drops 0.006 ± 0.000M/s mem
0.000 MiB)
qp-trie lookup   (2  thread)   20.466 ± 0.009M/s (drops 0.010 ± 0.000M/s mem
0.000 MiB)
qp-trie lookup   (4  thread)   41.211 ± 0.010M/s (drops 0.018 ± 0.000M/s mem
0.000 MiB)
qp-trie lookup   (8  thread)   82.933 ± 0.409M/s (drops 0.031 ± 0.000M/s mem
0.000 MiB)
qp-trie lookup   (16 thread)  162.615 ± 0.842M/s (drops 0.070 ± 0.000M/s mem
0.000 MiB)

* subtree lock
qp-trie lookup   (1  thread)    8.990 ± 0.506M/s (drops 0.006 ± 0.000M/s mem
0.000 MiB)
qp-trie lookup   (2  thread)   15.908 ± 0.141M/s (drops 0.004 ± 0.000M/s mem
0.000 MiB)
qp-trie lookup   (4  thread)   27.551 ± 0.025M/s (drops 0.019 ± 0.000M/s mem
0.000 MiB)
qp-trie lookup   (8  thread)   42.040 ± 0.241M/s (drops 0.018 ± 0.000M/s mem
0.000 MiB)
qp-trie lookup   (16 thread)   50.884 ± 0.171M/s (drops 0.012 ± 0.000M/s mem
0.000 MiB)


Strings in /proc/kallsyms (key size=83, max entries=170958)
* no lock
qp-trie lookup   (1  thread)    4.096 ± 0.234M/s (drops 0.249 ± 0.014M/s mem
0.000 MiB)
qp-trie lookup   (2  thread)    8.226 ± 0.009M/s (drops 0.500 ± 0.002M/s mem
0.000 MiB)
qp-trie lookup   (4  thread)   15.356 ± 0.034M/s (drops 0.933 ± 0.006M/s mem
0.000 MiB)
qp-trie lookup   (8  thread)   30.037 ± 0.584M/s (drops 1.827 ± 0.037M/s mem
0.000 MiB)
qp-trie lookup   (16 thread)   62.600 ± 0.307M/s (drops 3.808 ± 0.029M/s mem
0.000 MiB)

* subtree lock
qp-trie lookup   (1  thread)    4.454 ± 0.108M/s (drops 0.271 ± 0.007M/s mem
0.000 MiB)
qp-trie lookup   (2  thread)    4.883 ± 0.500M/s (drops 0.297 ± 0.031M/s mem
0.000 MiB)
qp-trie lookup   (4  thread)    5.771 ± 0.137M/s (drops 0.351 ± 0.008M/s mem
0.000 MiB)
qp-trie lookup   (8  thread)    5.926 ± 0.104M/s (drops 0.359 ± 0.011M/s mem
0.000 MiB)
qp-trie lookup   (16 thread)    5.947 ± 0.171M/s (drops 0.362 ± 0.023M/s mem
0.000 MiB)
>>>> Beside lookup procedure, get_next_key() from syscall also lookups trie
>>>> locklessly. If the branch node is reused, the order of returned keys may be
>>>> broken. There is also a parent pointer in branch node and it is used for reverse
>>>> lookup during get_next_key, the reuse may lead to unexpected skip in iteration.
>>> qp_trie_lookup_next_node can be done under spin_lock.
>>> Iterating all map elements is a slow operation anyway.
>> OK. Taking subtree spinlock is simpler but the scalability will be bad. Not sure
>> whether or not the solution for lockless lookup will work for get_next_key. Will
>> check.
> What kind of scalability are you concerned about?
> get_next is done by user space only. Plenty of overhead already.
As an ordered map, maybe the next and prev iteration operations are needed in
bpf program in the future. For now, i think it is OK.
>>>>> Instead of call_rcu in qp_trie_branch_free (which will work only for
>>>>> regular progs and have high overhead as demonstrated by mem_alloc patches)
>>>>> the qp-trie freeing logic can scrub that element, so it's ready to be
>>>>> reused as another struct qp_trie_branch.
>>>>> I guess I'm missing how rcu protects this internal data structures of qp-trie.
>>>>> The rcu_read_lock of regular bpf prog helps to stay lock-less during lookup?
>>>>> Is that it?
>>>> Yes. The update is made atomic by copying the parent branch node to a new branch
>>>> node and replacing the pointer to the parent branch node by the new branch node,
>>>> so the lookup procedure either find the old branch node or the new branch node.
>>>>> So to make qp-trie work in sleepable progs the algo would need to
>>>>> be changed to do both call_rcu and call_rcu_task_trace everywhere
>>>>> to protect these inner structs?
>>>>> call_rcu_task_trace can take long time. So qp_trie_branch-s may linger
>>>>> around. So quick update/delete (in sleepable with call_rcu_task_trace)
>>>>> may very well exhaust memory. With bpf_mem_alloc we don't have this issue
>>>>> since rcu_task_trace gp is observed only when freeing into global mem pool.
>>>>> Say qp-trie just uses bpf_mem_alloc for qp_trie_branch.
>>>>> What is the worst that can happen? qp_trie_lookup_elem will go into wrong
>>>>> path, but won't crash, right? Can we do hlist_nulls trick to address that?
>>>>> In other words bpf_mem_alloc reuse behavior is pretty much SLAB_TYPESAFE_BY_RCU.
>>>>> Many kernel data structures know how to deal with such object reuse.
>>>>> We can have a private bpf_mem_alloc here for qp_trie_branch-s only and
>>>>> construct a logic in a way that obj reuse is not problematic.
>>>> As said above, qp_trie_lookup_elem may be OK with SLAB_TYPESAFE_BY_RCU. But I
>>>> don't know how to do it for get_next_key because the iteration result needs to
>>>> be ordered and can not skip existed elements before the iterations begins.
>>> imo it's fine to spin_lock in get_next_key.
>>> We should measure the lock overhead in lookup. It might be acceptable too.
>> Will check that.
>>>> If removing immediate reuse from bpf_mem_alloc, beside the may-decreased
>>>> performance, is there any reason we can not do that ?
>>> What do you mean?
>>> Always do call_rcu + call_rcu_tasks_trace for every bpf_mem_free ?
>> Yes. Does doing call_rcu() + call_rcu_task_trace in batch help just like
>> free_bulk does ?
>>> As I said above:
>>> " call_rcu_task_trace can take long time. So qp_trie_branch-s may linger
>>>   around. So quick update/delete (in sleepable with call_rcu_task_trace)
>>>   may very well exhaust memory.
>>> "
>>> As an exercise try samples/bpf/map_perf_test on non-prealloc hashmap
>>> before mem_alloc conversion. Just regular call_rcu consumes 100% of all cpus.
>>> With call_rcu_tasks_trace it's worse. It cannot sustain such flood.
>>> .
I can not reproduce the phenomenon that call_rcu consumes 100% of all cpus in my
local environment, could you share the setup for it ?

The following is the output of perf report (--no-children) for "./map_perf_test
4 72 10240 100000" on a x86-64 host with 72-cpus:

    26.63%  map_perf_test    [kernel.vmlinux]                             [k]
alloc_htab_elem
    21.57%  map_perf_test    [kernel.vmlinux]                             [k]
htab_map_update_elem
    18.08%  map_perf_test    [kernel.vmlinux]                             [k]
htab_map_delete_elem
    12.30%  map_perf_test    [kernel.vmlinux]                             [k]
free_htab_elem
    10.55%  map_perf_test    [kernel.vmlinux]                             [k]
__htab_map_lookup_elem
     1.58%  map_perf_test    [kernel.vmlinux]                             [k]
bpf_map_kmalloc_node
     1.39%  map_perf_test    [kernel.vmlinux]                             [k]
_raw_spin_lock_irqsave
     1.37%  map_perf_test    [kernel.vmlinux]                             [k]
__copy_map_value.constprop.0
     0.45%  map_perf_test    [kernel.vmlinux]                             [k]
check_and_free_fields
     0.33%  map_perf_test    [kernel.vmlinux]                             [k]
rcu_segcblist_enqueue

The overhead of call_rcu is tiny compared with hash map operations. Instead
alloc_htab_elem() and free_htab_eleme() are the bottlenecks. The following is
the output of perf record after apply bpf_mem_alloc:

    25.35%  map_perf_test    [kernel.vmlinux]                             [k]
htab_map_delete_elem
    23.69%  map_perf_test    [kernel.vmlinux]                             [k]
htab_map_update_elem
     8.42%  map_perf_test    [kernel.vmlinux]                             [k]
__htab_map_lookup_elem
     7.60%  map_perf_test    [kernel.vmlinux]                             [k]
alloc_htab_elem
     4.35%  map_perf_test    [kernel.vmlinux]                             [k]
free_htab_elem
     2.28%  map_perf_test    [kernel.vmlinux]                             [k]
memcpy_erms
     2.24%  map_perf_test    [kernel.vmlinux]                             [k] jhash
     2.02%  map_perf_test    [kernel.vmlinux]                             [k]
_raw_spin_lock_irqsave

>> Will check the result of map_perf_test. But it seems bpf_mem_alloc may still
>> exhaust memory if __free_rcu_tasks_trace() can not called timely, Will take a
>> close lookup on that.
> In theory. yes. The batching makes a big difference.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key
  2022-09-27 14:07           ` Hou Tao
@ 2022-09-28  1:08             ` Alexei Starovoitov
  2022-09-28  3:27               ` Hou Tao
  2022-09-28  8:45               ` Hou Tao
  0 siblings, 2 replies; 52+ messages in thread
From: Alexei Starovoitov @ 2022-09-28  1:08 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, Hou Tao

On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@huaweicloud.com> wrote:
>
> A quick benchmark show the performance is bad when using subtree lock for lookup:
>
> Randomly-generated binary data (key size=255, max entries=16K, key length
> range:[1, 255])
> * no lock
> qp-trie lookup   (1  thread)   10.250 ± 0.009M/s (drops 0.006 ± 0.000M/s mem
> 0.000 MiB)
> qp-trie lookup   (2  thread)   20.466 ± 0.009M/s (drops 0.010 ± 0.000M/s mem
> 0.000 MiB)
> qp-trie lookup   (4  thread)   41.211 ± 0.010M/s (drops 0.018 ± 0.000M/s mem
> 0.000 MiB)
> qp-trie lookup   (8  thread)   82.933 ± 0.409M/s (drops 0.031 ± 0.000M/s mem
> 0.000 MiB)
> qp-trie lookup   (16 thread)  162.615 ± 0.842M/s (drops 0.070 ± 0.000M/s mem
> 0.000 MiB)
>
> * subtree lock
> qp-trie lookup   (1  thread)    8.990 ± 0.506M/s (drops 0.006 ± 0.000M/s mem
> 0.000 MiB)
> qp-trie lookup   (2  thread)   15.908 ± 0.141M/s (drops 0.004 ± 0.000M/s mem
> 0.000 MiB)
> qp-trie lookup   (4  thread)   27.551 ± 0.025M/s (drops 0.019 ± 0.000M/s mem
> 0.000 MiB)
> qp-trie lookup   (8  thread)   42.040 ± 0.241M/s (drops 0.018 ± 0.000M/s mem
> 0.000 MiB)
> qp-trie lookup   (16 thread)   50.884 ± 0.171M/s (drops 0.012 ± 0.000M/s mem
> 0.000 MiB)

That's indeed significant.
But I interpret it differently.
Since single thread perf is close enough while 16 thread
suffers 3x it means the lock mechanism is inefficient.
It means update/delete performance equally doesn't scale.

> Strings in /proc/kallsyms (key size=83, max entries=170958)
> * no lock
> qp-trie lookup   (1  thread)    4.096 ± 0.234M/s (drops 0.249 ± 0.014M/s mem
> 0.000 MiB)
>
> * subtree lock
> qp-trie lookup   (1  thread)    4.454 ± 0.108M/s (drops 0.271 ± 0.007M/s mem
> 0.000 MiB)

Here a single thread with spin_lock is _faster_ than without.
So it's not about the cost of spin_lock, but its contention.
So all the complexity to do lockless lookup
needs to be considered in this context.
Looks like update/delete don't scale anyway.
So lock-less lookup complexity is justified only
for the case with a lot of concurrent lookups and
little update/delete.
When bpf hash map was added the majority of tracing use cases
had # of lookups == # of updates == # of deletes.
For qp-trie we obviously cannot predict,
but should not pivot strongly into lock-less lookup without data.
The lock-less lookup should have reasonable complexity.
Otherwise let's do spin_lock and work on a different locking scheme
for all operations (lookup/update/delete).

> I can not reproduce the phenomenon that call_rcu consumes 100% of all cpus in my
> local environment, could you share the setup for it ?
>
> The following is the output of perf report (--no-children) for "./map_perf_test
> 4 72 10240 100000" on a x86-64 host with 72-cpus:
>
>     26.63%  map_perf_test    [kernel.vmlinux]                             [k]
> alloc_htab_elem
>     21.57%  map_perf_test    [kernel.vmlinux]                             [k]
> htab_map_update_elem

Looks like the perf is lost on atomic_inc/dec.
Try a partial revert of mem_alloc.
In particular to make sure
commit 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.")
is reverted and call_rcu is in place,
but percpu counter optimization is still there.
Also please use 'map_perf_test 4'.
I doubt 1000 vs 10240 will make a difference, but still.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key
  2022-09-28  1:08             ` Alexei Starovoitov
@ 2022-09-28  3:27               ` Hou Tao
  2022-09-28  4:37                 ` Alexei Starovoitov
  2022-09-28  8:45               ` Hou Tao
  1 sibling, 1 reply; 52+ messages in thread
From: Hou Tao @ 2022-09-28  3:27 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, Hou Tao

Hi,

On 9/28/2022 9:08 AM, Alexei Starovoitov wrote:
> On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@huaweicloud.com> wrote:
>> A quick benchmark show the performance is bad when using subtree lock for lookup:
>>
>> Randomly-generated binary data (key size=255, max entries=16K, key length
>> range:[1, 255])
>> * no lock
>> qp-trie lookup   (1  thread)   10.250 ± 0.009M/s (drops 0.006 ± 0.000M/s mem
>> 0.000 MiB)
>> qp-trie lookup   (2  thread)   20.466 ± 0.009M/s (drops 0.010 ± 0.000M/s mem
>> 0.000 MiB)
>> qp-trie lookup   (4  thread)   41.211 ± 0.010M/s (drops 0.018 ± 0.000M/s mem
>> 0.000 MiB)
>> qp-trie lookup   (8  thread)   82.933 ± 0.409M/s (drops 0.031 ± 0.000M/s mem
>> 0.000 MiB)
>> qp-trie lookup   (16 thread)  162.615 ± 0.842M/s (drops 0.070 ± 0.000M/s mem
>> 0.000 MiB)
>>
>> * subtree lock
>> qp-trie lookup   (1  thread)    8.990 ± 0.506M/s (drops 0.006 ± 0.000M/s mem
>> 0.000 MiB)
>> qp-trie lookup   (2  thread)   15.908 ± 0.141M/s (drops 0.004 ± 0.000M/s mem
>> 0.000 MiB)
>> qp-trie lookup   (4  thread)   27.551 ± 0.025M/s (drops 0.019 ± 0.000M/s mem
>> 0.000 MiB)
>> qp-trie lookup   (8  thread)   42.040 ± 0.241M/s (drops 0.018 ± 0.000M/s mem
>> 0.000 MiB)
>> qp-trie lookup   (16 thread)   50.884 ± 0.171M/s (drops 0.012 ± 0.000M/s mem
>> 0.000 MiB)
> That's indeed significant.
> But I interpret it differently.
> Since single thread perf is close enough while 16 thread
> suffers 3x it means the lock mechanism is inefficient.
> It means update/delete performance equally doesn't scale.
>
>> Strings in /proc/kallsyms (key size=83, max entries=170958)
>> * no lock
>> qp-trie lookup   (1  thread)    4.096 ± 0.234M/s (drops 0.249 ± 0.014M/s mem
>> 0.000 MiB)
>>
>> * subtree lock
>> qp-trie lookup   (1  thread)    4.454 ± 0.108M/s (drops 0.271 ± 0.007M/s mem
>> 0.000 MiB)
> Here a single thread with spin_lock is _faster_ than without.
It is a little strange because there is not any overhead for lockless lookup for
now. I thought it may be due to CPU boost or something similar. Will check it again.
> So it's not about the cost of spin_lock, but its contention.
> So all the complexity to do lockless lookup
> needs to be considered in this context.
> Looks like update/delete don't scale anyway.
Yes.
> So lock-less lookup complexity is justified only
> for the case with a lot of concurrent lookups and
> little update/delete.
> When bpf hash map was added the majority of tracing use cases
> had # of lookups == # of updates == # of deletes.
> For qp-trie we obviously cannot predict,
> but should not pivot strongly into lock-less lookup without data.
> The lock-less lookup should have reasonable complexity.
> Otherwise let's do spin_lock and work on a different locking scheme
> for all operations (lookup/update/delete).
For lpm-trie, does the use case in cillium [0] has many lockless lookups and few
updates/deletions ? So I think the use case is applicable for qp-trie as well.
Before steering towards spin_lock, let's implement a demo for lock-less lookup
to see its complexity. For spin-lock way, I had implemented a hand-over-hand
lock but the lookup was still lockless, but it doesn't scale well as well.
>
>> I can not reproduce the phenomenon that call_rcu consumes 100% of all cpus in my
>> local environment, could you share the setup for it ?
>>
>> The following is the output of perf report (--no-children) for "./map_perf_test
>> 4 72 10240 100000" on a x86-64 host with 72-cpus:
>>
>>     26.63%  map_perf_test    [kernel.vmlinux]                             [k]
>> alloc_htab_elem
>>     21.57%  map_perf_test    [kernel.vmlinux]                             [k]
>> htab_map_update_elem
> Looks like the perf is lost on atomic_inc/dec.
> Try a partial revert of mem_alloc.
> In particular to make sure
> commit 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.")
> is reverted and call_rcu is in place,
> but percpu counter optimization is still there.
> Also please use 'map_perf_test 4'.
> I doubt 1000 vs 10240 will make a difference, but still.
Will do. The suggestion is reasonable. But when max_entries=1000,
use_percpu_counter will be false.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v2 08/13] bpftool: Add support for qp-trie map
  2022-09-27 11:24   ` Quentin Monnet
@ 2022-09-28  4:14     ` Hou Tao
  2022-09-28  8:40       ` Quentin Monnet
  0 siblings, 1 reply; 52+ messages in thread
From: Hou Tao @ 2022-09-28  4:14 UTC (permalink / raw)
  To: Quentin Monnet, bpf
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, houtao1

Hi,

On 9/27/2022 7:24 PM, Quentin Monnet wrote:
> Sat Sep 24 2022 14:36:15 GMT+0100 (British Summer Time) ~ Hou Tao
> <houtao@huaweicloud.com>
>> From: Hou Tao <houtao1@huawei.com>
>>
>> Support lookup/update/delete/iterate/dump operations for qp-trie in
>> bpftool. Mainly add two functions: one function to parse dynptr key and
>> another one to dump dynptr key. The input format of dynptr key is:
>> "key [hex] size BYTES" and the output format of dynptr key is:
>> "size BYTES".
>>
>> The following is the output when using bpftool to manipulate
>> qp-trie:
>>
>>   $ bpftool map pin id 724953 /sys/fs/bpf/qp
>>   $ bpftool map show pinned /sys/fs/bpf/qp
>>   724953: qp_trie  name qp_trie  flags 0x1
>>           key 16B  value 4B  max_entries 2  memlock 65536B  map_extra 8
>>           btf_id 779
>>           pids test_qp_trie.bi(109167)
>>   $ bpftool map dump pinned /sys/fs/bpf/qp
>>   [{
>>           "key": {
>>               "size": 4,
>>               "data": ["0x0","0x0","0x0","0x0"
>>               ]
>>           },
>>           "value": 0
>>       },{
>>           "key": {
>>               "size": 4,
>>               "data": ["0x0","0x0","0x0","0x1"
>>               ]
>>           },
>>           "value": 2
>>       }
>>   ]
>>   $ bpftool map lookup pinned /sys/fs/bpf/qp key 4 0 0 0 1
>>   {
>>       "key": {
>>           "size": 4,
>>           "data": ["0x0","0x0","0x0","0x1"
>>           ]
>>       },
>>       "value": 2
>>   }
> The bpftool patch looks good, thanks! I have one comment on the syntax
> for the keys, I don't find it intuitive to have the size as the first
> BYTE. It makes it awkward to understand what the command does if we read
> it in the wild without knowing the map type. I can see two alternatives,
> either adding a keyword (e.g., "key_size 4 key 0 0 0 1"), or changing
> parse_bytes() to make it able to parse as much as it can then count the
> bytes, when we don't know in advance how many we get.
The suggestion is reasonable, but there is also reason for the current choice (
I should written it down in commit message). For dynptr-typed key, these two
proposed suggestions will work. But for key with embedded dynptrs as show below,
both explict key_size keyword and implicit key_size in BYTEs can not express the
key correctly.

struct map_key {
unsigned int cookie;
struct bpf_dynptr name;
struct bpf_dynptr addr;
unsigned int flags;
};

I also had thought about adding another key word "dynptr_key" (or "dyn_key") to
support dynptr-typed key or key with embedded dynptr, and the format will still
be: "dynptr_key size [BYTES]". But at least we can tell it is different with
"key" which is fixed size. What do you think ?
>
> Thanks,
> Quentin
>
> .


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key
  2022-09-28  3:27               ` Hou Tao
@ 2022-09-28  4:37                 ` Alexei Starovoitov
  0 siblings, 0 replies; 52+ messages in thread
From: Alexei Starovoitov @ 2022-09-28  4:37 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, Hou Tao

On Tue, Sep 27, 2022 at 8:27 PM Hou Tao <houtao@huaweicloud.com> wrote:
> > Also please use 'map_perf_test 4'.
> > I doubt 1000 vs 10240 will make a difference, but still.
> Will do. The suggestion is reasonable. But when max_entries=1000,
> use_percpu_counter will be false.

... when the system has a lot of cpus...
The knobs in the microbenchmark are there to tune it
to observe desired behavior. Please mention the reasons to
use that specific value next time.
When N things have changed it's hard to do apples-to-apples.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v2 08/13] bpftool: Add support for qp-trie map
  2022-09-28  4:14     ` Hou Tao
@ 2022-09-28  8:40       ` Quentin Monnet
  2022-09-28  9:05         ` Hou Tao
  0 siblings, 1 reply; 52+ messages in thread
From: Quentin Monnet @ 2022-09-28  8:40 UTC (permalink / raw)
  To: Hou Tao, bpf
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, houtao1

Wed Sep 28 2022 05:14:45 GMT+0100 (British Summer Time) ~ Hou Tao
<houtao@huaweicloud.com>
> Hi,
> 
> On 9/27/2022 7:24 PM, Quentin Monnet wrote:
>> Sat Sep 24 2022 14:36:15 GMT+0100 (British Summer Time) ~ Hou Tao
>> <houtao@huaweicloud.com>
>>> From: Hou Tao <houtao1@huawei.com>
>>>
>>> Support lookup/update/delete/iterate/dump operations for qp-trie in
>>> bpftool. Mainly add two functions: one function to parse dynptr key and
>>> another one to dump dynptr key. The input format of dynptr key is:
>>> "key [hex] size BYTES" and the output format of dynptr key is:
>>> "size BYTES".
>>>
>>> The following is the output when using bpftool to manipulate
>>> qp-trie:
>>>
>>>   $ bpftool map pin id 724953 /sys/fs/bpf/qp
>>>   $ bpftool map show pinned /sys/fs/bpf/qp
>>>   724953: qp_trie  name qp_trie  flags 0x1
>>>           key 16B  value 4B  max_entries 2  memlock 65536B  map_extra 8
>>>           btf_id 779
>>>           pids test_qp_trie.bi(109167)
>>>   $ bpftool map dump pinned /sys/fs/bpf/qp
>>>   [{
>>>           "key": {
>>>               "size": 4,
>>>               "data": ["0x0","0x0","0x0","0x0"
>>>               ]
>>>           },
>>>           "value": 0
>>>       },{
>>>           "key": {
>>>               "size": 4,
>>>               "data": ["0x0","0x0","0x0","0x1"
>>>               ]
>>>           },
>>>           "value": 2
>>>       }
>>>   ]
>>>   $ bpftool map lookup pinned /sys/fs/bpf/qp key 4 0 0 0 1
>>>   {
>>>       "key": {
>>>           "size": 4,
>>>           "data": ["0x0","0x0","0x0","0x1"
>>>           ]
>>>       },
>>>       "value": 2
>>>   }
>> The bpftool patch looks good, thanks! I have one comment on the syntax
>> for the keys, I don't find it intuitive to have the size as the first
>> BYTE. It makes it awkward to understand what the command does if we read
>> it in the wild without knowing the map type. I can see two alternatives,
>> either adding a keyword (e.g., "key_size 4 key 0 0 0 1"), or changing
>> parse_bytes() to make it able to parse as much as it can then count the
>> bytes, when we don't know in advance how many we get.
> The suggestion is reasonable, but there is also reason for the current choice (
> I should written it down in commit message). For dynptr-typed key, these two
> proposed suggestions will work. But for key with embedded dynptrs as show below,
> both explict key_size keyword and implicit key_size in BYTEs can not express the
> key correctly.
> 
> struct map_key {
> unsigned int cookie;
> struct bpf_dynptr name;
> struct bpf_dynptr addr;
> unsigned int flags;
> };

I'm not sure I follow. I don't understand the difference for dealing
internally with the key between "key_size N key BYTES" and "key N BYTES"
(or for parsing then counting). Please could you give an example telling
how you would you express the key from the structure above, with the
syntax you proposed?

> I also had thought about adding another key word "dynptr_key" (or "dyn_key") to
> support dynptr-typed key or key with embedded dynptr, and the format will still
> be: "dynptr_key size [BYTES]". But at least we can tell it is different with
> "key" which is fixed size. What do you think ?
If the other suggestions do not work, then yes, using a dedicated
keyword (Just "dynkey"? We can detail in the docs) sounds better to me.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key
  2022-09-28  1:08             ` Alexei Starovoitov
  2022-09-28  3:27               ` Hou Tao
@ 2022-09-28  8:45               ` Hou Tao
  2022-09-28  8:49                 ` Hou Tao
  2022-09-29  3:22                 ` Alexei Starovoitov
  1 sibling, 2 replies; 52+ messages in thread
From: Hou Tao @ 2022-09-28  8:45 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, Hou Tao

Hi,

On 9/28/2022 9:08 AM, Alexei Starovoitov wrote:
> On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@huaweicloud.com> wrote:
>
SNIP
>> I can not reproduce the phenomenon that call_rcu consumes 100% of all cpus in my
>> local environment, could you share the setup for it ?
>>
>> The following is the output of perf report (--no-children) for "./map_perf_test
>> 4 72 10240 100000" on a x86-64 host with 72-cpus:
>>
>>     26.63%  map_perf_test    [kernel.vmlinux]                             [k]
>> alloc_htab_elem
>>     21.57%  map_perf_test    [kernel.vmlinux]                             [k]
>> htab_map_update_elem
> Looks like the perf is lost on atomic_inc/dec.
> Try a partial revert of mem_alloc.
> In particular to make sure
> commit 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.")
> is reverted and call_rcu is in place,
> but percpu counter optimization is still there.
> Also please use 'map_perf_test 4'.
> I doubt 1000 vs 10240 will make a difference, but still.
>
I have tried the following two setups:
(1) Don't use bpf_mem_alloc in hash-map and use per-cpu counter in hash-map
# Samples: 1M of event 'cycles:ppp'
# Event count (approx.): 1041345723234
#
# Overhead  Command          Shared Object                                Symbol
# ........  ...............  ........................................... 
...............................................
#
    10.36%  map_perf_test    [kernel.vmlinux]                             [k]
bpf_map_get_memcg.isra.0
     9.82%  map_perf_test    [kernel.vmlinux]                             [k]
bpf_map_kmalloc_node
     4.24%  map_perf_test    [kernel.vmlinux]                             [k]
check_preemption_disabled
     2.86%  map_perf_test    [kernel.vmlinux]                             [k]
htab_map_update_elem
     2.80%  map_perf_test    [kernel.vmlinux]                             [k]
__kmalloc_node
     2.72%  map_perf_test    [kernel.vmlinux]                             [k]
htab_map_delete_elem
     2.30%  map_perf_test    [kernel.vmlinux]                             [k]
memcg_slab_post_alloc_hook
     2.21%  map_perf_test    [kernel.vmlinux]                             [k]
entry_SYSCALL_64
     2.17%  map_perf_test    [kernel.vmlinux]                             [k]
syscall_exit_to_user_mode
     2.12%  map_perf_test    [kernel.vmlinux]                             [k] jhash
     2.11%  map_perf_test    [kernel.vmlinux]                             [k]
syscall_return_via_sysret
     2.05%  map_perf_test    [kernel.vmlinux]                             [k]
alloc_htab_elem
     1.94%  map_perf_test    [kernel.vmlinux]                             [k]
_raw_spin_lock_irqsave
     1.92%  map_perf_test    [kernel.vmlinux]                             [k]
preempt_count_add
     1.92%  map_perf_test    [kernel.vmlinux]                             [k]
preempt_count_sub
     1.87%  map_perf_test    [kernel.vmlinux]                             [k]
call_rcu


(2) Use bpf_mem_alloc & per-cpu counter in hash-map, but no batch call_rcu
optimization
By revert the following commits:

9f2c6e96c65e bpf: Optimize rcu_barrier usage between hash map and bpf_mem_alloc.
bfc03c15bebf bpf: Remove usage of kmem_cache from bpf_mem_cache.
02cc5aa29e8c bpf: Remove prealloc-only restriction for sleepable bpf programs.
dccb4a9013a6 bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs.
96da3f7d489d bpf: Remove tracing program restriction on map types
ee4ed53c5eb6 bpf: Convert percpu hash map to per-cpu bpf_mem_alloc.
4ab67149f3c6 bpf: Add percpu allocation support to bpf_mem_alloc.
8d5a8011b35d bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.
7c266178aa51 bpf: Adjust low/high watermarks in bpf_mem_cache
0fd7c5d43339 bpf: Optimize call_rcu in non-preallocated hash map.

     5.17%  map_perf_test    [kernel.vmlinux]                             [k]
check_preemption_disabled
     4.53%  map_perf_test    [kernel.vmlinux]                             [k]
__get_obj_cgroup_from_memcg
     2.97%  map_perf_test    [kernel.vmlinux]                             [k]
htab_map_update_elem
     2.74%  map_perf_test    [kernel.vmlinux]                             [k]
htab_map_delete_elem
     2.62%  map_perf_test    [kernel.vmlinux]                             [k]
kmem_cache_alloc_node
     2.57%  map_perf_test    [kernel.vmlinux]                             [k]
memcg_slab_post_alloc_hook
     2.34%  map_perf_test    [kernel.vmlinux]                             [k] jhash
     2.30%  map_perf_test    [kernel.vmlinux]                             [k]
entry_SYSCALL_64
     2.25%  map_perf_test    [kernel.vmlinux]                             [k]
obj_cgroup_charge
     2.23%  map_perf_test    [kernel.vmlinux]                             [k]
alloc_htab_elem
     2.17%  map_perf_test    [kernel.vmlinux]                             [k]
memcpy_erms
     2.17%  map_perf_test    [kernel.vmlinux]                             [k]
syscall_exit_to_user_mode
     2.16%  map_perf_test    [kernel.vmlinux]                             [k]
syscall_return_via_sysret
     2.14%  map_perf_test    [kernel.vmlinux]                             [k]
_raw_spin_lock_irqsave
     2.13%  map_perf_test    [kernel.vmlinux]                             [k]
preempt_count_add
     2.12%  map_perf_test    [kernel.vmlinux]                             [k]
preempt_count_sub
     2.00%  map_perf_test    [kernel.vmlinux]                             [k]
percpu_counter_add_batch
     1.99%  map_perf_test    [kernel.vmlinux]                             [k]
alloc_bulk
     1.97%  map_perf_test    [kernel.vmlinux]                             [k]
call_rcu
     1.52%  map_perf_test    [kernel.vmlinux]                             [k]
mod_objcg_state
     1.36%  map_perf_test    [kernel.vmlinux]                             [k]
allocate_slab

In both of these two setups, the overhead of call_rcu is about 2% and it is not
the biggest overhead.

Maybe add a not-immediate-reuse flag support to bpf_mem_alloc is reason. What do
you think ?


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key
  2022-09-28  8:45               ` Hou Tao
@ 2022-09-28  8:49                 ` Hou Tao
  2022-09-29  3:22                 ` Alexei Starovoitov
  1 sibling, 0 replies; 52+ messages in thread
From: Hou Tao @ 2022-09-28  8:49 UTC (permalink / raw)
  To: Hou Tao, Alexei Starovoitov
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney



On 9/28/2022 4:45 PM, Hou Tao wrote:
> Hi,
>
> On 9/28/2022 9:08 AM, Alexei Starovoitov wrote:
>> On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@huaweicloud.com> wrote:
SNIP
>>
>> In both of these two setups, the overhead of call_rcu is about 2% and it is not
>> the biggest overhead.
Also considering that the overhead of qp-trie update/delete will be much bigger
than hash map update.
>>
>> Maybe add a not-immediate-reuse flag support to bpf_mem_alloc is reason. What do
>> you think ?
s/reason/reasonable.
> .


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v2 08/13] bpftool: Add support for qp-trie map
  2022-09-28  8:40       ` Quentin Monnet
@ 2022-09-28  9:05         ` Hou Tao
  2022-09-28  9:23           ` Quentin Monnet
  0 siblings, 1 reply; 52+ messages in thread
From: Hou Tao @ 2022-09-28  9:05 UTC (permalink / raw)
  To: Quentin Monnet, bpf
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, houtao1

Hi,

On 9/28/2022 4:40 PM, Quentin Monnet wrote:
> Wed Sep 28 2022 05:14:45 GMT+0100 (British Summer Time) ~ Hou Tao
> <houtao@huaweicloud.com>
>> Hi,
>>
>> On 9/27/2022 7:24 PM, Quentin Monnet wrote:
>>> Sat Sep 24 2022 14:36:15 GMT+0100 (British Summer Time) ~ Hou Tao
>>> <houtao@huaweicloud.com>
>>>> From: Hou Tao <houtao1@huawei.com>
>>>>
>>>> Support lookup/update/delete/iterate/dump operations for qp-trie in
>>>> bpftool. Mainly add two functions: one function to parse dynptr key and
>>>> another one to dump dynptr key. The input format of dynptr key is:
>>>> "key [hex] size BYTES" and the output format of dynptr key is:
>>>> "size BYTES".
SNIP
>>> The bpftool patch looks good, thanks! I have one comment on the syntax
>>> for the keys, I don't find it intuitive to have the size as the first
>>> BYTE. It makes it awkward to understand what the command does if we read
>>> it in the wild without knowing the map type. I can see two alternatives,
>>> either adding a keyword (e.g., "key_size 4 key 0 0 0 1"), or changing
>>> parse_bytes() to make it able to parse as much as it can then count the
>>> bytes, when we don't know in advance how many we get.
>> The suggestion is reasonable, but there is also reason for the current choice (
>> I should written it down in commit message). For dynptr-typed key, these two
>> proposed suggestions will work. But for key with embedded dynptrs as show below,
>> both explict key_size keyword and implicit key_size in BYTEs can not express the
>> key correctly.
>>
>> struct map_key {
>> unsigned int cookie;
>> struct bpf_dynptr name;
>> struct bpf_dynptr addr;
>> unsigned int flags;
>> };
> I'm not sure I follow. I don't understand the difference for dealing
> internally with the key between "key_size N key BYTES" and "key N BYTES"
> (or for parsing then counting). Please could you give an example telling
> how you would you express the key from the structure above, with the
> syntax you proposed?
In my understand, if using "key_size N key BYTES" to represent map_key, it can
not tell the exact size of "name" and "addr" and it only can tell the total size
of name and addr. If using "key BYTES" to do that, it has the similar problem.
But if using "key size BYTES" format, map_key can be expressed as follows:

key c c c c [name_size] n n n [addr_size] a a  f f f f
>
>> I also had thought about adding another key word "dynptr_key" (or "dyn_key") to
>> support dynptr-typed key or key with embedded dynptr, and the format will still
>> be: "dynptr_key size [BYTES]". But at least we can tell it is different with
>> "key" which is fixed size. What do you think ?
> If the other suggestions do not work, then yes, using a dedicated
> keyword (Just "dynkey"? We can detail in the docs) sounds better to me.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v2 08/13] bpftool: Add support for qp-trie map
  2022-09-28  9:05         ` Hou Tao
@ 2022-09-28  9:23           ` Quentin Monnet
  2022-09-28 10:54             ` Hou Tao
  0 siblings, 1 reply; 52+ messages in thread
From: Quentin Monnet @ 2022-09-28  9:23 UTC (permalink / raw)
  To: Hou Tao, bpf
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, houtao1

Wed Sep 28 2022 10:05:55 GMT+0100 (British Summer Time) ~ Hou Tao
<houtao@huaweicloud.com>
> Hi,
> 
> On 9/28/2022 4:40 PM, Quentin Monnet wrote:
>> Wed Sep 28 2022 05:14:45 GMT+0100 (British Summer Time) ~ Hou Tao
>> <houtao@huaweicloud.com>
>>> Hi,
>>>
>>> On 9/27/2022 7:24 PM, Quentin Monnet wrote:
>>>> Sat Sep 24 2022 14:36:15 GMT+0100 (British Summer Time) ~ Hou Tao
>>>> <houtao@huaweicloud.com>
>>>>> From: Hou Tao <houtao1@huawei.com>
>>>>>
>>>>> Support lookup/update/delete/iterate/dump operations for qp-trie in
>>>>> bpftool. Mainly add two functions: one function to parse dynptr key and
>>>>> another one to dump dynptr key. The input format of dynptr key is:
>>>>> "key [hex] size BYTES" and the output format of dynptr key is:
>>>>> "size BYTES".
> SNIP
>>>> The bpftool patch looks good, thanks! I have one comment on the syntax
>>>> for the keys, I don't find it intuitive to have the size as the first
>>>> BYTE. It makes it awkward to understand what the command does if we read
>>>> it in the wild without knowing the map type. I can see two alternatives,
>>>> either adding a keyword (e.g., "key_size 4 key 0 0 0 1"), or changing
>>>> parse_bytes() to make it able to parse as much as it can then count the
>>>> bytes, when we don't know in advance how many we get.
>>> The suggestion is reasonable, but there is also reason for the current choice (
>>> I should written it down in commit message). For dynptr-typed key, these two
>>> proposed suggestions will work. But for key with embedded dynptrs as show below,
>>> both explict key_size keyword and implicit key_size in BYTEs can not express the
>>> key correctly.
>>>
>>> struct map_key {
>>> unsigned int cookie;
>>> struct bpf_dynptr name;
>>> struct bpf_dynptr addr;
>>> unsigned int flags;
>>> };
>> I'm not sure I follow. I don't understand the difference for dealing
>> internally with the key between "key_size N key BYTES" and "key N BYTES"
>> (or for parsing then counting). Please could you give an example telling
>> how you would you express the key from the structure above, with the
>> syntax you proposed?
> In my understand, if using "key_size N key BYTES" to represent map_key, it can
> not tell the exact size of "name" and "addr" and it only can tell the total size
> of name and addr. If using "key BYTES" to do that, it has the similar problem.
> But if using "key size BYTES" format, map_key can be expressed as follows:
> 
> key c c c c [name_size] n n n [addr_size] a a  f f f f

OK thanks I get it now, you can have multiple sizes within the key, one
for each field. Yes, let's use a new keyword in that case please. Can
you also provide more details in the man page, and ideally add a new
example to the list?

Thanks,
Quentin


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v2 08/13] bpftool: Add support for qp-trie map
  2022-09-28  9:23           ` Quentin Monnet
@ 2022-09-28 10:54             ` Hou Tao
  2022-09-28 11:49               ` Quentin Monnet
  0 siblings, 1 reply; 52+ messages in thread
From: Hou Tao @ 2022-09-28 10:54 UTC (permalink / raw)
  To: Quentin Monnet, bpf
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, houtao1

Hi,

On 9/28/2022 5:23 PM, Quentin Monnet wrote:
> Wed Sep 28 2022 10:05:55 GMT+0100 (British Summer Time) ~ Hou Tao
> <houtao@huaweicloud.com>
>> Hi,
>>
>> On 9/28/2022 4:40 PM, Quentin Monnet wrote:
>>> Wed Sep 28 2022 05:14:45 GMT+0100 (British Summer Time) ~ Hou Tao
>>> <houtao@huaweicloud.com>
>>>> Hi,
>>>>
>>>> On 9/27/2022 7:24 PM, Quentin Monnet wrote:
>>>>> Sat Sep 24 2022 14:36:15 GMT+0100 (British Summer Time) ~ Hou Tao
>>>>> <houtao@huaweicloud.com>
>>>>>> From: Hou Tao <houtao1@huawei.com>
>>>>>>
>>>>>> Support lookup/update/delete/iterate/dump operations for qp-trie in
>>>>>> bpftool. Mainly add two functions: one function to parse dynptr key and
>>>>>> another one to dump dynptr key. The input format of dynptr key is:
>>>>>> "key [hex] size BYTES" and the output format of dynptr key is:
>>>>>> "size BYTES".
>> SNIP
>>>>> The bpftool patch looks good, thanks! I have one comment on the syntax
>>>>> for the keys, I don't find it intuitive to have the size as the first
>>>>> BYTE. It makes it awkward to understand what the command does if we read
>>>>> it in the wild without knowing the map type. I can see two alternatives,
>>>>> either adding a keyword (e.g., "key_size 4 key 0 0 0 1"), or changing
>>>>> parse_bytes() to make it able to parse as much as it can then count the
>>>>> bytes, when we don't know in advance how many we get.
>>>> The suggestion is reasonable, but there is also reason for the current choice (
>>>> I should written it down in commit message). For dynptr-typed key, these two
>>>> proposed suggestions will work. But for key with embedded dynptrs as show below,
>>>> both explict key_size keyword and implicit key_size in BYTEs can not express the
>>>> key correctly.
>>>>
>>>> struct map_key {
>>>> unsigned int cookie;
>>>> struct bpf_dynptr name;
>>>> struct bpf_dynptr addr;
>>>> unsigned int flags;
>>>> };
>>> I'm not sure I follow. I don't understand the difference for dealing
>>> internally with the key between "key_size N key BYTES" and "key N BYTES"
>>> (or for parsing then counting). Please could you give an example telling
>>> how you would you express the key from the structure above, with the
>>> syntax you proposed?
>> In my understand, if using "key_size N key BYTES" to represent map_key, it can
>> not tell the exact size of "name" and "addr" and it only can tell the total size
>> of name and addr. If using "key BYTES" to do that, it has the similar problem.
>> But if using "key size BYTES" format, map_key can be expressed as follows:
>>
>> key c c c c [name_size] n n n [addr_size] a a  f f f f
> OK thanks I get it now, you can have multiple sizes within the key, one
> for each field. Yes, let's use a new keyword in that case please. Can
> you also provide more details in the man page, and ideally add a new
> example to the list?
Forget to mention that the map key with embedded dynptr is not supported yet and
now only support using dynptr as the map key. So will add a new keyword "dynkey"
in v3 to support operations on qp-trie.
>
> Thanks,
> Quentin
>
> .


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v2 08/13] bpftool: Add support for qp-trie map
  2022-09-28 10:54             ` Hou Tao
@ 2022-09-28 11:49               ` Quentin Monnet
  0 siblings, 0 replies; 52+ messages in thread
From: Quentin Monnet @ 2022-09-28 11:49 UTC (permalink / raw)
  To: Hou Tao, bpf
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, houtao1

Wed Sep 28 2022 11:54:39 GMT+0100 (British Summer Time) ~ Hou Tao
<houtao@huaweicloud.com>
> Hi,
> 
> On 9/28/2022 5:23 PM, Quentin Monnet wrote:
>> Wed Sep 28 2022 10:05:55 GMT+0100 (British Summer Time) ~ Hou Tao
>> <houtao@huaweicloud.com>
>>> Hi,
>>>
>>> On 9/28/2022 4:40 PM, Quentin Monnet wrote:
>>>> Wed Sep 28 2022 05:14:45 GMT+0100 (British Summer Time) ~ Hou Tao
>>>> <houtao@huaweicloud.com>
>>>>> Hi,
>>>>>
>>>>> On 9/27/2022 7:24 PM, Quentin Monnet wrote:
>>>>>> Sat Sep 24 2022 14:36:15 GMT+0100 (British Summer Time) ~ Hou Tao
>>>>>> <houtao@huaweicloud.com>
>>>>>>> From: Hou Tao <houtao1@huawei.com>
>>>>>>>
>>>>>>> Support lookup/update/delete/iterate/dump operations for qp-trie in
>>>>>>> bpftool. Mainly add two functions: one function to parse dynptr key and
>>>>>>> another one to dump dynptr key. The input format of dynptr key is:
>>>>>>> "key [hex] size BYTES" and the output format of dynptr key is:
>>>>>>> "size BYTES".
>>> SNIP
>>>>>> The bpftool patch looks good, thanks! I have one comment on the syntax
>>>>>> for the keys, I don't find it intuitive to have the size as the first
>>>>>> BYTE. It makes it awkward to understand what the command does if we read
>>>>>> it in the wild without knowing the map type. I can see two alternatives,
>>>>>> either adding a keyword (e.g., "key_size 4 key 0 0 0 1"), or changing
>>>>>> parse_bytes() to make it able to parse as much as it can then count the
>>>>>> bytes, when we don't know in advance how many we get.
>>>>> The suggestion is reasonable, but there is also reason for the current choice (
>>>>> I should written it down in commit message). For dynptr-typed key, these two
>>>>> proposed suggestions will work. But for key with embedded dynptrs as show below,
>>>>> both explict key_size keyword and implicit key_size in BYTEs can not express the
>>>>> key correctly.
>>>>>
>>>>> struct map_key {
>>>>> unsigned int cookie;
>>>>> struct bpf_dynptr name;
>>>>> struct bpf_dynptr addr;
>>>>> unsigned int flags;
>>>>> };
>>>> I'm not sure I follow. I don't understand the difference for dealing
>>>> internally with the key between "key_size N key BYTES" and "key N BYTES"
>>>> (or for parsing then counting). Please could you give an example telling
>>>> how you would you express the key from the structure above, with the
>>>> syntax you proposed?
>>> In my understand, if using "key_size N key BYTES" to represent map_key, it can
>>> not tell the exact size of "name" and "addr" and it only can tell the total size
>>> of name and addr. If using "key BYTES" to do that, it has the similar problem.
>>> But if using "key size BYTES" format, map_key can be expressed as follows:
>>>
>>> key c c c c [name_size] n n n [addr_size] a a  f f f f
>> OK thanks I get it now, you can have multiple sizes within the key, one
>> for each field. Yes, let's use a new keyword in that case please. Can
>> you also provide more details in the man page, and ideally add a new
>> example to the list?
> Forget to mention that the map key with embedded dynptr is not supported yet and
> now only support using dynptr as the map key. So will add a new keyword "dynkey"
> in v3 to support operations on qp-trie.

Sounds good thank you, let's do that and ideally mention it in the
commit log for context.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v2 03/13] bpf: Support bpf_dynptr-typed map key in bpf syscall
  2022-09-24 13:36 ` [PATCH bpf-next v2 03/13] bpf: Support bpf_dynptr-typed map key in bpf syscall Hou Tao
@ 2022-09-29  0:16   ` Andrii Nakryiko
  2022-09-29  2:11     ` Hou Tao
  0 siblings, 1 reply; 52+ messages in thread
From: Andrii Nakryiko @ 2022-09-29  0:16 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, houtao1

On Sat, Sep 24, 2022 at 6:18 AM Hou Tao <houtao@huaweicloud.com> wrote:
>
> From: Hou Tao <houtao1@huawei.com>
>
> Userspace application uses bpf syscall to lookup or update bpf map. It
> passes a pointer of fixed-size buffer to kernel to represent the map
> key. To support map with variable-length key, introduce bpf_dynptr_user
> to allow userspace to pass a pointer of bpf_dynptr_user to specify the
> address and the length of key buffer. And in order to represent dynptr
> from userspace, adding a new dynptr type: BPF_DYNPTR_TYPE_USER. Because
> BPF_DYNPTR_TYPE_USER-typed dynptr is not available from bpf program, so
> no verifier update is needed.
>
> Add dynptr_key_off in bpf_map to distinguish map with fixed-size key
> from map with variable-length. dynptr_key_off is less than zero for
> fixed-size key and can only be zero for dynptr key.
>
> For dynptr-key map, key btf type is bpf_dynptr and key size is 16, so
> use the lower 32-bits of map_extra to specify the maximum size of dynptr
> key.
>
> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---

This is a great feature and you've put lots of high-quality work into
this! Looking forward to have qp-trie BPF map available. Apart from
your discussion with Alexie about locking and memory
allocation/reused, I have questions about this dynptr from user-space
interface. Let's discuss it in this patch to not interfere.

I'm trying to understand why there should be so many new concepts and
interfaces just to allow variable-sized keys. Can you elaborate on
that? Like why do we even need BPF_DYNPTR_TYPE_USER? Why user can't
just pass a void * (casted to u64) pointer and size of the memory
pointed to it, and kernel will just copy necessary amount of data into
kvmalloc'ed temporary region?

It also seems like you want to allow key (and maybe value as well, not
sure) to be a custom user-defined type where some of the fields are
struct bpf_dynptr. I think it's a big overcomplication, tbh. I'd say
it's enough to just say that entire key has to be described by a
single bpf_dynptr. Then we can have bpf_map_lookup_elem_dynptr(map,
key_dynptr, flags) new helper to provide variable-sized key for
lookup.

I think it would keep it much simpler. But if I'm missing something,
it would be good to understand that. Thanks!


>  include/linux/bpf.h            |   8 +++
>  include/uapi/linux/bpf.h       |   6 ++
>  kernel/bpf/map_in_map.c        |   3 +
>  kernel/bpf/syscall.c           | 121 +++++++++++++++++++++++++++------
>  tools/include/uapi/linux/bpf.h |   6 ++
>  5 files changed, 125 insertions(+), 19 deletions(-)
>

[...]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v2 03/13] bpf: Support bpf_dynptr-typed map key in bpf syscall
  2022-09-29  0:16   ` Andrii Nakryiko
@ 2022-09-29  2:11     ` Hou Tao
  2022-09-30 21:35       ` Andrii Nakryiko
  0 siblings, 1 reply; 52+ messages in thread
From: Hou Tao @ 2022-09-29  2:11 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, houtao1, Joanne Koong

Hi,

On 9/29/2022 8:16 AM, Andrii Nakryiko wrote:
> On Sat, Sep 24, 2022 at 6:18 AM Hou Tao <houtao@huaweicloud.com> wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> Userspace application uses bpf syscall to lookup or update bpf map. It
>> passes a pointer of fixed-size buffer to kernel to represent the map
>> key. To support map with variable-length key, introduce bpf_dynptr_user
>> to allow userspace to pass a pointer of bpf_dynptr_user to specify the
>> address and the length of key buffer. And in order to represent dynptr
>> from userspace, adding a new dynptr type: BPF_DYNPTR_TYPE_USER. Because
>> BPF_DYNPTR_TYPE_USER-typed dynptr is not available from bpf program, so
>> no verifier update is needed.
>>
>> Add dynptr_key_off in bpf_map to distinguish map with fixed-size key
>> from map with variable-length. dynptr_key_off is less than zero for
>> fixed-size key and can only be zero for dynptr key.
>>
>> For dynptr-key map, key btf type is bpf_dynptr and key size is 16, so
>> use the lower 32-bits of map_extra to specify the maximum size of dynptr
>> key.
>>
>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>> ---
> This is a great feature and you've put lots of high-quality work into
> this! Looking forward to have qp-trie BPF map available. Apart from
> your discussion with Alexie about locking and memory
> allocation/reused, I have questions about this dynptr from user-space
> interface. Let's discuss it in this patch to not interfere.
>
> I'm trying to understand why there should be so many new concepts and
> interfaces just to allow variable-sized keys. Can you elaborate on
> that? Like why do we even need BPF_DYNPTR_TYPE_USER? Why user can't
> just pass a void * (casted to u64) pointer and size of the memory
> pointed to it, and kernel will just copy necessary amount of data into
> kvmalloc'ed temporary region?
The main reason is that map operations from syscall and bpf program use the same
ops in bpf_map_ops (e.g. map_update_elem). If only use dynptr_kern for bpf
program, then
have to define three new operations for bpf program. Even more, after defining
two different map ops for the same operation from syscall and bpf program, the
internal  implementation of qp-trie still need to convert these two different
representations of variable-length key into bpf_qp_trie_key. It introduces
unnecessary conversion, so I think it may be a good idea to pass dynptr_kern to
qp-trie even for bpf syscall.

And now in bpf_attr, for BPF_MAP_*_ELEM command, there is no space to pass an
extra key size. It seems bpf_attr can be extend, but even it is extented, it
also means in libbpf we need to provide a new API group to support operationg on
dynptr key map, because the userspace needs to pass the key size as a new argument.
>
> It also seems like you want to allow key (and maybe value as well, not
> sure) to be a custom user-defined type where some of the fields are
> struct bpf_dynptr. I think it's a big overcomplication, tbh. I'd say
> it's enough to just say that entire key has to be described by a
> single bpf_dynptr. Then we can have bpf_map_lookup_elem_dynptr(map,
> key_dynptr, flags) new helper to provide variable-sized key for
> lookup.
For qp-trie, it will only support a single dynptr as the map key. In the future
maybe other map will support map key with embedded dynptrs. Maybe Joanne can
share some vision about such use case.
>
> I think it would keep it much simpler. But if I'm missing something,
> it would be good to understand that. Thanks!
>
>
>>  include/linux/bpf.h            |   8 +++
>>  include/uapi/linux/bpf.h       |   6 ++
>>  kernel/bpf/map_in_map.c        |   3 +
>>  kernel/bpf/syscall.c           | 121 +++++++++++++++++++++++++++------
>>  tools/include/uapi/linux/bpf.h |   6 ++
>>  5 files changed, 125 insertions(+), 19 deletions(-)
>>
> [...]
> .


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key
  2022-09-28  8:45               ` Hou Tao
  2022-09-28  8:49                 ` Hou Tao
@ 2022-09-29  3:22                 ` Alexei Starovoitov
  2022-10-08  1:56                   ` Hou Tao
  1 sibling, 1 reply; 52+ messages in thread
From: Alexei Starovoitov @ 2022-09-29  3:22 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, Hou Tao

On Wed, Sep 28, 2022 at 1:46 AM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 9/28/2022 9:08 AM, Alexei Starovoitov wrote:
> > On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@huaweicloud.com> wrote:
> >
> SNIP
> >> I can not reproduce the phenomenon that call_rcu consumes 100% of all cpus in my
> >> local environment, could you share the setup for it ?
> >>
> >> The following is the output of perf report (--no-children) for "./map_perf_test
> >> 4 72 10240 100000" on a x86-64 host with 72-cpus:
> >>
> >>     26.63%  map_perf_test    [kernel.vmlinux]                             [k]
> >> alloc_htab_elem
> >>     21.57%  map_perf_test    [kernel.vmlinux]                             [k]
> >> htab_map_update_elem
> > Looks like the perf is lost on atomic_inc/dec.
> > Try a partial revert of mem_alloc.
> > In particular to make sure
> > commit 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.")
> > is reverted and call_rcu is in place,
> > but percpu counter optimization is still there.
> > Also please use 'map_perf_test 4'.
> > I doubt 1000 vs 10240 will make a difference, but still.
> >
> I have tried the following two setups:
> (1) Don't use bpf_mem_alloc in hash-map and use per-cpu counter in hash-map
> # Samples: 1M of event 'cycles:ppp'
> # Event count (approx.): 1041345723234
> #
> # Overhead  Command          Shared Object                                Symbol
> # ........  ...............  ...........................................
> ...............................................
> #
>     10.36%  map_perf_test    [kernel.vmlinux]                             [k]
> bpf_map_get_memcg.isra.0

That is per-cpu counter and it's consuming 10% ?!
Something is really odd in your setup.
A lot of debug configs?

>      9.82%  map_perf_test    [kernel.vmlinux]                             [k]
> bpf_map_kmalloc_node
>      4.24%  map_perf_test    [kernel.vmlinux]                             [k]
> check_preemption_disabled

clearly debug build.
Please use production build.

>      2.86%  map_perf_test    [kernel.vmlinux]                             [k]
> htab_map_update_elem
>      2.80%  map_perf_test    [kernel.vmlinux]                             [k]
> __kmalloc_node
>      2.72%  map_perf_test    [kernel.vmlinux]                             [k]
> htab_map_delete_elem
>      2.30%  map_perf_test    [kernel.vmlinux]                             [k]
> memcg_slab_post_alloc_hook
>      2.21%  map_perf_test    [kernel.vmlinux]                             [k]
> entry_SYSCALL_64
>      2.17%  map_perf_test    [kernel.vmlinux]                             [k]
> syscall_exit_to_user_mode
>      2.12%  map_perf_test    [kernel.vmlinux]                             [k] jhash
>      2.11%  map_perf_test    [kernel.vmlinux]                             [k]
> syscall_return_via_sysret
>      2.05%  map_perf_test    [kernel.vmlinux]                             [k]
> alloc_htab_elem
>      1.94%  map_perf_test    [kernel.vmlinux]                             [k]
> _raw_spin_lock_irqsave
>      1.92%  map_perf_test    [kernel.vmlinux]                             [k]
> preempt_count_add
>      1.92%  map_perf_test    [kernel.vmlinux]                             [k]
> preempt_count_sub
>      1.87%  map_perf_test    [kernel.vmlinux]                             [k]
> call_rcu
>
>
> (2) Use bpf_mem_alloc & per-cpu counter in hash-map, but no batch call_rcu
> optimization
> By revert the following commits:
>
> 9f2c6e96c65e bpf: Optimize rcu_barrier usage between hash map and bpf_mem_alloc.
> bfc03c15bebf bpf: Remove usage of kmem_cache from bpf_mem_cache.
> 02cc5aa29e8c bpf: Remove prealloc-only restriction for sleepable bpf programs.
> dccb4a9013a6 bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs.
> 96da3f7d489d bpf: Remove tracing program restriction on map types
> ee4ed53c5eb6 bpf: Convert percpu hash map to per-cpu bpf_mem_alloc.
> 4ab67149f3c6 bpf: Add percpu allocation support to bpf_mem_alloc.
> 8d5a8011b35d bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.
> 7c266178aa51 bpf: Adjust low/high watermarks in bpf_mem_cache
> 0fd7c5d43339 bpf: Optimize call_rcu in non-preallocated hash map.
>
>      5.17%  map_perf_test    [kernel.vmlinux]                             [k]
> check_preemption_disabled
>      4.53%  map_perf_test    [kernel.vmlinux]                             [k]
> __get_obj_cgroup_from_memcg
>      2.97%  map_perf_test    [kernel.vmlinux]                             [k]
> htab_map_update_elem
>      2.74%  map_perf_test    [kernel.vmlinux]                             [k]
> htab_map_delete_elem
>      2.62%  map_perf_test    [kernel.vmlinux]                             [k]
> kmem_cache_alloc_node
>      2.57%  map_perf_test    [kernel.vmlinux]                             [k]
> memcg_slab_post_alloc_hook
>      2.34%  map_perf_test    [kernel.vmlinux]                             [k] jhash
>      2.30%  map_perf_test    [kernel.vmlinux]                             [k]
> entry_SYSCALL_64
>      2.25%  map_perf_test    [kernel.vmlinux]                             [k]
> obj_cgroup_charge
>      2.23%  map_perf_test    [kernel.vmlinux]                             [k]
> alloc_htab_elem
>      2.17%  map_perf_test    [kernel.vmlinux]                             [k]
> memcpy_erms
>      2.17%  map_perf_test    [kernel.vmlinux]                             [k]
> syscall_exit_to_user_mode
>      2.16%  map_perf_test    [kernel.vmlinux]                             [k]
> syscall_return_via_sysret
>      2.14%  map_perf_test    [kernel.vmlinux]                             [k]
> _raw_spin_lock_irqsave
>      2.13%  map_perf_test    [kernel.vmlinux]                             [k]
> preempt_count_add
>      2.12%  map_perf_test    [kernel.vmlinux]                             [k]
> preempt_count_sub
>      2.00%  map_perf_test    [kernel.vmlinux]                             [k]
> percpu_counter_add_batch
>      1.99%  map_perf_test    [kernel.vmlinux]                             [k]
> alloc_bulk
>      1.97%  map_perf_test    [kernel.vmlinux]                             [k]
> call_rcu
>      1.52%  map_perf_test    [kernel.vmlinux]                             [k]
> mod_objcg_state
>      1.36%  map_perf_test    [kernel.vmlinux]                             [k]
> allocate_slab
>
> In both of these two setups, the overhead of call_rcu is about 2% and it is not
> the biggest overhead.
>
> Maybe add a not-immediate-reuse flag support to bpf_mem_alloc is reason. What do
> you think ?

We've discussed it twice already. It's not an option due to OOM
and performance considerations.
call_rcu doesn't scale to millions a second.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v2 03/13] bpf: Support bpf_dynptr-typed map key in bpf syscall
  2022-09-29  2:11     ` Hou Tao
@ 2022-09-30 21:35       ` Andrii Nakryiko
  2022-10-08  2:40         ` Hou Tao
  0 siblings, 1 reply; 52+ messages in thread
From: Andrii Nakryiko @ 2022-09-30 21:35 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, houtao1, Joanne Koong

On Wed, Sep 28, 2022 at 7:11 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 9/29/2022 8:16 AM, Andrii Nakryiko wrote:
> > On Sat, Sep 24, 2022 at 6:18 AM Hou Tao <houtao@huaweicloud.com> wrote:
> >> From: Hou Tao <houtao1@huawei.com>
> >>
> >> Userspace application uses bpf syscall to lookup or update bpf map. It
> >> passes a pointer of fixed-size buffer to kernel to represent the map
> >> key. To support map with variable-length key, introduce bpf_dynptr_user
> >> to allow userspace to pass a pointer of bpf_dynptr_user to specify the
> >> address and the length of key buffer. And in order to represent dynptr
> >> from userspace, adding a new dynptr type: BPF_DYNPTR_TYPE_USER. Because
> >> BPF_DYNPTR_TYPE_USER-typed dynptr is not available from bpf program, so
> >> no verifier update is needed.
> >>
> >> Add dynptr_key_off in bpf_map to distinguish map with fixed-size key
> >> from map with variable-length. dynptr_key_off is less than zero for
> >> fixed-size key and can only be zero for dynptr key.
> >>
> >> For dynptr-key map, key btf type is bpf_dynptr and key size is 16, so
> >> use the lower 32-bits of map_extra to specify the maximum size of dynptr
> >> key.
> >>
> >> Signed-off-by: Hou Tao <houtao1@huawei.com>
> >> ---
> > This is a great feature and you've put lots of high-quality work into
> > this! Looking forward to have qp-trie BPF map available. Apart from
> > your discussion with Alexie about locking and memory
> > allocation/reused, I have questions about this dynptr from user-space
> > interface. Let's discuss it in this patch to not interfere.
> >
> > I'm trying to understand why there should be so many new concepts and
> > interfaces just to allow variable-sized keys. Can you elaborate on
> > that? Like why do we even need BPF_DYNPTR_TYPE_USER? Why user can't
> > just pass a void * (casted to u64) pointer and size of the memory
> > pointed to it, and kernel will just copy necessary amount of data into
> > kvmalloc'ed temporary region?
> The main reason is that map operations from syscall and bpf program use the same
> ops in bpf_map_ops (e.g. map_update_elem). If only use dynptr_kern for bpf
> program, then
> have to define three new operations for bpf program. Even more, after defining
> two different map ops for the same operation from syscall and bpf program, the
> internal  implementation of qp-trie still need to convert these two different
> representations of variable-length key into bpf_qp_trie_key. It introduces
> unnecessary conversion, so I think it may be a good idea to pass dynptr_kern to
> qp-trie even for bpf syscall.
>
> And now in bpf_attr, for BPF_MAP_*_ELEM command, there is no space to pass an
> extra key size. It seems bpf_attr can be extend, but even it is extented, it
> also means in libbpf we need to provide a new API group to support operationg on
> dynptr key map, because the userspace needs to pass the key size as a new argument.


You are right that the current assumption of implicit key/value size
doesn't work for these variable-key/value-length maps. But I think the
right answer is actually to make sure that we have a map_update_elem
callback variant that accepts key/value size explicitly. I still think
that the syscall interface shouldn't introduce a concept of dynptr.
From user-space's point of view dynptr is just a memory pointer +
associated memory size. Let's keep it simple. And yes, it will be a
new libbpf API for bpf_map_lookup_elem/bpf_map_update_elem. That's
fine.


> >
> > It also seems like you want to allow key (and maybe value as well, not
> > sure) to be a custom user-defined type where some of the fields are
> > struct bpf_dynptr. I think it's a big overcomplication, tbh. I'd say
> > it's enough to just say that entire key has to be described by a
> > single bpf_dynptr. Then we can have bpf_map_lookup_elem_dynptr(map,
> > key_dynptr, flags) new helper to provide variable-sized key for
> > lookup.
> For qp-trie, it will only support a single dynptr as the map key. In the future
> maybe other map will support map key with embedded dynptrs. Maybe Joanne can
> share some vision about such use case.

My point was that instead of saying that key is some fixed-size struct
in which one of the fields is dynptr (and then when comparing you have
to compare part of struct, then dynptr contents, then the other part
of struct?), just say that entire key is represented by dynptr,
implicitly (it's just a blob of bytes). That seems more
straightforward.

> >
> > I think it would keep it much simpler. But if I'm missing something,
> > it would be good to understand that. Thanks!
> >
> >
> >>  include/linux/bpf.h            |   8 +++
> >>  include/uapi/linux/bpf.h       |   6 ++
> >>  kernel/bpf/map_in_map.c        |   3 +
> >>  kernel/bpf/syscall.c           | 121 +++++++++++++++++++++++++++------
> >>  tools/include/uapi/linux/bpf.h |   6 ++
> >>  5 files changed, 125 insertions(+), 19 deletions(-)
> >>
> > [...]
> > .
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key
  2022-09-29  3:22                 ` Alexei Starovoitov
@ 2022-10-08  1:56                   ` Hou Tao
  2022-10-08  1:59                     ` Alexei Starovoitov
  0 siblings, 1 reply; 52+ messages in thread
From: Hou Tao @ 2022-10-08  1:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, Hou Tao

Hi,

On 9/29/2022 11:22 AM, Alexei Starovoitov wrote:
> On Wed, Sep 28, 2022 at 1:46 AM Hou Tao <houtao@huaweicloud.com> wrote:
>> Hi,
>>
>> On 9/28/2022 9:08 AM, Alexei Starovoitov wrote:
>>> On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@huaweicloud.com> wrote:
>>>
>>> Looks like the perf is lost on atomic_inc/dec.
>>> Try a partial revert of mem_alloc.
>>> In particular to make sure
>>> commit 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.")
>>> is reverted and call_rcu is in place,
>>> but percpu counter optimization is still there.
>>> Also please use 'map_perf_test 4'.
>>> I doubt 1000 vs 10240 will make a difference, but still.
>>>
>> I have tried the following two setups:
>> (1) Don't use bpf_mem_alloc in hash-map and use per-cpu counter in hash-map
>> # Samples: 1M of event 'cycles:ppp'
>> # Event count (approx.): 1041345723234
>> #
>> # Overhead  Command          Shared Object                                Symbol
>> # ........  ...............  ...........................................
>> ...............................................
>> #
>>     10.36%  map_perf_test    [kernel.vmlinux]                             [k]
>> bpf_map_get_memcg.isra.0
> That is per-cpu counter and it's consuming 10% ?!
> Something is really odd in your setup.
> A lot of debug configs?
Sorry for the late reply. Just back to work from a long vacation.

My local .config is derived from Fedora distribution. It indeed has some DEBUG
related configs. Will turn these configs off to check it again :)
>>      9.82%  map_perf_test    [kernel.vmlinux]                             [k]
>> bpf_map_kmalloc_node
>>      4.24%  map_perf_test    [kernel.vmlinux]                             [k]
>> check_preemption_disabled
> clearly debug build.
> Please use production build.
check_preemption_disabled is due to CONFIG_DEBUG_PREEMPT. And it is enabled on
Fedora distribution.
>>      2.86%  map_perf_test    [kernel.vmlinux]                             [k]
>> htab_map_update_elem
>>      2.80%  map_perf_test    [kernel.vmlinux]                             [k]
>> __kmalloc_node
>>      2.72%  map_perf_test    [kernel.vmlinux]                             [k]
>> htab_map_delete_elem
>>      2.30%  map_perf_test    [kernel.vmlinux]                             [k]
>> memcg_slab_post_alloc_hook
>>      2.21%  map_perf_test    [kernel.vmlinux]                             [k]
>> entry_SYSCALL_64
>>      2.17%  map_perf_test    [kernel.vmlinux]                             [k]
>> syscall_exit_to_user_mode
>>      2.12%  map_perf_test    [kernel.vmlinux]                             [k] jhash
>>      2.11%  map_perf_test    [kernel.vmlinux]                             [k]
>> syscall_return_via_sysret
>>      2.05%  map_perf_test    [kernel.vmlinux]                             [k]
>> alloc_htab_elem
>>      1.94%  map_perf_test    [kernel.vmlinux]                             [k]
>> _raw_spin_lock_irqsave
>>      1.92%  map_perf_test    [kernel.vmlinux]                             [k]
>> preempt_count_add
>>      1.92%  map_perf_test    [kernel.vmlinux]                             [k]
>> preempt_count_sub
>>      1.87%  map_perf_test    [kernel.vmlinux]                             [k]
>> call_rcu
SNIP
>> Maybe add a not-immediate-reuse flag support to bpf_mem_alloc is reason. What do
>> you think ?
> We've discussed it twice already. It's not an option due to OOM
> and performance considerations.
> call_rcu doesn't scale to millions a second.
Understand. I was just trying to understand the exact performance overhead of
call_rcu(). If the overhead of map operations are much greater than the overhead
of call_rcu(), I think calling call_rcu() one millions a second will be not a
problem and  it also makes the implementation of qp-trie being much simpler. The
OOM problem is indeed a problem, although it is also possible for the current
implementation, so I will try to implement the lookup procedure which handles
the reuse problem.

Regards.
Tao
> .


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key
  2022-10-08  1:56                   ` Hou Tao
@ 2022-10-08  1:59                     ` Alexei Starovoitov
  2022-10-08 13:22                       ` Paul E. McKenney
  0 siblings, 1 reply; 52+ messages in thread
From: Alexei Starovoitov @ 2022-10-08  1:59 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, Hou Tao

On Fri, Oct 7, 2022 at 6:56 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 9/29/2022 11:22 AM, Alexei Starovoitov wrote:
> > On Wed, Sep 28, 2022 at 1:46 AM Hou Tao <houtao@huaweicloud.com> wrote:
> >> Hi,
> >>
> >> On 9/28/2022 9:08 AM, Alexei Starovoitov wrote:
> >>> On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@huaweicloud.com> wrote:
> >>>
> >>> Looks like the perf is lost on atomic_inc/dec.
> >>> Try a partial revert of mem_alloc.
> >>> In particular to make sure
> >>> commit 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.")
> >>> is reverted and call_rcu is in place,
> >>> but percpu counter optimization is still there.
> >>> Also please use 'map_perf_test 4'.
> >>> I doubt 1000 vs 10240 will make a difference, but still.
> >>>
> >> I have tried the following two setups:
> >> (1) Don't use bpf_mem_alloc in hash-map and use per-cpu counter in hash-map
> >> # Samples: 1M of event 'cycles:ppp'
> >> # Event count (approx.): 1041345723234
> >> #
> >> # Overhead  Command          Shared Object                                Symbol
> >> # ........  ...............  ...........................................
> >> ...............................................
> >> #
> >>     10.36%  map_perf_test    [kernel.vmlinux]                             [k]
> >> bpf_map_get_memcg.isra.0
> > That is per-cpu counter and it's consuming 10% ?!
> > Something is really odd in your setup.
> > A lot of debug configs?
> Sorry for the late reply. Just back to work from a long vacation.
>
> My local .config is derived from Fedora distribution. It indeed has some DEBUG
> related configs. Will turn these configs off to check it again :)
> >>      9.82%  map_perf_test    [kernel.vmlinux]                             [k]
> >> bpf_map_kmalloc_node
> >>      4.24%  map_perf_test    [kernel.vmlinux]                             [k]
> >> check_preemption_disabled
> > clearly debug build.
> > Please use production build.
> check_preemption_disabled is due to CONFIG_DEBUG_PREEMPT. And it is enabled on
> Fedora distribution.
> >>      2.86%  map_perf_test    [kernel.vmlinux]                             [k]
> >> htab_map_update_elem
> >>      2.80%  map_perf_test    [kernel.vmlinux]                             [k]
> >> __kmalloc_node
> >>      2.72%  map_perf_test    [kernel.vmlinux]                             [k]
> >> htab_map_delete_elem
> >>      2.30%  map_perf_test    [kernel.vmlinux]                             [k]
> >> memcg_slab_post_alloc_hook
> >>      2.21%  map_perf_test    [kernel.vmlinux]                             [k]
> >> entry_SYSCALL_64
> >>      2.17%  map_perf_test    [kernel.vmlinux]                             [k]
> >> syscall_exit_to_user_mode
> >>      2.12%  map_perf_test    [kernel.vmlinux]                             [k] jhash
> >>      2.11%  map_perf_test    [kernel.vmlinux]                             [k]
> >> syscall_return_via_sysret
> >>      2.05%  map_perf_test    [kernel.vmlinux]                             [k]
> >> alloc_htab_elem
> >>      1.94%  map_perf_test    [kernel.vmlinux]                             [k]
> >> _raw_spin_lock_irqsave
> >>      1.92%  map_perf_test    [kernel.vmlinux]                             [k]
> >> preempt_count_add
> >>      1.92%  map_perf_test    [kernel.vmlinux]                             [k]
> >> preempt_count_sub
> >>      1.87%  map_perf_test    [kernel.vmlinux]                             [k]
> >> call_rcu
> SNIP
> >> Maybe add a not-immediate-reuse flag support to bpf_mem_alloc is reason. What do
> >> you think ?
> > We've discussed it twice already. It's not an option due to OOM
> > and performance considerations.
> > call_rcu doesn't scale to millions a second.
> Understand. I was just trying to understand the exact performance overhead of
> call_rcu(). If the overhead of map operations are much greater than the overhead
> of call_rcu(), I think calling call_rcu() one millions a second will be not a
> problem and  it also makes the implementation of qp-trie being much simpler. The
> OOM problem is indeed a problem, although it is also possible for the current
> implementation, so I will try to implement the lookup procedure which handles
> the reuse problem.

call_rcu is not just that particular function.
It's all the work rcu subsystem needs to do to observe gp
and execute that callback. Just see how many kthreads it will
start when overloaded like this.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v2 03/13] bpf: Support bpf_dynptr-typed map key in bpf syscall
  2022-09-30 21:35       ` Andrii Nakryiko
@ 2022-10-08  2:40         ` Hou Tao
  2022-10-13 18:04           ` Andrii Nakryiko
  0 siblings, 1 reply; 52+ messages in thread
From: Hou Tao @ 2022-10-08  2:40 UTC (permalink / raw)
  To: Andrii Nakryiko, Joanne Koong
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, houtao1

Hi,

On 10/1/2022 5:35 AM, Andrii Nakryiko wrote:
> On Wed, Sep 28, 2022 at 7:11 PM Hou Tao <houtao@huaweicloud.com> wrote:
SNP
>>> I'm trying to understand why there should be so many new concepts and
>>> interfaces just to allow variable-sized keys. Can you elaborate on
>>> that? Like why do we even need BPF_DYNPTR_TYPE_USER? Why user can't
>>> just pass a void * (casted to u64) pointer and size of the memory
>>> pointed to it, and kernel will just copy necessary amount of data into
>>> kvmalloc'ed temporary region?
>> The main reason is that map operations from syscall and bpf program use the same
>> ops in bpf_map_ops (e.g. map_update_elem). If only use dynptr_kern for bpf
>> program, then
>> have to define three new operations for bpf program. Even more, after defining
>> two different map ops for the same operation from syscall and bpf program, the
>> internal  implementation of qp-trie still need to convert these two different
>> representations of variable-length key into bpf_qp_trie_key. It introduces
>> unnecessary conversion, so I think it may be a good idea to pass dynptr_kern to
>> qp-trie even for bpf syscall.
>>
>> And now in bpf_attr, for BPF_MAP_*_ELEM command, there is no space to pass an
>> extra key size. It seems bpf_attr can be extend, but even it is extented, it
>> also means in libbpf we need to provide a new API group to support operationg on
>> dynptr key map, because the userspace needs to pass the key size as a new argument.
> You are right that the current assumption of implicit key/value size
> doesn't work for these variable-key/value-length maps. But I think the
> right answer is actually to make sure that we have a map_update_elem
> callback variant that accepts key/value size explicitly. I still think
> that the syscall interface shouldn't introduce a concept of dynptr.
> >From user-space's point of view dynptr is just a memory pointer +
> associated memory size. Let's keep it simple. And yes, it will be a
> new libbpf API for bpf_map_lookup_elem/bpf_map_update_elem. That's
> fine.
Is your point that dynptr is too complicated for user-space and may lead to
confusion between dynptr in kernel space ? How about a different name or a
simple definition just like bpf_lpm_trie_key ? It will make both the
implementation and the usage much simpler, because the implementation and the
user can still use the same APIs just like fixed sized map.

Not just lookup/update/delete, we also need to define a new op for
get_next_key/lookup_and_delete_elem. And also need to define corresponding new
bpf helpers for bpf program. And you said "explict key/value size", do you mean
something below ?

int (*map_update_elem)(struct bpf_map *map, void *key, u32 key_size, void
*value, u32 value_size, u64 flags);

>
>
>>> It also seems like you want to allow key (and maybe value as well, not
>>> sure) to be a custom user-defined type where some of the fields are
>>> struct bpf_dynptr. I think it's a big overcomplication, tbh. I'd say
>>> it's enough to just say that entire key has to be described by a
>>> single bpf_dynptr. Then we can have bpf_map_lookup_elem_dynptr(map,
>>> key_dynptr, flags) new helper to provide variable-sized key for
>>> lookup.
>> For qp-trie, it will only support a single dynptr as the map key. In the future
>> maybe other map will support map key with embedded dynptrs. Maybe Joanne can
>> share some vision about such use case.
> My point was that instead of saying that key is some fixed-size struct
> in which one of the fields is dynptr (and then when comparing you have
> to compare part of struct, then dynptr contents, then the other part
> of struct?), just say that entire key is represented by dynptr,
> implicitly (it's just a blob of bytes). That seems more
> straightforward.
I see. But I still think there is possible user case for struct with embedded
dynptr. For bpf map in kernel, byte blob is OK. But If it is also a blob of
bytes for the bpf program or userspace application, the application may need to
marshaling and un-marshaling between the bytes blob and a meaningful struct type
each time before using it.
> .


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key
  2022-10-08  1:59                     ` Alexei Starovoitov
@ 2022-10-08 13:22                       ` Paul E. McKenney
  2022-10-08 16:40                         ` Alexei Starovoitov
  0 siblings, 1 reply; 52+ messages in thread
From: Paul E. McKenney @ 2022-10-08 13:22 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Hou Tao, bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Alexei Starovoitov, Daniel Borkmann,
	KP Singh, David S . Miller, Jakub Kicinski, Stanislav Fomichev,
	Jiri Olsa, John Fastabend, Hou Tao

On Fri, Oct 07, 2022 at 06:59:08PM -0700, Alexei Starovoitov wrote:
> On Fri, Oct 7, 2022 at 6:56 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >
> > Hi,
> >
> > On 9/29/2022 11:22 AM, Alexei Starovoitov wrote:
> > > On Wed, Sep 28, 2022 at 1:46 AM Hou Tao <houtao@huaweicloud.com> wrote:
> > >> Hi,
> > >>
> > >> On 9/28/2022 9:08 AM, Alexei Starovoitov wrote:
> > >>> On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@huaweicloud.com> wrote:
> > >>>
> > >>> Looks like the perf is lost on atomic_inc/dec.
> > >>> Try a partial revert of mem_alloc.
> > >>> In particular to make sure
> > >>> commit 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.")
> > >>> is reverted and call_rcu is in place,
> > >>> but percpu counter optimization is still there.
> > >>> Also please use 'map_perf_test 4'.
> > >>> I doubt 1000 vs 10240 will make a difference, but still.
> > >>>
> > >> I have tried the following two setups:
> > >> (1) Don't use bpf_mem_alloc in hash-map and use per-cpu counter in hash-map
> > >> # Samples: 1M of event 'cycles:ppp'
> > >> # Event count (approx.): 1041345723234
> > >> #
> > >> # Overhead  Command          Shared Object                                Symbol
> > >> # ........  ...............  ...........................................
> > >> ...............................................
> > >> #
> > >>     10.36%  map_perf_test    [kernel.vmlinux]                             [k]
> > >> bpf_map_get_memcg.isra.0
> > > That is per-cpu counter and it's consuming 10% ?!
> > > Something is really odd in your setup.
> > > A lot of debug configs?
> > Sorry for the late reply. Just back to work from a long vacation.
> >
> > My local .config is derived from Fedora distribution. It indeed has some DEBUG
> > related configs. Will turn these configs off to check it again :)
> > >>      9.82%  map_perf_test    [kernel.vmlinux]                             [k]
> > >> bpf_map_kmalloc_node
> > >>      4.24%  map_perf_test    [kernel.vmlinux]                             [k]
> > >> check_preemption_disabled
> > > clearly debug build.
> > > Please use production build.
> > check_preemption_disabled is due to CONFIG_DEBUG_PREEMPT. And it is enabled on
> > Fedora distribution.
> > >>      2.86%  map_perf_test    [kernel.vmlinux]                             [k]
> > >> htab_map_update_elem
> > >>      2.80%  map_perf_test    [kernel.vmlinux]                             [k]
> > >> __kmalloc_node
> > >>      2.72%  map_perf_test    [kernel.vmlinux]                             [k]
> > >> htab_map_delete_elem
> > >>      2.30%  map_perf_test    [kernel.vmlinux]                             [k]
> > >> memcg_slab_post_alloc_hook
> > >>      2.21%  map_perf_test    [kernel.vmlinux]                             [k]
> > >> entry_SYSCALL_64
> > >>      2.17%  map_perf_test    [kernel.vmlinux]                             [k]
> > >> syscall_exit_to_user_mode
> > >>      2.12%  map_perf_test    [kernel.vmlinux]                             [k] jhash
> > >>      2.11%  map_perf_test    [kernel.vmlinux]                             [k]
> > >> syscall_return_via_sysret
> > >>      2.05%  map_perf_test    [kernel.vmlinux]                             [k]
> > >> alloc_htab_elem
> > >>      1.94%  map_perf_test    [kernel.vmlinux]                             [k]
> > >> _raw_spin_lock_irqsave
> > >>      1.92%  map_perf_test    [kernel.vmlinux]                             [k]
> > >> preempt_count_add
> > >>      1.92%  map_perf_test    [kernel.vmlinux]                             [k]
> > >> preempt_count_sub
> > >>      1.87%  map_perf_test    [kernel.vmlinux]                             [k]
> > >> call_rcu
> > SNIP
> > >> Maybe add a not-immediate-reuse flag support to bpf_mem_alloc is reason. What do
> > >> you think ?
> > > We've discussed it twice already. It's not an option due to OOM
> > > and performance considerations.
> > > call_rcu doesn't scale to millions a second.
> > Understand. I was just trying to understand the exact performance overhead of
> > call_rcu(). If the overhead of map operations are much greater than the overhead
> > of call_rcu(), I think calling call_rcu() one millions a second will be not a
> > problem and  it also makes the implementation of qp-trie being much simpler. The
> > OOM problem is indeed a problem, although it is also possible for the current
> > implementation, so I will try to implement the lookup procedure which handles
> > the reuse problem.
> 
> call_rcu is not just that particular function.
> It's all the work rcu subsystem needs to do to observe gp
> and execute that callback. Just see how many kthreads it will
> start when overloaded like this.

The kthreads to watch include rcu_preempt, rcu_sched, ksoftirqd*, rcuc*,
and rcuo*.  There is also the back-of-interrupt softirq context, which
requires some care to measure accurately.

The possibility of SLAB_TYPESAFE_BY_RCU has been discussed.  I take it
that the per-element locking overhead for exact iterations was a problem?
If so, what exactly are the consistency rules for iteration?  Presumably
stronger than "if the element existed throughout, it is included in the
iteration; if it did not exist throughout, it is not included; otherwise
it might or might not be included" given that you get that for free.

Either way, could you please tell me the exact iteration rules?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key
  2022-10-08 13:22                       ` Paul E. McKenney
@ 2022-10-08 16:40                         ` Alexei Starovoitov
  2022-10-08 20:11                           ` Paul E. McKenney
  0 siblings, 1 reply; 52+ messages in thread
From: Alexei Starovoitov @ 2022-10-08 16:40 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Hou Tao, bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Alexei Starovoitov, Daniel Borkmann,
	KP Singh, David S . Miller, Jakub Kicinski, Stanislav Fomichev,
	Jiri Olsa, John Fastabend, Hou Tao

On Sat, Oct 8, 2022 at 6:22 AM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Fri, Oct 07, 2022 at 06:59:08PM -0700, Alexei Starovoitov wrote:
> > On Fri, Oct 7, 2022 at 6:56 PM Hou Tao <houtao@huaweicloud.com> wrote:
> > >
> > > Hi,
> > >
> > > On 9/29/2022 11:22 AM, Alexei Starovoitov wrote:
> > > > On Wed, Sep 28, 2022 at 1:46 AM Hou Tao <houtao@huaweicloud.com> wrote:
> > > >> Hi,
> > > >>
> > > >> On 9/28/2022 9:08 AM, Alexei Starovoitov wrote:
> > > >>> On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@huaweicloud.com> wrote:
> > > >>>
> > > >>> Looks like the perf is lost on atomic_inc/dec.
> > > >>> Try a partial revert of mem_alloc.
> > > >>> In particular to make sure
> > > >>> commit 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.")
> > > >>> is reverted and call_rcu is in place,
> > > >>> but percpu counter optimization is still there.
> > > >>> Also please use 'map_perf_test 4'.
> > > >>> I doubt 1000 vs 10240 will make a difference, but still.
> > > >>>
> > > >> I have tried the following two setups:
> > > >> (1) Don't use bpf_mem_alloc in hash-map and use per-cpu counter in hash-map
> > > >> # Samples: 1M of event 'cycles:ppp'
> > > >> # Event count (approx.): 1041345723234
> > > >> #
> > > >> # Overhead  Command          Shared Object                                Symbol
> > > >> # ........  ...............  ...........................................
> > > >> ...............................................
> > > >> #
> > > >>     10.36%  map_perf_test    [kernel.vmlinux]                             [k]
> > > >> bpf_map_get_memcg.isra.0
> > > > That is per-cpu counter and it's consuming 10% ?!
> > > > Something is really odd in your setup.
> > > > A lot of debug configs?
> > > Sorry for the late reply. Just back to work from a long vacation.
> > >
> > > My local .config is derived from Fedora distribution. It indeed has some DEBUG
> > > related configs. Will turn these configs off to check it again :)
> > > >>      9.82%  map_perf_test    [kernel.vmlinux]                             [k]
> > > >> bpf_map_kmalloc_node
> > > >>      4.24%  map_perf_test    [kernel.vmlinux]                             [k]
> > > >> check_preemption_disabled
> > > > clearly debug build.
> > > > Please use production build.
> > > check_preemption_disabled is due to CONFIG_DEBUG_PREEMPT. And it is enabled on
> > > Fedora distribution.
> > > >>      2.86%  map_perf_test    [kernel.vmlinux]                             [k]
> > > >> htab_map_update_elem
> > > >>      2.80%  map_perf_test    [kernel.vmlinux]                             [k]
> > > >> __kmalloc_node
> > > >>      2.72%  map_perf_test    [kernel.vmlinux]                             [k]
> > > >> htab_map_delete_elem
> > > >>      2.30%  map_perf_test    [kernel.vmlinux]                             [k]
> > > >> memcg_slab_post_alloc_hook
> > > >>      2.21%  map_perf_test    [kernel.vmlinux]                             [k]
> > > >> entry_SYSCALL_64
> > > >>      2.17%  map_perf_test    [kernel.vmlinux]                             [k]
> > > >> syscall_exit_to_user_mode
> > > >>      2.12%  map_perf_test    [kernel.vmlinux]                             [k] jhash
> > > >>      2.11%  map_perf_test    [kernel.vmlinux]                             [k]
> > > >> syscall_return_via_sysret
> > > >>      2.05%  map_perf_test    [kernel.vmlinux]                             [k]
> > > >> alloc_htab_elem
> > > >>      1.94%  map_perf_test    [kernel.vmlinux]                             [k]
> > > >> _raw_spin_lock_irqsave
> > > >>      1.92%  map_perf_test    [kernel.vmlinux]                             [k]
> > > >> preempt_count_add
> > > >>      1.92%  map_perf_test    [kernel.vmlinux]                             [k]
> > > >> preempt_count_sub
> > > >>      1.87%  map_perf_test    [kernel.vmlinux]                             [k]
> > > >> call_rcu
> > > SNIP
> > > >> Maybe add a not-immediate-reuse flag support to bpf_mem_alloc is reason. What do
> > > >> you think ?
> > > > We've discussed it twice already. It's not an option due to OOM
> > > > and performance considerations.
> > > > call_rcu doesn't scale to millions a second.
> > > Understand. I was just trying to understand the exact performance overhead of
> > > call_rcu(). If the overhead of map operations are much greater than the overhead
> > > of call_rcu(), I think calling call_rcu() one millions a second will be not a
> > > problem and  it also makes the implementation of qp-trie being much simpler. The
> > > OOM problem is indeed a problem, although it is also possible for the current
> > > implementation, so I will try to implement the lookup procedure which handles
> > > the reuse problem.
> >
> > call_rcu is not just that particular function.
> > It's all the work rcu subsystem needs to do to observe gp
> > and execute that callback. Just see how many kthreads it will
> > start when overloaded like this.
>
> The kthreads to watch include rcu_preempt, rcu_sched, ksoftirqd*, rcuc*,
> and rcuo*.  There is also the back-of-interrupt softirq context, which
> requires some care to measure accurately.
>
> The possibility of SLAB_TYPESAFE_BY_RCU has been discussed.  I take it
> that the per-element locking overhead for exact iterations was a problem?
> If so, what exactly are the consistency rules for iteration?  Presumably
> stronger than "if the element existed throughout, it is included in the
> iteration; if it did not exist throughout, it is not included; otherwise
> it might or might not be included" given that you get that for free.
>
> Either way, could you please tell me the exact iteration rules?

The rules are the way we make them to be.
iteration will be under lock.
lookup needs to be correct. It can retry if necessary (like htab is doing).
Randomly returning 'noexist' is, of course, not acceptable.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key
  2022-10-08 16:40                         ` Alexei Starovoitov
@ 2022-10-08 20:11                           ` Paul E. McKenney
  2022-10-09  1:09                             ` Hou Tao
  0 siblings, 1 reply; 52+ messages in thread
From: Paul E. McKenney @ 2022-10-08 20:11 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Hou Tao, bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Alexei Starovoitov, Daniel Borkmann,
	KP Singh, David S . Miller, Jakub Kicinski, Stanislav Fomichev,
	Jiri Olsa, John Fastabend, Hou Tao

On Sat, Oct 08, 2022 at 09:40:04AM -0700, Alexei Starovoitov wrote:
> On Sat, Oct 8, 2022 at 6:22 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Fri, Oct 07, 2022 at 06:59:08PM -0700, Alexei Starovoitov wrote:
> > > On Fri, Oct 7, 2022 at 6:56 PM Hou Tao <houtao@huaweicloud.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > On 9/29/2022 11:22 AM, Alexei Starovoitov wrote:
> > > > > On Wed, Sep 28, 2022 at 1:46 AM Hou Tao <houtao@huaweicloud.com> wrote:
> > > > >> Hi,
> > > > >>
> > > > >> On 9/28/2022 9:08 AM, Alexei Starovoitov wrote:
> > > > >>> On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@huaweicloud.com> wrote:
> > > > >>>
> > > > >>> Looks like the perf is lost on atomic_inc/dec.
> > > > >>> Try a partial revert of mem_alloc.
> > > > >>> In particular to make sure
> > > > >>> commit 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.")
> > > > >>> is reverted and call_rcu is in place,
> > > > >>> but percpu counter optimization is still there.
> > > > >>> Also please use 'map_perf_test 4'.
> > > > >>> I doubt 1000 vs 10240 will make a difference, but still.
> > > > >>>
> > > > >> I have tried the following two setups:
> > > > >> (1) Don't use bpf_mem_alloc in hash-map and use per-cpu counter in hash-map
> > > > >> # Samples: 1M of event 'cycles:ppp'
> > > > >> # Event count (approx.): 1041345723234
> > > > >> #
> > > > >> # Overhead  Command          Shared Object                                Symbol
> > > > >> # ........  ...............  ...........................................
> > > > >> ...............................................
> > > > >> #
> > > > >>     10.36%  map_perf_test    [kernel.vmlinux]                             [k]
> > > > >> bpf_map_get_memcg.isra.0
> > > > > That is per-cpu counter and it's consuming 10% ?!
> > > > > Something is really odd in your setup.
> > > > > A lot of debug configs?
> > > > Sorry for the late reply. Just back to work from a long vacation.
> > > >
> > > > My local .config is derived from Fedora distribution. It indeed has some DEBUG
> > > > related configs. Will turn these configs off to check it again :)
> > > > >>      9.82%  map_perf_test    [kernel.vmlinux]                             [k]
> > > > >> bpf_map_kmalloc_node
> > > > >>      4.24%  map_perf_test    [kernel.vmlinux]                             [k]
> > > > >> check_preemption_disabled
> > > > > clearly debug build.
> > > > > Please use production build.
> > > > check_preemption_disabled is due to CONFIG_DEBUG_PREEMPT. And it is enabled on
> > > > Fedora distribution.
> > > > >>      2.86%  map_perf_test    [kernel.vmlinux]                             [k]
> > > > >> htab_map_update_elem
> > > > >>      2.80%  map_perf_test    [kernel.vmlinux]                             [k]
> > > > >> __kmalloc_node
> > > > >>      2.72%  map_perf_test    [kernel.vmlinux]                             [k]
> > > > >> htab_map_delete_elem
> > > > >>      2.30%  map_perf_test    [kernel.vmlinux]                             [k]
> > > > >> memcg_slab_post_alloc_hook
> > > > >>      2.21%  map_perf_test    [kernel.vmlinux]                             [k]
> > > > >> entry_SYSCALL_64
> > > > >>      2.17%  map_perf_test    [kernel.vmlinux]                             [k]
> > > > >> syscall_exit_to_user_mode
> > > > >>      2.12%  map_perf_test    [kernel.vmlinux]                             [k] jhash
> > > > >>      2.11%  map_perf_test    [kernel.vmlinux]                             [k]
> > > > >> syscall_return_via_sysret
> > > > >>      2.05%  map_perf_test    [kernel.vmlinux]                             [k]
> > > > >> alloc_htab_elem
> > > > >>      1.94%  map_perf_test    [kernel.vmlinux]                             [k]
> > > > >> _raw_spin_lock_irqsave
> > > > >>      1.92%  map_perf_test    [kernel.vmlinux]                             [k]
> > > > >> preempt_count_add
> > > > >>      1.92%  map_perf_test    [kernel.vmlinux]                             [k]
> > > > >> preempt_count_sub
> > > > >>      1.87%  map_perf_test    [kernel.vmlinux]                             [k]
> > > > >> call_rcu
> > > > SNIP
> > > > >> Maybe add a not-immediate-reuse flag support to bpf_mem_alloc is reason. What do
> > > > >> you think ?
> > > > > We've discussed it twice already. It's not an option due to OOM
> > > > > and performance considerations.
> > > > > call_rcu doesn't scale to millions a second.
> > > > Understand. I was just trying to understand the exact performance overhead of
> > > > call_rcu(). If the overhead of map operations are much greater than the overhead
> > > > of call_rcu(), I think calling call_rcu() one millions a second will be not a
> > > > problem and  it also makes the implementation of qp-trie being much simpler. The
> > > > OOM problem is indeed a problem, although it is also possible for the current
> > > > implementation, so I will try to implement the lookup procedure which handles
> > > > the reuse problem.
> > >
> > > call_rcu is not just that particular function.
> > > It's all the work rcu subsystem needs to do to observe gp
> > > and execute that callback. Just see how many kthreads it will
> > > start when overloaded like this.
> >
> > The kthreads to watch include rcu_preempt, rcu_sched, ksoftirqd*, rcuc*,
> > and rcuo*.  There is also the back-of-interrupt softirq context, which
> > requires some care to measure accurately.
> >
> > The possibility of SLAB_TYPESAFE_BY_RCU has been discussed.  I take it
> > that the per-element locking overhead for exact iterations was a problem?
> > If so, what exactly are the consistency rules for iteration?  Presumably
> > stronger than "if the element existed throughout, it is included in the
> > iteration; if it did not exist throughout, it is not included; otherwise
> > it might or might not be included" given that you get that for free.
> >
> > Either way, could you please tell me the exact iteration rules?
> 
> The rules are the way we make them to be.
> iteration will be under lock.
> lookup needs to be correct. It can retry if necessary (like htab is doing).
> Randomly returning 'noexist' is, of course, not acceptable.

OK, so then it is important that updates to this data structure be
carried out in such a way as to avoid discombobulating lockless readers.
Do the updates have that property?

The usual way to get that property is to leave the old search structure
around, replacing it with the new one, and RCU-freeing the old one.
In case it helps, Kung and Lehman describe how to do that for search trees:

http://www.eecs.harvard.edu/~htk/publication/1980-tods-kung-lehman.pdf

							Thanx, Paul

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key
  2022-10-08 20:11                           ` Paul E. McKenney
@ 2022-10-09  1:09                             ` Hou Tao
  2022-10-09  9:05                               ` Paul E. McKenney
  0 siblings, 1 reply; 52+ messages in thread
From: Hou Tao @ 2022-10-09  1:09 UTC (permalink / raw)
  To: paulmck, Alexei Starovoitov
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Hou Tao

Hi Paul,

On 10/9/2022 4:11 AM, Paul E. McKenney wrote:
> On Sat, Oct 08, 2022 at 09:40:04AM -0700, Alexei Starovoitov wrote:
>> On Sat, Oct 8, 2022 at 6:22 AM Paul E. McKenney <paulmck@kernel.org> wrote:
>>> On Fri, Oct 07, 2022 at 06:59:08PM -0700, Alexei Starovoitov wrote:
SNIP
>>>>> Understand. I was just trying to understand the exact performance overhead of
>>>>> call_rcu(). If the overhead of map operations are much greater than the overhead
>>>>> of call_rcu(), I think calling call_rcu() one millions a second will be not a
>>>>> problem and  it also makes the implementation of qp-trie being much simpler. The
>>>>> OOM problem is indeed a problem, although it is also possible for the current
>>>>> implementation, so I will try to implement the lookup procedure which handles
>>>>> the reuse problem.
>>>> call_rcu is not just that particular function.
>>>> It's all the work rcu subsystem needs to do to observe gp
>>>> and execute that callback. Just see how many kthreads it will
>>>> start when overloaded like this.
>>> The kthreads to watch include rcu_preempt, rcu_sched, ksoftirqd*, rcuc*,
>>> and rcuo*.  There is also the back-of-interrupt softirq context, which
>>> requires some care to measure accurately.
>>>
>>> The possibility of SLAB_TYPESAFE_BY_RCU has been discussed.  I take it
>>> that the per-element locking overhead for exact iterations was a problem?
>>> If so, what exactly are the consistency rules for iteration?  Presumably
>>> stronger than "if the element existed throughout, it is included in the
>>> iteration; if it did not exist throughout, it is not included; otherwise
>>> it might or might not be included" given that you get that for free.
>>>
>>> Either way, could you please tell me the exact iteration rules?
>> The rules are the way we make them to be.
>> iteration will be under lock.
>> lookup needs to be correct. It can retry if necessary (like htab is doing).
>> Randomly returning 'noexist' is, of course, not acceptable.
> OK, so then it is important that updates to this data structure be
> carried out in such a way as to avoid discombobulating lockless readers.
> Do the updates have that property?
Yes. The update procedure will copy the old pointer array to a new array first,
then update the new array and replace the pointer of old array by the pointer of
new array.
>
> The usual way to get that property is to leave the old search structure
> around, replacing it with the new one, and RCU-freeing the old one.
> In case it helps, Kung and Lehman describe how to do that for search trees:
>
> http://www.eecs.harvard.edu/~htk/publication/1980-tods-kung-lehman.pdf
Thanks for the paper. Just skimming through it, it seems that it uses
reference-counting and garbage collection to solve the safe memory reclamation
problem. It may be too heavy for qp-trie and we plan to use seqcount-like way to
check whether or not the branch and the leaf node is reused during lookup, and
retry the lookup if it happened. Now just checking the feasibility of the
solution and it seems a little complicated than expected.

>
> 							Thanx, Paul
> .


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key
  2022-10-09  1:09                             ` Hou Tao
@ 2022-10-09  9:05                               ` Paul E. McKenney
  2022-10-09 10:45                                 ` Hou Tao
  0 siblings, 1 reply; 52+ messages in thread
From: Paul E. McKenney @ 2022-10-09  9:05 UTC (permalink / raw)
  To: Hou Tao
  Cc: Alexei Starovoitov, bpf, Martin KaFai Lau, Andrii Nakryiko,
	Song Liu, Hao Luo, Yonghong Song, Alexei Starovoitov,
	Daniel Borkmann, KP Singh, David S . Miller, Jakub Kicinski,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Hou Tao

On Sun, Oct 09, 2022 at 09:09:44AM +0800, Hou Tao wrote:
> Hi Paul,
> 
> On 10/9/2022 4:11 AM, Paul E. McKenney wrote:
> > On Sat, Oct 08, 2022 at 09:40:04AM -0700, Alexei Starovoitov wrote:
> >> On Sat, Oct 8, 2022 at 6:22 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> >>> On Fri, Oct 07, 2022 at 06:59:08PM -0700, Alexei Starovoitov wrote:
> SNIP
> >>>>> Understand. I was just trying to understand the exact performance overhead of
> >>>>> call_rcu(). If the overhead of map operations are much greater than the overhead
> >>>>> of call_rcu(), I think calling call_rcu() one millions a second will be not a
> >>>>> problem and  it also makes the implementation of qp-trie being much simpler. The
> >>>>> OOM problem is indeed a problem, although it is also possible for the current
> >>>>> implementation, so I will try to implement the lookup procedure which handles
> >>>>> the reuse problem.
> >>>> call_rcu is not just that particular function.
> >>>> It's all the work rcu subsystem needs to do to observe gp
> >>>> and execute that callback. Just see how many kthreads it will
> >>>> start when overloaded like this.
> >>> The kthreads to watch include rcu_preempt, rcu_sched, ksoftirqd*, rcuc*,
> >>> and rcuo*.  There is also the back-of-interrupt softirq context, which
> >>> requires some care to measure accurately.
> >>>
> >>> The possibility of SLAB_TYPESAFE_BY_RCU has been discussed.  I take it
> >>> that the per-element locking overhead for exact iterations was a problem?
> >>> If so, what exactly are the consistency rules for iteration?  Presumably
> >>> stronger than "if the element existed throughout, it is included in the
> >>> iteration; if it did not exist throughout, it is not included; otherwise
> >>> it might or might not be included" given that you get that for free.
> >>>
> >>> Either way, could you please tell me the exact iteration rules?
> >> The rules are the way we make them to be.
> >> iteration will be under lock.
> >> lookup needs to be correct. It can retry if necessary (like htab is doing).
> >> Randomly returning 'noexist' is, of course, not acceptable.
> > OK, so then it is important that updates to this data structure be
> > carried out in such a way as to avoid discombobulating lockless readers.
> > Do the updates have that property?
> 
> Yes. The update procedure will copy the old pointer array to a new array first,
> then update the new array and replace the pointer of old array by the pointer of
> new array.

Very good.  But then why is there a problem?  Is the iteration using
multiple RCU read-side critical sections or something?

> > The usual way to get that property is to leave the old search structure
> > around, replacing it with the new one, and RCU-freeing the old one.
> > In case it helps, Kung and Lehman describe how to do that for search trees:
> >
> > http://www.eecs.harvard.edu/~htk/publication/1980-tods-kung-lehman.pdf
> Thanks for the paper. Just skimming through it, it seems that it uses
> reference-counting and garbage collection to solve the safe memory reclamation
> problem. It may be too heavy for qp-trie and we plan to use seqcount-like way to
> check whether or not the branch and the leaf node is reused during lookup, and
> retry the lookup if it happened. Now just checking the feasibility of the
> solution and it seems a little complicated than expected.

The main thing in that paper is the handling of rotations in the
search-tree update.  But if you are not using a tree, that won't be all
that relevant.

								Thanx, Paul

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key
  2022-10-09  9:05                               ` Paul E. McKenney
@ 2022-10-09 10:45                                 ` Hou Tao
  2022-10-09 11:04                                   ` Paul E. McKenney
  0 siblings, 1 reply; 52+ messages in thread
From: Hou Tao @ 2022-10-09 10:45 UTC (permalink / raw)
  To: paulmck, Alexei Starovoitov
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Hou Tao

Hi,

On 10/9/2022 5:05 PM, Paul E. McKenney wrote:
> On Sun, Oct 09, 2022 at 09:09:44AM +0800, Hou Tao wrote:
>> Hi Paul,
>>
>> On 10/9/2022 4:11 AM, Paul E. McKenney wrote:
>>> On Sat, Oct 08, 2022 at 09:40:04AM -0700, Alexei Starovoitov wrote:
>>>> On Sat, Oct 8, 2022 at 6:22 AM Paul E. McKenney <paulmck@kernel.org> wrote:
>>>>> On Fri, Oct 07, 2022 at 06:59:08PM -0700, Alexei Starovoitov wrote:
>> SNIP
>>>>>>> Understand. I was just trying to understand the exact performance overhead of
>>>>>>> call_rcu(). If the overhead of map operations are much greater than the overhead
>>>>>>> of call_rcu(), I think calling call_rcu() one millions a second will be not a
>>>>>>> problem and  it also makes the implementation of qp-trie being much simpler. The
>>>>>>> OOM problem is indeed a problem, although it is also possible for the current
>>>>>>> implementation, so I will try to implement the lookup procedure which handles
>>>>>>> the reuse problem.
>>>>>> call_rcu is not just that particular function.
>>>>>> It's all the work rcu subsystem needs to do to observe gp
>>>>>> and execute that callback. Just see how many kthreads it will
>>>>>> start when overloaded like this.
>>>>> The kthreads to watch include rcu_preempt, rcu_sched, ksoftirqd*, rcuc*,
>>>>> and rcuo*.  There is also the back-of-interrupt softirq context, which
>>>>> requires some care to measure accurately.
>>>>>
>>>>> The possibility of SLAB_TYPESAFE_BY_RCU has been discussed.  I take it
>>>>> that the per-element locking overhead for exact iterations was a problem?
>>>>> If so, what exactly are the consistency rules for iteration?  Presumably
>>>>> stronger than "if the element existed throughout, it is included in the
>>>>> iteration; if it did not exist throughout, it is not included; otherwise
>>>>> it might or might not be included" given that you get that for free.
>>>>>
>>>>> Either way, could you please tell me the exact iteration rules?
>>>> The rules are the way we make them to be.
>>>> iteration will be under lock.
>>>> lookup needs to be correct. It can retry if necessary (like htab is doing).
>>>> Randomly returning 'noexist' is, of course, not acceptable.
>>> OK, so then it is important that updates to this data structure be
>>> carried out in such a way as to avoid discombobulating lockless readers.
>>> Do the updates have that property?
>> Yes. The update procedure will copy the old pointer array to a new array first,
>> then update the new array and replace the pointer of old array by the pointer of
>> new array.
> Very good.  But then why is there a problem?  Is the iteration using
> multiple RCU read-side critical sections or something?
The problem is that although the objects are RCU-freed, but these object also
can be reused immediately in bpf memory allocator. The reason for reuse is for
performance and is to reduce the possibility of OOM. Because the object can be
reused during RCU-protected lookup and the possibility of reuse is low, the
lookup procedure needs to check whether reuse is happening during lookup. And I
was arguing with Alexei about whether or no it is reasonable to provide an
optional flag to remove the immediate reuse in bpf memory allocator.
>
>>> The usual way to get that property is to leave the old search structure
>>> around, replacing it with the new one, and RCU-freeing the old one.
>>> In case it helps, Kung and Lehman describe how to do that for search trees:
>>>
>>> http://www.eecs.harvard.edu/~htk/publication/1980-tods-kung-lehman.pdf
>> Thanks for the paper. Just skimming through it, it seems that it uses
>> reference-counting and garbage collection to solve the safe memory reclamation
>> problem. It may be too heavy for qp-trie and we plan to use seqcount-like way to
>> check whether or not the branch and the leaf node is reused during lookup, and
>> retry the lookup if it happened. Now just checking the feasibility of the
>> solution and it seems a little complicated than expected.
> The main thing in that paper is the handling of rotations in the
> search-tree update.  But if you are not using a tree, that won't be all
> that relevant.
I see. Thanks for the explanation.
>
> 								Thanx, Paul


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key
  2022-10-09 10:45                                 ` Hou Tao
@ 2022-10-09 11:04                                   ` Paul E. McKenney
  0 siblings, 0 replies; 52+ messages in thread
From: Paul E. McKenney @ 2022-10-09 11:04 UTC (permalink / raw)
  To: Hou Tao
  Cc: Alexei Starovoitov, bpf, Martin KaFai Lau, Andrii Nakryiko,
	Song Liu, Hao Luo, Yonghong Song, Alexei Starovoitov,
	Daniel Borkmann, KP Singh, David S . Miller, Jakub Kicinski,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Hou Tao

On Sun, Oct 09, 2022 at 06:45:22PM +0800, Hou Tao wrote:
> Hi,
> On 10/9/2022 5:05 PM, Paul E. McKenney wrote:
> > On Sun, Oct 09, 2022 at 09:09:44AM +0800, Hou Tao wrote:
> >> Hi Paul,
> >>
> >> On 10/9/2022 4:11 AM, Paul E. McKenney wrote:
> >>> On Sat, Oct 08, 2022 at 09:40:04AM -0700, Alexei Starovoitov wrote:
> >>>> On Sat, Oct 8, 2022 at 6:22 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> >>>>> On Fri, Oct 07, 2022 at 06:59:08PM -0700, Alexei Starovoitov wrote:
> >> SNIP
> >>>>>>> Understand. I was just trying to understand the exact performance overhead of
> >>>>>>> call_rcu(). If the overhead of map operations are much greater than the overhead
> >>>>>>> of call_rcu(), I think calling call_rcu() one millions a second will be not a
> >>>>>>> problem and  it also makes the implementation of qp-trie being much simpler. The
> >>>>>>> OOM problem is indeed a problem, although it is also possible for the current
> >>>>>>> implementation, so I will try to implement the lookup procedure which handles
> >>>>>>> the reuse problem.
> >>>>>> call_rcu is not just that particular function.
> >>>>>> It's all the work rcu subsystem needs to do to observe gp
> >>>>>> and execute that callback. Just see how many kthreads it will
> >>>>>> start when overloaded like this.
> >>>>> The kthreads to watch include rcu_preempt, rcu_sched, ksoftirqd*, rcuc*,
> >>>>> and rcuo*.  There is also the back-of-interrupt softirq context, which
> >>>>> requires some care to measure accurately.
> >>>>>
> >>>>> The possibility of SLAB_TYPESAFE_BY_RCU has been discussed.  I take it
> >>>>> that the per-element locking overhead for exact iterations was a problem?
> >>>>> If so, what exactly are the consistency rules for iteration?  Presumably
> >>>>> stronger than "if the element existed throughout, it is included in the
> >>>>> iteration; if it did not exist throughout, it is not included; otherwise
> >>>>> it might or might not be included" given that you get that for free.
> >>>>>
> >>>>> Either way, could you please tell me the exact iteration rules?
> >>>> The rules are the way we make them to be.
> >>>> iteration will be under lock.
> >>>> lookup needs to be correct. It can retry if necessary (like htab is doing).
> >>>> Randomly returning 'noexist' is, of course, not acceptable.
> >>> OK, so then it is important that updates to this data structure be
> >>> carried out in such a way as to avoid discombobulating lockless readers.
> >>> Do the updates have that property?
> >> Yes. The update procedure will copy the old pointer array to a new array first,
> >> then update the new array and replace the pointer of old array by the pointer of
> >> new array.
> > Very good.  But then why is there a problem?  Is the iteration using
> > multiple RCU read-side critical sections or something?
> 
> The problem is that although the objects are RCU-freed, but these object also
> can be reused immediately in bpf memory allocator. The reason for reuse is for
> performance and is to reduce the possibility of OOM. Because the object can be
> reused during RCU-protected lookup and the possibility of reuse is low, the
> lookup procedure needs to check whether reuse is happening during lookup. And I
> was arguing with Alexei about whether or no it is reasonable to provide an
> optional flag to remove the immediate reuse in bpf memory allocator.

Indeed, in that case there needs to be a check, for example as described
in the comment preceding the definition of SLAB_TYPESAFE_BY_RCU.

If the use of the element is read-only on the one hand or
heuristic/statistical on the other, lighter weight approaches are
possible.  Would that help?

							Thanx, Paul

> >>> The usual way to get that property is to leave the old search structure
> >>> around, replacing it with the new one, and RCU-freeing the old one.
> >>> In case it helps, Kung and Lehman describe how to do that for search trees:
> >>>
> >>> http://www.eecs.harvard.edu/~htk/publication/1980-tods-kung-lehman.pdf
> >> Thanks for the paper. Just skimming through it, it seems that it uses
> >> reference-counting and garbage collection to solve the safe memory reclamation
> >> problem. It may be too heavy for qp-trie and we plan to use seqcount-like way to
> >> check whether or not the branch and the leaf node is reused during lookup, and
> >> retry the lookup if it happened. Now just checking the feasibility of the
> >> solution and it seems a little complicated than expected.
> > The main thing in that paper is the handling of rotations in the
> > search-tree update.  But if you are not using a tree, that won't be all
> > that relevant.
> I see. Thanks for the explanation.
> >
> > 								Thanx, Paul
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v2 03/13] bpf: Support bpf_dynptr-typed map key in bpf syscall
  2022-10-08  2:40         ` Hou Tao
@ 2022-10-13 18:04           ` Andrii Nakryiko
  2022-10-14  4:02             ` Hou Tao
  0 siblings, 1 reply; 52+ messages in thread
From: Andrii Nakryiko @ 2022-10-13 18:04 UTC (permalink / raw)
  To: Hou Tao
  Cc: Joanne Koong, bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Alexei Starovoitov, Daniel Borkmann,
	KP Singh, David S . Miller, Jakub Kicinski, Stanislav Fomichev,
	Jiri Olsa, John Fastabend, Paul E . McKenney, houtao1

On Fri, Oct 7, 2022 at 7:40 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 10/1/2022 5:35 AM, Andrii Nakryiko wrote:
> > On Wed, Sep 28, 2022 at 7:11 PM Hou Tao <houtao@huaweicloud.com> wrote:
> SNP
> >>> I'm trying to understand why there should be so many new concepts and
> >>> interfaces just to allow variable-sized keys. Can you elaborate on
> >>> that? Like why do we even need BPF_DYNPTR_TYPE_USER? Why user can't
> >>> just pass a void * (casted to u64) pointer and size of the memory
> >>> pointed to it, and kernel will just copy necessary amount of data into
> >>> kvmalloc'ed temporary region?
> >> The main reason is that map operations from syscall and bpf program use the same
> >> ops in bpf_map_ops (e.g. map_update_elem). If only use dynptr_kern for bpf
> >> program, then
> >> have to define three new operations for bpf program. Even more, after defining
> >> two different map ops for the same operation from syscall and bpf program, the
> >> internal  implementation of qp-trie still need to convert these two different
> >> representations of variable-length key into bpf_qp_trie_key. It introduces
> >> unnecessary conversion, so I think it may be a good idea to pass dynptr_kern to
> >> qp-trie even for bpf syscall.
> >>
> >> And now in bpf_attr, for BPF_MAP_*_ELEM command, there is no space to pass an
> >> extra key size. It seems bpf_attr can be extend, but even it is extented, it
> >> also means in libbpf we need to provide a new API group to support operationg on
> >> dynptr key map, because the userspace needs to pass the key size as a new argument.
> > You are right that the current assumption of implicit key/value size
> > doesn't work for these variable-key/value-length maps. But I think the
> > right answer is actually to make sure that we have a map_update_elem
> > callback variant that accepts key/value size explicitly. I still think
> > that the syscall interface shouldn't introduce a concept of dynptr.
> > >From user-space's point of view dynptr is just a memory pointer +
> > associated memory size. Let's keep it simple. And yes, it will be a
> > new libbpf API for bpf_map_lookup_elem/bpf_map_update_elem. That's
> > fine.
> Is your point that dynptr is too complicated for user-space and may lead to
> confusion between dynptr in kernel space ? How about a different name or a

No, dynptr is just an unnecessary concept for user-space, because
fundamentally it's just a memory region, which in UAPI is represented
by a pointer + size. So why inventing new concepts when existing ones
are covering it?

> simple definition just like bpf_lpm_trie_key ? It will make both the
> implementation and the usage much simpler, because the implementation and the
> user can still use the same APIs just like fixed sized map.
>
> Not just lookup/update/delete, we also need to define a new op for
> get_next_key/lookup_and_delete_elem. And also need to define corresponding new
> bpf helpers for bpf program. And you said "explict key/value size", do you mean
> something below ?
>
> int (*map_update_elem)(struct bpf_map *map, void *key, u32 key_size, void
> *value, u32 value_size, u64 flags);

Yes, something like that. The problem is that up until now we assume
that key_size is fixed and can be derived from map definition. We are
trying to change that, so there needs to be a change in internal APIs.

>
> >
> >
> >>> It also seems like you want to allow key (and maybe value as well, not
> >>> sure) to be a custom user-defined type where some of the fields are
> >>> struct bpf_dynptr. I think it's a big overcomplication, tbh. I'd say
> >>> it's enough to just say that entire key has to be described by a
> >>> single bpf_dynptr. Then we can have bpf_map_lookup_elem_dynptr(map,
> >>> key_dynptr, flags) new helper to provide variable-sized key for
> >>> lookup.
> >> For qp-trie, it will only support a single dynptr as the map key. In the future
> >> maybe other map will support map key with embedded dynptrs. Maybe Joanne can
> >> share some vision about such use case.
> > My point was that instead of saying that key is some fixed-size struct
> > in which one of the fields is dynptr (and then when comparing you have
> > to compare part of struct, then dynptr contents, then the other part
> > of struct?), just say that entire key is represented by dynptr,
> > implicitly (it's just a blob of bytes). That seems more
> > straightforward.
> I see. But I still think there is possible user case for struct with embedded
> dynptr. For bpf map in kernel, byte blob is OK. But If it is also a blob of
> bytes for the bpf program or userspace application, the application may need to
> marshaling and un-marshaling between the bytes blob and a meaningful struct type
> each time before using it.
> > .
>

I'm not sure what you mean by "blob of bytes for userspace
application"? You mean a pointer pointing to some process' memory (not
a kernel memory)? How is that going to work if BPF program can run and
access such blob in any context, not just in the context of original
user-space app that set this value?

If you mean that blob needs to be interpreted as some sort of struct,
then yes, it's easy, we have bpf_dynptr_data() and `void *` -> `struct
my_custom_struct` casting in C.

Or did I miss your point?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v2 03/13] bpf: Support bpf_dynptr-typed map key in bpf syscall
  2022-10-13 18:04           ` Andrii Nakryiko
@ 2022-10-14  4:02             ` Hou Tao
  2022-10-18 22:50               ` Andrii Nakryiko
  0 siblings, 1 reply; 52+ messages in thread
From: Hou Tao @ 2022-10-14  4:02 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Joanne Koong, bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Alexei Starovoitov, Daniel Borkmann,
	KP Singh, David S . Miller, Jakub Kicinski, Stanislav Fomichev,
	Jiri Olsa, John Fastabend, Paul E . McKenney, houtao1

Hi,

On 10/14/2022 2:04 AM, Andrii Nakryiko wrote:
> On Fri, Oct 7, 2022 at 7:40 PM Hou Tao <houtao@huaweicloud.com> wrote:
>> Hi,
>>
>> On 10/1/2022 5:35 AM, Andrii Nakryiko wrote:
>>> On Wed, Sep 28, 2022 at 7:11 PM Hou Tao <houtao@huaweicloud.com> wrote:
>> SNP
>>>>> I'm trying to understand why there should be so many new concepts and
>>>>> interfaces just to allow variable-sized keys. Can you elaborate on
>>>>> that? Like why do we even need BPF_DYNPTR_TYPE_USER? Why user can't
>>>>> just pass a void * (casted to u64) pointer and size of the memory
>>>>> pointed to it, and kernel will just copy necessary amount of data into
>>>>> kvmalloc'ed temporary region?
>>>> The main reason is that map operations from syscall and bpf program use the same
>>>> ops in bpf_map_ops (e.g. map_update_elem). If only use dynptr_kern for bpf
>>>> program, then
>>>> have to define three new operations for bpf program. Even more, after defining
>>>> two different map ops for the same operation from syscall and bpf program, the
>>>> internal  implementation of qp-trie still need to convert these two different
>>>> representations of variable-length key into bpf_qp_trie_key. It introduces
>>>> unnecessary conversion, so I think it may be a good idea to pass dynptr_kern to
>>>> qp-trie even for bpf syscall.
>>>>
>>>> And now in bpf_attr, for BPF_MAP_*_ELEM command, there is no space to pass an
>>>> extra key size. It seems bpf_attr can be extend, but even it is extented, it
>>>> also means in libbpf we need to provide a new API group to support operationg on
>>>> dynptr key map, because the userspace needs to pass the key size as a new argument.
>>> You are right that the current assumption of implicit key/value size
>>> doesn't work for these variable-key/value-length maps. But I think the
>>> right answer is actually to make sure that we have a map_update_elem
>>> callback variant that accepts key/value size explicitly. I still think
>>> that the syscall interface shouldn't introduce a concept of dynptr.
>>> >From user-space's point of view dynptr is just a memory pointer +
>>> associated memory size. Let's keep it simple. And yes, it will be a
>>> new libbpf API for bpf_map_lookup_elem/bpf_map_update_elem. That's
>>> fine.
>> Is your point that dynptr is too complicated for user-space and may lead to
>> confusion between dynptr in kernel space ? How about a different name or a
> No, dynptr is just an unnecessary concept for user-space, because
> fundamentally it's just a memory region, which in UAPI is represented
> by a pointer + size. So why inventing new concepts when existing ones
> are covering it?
But the problem is pointer + explicit size is not being covered by any existing
APIs and we need to add support for it. Using dnyptr is one option and directly
using pointer + explicit size is another one.
>
>> simple definition just like bpf_lpm_trie_key ? It will make both the
>> implementation and the usage much simpler, because the implementation and the
>> user can still use the same APIs just like fixed sized map.
>>
>> Not just lookup/update/delete, we also need to define a new op for
>> get_next_key/lookup_and_delete_elem. And also need to define corresponding new
>> bpf helpers for bpf program. And you said "explict key/value size", do you mean
>> something below ?
>>
>> int (*map_update_elem)(struct bpf_map *map, void *key, u32 key_size, void
>> *value, u32 value_size, u64 flags);
> Yes, something like that. The problem is that up until now we assume
> that key_size is fixed and can be derived from map definition. We are
> trying to change that, so there needs to be a change in internal APIs.
Will need to change both the UAPIs and internal APIs. Should I add variable-size
map value into consideration this time ? I am afraid that it may be little
over-designed. Maybe I should hack a demo out firstly to check the work-load and
the complexity.
>
>>>
>>>>> It also seems like you want to allow key (and maybe value as well, not
>>>>> sure) to be a custom user-defined type where some of the fields are
>>>>> struct bpf_dynptr. I think it's a big overcomplication, tbh. I'd say
>>>>> it's enough to just say that entire key has to be described by a
>>>>> single bpf_dynptr. Then we can have bpf_map_lookup_elem_dynptr(map,
>>>>> key_dynptr, flags) new helper to provide variable-sized key for
>>>>> lookup.
>>>> For qp-trie, it will only support a single dynptr as the map key. In the future
>>>> maybe other map will support map key with embedded dynptrs. Maybe Joanne can
>>>> share some vision about such use case.
>>> My point was that instead of saying that key is some fixed-size struct
>>> in which one of the fields is dynptr (and then when comparing you have
>>> to compare part of struct, then dynptr contents, then the other part
>>> of struct?), just say that entire key is represented by dynptr,
>>> implicitly (it's just a blob of bytes). That seems more
>>> straightforward.
>> I see. But I still think there is possible user case for struct with embedded
>> dynptr. For bpf map in kernel, byte blob is OK. But If it is also a blob of
>> bytes for the bpf program or userspace application, the application may need to
>> marshaling and un-marshaling between the bytes blob and a meaningful struct type
>> each time before using it.
>>> .
> I'm not sure what you mean by "blob of bytes for userspace
> application"? You mean a pointer pointing to some process' memory (not
> a kernel memory)? How is that going to work if BPF program can run and
> access such blob in any context, not just in the context of original
> user-space app that set this value?
>
> If you mean that blob needs to be interpreted as some sort of struct,
> then yes, it's easy, we have bpf_dynptr_data() and `void *` -> `struct
> my_custom_struct` casting in C.
Yes. I mean we need to cast the blob to a meaning struct before using it. If
there are one variable-length field in the struct, how would the directly
castling work as shown below ?

struct my_custom_struct {
           struct {
               unsigned int len;
               char *data;
           } name;
           unsigned int pt_code;
};
>
> Or did I miss your point?


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v2 03/13] bpf: Support bpf_dynptr-typed map key in bpf syscall
  2022-10-14  4:02             ` Hou Tao
@ 2022-10-18 22:50               ` Andrii Nakryiko
  0 siblings, 0 replies; 52+ messages in thread
From: Andrii Nakryiko @ 2022-10-18 22:50 UTC (permalink / raw)
  To: Hou Tao
  Cc: Joanne Koong, bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Alexei Starovoitov, Daniel Borkmann,
	KP Singh, David S . Miller, Jakub Kicinski, Stanislav Fomichev,
	Jiri Olsa, John Fastabend, Paul E . McKenney, houtao1

On Thu, Oct 13, 2022 at 9:02 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 10/14/2022 2:04 AM, Andrii Nakryiko wrote:
> > On Fri, Oct 7, 2022 at 7:40 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >> Hi,
> >>
> >> On 10/1/2022 5:35 AM, Andrii Nakryiko wrote:
> >>> On Wed, Sep 28, 2022 at 7:11 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >> SNP
> >>>>> I'm trying to understand why there should be so many new concepts and
> >>>>> interfaces just to allow variable-sized keys. Can you elaborate on
> >>>>> that? Like why do we even need BPF_DYNPTR_TYPE_USER? Why user can't
> >>>>> just pass a void * (casted to u64) pointer and size of the memory
> >>>>> pointed to it, and kernel will just copy necessary amount of data into
> >>>>> kvmalloc'ed temporary region?
> >>>> The main reason is that map operations from syscall and bpf program use the same
> >>>> ops in bpf_map_ops (e.g. map_update_elem). If only use dynptr_kern for bpf
> >>>> program, then
> >>>> have to define three new operations for bpf program. Even more, after defining
> >>>> two different map ops for the same operation from syscall and bpf program, the
> >>>> internal  implementation of qp-trie still need to convert these two different
> >>>> representations of variable-length key into bpf_qp_trie_key. It introduces
> >>>> unnecessary conversion, so I think it may be a good idea to pass dynptr_kern to
> >>>> qp-trie even for bpf syscall.
> >>>>
> >>>> And now in bpf_attr, for BPF_MAP_*_ELEM command, there is no space to pass an
> >>>> extra key size. It seems bpf_attr can be extend, but even it is extented, it
> >>>> also means in libbpf we need to provide a new API group to support operationg on
> >>>> dynptr key map, because the userspace needs to pass the key size as a new argument.
> >>> You are right that the current assumption of implicit key/value size
> >>> doesn't work for these variable-key/value-length maps. But I think the
> >>> right answer is actually to make sure that we have a map_update_elem
> >>> callback variant that accepts key/value size explicitly. I still think
> >>> that the syscall interface shouldn't introduce a concept of dynptr.
> >>> >From user-space's point of view dynptr is just a memory pointer +
> >>> associated memory size. Let's keep it simple. And yes, it will be a
> >>> new libbpf API for bpf_map_lookup_elem/bpf_map_update_elem. That's
> >>> fine.
> >> Is your point that dynptr is too complicated for user-space and may lead to
> >> confusion between dynptr in kernel space ? How about a different name or a
> > No, dynptr is just an unnecessary concept for user-space, because
> > fundamentally it's just a memory region, which in UAPI is represented
> > by a pointer + size. So why inventing new concepts when existing ones
> > are covering it?
> But the problem is pointer + explicit size is not being covered by any existing
> APIs and we need to add support for it. Using dnyptr is one option and directly
> using pointer + explicit size is another one.

dynptr is more than pointer + size (it supports various types of
memory it points to, it supports offset, etc), it's more generic thing
for BPF-side programmability. There is no need to expose it into
user-space. All we care about here is memory region, which is pointer
+ size. Keep it simple.

> >
> >> simple definition just like bpf_lpm_trie_key ? It will make both the
> >> implementation and the usage much simpler, because the implementation and the
> >> user can still use the same APIs just like fixed sized map.
> >>
> >> Not just lookup/update/delete, we also need to define a new op for
> >> get_next_key/lookup_and_delete_elem. And also need to define corresponding new
> >> bpf helpers for bpf program. And you said "explict key/value size", do you mean
> >> something below ?
> >>
> >> int (*map_update_elem)(struct bpf_map *map, void *key, u32 key_size, void
> >> *value, u32 value_size, u64 flags);
> > Yes, something like that. The problem is that up until now we assume
> > that key_size is fixed and can be derived from map definition. We are
> > trying to change that, so there needs to be a change in internal APIs.
> Will need to change both the UAPIs and internal APIs. Should I add variable-size
> map value into consideration this time ? I am afraid that it may be little
> over-designed. Maybe I should hack a demo out firstly to check the work-load and
> the complexity.

I think sticking to fixed-size key/value for starters is ok, there is
plenty things to figure out even without that. We can try attacking
variable-sized key BPF maps (e.g., technically BPF hashmap might also
support variable-sized key or value just as well) as a separate
project.


> >
> >>>
> >>>>> It also seems like you want to allow key (and maybe value as well, not
> >>>>> sure) to be a custom user-defined type where some of the fields are
> >>>>> struct bpf_dynptr. I think it's a big overcomplication, tbh. I'd say
> >>>>> it's enough to just say that entire key has to be described by a
> >>>>> single bpf_dynptr. Then we can have bpf_map_lookup_elem_dynptr(map,
> >>>>> key_dynptr, flags) new helper to provide variable-sized key for
> >>>>> lookup.
> >>>> For qp-trie, it will only support a single dynptr as the map key. In the future
> >>>> maybe other map will support map key with embedded dynptrs. Maybe Joanne can
> >>>> share some vision about such use case.
> >>> My point was that instead of saying that key is some fixed-size struct
> >>> in which one of the fields is dynptr (and then when comparing you have
> >>> to compare part of struct, then dynptr contents, then the other part
> >>> of struct?), just say that entire key is represented by dynptr,
> >>> implicitly (it's just a blob of bytes). That seems more
> >>> straightforward.
> >> I see. But I still think there is possible user case for struct with embedded
> >> dynptr. For bpf map in kernel, byte blob is OK. But If it is also a blob of
> >> bytes for the bpf program or userspace application, the application may need to
> >> marshaling and un-marshaling between the bytes blob and a meaningful struct type
> >> each time before using it.
> >>> .
> > I'm not sure what you mean by "blob of bytes for userspace
> > application"? You mean a pointer pointing to some process' memory (not
> > a kernel memory)? How is that going to work if BPF program can run and
> > access such blob in any context, not just in the context of original
> > user-space app that set this value?
> >
> > If you mean that blob needs to be interpreted as some sort of struct,
> > then yes, it's easy, we have bpf_dynptr_data() and `void *` -> `struct
> > my_custom_struct` casting in C.
> Yes. I mean we need to cast the blob to a meaning struct before using it. If
> there are one variable-length field in the struct, how would the directly
> castling work as shown below ?
>
> struct my_custom_struct {
>            struct {
>                unsigned int len;
>                char *data;
>            } name;
>            unsigned int pt_code;
> };

I'd imagine that you'd represent variable-sized part at the end of
fixed part as flexible array of bytes:

struct my_custom_struct {
    int pt_code;
    int len;
    char data[];
}

> >
> > Or did I miss your point?
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key
  2022-09-24 13:36 [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key Hou Tao
                   ` (13 preceding siblings ...)
  2022-09-26  1:25 ` [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key Alexei Starovoitov
@ 2022-10-19 17:01 ` Tony Finch
  2022-10-27 18:52   ` Andrii Nakryiko
  14 siblings, 1 reply; 52+ messages in thread
From: Tony Finch @ 2022-10-19 17:01 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, houtao1

Hello all,

I have just found out about this qp-trie work, and I'm pleased to hear
that it is looking promising for you!

I have a few very broad observations:

The "q" in qp-trie doesn't have to stand for "quadbit". There's a tradeoff
between branch factor, maximum key length, and size of branch node. The
greater the branch factor, the fewer indirections needed to traverse the
tree; but if you go too wide then prefetching is less effective and branch
nodes get bigger. I found that 5 bits was the sweet spot (32 wide bitmap,
30ish bit key length) - indexing 5 bit mouthfuls out of the key is HORRID
but it was measurably faster than 4 bits. 6 bits (64 bits of bitmap) grew
nodes from 16 bytes to 24 bytes, and it ended up slower.

Your interior nodes are much bigger than mine, so you might find the
tradeoff is different. I encourage you to try it out.

I saw there has been some discussion about locking and RCU. My current
project is integrating a qp-trie into BIND, with the aim of replacing its
old red-black tree for searching DNS records. It's based on a concurrent
qp-trie that I prototyped in NSD (a smaller and simpler DNS server than
BIND). My strategy is based on a custom allocator for interior nodes. This
has two main effects:

  * Node references are now 32 bit indexes into the allocator's pool,
    instead of 64 bit pointers; nodes are 12 bytes instead of 16 bytes.

  * The allocator supports copy-on-write and safe memory reclamation with
    a fairly small overhead, 3 x 32 bit counters per memory chunk (each
    chunk is roughly page sized).

I wrote some notes when the design was new, but things have changed since
then.

https://dotat.at/@/2021-06-23-page-based-gc-for-qp-trie-rcu.html

For memory reclamation the interior nodes get moved / compacted. It's a
kind of garbage collector, but easy-mode because the per-chunk counters
accurately indicate when compaction is worthwhile. I've written some notes
on my several failed GC experiments; the last / current attempt seems (by
and large) good enough.

https://dotat.at/@/2022-06-22-compact-qp.html

For exterior / leaf nodes, I'm using atomic refcounts to know when they
can be reclaimed. The caller is responsible for COWing its leaves when
necessary.

Updates to the tree are transactional in style, and do not block readers:
a single writer gets the write mutex, makes whatever changes it needs
(copying as required), then commits by flipping the tree's root. After a
commit it can free unused chunks. (Compaction can be part of an update
transaction or a transaction of its own.)

I'm currently using a reader-writer lock for the tree root, but I designed
it with liburcu in mind, while trying to keep things simple.

This strategy is very heavily biased in favour of readers, which suits DNS
servers. I don't know enough about BPF to have any idea what kind of
update traffic you need to support.

At the moment I am reworking and simplifying my transaction and
reclamation code and it's all very broken. I guess this isn't the best
possible time to compare notes on qp-trie variants, but I'm happy to hear
from others who have code and ideas to share.

-- 
Tony Finch  <dot@dotat.at>  https://dotat.at/
Mull of Kintyre to Ardnamurchan Point: East or southeast 4 to 6,
increasing 6 to gale 8 for a time. Smooth or slight in eastern
shelter, otherwise slight or moderate. Rain or showers. Good,
occasionally poor.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key
  2022-10-19 17:01 ` Tony Finch
@ 2022-10-27 18:52   ` Andrii Nakryiko
  2022-11-01 12:07     ` Hou Tao
  0 siblings, 1 reply; 52+ messages in thread
From: Andrii Nakryiko @ 2022-10-27 18:52 UTC (permalink / raw)
  To: Tony Finch
  Cc: Hou Tao, bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Alexei Starovoitov, Daniel Borkmann,
	KP Singh, David S . Miller, Jakub Kicinski, Stanislav Fomichev,
	Jiri Olsa, John Fastabend, Paul E . McKenney, houtao1

On Wed, Oct 19, 2022 at 10:01 AM Tony Finch <dot@dotat.at> wrote:
>
> Hello all,
>
> I have just found out about this qp-trie work, and I'm pleased to hear
> that it is looking promising for you!
>

This is a very nice data structure, so thank you for doing a great job
explaining it in your post!

> I have a few very broad observations:
>
> The "q" in qp-trie doesn't have to stand for "quadbit". There's a tradeoff
> between branch factor, maximum key length, and size of branch node. The
> greater the branch factor, the fewer indirections needed to traverse the
> tree; but if you go too wide then prefetching is less effective and branch
> nodes get bigger. I found that 5 bits was the sweet spot (32 wide bitmap,
> 30ish bit key length) - indexing 5 bit mouthfuls out of the key is HORRID
> but it was measurably faster than 4 bits. 6 bits (64 bits of bitmap) grew
> nodes from 16 bytes to 24 bytes, and it ended up slower.
>
> Your interior nodes are much bigger than mine, so you might find the
> tradeoff is different. I encourage you to try it out.

True, but I think for (at least initial) simplicity, sticking to
half-bytes would simplify the code and let us figure out BPF and
kernel-specific issues without having to worry about the correctness
of the qp-trie core logic itself.

>
> I saw there has been some discussion about locking and RCU. My current
> project is integrating a qp-trie into BIND, with the aim of replacing its
> old red-black tree for searching DNS records. It's based on a concurrent
> qp-trie that I prototyped in NSD (a smaller and simpler DNS server than
> BIND). My strategy is based on a custom allocator for interior nodes. This
> has two main effects:
>
>   * Node references are now 32 bit indexes into the allocator's pool,
>     instead of 64 bit pointers; nodes are 12 bytes instead of 16 bytes.
>
>   * The allocator supports copy-on-write and safe memory reclamation with
>     a fairly small overhead, 3 x 32 bit counters per memory chunk (each
>     chunk is roughly page sized).
>
> I wrote some notes when the design was new, but things have changed since
> then.
>
> https://dotat.at/@/2021-06-23-page-based-gc-for-qp-trie-rcu.html
>
> For memory reclamation the interior nodes get moved / compacted. It's a
> kind of garbage collector, but easy-mode because the per-chunk counters
> accurately indicate when compaction is worthwhile. I've written some notes
> on my several failed GC experiments; the last / current attempt seems (by
> and large) good enough.
>
> https://dotat.at/@/2022-06-22-compact-qp.html
>
> For exterior / leaf nodes, I'm using atomic refcounts to know when they
> can be reclaimed. The caller is responsible for COWing its leaves when
> necessary.
>
> Updates to the tree are transactional in style, and do not block readers:
> a single writer gets the write mutex, makes whatever changes it needs
> (copying as required), then commits by flipping the tree's root. After a
> commit it can free unused chunks. (Compaction can be part of an update
> transaction or a transaction of its own.)
>
> I'm currently using a reader-writer lock for the tree root, but I designed
> it with liburcu in mind, while trying to keep things simple.
>
> This strategy is very heavily biased in favour of readers, which suits DNS
> servers. I don't know enough about BPF to have any idea what kind of
> update traffic you need to support.

These are some nice ideas, I did a quick read on your latest blog
posts, missed those updates since last time I checked your blog.

One limitation that we have in the BPF world is that BPF programs can
be run in extremely restrictive contexts (e.g., NMI), in which things
that user-space can assume will almost always succeed (like memory
allocation), are not allowed. We do have BPF-specific memory
allocator, but even it can fail to allocate memory, depending on
allocation patterns. So we need to think if this COW approach is
acceptable. I'd love for Hou Tao to think about this and chime in,
though, as he spent a lot of time thinking about particulars.

But very basically, ultimate memory and performance savings are
perhaps less important in trying to fit qp-trie into BPF framework. We
can iterate after with optimizations and improvements, but first we
need to get the things correct and well-behaved.

>
> At the moment I am reworking and simplifying my transaction and
> reclamation code and it's all very broken. I guess this isn't the best
> possible time to compare notes on qp-trie variants, but I'm happy to hear
> from others who have code and ideas to share.

It would be great if you can lend your expertise in reviewing at least
generic qp-trie parts, but also in helping to figure out the overall
concurrency approach we can take in kernel/BPF land (depending on your
familiarity with kernel specifics, of course).

Thanks for offering the latest on qp-trie, exciting to see more
production applications of qp-trie and that you are still actively
working on this!

>
> --
> Tony Finch  <dot@dotat.at>  https://dotat.at/
> Mull of Kintyre to Ardnamurchan Point: East or southeast 4 to 6,
> increasing 6 to gale 8 for a time. Smooth or slight in eastern
> shelter, otherwise slight or moderate. Rain or showers. Good,
> occasionally poor.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key
  2022-10-27 18:52   ` Andrii Nakryiko
@ 2022-11-01 12:07     ` Hou Tao
  0 siblings, 0 replies; 52+ messages in thread
From: Hou Tao @ 2022-11-01 12:07 UTC (permalink / raw)
  To: Andrii Nakryiko, Tony Finch
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, houtao1

Hi,

On 10/28/2022 2:52 AM, Andrii Nakryiko wrote:
> On Wed, Oct 19, 2022 at 10:01 AM Tony Finch <dot@dotat.at> wrote:
>> Hello all,
>>
>> I have just found out about this qp-trie work, and I'm pleased to hear
>> that it is looking promising for you!
>>
> This is a very nice data structure, so thank you for doing a great job
> explaining it in your post!
Sorry for the late reply. Stilling digging into other problems. Also thanks Tony
for his job on qp.git.
>
>> I have a few very broad observations:
>>
>> The "q" in qp-trie doesn't have to stand for "quadbit". There's a tradeoff
>> between branch factor, maximum key length, and size of branch node. The
>> greater the branch factor, the fewer indirections needed to traverse the
>> tree; but if you go too wide then prefetching is less effective and branch
>> nodes get bigger. I found that 5 bits was the sweet spot (32 wide bitmap,
>> 30ish bit key length) - indexing 5 bit mouthfuls out of the key is HORRID
>> but it was measurably faster than 4 bits. 6 bits (64 bits of bitmap) grew
>> nodes from 16 bytes to 24 bytes, and it ended up slower.
>>
>> Your interior nodes are much bigger than mine, so you might find the
>> tradeoff is different. I encourage you to try it out.
parent field in qp_trie_branch is used to support non-recursive iteration and
rcu_head is used for RCU memory freeing.
> True, but I think for (at least initial) simplicity, sticking to
> half-bytes would simplify the code and let us figure out BPF and
> kernel-specific issues without having to worry about the correctness
> of the qp-trie core logic itself.
Agreed.
>
>> I saw there has been some discussion about locking and RCU. My current
>> project is integrating a qp-trie into BIND, with the aim of replacing its
>> old red-black tree for searching DNS records. It's based on a concurrent
>> qp-trie that I prototyped in NSD (a smaller and simpler DNS server than
>> BIND). My strategy is based on a custom allocator for interior nodes. This
>> has two main effects:
>>
>>   * Node references are now 32 bit indexes into the allocator's pool,
>>     instead of 64 bit pointers; nodes are 12 bytes instead of 16 bytes.
>>
>>   * The allocator supports copy-on-write and safe memory reclamation with
>>     a fairly small overhead, 3 x 32 bit counters per memory chunk (each
>>     chunk is roughly page sized).
>>
>> I wrote some notes when the design was new, but things have changed since
>> then.
>>
>> https://dotat.at/@/2021-06-23-page-based-gc-for-qp-trie-rcu.html
>>
>> For memory reclamation the interior nodes get moved / compacted. It's a
>> kind of garbage collector, but easy-mode because the per-chunk counters
>> accurately indicate when compaction is worthwhile. I've written some notes
>> on my several failed GC experiments; the last / current attempt seems (by
>> and large) good enough.
>>
>> https://dotat.at/@/2022-06-22-compact-qp.html
>>
>> For exterior / leaf nodes, I'm using atomic refcounts to know when they
>> can be reclaimed. The caller is responsible for COWing its leaves when
>> necessary.
>>
>> Updates to the tree are transactional in style, and do not block readers:
>> a single writer gets the write mutex, makes whatever changes it needs
>> (copying as required), then commits by flipping the tree's root. After a
>> commit it can free unused chunks. (Compaction can be part of an update
>> transaction or a transaction of its own.)
>>
>> I'm currently using a reader-writer lock for the tree root, but I designed
>> it with liburcu in mind, while trying to keep things simple.
>>
>> This strategy is very heavily biased in favour of readers, which suits DNS
>> servers. I don't know enough about BPF to have any idea what kind of
>> update traffic you need to support.
> These are some nice ideas, I did a quick read on your latest blog
> posts, missed those updates since last time I checked your blog.
>
> One limitation that we have in the BPF world is that BPF programs can
> be run in extremely restrictive contexts (e.g., NMI), in which things
> that user-space can assume will almost always succeed (like memory
> allocation), are not allowed. We do have BPF-specific memory
> allocator, but even it can fail to allocate memory, depending on
> allocation patterns. So we need to think if this COW approach is
> acceptable. I'd love for Hou Tao to think about this and chime in,
> though, as he spent a lot of time thinking about particulars.
Current implementation of BPF_MAP_TYPE_QP_TRIE is already COWed. When adding or
deleting a leaf node, its parent interior node will be copied to a new interior
node, the pointer to the old parent node (in the grand-parent interior node)
will be updated by the new parent node, and the old parent node will be RCU-freed.
According to above description, COW in qp-trie means all nodes on the path from
the root node to the leaf node are COWed, so I think current COW implementation
is better for bpf map usage scenario. But I will check the qp-trie code in BIND
[0] later.

0:
https://gitlab.isc.org/isc-projects/bind9/-/commit/ecc555e6ec763c4f8f2495864ec08749202fff1a#65b4d67ce64e9195e41ac43d78af5156f9ebb779_0_553
> But very basically, ultimate memory and performance savings are
> perhaps less important in trying to fit qp-trie into BPF framework. We
> can iterate after with optimizations and improvements, but first we
> need to get the things correct and well-behaved.
Understand.
>
>> At the moment I am reworking and simplifying my transaction and
>> reclamation code and it's all very broken. I guess this isn't the best
>> possible time to compare notes on qp-trie variants, but I'm happy to hear
>> from others who have code and ideas to share.
> It would be great if you can lend your expertise in reviewing at least
> generic qp-trie parts, but also in helping to figure out the overall
> concurrency approach we can take in kernel/BPF land (depending on your
> familiarity with kernel specifics, of course).
>
> Thanks for offering the latest on qp-trie, exciting to see more
> production applications of qp-trie and that you are still actively
> working on this!
Yes, it would be great if Tony could help to review or co-design bpf qp-trie map.
>> --
>> Tony Finch  <dot@dotat.at>  https://dotat.at/
>> Mull of Kintyre to Ardnamurchan Point: East or southeast 4 to 6,
>> increasing 6 to gale 8 for a time. Smooth or slight in eastern
>> shelter, otherwise slight or moderate. Rain or showers. Good,
>> occasionally poor.


^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2022-11-01 12:08 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-24 13:36 [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key Hou Tao
2022-09-24 13:36 ` [PATCH bpf-next v2 01/13] bpf: Export bpf_dynptr_set_size() Hou Tao
2022-09-24 13:36 ` [PATCH bpf-next v2 02/13] bpf: Add helper btf_find_dynptr() Hou Tao
2022-09-24 13:36 ` [PATCH bpf-next v2 03/13] bpf: Support bpf_dynptr-typed map key in bpf syscall Hou Tao
2022-09-29  0:16   ` Andrii Nakryiko
2022-09-29  2:11     ` Hou Tao
2022-09-30 21:35       ` Andrii Nakryiko
2022-10-08  2:40         ` Hou Tao
2022-10-13 18:04           ` Andrii Nakryiko
2022-10-14  4:02             ` Hou Tao
2022-10-18 22:50               ` Andrii Nakryiko
2022-09-24 13:36 ` [PATCH bpf-next v2 04/13] bpf: Support bpf_dynptr-typed map key in verifier Hou Tao
2022-09-24 13:36 ` [PATCH bpf-next v2 05/13] libbpf: Add helpers for bpf_dynptr_user Hou Tao
2022-09-24 13:36 ` [PATCH bpf-next v2 06/13] bpf: Add support for qp-trie map with dynptr key Hou Tao
2022-09-24 13:36 ` [PATCH bpf-next v2 07/13] libbpf: Add probe support for BPF_MAP_TYPE_QP_TRIE Hou Tao
2022-09-24 13:36 ` [PATCH bpf-next v2 08/13] bpftool: Add support for qp-trie map Hou Tao
2022-09-27 11:24   ` Quentin Monnet
2022-09-28  4:14     ` Hou Tao
2022-09-28  8:40       ` Quentin Monnet
2022-09-28  9:05         ` Hou Tao
2022-09-28  9:23           ` Quentin Monnet
2022-09-28 10:54             ` Hou Tao
2022-09-28 11:49               ` Quentin Monnet
2022-09-24 13:36 ` [PATCH bpf-next v2 09/13] selftests/bpf: Add two new dynptr_fail cases for map key Hou Tao
2022-09-24 13:36 ` [PATCH bpf-next v2 10/13] selftests/bpf: Move ENOTSUPP into bpf_util.h Hou Tao
2022-09-24 13:36 ` [PATCH bpf-next v2 11/13] selftests/bpf: Add prog tests for qp-trie map Hou Tao
2022-09-24 13:36 ` [PATCH bpf-next v2 12/13] selftests/bpf: Add benchmark " Hou Tao
2022-09-24 13:36 ` [PATCH bpf-next v2 13/13] selftests/bpf: Add map tests for qp-trie by using bpf syscall Hou Tao
2022-09-26  1:25 ` [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key Alexei Starovoitov
2022-09-26 13:18   ` Hou Tao
2022-09-27  1:19     ` Alexei Starovoitov
2022-09-27  3:08       ` Hou Tao
2022-09-27  3:18         ` Alexei Starovoitov
2022-09-27 14:07           ` Hou Tao
2022-09-28  1:08             ` Alexei Starovoitov
2022-09-28  3:27               ` Hou Tao
2022-09-28  4:37                 ` Alexei Starovoitov
2022-09-28  8:45               ` Hou Tao
2022-09-28  8:49                 ` Hou Tao
2022-09-29  3:22                 ` Alexei Starovoitov
2022-10-08  1:56                   ` Hou Tao
2022-10-08  1:59                     ` Alexei Starovoitov
2022-10-08 13:22                       ` Paul E. McKenney
2022-10-08 16:40                         ` Alexei Starovoitov
2022-10-08 20:11                           ` Paul E. McKenney
2022-10-09  1:09                             ` Hou Tao
2022-10-09  9:05                               ` Paul E. McKenney
2022-10-09 10:45                                 ` Hou Tao
2022-10-09 11:04                                   ` Paul E. McKenney
2022-10-19 17:01 ` Tony Finch
2022-10-27 18:52   ` Andrii Nakryiko
2022-11-01 12:07     ` Hou Tao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.