All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next 00/10] Add support for qp-trie map
@ 2022-09-17 15:31 Hou Tao
  2022-09-17 15:31 ` [PATCH bpf-next 01/10] bpf: Export bpf_dynptr_{get|set}_size Hou Tao
                   ` (9 more replies)
  0 siblings, 10 replies; 13+ messages in thread
From: Hou Tao @ 2022-09-17 15:31 UTC (permalink / raw)
  To: bpf, Andrii Nakryiko
  Cc: Song Liu, Hao Luo, Yonghong Song, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, KP Singh, David S . Miller,
	Jakub Kicinski, Stanislav Fomichev, Jiri Olsa, John Fastabend,
	Lorenz Bauer, Paul E . McKenney, houtao1

From: Hou Tao <houtao1@huawei.com>

Hi,

The initial motivation for qp-trie map is to reduce memory usage for
string keys special those with large differencies in length as
discussed in [0]. And as a big-endian lexicographical ordered map, it
can also be used for any binary data with fixed or variable length.

Now the basic functionality of qp-trie is ready, so posting it to get
more feedback or suggestions about qp-trie. Specially feedback
about the following questions:

(1) Use bpf_dynptr in map key
Now qp-trie uses bpf_dynptr as map key. For simplicity a new map flags
BPF_F_DYNPTR_KEY is introduced, but it is not usable for map key in
which a bpf_dynptr is embedded. So maybe a better and generic way is to
check the offset of bpf_dynptr in map key during map creation and use
the offset when initializing or verifying map key ?

(2) Use cases for qp-trie
Andrii had proposed to re-implement lpm-trie by using qp-trie. The
advantage would be the speed up of lookup operations due to lower tree
depth of qp-trie and the performance of update may also increase.
But is there any other use cases for qp-trie ? Specially those cases
which need both ordering and memory efficiency or cases in which qp-trie
will have high branch factor and its lookup performance will be much
better than hash-table as shown below:

  Randomly-generated binary data (key size=255, max entries=16K, key length range:[1, 255])
  htab lookup      (1  thread)    4.968 ± 0.009M/s (drops 0.002 ± 0.000M/s mem 8.169 MiB)
  htab lookup      (2  thread)   10.118 ± 0.010M/s (drops 0.007 ± 0.000M/s mem 8.169 MiB)
  htab lookup      (4  thread)   20.084 ± 0.022M/s (drops 0.007 ± 0.000M/s mem 8.168 MiB)
  htab lookup      (8  thread)   39.866 ± 0.047M/s (drops 0.010 ± 0.000M/s mem 8.168 MiB)
  htab lookup      (16 thread)   79.412 ± 0.065M/s (drops 0.049 ± 0.000M/s mem 8.169 MiB)
  
  qp-trie lookup   (1  thread)   10.291 ± 0.007M/s (drops 0.004 ± 0.000M/s mem 4.899 MiB)
  qp-trie lookup   (2  thread)   20.797 ± 0.009M/s (drops 0.006 ± 0.000M/s mem 4.879 MiB)
  qp-trie lookup   (4  thread)   41.943 ± 0.019M/s (drops 0.015 ± 0.000M/s mem 4.262 MiB)
  qp-trie lookup   (8  thread)   81.985 ± 0.032M/s (drops 0.025 ± 0.000M/s mem 4.215 MiB)
  qp-trie lookup   (16 thread)  164.681 ± 0.051M/s (drops 0.050 ± 0.000M/s mem 4.261 MiB)

  * non-zero drops is due to duplicated keys in generated keys.

(3) Improve update performance for qp-trie
Now qp-trie is divided into 256 sub-trees by using the first byte as the
key and one sub-tree is protected one spinlock. I also had tried the
hierarchical lock by using hand-over-hand lock scheme [1], but it
doesn't scale well. Update performance of hash-table has been optimized
a lot by "bpf: BPF specific memory allocator", but because qp-trie
allocates memory for a variable-length key and bpf_mem_alloc() only
support <=4KB memory, so it doesn't fill well for qp-trie.

  Strings in /proc/kallsyms (key size=83, max entries=170958)
  htab update      (1  thread)    3.135 ± 0.355M/s (drops 0.000 ± 0.000M/s mem 30.842 MiB)
  htab update      (2  thread)    6.269 ± 0.686M/s (drops 0.000 ± 0.000M/s mem 30.841 MiB)
  htab update      (4  thread)   11.607 ± 1.632M/s (drops 0.000 ± 0.000M/s mem 30.840 MiB)
  htab update      (8  thread)   23.041 ± 0.806M/s (drops 0.000 ± 0.000M/s mem 30.842 MiB)
  htab update      (16 thread)   31.393 ± 0.307M/s (drops 0.000 ± 0.000M/s mem 30.835 MiB)
  
  qp-trie update   (1  thread)    1.191 ± 0.093M/s (drops 0.000 ± 0.000M/s mem 17.170 MiB)
  qp-trie update   (2  thread)    2.057 ± 0.041M/s (drops 0.000 ± 0.000M/s mem 17.058 MiB)
  qp-trie update   (4  thread)    2.975 ± 0.035M/s (drops 0.000 ± 0.000M/s mem 17.411 MiB)
  qp-trie update   (8  thread)    3.596 ± 0.031M/s (drops 0.000 ± 0.000M/s mem 17.110 MiB)
  qp-trie update   (16 thread)    4.200 ± 0.048M/s (drops 0.000 ± 0.000M/s mem 17.228 MiB)
  
(4) Improve memory efficiency further
When using strings in BTF string section as a data set for qp-trie, the
slab memory usage showed in cgroup memory.stats file is about 11MB for
qp-trie and 22MB for hash table as shown below. However the theoretical
memory usage for qp-trie is ~6.8MB (is ~4.9MB if removing "parent" & "rcu"
fields from qp_trie_branch) and the extra memory usage (about 38% of total
usage) mainly comes from internal fragment in slab (namely 2^n alignment
for allocation) and overhead in kmem-cgroup accounting. We can reduce the
internal fragment by creating separated kmem_cache for qp_trie_branch with
different child nodes, but not sure whether it is worthy or not.

And in order to prevent allocating a rcu_head for each leaf node, now only
branch node is RCU-freed, so when replacing a leaf node, a new branch node
and a new leaf node will be allocated instead of replacing the old leaf
node and RCU-freed the old leaf node. Also it seems rcu_head can be
replaced by a pointer as BPF specific memory allocator does, so 8-bytes
can be saved for each node, and again not sure it's worth it.

  Sorted strings in BTF string sections (entries=115980)
  htab lookup      (1  thread)    6.915 ± 0.029M/s (drops 0.000 ± 0.000M/s mem 22.216 MiB)
  qp-trie lookup   (1  thread)    6.791 ± 0.005M/s (drops 0.000 ± 0.000M/s mem 11.273 MiB)
  
  All files under linux kernel source directory (entries=74359)
  htab lookup      (1  thread)    7.978 ± 0.009M/s (drops 0.000 ± 0.000M/s mem 14.272 MiB)
  qp-trie lookup   (1  thread)    5.521 ± 0.003M/s (drops 0.000 ± 0.000M/s mem 9.367 MiB)
  
  Domain names for Alexa top million web site (entries=1000000)
  htab lookup      (1  thread)    3.148 ± 0.039M/s (drops 0.000 ± 0.000M/s mem 190.831 MiB)
  qp-trie lookup   (1  thread)    2.374 ± 0.026M/s (drops 0.000 ± 0.000M/s mem 83.733 MiB)

Comments and suggestions are always welcome.

Change Log:
v1:
 * Use bpf_dynptr as map key type instead of bpf_lpm_trie_key-styled key (Suggested by Andrii)
 * Fix build error and RCU related sparse errors reported by lkp robot
 * Copy the passed key firstly in qp_trie_update_elem(), because the content of passed key may
   change and may break the assumption in two-round lookup process during update.
 * Add the missing rcu_barrier in qp_trie_free()

RFC: https://lore.kernel.org/bpf/20220726130005.3102470-1-houtao1@huawei.com/

[0]: https://lore.kernel.org/bpf/CAEf4Bzb7keBS8vXgV5JZzwgNGgMV0X3_guQ_m9JW3X6fJBDpPQ@mail.gmail.com/
[1]: https://lore.kernel.org/bpf/db34696a-cbfe-16e8-6dd5-8174b97dcf1d@huawei.com/

Hou Tao (10):
  bpf: Export bpf_dynptr_{get|set}_size
  bpf: Add helper btf_type_is_bpf_dynptr()
  bpf: Support bpf_dynptr-typed map key for bpf syscall
  bpf: Support bpf_dynptr-typed map key for bpf program
  libbpf: Add helpers for bpf_dynptr_user
  bpf: Add support for qp-trie map
  selftests/bpf: Add two new dynptr_fail cases for map key
  selftests/bpf: Add test case for basic qp-trie map operations
  selftests/bpf: Add benchmark for qp-trie map
  selftests/bpf: Add tests for qp-trie map by using bpf syscalls

 include/linux/bpf.h                           |    4 +
 include/linux/bpf_types.h                     |    1 +
 include/linux/btf.h                           |    1 +
 include/uapi/linux/bpf.h                      |   10 +
 kernel/bpf/Makefile                           |    1 +
 kernel/bpf/bpf_qp_trie.c                      | 1055 +++++++++++++++
 kernel/bpf/btf.c                              |    7 +
 kernel/bpf/helpers.c                          |    9 +-
 kernel/bpf/syscall.c                          |  108 +-
 kernel/bpf/verifier.c                         |   18 +-
 tools/include/uapi/linux/bpf.h                |    9 +
 tools/lib/bpf/bpf.h                           |   19 +
 tools/testing/selftests/bpf/DENYLIST.s390x    |    1 +
 tools/testing/selftests/bpf/Makefile          |    5 +-
 tools/testing/selftests/bpf/bench.c           |   10 +
 .../selftests/bpf/benchs/bench_qp_trie.c      |  511 ++++++++
 .../selftests/bpf/benchs/run_bench_qp_trie.sh |   55 +
 .../selftests/bpf/map_tests/qp_trie_map.c     | 1165 +++++++++++++++++
 .../testing/selftests/bpf/prog_tests/dynptr.c |    2 +
 .../selftests/bpf/prog_tests/qp_trie_test.c   |   91 ++
 .../testing/selftests/bpf/progs/dynptr_fail.c |   43 +
 .../selftests/bpf/progs/qp_trie_bench.c       |  236 ++++
 .../selftests/bpf/progs/qp_trie_test.c        |  110 ++
 23 files changed, 3452 insertions(+), 19 deletions(-)
 create mode 100644 kernel/bpf/bpf_qp_trie.c
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_qp_trie.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_qp_trie.sh
 create mode 100644 tools/testing/selftests/bpf/map_tests/qp_trie_map.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/qp_trie_test.c
 create mode 100644 tools/testing/selftests/bpf/progs/qp_trie_bench.c
 create mode 100644 tools/testing/selftests/bpf/progs/qp_trie_test.c

-- 
2.29.2


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH bpf-next 01/10] bpf: Export bpf_dynptr_{get|set}_size
  2022-09-17 15:31 [PATCH bpf-next 00/10] Add support for qp-trie map Hou Tao
@ 2022-09-17 15:31 ` Hou Tao
  2022-09-17 15:31 ` [PATCH bpf-next 02/10] bpf: Add helper btf_type_is_bpf_dynptr() Hou Tao
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Hou Tao @ 2022-09-17 15:31 UTC (permalink / raw)
  To: bpf, Andrii Nakryiko
  Cc: Song Liu, Hao Luo, Yonghong Song, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, KP Singh, David S . Miller,
	Jakub Kicinski, Stanislav Fomichev, Jiri Olsa, John Fastabend,
	Lorenz Bauer, Paul E . McKenney, houtao1

From: Hou Tao <houtao1@huawei.com>

For map with bpf_dynptr-typed key, lookup and update procedures will use
bpf_dynptr_get_size() to get the length of key, and iteration procedure
will use bpf_dynptr_set_size() to set the length of returned key.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 include/linux/bpf.h  | 2 ++
 kernel/bpf/helpers.c | 9 ++++++++-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index e0dbe0c0a17e..8da4a8190413 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2647,6 +2647,8 @@ void bpf_dynptr_init(struct bpf_dynptr_kern *ptr, void *data,
 		     enum bpf_dynptr_type type, u32 offset, u32 size);
 void bpf_dynptr_set_null(struct bpf_dynptr_kern *ptr);
 int bpf_dynptr_check_size(u32 size);
+u32 bpf_dynptr_get_size(const struct bpf_dynptr_kern *ptr);
+void bpf_dynptr_set_size(struct bpf_dynptr_kern *ptr, u32 new_size);
 
 #ifdef CONFIG_BPF_LSM
 void bpf_cgroup_atype_get(u32 attach_btf_id, int cgroup_atype);
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 41aeaf3862ec..7dbda4acb2c2 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1408,11 +1408,18 @@ static void bpf_dynptr_set_type(struct bpf_dynptr_kern *ptr, enum bpf_dynptr_typ
 	ptr->size |= type << DYNPTR_TYPE_SHIFT;
 }
 
-static u32 bpf_dynptr_get_size(struct bpf_dynptr_kern *ptr)
+u32 bpf_dynptr_get_size(const struct bpf_dynptr_kern *ptr)
 {
 	return ptr->size & DYNPTR_SIZE_MASK;
 }
 
+void bpf_dynptr_set_size(struct bpf_dynptr_kern *ptr, u32 new_size)
+{
+	u32 metadata = ptr->size & ~DYNPTR_SIZE_MASK;
+
+	ptr->size = new_size | metadata;
+}
+
 int bpf_dynptr_check_size(u32 size)
 {
 	return size > DYNPTR_MAX_SIZE ? -E2BIG : 0;
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH bpf-next 02/10] bpf: Add helper btf_type_is_bpf_dynptr()
  2022-09-17 15:31 [PATCH bpf-next 00/10] Add support for qp-trie map Hou Tao
  2022-09-17 15:31 ` [PATCH bpf-next 01/10] bpf: Export bpf_dynptr_{get|set}_size Hou Tao
@ 2022-09-17 15:31 ` Hou Tao
  2022-09-17 15:31 ` [PATCH bpf-next 03/10] bpf: Support bpf_dynptr-typed map key for bpf syscall Hou Tao
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Hou Tao @ 2022-09-17 15:31 UTC (permalink / raw)
  To: bpf, Andrii Nakryiko
  Cc: Song Liu, Hao Luo, Yonghong Song, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, KP Singh, David S . Miller,
	Jakub Kicinski, Stanislav Fomichev, Jiri Olsa, John Fastabend,
	Lorenz Bauer, Paul E . McKenney, houtao1

From: Hou Tao <houtao1@huawei.com>

Add helper btf_type_is_bpf_dynptr() to check whether or not the passed
btf type is bpf_dynptr. It will be used by bpf map (e.g. qp-trie) which
uses bpf_dynptr as map key.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 include/linux/btf.h | 1 +
 kernel/bpf/btf.c    | 7 +++++++
 2 files changed, 8 insertions(+)

diff --git a/include/linux/btf.h b/include/linux/btf.h
index 1fcc833a8690..d52f9979ea49 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -156,6 +156,7 @@ int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t);
 int btf_find_timer(const struct btf *btf, const struct btf_type *t);
 struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
 					  const struct btf_type *t);
+bool btf_type_is_bpf_dynptr(const struct btf *btf, const struct btf_type *t);
 bool btf_type_is_void(const struct btf_type *t);
 s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind);
 const struct btf_type *btf_type_skip_modifiers(const struct btf *btf,
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index b3940c605aac..f94b6112a649 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3522,6 +3522,13 @@ struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
 	return ERR_PTR(ret);
 }
 
+bool btf_type_is_bpf_dynptr(const struct btf *btf, const struct btf_type *t)
+{
+	return __btf_type_is_struct(t) &&
+	       t->size == sizeof(struct bpf_dynptr) &&
+	       !strcmp("bpf_dynptr", __btf_name_by_offset(btf, t->name_off));
+}
+
 static void __btf_struct_show(const struct btf *btf, const struct btf_type *t,
 			      u32 type_id, void *data, u8 bits_offset,
 			      struct btf_show *show)
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH bpf-next 03/10] bpf: Support bpf_dynptr-typed map key for bpf syscall
  2022-09-17 15:31 [PATCH bpf-next 00/10] Add support for qp-trie map Hou Tao
  2022-09-17 15:31 ` [PATCH bpf-next 01/10] bpf: Export bpf_dynptr_{get|set}_size Hou Tao
  2022-09-17 15:31 ` [PATCH bpf-next 02/10] bpf: Add helper btf_type_is_bpf_dynptr() Hou Tao
@ 2022-09-17 15:31 ` Hou Tao
  2022-09-17 18:18   ` kernel test robot
  2022-09-17 18:29   ` kernel test robot
  2022-09-17 15:31 ` [PATCH bpf-next 04/10] bpf: Support bpf_dynptr-typed map key for bpf program Hou Tao
                   ` (6 subsequent siblings)
  9 siblings, 2 replies; 13+ messages in thread
From: Hou Tao @ 2022-09-17 15:31 UTC (permalink / raw)
  To: bpf, Andrii Nakryiko
  Cc: Song Liu, Hao Luo, Yonghong Song, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, KP Singh, David S . Miller,
	Jakub Kicinski, Stanislav Fomichev, Jiri Olsa, John Fastabend,
	Lorenz Bauer, Paul E . McKenney, houtao1

From: Hou Tao <houtao1@huawei.com>

Userspace application uses bpf syscall to lookup or update bpf map. It
passes a pointer of fixed-size buffer to kernel to represent the map
key. To support map with variable-length key, introduce bpf_dynptr_user
to allow userspace to pass a pointer of bpf_dynptr_user to specify the
address and the length of key buffer. And in order to represent dynptr
from userspace, adding a new dynptr type: BPF_DYNPTR_TYPE_USER. Because
BPF_DYNPTR_TYPE_USER-typed dynptr is not available for bpf program, so
verifier doesn't need update.

To distinguish map with fixed-size key from map with variable-length
one, add a new map flag: BPF_F_DYNPTR_KEY to do that. For map which
enables BPF_F_DYNPTR_KEY, key btf type must be bpf_dynptr and the lower
32-bits of map_extra is used to specify the maximum size of dynptr key.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 include/linux/bpf.h            |   2 +
 include/uapi/linux/bpf.h       |   9 +++
 kernel/bpf/syscall.c           | 108 ++++++++++++++++++++++++++++-----
 tools/include/uapi/linux/bpf.h |   8 +++
 4 files changed, 113 insertions(+), 14 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 8da4a8190413..5060d7aee08c 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2641,6 +2641,8 @@ enum bpf_dynptr_type {
 	BPF_DYNPTR_TYPE_LOCAL,
 	/* Underlying data is a ringbuf record */
 	BPF_DYNPTR_TYPE_RINGBUF,
+	/* Points to memory copied from/to userspace */
+	BPF_DYNPTR_TYPE_USER,
 };
 
 void bpf_dynptr_init(struct bpf_dynptr_kern *ptr, void *data,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 3df78c56c1bf..77a2828f8148 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1246,6 +1246,9 @@ enum {
 
 /* Create a map that is suitable to be an inner map with dynamic max entries */
 	BPF_F_INNER_MAP		= (1U << 12),
+
+/* Map with bpf_dynptr-typed key */
+	BPF_F_DYNPTR_KEY	= (1U << 13),
 };
 
 /* Flags for BPF_PROG_QUERY. */
@@ -6775,6 +6778,12 @@ struct bpf_timer {
 	__u64 :64;
 } __attribute__((aligned(8)));
 
+struct bpf_dynptr_user {
+	__u64 data;
+	__u32 size;
+	__u32 :32;
+} __attribute__((aligned(8)));
+
 struct bpf_dynptr {
 	__u64 :64;
 	__u64 :64;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index dab156f09f8d..fd15c13cef24 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -151,6 +151,11 @@ bool bpf_map_write_active(const struct bpf_map *map)
 	return atomic64_read(&map->writecnt) != 0;
 }
 
+static inline bool is_dynptr_key(const struct bpf_map *map)
+{
+	return map->map_flags & BPF_F_DYNPTR_KEY;
+}
+
 static u32 bpf_map_value_size(const struct bpf_map *map)
 {
 	if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
@@ -994,7 +999,8 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 	/* Some maps allow key to be unspecified. */
 	if (btf_key_id) {
 		key_type = btf_type_id_size(btf, &btf_key_id, &key_size);
-		if (!key_type || key_size != map->key_size)
+		if (!key_type || key_size != map->key_size ||
+		    (is_dynptr_key(map) && !btf_type_is_bpf_dynptr(btf, key_type)))
 			return -EINVAL;
 	} else {
 		key_type = btf_type_by_id(btf, 0);
@@ -1089,9 +1095,16 @@ static int map_create(union bpf_attr *attr)
 		return -EINVAL;
 	}
 
-	if (attr->map_type != BPF_MAP_TYPE_BLOOM_FILTER &&
-	    attr->map_extra != 0)
+	if (attr->map_flags & BPF_F_DYNPTR_KEY) {
+		/* The lower 32-bits of map_extra specifies the maximum size
+		 * of bpf_dynptr-typed key
+		 */
+		if (!attr->btf_key_type_id || !attr->map_extra || (attr->map_extra >> 32) ||
+		    bpf_dynptr_check_size(attr->map_extra))
+			return -EINVAL;
+	} else if (attr->map_type != BPF_MAP_TYPE_BLOOM_FILTER && attr->map_extra != 0) {
 		return -EINVAL;
+	}
 
 	f_flags = bpf_get_file_flag(attr->map_flags);
 	if (f_flags < 0)
@@ -1280,8 +1293,39 @@ int __weak bpf_stackmap_copy(struct bpf_map *map, void *key, void *value)
 	return -ENOTSUPP;
 }
 
-static void *__bpf_copy_key(void __user *ukey, u64 key_size)
+static void *bpf_copy_from_dynptr_ukey(bpfptr_t ukey)
 {
+	struct bpf_dynptr_kern *kptr;
+	struct bpf_dynptr_user uptr;
+	bpfptr_t data;
+
+	if (copy_from_bpfptr(&uptr, ukey, sizeof(uptr)))
+		return ERR_PTR(-EFAULT);
+
+	if (!uptr.size || bpf_dynptr_check_size(uptr.size))
+		return ERR_PTR(-EINVAL);
+
+	/* Allocate and free bpf_dynptr_kern and its data together */
+	kptr = kvmalloc(sizeof(*kptr) + uptr.size, GFP_USER | __GFP_NOWARN);
+	if (!kptr)
+		return ERR_PTR(-ENOMEM);
+
+	data = make_bpfptr(uptr.data, bpfptr_is_kernel(ukey));
+	if (copy_from_bpfptr(&kptr[1], data, uptr.size)) {
+		kvfree(kptr);
+		return ERR_PTR(-EFAULT);
+	}
+
+	bpf_dynptr_init(kptr, &kptr[1], BPF_DYNPTR_TYPE_USER, 0, uptr.size);
+
+	return kptr;
+}
+
+static void *__bpf_copy_key(void __user *ukey, u64 key_size, bool dynptr)
+{
+	if (dynptr)
+		return bpf_copy_from_dynptr_ukey(USER_BPFPTR(ukey));
+
 	if (key_size)
 		return vmemdup_user(ukey, key_size);
 
@@ -1291,8 +1335,11 @@ static void *__bpf_copy_key(void __user *ukey, u64 key_size)
 	return NULL;
 }
 
-static void *___bpf_copy_key(bpfptr_t ukey, u64 key_size)
+static void *___bpf_copy_key(bpfptr_t ukey, u64 key_size, bool dynptr)
 {
+	if (dynptr)
+		return bpf_copy_from_dynptr_ukey(ukey);
+
 	if (key_size)
 		return kvmemdup_bpfptr(ukey, key_size);
 
@@ -1302,6 +1349,34 @@ static void *___bpf_copy_key(bpfptr_t ukey, u64 key_size)
 	return NULL;
 }
 
+static void *bpf_new_dynptr_key(u32 key_size)
+{
+	struct bpf_dynptr_kern *kptr;
+
+	kptr = kvmalloc(sizeof(*kptr) + key_size, GFP_USER | __GFP_NOWARN);
+	if (kptr)
+		bpf_dynptr_init(kptr, &kptr[1], BPF_DYNPTR_TYPE_USER, 0, key_size);
+	return kptr;
+}
+
+static int bpf_copy_to_dynptr_ukey(struct bpf_dynptr_user __user *uptr,
+				   struct bpf_dynptr_kern *kptr)
+{
+	unsigned int size;
+	u64 udata;
+
+	if (get_user(udata, &uptr->data))
+		return -EFAULT;
+
+	/* Also zeroing the reserved field in uptr */
+	size = bpf_dynptr_get_size(kptr);
+	if (copy_to_user(u64_to_user_ptr(udata), kptr->data + kptr->offset, size) ||
+	    put_user(size, &uptr->size) || put_user(0, &uptr->size + 1))
+		return -EFAULT;
+
+	return 0;
+}
+
 /* last field in 'union bpf_attr' used by this command */
 #define BPF_MAP_LOOKUP_ELEM_LAST_FIELD flags
 
@@ -1337,7 +1412,7 @@ static int map_lookup_elem(union bpf_attr *attr)
 		goto err_put;
 	}
 
-	key = __bpf_copy_key(ukey, map->key_size);
+	key = __bpf_copy_key(ukey, map->key_size, is_dynptr_key(map));
 	if (IS_ERR(key)) {
 		err = PTR_ERR(key);
 		goto err_put;
@@ -1377,7 +1452,6 @@ static int map_lookup_elem(union bpf_attr *attr)
 	return err;
 }
 
-
 #define BPF_MAP_UPDATE_ELEM_LAST_FIELD flags
 
 static int map_update_elem(union bpf_attr *attr, bpfptr_t uattr)
@@ -1410,7 +1484,7 @@ static int map_update_elem(union bpf_attr *attr, bpfptr_t uattr)
 		goto err_put;
 	}
 
-	key = ___bpf_copy_key(ukey, map->key_size);
+	key = ___bpf_copy_key(ukey, map->key_size, is_dynptr_key(map));
 	if (IS_ERR(key)) {
 		err = PTR_ERR(key);
 		goto err_put;
@@ -1458,7 +1532,7 @@ static int map_delete_elem(union bpf_attr *attr, bpfptr_t uattr)
 		goto err_put;
 	}
 
-	key = ___bpf_copy_key(ukey, map->key_size);
+	key = ___bpf_copy_key(ukey, map->key_size, is_dynptr_key(map));
 	if (IS_ERR(key)) {
 		err = PTR_ERR(key);
 		goto err_put;
@@ -1514,7 +1588,7 @@ static int map_get_next_key(union bpf_attr *attr)
 	}
 
 	if (ukey) {
-		key = __bpf_copy_key(ukey, map->key_size);
+		key = __bpf_copy_key(ukey, map->key_size, is_dynptr_key(map));
 		if (IS_ERR(key)) {
 			err = PTR_ERR(key);
 			goto err_put;
@@ -1524,7 +1598,10 @@ static int map_get_next_key(union bpf_attr *attr)
 	}
 
 	err = -ENOMEM;
-	next_key = kvmalloc(map->key_size, GFP_USER);
+	if (is_dynptr_key(map))
+		next_key = bpf_new_dynptr_key(map->map_extra);
+	else
+		next_key = kvmalloc(map->key_size, GFP_USER | __GFP_NOWARN);
 	if (!next_key)
 		goto free_key;
 
@@ -1540,8 +1617,11 @@ static int map_get_next_key(union bpf_attr *attr)
 	if (err)
 		goto free_next_key;
 
-	err = -EFAULT;
-	if (copy_to_user(unext_key, next_key, map->key_size) != 0)
+	if (is_dynptr_key(map))
+		err = bpf_copy_to_dynptr_ukey(unext_key, next_key);
+	else
+		err = copy_to_user(unext_key, next_key, map->key_size) != 0 ? -EFAULT : 0;
+	if (err)
 		goto free_next_key;
 
 	err = 0;
@@ -1815,7 +1895,7 @@ static int map_lookup_and_delete_elem(union bpf_attr *attr)
 		goto err_put;
 	}
 
-	key = __bpf_copy_key(ukey, map->key_size);
+	key = __bpf_copy_key(ukey, map->key_size, is_dynptr_key(map));
 	if (IS_ERR(key)) {
 		err = PTR_ERR(key);
 		goto err_put;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 3df78c56c1bf..600c3fcee37a 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1246,6 +1246,8 @@ enum {
 
 /* Create a map that is suitable to be an inner map with dynamic max entries */
 	BPF_F_INNER_MAP		= (1U << 12),
+/* Map with bpf_dynptr-typed key */
+	BPF_F_DYNPTR_KEY	= (1U << 13),
 };
 
 /* Flags for BPF_PROG_QUERY. */
@@ -6775,6 +6777,12 @@ struct bpf_timer {
 	__u64 :64;
 } __attribute__((aligned(8)));
 
+struct bpf_dynptr_user {
+	__u64 data;
+	__u32 size;
+	__u32 :32;
+} __attribute__((aligned(8)));
+
 struct bpf_dynptr {
 	__u64 :64;
 	__u64 :64;
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH bpf-next 04/10] bpf: Support bpf_dynptr-typed map key for bpf program
  2022-09-17 15:31 [PATCH bpf-next 00/10] Add support for qp-trie map Hou Tao
                   ` (2 preceding siblings ...)
  2022-09-17 15:31 ` [PATCH bpf-next 03/10] bpf: Support bpf_dynptr-typed map key for bpf syscall Hou Tao
@ 2022-09-17 15:31 ` Hou Tao
  2022-09-17 15:31 ` [PATCH bpf-next 05/10] libbpf: Add helpers for bpf_dynptr_user Hou Tao
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Hou Tao @ 2022-09-17 15:31 UTC (permalink / raw)
  To: bpf, Andrii Nakryiko
  Cc: Song Liu, Hao Luo, Yonghong Song, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, KP Singh, David S . Miller,
	Jakub Kicinski, Stanislav Fomichev, Jiri Olsa, John Fastabend,
	Lorenz Bauer, Paul E . McKenney, houtao1

From: Hou Tao <houtao1@huawei.com>

When bpf program manipulates a BPF_F_DYNPTR_KEY-enabled map, only allow
it to use a bpf_dynptr as map key.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/verifier.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 8c6fbcd0afaf..169c0b3e8002 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -6000,9 +6000,21 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
 			verbose(env, "invalid map_ptr to access map->key\n");
 			return -EACCES;
 		}
-		err = check_helper_mem_access(env, regno,
-					      meta->map_ptr->key_size, false,
-					      NULL);
+		/* For BPF_F_DYNPTR_KEY-enabled map, only allow bpf_dynptr
+		 * to be used as map key
+		 */
+		if (meta->map_ptr->map_flags & BPF_F_DYNPTR_KEY) {
+			if (base_type(reg->type) != PTR_TO_STACK ||
+			    !is_dynptr_reg_valid_init(env, reg, ARG_PTR_TO_DYNPTR)) {
+				verbose(env, "expect R%d to be dynptr instead of %s\n",
+					regno, reg_type_str(env, reg->type));
+				return -EACCES;
+			}
+		} else {
+			err = check_helper_mem_access(env, regno,
+						      meta->map_ptr->key_size, false,
+						      NULL);
+		}
 		break;
 	case ARG_PTR_TO_MAP_VALUE:
 		if (type_may_be_null(arg_type) && register_is_null(reg))
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH bpf-next 05/10] libbpf: Add helpers for bpf_dynptr_user
  2022-09-17 15:31 [PATCH bpf-next 00/10] Add support for qp-trie map Hou Tao
                   ` (3 preceding siblings ...)
  2022-09-17 15:31 ` [PATCH bpf-next 04/10] bpf: Support bpf_dynptr-typed map key for bpf program Hou Tao
@ 2022-09-17 15:31 ` Hou Tao
  2022-09-17 15:31 ` [PATCH bpf-next 06/10] bpf: Add support for qp-trie map Hou Tao
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Hou Tao @ 2022-09-17 15:31 UTC (permalink / raw)
  To: bpf, Andrii Nakryiko
  Cc: Song Liu, Hao Luo, Yonghong Song, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, KP Singh, David S . Miller,
	Jakub Kicinski, Stanislav Fomichev, Jiri Olsa, John Fastabend,
	Lorenz Bauer, Paul E . McKenney, houtao1

From: Hou Tao <houtao1@huawei.com>

Add bpf_dynptr_user_init() to initialize a bpf_dynptr, and
bpf_dynptr_user_get_{data,size} to get the buffer address
and buffer length of bpf_dynptr_user.

Instead of exporting these symbols, simply adding these helpers as
inline functions.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 tools/lib/bpf/bpf.h | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 9c50beabdd14..cd3e9dbaffa1 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -371,6 +371,25 @@ LIBBPF_API int bpf_btf_get_fd_by_id(__u32 id);
 LIBBPF_API int bpf_link_get_fd_by_id(__u32 id);
 LIBBPF_API int bpf_obj_get_info_by_fd(int bpf_fd, void *info, __u32 *info_len);
 
+/* sys_bpf() will check the validity of size */
+static inline void bpf_dynptr_user_init(void *data, __u32 size, struct bpf_dynptr_user *dynptr)
+{
+	/* Zero padding bytes */
+	memset(dynptr, 0, sizeof(*dynptr));
+	dynptr->data = (__u64)(unsigned long)data;
+	dynptr->size = size;
+}
+
+static inline __u32 bpf_dynptr_user_get_size(const struct bpf_dynptr_user *dynptr)
+{
+	return dynptr->size;
+}
+
+static inline void *bpf_dynptr_user_get_data(const struct bpf_dynptr_user *dynptr)
+{
+	return (void *)(unsigned long)dynptr->data;
+}
+
 struct bpf_prog_query_opts {
 	size_t sz; /* size of this struct for forward/backward compatibility */
 	__u32 query_flags;
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH bpf-next 06/10] bpf: Add support for qp-trie map
  2022-09-17 15:31 [PATCH bpf-next 00/10] Add support for qp-trie map Hou Tao
                   ` (4 preceding siblings ...)
  2022-09-17 15:31 ` [PATCH bpf-next 05/10] libbpf: Add helpers for bpf_dynptr_user Hou Tao
@ 2022-09-17 15:31 ` Hou Tao
  2022-09-17 15:31 ` [PATCH bpf-next 07/10] selftests/bpf: Add two new dynptr_fail cases for map key Hou Tao
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Hou Tao @ 2022-09-17 15:31 UTC (permalink / raw)
  To: bpf, Andrii Nakryiko
  Cc: Song Liu, Hao Luo, Yonghong Song, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, KP Singh, David S . Miller,
	Jakub Kicinski, Stanislav Fomichev, Jiri Olsa, John Fastabend,
	Lorenz Bauer, Paul E . McKenney, houtao1

From: Hou Tao <houtao1@huawei.com>

The initial motivation for qp-trie map is to reduce memory usage for
string keys specially those with large differencies in length. Moreover
as a big-endian lexicographic-ordered map, qp-trie can also be used for
any binary data with fixed or variable length.

The memory efficiency of qp-tries comes partly from the design of qp-trie
which doesn't save key for branch node and uses sparse array to save leaf
nodes, partly comes from the support of bpf_dynptr-typed key: only the
used part in key is saved.

But the memory efficiency and ordered keys come with cost: for strings
(e.g. symbol names in /proc/kallsyms) the lookup performance of qp-trie
is about 30% or more slower compared with hash-table. But the lookup
performance is not always bad than hash-table, for randomly generated
binary data set with big differencies in length, the lookup performance
of qp-trie will be twice as good as hash-table as showed in the
following benchmark.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 include/linux/bpf_types.h      |    1 +
 include/uapi/linux/bpf.h       |    1 +
 kernel/bpf/Makefile            |    1 +
 kernel/bpf/bpf_qp_trie.c       | 1055 ++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |    1 +
 5 files changed, 1059 insertions(+)
 create mode 100644 kernel/bpf/bpf_qp_trie.c

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 2b9112b80171..a73d47f83d74 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -126,6 +126,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops)
 #endif
 BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_BLOOM_FILTER, bloom_filter_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_QP_TRIE, qp_trie_map_ops)
 
 BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint)
 BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 77a2828f8148..e35e70b5cf0d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -928,6 +928,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_INODE_STORAGE,
 	BPF_MAP_TYPE_TASK_STORAGE,
 	BPF_MAP_TYPE_BLOOM_FILTER,
+	BPF_MAP_TYPE_QP_TRIE,
 };
 
 /* Note that tracing related programs such as
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 341c94f208f4..8419f44fea50 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -10,6 +10,7 @@ obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_i
 obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
 obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
 obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
+obj-$(CONFIG_BPF_SYSCALL) += bpf_qp_trie.o
 obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
 obj-$(CONFIG_BPF_SYSCALL) += disasm.o
 obj-$(CONFIG_BPF_JIT) += trampoline.o
diff --git a/kernel/bpf/bpf_qp_trie.c b/kernel/bpf/bpf_qp_trie.c
new file mode 100644
index 000000000000..4f56f69360ce
--- /dev/null
+++ b/kernel/bpf/bpf_qp_trie.c
@@ -0,0 +1,1055 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Derived from qp.c in https://github.com/fanf2/qp.git
+ *
+ * Copyright (C) 2022. Huawei Technologies Co., Ltd
+ */
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/spinlock.h>
+#include <linux/rcupdate.h>
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/btf_ids.h>
+
+/* qp-trie (quadbit popcount trie) is a memory efficient trie. Unlike
+ * normal trie which uses byte as lookup key, qp-trie interprets its keys
+ * as quadbit/nibble array and uses one nibble each time during lookup.
+ * The most significant nibble (upper nibble) of byte N in the key will
+ * be the 2*N element of nibble array, and the least significant nibble
+ * (lower nibble) of byte N will be the 2*N+1 element in nibble array.
+ *
+ * For normal trie, it may have 256 child nodes, and for qp-trie one branch
+ * node may have 17 child nodes. #0 child node is special because it must
+ * be a leaf node and its key is the same as the branch node. #1~#16 child
+ * nodes represent leaf nodes or branch nodes which have different keys
+ * with parent node. The key of branch node is the common prefix for these
+ * child nodes, and the index of child node minus one is the value of first
+ * different nibble between these child nodes.
+ *
+ * qp-trie reduces memory usage through two methods:
+ * (1) Branch node doesn't store the key. It only stores the position of
+ *     the first nibble which differentiates child nodes.
+ * (2) Branch node doesn't store all 17 child nodes. It uses a bitmap and
+ *     popcount() to implement a sparse array and only allocates memory
+ *     for those present children.
+ *
+ * Like normal trie, qp-trie is also ordered and is in big-endian
+ * lexicographic order. If traverse qp-trie in a depth-first way, it will
+ * return a string of ordered keys.
+ *
+ * The following diagrams show the construction of a tiny qp-trie:
+ *
+ * (1) insert abc
+ *
+ *          [ leaf node: abc ]
+ *
+ * (2) insert abc_d
+ *
+ * The first different nibble between "abc" and "abc_d" is the upper nibble
+ * of character '_' (0x5), and its position in nibble array is 6
+ * (starts from 0).
+ *
+ *          [ branch node ] bitmap: 0x41 diff pos: 6
+ *                 |
+ *                 *
+ *             children
+ *          [0]        [6]
+ *           |          |
+ *       [leaf: abc] [leaf: abc_d]
+ *
+ * (3) insert abc_e
+ *
+ * The first different nibble between "abc_d" and "abc_e" is the lower
+ * nibble of character 'd'/'e', and its position in array is 9.
+ *
+ *          [ branch node ] bitmap: 0x41 diff pos: 6
+ *                 |
+ *                 *
+ *             children
+ *          [0]        [6]
+ *           |          |
+ *       [leaf: abc]    |
+ *                      *
+ *                [ branch node ] bitmap: 0x60 diff pos: 9
+ *                      |
+ *                      *
+ *                   children
+ *                [5]        [6]
+ *                 |          |
+ *          [leaf: abc_d]  [leaf: abc_e]
+ */
+
+#define QP_TRIE_MANDATORY_FLAG_MASK (BPF_F_NO_PREALLOC | BPF_F_DYNPTR_KEY)
+#define QP_TRIE_CREATE_FLAG_MASK (QP_TRIE_MANDATORY_FLAG_MASK | BPF_F_NUMA_NODE | \
+				  BPF_F_ACCESS_MASK)
+
+/* bit[0] of nodes in qp_trie_branch is used to tell node type:
+ *
+ * bit[0]: 0-branch node
+ * bit[0]: 1-leaf node
+ *
+ * Size of qp_trie_branch is already 2-bytes aligned, so only need to make
+ * allocation of leaf node to be 2-bytes aligned.
+ */
+#define QP_TRIE_LEAF_NODE_MASK 1UL
+#define QP_TRIE_LEAF_ALLOC_ALIGN 2
+
+/* To reduce memory usage, only qp_trie_branch is RCU-freed. To handle
+ * freeing of the last leaf node, an extra qp_trie_branch node is
+ * allocated. The branch node has only one child and its index is 0. It
+ * is set as root node after adding the first leaf node.
+ */
+#define QP_TRIE_ROOT_NODE_INDEX 0
+#define QP_TRIE_NON_ROOT_NODE_MASK 1
+
+#define QP_TRIE_NIBBLE_SHIFT 1
+#define QP_TRIE_BYTE_INDEX_SHIFT 2
+
+#define QP_TRIE_TWIGS_FREE_NONE_IDX 17
+
+struct qp_trie_branch {
+	/* The bottom two bits of index are used as special flags:
+	 *
+	 * bit[0]: 0-root, 1-not root
+	 * bit[1]: 0-upper nibble, 1-lower nibble
+	 *
+	 * bit[2:31]: byte index for key
+	 */
+	unsigned int index;
+	/* 17 bits are used to accommodate arbitrary keys, even when there are
+	 * zero-bytes in these keys.
+	 *
+	 * bit[0]: a leaf node has the same key as the prefix of parent node
+	 * bit[N]: a child node with the value of nibble at index as (N - 1)
+	 */
+	unsigned int bitmap:17;
+	/* The index of leaf node will be RCU-freed together */
+	unsigned int to_free_idx:5;
+	struct qp_trie_branch __rcu *parent;
+	struct rcu_head rcu;
+	void __rcu *nodes[0];
+};
+
+#define QP_TRIE_NR_SUBTREE 256
+
+struct qp_trie {
+	struct bpf_map map;
+	atomic_t entries;
+	void __rcu *roots[QP_TRIE_NR_SUBTREE];
+	spinlock_t locks[QP_TRIE_NR_SUBTREE];
+};
+
+/* Internally use qp_trie_key instead of bpf_dynptr_kern
+ * to reduce memory usage
+ */
+struct qp_trie_key {
+	/* the length of blob data */
+	unsigned int len;
+	/* blob data */
+	unsigned char data[0];
+};
+
+struct qp_trie_diff {
+	unsigned int index;
+	unsigned int sibling_bm;
+	unsigned int new_bm;
+};
+
+static inline void *to_child_node(const struct qp_trie_key *key)
+{
+	return (void *)((long)key | QP_TRIE_LEAF_NODE_MASK);
+}
+
+static inline struct qp_trie_key *to_leaf_node(void *node)
+{
+	return (void *)((long)node & ~QP_TRIE_LEAF_NODE_MASK);
+}
+
+static inline bool is_branch_node(void *node)
+{
+	return !((long)node & QP_TRIE_LEAF_NODE_MASK);
+}
+
+static inline bool is_same_key(const struct qp_trie_key *k, const unsigned char *data,
+			       unsigned int len)
+{
+	return k->len == len && !memcmp(k->data, data, len);
+}
+
+static inline void *qp_trie_leaf_value(const struct qp_trie_key *key)
+{
+	return (void *)key + sizeof(*key) + key->len;
+}
+
+static inline unsigned int calc_twig_index(unsigned int mask, unsigned int bitmap)
+{
+	return hweight32(mask & (bitmap - 1));
+}
+
+static inline unsigned int calc_twig_nr(unsigned int bitmap)
+{
+	return hweight32(bitmap);
+}
+
+static inline unsigned int nibble_to_bitmap(unsigned char nibble)
+{
+	return 1U << (nibble + 1);
+}
+
+static inline unsigned int index_to_byte_index(unsigned int index)
+{
+	return index >> QP_TRIE_BYTE_INDEX_SHIFT;
+}
+
+static inline unsigned int calc_br_bitmap(unsigned int index, const unsigned char *data,
+					  unsigned int len)
+{
+	unsigned int byte;
+	unsigned char nibble;
+
+	if (index == QP_TRIE_ROOT_NODE_INDEX)
+		return 1;
+
+	byte = index_to_byte_index(index);
+	if (byte >= len)
+		return 1;
+
+	nibble = data[byte];
+	/* lower nibble */
+	if ((index >> QP_TRIE_NIBBLE_SHIFT) & 1)
+		nibble &= 0xf;
+	else
+		nibble >>= 4;
+	return nibble_to_bitmap(nibble);
+}
+
+static void qp_trie_free_twigs_rcu(struct rcu_head *rcu)
+{
+	struct qp_trie_branch *twigs = container_of(rcu, struct qp_trie_branch, rcu);
+	unsigned int idx = twigs->to_free_idx;
+
+	if (idx != QP_TRIE_TWIGS_FREE_NONE_IDX)
+		kfree(to_leaf_node(rcu_access_pointer(twigs->nodes[idx])));
+	kfree(twigs);
+}
+
+static void qp_trie_branch_free(struct qp_trie_branch *twigs, unsigned int to_free_idx)
+{
+	twigs->to_free_idx = to_free_idx;
+	call_rcu(&twigs->rcu, qp_trie_free_twigs_rcu);
+}
+
+static inline struct qp_trie_branch *
+qp_trie_branch_new(struct bpf_map *map, unsigned int nr)
+{
+	struct qp_trie_branch *a;
+
+	a = bpf_map_kmalloc_node(map, sizeof(*a) + nr * sizeof(*a->nodes),
+				 GFP_NOWAIT | __GFP_NOWARN, map->numa_node);
+	return a;
+}
+
+static inline void qp_trie_assign_parent(struct qp_trie_branch *parent, void *node)
+{
+	if (is_branch_node(node))
+		rcu_assign_pointer(((struct qp_trie_branch *)node)->parent, parent);
+}
+
+static void qp_trie_update_parent(struct qp_trie_branch *parent, unsigned int nr)
+{
+	unsigned int i;
+
+	for (i = 0; i < nr; i++)
+		qp_trie_assign_parent(parent, rcu_dereference_protected(parent->nodes[i], 1));
+}
+
+/* new_node can be either a leaf node or a branch node */
+static struct qp_trie_branch *
+qp_trie_branch_replace(struct bpf_map *map, struct qp_trie_branch *old, unsigned int bitmap,
+		       void *new_node)
+{
+	unsigned int nr = calc_twig_nr(old->bitmap);
+	unsigned int p = calc_twig_index(old->bitmap, bitmap);
+	struct qp_trie_branch *twigs;
+
+	twigs = qp_trie_branch_new(map, nr);
+	if (!twigs)
+		return NULL;
+
+	if (p)
+		memcpy(twigs->nodes, old->nodes, p * sizeof(*twigs->nodes));
+
+	rcu_assign_pointer(twigs->nodes[p], new_node);
+
+	if (nr - 1 > p)
+		memcpy(&twigs->nodes[p+1], &old->nodes[p+1], (nr - 1 - p) * sizeof(*twigs->nodes));
+
+	twigs->index = old->index;
+	twigs->bitmap = old->bitmap;
+	/* twigs will not be visible to reader until rcu_assign_pointer(), so
+	 * use RCU_INIT_POINTER() here.
+	 */
+	RCU_INIT_POINTER(twigs->parent, old->parent);
+
+	/* Initialize ->parent of parent node first, then update ->parent for
+	 * child nodes after parent node is fully initialized.
+	 */
+	qp_trie_update_parent(twigs, nr);
+
+	return twigs;
+}
+
+static struct qp_trie_branch *
+qp_trie_branch_insert(struct bpf_map *map, struct qp_trie_branch *old, unsigned int bitmap,
+		      const struct qp_trie_key *new)
+{
+	unsigned int nr = calc_twig_nr(old->bitmap);
+	unsigned int p = calc_twig_index(old->bitmap, bitmap);
+	struct qp_trie_branch *twigs;
+
+	twigs = qp_trie_branch_new(map, nr + 1);
+	if (!twigs)
+		return NULL;
+
+	if (p)
+		memcpy(twigs->nodes, old->nodes, p * sizeof(*twigs->nodes));
+
+	rcu_assign_pointer(twigs->nodes[p], to_child_node(new));
+
+	if (nr > p)
+		memcpy(&twigs->nodes[p+1], &old->nodes[p], (nr - p) * sizeof(*twigs->nodes));
+
+	twigs->bitmap = old->bitmap | bitmap;
+	twigs->index = old->index;
+	RCU_INIT_POINTER(twigs->parent, old->parent);
+
+	qp_trie_update_parent(twigs, nr + 1);
+
+	return twigs;
+}
+
+static struct qp_trie_branch *
+qp_trie_branch_remove(struct bpf_map *map, struct qp_trie_branch *old, unsigned int bitmap)
+{
+	unsigned int nr = calc_twig_nr(old->bitmap);
+	unsigned int p = calc_twig_index(old->bitmap, bitmap);
+	struct qp_trie_branch *twigs;
+
+	twigs = qp_trie_branch_new(map, nr - 1);
+	if (!twigs)
+		return NULL;
+
+	if (p)
+		memcpy(twigs->nodes, old->nodes, p * sizeof(*twigs->nodes));
+	if (nr - 1 > p)
+		memcpy(&twigs->nodes[p], &old->nodes[p+1], (nr - 1 - p) * sizeof(*twigs->nodes));
+
+	twigs->bitmap = old->bitmap & ~bitmap;
+	twigs->index = old->index;
+	RCU_INIT_POINTER(twigs->parent, old->parent);
+
+	qp_trie_update_parent(twigs, nr - 1);
+
+	return twigs;
+}
+
+static struct qp_trie_key *
+qp_trie_init_leaf_node(struct bpf_map *map, const struct bpf_dynptr_kern *k, void *v)
+{
+	unsigned int key_size, total;
+	struct qp_trie_key *new;
+
+	key_size = bpf_dynptr_get_size(k);
+	if (!key_size || key_size > (u32)map->map_extra)
+		return ERR_PTR(-EINVAL);
+
+	total = round_up(sizeof(*new) + key_size + map->value_size, QP_TRIE_LEAF_ALLOC_ALIGN);
+	new = bpf_map_kmalloc_node(map, total, GFP_NOWAIT | __GFP_NOWARN, map->numa_node);
+	if (!new)
+		return ERR_PTR(-ENOMEM);
+
+	new->len = key_size;
+	memcpy(new->data, k->data + k->offset, key_size);
+	memcpy((void *)&new[1] + key_size, v, map->value_size);
+
+	return new;
+}
+
+static bool calc_prefix_len(const struct qp_trie_key *s_key, const struct qp_trie_key *n_key,
+			    unsigned int *index)
+{
+	unsigned int i, len = min(s_key->len, n_key->len);
+	unsigned char diff = 0;
+
+	for (i = 0; i < len; i++) {
+		diff = s_key->data[i] ^ n_key->data[i];
+		if (diff)
+			break;
+	}
+
+	*index = (i << QP_TRIE_BYTE_INDEX_SHIFT) | QP_TRIE_NON_ROOT_NODE_MASK;
+	if (!diff)
+		return s_key->len == n_key->len;
+
+	*index += (diff & 0xf0) ? 0 : (1U << QP_TRIE_NIBBLE_SHIFT);
+	return false;
+}
+
+static int qp_trie_new_branch(struct qp_trie *trie, struct qp_trie_branch __rcu **parent,
+			      unsigned int bitmap, void *sibling, struct qp_trie_diff *d,
+			      const struct qp_trie_key *leaf)
+{
+	struct qp_trie_branch *new_child_twigs, *new_twigs, *old_twigs;
+	struct bpf_map *map;
+	unsigned int iip;
+	int err;
+
+	map = &trie->map;
+	if (atomic_inc_return(&trie->entries) > map->max_entries) {
+		err = -ENOSPC;
+		goto dec_entries;
+	}
+
+	new_child_twigs = qp_trie_branch_new(map, 2);
+	if (!new_child_twigs) {
+		err = -ENOMEM;
+		goto dec_entries;
+	}
+
+	new_child_twigs->index = d->index;
+	new_child_twigs->bitmap = d->sibling_bm | d->new_bm;
+
+	iip = calc_twig_index(new_child_twigs->bitmap, d->sibling_bm);
+	RCU_INIT_POINTER(new_child_twigs->nodes[iip], sibling);
+	rcu_assign_pointer(new_child_twigs->nodes[!iip], to_child_node(leaf));
+	RCU_INIT_POINTER(new_child_twigs->parent, NULL);
+
+	old_twigs = rcu_dereference_protected(*parent, 1);
+	new_twigs = qp_trie_branch_replace(map, old_twigs, bitmap, new_child_twigs);
+	if (!new_twigs) {
+		err = -ENOMEM;
+		goto free_child_twigs;
+	}
+
+	qp_trie_assign_parent(new_child_twigs, sibling);
+	rcu_assign_pointer(*parent, new_twigs);
+	qp_trie_branch_free(old_twigs, QP_TRIE_TWIGS_FREE_NONE_IDX);
+
+	return 0;
+
+free_child_twigs:
+	kfree(new_child_twigs);
+dec_entries:
+	atomic_dec(&trie->entries);
+	return err;
+}
+
+static int qp_trie_ext_branch(struct qp_trie *trie, struct qp_trie_branch __rcu **parent,
+			      const struct qp_trie_key *new, unsigned int bitmap)
+{
+	struct qp_trie_branch *old_twigs, *new_twigs;
+	struct bpf_map *map;
+	int err;
+
+	map = &trie->map;
+	if (atomic_inc_return(&trie->entries) > map->max_entries) {
+		err = -ENOSPC;
+		goto dec_entries;
+	}
+
+	old_twigs = rcu_dereference_protected(*parent, 1);
+	new_twigs = qp_trie_branch_insert(map, old_twigs, bitmap, new);
+	if (!new_twigs) {
+		err = -ENOMEM;
+		goto dec_entries;
+	}
+
+	rcu_assign_pointer(*parent, new_twigs);
+	qp_trie_branch_free(old_twigs, QP_TRIE_TWIGS_FREE_NONE_IDX);
+
+	return 0;
+
+dec_entries:
+	atomic_dec(&trie->entries);
+	return err;
+}
+
+static int qp_trie_add_leaf_node(struct qp_trie *trie, struct qp_trie_branch __rcu **parent,
+				 const struct qp_trie_key *new)
+{
+	struct bpf_map *map = &trie->map;
+	struct qp_trie_branch *twigs;
+	int err;
+
+	if (atomic_inc_return(&trie->entries) > map->max_entries) {
+		err = -ENOSPC;
+		goto dec_entries;
+	}
+
+	twigs = qp_trie_branch_new(map, 1);
+	if (!twigs) {
+		err = -ENOMEM;
+		goto dec_entries;
+	}
+	twigs->index = QP_TRIE_ROOT_NODE_INDEX;
+	twigs->bitmap = 1;
+	RCU_INIT_POINTER(twigs->parent, NULL);
+	rcu_assign_pointer(twigs->nodes[0], to_child_node(new));
+
+	rcu_assign_pointer(*parent, twigs);
+
+	return 0;
+dec_entries:
+	atomic_dec(&trie->entries);
+	return err;
+}
+
+static int qp_trie_rep_leaf_node(struct qp_trie *trie, struct qp_trie_branch __rcu **parent,
+				 const struct qp_trie_key *new, unsigned int bitmap)
+{
+	struct qp_trie_branch *old_twigs, *new_twigs;
+	struct bpf_map *map = &trie->map;
+
+	/* Only branch node is freed by RCU, so replace the old branch node
+	 * and free the old leaf node together with the old branch node.
+	 */
+	old_twigs = rcu_dereference_protected(*parent, 1);
+	new_twigs = qp_trie_branch_replace(map, old_twigs, bitmap, to_child_node(new));
+	if (!new_twigs)
+		return -ENOMEM;
+
+	rcu_assign_pointer(*parent, new_twigs);
+
+	qp_trie_branch_free(old_twigs, calc_twig_index(old_twigs->bitmap, bitmap));
+
+	return 0;
+}
+
+static int qp_trie_remove_leaf(struct qp_trie *trie, struct qp_trie_branch __rcu **parent,
+			       unsigned int bitmap, const struct qp_trie_key *node)
+{
+	struct bpf_map *map = &trie->map;
+	struct qp_trie_branch *new, *old;
+	unsigned int nr;
+
+	old = rcu_dereference_protected(*parent, 1);
+	nr = calc_twig_nr(old->bitmap);
+	if (nr > 2) {
+		new = qp_trie_branch_remove(map, old, bitmap);
+		if (!new)
+			return -ENOMEM;
+	} else {
+		new = NULL;
+	}
+
+	rcu_assign_pointer(*parent, new);
+
+	qp_trie_branch_free(old, calc_twig_index(old->bitmap, bitmap));
+
+	atomic_dec(&trie->entries);
+
+	return 0;
+}
+
+static int qp_trie_merge_node(struct qp_trie *trie, struct qp_trie_branch __rcu **grand_parent,
+			      struct qp_trie_branch *parent, unsigned int parent_bitmap,
+			      unsigned int bitmap)
+{
+	struct qp_trie_branch *old_twigs, *new_twigs;
+	struct bpf_map *map = &trie->map;
+	void *new_sibling;
+	unsigned int iip;
+
+	iip = calc_twig_index(parent->bitmap, bitmap);
+	new_sibling = rcu_dereference_protected(parent->nodes[!iip], 1);
+
+	old_twigs = rcu_dereference_protected(*grand_parent, 1);
+	new_twigs = qp_trie_branch_replace(map, old_twigs, parent_bitmap, new_sibling);
+	if (!new_twigs)
+		return -ENOMEM;
+
+	rcu_assign_pointer(*grand_parent, new_twigs);
+
+	qp_trie_branch_free(old_twigs, QP_TRIE_TWIGS_FREE_NONE_IDX);
+	qp_trie_branch_free(parent, iip);
+
+	atomic_dec(&trie->entries);
+
+	return 0;
+}
+
+static int qp_trie_alloc_check(union bpf_attr *attr)
+{
+	if (!bpf_capable())
+		return -EPERM;
+
+	if ((attr->map_flags & QP_TRIE_MANDATORY_FLAG_MASK) != QP_TRIE_MANDATORY_FLAG_MASK ||
+	    attr->map_flags & ~QP_TRIE_CREATE_FLAG_MASK ||
+	    !bpf_map_flags_access_ok(attr->map_flags))
+		return -EINVAL;
+
+	if (!attr->max_entries || !attr->value_size)
+		return -EINVAL;
+
+	/* Key and value are allocated together in qp_trie_init_leaf_node() */
+	if (round_up((u64)sizeof(struct qp_trie_key) + (u32)attr->map_extra + attr->value_size,
+		     QP_TRIE_LEAF_ALLOC_ALIGN) >= KMALLOC_MAX_SIZE)
+		return -E2BIG;
+
+	return 0;
+}
+
+static struct bpf_map *qp_trie_alloc(union bpf_attr *attr)
+{
+	struct qp_trie *trie;
+	unsigned int i;
+
+	trie = bpf_map_area_alloc(sizeof(*trie), bpf_map_attr_numa_node(attr));
+	if (!trie)
+		return ERR_PTR(-ENOMEM);
+
+	/* roots are zeroed by bpf_map_area_alloc() */
+	for (i = 0; i < QP_TRIE_NR_SUBTREE; i++)
+		spin_lock_init(&trie->locks[i]);
+
+	atomic_set(&trie->entries, 0);
+	bpf_map_init_from_attr(&trie->map, attr);
+
+	return &trie->map;
+}
+
+static void qp_trie_free_subtree(void *root)
+{
+	struct qp_trie_branch *parent = NULL;
+	struct qp_trie_key *cur = NULL;
+	void *node = root;
+
+	/*
+	 * Depth-first deletion
+	 *
+	 * 1. find left-most key and its parent
+	 * 2. get next sibling Y from parent
+	 * (a) Y is leaf node: continue
+	 * (b) Y is branch node: goto step 1
+	 * (c) no more sibling: backtrace upwards if parent is not NULL and
+	 *     goto step 1
+	 */
+	do {
+		while (is_branch_node(node)) {
+			parent = node;
+			node = rcu_dereference_raw(parent->nodes[0]);
+		}
+
+		cur = to_leaf_node(node);
+		while (parent) {
+			unsigned int iip, bitmap, nr;
+			void *ancestor;
+
+			bitmap = calc_br_bitmap(parent->index, cur->data, cur->len);
+			iip = calc_twig_index(parent->bitmap, bitmap) + 1;
+			nr = calc_twig_nr(parent->bitmap);
+
+			for (; iip < nr; iip++) {
+				kfree(cur);
+
+				node = rcu_dereference_raw(parent->nodes[iip]);
+				if (is_branch_node(node))
+					break;
+
+				cur = to_leaf_node(node);
+			}
+			if (iip < nr)
+				break;
+
+			ancestor = rcu_dereference_raw(parent->parent);
+			kfree(parent);
+			parent = ancestor;
+		}
+	} while (parent);
+
+	kfree(cur);
+}
+
+static void qp_trie_free(struct bpf_map *map)
+{
+	struct qp_trie *trie = container_of(map, struct qp_trie, map);
+	unsigned int i;
+
+	/* Wait for the pending qp_trie_free_twigs_rcu() */
+	rcu_barrier();
+
+	for (i = 0; i < ARRAY_SIZE(trie->roots); i++) {
+		void *root = rcu_dereference_raw(trie->roots[i]);
+
+		if (root)
+			qp_trie_free_subtree(root);
+	}
+	bpf_map_area_free(trie);
+}
+
+static inline void qp_trie_copy_leaf(const struct qp_trie_key *leaf, struct bpf_dynptr_kern *key)
+{
+	memcpy(key->data + key->offset, leaf->data, leaf->len);
+	bpf_dynptr_set_size(key, leaf->len);
+}
+
+static void qp_trie_copy_min_key_from(void *root, struct bpf_dynptr_kern *key)
+{
+	void *node;
+
+	node = root;
+	while (is_branch_node(node))
+		node = rcu_dereference(((struct qp_trie_branch *)node)->nodes[0]);
+
+	qp_trie_copy_leaf(to_leaf_node(node), key);
+}
+
+static int qp_trie_lookup_min_key(struct qp_trie *trie, unsigned int from,
+				  struct bpf_dynptr_kern *key)
+{
+	unsigned int i;
+
+	for (i = from; i < ARRAY_SIZE(trie->roots); i++) {
+		void *root = rcu_dereference(trie->roots[i]);
+
+		if (root) {
+			qp_trie_copy_min_key_from(root, key);
+			return 0;
+		}
+	}
+
+	return -ENOENT;
+}
+
+static int qp_trie_next_twigs_index(struct qp_trie_branch *twigs, unsigned int bitmap)
+{
+	unsigned int idx, nr, next;
+
+	/* bitmap may not in twigs->bitmap */
+	idx = calc_twig_index(twigs->bitmap, bitmap);
+	nr = calc_twig_nr(twigs->bitmap);
+
+	next = idx;
+	if (twigs->bitmap & bitmap)
+		next += 1;
+
+	if (next >= nr)
+		return -1;
+	return next;
+}
+
+static int qp_trie_lookup_next_node(struct qp_trie *trie, const struct bpf_dynptr_kern *key,
+				    struct bpf_dynptr_kern *next_key)
+{
+	const struct qp_trie_key *found;
+	struct qp_trie_branch *parent;
+	const unsigned char *data;
+	unsigned int data_len;
+	void *node, *next;
+
+	/* Non-existent key, so restart from the beginning */
+	data = key->data + key->offset;
+	node = rcu_dereference(trie->roots[*data]);
+	if (!node)
+		return qp_trie_lookup_min_key(trie, 0, next_key);
+
+	parent = NULL;
+	data_len = bpf_dynptr_get_size(key);
+	while (is_branch_node(node)) {
+		struct qp_trie_branch *br = node;
+		unsigned int iip, bitmap;
+
+		bitmap = calc_br_bitmap(br->index, data, data_len);
+		if (bitmap & br->bitmap)
+			iip = calc_twig_index(br->bitmap, bitmap);
+		else
+			iip = 0;
+
+		parent = br;
+		node = rcu_dereference(br->nodes[iip]);
+	}
+	found = to_leaf_node(node);
+	if (!is_same_key(found, data, data_len))
+		return qp_trie_lookup_min_key(trie, 0, next_key);
+
+	/* Pair with store release in rcu_assign_pointer(*parent, twigs) to
+	 * ensure reading node->parent will not return the old parent if
+	 * the node is found by following the newly-created parent.
+	 */
+	smp_rmb();
+
+	next = NULL;
+	while (parent) {
+		unsigned int bitmap;
+		int next_idx;
+
+		bitmap = calc_br_bitmap(parent->index, data, data_len);
+		next_idx = qp_trie_next_twigs_index(parent, bitmap);
+		if (next_idx >= 0) {
+			next = rcu_dereference(parent->nodes[next_idx]);
+			break;
+		}
+		parent = rcu_dereference(parent->parent);
+	}
+
+	/* Goto next sub-tree */
+	if (!next)
+		return qp_trie_lookup_min_key(trie, *data + 1, next_key);
+
+	if (!is_branch_node(next))
+		qp_trie_copy_leaf(to_leaf_node(next), next_key);
+	else
+		qp_trie_copy_min_key_from(next, next_key);
+
+	return 0;
+}
+
+/* Called from syscall */
+static int qp_trie_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+	struct qp_trie *trie = container_of(map, struct qp_trie, map);
+	int err;
+
+	if (!key)
+		err = qp_trie_lookup_min_key(trie, 0, next_key);
+	else
+		err = qp_trie_lookup_next_node(trie, key, next_key);
+	return err;
+}
+
+/* Called from syscall or from eBPF program */
+static void *qp_trie_lookup_elem(struct bpf_map *map, void *key)
+{
+	struct qp_trie *trie = container_of(map, struct qp_trie, map);
+	const struct bpf_dynptr_kern *dynptr_key = key;
+	const struct qp_trie_key *found;
+	const unsigned char *data;
+	unsigned int data_len;
+	void *node, *value;
+
+	/* Dynptr with zero length is possible, but is invalid for qp-trie */
+	data_len = bpf_dynptr_get_size(dynptr_key);
+	if (!data_len)
+		return NULL;
+
+	data = dynptr_key->data + dynptr_key->offset;
+	node = rcu_dereference_check(trie->roots[*data], rcu_read_lock_bh_held());
+	if (!node)
+		return NULL;
+
+	value = NULL;
+	while (is_branch_node(node)) {
+		struct qp_trie_branch *br = node;
+		unsigned int bitmap;
+		unsigned int iip;
+
+		/* When byte index equals with key len, the target key
+		 * may be in twigs->nodes[0].
+		 */
+		if (index_to_byte_index(br->index) > data_len)
+			goto done;
+
+		bitmap = calc_br_bitmap(br->index, data, data_len);
+		if (!(bitmap & br->bitmap))
+			goto done;
+
+		iip = calc_twig_index(br->bitmap, bitmap);
+		node = rcu_dereference_check(br->nodes[iip], rcu_read_lock_bh_held());
+	}
+
+	found = to_leaf_node(node);
+	if (is_same_key(found, data, data_len))
+		value = qp_trie_leaf_value(found);
+done:
+	return value;
+}
+
+/* Called from syscall or from eBPF program */
+static int qp_trie_update_elem(struct bpf_map *map, void *key, void *value, u64 flags)
+{
+	struct qp_trie *trie = container_of(map, struct qp_trie, map);
+	const struct qp_trie_key *leaf_key, *new_key;
+	struct qp_trie_branch __rcu **parent;
+	struct qp_trie_diff d;
+	unsigned int bitmap;
+	void __rcu **node;
+	spinlock_t *lock;
+	unsigned char c;
+	bool equal;
+	int err;
+
+	if (flags > BPF_EXIST)
+		return -EINVAL;
+
+	/* The content of key may change, so copy it firstly */
+	new_key = qp_trie_init_leaf_node(map, key, value);
+	if (IS_ERR(new_key))
+		return PTR_ERR(new_key);
+
+	c = new_key->data[0];
+	lock = &trie->locks[c];
+	spin_lock(lock);
+	parent = (struct qp_trie_branch __rcu **)&trie->roots[c];
+	if (!rcu_dereference_protected(*parent, 1)) {
+		if (flags == BPF_EXIST) {
+			err = -ENOENT;
+			goto unlock;
+		}
+		err = qp_trie_add_leaf_node(trie, parent, new_key);
+		goto unlock;
+	}
+
+	bitmap = 1;
+	node = &rcu_dereference_protected(*parent, 1)->nodes[0];
+	while (is_branch_node(rcu_dereference_protected(*node, 1))) {
+		struct qp_trie_branch *br = rcu_dereference_protected(*node, 1);
+		unsigned int iip;
+
+		bitmap = calc_br_bitmap(br->index, new_key->data, new_key->len);
+		if (bitmap & br->bitmap)
+			iip = calc_twig_index(br->bitmap, bitmap);
+		else
+			iip = 0;
+		parent = (struct qp_trie_branch __rcu **)node;
+		node = &br->nodes[iip];
+	}
+
+	leaf_key = to_leaf_node(rcu_dereference_protected(*node, 1));
+	equal = calc_prefix_len(leaf_key, new_key, &d.index);
+	if (equal) {
+		if (flags == BPF_NOEXIST) {
+			err = -EEXIST;
+			goto unlock;
+		}
+		err = qp_trie_rep_leaf_node(trie, parent, new_key, bitmap);
+		goto unlock;
+	}
+
+	d.sibling_bm = calc_br_bitmap(d.index, leaf_key->data, leaf_key->len);
+	d.new_bm = calc_br_bitmap(d.index, new_key->data, new_key->len);
+
+	bitmap = 1;
+	parent = (struct qp_trie_branch __rcu **)&trie->roots[c];
+	node = &rcu_dereference_protected(*parent, 1)->nodes[0];
+	while (is_branch_node(rcu_dereference_protected(*node, 1))) {
+		struct qp_trie_branch *br = rcu_dereference_protected(*node, 1);
+		unsigned int iip;
+
+		if (d.index < br->index)
+			goto new_branch;
+
+		parent = (struct qp_trie_branch __rcu **)node;
+		if (d.index == br->index) {
+			if (flags == BPF_EXIST) {
+				err = -ENOENT;
+				goto unlock;
+			}
+			err = qp_trie_ext_branch(trie, parent, new_key, d.new_bm);
+			goto unlock;
+		}
+
+		bitmap = calc_br_bitmap(br->index, new_key->data, new_key->len);
+		iip = calc_twig_index(br->bitmap, bitmap);
+		node = &br->nodes[iip];
+	}
+
+new_branch:
+	if (flags == BPF_EXIST) {
+		err = -ENOENT;
+		goto unlock;
+	}
+	err = qp_trie_new_branch(trie, parent, bitmap, rcu_dereference_protected(*node, 1),
+				 &d, new_key);
+unlock:
+	spin_unlock(lock);
+	if (err)
+		kfree(new_key);
+	return err;
+}
+
+/* Called from syscall or from eBPF program */
+static int qp_trie_delete_elem(struct bpf_map *map, void *key)
+{
+	struct qp_trie *trie = container_of(map, struct qp_trie, map);
+	unsigned int bitmap, parent_bitmap, data_len, nr;
+	struct qp_trie_branch __rcu **parent, **grand_parent;
+	const struct bpf_dynptr_kern *dynptr_key;
+	const struct qp_trie_key *found;
+	const unsigned char *data;
+	void __rcu **node;
+	spinlock_t *lock;
+	unsigned char c;
+	int err;
+
+	dynptr_key = key;
+	data_len = bpf_dynptr_get_size(dynptr_key);
+	if (!data_len)
+		return -EINVAL;
+
+	err = -ENOENT;
+	data = dynptr_key->data + dynptr_key->offset;
+	c = *data;
+	lock = &trie->locks[c];
+	spin_lock(lock);
+	parent = (struct qp_trie_branch __rcu **)&trie->roots[c];
+	if (!*parent)
+		goto unlock;
+
+	grand_parent = NULL;
+	parent_bitmap = bitmap = 1;
+	node = &rcu_dereference_protected(*parent, 1)->nodes[0];
+	while (is_branch_node(rcu_dereference_protected(*node, 1))) {
+		struct qp_trie_branch *br = rcu_dereference_protected(*node, 1);
+		unsigned int iip;
+
+		if (index_to_byte_index(br->index) > data_len)
+			goto unlock;
+
+		parent_bitmap = bitmap;
+		bitmap = calc_br_bitmap(br->index, data, data_len);
+		if (!(bitmap & br->bitmap))
+			goto unlock;
+
+		grand_parent = parent;
+		parent = (struct qp_trie_branch __rcu **)node;
+		iip = calc_twig_index(br->bitmap, bitmap);
+		node = &br->nodes[iip];
+	}
+
+	found = to_leaf_node(rcu_dereference_protected(*node, 1));
+	if (!is_same_key(found, data, data_len))
+		goto unlock;
+
+	nr = calc_twig_nr(rcu_dereference_protected(*parent, 1)->bitmap);
+	if (nr != 2)
+		err = qp_trie_remove_leaf(trie, parent, bitmap, found);
+	else
+		err = qp_trie_merge_node(trie, grand_parent, rcu_dereference_protected(*parent, 1),
+					 parent_bitmap, bitmap);
+unlock:
+	spin_unlock(lock);
+	return err;
+}
+
+static int qp_trie_check_btf(const struct bpf_map *map,
+			     const struct btf *btf,
+			     const struct btf_type *key_type,
+			     const struct btf_type *value_type)
+{
+	return 0;
+}
+
+BTF_ID_LIST_SINGLE(qp_trie_map_btf_ids, struct, qp_trie)
+const struct bpf_map_ops qp_trie_map_ops = {
+	.map_alloc_check = qp_trie_alloc_check,
+	.map_alloc = qp_trie_alloc,
+	.map_free = qp_trie_free,
+	.map_get_next_key = qp_trie_get_next_key,
+	.map_lookup_elem = qp_trie_lookup_elem,
+	.map_update_elem = qp_trie_update_elem,
+	.map_delete_elem = qp_trie_delete_elem,
+	.map_meta_equal = bpf_map_meta_equal,
+	.map_check_btf = qp_trie_check_btf,
+	.map_btf_id = &qp_trie_map_btf_ids[0],
+};
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 600c3fcee37a..1591f0df46f3 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -928,6 +928,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_INODE_STORAGE,
 	BPF_MAP_TYPE_TASK_STORAGE,
 	BPF_MAP_TYPE_BLOOM_FILTER,
+	BPF_MAP_TYPE_QP_TRIE,
 };
 
 /* Note that tracing related programs such as
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH bpf-next 07/10] selftests/bpf: Add two new dynptr_fail cases for map key
  2022-09-17 15:31 [PATCH bpf-next 00/10] Add support for qp-trie map Hou Tao
                   ` (5 preceding siblings ...)
  2022-09-17 15:31 ` [PATCH bpf-next 06/10] bpf: Add support for qp-trie map Hou Tao
@ 2022-09-17 15:31 ` Hou Tao
  2022-09-17 15:31 ` [PATCH bpf-next 08/10] selftests/bpf: Add test case for basic qp-trie map operations Hou Tao
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Hou Tao @ 2022-09-17 15:31 UTC (permalink / raw)
  To: bpf, Andrii Nakryiko
  Cc: Song Liu, Hao Luo, Yonghong Song, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, KP Singh, David S . Miller,
	Jakub Kicinski, Stanislav Fomichev, Jiri Olsa, John Fastabend,
	Lorenz Bauer, Paul E . McKenney, houtao1

From: Hou Tao <houtao1@huawei.com>

Now bpf_dynptr is also not usable for map key, so add two test cases to
ensure that: one to directly use bpf_dynptr use as map key and another
to use a struct with embedded bpf_dynptr as map key.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 .../testing/selftests/bpf/prog_tests/dynptr.c |  2 +
 .../testing/selftests/bpf/progs/dynptr_fail.c | 43 +++++++++++++++++++
 2 files changed, 45 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/dynptr.c b/tools/testing/selftests/bpf/prog_tests/dynptr.c
index bcf80b9f7c27..1df50fe1a741 100644
--- a/tools/testing/selftests/bpf/prog_tests/dynptr.c
+++ b/tools/testing/selftests/bpf/prog_tests/dynptr.c
@@ -20,6 +20,8 @@ static struct {
 	{"ringbuf_invalid_api", "type=mem expected=alloc_mem"},
 	{"add_dynptr_to_map1", "invalid indirect read from stack"},
 	{"add_dynptr_to_map2", "invalid indirect read from stack"},
+	{"add_dynptr_to_key1", "invalid indirect read from stack"},
+	{"add_dynptr_to_key2", "invalid indirect read from stack"},
 	{"data_slice_out_of_bounds_ringbuf", "value is outside of the allowed memory range"},
 	{"data_slice_out_of_bounds_map_value", "value is outside of the allowed memory range"},
 	{"data_slice_use_after_release1", "invalid mem access 'scalar'"},
diff --git a/tools/testing/selftests/bpf/progs/dynptr_fail.c b/tools/testing/selftests/bpf/progs/dynptr_fail.c
index b0f08ff024fb..85cc32999e8e 100644
--- a/tools/testing/selftests/bpf/progs/dynptr_fail.c
+++ b/tools/testing/selftests/bpf/progs/dynptr_fail.c
@@ -35,6 +35,20 @@ struct {
 	__type(value, __u32);
 } array_map3 SEC(".maps");
 
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 1);
+	__type(key, struct bpf_dynptr);
+	__type(value, __u32);
+} hash_map1 SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 1);
+	__type(key, struct test_info);
+	__type(value, __u32);
+} hash_map2 SEC(".maps");
+
 struct sample {
 	int pid;
 	long value;
@@ -206,6 +220,35 @@ int add_dynptr_to_map2(void *ctx)
 	return 0;
 }
 
+/* Can't use a dynptr as key of normal map */
+SEC("?raw_tp")
+int add_dynptr_to_key1(void *ctx)
+{
+	struct bpf_dynptr ptr;
+
+	get_map_val_dynptr(&ptr);
+	bpf_map_lookup_elem(&hash_map1, &ptr);
+
+	return 0;
+}
+
+/* Can't use a struct with an embedded dynptr as key of normal map */
+SEC("?raw_tp")
+int add_dynptr_to_key2(void *ctx)
+{
+	struct test_info x;
+
+	x.x = 0;
+	bpf_ringbuf_reserve_dynptr(&ringbuf, val, 0, &x.ptr);
+
+	/* this should fail */
+	bpf_map_lookup_elem(&hash_map2, &x);
+
+	bpf_ringbuf_submit_dynptr(&x.ptr, 0);
+
+	return 0;
+}
+
 /* A data slice can't be accessed out of bounds */
 SEC("?raw_tp")
 int data_slice_out_of_bounds_ringbuf(void *ctx)
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH bpf-next 08/10] selftests/bpf: Add test case for basic qp-trie map operations
  2022-09-17 15:31 [PATCH bpf-next 00/10] Add support for qp-trie map Hou Tao
                   ` (6 preceding siblings ...)
  2022-09-17 15:31 ` [PATCH bpf-next 07/10] selftests/bpf: Add two new dynptr_fail cases for map key Hou Tao
@ 2022-09-17 15:31 ` Hou Tao
  2022-09-17 15:31 ` [PATCH bpf-next 09/10] selftests/bpf: Add benchmark for qp-trie map Hou Tao
  2022-09-17 15:31 ` [PATCH bpf-next 10/10] selftests/bpf: Add tests for qp-trie map by using bpf syscalls Hou Tao
  9 siblings, 0 replies; 13+ messages in thread
From: Hou Tao @ 2022-09-17 15:31 UTC (permalink / raw)
  To: bpf, Andrii Nakryiko
  Cc: Song Liu, Hao Luo, Yonghong Song, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, KP Singh, David S . Miller,
	Jakub Kicinski, Stanislav Fomichev, Jiri Olsa, John Fastabend,
	Lorenz Bauer, Paul E . McKenney, houtao1

From: Hou Tao <houtao1@huawei.com>

First showing the lookup on qp-trie map is different with on hash-tab,
because qp-trie map only uses the specified length in key buffer instead
of the full key size, then checking update and deletion operations on
qp-trie map work as expected by using bpf_dynptr. Also checking the
zero-sized bpf_dynptr is rejected by map operations.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 tools/testing/selftests/bpf/DENYLIST.s390x    |   1 +
 .../selftests/bpf/prog_tests/qp_trie_test.c   |  91 +++++++++++++++
 .../selftests/bpf/progs/qp_trie_test.c        | 110 ++++++++++++++++++
 3 files changed, 202 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/qp_trie_test.c
 create mode 100644 tools/testing/selftests/bpf/progs/qp_trie_test.c

diff --git a/tools/testing/selftests/bpf/DENYLIST.s390x b/tools/testing/selftests/bpf/DENYLIST.s390x
index 168c5b287b5c..e18d6f5ffeef 100644
--- a/tools/testing/selftests/bpf/DENYLIST.s390x
+++ b/tools/testing/selftests/bpf/DENYLIST.s390x
@@ -71,3 +71,4 @@ cb_refs                                  # expected error message unexpected err
 cgroup_hierarchical_stats                # JIT does not support calling kernel function                                (kfunc)
 htab_update                              # failed to attach: ERROR: strerror_r(-524)=22                                (trampoline)
 tracing_struct                           # failed to auto-attach: -524                                                 (trampoline)
+qp_trie_test                             # failed to attach: ERROR: strerror_r(-524)=22                                (trampoline)
diff --git a/tools/testing/selftests/bpf/prog_tests/qp_trie_test.c b/tools/testing/selftests/bpf/prog_tests/qp_trie_test.c
new file mode 100644
index 000000000000..24d5719f4f7f
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/qp_trie_test.c
@@ -0,0 +1,91 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2022. Huawei Technologies Co., Ltd */
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <test_progs.h>
+#include "qp_trie_test.skel.h"
+
+static int setup_maps(struct qp_trie_test *skel, char *name, unsigned int value)
+{
+#define FILE_PATH_SIZE 64
+	struct bpf_dynptr_user dynptr;
+	char raw[FILE_PATH_SIZE];
+	char zero;
+	int fd, err;
+
+	memset(raw, 0, sizeof(raw));
+	strncpy(raw, name, sizeof(raw) - 1);
+
+	fd = bpf_map__fd(skel->maps.trie);
+	/* Full path returned from d_path includes the trailing terminator */
+	bpf_dynptr_user_init(name, strlen(name) + 1, &dynptr);
+	err = bpf_map_update_elem(fd, &dynptr, &value, BPF_NOEXIST);
+	if (!ASSERT_OK(err, "trie add name"))
+		return -EINVAL;
+
+	zero = 0;
+	bpf_dynptr_user_init(&zero, 1, &dynptr);
+	err = bpf_map_update_elem(fd, &dynptr, &value, BPF_NOEXIST);
+	if (!ASSERT_OK(err, "trie add zero"))
+		return -EINVAL;
+
+	fd = bpf_map__fd(skel->maps.htab);
+	err = bpf_map_update_elem(fd, raw, &value, BPF_NOEXIST);
+	if (!ASSERT_OK(err, "htab add"))
+		return -EINVAL;
+
+	return 0;
+}
+
+void test_qp_trie_test(void)
+{
+	char name[] = "/tmp/qp_trie_test";
+	unsigned int value, new_value;
+	struct bpf_dynptr_user dynptr;
+	struct qp_trie_test *skel;
+	int err, fd;
+	char zero;
+
+	skel = qp_trie_test__open();
+	if (!ASSERT_OK_PTR(skel, "qp_trie_test__open()"))
+		return;
+
+	err = qp_trie_test__load(skel);
+	if (!ASSERT_OK(err, "qp_trie_test__load()"))
+		goto out;
+
+	value = time(NULL);
+	if (setup_maps(skel, name, value))
+		goto out;
+
+	skel->bss->pid = getpid();
+	err = qp_trie_test__attach(skel);
+	if (!ASSERT_OK(err, "attach"))
+		goto out;
+
+	fd = open(name, O_RDONLY | O_CREAT, 0644);
+	if (!ASSERT_GE(fd, 0, "open tmp file"))
+		goto out;
+	close(fd);
+	unlink(name);
+
+	ASSERT_EQ(skel->bss->trie_value, value, "trie lookup str");
+	ASSERT_EQ(skel->bss->htab_value, -1, "htab lookup bytes");
+	ASSERT_FALSE(skel->bss->zero_sized_key_bad, "zero-sized key");
+
+	bpf_dynptr_user_init(name, strlen(name) + 1, &dynptr);
+	new_value = 0;
+	err = bpf_map_lookup_elem(bpf_map__fd(skel->maps.trie), &dynptr, &new_value);
+	ASSERT_OK(err, "lookup elem");
+	ASSERT_EQ(new_value, value + 1, "check new value");
+
+	zero = 0;
+	bpf_dynptr_user_init(&zero, 1, &dynptr);
+	err = bpf_map_lookup_elem(bpf_map__fd(skel->maps.trie), &dynptr, &new_value);
+	ASSERT_EQ(err, -ENOENT, "lookup deleted elem");
+
+out:
+	qp_trie_test__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/qp_trie_test.c b/tools/testing/selftests/bpf/progs/qp_trie_test.c
new file mode 100644
index 000000000000..c7cf3dfae971
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/qp_trie_test.c
@@ -0,0 +1,110 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2022. Huawei Technologies Co., Ltd */
+#include <stdbool.h>
+#include <errno.h>
+#include <linux/types.h>
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_core_read.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+struct path {
+} __attribute__((preserve_access_index));
+
+struct file {
+	struct path f_path;
+} __attribute__((preserve_access_index));
+
+#define FILE_PATH_SIZE 64
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 2);
+	__uint(key_size, 4);
+	__uint(value_size, FILE_PATH_SIZE);
+} array SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_QP_TRIE);
+	__uint(max_entries, 2);
+	__type(key, struct bpf_dynptr);
+	__type(value, unsigned int);
+	__uint(map_flags, BPF_F_NO_PREALLOC | BPF_F_DYNPTR_KEY);
+	__uint(map_extra, FILE_PATH_SIZE);
+} trie SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 1);
+	__uint(key_size, FILE_PATH_SIZE);
+	__uint(value_size, 4);
+} htab SEC(".maps");
+
+int pid = 0;
+unsigned int trie_value = 0;
+unsigned int htab_value = 0;
+bool zero_sized_key_bad = false;
+
+SEC("fentry/filp_close")
+int BPF_PROG(filp_close, struct file *filp)
+{
+	struct bpf_dynptr str_ptr, zero_ptr, zero_sized_ptr;
+	unsigned int new_value, *value;
+	int idx, len, err;
+	struct path *p;
+	char *raw;
+
+	if (bpf_get_current_pid_tgid() >> 32 != pid)
+		return 0;
+
+	idx = 0;
+	raw = bpf_map_lookup_elem(&array, &idx);
+	if (!raw)
+		return 0;
+
+	p = &filp->f_path;
+	len = bpf_d_path(p, raw, FILE_PATH_SIZE);
+	if (len < 0 || len > FILE_PATH_SIZE)
+		return 0;
+
+	bpf_dynptr_from_mem(raw, len, 0, &str_ptr);
+	value = bpf_map_lookup_elem(&trie, &str_ptr);
+	if (value)
+		trie_value = *value;
+	else
+		trie_value = -1;
+
+	value = bpf_map_lookup_elem(&htab, raw);
+	if (value)
+		htab_value = *value;
+	else
+		htab_value = -1;
+
+	/* Update qp_trie map */
+	new_value = trie_value + 1;
+	bpf_map_update_elem(&trie, &str_ptr, &new_value, BPF_ANY);
+
+	idx = 1;
+	raw = bpf_map_lookup_elem(&array, &idx);
+	if (!raw)
+		return 0;
+	bpf_dynptr_from_mem(raw, 1, 0, &zero_ptr);
+	bpf_map_delete_elem(&trie, &zero_ptr);
+
+	/* Use zero-sized bpf_dynptr */
+	bpf_dynptr_from_mem(raw, 0, 0, &zero_sized_ptr);
+
+	value = bpf_map_lookup_elem(&trie, &zero_sized_ptr);
+	if (value)
+		zero_sized_key_bad = true;
+	err = bpf_map_update_elem(&trie, &zero_sized_ptr, &new_value, BPF_ANY);
+	if (err != -EINVAL)
+		zero_sized_key_bad = true;
+	err = bpf_map_delete_elem(&trie, &zero_sized_ptr);
+	if (err != -EINVAL)
+		zero_sized_key_bad = true;
+
+	return 0;
+}
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH bpf-next 09/10] selftests/bpf: Add benchmark for qp-trie map
  2022-09-17 15:31 [PATCH bpf-next 00/10] Add support for qp-trie map Hou Tao
                   ` (7 preceding siblings ...)
  2022-09-17 15:31 ` [PATCH bpf-next 08/10] selftests/bpf: Add test case for basic qp-trie map operations Hou Tao
@ 2022-09-17 15:31 ` Hou Tao
  2022-09-17 15:31 ` [PATCH bpf-next 10/10] selftests/bpf: Add tests for qp-trie map by using bpf syscalls Hou Tao
  9 siblings, 0 replies; 13+ messages in thread
From: Hou Tao @ 2022-09-17 15:31 UTC (permalink / raw)
  To: bpf, Andrii Nakryiko
  Cc: Song Liu, Hao Luo, Yonghong Song, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, KP Singh, David S . Miller,
	Jakub Kicinski, Stanislav Fomichev, Jiri Olsa, John Fastabend,
	Lorenz Bauer, Paul E . McKenney, houtao1

From: Hou Tao <houtao1@huawei.com>

Add a benchmark for qp-trie map to compare lookup, update/delete
performance and memory usage with hash table.

When the content of keys are uniformly distributed and there are large
differencies between key length, qp-trie will be dense and has low
height, but the lookup overhead of htab increases due to unnecessary
memory comparison, so the lookup performance of qp-trie will be much
better than hash-table as shown below:

Randomly-generated binary data (key size=255, max entries=16K, key length range:[1, 255])
htab lookup      (1  thread)    4.968 ± 0.009M/s (drops 0.002 ± 0.000M/s mem 8.169 MiB)
htab lookup      (2  thread)   10.118 ± 0.010M/s (drops 0.007 ± 0.000M/s mem 8.169 MiB)
htab lookup      (4  thread)   20.084 ± 0.022M/s (drops 0.007 ± 0.000M/s mem 8.168 MiB)
htab lookup      (8  thread)   39.866 ± 0.047M/s (drops 0.010 ± 0.000M/s mem 8.168 MiB)
htab lookup      (16 thread)   79.412 ± 0.065M/s (drops 0.049 ± 0.000M/s mem 8.169 MiB)
htab update      (1  thread)    2.122 ± 0.021M/s (drops 0.000 ± 0.000M/s mem 8.169 MiB)
htab update      (2  thread)    4.248 ± 0.197M/s (drops 0.000 ± 0.000M/s mem 8.168 MiB)
htab update      (4  thread)    8.475 ± 0.348M/s (drops 0.000 ± 0.000M/s mem 8.180 MiB)
htab update      (8  thread)   16.725 ± 0.633M/s (drops 0.000 ± 0.000M/s mem 8.208 MiB)
htab update      (16 thread)   30.246 ± 0.611M/s (drops 0.000 ± 0.000M/s mem 8.190 MiB)

qp-trie lookup   (1  thread)   10.291 ± 0.007M/s (drops 0.004 ± 0.000M/s mem 4.899 MiB)
qp-trie lookup   (2  thread)   20.797 ± 0.009M/s (drops 0.006 ± 0.000M/s mem 4.879 MiB)
qp-trie lookup   (4  thread)   41.943 ± 0.019M/s (drops 0.015 ± 0.000M/s mem 4.262 MiB)
qp-trie lookup   (8  thread)   81.985 ± 0.032M/s (drops 0.025 ± 0.000M/s mem 4.215 MiB)
qp-trie lookup   (16 thread)  164.681 ± 0.051M/s (drops 0.050 ± 0.000M/s mem 4.261 MiB)
qp-trie update   (1  thread)    1.622 ± 0.016M/s (drops 0.000 ± 0.000M/s mem 4.918 MiB)
qp-trie update   (2  thread)    2.688 ± 0.021M/s (drops 0.000 ± 0.000M/s mem 4.874 MiB)
qp-trie update   (4  thread)    4.062 ± 0.128M/s (drops 0.000 ± 0.000M/s mem 4.218 MiB)
qp-trie update   (8  thread)    7.037 ± 0.247M/s (drops 0.000 ± 0.000M/s mem 4.900 MiB)
qp-trie update   (16 thread)   11.024 ± 0.295M/s (drops 0.000 ± 0.000M/s mem 4.830 MiB)

For the strings in /proc/kallsyms, single-thread lookup performance is
about ~27% slower compared with hash table. When number of threads
increase, lookup performance is almost the same. But update and deletion
performance of qp-trie are much worsed compared with hash table as shown
below:

Strings in /proc/kallsyms (key size=83, max entries=170958)
htab lookup      (1  thread)    5.686 ± 0.008M/s (drops 0.345 ± 0.002M/s mem 30.840 MiB)
htab lookup      (2  thread)   10.147 ± 0.067M/s (drops 0.616 ± 0.005M/s mem 30.841 MiB)
htab lookup      (4  thread)   16.503 ± 0.025M/s (drops 1.002 ± 0.004M/s mem 30.841 MiB)
htab lookup      (8  thread)   33.429 ± 0.146M/s (drops 2.028 ± 0.020M/s mem 30.848 MiB)
htab lookup      (16 thread)   67.249 ± 0.577M/s (drops 4.085 ± 0.032M/s mem 30.841 MiB)
htab update      (1  thread)    3.135 ± 0.355M/s (drops 0.000 ± 0.000M/s mem 30.842 MiB)
htab update      (2  thread)    6.269 ± 0.686M/s (drops 0.000 ± 0.000M/s mem 30.841 MiB)
htab update      (4  thread)   11.607 ± 1.632M/s (drops 0.000 ± 0.000M/s mem 30.840 MiB)
htab update      (8  thread)   23.041 ± 0.806M/s (drops 0.000 ± 0.000M/s mem 30.842 MiB)
htab update      (16 thread)   31.393 ± 0.307M/s (drops 0.000 ± 0.000M/s mem 30.835 MiB)

qp-trie lookup   (1  thread)    4.122 ± 0.010M/s (drops 0.250 ± 0.002M/s mem 30.108 MiB)
qp-trie lookup   (2  thread)    9.119 ± 0.057M/s (drops 0.554 ± 0.004M/s mem 17.422 MiB)
qp-trie lookup   (4  thread)   16.605 ± 0.032M/s (drops 1.008 ± 0.006M/s mem 17.203 MiB)
qp-trie lookup   (8  thread)   33.461 ± 0.058M/s (drops 2.032 ± 0.004M/s mem 16.977 MiB)
qp-trie lookup   (16 thread)   67.466 ± 0.145M/s (drops 4.097 ± 0.019M/s mem 17.452 MiB)
qp-trie update   (1  thread)    1.191 ± 0.093M/s (drops 0.000 ± 0.000M/s mem 17.170 MiB)
qp-trie update   (2  thread)    2.057 ± 0.041M/s (drops 0.000 ± 0.000M/s mem 17.058 MiB)
qp-trie update   (4  thread)    2.975 ± 0.035M/s (drops 0.000 ± 0.000M/s mem 17.411 MiB)
qp-trie update   (8  thread)    3.596 ± 0.031M/s (drops 0.000 ± 0.000M/s mem 17.110 MiB)
qp-trie update   (16 thread)    4.200 ± 0.048M/s (drops 0.000 ± 0.000M/s mem 17.228 MiB)

For strings in BTF string section, the results are similar:

Sorted strings in BTF string sections (key size=71, max entries=115980)
htab lookup      (1  thread)    6.990 ± 0.050M/s (drops 0.000 ± 0.000M/s mem 22.227 MiB)
htab lookup      (2  thread)   12.729 ± 0.059M/s (drops 0.000 ± 0.000M/s mem 22.224 MiB)
htab lookup      (4  thread)   21.202 ± 0.099M/s (drops 0.000 ± 0.000M/s mem 22.218 MiB)
htab lookup      (8  thread)   43.418 ± 0.169M/s (drops 0.000 ± 0.000M/s mem 22.225 MiB)
htab lookup      (16 thread)   88.745 ± 0.410M/s (drops 0.000 ± 0.000M/s mem 22.224 MiB)
htab update      (1  thread)    3.238 ± 0.271M/s (drops 0.000 ± 0.000M/s mem 22.228 MiB)
htab update      (2  thread)    6.483 ± 0.821M/s (drops 0.000 ± 0.000M/s mem 22.227 MiB)
htab update      (4  thread)   12.702 ± 0.924M/s (drops 0.000 ± 0.000M/s mem 22.226 MiB)
htab update      (8  thread)   22.167 ± 1.269M/s (drops 0.000 ± 0.000M/s mem 22.229 MiB)
htab update      (16 thread)   31.225 ± 0.475M/s (drops 0.000 ± 0.000M/s mem 22.239 MiB)

qp-trie lookup   (1  thread)    6.729 ± 0.006M/s (drops 0.000 ± 0.000M/s mem 11.335 MiB)
qp-trie lookup   (2  thread)   13.417 ± 0.010M/s (drops 0.000 ± 0.000M/s mem 11.287 MiB)
qp-trie lookup   (4  thread)   26.399 ± 0.043M/s (drops 0.000 ± 0.000M/s mem 11.111 MiB)
qp-trie lookup   (8  thread)   52.910 ± 0.049M/s (drops 0.000 ± 0.000M/s mem 11.131 MiB)
qp-trie lookup   (16 thread)  105.444 ± 0.064M/s (drops 0.000 ± 0.000M/s mem 11.060 MiB)
qp-trie update   (1  thread)    1.508 ± 0.102M/s (drops 0.000 ± 0.000M/s mem 10.979 MiB)
qp-trie update   (2  thread)    2.877 ± 0.034M/s (drops 0.000 ± 0.000M/s mem 10.843 MiB)
qp-trie update   (4  thread)    5.111 ± 0.083M/s (drops 0.000 ± 0.000M/s mem 10.938 MiB)
qp-trie update   (8  thread)    9.229 ± 0.046M/s (drops 0.000 ± 0.000M/s mem 11.042 MiB)
qp-trie update   (16 thread)   11.625 ± 0.147M/s (drops 0.000 ± 0.000M/s mem 10.930 MiB)

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 tools/testing/selftests/bpf/Makefile          |   5 +-
 tools/testing/selftests/bpf/bench.c           |  10 +
 .../selftests/bpf/benchs/bench_qp_trie.c      | 511 ++++++++++++++++++
 .../selftests/bpf/benchs/run_bench_qp_trie.sh |  55 ++
 .../selftests/bpf/progs/qp_trie_bench.c       | 236 ++++++++
 5 files changed, 816 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_qp_trie.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_qp_trie.sh
 create mode 100644 tools/testing/selftests/bpf/progs/qp_trie_bench.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 5f42adddbbb0..42ea2bc91d3b 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -577,11 +577,13 @@ $(OUTPUT)/bench_strncmp.o: $(OUTPUT)/strncmp_bench.skel.h
 $(OUTPUT)/bench_bpf_hashmap_full_update.o: $(OUTPUT)/bpf_hashmap_full_update_bench.skel.h
 $(OUTPUT)/bench_local_storage.o: $(OUTPUT)/local_storage_bench.skel.h
 $(OUTPUT)/bench_local_storage_rcu_tasks_trace.o: $(OUTPUT)/local_storage_rcu_tasks_trace_bench.skel.h
+$(OUTPUT)/bench_qp_trie.o: $(OUTPUT)/qp_trie_bench.skel.h
 $(OUTPUT)/bench.o: bench.h testing_helpers.h $(BPFOBJ)
 $(OUTPUT)/bench: LDLIBS += -lm
 $(OUTPUT)/bench: $(OUTPUT)/bench.o \
 		 $(TESTING_HELPERS) \
 		 $(TRACE_HELPERS) \
+		 $(CGROUP_HELPERS) \
 		 $(OUTPUT)/bench_count.o \
 		 $(OUTPUT)/bench_rename.o \
 		 $(OUTPUT)/bench_trigger.o \
@@ -591,7 +593,8 @@ $(OUTPUT)/bench: $(OUTPUT)/bench.o \
 		 $(OUTPUT)/bench_strncmp.o \
 		 $(OUTPUT)/bench_bpf_hashmap_full_update.o \
 		 $(OUTPUT)/bench_local_storage.o \
-		 $(OUTPUT)/bench_local_storage_rcu_tasks_trace.o
+		 $(OUTPUT)/bench_local_storage_rcu_tasks_trace.o \
+		 $(OUTPUT)/bench_qp_trie.o
 	$(call msg,BINARY,,$@)
 	$(Q)$(CC) $(CFLAGS) $(LDFLAGS) $(filter %.a %.o,$^) $(LDLIBS) -o $@
 
diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index c1f20a147462..618f45fbe6e2 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -275,6 +275,7 @@ extern struct argp bench_bpf_loop_argp;
 extern struct argp bench_local_storage_argp;
 extern struct argp bench_local_storage_rcu_tasks_trace_argp;
 extern struct argp bench_strncmp_argp;
+extern struct argp bench_qp_trie_argp;
 
 static const struct argp_child bench_parsers[] = {
 	{ &bench_ringbufs_argp, 0, "Ring buffers benchmark", 0 },
@@ -284,6 +285,7 @@ static const struct argp_child bench_parsers[] = {
 	{ &bench_strncmp_argp, 0, "bpf_strncmp helper benchmark", 0 },
 	{ &bench_local_storage_rcu_tasks_trace_argp, 0,
 		"local_storage RCU Tasks Trace slowdown benchmark", 0 },
+	{ &bench_qp_trie_argp, 0, "qp-trie benchmark", 0 },
 	{},
 };
 
@@ -490,6 +492,10 @@ extern const struct bench bench_local_storage_cache_seq_get;
 extern const struct bench bench_local_storage_cache_interleaved_get;
 extern const struct bench bench_local_storage_cache_hashmap_control;
 extern const struct bench bench_local_storage_tasks_trace;
+extern const struct bench bench_htab_lookup;
+extern const struct bench bench_qp_trie_lookup;
+extern const struct bench bench_htab_update;
+extern const struct bench bench_qp_trie_update;
 
 static const struct bench *benchs[] = {
 	&bench_count_global,
@@ -529,6 +535,10 @@ static const struct bench *benchs[] = {
 	&bench_local_storage_cache_interleaved_get,
 	&bench_local_storage_cache_hashmap_control,
 	&bench_local_storage_tasks_trace,
+	&bench_htab_lookup,
+	&bench_qp_trie_lookup,
+	&bench_htab_update,
+	&bench_qp_trie_update,
 };
 
 static void setup_benchmark()
diff --git a/tools/testing/selftests/bpf/benchs/bench_qp_trie.c b/tools/testing/selftests/bpf/benchs/bench_qp_trie.c
new file mode 100644
index 000000000000..ea38e272b5e4
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/bench_qp_trie.c
@@ -0,0 +1,511 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2022. Huawei Technologies Co., Ltd */
+#include <argp.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include "bench.h"
+#include "bpf_util.h"
+#include "cgroup_helpers.h"
+
+#include "qp_trie_bench.skel.h"
+
+enum {
+	FOR_HTAB = 0,
+	FOR_TRIE,
+};
+
+static struct qp_trie_ctx {
+	struct qp_trie_bench *skel;
+	int cgrp_dfd;
+	u64 map_slab_mem;
+} ctx;
+
+static struct {
+	const char *file;
+	__u32 entries;
+} args;
+
+struct qp_trie_key {
+	__u32 len;
+	unsigned char data[0];
+};
+
+struct run_stat {
+	__u64 stats[2];
+};
+
+enum {
+	ARG_DATA_FILE = 8001,
+	ARG_DATA_ENTRIES = 8002,
+};
+
+static const struct argp_option opts[] = {
+	{ "file", ARG_DATA_FILE, "DATA-FILE", 0, "Set data file" },
+	{ "entries", ARG_DATA_ENTRIES, "DATA-ENTRIES", 0, "Set data entries" },
+	{},
+};
+
+static error_t qp_trie_parse_arg(int key, char *arg, struct argp_state *state)
+{
+	switch (key) {
+	case ARG_DATA_FILE:
+		args.file = strdup(arg);
+		break;
+	case ARG_DATA_ENTRIES:
+		args.entries = strtoul(arg, NULL, 10);
+		break;
+	default:
+		return ARGP_ERR_UNKNOWN;
+	}
+
+	return 0;
+}
+
+const struct argp bench_qp_trie_argp = {
+	.options = opts,
+	.parser = qp_trie_parse_arg,
+};
+
+static int parse_data_set(const char *name, struct qp_trie_key ***set, unsigned int *nr,
+			  unsigned int *max_len)
+{
+#define INT_MAX_DATA_SIZE 1024
+	unsigned int i, nr_items, item_max_len;
+	char line[INT_MAX_DATA_SIZE + 1];
+	struct qp_trie_key **items;
+	struct qp_trie_key *cur;
+	int err = 0;
+	FILE *file;
+	char *got;
+
+	file = fopen(name, "rb");
+	if (!file) {
+		fprintf(stderr, "open %s err %s\n", name, strerror(errno));
+		return -1;
+	}
+
+	got = fgets(line, sizeof(line), file);
+	if (!got) {
+		fprintf(stderr, "empty file ?\n");
+		err = -1;
+		goto out;
+	}
+	if (sscanf(line, "%u", &nr_items) != 1) {
+		fprintf(stderr, "the first line must be the number of items\n");
+		err = -1;
+		goto out;
+	}
+
+	fprintf(stdout, "item %u\n", nr_items);
+
+	items = (struct qp_trie_key **)calloc(nr_items, sizeof(*items) + INT_MAX_DATA_SIZE);
+	if (!items) {
+		fprintf(stderr, "no mem for items\n");
+		err = -1;
+		goto out;
+	}
+
+	i = 0;
+	item_max_len = 0;
+	cur = (void *)items + sizeof(*items) * nr_items;
+	while (true) {
+		unsigned int len;
+
+		got = fgets(line, sizeof(line), file);
+		if (!got) {
+			if (!feof(file)) {
+				fprintf(stderr, "read file %s error\n", name);
+				err = -1;
+			}
+			break;
+		}
+
+		len = strlen(got);
+		if (len && got[len - 1] == '\n') {
+			got[len - 1] = 0;
+			len -= 1;
+		}
+		if (!len) {
+			fprintf(stdout, "#%u empty line\n", i + 2);
+			continue;
+		}
+
+		if (i >= nr_items) {
+			fprintf(stderr, "too many line in %s\n", name);
+			break;
+		}
+
+		if (len > item_max_len)
+			item_max_len = len;
+		cur->len = len;
+		memcpy(cur->data, got, len);
+		items[i++] = cur;
+		cur = (void *)cur + INT_MAX_DATA_SIZE;
+	}
+
+	if (!err) {
+		if (i != nr_items)
+			fprintf(stdout, "few lines in %s (exp %u got %u)\n", name, nr_items, i);
+		*nr = i;
+		*set = items;
+		*max_len = item_max_len;
+	} else {
+		free(items);
+	}
+
+out:
+	fclose(file);
+	return err;
+}
+
+static int gen_data_set(struct qp_trie_key ***set, unsigned int *nr, unsigned int *max_len)
+{
+#define RND_MAX_DATA_SIZE 255
+	struct qp_trie_key **items;
+	size_t ptr_size, data_size;
+	struct qp_trie_key *cur;
+	unsigned int i, nr_items;
+	ssize_t got;
+	int err = 0;
+
+	ptr_size = *nr * sizeof(*items);
+	data_size = *nr * (sizeof(*cur) + RND_MAX_DATA_SIZE);
+	items = (struct qp_trie_key **)malloc(ptr_size + data_size);
+	if (!items) {
+		fprintf(stderr, "no mem for items\n");
+		err = -1;
+		goto out;
+	}
+
+	cur = (void *)items + ptr_size;
+	got = syscall(__NR_getrandom, cur, data_size, 0);
+	if (got != data_size) {
+		fprintf(stderr, "getrandom error %s\n", strerror(errno));
+		err = -1;
+		goto out;
+	}
+
+	nr_items = 0;
+	for (i = 0; i < *nr; i++) {
+		cur->len &= 0xff;
+		if (cur->len) {
+			items[nr_items++] = cur;
+			memset(cur->data + cur->len, 0, RND_MAX_DATA_SIZE - cur->len);
+		}
+		cur = (void *)cur + (sizeof(*cur) + RND_MAX_DATA_SIZE);
+	}
+	if (!nr_items) {
+		fprintf(stderr, "no valid key in random data\n");
+		err = -1;
+		goto out;
+	}
+	fprintf(stdout, "generate %u random keys\n", nr_items);
+
+	*nr = nr_items;
+	*set = items;
+	*max_len = RND_MAX_DATA_SIZE;
+out:
+	if (err && items)
+		free(items);
+	return err;
+}
+
+static void qp_trie_validate(void)
+{
+	if (env.consumer_cnt != 1) {
+		fprintf(stderr, "qp_trie_map benchmark doesn't support multi-consumer!\n");
+		exit(1);
+	}
+
+	if (!args.file && !args.entries) {
+		fprintf(stderr, "must specify entries when use random generated data set\n");
+		exit(1);
+	}
+
+	if (args.file && access(args.file, R_OK)) {
+		fprintf(stderr, "data file is un-accessible\n");
+		exit(1);
+	}
+}
+
+static void qp_trie_init_map_opts(struct qp_trie_bench *skel, unsigned int data_size,
+				  unsigned int nr)
+{
+	bpf_map__set_value_size(skel->maps.htab_array, data_size);
+	bpf_map__set_max_entries(skel->maps.htab_array, nr);
+
+	bpf_map__set_key_size(skel->maps.htab, data_size);
+	bpf_map__set_max_entries(skel->maps.htab, nr);
+
+	bpf_map__set_value_size(skel->maps.trie_array, sizeof(struct qp_trie_key) + data_size);
+	bpf_map__set_max_entries(skel->maps.trie_array, nr);
+
+	bpf_map__set_map_extra(skel->maps.qp_trie, data_size);
+	bpf_map__set_max_entries(skel->maps.qp_trie, nr);
+}
+
+static void qp_trie_setup_key_map(struct bpf_map *map, unsigned int map_type,
+				  struct qp_trie_key **set, unsigned int nr)
+{
+	int fd = bpf_map__fd(map);
+	unsigned int i;
+
+	for (i = 0; i < nr; i++) {
+		void *value;
+		int err;
+
+		value = (map_type != FOR_HTAB) ? (void *)set[i] : (void *)set[i]->data;
+		err = bpf_map_update_elem(fd, &i, value, 0);
+		if (err) {
+			fprintf(stderr, "add #%u key (%s) on %s error %d\n",
+				i, set[i]->data, bpf_map__name(map), err);
+			exit(1);
+		}
+	}
+}
+
+static u64 qp_trie_get_slab_mem(int dfd)
+{
+	const char *magic = "slab ";
+	const char *name = "memory.stat";
+	int fd;
+	ssize_t nr;
+	char buf[4096];
+	char *from;
+
+	fd = openat(dfd, name, 0);
+	if (fd < 0) {
+		fprintf(stderr, "no %s\n", name);
+		exit(1);
+	}
+
+	nr = read(fd, buf, sizeof(buf));
+	if (nr <= 0) {
+		fprintf(stderr, "empty %s ?\n", name);
+		exit(1);
+	}
+	buf[nr - 1] = 0;
+
+	close(fd);
+
+	from = strstr(buf, magic);
+	if (!from) {
+		fprintf(stderr, "no slab in %s\n", name);
+		exit(1);
+	}
+
+	return strtoull(from + strlen(magic), NULL, 10);
+}
+
+static void qp_trie_setup_lookup_map(struct bpf_map *map, unsigned int map_type,
+				     struct qp_trie_key **set, unsigned int nr)
+{
+	int fd = bpf_map__fd(map);
+	unsigned int i;
+
+	for (i = 0; i < nr; i++) {
+		int err;
+
+		if (map_type == FOR_HTAB) {
+			void *key;
+
+			key = set[i]->data;
+			err = bpf_map_update_elem(fd, key, &i, 0);
+		} else {
+			struct bpf_dynptr_user dynptr;
+
+			bpf_dynptr_user_init(set[i]->data, set[i]->len, &dynptr);
+			err = bpf_map_update_elem(fd, &dynptr, &i, 0);
+		}
+		if (err) {
+			fprintf(stderr, "add #%u key (%s) on %s error %d\n",
+				i, set[i]->data, bpf_map__name(map), err);
+			exit(1);
+		}
+	}
+}
+
+static void qp_trie_setup(unsigned int map_type)
+{
+	struct qp_trie_key **set = NULL;
+	struct qp_trie_bench *skel;
+	unsigned int nr = 0, max_len = 0;
+	struct bpf_map *map;
+	u64 before, after;
+	int dfd;
+	int err;
+
+	if (!args.file) {
+		nr = args.entries;
+		err = gen_data_set(&set, &nr, &max_len);
+	} else {
+		err = parse_data_set(args.file, &set, &nr, &max_len);
+	}
+	if (err < 0)
+		exit(1);
+
+	if (args.entries && args.entries < nr)
+		nr = args.entries;
+
+	dfd = cgroup_setup_and_join("/qp_trie");
+	if (dfd < 0) {
+		fprintf(stderr, "failed to setup cgroup env\n");
+		exit(1);
+	}
+
+	setup_libbpf();
+
+	before = qp_trie_get_slab_mem(dfd);
+
+	skel = qp_trie_bench__open();
+	if (!skel) {
+		fprintf(stderr, "failed to open skeleton\n");
+		exit(1);
+	}
+
+	qp_trie_init_map_opts(skel, max_len, nr);
+
+	skel->rodata->qp_trie_key_size = max_len;
+	skel->bss->update_nr = nr;
+	skel->bss->update_chunk = nr / env.producer_cnt;
+
+	err = qp_trie_bench__load(skel);
+	if (err) {
+		fprintf(stderr, "failed to load skeleton\n");
+		exit(1);
+	}
+
+	map = (map_type == FOR_HTAB) ? skel->maps.htab_array : skel->maps.trie_array;
+	qp_trie_setup_key_map(map, map_type, set, nr);
+
+	map = (map_type == FOR_HTAB) ? skel->maps.htab : skel->maps.qp_trie;
+	qp_trie_setup_lookup_map(map, map_type, set, nr);
+
+	after = qp_trie_get_slab_mem(dfd);
+
+	ctx.skel = skel;
+	ctx.cgrp_dfd = dfd;
+	ctx.map_slab_mem = after - before;
+}
+
+static void qp_trie_attach_prog(struct bpf_program *prog)
+{
+	struct bpf_link *link;
+
+	link = bpf_program__attach(prog);
+	if (!link) {
+		fprintf(stderr, "failed to attach program!\n");
+		exit(1);
+	}
+}
+
+static void htab_lookup_setup(void)
+{
+	qp_trie_setup(FOR_HTAB);
+	qp_trie_attach_prog(ctx.skel->progs.htab_lookup);
+}
+
+static void qp_trie_lookup_setup(void)
+{
+	qp_trie_setup(FOR_TRIE);
+	qp_trie_attach_prog(ctx.skel->progs.qp_trie_lookup);
+}
+
+static void htab_update_setup(void)
+{
+	qp_trie_setup(FOR_HTAB);
+	qp_trie_attach_prog(ctx.skel->progs.htab_update);
+}
+
+static void qp_trie_update_setup(void)
+{
+	qp_trie_setup(FOR_TRIE);
+	qp_trie_attach_prog(ctx.skel->progs.qp_trie_update);
+}
+
+static void *qp_trie_producer(void *ctx)
+{
+	while (true)
+		(void)syscall(__NR_getpgid);
+	return NULL;
+}
+
+static void *qp_trie_consumer(void *ctx)
+{
+	return NULL;
+}
+
+static void qp_trie_measure(struct bench_res *res)
+{
+	static __u64 last_hits, last_drops;
+	__u64 total_hits = 0, total_drops = 0;
+	unsigned int i, nr_cpus;
+
+	nr_cpus = bpf_num_possible_cpus();
+	for (i = 0; i < nr_cpus; i++) {
+		struct run_stat *s = (void *)&ctx.skel->bss->percpu_stats[i & 255];
+
+		total_hits += s->stats[0];
+		total_drops += s->stats[1];
+	}
+
+	res->hits = total_hits - last_hits;
+	res->drops = total_drops - last_drops;
+
+	last_hits = total_hits;
+	last_drops = total_drops;
+}
+
+static void qp_trie_report_final(struct bench_res res[], int res_cnt)
+{
+	close(ctx.cgrp_dfd);
+	cleanup_cgroup_environment();
+
+	fprintf(stdout, "Slab: %.3f MiB\n", (float)ctx.map_slab_mem / 1024 / 1024);
+	hits_drops_report_final(res, res_cnt);
+}
+
+const struct bench bench_htab_lookup = {
+	.name = "htab-lookup",
+	.validate = qp_trie_validate,
+	.setup = htab_lookup_setup,
+	.producer_thread = qp_trie_producer,
+	.consumer_thread = qp_trie_consumer,
+	.measure = qp_trie_measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = qp_trie_report_final,
+};
+
+const struct bench bench_qp_trie_lookup = {
+	.name = "qp-trie-lookup",
+	.validate = qp_trie_validate,
+	.setup = qp_trie_lookup_setup,
+	.producer_thread = qp_trie_producer,
+	.consumer_thread = qp_trie_consumer,
+	.measure = qp_trie_measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = qp_trie_report_final,
+};
+
+const struct bench bench_htab_update = {
+	.name = "htab-update",
+	.validate = qp_trie_validate,
+	.setup = htab_update_setup,
+	.producer_thread = qp_trie_producer,
+	.consumer_thread = qp_trie_consumer,
+	.measure = qp_trie_measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = qp_trie_report_final,
+};
+
+const struct bench bench_qp_trie_update = {
+	.name = "qp-trie-update",
+	.validate = qp_trie_validate,
+	.setup = qp_trie_update_setup,
+	.producer_thread = qp_trie_producer,
+	.consumer_thread = qp_trie_consumer,
+	.measure = qp_trie_measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = qp_trie_report_final,
+};
diff --git a/tools/testing/selftests/bpf/benchs/run_bench_qp_trie.sh b/tools/testing/selftests/bpf/benchs/run_bench_qp_trie.sh
new file mode 100755
index 000000000000..0cbcb5bc9292
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/run_bench_qp_trie.sh
@@ -0,0 +1,55 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (C) 2022. Huawei Technologies Co., Ltd
+
+source ./benchs/run_common.sh
+
+set -eufo pipefail
+
+mem()
+{
+	echo "$*" | sed -E "s/.*Slab: ([0-9]+\.[0-9]+ MiB).*/\1/"
+}
+
+run_qp_trie_bench()
+{
+	local title=$1
+	local summary
+
+	shift 1
+	summary=$($RUN_BENCH "$@" | grep "Summary\|Slab:")
+	printf "%s %20s (drops %-16s mem %s)\n" "$title" "$(hits $summary)" \
+		"$(drops $summary)" "$(mem $summary)"
+}
+
+run_qp_trie_benchs()
+{
+	local p
+	local m
+	local b
+	local title
+
+	for m in htab qp-trie
+	do
+		for b in lookup update
+		do
+			for p in 1 2 4 8 16
+			do
+				title=$(printf "%-16s (%-2d thread)" "$m $b" $p)
+				run_qp_trie_bench "$title" ${m}-${b} -p $p "$@"
+			done
+		done
+	done
+	echo
+}
+
+echo "Randomly-generated binary data (16K)"
+run_qp_trie_benchs --entries 16384
+
+echo "Strings in /proc/kallsyms"
+TMP_FILE=/tmp/kallsyms.txt
+SRC_FILE=/proc/kallsyms
+trap 'rm -f $TMP_FILE' EXIT
+wc -l $SRC_FILE | awk '{ print $1}' > $TMP_FILE
+awk '{ print $3 }' $SRC_FILE >> $TMP_FILE
+run_qp_trie_benchs --file $TMP_FILE
diff --git a/tools/testing/selftests/bpf/progs/qp_trie_bench.c b/tools/testing/selftests/bpf/progs/qp_trie_bench.c
new file mode 100644
index 000000000000..b60acb7b9f94
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/qp_trie_bench.c
@@ -0,0 +1,236 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2022. Huawei Technologies Co., Ltd */
+#include <linux/types.h>
+#include <linux/bpf.h>
+#include <linux/errno.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+struct bpf_map;
+
+struct qp_trie_key {
+	__u32 len;
+	unsigned char data[0];
+};
+
+/* value_size will be set by benchmark */
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(key_size, 4);
+} htab_array SEC(".maps");
+
+/* value_size will be set by benchmark */
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(key_size, 4);
+} trie_array SEC(".maps");
+
+/* key_size will be set by benchmark */
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(value_size, 4);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+} htab SEC(".maps");
+
+/* map_extra will be set by benchmark */
+struct {
+	__uint(type, BPF_MAP_TYPE_QP_TRIE);
+	__type(key, struct bpf_dynptr);
+	__type(value, unsigned int);
+	__uint(map_flags, BPF_F_NO_PREALLOC | BPF_F_DYNPTR_KEY);
+} qp_trie SEC(".maps");
+
+char _license[] SEC("license") = "GPL";
+
+struct {
+	__u64 stats[2];
+} __attribute__((__aligned__(128))) percpu_stats[256];
+
+struct update_ctx {
+	unsigned int max;
+	unsigned int from;
+};
+
+volatile const unsigned int qp_trie_key_size;
+
+unsigned int update_nr;
+unsigned int update_chunk;
+
+static __always_inline void update_stats(int idx)
+{
+	__u32 cpu = bpf_get_smp_processor_id();
+
+	percpu_stats[cpu & 255].stats[idx]++;
+}
+
+static int lookup_htab(struct bpf_map *map, __u32 *key, void *value, void *data)
+{
+	__u32 *index;
+
+	index = bpf_map_lookup_elem(&htab, value);
+	if (index && *index == *key)
+		update_stats(0);
+	else
+		update_stats(1);
+	return 0;
+}
+
+static int update_htab_loop(unsigned int i, void *ctx)
+{
+	struct update_ctx *update = ctx;
+	void *value;
+	int err;
+
+	if (update->from >= update->max)
+		update->from = 0;
+	value = bpf_map_lookup_elem(&htab_array, &update->from);
+	if (!value)
+		return 1;
+
+	err = bpf_map_update_elem(&htab, value, &update->from, 0);
+	if (!err)
+		update_stats(0);
+	else
+		update_stats(1);
+	update->from++;
+
+	return 0;
+}
+
+static int delete_htab_loop(unsigned int i, void *ctx)
+{
+	struct update_ctx *update = ctx;
+	void *value;
+	int err;
+
+	if (update->from >= update->max)
+		update->from = 0;
+	value = bpf_map_lookup_elem(&htab_array, &update->from);
+	if (!value)
+		return 1;
+
+	err = bpf_map_delete_elem(&htab, value);
+	if (!err)
+		update_stats(0);
+	update->from++;
+
+	return 0;
+}
+
+static int lookup_qp_trie(struct bpf_map *map, __u32 *key, void *value, void *data)
+{
+	struct qp_trie_key *qp_trie_key = value;
+	struct bpf_dynptr dynptr;
+	__u32 *index;
+
+	if (qp_trie_key->len > qp_trie_key_size)
+		return 0;
+
+	bpf_dynptr_from_mem(qp_trie_key->data, qp_trie_key->len, 0, &dynptr);
+	index = bpf_map_lookup_elem(&qp_trie, &dynptr);
+	if (index && *index == *key)
+		update_stats(0);
+	else
+		update_stats(1);
+	return 0;
+}
+
+static int update_qp_trie_loop(unsigned int i, void *ctx)
+{
+	struct update_ctx *update = ctx;
+	struct qp_trie_key *value;
+	struct bpf_dynptr dynptr;
+	int err;
+
+	if (update->from >= update->max)
+		update->from = 0;
+	value = bpf_map_lookup_elem(&trie_array, &update->from);
+	if (!value || value->len > qp_trie_key_size)
+		return 1;
+
+	bpf_dynptr_from_mem(value->data, value->len, 0, &dynptr);
+	err = bpf_map_update_elem(&qp_trie, &dynptr, &update->from, 0);
+	if (!err)
+		update_stats(0);
+	else
+		update_stats(1);
+	update->from++;
+
+	return 0;
+}
+
+static int delete_qp_trie_loop(unsigned int i, void *ctx)
+{
+	struct update_ctx *update = ctx;
+	struct qp_trie_key *value;
+	struct bpf_dynptr dynptr;
+	int err;
+
+	if (update->from >= update->max)
+		update->from = 0;
+	value = bpf_map_lookup_elem(&trie_array, &update->from);
+	if (!value || value->len > qp_trie_key_size)
+		return 1;
+
+	bpf_dynptr_from_mem(value->data, value->len, 0, &dynptr);
+	err = bpf_map_delete_elem(&qp_trie, &dynptr);
+	if (!err)
+		update_stats(0);
+	update->from++;
+
+	return 0;
+}
+
+SEC("tp/syscalls/sys_enter_getpgid")
+int htab_lookup(void *ctx)
+{
+	bpf_for_each_map_elem(&htab_array, lookup_htab, NULL, 0);
+	return 0;
+}
+
+SEC("tp/syscalls/sys_enter_getpgid")
+int qp_trie_lookup(void *ctx)
+{
+	bpf_for_each_map_elem(&trie_array, lookup_qp_trie, NULL, 0);
+	return 0;
+}
+
+SEC("tp/syscalls/sys_enter_getpgid")
+int htab_update(void *ctx)
+{
+	unsigned int index = bpf_get_smp_processor_id() * update_chunk;
+	struct update_ctx update;
+
+	update.max = update_nr;
+	if (update.max && index >= update.max)
+		index %= update.max;
+
+	/* Only operate part of keys according to cpu id */
+	update.from = index;
+	bpf_loop(update_chunk, update_htab_loop, &update, 0);
+
+	update.from = index;
+	bpf_loop(update_chunk, delete_htab_loop, &update, 0);
+
+	return 0;
+}
+
+SEC("tp/syscalls/sys_enter_getpgid")
+int qp_trie_update(void *ctx)
+{
+	unsigned int index = bpf_get_smp_processor_id() * update_chunk;
+	struct update_ctx update;
+
+	update.max = update_nr;
+	if (update.max && index >= update.max)
+		index %= update.max;
+
+	/* Only operate part of keys according to cpu id */
+	update.from = index;
+	bpf_loop(update_chunk, update_qp_trie_loop, &update, 0);
+
+	update.from = index;
+	bpf_loop(update_chunk, delete_qp_trie_loop, &update, 0);
+
+	return 0;
+}
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH bpf-next 10/10] selftests/bpf: Add tests for qp-trie map by using bpf syscalls
  2022-09-17 15:31 [PATCH bpf-next 00/10] Add support for qp-trie map Hou Tao
                   ` (8 preceding siblings ...)
  2022-09-17 15:31 ` [PATCH bpf-next 09/10] selftests/bpf: Add benchmark for qp-trie map Hou Tao
@ 2022-09-17 15:31 ` Hou Tao
  9 siblings, 0 replies; 13+ messages in thread
From: Hou Tao @ 2022-09-17 15:31 UTC (permalink / raw)
  To: bpf, Andrii Nakryiko
  Cc: Song Liu, Hao Luo, Yonghong Song, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, KP Singh, David S . Miller,
	Jakub Kicinski, Stanislav Fomichev, Jiri Olsa, John Fastabend,
	Lorenz Bauer, Paul E . McKenney, houtao1

From: Hou Tao <houtao1@huawei.com>

Add test cases for creation, lookup, update, deletion and iteration on
qp-trie. Also checking the returned keys during iterations are ordered.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 .../selftests/bpf/map_tests/qp_trie_map.c     | 1165 +++++++++++++++++
 1 file changed, 1165 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/map_tests/qp_trie_map.c

diff --git a/tools/testing/selftests/bpf/map_tests/qp_trie_map.c b/tools/testing/selftests/bpf/map_tests/qp_trie_map.c
new file mode 100644
index 000000000000..27d0805c3049
--- /dev/null
+++ b/tools/testing/selftests/bpf/map_tests/qp_trie_map.c
@@ -0,0 +1,1165 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2022. Huawei Technologies Co., Ltd */
+#include <unistd.h>
+#include <errno.h>
+#include <stdlib.h>
+#include <endian.h>
+#include <limits.h>
+#include <time.h>
+#include <pthread.h>
+#include <linux/btf.h>
+
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#include <test_btf.h>
+#include <test_maps.h>
+
+#include "bpf_util.h"
+
+#define QP_TRIE_KEY_SIZE sizeof(struct bpf_dynptr)
+#define QP_TRIE_DFT_MAX_KEY_LEN 4
+#define QP_TRIE_DFT_VAL_SIZE 4
+#define QP_TRIE_DFT_MAP_FLAGS (BPF_F_NO_PREALLOC | BPF_F_DYNPTR_KEY)
+
+#define QP_TRIE_DFT_BTF_KEY_ID 1
+#define QP_TRIE_DFT_BTF_VAL_ID 2
+
+struct qp_trie_create_case {
+	const char *name;
+	int error;
+	unsigned int map_flags;
+	unsigned int max_key_len;
+	unsigned int value_size;
+	unsigned int max_entries;
+	unsigned int btf_key_type_id;
+	unsigned int btf_value_type_id;
+};
+
+struct qp_trie_bytes_key {
+	unsigned int len;
+	unsigned char data[4];
+};
+
+struct qp_trie_int_key {
+	unsigned int len;
+	unsigned int data;
+};
+
+enum {
+	UPDATE_OP = 0,
+	DELETE_OP,
+	LOOKUP_OP,
+	ITERATE_OP,
+	MAX_OP,
+};
+
+struct stress_conf {
+	unsigned int threads[MAX_OP];
+	unsigned int max_key_len;
+	unsigned int loop;
+	unsigned int nr;
+};
+
+struct qp_trie_rw_ctx {
+	unsigned int nr;
+	unsigned int max_key_len;
+	int fd;
+	struct bpf_dynptr_user *set;
+	unsigned int loop;
+	unsigned int nr_delete;
+};
+
+static int qp_trie_load_btf(void)
+{
+	const char btf_str_sec[] = "\0bpf_dynptr\0qp_test_key";
+	__u32 btf_raw_types[] = {
+		/* struct bpf_dynptr */				/* [1] */
+		BTF_TYPE_ENC(1, BTF_INFO_ENC(BTF_KIND_STRUCT, 0, 0), 16),
+		/* unsigned int */				/* [2] */
+		BTF_TYPE_INT_ENC(0, 0, 0, 32, 4),
+		/* struct qp_test_key */			/* [3] */
+		BTF_TYPE_ENC(12, BTF_INFO_ENC(BTF_KIND_STRUCT, 0, 0), 16),
+	};
+	struct btf_header btf_hdr = {
+		.magic = BTF_MAGIC,
+		.version = BTF_VERSION,
+		.hdr_len = sizeof(struct btf_header),
+		.type_len = sizeof(btf_raw_types),
+		.str_off = sizeof(btf_raw_types),
+		.str_len = sizeof(btf_str_sec),
+	};
+	__u8 raw_btf[sizeof(struct btf_header) + sizeof(btf_raw_types) +
+		     sizeof(btf_str_sec)];
+
+	memcpy(raw_btf, &btf_hdr, sizeof(btf_hdr));
+	memcpy(raw_btf + sizeof(btf_hdr), btf_raw_types, sizeof(btf_raw_types));
+	memcpy(raw_btf + sizeof(btf_hdr) + sizeof(btf_raw_types),
+	       btf_str_sec, sizeof(btf_str_sec));
+
+	return bpf_btf_load(raw_btf, sizeof(raw_btf), NULL);
+}
+
+struct qp_trie_create_case create_cases[] = {
+	{
+		.name = "tiny qp-trie",
+		.error = 0,
+		.map_flags = QP_TRIE_DFT_MAP_FLAGS,
+		.max_key_len = QP_TRIE_DFT_MAX_KEY_LEN,
+		.value_size = QP_TRIE_DFT_VAL_SIZE,
+		.max_entries = 1,
+		.btf_key_type_id = QP_TRIE_DFT_BTF_KEY_ID,
+		.btf_value_type_id = QP_TRIE_DFT_BTF_VAL_ID,
+	},
+	{
+		.name = "empty qp-trie",
+		.error = -EINVAL,
+		.map_flags = QP_TRIE_DFT_MAP_FLAGS,
+		.max_key_len = QP_TRIE_DFT_MAX_KEY_LEN,
+		.value_size = QP_TRIE_DFT_VAL_SIZE,
+		.max_entries = 0,
+		.btf_key_type_id = QP_TRIE_DFT_BTF_KEY_ID,
+		.btf_value_type_id = QP_TRIE_DFT_BTF_VAL_ID,
+	},
+	{
+		.name = "preallocated qp-trie",
+		.error = -EINVAL,
+		.map_flags = BPF_F_DYNPTR_KEY,
+		.max_key_len = QP_TRIE_DFT_MAX_KEY_LEN,
+		.value_size = QP_TRIE_DFT_VAL_SIZE,
+		.max_entries = 1,
+		.btf_key_type_id = QP_TRIE_DFT_BTF_KEY_ID,
+		.btf_value_type_id = QP_TRIE_DFT_BTF_VAL_ID,
+	},
+	{
+		.name = "fixed-size key qp-trie",
+		.error = -EINVAL,
+		.map_flags = BPF_F_NO_PREALLOC,
+		.max_key_len = QP_TRIE_DFT_MAX_KEY_LEN,
+		.value_size = QP_TRIE_DFT_VAL_SIZE,
+		.max_entries = 1,
+		.btf_key_type_id = QP_TRIE_DFT_BTF_KEY_ID,
+		.btf_value_type_id = QP_TRIE_DFT_BTF_VAL_ID,
+	},
+	{
+		.name = "mmapable qp-trie",
+		.error = -EINVAL,
+		.map_flags = QP_TRIE_DFT_MAP_FLAGS | BPF_F_MMAPABLE,
+		.max_key_len = QP_TRIE_DFT_MAX_KEY_LEN,
+		.value_size = QP_TRIE_DFT_VAL_SIZE,
+		.max_entries = 1,
+		.btf_key_type_id = QP_TRIE_DFT_BTF_KEY_ID,
+		.btf_value_type_id = QP_TRIE_DFT_BTF_VAL_ID,
+	},
+	{
+		.name = "no btf qp-trie",
+		.error = -EINVAL,
+		.map_flags = QP_TRIE_DFT_MAP_FLAGS,
+		.max_key_len = QP_TRIE_DFT_MAX_KEY_LEN,
+		.value_size = QP_TRIE_DFT_VAL_SIZE,
+		.max_entries = 1,
+		.btf_key_type_id = 0,
+		.btf_value_type_id = 0,
+	},
+	{
+		.name = "qp_test_key qp-trie",
+		.error = -EINVAL,
+		.map_flags = QP_TRIE_DFT_MAP_FLAGS,
+		.max_key_len = QP_TRIE_DFT_MAX_KEY_LEN,
+		.value_size = QP_TRIE_DFT_VAL_SIZE,
+		.max_entries = 1,
+		.btf_key_type_id = 3,
+		.btf_value_type_id = QP_TRIE_DFT_BTF_VAL_ID,
+	},
+	{
+		.name = "zero max key len qp-trie",
+		.error = -EINVAL,
+		.map_flags = QP_TRIE_DFT_MAP_FLAGS,
+		.max_key_len = 0,
+		.value_size = QP_TRIE_DFT_VAL_SIZE,
+		.max_entries = 1,
+		.btf_key_type_id = QP_TRIE_DFT_BTF_KEY_ID,
+		.btf_value_type_id = QP_TRIE_DFT_BTF_VAL_ID,
+	},
+	{
+		.name = "big k-v size qp-trie",
+		.error = -E2BIG,
+		.map_flags = QP_TRIE_DFT_MAP_FLAGS,
+		.max_key_len = QP_TRIE_DFT_MAX_KEY_LEN,
+		.value_size = 128 << 20,
+		.max_entries = 1,
+		.btf_key_type_id = QP_TRIE_DFT_BTF_KEY_ID,
+		.btf_value_type_id = QP_TRIE_DFT_BTF_VAL_ID,
+	},
+};
+
+static void test_qp_trie_create(void)
+{
+	unsigned int i;
+	int btf_fd;
+
+	btf_fd = qp_trie_load_btf();
+	CHECK(btf_fd < 0, "load btf", "error %d\n", btf_fd);
+
+	for (i = 0; i < ARRAY_SIZE(create_cases); i++) {
+		LIBBPF_OPTS(bpf_map_create_opts, opts);
+		int fd;
+
+		opts.map_flags = create_cases[i].map_flags;
+		opts.btf_fd = btf_fd;
+		opts.btf_key_type_id = create_cases[i].btf_key_type_id;
+		opts.btf_value_type_id = create_cases[i].btf_value_type_id;
+		opts.map_extra = create_cases[i].max_key_len;
+		fd = bpf_map_create(BPF_MAP_TYPE_QP_TRIE, "qp_trie", QP_TRIE_KEY_SIZE,
+				    create_cases[i].value_size, create_cases[i].max_entries, &opts);
+		if (!create_cases[i].error) {
+			CHECK(fd < 0, create_cases[i].name, "error %d\n", fd);
+			close(fd);
+		} else {
+			CHECK(fd != create_cases[i].error, create_cases[i].name,
+			      "expect error %d got %d\n", create_cases[i].error, fd);
+		}
+	}
+
+	close(btf_fd);
+}
+
+static int qp_trie_create(unsigned int max_key_len, unsigned int value_size, unsigned int max_entries)
+{
+	LIBBPF_OPTS(bpf_map_create_opts, opts);
+	int btf_fd, map_fd;
+
+	btf_fd = qp_trie_load_btf();
+	CHECK(btf_fd < 0, "load btf", "error %d\n", btf_fd);
+
+	opts.map_flags = QP_TRIE_DFT_MAP_FLAGS;
+	opts.btf_fd = btf_fd;
+	opts.btf_key_type_id = QP_TRIE_DFT_BTF_KEY_ID;
+	opts.btf_value_type_id = QP_TRIE_DFT_BTF_VAL_ID;
+	opts.map_extra = max_key_len;
+	map_fd = bpf_map_create(BPF_MAP_TYPE_QP_TRIE, "qp_trie", QP_TRIE_KEY_SIZE, value_size,
+				max_entries, &opts);
+	CHECK(map_fd < 0, "bpf_map_create", "error %d\n", map_fd);
+
+	close(btf_fd);
+
+	return map_fd;
+}
+
+static void test_qp_trie_bad_update(void)
+{
+	struct bpf_dynptr_user dynptr;
+	unsigned int key, value;
+	u64 big_key;
+	int fd, err;
+
+	fd = qp_trie_create(sizeof(key), sizeof(value), 1);
+
+	/* Invalid flags (Error) */
+	key = 0;
+	value = 0;
+	bpf_dynptr_user_init(&key, sizeof(key), &dynptr);
+	err = bpf_map_update_elem(fd, &dynptr, &value, BPF_NOEXIST | BPF_EXIST);
+	CHECK(err != -EINVAL, "invalid update flag", "error %d\n", err);
+
+	/* Invalid key len (Error) */
+	big_key = 1;
+	value = 1;
+	bpf_dynptr_user_init(&big_key, sizeof(big_key), &dynptr);
+	err = bpf_map_update_elem(fd, &dynptr, &value, 0);
+	CHECK(err != -EINVAL, "invalid data len", "error %d\n", err);
+
+	/* Iterate an empty qp-trie (Error) */
+	bpf_dynptr_user_init(&key, sizeof(key), &dynptr);
+	err = bpf_map_get_next_key(fd, NULL, &dynptr);
+	CHECK(err != -ENOENT, "non-empty qp-trie", "error %d\n", err);
+
+	/* Overwrite an empty qp-trie (Error) */
+	key = 2;
+	value = 2;
+	bpf_dynptr_user_init(&key, sizeof(key), &dynptr);
+	err = bpf_map_update_elem(fd, &dynptr, &value, BPF_EXIST);
+	CHECK(err != -ENOENT, "overwrite empty qp-trie", "error %d\n", err);
+
+	/* Iterate an empty qp-trie (Error) */
+	bpf_dynptr_user_init(&key, sizeof(key), &dynptr);
+	err = bpf_map_get_next_key(fd, NULL, &dynptr);
+	CHECK(err != -ENOENT, "non-empty qp-trie", "error %d\n", err);
+
+	close(fd);
+}
+
+static void test_qp_trie_bad_lookup_delete(void)
+{
+	struct bpf_dynptr_user dynptr;
+	unsigned int key, value;
+	int fd, err;
+
+	fd = qp_trie_create(sizeof(key), sizeof(value), 2);
+
+	/* Lookup/Delete non-existent key (Error) */
+	key = 0;
+	bpf_dynptr_user_init(&key, sizeof(key), &dynptr);
+	err = bpf_map_delete_elem(fd, &dynptr);
+	CHECK(err != -ENOENT, "del non-existent key", "error %d\n", err);
+	err = bpf_map_lookup_elem(fd, &dynptr, &value);
+	CHECK(err != -ENOENT, "lookup non-existent key", "error %d\n", err);
+
+	key = 0;
+	value = 2;
+	bpf_dynptr_user_init(&key, 2, &dynptr);
+	err = bpf_map_update_elem(fd, &dynptr, &value, BPF_NOEXIST);
+	CHECK(err, "add elem", "error %d\n", err);
+
+	key = 0;
+	value = 4;
+	bpf_dynptr_user_init(&key, sizeof(key), &dynptr);
+	err = bpf_map_update_elem(fd, &dynptr, &value, BPF_NOEXIST);
+	CHECK(err, "add elem", "error %d\n", err);
+
+	/*
+	 * Lookup/Delete non-existent key, although it is the prefix of
+	 * existent keys (Error)
+	 */
+	key = 0;
+	bpf_dynptr_user_init(&key, 1, &dynptr);
+	err = bpf_map_delete_elem(fd, &dynptr);
+	CHECK(err != -ENOENT, "del non-existent key", "error %d\n", err);
+	err = bpf_map_lookup_elem(fd, &dynptr, &value);
+	CHECK(err != -ENOENT, "lookup non-existent key", "error %d\n", err);
+
+	/* Lookup/Delete non-existent key, although its prefix exists (Error) */
+	key = 0;
+	bpf_dynptr_user_init(&key, 3, &dynptr);
+	err = bpf_map_delete_elem(fd, &dynptr);
+	CHECK(err != -ENOENT, "del non-existent key", "error %d\n", err);
+	err = bpf_map_lookup_elem(fd, &dynptr, &value);
+	CHECK(err != -ENOENT, "lookup non-existent key", "error %d\n", err);
+
+	close(fd);
+}
+
+static int cmp_str(const void *a, const void *b)
+{
+	const char *str_a = *(const char **)a, *str_b = *(const char **)b;
+
+	return strcmp(str_a, str_b);
+}
+
+static void test_qp_trie_one_subtree_update(void)
+{
+	const char *keys[] = {
+		"ab", "abc", "abo", "abS", "abcd",
+	};
+	const char *sorted_keys[ARRAY_SIZE(keys)];
+	unsigned int value, got, i, j;
+	struct bpf_dynptr_user dynptr;
+	struct bpf_dynptr_user *cur;
+	char data[4];
+	int fd, err;
+
+	fd = qp_trie_create(4, sizeof(value), ARRAY_SIZE(keys));
+
+	for (i = 0; i < ARRAY_SIZE(keys); i++) {
+		unsigned int flags;
+
+		/* Add i-th element */
+		flags = i % 2 ? BPF_NOEXIST : 0;
+		bpf_dynptr_user_init((void *)keys[i], strlen(keys[i]), &dynptr);
+		value = i + 100;
+		err = bpf_map_update_elem(fd, &dynptr, &value, flags);
+		CHECK(err, "add elem", "#%u error %d\n", i, err);
+
+		err = bpf_map_lookup_elem(fd, &dynptr, &got);
+		CHECK(err, "lookup elem", "#%u error %d\n", i, err);
+		CHECK(got != value, "lookup elem", "#%u expect %u got %u\n", i, value, got);
+
+		/* Re-add i-th element (Error) */
+		err = bpf_map_update_elem(fd, &dynptr, &value, BPF_NOEXIST);
+		CHECK(err != -EEXIST, "re-add elem", "#%u error %d\n", i, err);
+
+		/* Overwrite i-th element */
+		flags = i % 2 ? 0 : BPF_EXIST;
+		value = i;
+		err = bpf_map_update_elem(fd, &dynptr, &value, flags);
+		CHECK(err, "update elem", "error %d\n", err);
+
+		/* Lookup #[0~i] elements */
+		for (j = 0; j <= i; j++) {
+			bpf_dynptr_user_init((void *)keys[j], strlen(keys[j]), &dynptr);
+			err = bpf_map_lookup_elem(fd, &dynptr, &got);
+			CHECK(err, "lookup elem", "#%u/%u error %d\n", i, j, err);
+			CHECK(got != j, "lookup elem", "#%u/%u expect %u got %u\n", i, j, value, got);
+		}
+	}
+
+	/* Add element to a full qp-trie (Error) */
+	memset(data, 0, sizeof(data));
+	bpf_dynptr_user_init(&data, sizeof(data), &dynptr);
+	value = 0;
+	err = bpf_map_update_elem(fd, &dynptr, &value, 0);
+	CHECK(err != -ENOSPC, "add to full qp-trie", "error %d\n", err);
+
+	/* Iterate sorted elements */
+	cur = NULL;
+	memcpy(sorted_keys, keys, sizeof(keys));
+	qsort(sorted_keys, ARRAY_SIZE(sorted_keys), sizeof(sorted_keys[0]), cmp_str);
+	bpf_dynptr_user_init(data, sizeof(data), &dynptr);
+	for (i = 0; i < ARRAY_SIZE(sorted_keys); i++) {
+		unsigned int len;
+		char *got;
+
+		len = strlen(sorted_keys[i]);
+		err = bpf_map_get_next_key(fd, cur, &dynptr);
+		CHECK(err, "iterate", "#%u error %d\n", i, err);
+		CHECK(bpf_dynptr_user_get_size(&dynptr) != len, "iterate",
+		      "#%u invalid len %u expect %u\n",
+		      i, bpf_dynptr_user_get_size(&dynptr), len);
+		got = bpf_dynptr_user_get_data(&dynptr);
+		CHECK(memcmp(sorted_keys[i], got, len), "iterate",
+		      "#%u got %.*s exp %.*s\n", i, len, got, len, sorted_keys[i]);
+
+		if (!cur)
+			cur = &dynptr;
+	}
+	err = bpf_map_get_next_key(fd, cur, &dynptr);
+	CHECK(err != -ENOENT, "more element", "error %d\n", err);
+
+	/* Delete all elements */
+	for (i = 0; i < ARRAY_SIZE(keys); i++) {
+		bpf_dynptr_user_init((void *)keys[i], strlen(keys[i]), &dynptr);
+		err = bpf_map_delete_elem(fd, &dynptr);
+		CHECK(err, "del elem", "#%u elem error %d\n", i, err);
+
+		/* Lookup deleted element (Error) */
+		err = bpf_map_lookup_elem(fd, &dynptr, &got);
+		CHECK(err != -ENOENT, "lookup elem", "#%u error %d\n", i, err);
+
+		/* Lookup #(i~N] elements */
+		for (j = i + 1; j < ARRAY_SIZE(keys); j++) {
+			bpf_dynptr_user_init((void *)keys[j], strlen(keys[j]), &dynptr);
+			err = bpf_map_lookup_elem(fd, &dynptr, &got);
+			CHECK(err, "lookup elem", "#%u/%u error %d\n", i, j, err);
+			CHECK(got != j, "lookup elem", "#%u/%u expect %u got %u\n", i, j, value, got);
+		}
+	}
+
+	memset(data, 0, sizeof(data));
+	bpf_dynptr_user_init(&data, sizeof(data), &dynptr);
+	err = bpf_map_get_next_key(fd, NULL, &dynptr);
+	CHECK(err != -ENOENT, "non-empty qp-trie", "error %d\n", err);
+
+	close(fd);
+}
+
+static void test_qp_trie_all_subtree_update(void)
+{
+	unsigned int i, max_entries, key, value, got;
+	struct bpf_dynptr_user dynptr;
+	struct bpf_dynptr_user *cur;
+	int fd, err;
+
+	/* 16 elements per subtree */
+	max_entries = 256 * 16;
+	fd = qp_trie_create(sizeof(key), sizeof(value), max_entries);
+
+	for (i = 0; i < max_entries; i++) {
+		key = htole32(i);
+		bpf_dynptr_user_init(&key, sizeof(key), &dynptr);
+		value = i;
+		err = bpf_map_update_elem(fd, &dynptr, &value, BPF_NOEXIST);
+		CHECK(err, "add elem", "#%u error %d\n", i, err);
+
+		err = bpf_map_lookup_elem(fd, &dynptr, &got);
+		CHECK(err, "lookup elem", "#%u elem error %d\n", i, err);
+		CHECK(got != value, "lookup elem", "#%u expect %u got %u\n", i, value, got);
+	}
+
+	/* Add element to a full qp-trie (Error) */
+	key = htole32(max_entries + 1);
+	bpf_dynptr_user_init(&key, sizeof(key), &dynptr);
+	value = 0;
+	err = bpf_map_update_elem(fd, &dynptr, &value, 0);
+	CHECK(err != -ENOSPC, "add to full qp-trie", "error %d\n", err);
+
+	/* Iterate all elements */
+	cur = NULL;
+	bpf_dynptr_user_init(&key, sizeof(key), &dynptr);
+	for (i = 0; i < max_entries; i++) {
+		unsigned int *data;
+		unsigned int exp;
+
+		exp = htole32((i / 16) | ((i & 0xf) << 8));
+		err = bpf_map_get_next_key(fd, cur, &dynptr);
+		CHECK(err, "iterate", "#%u error %d\n", i, err);
+		CHECK(bpf_dynptr_user_get_size(&dynptr) != 4, "iterate",
+		      "#%u invalid len %u\n", i, bpf_dynptr_user_get_size(&dynptr));
+		data = bpf_dynptr_user_get_data(&dynptr);
+		CHECK(data != &key, "dynptr data", "#%u got %p exp %p\n", i, data, &key);
+		CHECK(key != exp, "iterate", "#%u got %u exp %u\n", i, key, exp);
+
+		if (!cur)
+			cur = &dynptr;
+	}
+	err = bpf_map_get_next_key(fd, cur, &dynptr);
+	CHECK(err != -ENOENT, "more element", "error %d\n", err);
+
+	/* Delete all elements */
+	i = max_entries;
+	while (i-- > 0) {
+		key = i;
+		bpf_dynptr_user_init(&key, sizeof(key), &dynptr);
+		err = bpf_map_delete_elem(fd, &dynptr);
+		CHECK(err, "del elem", "#%u error %d\n", i, err);
+
+		/* Lookup deleted element (Error) */
+		err = bpf_map_lookup_elem(fd, &dynptr, &got);
+		CHECK(err != -ENOENT, "lookup elem", "#%u error %d\n", i, err);
+	}
+
+	bpf_dynptr_user_init(&key, sizeof(key), &dynptr);
+	err = bpf_map_get_next_key(fd, NULL, &dynptr);
+	CHECK(err != -ENOENT, "non-empty qp-trie", "error %d\n", err);
+
+	close(fd);
+}
+
+static int binary_insert_data(unsigned int *set, unsigned int nr, unsigned int data)
+{
+	int begin = 0, end = nr - 1, mid, i;
+
+	while (begin <= end) {
+		mid = begin + (end - begin) / 2;
+		if (data == set[mid])
+			return -1;
+		if (data > set[mid])
+			begin = mid + 1;
+		else
+			end = mid - 1;
+	}
+
+	/* Move [begin, nr) backwards and insert new item at begin */
+	i = nr - 1;
+	while (i >= begin) {
+		set[i + 1] = set[i];
+		i--;
+	}
+	set[begin] = data;
+
+	return 0;
+}
+
+/* UINT_MAX will not be in the returned data set */
+static unsigned int *gen_random_unique_data_set(unsigned int max_entries)
+{
+	unsigned int *data_set;
+	unsigned int i, data;
+
+	data_set = malloc(sizeof(*data_set) * max_entries);
+	CHECK(!data_set, "malloc", "no mem");
+
+	for (i = 0; i < max_entries; i++) {
+		while (true) {
+			data = random() % UINT_MAX;
+			if (!binary_insert_data(data_set, i, data))
+				break;
+		}
+	}
+
+	return data_set;
+}
+
+static int cmp_be32(const void *l, const void *r)
+{
+	unsigned int a = htobe32(*(unsigned int *)l), b = htobe32(*(unsigned int *)r);
+
+	if (a < b)
+		return -1;
+	if (a > b)
+		return 1;
+	return 0;
+}
+
+static void test_qp_trie_rdonly_iterate(void)
+{
+	unsigned int i, max_entries, value, data, len;
+	struct bpf_dynptr_user dynptr;
+	struct bpf_dynptr_user *cur;
+	unsigned int *data_set;
+	int fd, err;
+
+	max_entries = 4096;
+	data_set = gen_random_unique_data_set(max_entries);
+	qsort(data_set, max_entries, sizeof(*data_set), cmp_be32);
+
+	fd = qp_trie_create(sizeof(*data_set), sizeof(value), max_entries);
+	value = 1;
+	for (i = 0; i < max_entries; i++) {
+		bpf_dynptr_user_init(&data_set[i], sizeof(data_set[i]), &dynptr);
+		err = bpf_map_update_elem(fd, &dynptr, &value, 0);
+		CHECK(err, "add elem", "#%u error %d\n", i, err);
+	}
+
+	/* Iteration results are big-endian ordered */
+	cur = NULL;
+	bpf_dynptr_user_init(&data, sizeof(data), &dynptr);
+	for (i = 0; i < max_entries; i++) {
+		unsigned int *got;
+
+		err = bpf_map_get_next_key(fd, cur, &dynptr);
+		CHECK(err, "iterate", "#%u error %d\n", i, err);
+
+		got = bpf_dynptr_user_get_data(&dynptr);
+		len = bpf_dynptr_user_get_size(&dynptr);
+		CHECK(len != 4, "iterate", "#%u invalid len %u\n", i, len);
+		CHECK(got != &data, "iterate", "#%u invalid dynptr got %p exp %p\n", i, got, &data);
+		CHECK(*got != data_set[i], "iterate", "#%u got 0x%x exp 0x%x\n",
+		      i, *got, data_set[i]);
+		cur = &dynptr;
+	}
+	err = bpf_map_get_next_key(fd, cur, &dynptr);
+	CHECK(err != -ENOENT, "more element", "error %d\n", err);
+
+	/* Iterate from non-existent key */
+	data = htobe32(UINT_MAX);
+	bpf_dynptr_user_init(&data, sizeof(data), &dynptr);
+	err = bpf_map_get_next_key(fd, &dynptr, &dynptr);
+	CHECK(err, "iterate from non-existent", "error %d\n", err);
+	len = bpf_dynptr_user_get_size(&dynptr);
+	CHECK(len != 4, "iterate", "invalid len %u\n", len);
+	CHECK(data != data_set[0], "iterate", "got 0x%x exp 0x%x\n",
+	      data, data_set[0]);
+
+	free(data_set);
+
+	close(fd);
+}
+
+/*
+ * Delete current key (also the smallest key) after iteration, the next
+ * iteration will return the second smallest key, so the iteration result
+ * is still ordered.
+ */
+static void test_qp_trie_iterate_then_delete(void)
+{
+	unsigned int i, max_entries, value, data, len;
+	struct bpf_dynptr_user dynptr;
+	struct bpf_dynptr_user *cur;
+	unsigned int *data_set;
+	int fd, err;
+
+	max_entries = 4096;
+	data_set = gen_random_unique_data_set(max_entries);
+	qsort(data_set, max_entries, sizeof(*data_set), cmp_be32);
+
+	fd = qp_trie_create(sizeof(*data_set), sizeof(value), max_entries);
+	value = 1;
+	for (i = 0; i < max_entries; i++) {
+		bpf_dynptr_user_init(&data_set[i], sizeof(data_set[i]), &dynptr);
+		err = bpf_map_update_elem(fd, &dynptr, &value, BPF_NOEXIST);
+		CHECK(err, "add elem", "#%u error %d\n", i, err);
+	}
+
+	/* Iteration results are big-endian ordered */
+	cur = NULL;
+	bpf_dynptr_user_init(&data, sizeof(data), &dynptr);
+	for (i = 0; i < max_entries; i++) {
+		err = bpf_map_get_next_key(fd, cur, &dynptr);
+		CHECK(err, "iterate", "#%u error %d\n", i, err);
+
+		len = bpf_dynptr_user_get_size(&dynptr);
+		CHECK(len != 4, "iterate", "#%u invalid len %u\n", i, len);
+		CHECK(data != data_set[i], "iterate", "#%u got 0x%x exp 0x%x\n",
+		      i, data, data_set[i]);
+		cur = &dynptr;
+
+		/*
+		 * Delete the mininal key, next call of bpf_get_next_key() will
+		 * return the second minimal key.
+		 */
+		err = bpf_map_delete_elem(fd, &dynptr);
+		CHECK(err, "del elem", "#%u elem error %d\n", i, err);
+	}
+	err = bpf_map_get_next_key(fd, cur, &dynptr);
+	CHECK(err != -ENOENT, "more element", "error %d\n", err);
+
+	err = bpf_map_get_next_key(fd, NULL, &dynptr);
+	CHECK(err != -ENOENT, "no-empty qp-trie", "error %d\n", err);
+
+	free(data_set);
+
+	close(fd);
+}
+
+/* The range is half-closed: [from, to) */
+static void delete_random_keys_in_range(int fd, unsigned int *data_set,
+					unsigned int from, unsigned int to)
+{
+	unsigned int del_from, del_to;
+
+	if (from >= to)
+		return;
+
+	del_from = random() % (to - from) + from;
+	del_to = random() % (to - del_from) + del_from;
+	for (; del_from <= del_to; del_from++) {
+		struct bpf_dynptr_user dynptr;
+		int err;
+
+		/* Skip deleted keys */
+		if (data_set[del_from] == UINT_MAX)
+			continue;
+
+		bpf_dynptr_user_init(&data_set[del_from], sizeof(data_set[del_from]), &dynptr);
+		err = bpf_map_delete_elem(fd, &dynptr);
+		CHECK(err, "del elem", "#%u range %u-%u error %d\n", del_from, from, to, err);
+		data_set[del_from] = UINT_MAX;
+	}
+}
+
+/* Delete keys randomly and ensure the iteration returns the expected data */
+static void test_qp_trie_iterate_then_batch_delete(void)
+{
+	unsigned int i, max_entries, value, data, len;
+	struct bpf_dynptr_user dynptr;
+	struct bpf_dynptr_user *cur;
+	unsigned int *data_set;
+	int fd, err;
+
+	max_entries = 8192;
+	data_set = gen_random_unique_data_set(max_entries);
+	qsort(data_set, max_entries, sizeof(*data_set), cmp_be32);
+
+	fd = qp_trie_create(sizeof(*data_set), sizeof(value), max_entries);
+	value = 1;
+	for (i = 0; i < max_entries; i++) {
+		bpf_dynptr_user_init(&data_set[i], sizeof(data_set[i]), &dynptr);
+		err = bpf_map_update_elem(fd, &dynptr, &value, BPF_NOEXIST);
+		CHECK(err, "add elem", "#%u error %d\n", i, err);
+	}
+
+	cur = NULL;
+	bpf_dynptr_user_init(&data, sizeof(data), &dynptr);
+	for (i = 0; i < max_entries; i++) {
+		err = bpf_map_get_next_key(fd, cur, &dynptr);
+		CHECK(err, "iterate", "#%u error %d\n", i, err);
+
+		len = bpf_dynptr_user_get_size(&dynptr);
+		CHECK(len != 4, "iterate", "#%u invalid len %u\n", i, len);
+		CHECK(data != data_set[i], "iterate", "#%u got 0x%x exp 0x%x\n",
+		      i, data, data_set[i]);
+		cur = &dynptr;
+
+		/* Delete some keys from iterated keys */
+		delete_random_keys_in_range(fd, data_set, 0, i);
+
+		/* Skip deleted keys */
+		while (i + 1 < max_entries) {
+			if (data_set[i + 1] != UINT_MAX)
+				break;
+			i++;
+		}
+
+		/* Delete some keys from to-iterate keys */
+		delete_random_keys_in_range(fd, data_set, i + 1, max_entries);
+
+		/* Skip deleted keys */
+		while (i + 1 < max_entries) {
+			if (data_set[i + 1] != UINT_MAX)
+				break;
+			i++;
+		}
+	}
+	err = bpf_map_get_next_key(fd, cur, &dynptr);
+	CHECK(err != -ENOENT, "more element", "error %d\n", err);
+
+	free(data_set);
+
+	close(fd);
+}
+
+/*
+ * Add keys with odd index first and add keys with even index during iteration.
+ * Check whether or not the whole key set is returned by iteration procedure.
+ */
+static void test_qp_trie_iterate_then_add(void)
+{
+	unsigned int i, max_entries, value, data, len;
+	struct bpf_dynptr_user dynptr, next_key;
+	struct bpf_dynptr_user *cur;
+	unsigned int *data_set;
+	int fd, err;
+
+	max_entries = 8192;
+	data_set = gen_random_unique_data_set(max_entries);
+	qsort(data_set, max_entries, sizeof(*data_set), cmp_be32);
+
+	fd = qp_trie_create(sizeof(*data_set), sizeof(value), max_entries);
+	value = 1;
+	for (i = 0; i < max_entries; i++) {
+		if (i & 1)
+			continue;
+
+		bpf_dynptr_user_init(&data_set[i], sizeof(data_set[i]), &dynptr);
+		err = bpf_map_update_elem(fd, &dynptr, &value, BPF_NOEXIST);
+		CHECK(err, "add elem", "#%u error %d\n", i, err);
+	}
+
+	/* Iteration results are big-endian ordered */
+	cur = NULL;
+	bpf_dynptr_user_init(&data, sizeof(data), &next_key);
+	for (i = 0; i < max_entries; i++) {
+		err = bpf_map_get_next_key(fd, cur, &next_key);
+		CHECK(err, "iterate", "#%u error %d\n", i, err);
+
+		len = bpf_dynptr_user_get_size(&next_key);
+		CHECK(len != 4, "iterate", "#%u invalid len %u\n", i, len);
+		CHECK(data != data_set[i], "iterate", "#%u got 0x%x exp 0x%x\n",
+		      i, data, data_set[i]);
+		cur = &next_key;
+
+		if ((i & 1) || i + 1 >= max_entries)
+			continue;
+
+		/* Add key with odd index which be returned in next iteration */
+		bpf_dynptr_user_init(&data_set[i + 1], sizeof(data_set[i + 1]), &dynptr);
+		err = bpf_map_update_elem(fd, &dynptr, &value, BPF_NOEXIST);
+		CHECK(err, "add elem", "#%u error %d\n", i + 1, err);
+	}
+	err = bpf_map_get_next_key(fd, cur, &next_key);
+	CHECK(err != -ENOENT, "more element", "error %d\n", err);
+
+	free(data_set);
+
+	close(fd);
+}
+
+static int get_int_from_env(const char *key, int dft)
+{
+	const char *value = getenv(key);
+
+	if (!value)
+		return dft;
+	return atoi(value);
+}
+
+static void free_bytes_set(struct bpf_dynptr_user *set, unsigned int nr)
+{
+	unsigned int i;
+
+	for (i = 0; i < nr; i++)
+		free(bpf_dynptr_user_get_data(&set[i]));
+	free(set);
+}
+
+struct bpf_dynptr_user *generate_random_bytes_set(unsigned int max_key_len, unsigned int nr)
+{
+	struct bpf_dynptr_user *set;
+	unsigned int i;
+
+	set = malloc(nr * sizeof(*set));
+	CHECK(!set, "malloc", "no mem for set");
+
+	for (i = 0; i < nr; i++) {
+		unsigned char *data;
+		unsigned int len, j;
+
+		len = random() % max_key_len + 1;
+		data = malloc(len);
+		CHECK(!data, "maloc", "no mem for data");
+
+		j = 0;
+		while (j + 4 <= len) {
+			unsigned int rnd = random();
+
+			memcpy(&data[j], &rnd, sizeof(rnd));
+			j += 4;
+		}
+		while (j < len)
+			data[j++] = random();
+
+		bpf_dynptr_user_init(data, len, &set[i]);
+	}
+
+	return set;
+}
+
+static struct bpf_dynptr_user *alloc_dynptr_user(unsigned int len)
+{
+	struct bpf_dynptr_user *dynptr;
+
+	dynptr = malloc(sizeof(*dynptr) + len);
+	if (!dynptr)
+		return NULL;
+
+	bpf_dynptr_user_init(&dynptr[1], len, dynptr);
+
+	return dynptr;
+}
+
+static int cmp_dynptr_user(const struct bpf_dynptr_user *a, const struct bpf_dynptr_user *b)
+{
+	unsigned int a_len = bpf_dynptr_user_get_size(a), b_len = bpf_dynptr_user_get_size(b);
+	unsigned int cmp = a_len < b_len ? a_len : b_len;
+	int ret;
+
+	ret = memcmp(bpf_dynptr_user_get_data(a), bpf_dynptr_user_get_data(b), cmp);
+	if (ret)
+		return ret;
+	return a_len - b_len;
+}
+
+static void dump_dynptr_user(const char *name, const struct bpf_dynptr_user *ptr)
+{
+	unsigned char *data = bpf_dynptr_user_get_data(ptr);
+	unsigned int i, len = bpf_dynptr_user_get_size(ptr);
+
+	fprintf(stderr, "%s dynptr len %u data %p\n", name, len, data);
+
+	for (i = 0; i < len; i++) {
+		fprintf(stderr, "%02x ", data[i]);
+		if (i % 16 == 15)
+			fprintf(stderr, "\n");
+	}
+	fprintf(stderr, "\n");
+}
+
+static void copy_and_reset_dynptr_user(struct bpf_dynptr_user *dst_ptr,
+				       struct bpf_dynptr_user *src_ptr, unsigned int reset_len)
+{
+	unsigned char *dst = bpf_dynptr_user_get_data(dst_ptr);
+	unsigned char *src = bpf_dynptr_user_get_data(src_ptr);
+	unsigned int src_len = bpf_dynptr_user_get_size(src_ptr);
+
+	memcpy(dst, src, src_len);
+	bpf_dynptr_user_init(dst, src_len, dst_ptr);
+	bpf_dynptr_user_init(src, reset_len, src_ptr);
+}
+
+static void *update_fn(void *arg)
+{
+	const struct qp_trie_rw_ctx *ctx = arg;
+	unsigned int i, j;
+
+	for (i = 0; i < ctx->loop; i++) {
+		for (j = 0; j < ctx->nr; j++) {
+			unsigned int value;
+			int err;
+
+			value = bpf_dynptr_user_get_size(&ctx->set[i]);
+			err = bpf_map_update_elem(ctx->fd, &ctx->set[i], &value, BPF_ANY);
+			if (err) {
+				fprintf(stderr, "update #%u element error %d\n", j, err);
+				return (void *)(long)err;
+			}
+		}
+	}
+
+	return NULL;
+}
+
+static void *delete_fn(void *arg)
+{
+	const struct qp_trie_rw_ctx *ctx = arg;
+	unsigned int i, j;
+
+	for (i = 0; i < ctx->loop; i++) {
+		for (j = 0; j < ctx->nr; j++) {
+			int err;
+
+			err = bpf_map_delete_elem(ctx->fd, &ctx->set[i]);
+			if (err && err != -ENOENT) {
+				fprintf(stderr, "delete #%u element error %d\n", j, err);
+				return (void *)(long)err;
+			}
+		}
+	}
+
+	return NULL;
+}
+
+static void *lookup_fn(void *arg)
+{
+	const struct qp_trie_rw_ctx *ctx = arg;
+	unsigned int i, j;
+
+	for (i = 0; i < ctx->loop; i++) {
+		for (j = 0; j < ctx->nr; j++) {
+			unsigned int got, value;
+			int err;
+
+			got = 0;
+			value = bpf_dynptr_user_get_size(&ctx->set[i]);
+			err = bpf_map_lookup_elem(ctx->fd, &ctx->set[i], &got);
+			if (!err && got != value) {
+				fprintf(stderr, "lookup #%u element got %u expected %u\n", j, got, value);
+				return (void *)(long)err;
+			} else if (err && err != -ENOENT) {
+				fprintf(stderr, "lookup #%u element error %d\n", j, err);
+				return (void *)(long)err;
+			}
+		}
+	}
+
+	return NULL;
+}
+
+static void *iterate_fn(void *arg)
+{
+	const struct qp_trie_rw_ctx *ctx = arg;
+	struct bpf_dynptr_user *key, *next_key;
+	unsigned int i;
+	int err;
+
+	key = NULL;
+	next_key = alloc_dynptr_user(ctx->max_key_len);
+	if (!next_key)
+		return (void *)(long)-ENOMEM;
+
+	err = 0;
+	for (i = 0; i < ctx->loop; i++) {
+		while (true) {
+			err = bpf_map_get_next_key(ctx->fd, key, next_key);
+			if (err < 0) {
+				if (err != -ENOENT) {
+					fprintf(stderr, "get key error %d\n", err);
+					goto out;
+				}
+				err = 0;
+				break;
+			}
+
+			/* If no deletion, next key should be greater than key */
+			if (!ctx->nr_delete && key && cmp_dynptr_user(key, next_key) >= 0) {
+				fprintf(stderr, "unordered iteration result\n");
+				dump_dynptr_user("previous key", key);
+				dump_dynptr_user("cur key", next_key);
+				err = -EINVAL;
+				goto out;
+			}
+
+			if (!key) {
+				key = alloc_dynptr_user(ctx->max_key_len);
+				if (!key) {
+					err = -ENOMEM;
+					goto out;
+				}
+			}
+
+			/* Copy next_key to key, and reset next_key */
+			copy_and_reset_dynptr_user(key, next_key, ctx->max_key_len);
+		}
+
+		free(key);
+		key = NULL;
+	}
+
+out:
+	free(key);
+	free(next_key);
+	return (void *)(long)err;
+}
+
+static void do_qp_trie_stress_test(const struct stress_conf *conf)
+{
+	void *(*fns[MAX_OP])(void *arg) = {
+		update_fn, delete_fn, lookup_fn, iterate_fn,
+	};
+	unsigned int created[MAX_OP];
+	struct qp_trie_rw_ctx ctx;
+	pthread_t *tids[MAX_OP];
+	unsigned int op, i, err;
+
+	ctx.nr = conf->nr;
+	ctx.max_key_len = conf->max_key_len;
+	ctx.fd = qp_trie_create(ctx.max_key_len, sizeof(unsigned int), ctx.nr);
+	ctx.set = generate_random_bytes_set(ctx.max_key_len, ctx.nr);
+	ctx.loop = conf->loop;
+	ctx.nr_delete = conf->threads[DELETE_OP];
+
+	/* Create threads */
+	for (op = 0; op < ARRAY_SIZE(tids); op++) {
+		if (!conf->threads[op]) {
+			tids[op] = NULL;
+			continue;
+		}
+
+		tids[op] = malloc(conf->threads[op] * sizeof(*tids[op]));
+		CHECK(!tids[op], "malloc", "no mem for op %u threads %u\n", op, conf->threads[op]);
+	}
+
+	for (op = 0; op < ARRAY_SIZE(tids); op++) {
+		for (i = 0; i < conf->threads[op]; i++) {
+			err = pthread_create(&tids[op][i], NULL, fns[op], &ctx);
+			if (err) {
+				fprintf(stderr, "create #%u thread for op %u error %d\n", i, op, err);
+				break;
+			}
+		}
+		created[op] = i;
+	}
+
+	err = 0;
+	for (op = 0; op < ARRAY_SIZE(tids); op++) {
+		for (i = 0; i < created[op]; i++) {
+			void *thread_err = NULL;
+
+			pthread_join(tids[op][i], &thread_err);
+			if (thread_err)
+				err |= 1 << op;
+		}
+	}
+	CHECK(err, "stress operation", "err %u\n", err);
+
+	for (op = 0; op < ARRAY_SIZE(tids); op++)
+		free(tids[op]);
+	free_bytes_set(ctx.set, ctx.nr);
+	close(ctx.fd);
+}
+
+static void test_qp_trie_stress(void)
+{
+	struct stress_conf conf;
+
+	memset(&conf, 0, sizeof(conf));
+
+	/* Test concurrently update, lookup and iterate operations. There is
+	 * no deletion, so iteration can check the order of returned keys.
+	 */
+	conf.threads[UPDATE_OP] = get_int_from_env("QP_TRIE_NR_UPDATE", 8);
+	conf.threads[LOOKUP_OP] = get_int_from_env("QP_TRIE_NR_LOOKUP", 8);
+	conf.threads[ITERATE_OP] = get_int_from_env("QP_TRIE_NR_ITERATE", 8);
+	conf.max_key_len = get_int_from_env("QP_TRIE_MAX_KEY_LEN", 256);
+	conf.loop = get_int_from_env("QP_TRIE_NR_LOOP", 32);
+	conf.nr = get_int_from_env("QP_TRIE_NR_DATA", 8192);
+	do_qp_trie_stress_test(&conf);
+
+	/* Add delete operation */
+	conf.threads[DELETE_OP] = get_int_from_env("QP_TRIE_NR_DELETE", 8);
+	do_qp_trie_stress_test(&conf);
+}
+
+void test_qp_trie_map(void)
+{
+	test_qp_trie_create();
+
+	test_qp_trie_bad_update();
+
+	test_qp_trie_bad_lookup_delete();
+
+	test_qp_trie_one_subtree_update();
+
+	test_qp_trie_all_subtree_update();
+
+	test_qp_trie_rdonly_iterate();
+
+	test_qp_trie_iterate_then_delete();
+
+	test_qp_trie_iterate_then_batch_delete();
+
+	test_qp_trie_iterate_then_add();
+
+	test_qp_trie_stress();
+
+	printf("%s:PASS\n", __func__);
+}
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 03/10] bpf: Support bpf_dynptr-typed map key for bpf syscall
  2022-09-17 15:31 ` [PATCH bpf-next 03/10] bpf: Support bpf_dynptr-typed map key for bpf syscall Hou Tao
@ 2022-09-17 18:18   ` kernel test robot
  2022-09-17 18:29   ` kernel test robot
  1 sibling, 0 replies; 13+ messages in thread
From: kernel test robot @ 2022-09-17 18:18 UTC (permalink / raw)
  To: Hou Tao, bpf, Andrii Nakryiko
  Cc: llvm, kbuild-all, Song Liu, Hao Luo, Yonghong Song,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Lorenz Bauer, Paul E . McKenney, houtao1

Hi Hou,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Hou-Tao/Add-support-for-qp-trie-map/20220917-231450
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: arm-randconfig-r016-20220917 (https://download.01.org/0day-ci/archive/20220918/202209180234.jN58ZaBL-lkp@intel.com/config)
compiler: clang version 16.0.0 (https://github.com/llvm/llvm-project 791a7ae1ba3efd6bca96338e10ffde557ba83920)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install arm cross compiling tool for clang build
        # apt-get install binutils-arm-linux-gnueabi
        # https://github.com/intel-lab-lkp/linux/commit/1995ae6874ef9089b4eee12c00ba9c5af264b8bc
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Hou-Tao/Add-support-for-qp-trie-map/20220917-231450
        git checkout 1995ae6874ef9089b4eee12c00ba9c5af264b8bc
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> ld.lld: error: undefined symbol: __get_user_bad
   >>> referenced by syscall.c:1368 (kernel/bpf/syscall.c:1368)
   >>>               bpf/syscall.o:(bpf_copy_to_dynptr_ukey) in archive kernel/built-in.a

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 03/10] bpf: Support bpf_dynptr-typed map key for bpf syscall
  2022-09-17 15:31 ` [PATCH bpf-next 03/10] bpf: Support bpf_dynptr-typed map key for bpf syscall Hou Tao
  2022-09-17 18:18   ` kernel test robot
@ 2022-09-17 18:29   ` kernel test robot
  1 sibling, 0 replies; 13+ messages in thread
From: kernel test robot @ 2022-09-17 18:29 UTC (permalink / raw)
  To: Hou Tao, bpf, Andrii Nakryiko
  Cc: kbuild-all, Song Liu, Hao Luo, Yonghong Song, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, KP Singh, David S . Miller,
	Jakub Kicinski, Stanislav Fomichev, Jiri Olsa, John Fastabend,
	Lorenz Bauer, Paul E . McKenney, houtao1

Hi Hou,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Hou-Tao/Add-support-for-qp-trie-map/20220917-231450
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: arm-randconfig-r002-20220917 (https://download.01.org/0day-ci/archive/20220918/202209180207.PvS9ya46-lkp@intel.com/config)
compiler: arm-linux-gnueabi-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/1995ae6874ef9089b4eee12c00ba9c5af264b8bc
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Hou-Tao/Add-support-for-qp-trie-map/20220917-231450
        git checkout 1995ae6874ef9089b4eee12c00ba9c5af264b8bc
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=arm SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   arm-linux-gnueabi-ld: kernel/bpf/syscall.o: in function `bpf_copy_to_dynptr_ukey':
>> kernel/bpf/syscall.c:1368: undefined reference to `__get_user_bad'
   pahole: .tmp_vmlinux.btf: No such file or directory
   .btf.vmlinux.bin.o: file not recognized: file format not recognized


vim +1368 kernel/bpf/syscall.c

  1361	
  1362	static int bpf_copy_to_dynptr_ukey(struct bpf_dynptr_user __user *uptr,
  1363					   struct bpf_dynptr_kern *kptr)
  1364	{
  1365		unsigned int size;
  1366		u64 udata;
  1367	
> 1368		if (get_user(udata, &uptr->data))
  1369			return -EFAULT;
  1370	
  1371		/* Also zeroing the reserved field in uptr */
  1372		size = bpf_dynptr_get_size(kptr);
  1373		if (copy_to_user(u64_to_user_ptr(udata), kptr->data + kptr->offset, size) ||
  1374		    put_user(size, &uptr->size) || put_user(0, &uptr->size + 1))
  1375			return -EFAULT;
  1376	
  1377		return 0;
  1378	}
  1379	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-09-17 18:30 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-17 15:31 [PATCH bpf-next 00/10] Add support for qp-trie map Hou Tao
2022-09-17 15:31 ` [PATCH bpf-next 01/10] bpf: Export bpf_dynptr_{get|set}_size Hou Tao
2022-09-17 15:31 ` [PATCH bpf-next 02/10] bpf: Add helper btf_type_is_bpf_dynptr() Hou Tao
2022-09-17 15:31 ` [PATCH bpf-next 03/10] bpf: Support bpf_dynptr-typed map key for bpf syscall Hou Tao
2022-09-17 18:18   ` kernel test robot
2022-09-17 18:29   ` kernel test robot
2022-09-17 15:31 ` [PATCH bpf-next 04/10] bpf: Support bpf_dynptr-typed map key for bpf program Hou Tao
2022-09-17 15:31 ` [PATCH bpf-next 05/10] libbpf: Add helpers for bpf_dynptr_user Hou Tao
2022-09-17 15:31 ` [PATCH bpf-next 06/10] bpf: Add support for qp-trie map Hou Tao
2022-09-17 15:31 ` [PATCH bpf-next 07/10] selftests/bpf: Add two new dynptr_fail cases for map key Hou Tao
2022-09-17 15:31 ` [PATCH bpf-next 08/10] selftests/bpf: Add test case for basic qp-trie map operations Hou Tao
2022-09-17 15:31 ` [PATCH bpf-next 09/10] selftests/bpf: Add benchmark for qp-trie map Hou Tao
2022-09-17 15:31 ` [PATCH bpf-next 10/10] selftests/bpf: Add tests for qp-trie map by using bpf syscalls Hou Tao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.