bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH bpf-next 0/3] Add support for qp-trie map
@ 2022-07-26 13:00 Hou Tao
  2022-07-26 13:00 ` [RFC PATCH bpf-next 1/3] bpf: " Hou Tao
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: Hou Tao @ 2022-07-26 13:00 UTC (permalink / raw)
  To: Andrii Nakryiko, Alexei Starovoitov, bpf
  Cc: Daniel Borkmann, Martin KaFai Lau, Yonghong Song, Song Liu,
	KP Singh, David S . Miller, Jakub Kicinski, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, John Fastabend, houtao1

Hi,

The initial motivation for qp-trie map is to reduce memory usage for
string keys special those with large differencies in length as
discussed in [0]. And as a big-endian lexicographical ordered map, it
can also be used for any binary data with fixed or variable length.

Now the basic functionality of qp-trie is ready, so posting a RFC version
to get more feedback or suggestions about qp-trie. Specially feedback
about the following questions:

(1) Application scenario for qp-trie
Andrii had proposed to re-implement lpm-trie by using qp-trie. The
advantage would be the speed up of lookup operations due to lower tree
depth of qp-trie. Maybe the performance of update could also be improved
although in cillium there is a big lock during lpm-trie update [1]. Is
there any other use cases for qp-trie ? Specially those cases which need
both ordering and memory efficiency or cases in which jhash() of htab
creates too much collisions and qp-trie lookup performances better than
hash-table lookup as shown below:

  Randomly-generated binary data with variable length (length range=[1, 256] entries=16K)

  htab lookup      (1  thread)    5.062 ± 0.004M/s (drops 0.002 ± 0.000M/s mem 8.125 MiB)
  htab lookup      (2  thread)   10.256 ± 0.017M/s (drops 0.006 ± 0.000M/s mem 8.114 MiB)
  htab lookup      (4  thread)   20.383 ± 0.006M/s (drops 0.009 ± 0.000M/s mem 8.117 MiB)
  htab lookup      (8  thread)   40.727 ± 0.093M/s (drops 0.010 ± 0.000M/s mem 8.123 MiB)
  htab lookup      (16 thread)   81.333 ± 0.311M/s (drops 0.020 ± 0.000M/s mem 8.122 MiB)
  
  qp-trie lookup   (1  thread)   10.161 ± 0.008M/s (drops 0.006 ± 0.000M/s mem 4.847 MiB)
  qp-trie lookup   (2  thread)   20.287 ± 0.024M/s (drops 0.007 ± 0.000M/s mem 4.828 MiB)
  qp-trie lookup   (4  thread)   40.784 ± 0.020M/s (drops 0.015 ± 0.000M/s mem 4.071 MiB)
  qp-trie lookup   (8  thread)   81.165 ± 0.013M/s (drops 0.040 ± 0.000M/s mem 4.045 MiB)
  qp-trie lookup   (16 thread)  159.955 ± 0.014M/s (drops 0.108 ± 0.000M/s mem 4.495 MiB)

  * non-zero drops is due to duplicated keys in generated keys.

(2) more fine-grained lock in qp-trie
Now qp-trie is divided into 256 sub-trees by using the first character of
key and one sub-tree is protected one spinlock. From the data below,
although the update/delete speed of qp-trie is slower compare with hash
table, but it scales similar with hash table. So maybe 256-locks is a
good enough solution ?

  Strings in /proc/kallsyms
  htab update      (1  thread)    2.850 ± 0.129M/s (drops 0.000 ± 0.000M/s mem 33.564 MiB)
  htab update      (2  thread)    4.363 ± 0.031M/s (drops 0.000 ± 0.000M/s mem 33.563 MiB)
  htab update      (4  thread)    6.306 ± 0.096M/s (drops 0.000 ± 0.000M/s mem 33.718 MiB)
  htab update      (8  thread)    6.611 ± 0.026M/s (drops 0.000 ± 0.000M/s mem 33.627 MiB)
  htab update      (16 thread)    6.390 ± 0.015M/s (drops 0.000 ± 0.000M/s mem 33.564 MiB)
  qp-trie update   (1  thread)    1.157 ± 0.099M/s (drops 0.000 ± 0.000M/s mem 18.333 MiB)
  qp-trie update   (2  thread)    1.920 ± 0.062M/s (drops 0.000 ± 0.000M/s mem 18.293 MiB)
  qp-trie update   (4  thread)    2.630 ± 0.050M/s (drops 0.000 ± 0.000M/s mem 18.472 MiB)
  qp-trie update   (8  thread)    3.171 ± 0.027M/s (drops 0.000 ± 0.000M/s mem 18.301 MiB)
  qp-trie update   (16 thread)    3.782 ± 0.036M/s (drops 0.000 ± 0.000M/s mem 19.040 MiB)

(3) Improve memory efficiency further
When using strings in BTF string section as a data set for qp-trie, the
slab memory usage showed in cgroup memory.stats file is about 11MB for
qp-trie and 15MB for hash table as shown below. However the theoretical
memory usage for qp-trie is ~6.8MB (is ~4.9MB if removing "parent" & "rcu"
fields from qp_trie_branch) and the extra memory usage (about 38% of total
usage) mainly comes from internal fragment in slab (namely 2^n alignment
for allocation) and overhead in kmem-cgroup accounting. We can reduce the
internal fragment by creating separated kmem_cache for qp_trie_branch with
different child nodes, but not sure whether it is worthy or not.

And in order to prevent allocating a rcu_head for each leaf node, now only
branch node is RCU-freed, so when replacing a leaf node, a new branch node
and a new leaf node will be allocated instead of replacing the old leaf
node and RCU-freed the old leaf node. Also not sure whether or not it is
worthy.

  Strings in BTF string section (entries=115980):
  htab lookup      (1  thread)    9.889 ± 0.006M/s (drops 0.000 ± 0.000M/s mem 15.069 MiB)
  qp-trie lookup   (1  thread)    5.132 ± 0.002M/s (drops 0.000 ± 0.000M/s mem 10.721 MiB)

  All files under linux kernel source directory (entries=74359):
  htab lookup      (1  thread)    8.418 ± 0.077M/s (drops 0.000 ± 0.000M/s mem 14.207 MiB)
  qp-trie lookup   (1  thread)    4.966 ± 0.003M/s (drops 0.000 ± 0.000M/s mem 9.355 MiB)

  Domain names for Alexa top million web site (entries=1000000):
  htab lookup      (1  thread)    4.551 ± 0.043M/s (drops 0.000 ± 0.000M/s mem 190.761 MiB)
  qp-trie lookup   (1  thread)    2.804 ± 0.017M/s (drops 0.000 ± 0.000M/s mem 83.194 MiB)

Comments and suggestions are always welcome.

Regards,
Tao

[0]: https://lore.kernel.org/bpf/CAEf4Bzb7keBS8vXgV5JZzwgNGgMV0X3_guQ_m9JW3X6fJBDpPQ@mail.gmail.com/
[1]: https://github.com/cilium/cilium/blob/5145e31cd65db3361f6538d5f5f899440b769070/pkg/datapath/prefilter/prefilter.go#L123

Hou Tao (3):
  bpf: Add support for qp-trie map
  selftests/bpf: add a simple test for qp-trie
  selftests/bpf: add benchmark for qp-trie map

 include/linux/bpf_types.h                     |    1 +
 include/uapi/linux/bpf.h                      |    8 +
 kernel/bpf/Makefile                           |    1 +
 kernel/bpf/bpf_qp_trie.c                      | 1064 +++++++++++++++++
 tools/include/uapi/linux/bpf.h                |    8 +
 tools/testing/selftests/bpf/Makefile          |    5 +-
 tools/testing/selftests/bpf/bench.c           |   10 +
 .../selftests/bpf/benchs/bench_qp_trie.c      |  499 ++++++++
 .../selftests/bpf/benchs/run_bench_qp_trie.sh |   55 +
 .../selftests/bpf/prog_tests/str_key.c        |   69 ++
 .../selftests/bpf/progs/qp_trie_bench.c       |  218 ++++
 tools/testing/selftests/bpf/progs/str_key.c   |   85 ++
 12 files changed, 2022 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/bpf_qp_trie.c
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_qp_trie.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_qp_trie.sh
 create mode 100644 tools/testing/selftests/bpf/prog_tests/str_key.c
 create mode 100644 tools/testing/selftests/bpf/progs/qp_trie_bench.c
 create mode 100644 tools/testing/selftests/bpf/progs/str_key.c

-- 
2.29.2


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH bpf-next 1/3] bpf: Add support for qp-trie map
  2022-07-26 13:00 [RFC PATCH bpf-next 0/3] Add support for qp-trie map Hou Tao
@ 2022-07-26 13:00 ` Hou Tao
  2022-07-26 13:00 ` [RFC PATCH bpf-next 2/3] selftests/bpf: add a simple test for qp-trie Hou Tao
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Hou Tao @ 2022-07-26 13:00 UTC (permalink / raw)
  To: Andrii Nakryiko, Alexei Starovoitov, bpf
  Cc: Daniel Borkmann, Martin KaFai Lau, Yonghong Song, Song Liu,
	KP Singh, David S . Miller, Jakub Kicinski, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, John Fastabend, houtao1

The initial motivation for qp-trie map is to reduce memory usage for
string keys specially those with large differencies in length. Moreover
as a big-endian lexicographic-ordered map, qp-trie can also be used for
any binary data with fixed or variable length.

The memory efficiency of qp-tries comes partly from the design of qp-trie
which doesn't save key for branch node and uses sparse array to save leaf
nodes, partly comes from the support of variable-length key: only the
used part in key is saved.

But the memory efficiency and ordered keys come with cost: the lookup
performance of qp-trie is about 30% or more slower compared with
hash-table.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 include/linux/bpf_types.h      |    1 +
 include/uapi/linux/bpf.h       |    8 +
 kernel/bpf/Makefile            |    1 +
 kernel/bpf/bpf_qp_trie.c       | 1064 ++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |    8 +
 5 files changed, 1082 insertions(+)
 create mode 100644 kernel/bpf/bpf_qp_trie.c

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 2b9112b80171..a73d47f83d74 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -126,6 +126,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops)
 #endif
 BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_BLOOM_FILTER, bloom_filter_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_QP_TRIE, qp_trie_map_ops)
 
 BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint)
 BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index ffcbf79a556b..39a65bf0d9f4 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -82,6 +82,13 @@ struct bpf_lpm_trie_key {
 	__u8	data[0];	/* Arbitrary size */
 };
 
+struct bpf_qp_trie_key {
+	/* the length of blob data */
+	__u32 len;
+	/* blob data */
+	__u8 data[0];
+};
+
 struct bpf_cgroup_storage_key {
 	__u64	cgroup_inode_id;	/* cgroup inode id */
 	__u32	attach_type;		/* program attach type (enum bpf_attach_type) */
@@ -909,6 +916,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_INODE_STORAGE,
 	BPF_MAP_TYPE_TASK_STORAGE,
 	BPF_MAP_TYPE_BLOOM_FILTER,
+	BPF_MAP_TYPE_QP_TRIE,
 };
 
 /* Note that tracing related programs such as
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 057ba8e01e70..99fd5b10d544 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -10,6 +10,7 @@ obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_i
 obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
 obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
 obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
+obj-$(CONFIG_BPF_SYSCALL) += bpf_qp_trie.o
 obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
 obj-$(CONFIG_BPF_SYSCALL) += disasm.o
 obj-$(CONFIG_BPF_JIT) += trampoline.o
diff --git a/kernel/bpf/bpf_qp_trie.c b/kernel/bpf/bpf_qp_trie.c
new file mode 100644
index 000000000000..6b2672e67c87
--- /dev/null
+++ b/kernel/bpf/bpf_qp_trie.c
@@ -0,0 +1,1064 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Derived from qp.c in https://github.com/fanf2/qp.git
+ *
+ * Copyright (C) 2022. Huawei Technologies Co., Ltd
+ */
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/spinlock.h>
+#include <linux/rcupdate.h>
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/btf_ids.h>
+
+/* qp-trie (quadbit popcount trie) is a memory efficient trie. Unlike
+ * normal trie which uses byte as lookup key, qp-trie interprets its keys
+ * as quadbit/nibble array and uses one nibble each time during lookup.
+ * The most significant nibble (upper nibble) of byte N in the key will
+ * be the 2*N element of nibble array, and the least significant nibble
+ * (lower nibble) of byte N will be the 2*N+1 element in nibble array.
+ *
+ * For normal trie, it may have 256 child nodes, and for qp-trie one branch
+ * node may have 17 child nodes. #0 child node is special because it must
+ * be a leaf node and its key is the same as the branch node. #1~#16 child
+ * nodes represent leaf nodes or branch nodes which have different keys
+ * with parent node. The key of branch node is the common prefix for these
+ * child nodes, and the index of child node minus one is the value of first
+ * different nibble between these child nodes.
+ *
+ * qp-trie reduces memory usage through two methods:
+ * (1) Branch node doesn't store the key. It only stores the position of
+ *     the first nibble which differentiates child nodes.
+ * (2) Branch node doesn't store all 17 child nodes. It uses a bitmap and
+ *     popcount() to implement a sparse array and only allocates memory
+ *     for those present children.
+ *
+ * Like normal trie, qp-trie is also ordered and is in big-endian
+ * lexicographic order. If traverse qp-trie in a depth-first way, it will
+ * return a string of ordered keys.
+ *
+ * The following diagrams show the construction of a tiny qp-trie:
+ *
+ * (1) insert abc
+ *
+ *          [ leaf node: abc ]
+ *
+ * (2) insert abc_d
+ *
+ * The first different nibble between "abc" and "abc_d" is the upper nibble
+ * of character '_' (0x5), and its position in nibble array is 6
+ * (starts from 0).
+ *
+ *          [ branch node ] bitmap: 0x41 diff pos: 6
+ *                 |
+ *                 *
+ *             children
+ *          [0]        [6]
+ *           |          |
+ *       [leaf: abc] [leaf: abc_d]
+ *
+ * (3) insert abc_e
+ *
+ * The first different nibble between "abc_d" and "abc_e" is the lower
+ * nibble of character 'd'/'e', and its position in array is 9.
+ *
+ *          [ branch node ] bitmap: 0x41 diff pos: 6
+ *                 |
+ *                 *
+ *             children
+ *          [0]        [6]
+ *           |          |
+ *       [leaf: abc]    |
+ *                      *
+ *                [ branch node ] bitmap: 0x60 diff pos: 9
+ *                      |
+ *                      *
+ *                   children
+ *                [5]        [6]
+ *                 |          |
+ *          [leaf: abc_d]  [leaf: abc_e]
+ */
+
+#define QP_TRIE_CREATE_FLAG_MASK (BPF_F_NO_PREALLOC | BPF_F_NUMA_NODE | \
+				  BPF_F_ACCESS_MASK)
+
+/* bit[0] of nodes in qp_trie_branch is used to tell node type:
+ *
+ * bit[0]: 0-branch node
+ * bit[0]: 1-leaf node
+ *
+ * Size of qp_trie_branch is already 2-bytes aligned, so only need to make
+ * allocation of leaf node to be 2-bytes aligned.
+ */
+#define QP_TRIE_LEAF_NODE_MASK 1UL
+#define QP_TRIE_LEAF_ALLOC_ALIGN 2
+
+/* To reduce memory usage, only qp_trie_branch is RCU-freed. To handle
+ * freeing of the last leaf node, an extra qp_trie_branch node is
+ * allocated. The branch node has only one child and its index is 0. It
+ * is set as root node after adding the first leaf node.
+ */
+#define QP_TRIE_ROOT_NODE_INDEX 0
+#define QP_TRIE_NON_ROOT_NODE_MASK 1
+
+#define QP_TRIE_NIBBLE_SHIFT 1
+#define QP_TRIE_BYTE_INDEX_SHIFT 2
+
+#define QP_TRIE_MIN_KEY_DATA_SIZE 1
+#define QP_TRIE_MAX_KEY_DATA_SIZE (1U << 30)
+#define QP_TRIE_MIN_KEY_SIZE (sizeof(struct bpf_qp_trie_key) + QP_TRIE_MIN_KEY_DATA_SIZE)
+#define QP_TRIE_MAX_KEY_SIZE (sizeof(struct bpf_qp_trie_key) + QP_TRIE_MAX_KEY_DATA_SIZE)
+
+#define QP_TRIE_TWIGS_FREE_NONE_IDX 17
+
+struct qp_trie_branch {
+	/* The bottom two bits of index are used as special flags:
+	 *
+	 * bit[0]: 0-root, 1-not root
+	 * bit[1]: 0-upper nibble, 1-lower nibble
+	 *
+	 * bit[2:31]: byte index for key
+	 */
+	unsigned int index;
+	/* 17 bits are used to accommodate arbitrary keys, even when there are
+	 * zero-bytes in these keys.
+	 *
+	 * bit[0]: a leaf node has the same key as the prefix of parent node
+	 * bit[N]: a child node with the value of nibble at index as (N - 1)
+	 */
+	unsigned int bitmap:17;
+	/* The index of leaf node will be RCU-freed together */
+	unsigned int to_free_idx:5;
+	struct qp_trie_branch __rcu *parent;
+	struct rcu_head rcu;
+	void __rcu *nodes[0];
+};
+
+#define QP_TRIE_NR_SUBTREE 256
+
+struct qp_trie {
+	struct bpf_map map;
+	atomic_t entries;
+	void __rcu *roots[QP_TRIE_NR_SUBTREE];
+	spinlock_t locks[QP_TRIE_NR_SUBTREE];
+};
+
+struct qp_trie_diff {
+	unsigned int index;
+	unsigned int sibling_bm;
+	unsigned int new_bm;
+};
+
+static inline bool is_valid_key_data_len(const struct bpf_qp_trie_key *key,
+					 unsigned int key_size)
+{
+	return key->len >= QP_TRIE_MIN_KEY_DATA_SIZE &&
+	       key->len <= key_size - sizeof(*key);
+}
+
+static inline void *to_child_node(struct bpf_qp_trie_key *key)
+{
+	return (void *)((long)key | QP_TRIE_LEAF_NODE_MASK);
+}
+
+static inline struct bpf_qp_trie_key *to_leaf_node(void *node)
+{
+	return (void *)((long)node & ~QP_TRIE_LEAF_NODE_MASK);
+}
+
+static inline bool is_branch_node(void *node)
+{
+	return !((long)node & QP_TRIE_LEAF_NODE_MASK);
+}
+
+static inline bool is_same_key(const struct bpf_qp_trie_key *l, const struct bpf_qp_trie_key *r)
+{
+	return l->len == r->len && !memcmp(l->data, r->data, r->len);
+}
+
+static inline void *qp_trie_leaf_value(const struct bpf_qp_trie_key *key)
+{
+	return (void *)key + sizeof(*key) + key->len;
+}
+
+static inline unsigned int calc_twig_index(unsigned int mask, unsigned int bitmap)
+{
+	return hweight32(mask & (bitmap - 1));
+}
+
+static inline unsigned int calc_twig_nr(unsigned int bitmap)
+{
+	return hweight32(bitmap);
+}
+
+static inline unsigned int nibble_to_bitmap(unsigned char nibble)
+{
+	return 1U << (nibble + 1);
+}
+
+static inline unsigned int index_to_byte_index(unsigned int index)
+{
+	return index >> QP_TRIE_BYTE_INDEX_SHIFT;
+}
+
+static inline unsigned int calc_br_bitmap(unsigned int index, const struct bpf_qp_trie_key *key)
+{
+	unsigned int byte;
+	unsigned char nibble;
+
+	if (index == QP_TRIE_ROOT_NODE_INDEX)
+		return 1;
+
+	byte = index_to_byte_index(index);
+	if (byte >= key->len)
+		return 1;
+
+	nibble = key->data[byte];
+	/* lower nibble */
+	if ((index >> QP_TRIE_NIBBLE_SHIFT) & 1)
+		nibble &= 0xf;
+	else
+		nibble >>= 4;
+	return nibble_to_bitmap(nibble);
+}
+
+static void qp_trie_free_twigs_rcu(struct rcu_head *rcu)
+{
+	struct qp_trie_branch *twigs = container_of(rcu, struct qp_trie_branch, rcu);
+	unsigned int idx = twigs->to_free_idx;
+
+	if (idx != QP_TRIE_TWIGS_FREE_NONE_IDX)
+		kfree(to_leaf_node(twigs->nodes[idx]));
+	kfree(twigs);
+}
+
+static void qp_trie_branch_free(struct qp_trie_branch *twigs, unsigned int to_free_idx)
+{
+	twigs->to_free_idx = to_free_idx;
+	call_rcu(&twigs->rcu, qp_trie_free_twigs_rcu);
+}
+
+static inline struct qp_trie_branch *
+qp_trie_branch_new(struct bpf_map *map, unsigned int nr)
+{
+	struct qp_trie_branch *a;
+
+	a = bpf_map_kmalloc_node(map, sizeof(*a) + nr * sizeof(*a->nodes),
+				 GFP_NOWAIT | __GFP_NOWARN, map->numa_node);
+	return a;
+}
+
+static inline void qp_trie_assign_parent(struct qp_trie_branch *parent, void *node)
+{
+	if (is_branch_node(node))
+		rcu_assign_pointer(((struct qp_trie_branch *)node)->parent, parent);
+}
+
+static void qp_trie_update_parent(struct qp_trie_branch *parent, unsigned int nr)
+{
+	unsigned int i;
+
+	for (i = 0; i < nr; i++)
+		qp_trie_assign_parent(parent, parent->nodes[i]);
+}
+
+/* new_node can be either a leaf node or a branch node */
+static struct qp_trie_branch *
+qp_trie_branch_replace(struct bpf_map *map, struct qp_trie_branch *old, unsigned int bitmap,
+		       void *new_node)
+{
+	unsigned int nr = calc_twig_nr(old->bitmap);
+	unsigned int p = calc_twig_index(old->bitmap, bitmap);
+	struct qp_trie_branch *twigs;
+
+	twigs = qp_trie_branch_new(map, nr);
+	if (!twigs)
+		return NULL;
+
+	if (p)
+		memcpy(twigs->nodes, old->nodes, p * sizeof(*twigs->nodes));
+
+	rcu_assign_pointer(twigs->nodes[p], new_node);
+
+	if (nr - 1 > p)
+		memcpy(&twigs->nodes[p+1], &old->nodes[p+1], (nr - 1 - p) * sizeof(*twigs->nodes));
+
+	twigs->index = old->index;
+	twigs->bitmap = old->bitmap;
+	/* Initialize parent of upper-level node first,
+	 * then update parent for child nodes after parent node is
+	 * fully initialized
+	 */
+	RCU_INIT_POINTER(twigs->parent, old->parent);
+
+	qp_trie_update_parent(twigs, nr);
+
+	return twigs;
+}
+
+static struct qp_trie_branch *
+qp_trie_branch_insert(struct bpf_map *map, struct qp_trie_branch *old, unsigned int bitmap,
+		      struct bpf_qp_trie_key *new)
+{
+	unsigned int nr = calc_twig_nr(old->bitmap);
+	unsigned int p = calc_twig_index(old->bitmap, bitmap);
+	struct qp_trie_branch *twigs;
+
+	twigs = qp_trie_branch_new(map, nr + 1);
+	if (!twigs)
+		return NULL;
+
+	if (p)
+		memcpy(twigs->nodes, old->nodes, p * sizeof(*twigs->nodes));
+
+	rcu_assign_pointer(twigs->nodes[p], to_child_node(new));
+
+	if (nr > p)
+		memcpy(&twigs->nodes[p+1], &old->nodes[p], (nr - p) * sizeof(*twigs->nodes));
+
+	twigs->bitmap = old->bitmap | bitmap;
+	twigs->index = old->index;
+	RCU_INIT_POINTER(twigs->parent, old->parent);
+
+	qp_trie_update_parent(twigs, nr + 1);
+
+	return twigs;
+}
+
+static struct qp_trie_branch *
+qp_trie_branch_remove(struct bpf_map *map, struct qp_trie_branch *old, unsigned int bitmap)
+{
+	unsigned int nr = calc_twig_nr(old->bitmap);
+	unsigned int p = calc_twig_index(old->bitmap, bitmap);
+	struct qp_trie_branch *twigs;
+
+	twigs = qp_trie_branch_new(map, nr - 1);
+	if (!twigs)
+		return NULL;
+
+	if (p)
+		memcpy(twigs->nodes, old->nodes, p * sizeof(*twigs->nodes));
+	if (nr - 1 > p)
+		memcpy(&twigs->nodes[p], &old->nodes[p+1], (nr - 1 - p) * sizeof(*twigs->nodes));
+
+	twigs->bitmap = old->bitmap & ~bitmap;
+	twigs->index = old->index;
+	RCU_INIT_POINTER(twigs->parent, old->parent);
+
+	qp_trie_update_parent(twigs, nr - 1);
+
+	return twigs;
+}
+
+static struct bpf_qp_trie_key *qp_trie_init_leaf_node(struct bpf_map *map, void *k, void *v)
+{
+	struct bpf_qp_trie_key *key = k, *new;
+	unsigned int key_size, total_size;
+
+	key_size = sizeof(*key) + key->len;
+	total_size = round_up(key_size + map->value_size, QP_TRIE_LEAF_ALLOC_ALIGN);
+	new = bpf_map_kmalloc_node(map, total_size, GFP_NOWAIT | __GFP_NOWARN, map->numa_node);
+	if (!new)
+		return NULL;
+
+	memcpy(new, k, key_size);
+	memcpy((void *)new + key_size, v, map->value_size);
+
+	return new;
+}
+
+static bool calc_prefix_len(const struct bpf_qp_trie_key *s_key,
+			    const struct bpf_qp_trie_key *n_key, unsigned int *index)
+{
+	unsigned int len = min(s_key->len, n_key->len);
+	unsigned int i;
+	unsigned char diff = 0;
+
+	for (i = 0; i < len; i++) {
+		diff = s_key->data[i] ^ n_key->data[i];
+		if (diff)
+			break;
+	}
+
+	*index = (i << QP_TRIE_BYTE_INDEX_SHIFT) | QP_TRIE_NON_ROOT_NODE_MASK;
+	if (!diff)
+		return s_key->len == n_key->len;
+
+	*index += (diff & 0xf0) ? 0 : (1U << QP_TRIE_NIBBLE_SHIFT);
+	return false;
+}
+
+static int qp_trie_new_branch(struct qp_trie *trie, struct qp_trie_branch **parent,
+			      unsigned int bitmap, void *sibling, struct qp_trie_diff *d,
+			      void *key, void *value)
+{
+	struct qp_trie_branch *new_child_twigs, *new_twigs, *old_twigs;
+	struct bpf_qp_trie_key *leaf;
+	struct bpf_map *map;
+	unsigned int iip;
+	int err;
+
+	map = &trie->map;
+	if (atomic_inc_return(&trie->entries) > map->max_entries) {
+		err = -ENOSPC;
+		goto dec_entries;
+	}
+
+	leaf = qp_trie_init_leaf_node(map, key, value);
+	if (!leaf) {
+		err = -ENOMEM;
+		goto dec_entries;
+	}
+
+	new_child_twigs = qp_trie_branch_new(map, 2);
+	if (!new_child_twigs) {
+		err = -ENOMEM;
+		goto free_leaf;
+	}
+
+	new_child_twigs->index = d->index;
+	new_child_twigs->bitmap = d->sibling_bm | d->new_bm;
+
+	iip = calc_twig_index(new_child_twigs->bitmap, d->sibling_bm);
+	RCU_INIT_POINTER(new_child_twigs->nodes[iip], sibling);
+	rcu_assign_pointer(new_child_twigs->nodes[!iip], to_child_node(leaf));
+	RCU_INIT_POINTER(new_child_twigs->parent, NULL);
+
+	old_twigs = *parent;
+	new_twigs = qp_trie_branch_replace(map, old_twigs, bitmap, new_child_twigs);
+	if (!new_twigs) {
+		err = -ENOMEM;
+		goto free_child_twigs;
+	}
+
+	qp_trie_assign_parent(new_child_twigs, sibling);
+	rcu_assign_pointer(*parent, new_twigs);
+	qp_trie_branch_free(old_twigs, QP_TRIE_TWIGS_FREE_NONE_IDX);
+
+	return 0;
+
+free_child_twigs:
+	kfree(new_child_twigs);
+free_leaf:
+	kfree(leaf);
+dec_entries:
+	atomic_dec(&trie->entries);
+	return err;
+}
+
+static int qp_trie_ext_branch(struct qp_trie *trie, struct qp_trie_branch **parent,
+			      void *key, void *value, unsigned int bitmap)
+{
+	struct qp_trie_branch *old_twigs, *new_twigs;
+	struct bpf_qp_trie_key *new;
+	struct bpf_map *map;
+	int err;
+
+	map = &trie->map;
+	if (atomic_inc_return(&trie->entries) > map->max_entries) {
+		err = -ENOSPC;
+		goto dec_entries;
+	}
+
+	new = qp_trie_init_leaf_node(map, key, value);
+	if (!new) {
+		err = -ENOMEM;
+		goto dec_entries;
+	}
+
+	old_twigs = *parent;
+	new_twigs = qp_trie_branch_insert(map, old_twigs, bitmap, new);
+	if (!new_twigs) {
+		err = -ENOMEM;
+		goto free_leaf;
+	}
+
+	rcu_assign_pointer(*parent, new_twigs);
+	qp_trie_branch_free(old_twigs, QP_TRIE_TWIGS_FREE_NONE_IDX);
+
+	return 0;
+
+free_leaf:
+	kfree(new);
+dec_entries:
+	atomic_dec(&trie->entries);
+	return err;
+}
+
+static int qp_trie_add_leaf_node(struct qp_trie *trie, struct qp_trie_branch **parent,
+				 void *key, void *value)
+{
+	struct bpf_map *map = &trie->map;
+	struct qp_trie_branch *twigs;
+	struct bpf_qp_trie_key *new;
+	int err;
+
+	if (atomic_inc_return(&trie->entries) > map->max_entries) {
+		err = -ENOSPC;
+		goto dec_entries;
+	}
+
+	new = qp_trie_init_leaf_node(map, key, value);
+	if (!new) {
+		err = -ENOMEM;
+		goto dec_entries;
+	}
+
+	twigs = qp_trie_branch_new(map, 1);
+	if (!twigs) {
+		err = -ENOMEM;
+		goto free_leaf;
+	}
+	twigs->index = QP_TRIE_ROOT_NODE_INDEX;
+	twigs->bitmap = 1;
+	RCU_INIT_POINTER(twigs->parent, NULL);
+	rcu_assign_pointer(twigs->nodes[0], to_child_node(new));
+
+	rcu_assign_pointer(*parent, twigs);
+
+	return 0;
+free_leaf:
+	kfree(new);
+dec_entries:
+	atomic_dec(&trie->entries);
+	return err;
+}
+
+static int qp_trie_rep_leaf_node(struct qp_trie *trie, struct qp_trie_branch **parent,
+				 void *key, void *value, struct bpf_qp_trie_key *leaf,
+				 unsigned int bitmap)
+{
+	struct qp_trie_branch *old_twigs, *new_twigs;
+	struct bpf_map *map = &trie->map;
+	struct bpf_qp_trie_key *new;
+	int err;
+
+	new = qp_trie_init_leaf_node(map, key, value);
+	if (!new)
+		return -ENOMEM;
+
+	old_twigs = *parent;
+	new_twigs = qp_trie_branch_replace(map, old_twigs, bitmap, to_child_node(new));
+	if (!new_twigs) {
+		err = -ENOMEM;
+		goto free_leaf;
+	}
+	rcu_assign_pointer(*parent, new_twigs);
+
+	qp_trie_branch_free(old_twigs, calc_twig_index(old_twigs->bitmap, bitmap));
+
+	return 0;
+free_leaf:
+	kfree(new);
+	return err;
+}
+
+static int qp_trie_remove_leaf(struct qp_trie *trie, struct qp_trie_branch **parent,
+			       unsigned int bitmap, const struct bpf_qp_trie_key *node)
+{
+	struct bpf_map *map = &trie->map;
+	struct qp_trie_branch *new, *old;
+	unsigned int nr;
+
+	old = *parent;
+	nr = calc_twig_nr(old->bitmap);
+	if (nr > 2) {
+		new = qp_trie_branch_remove(map, old, bitmap);
+		if (!new)
+			return -ENOMEM;
+	} else {
+		new = NULL;
+	}
+
+	rcu_assign_pointer(*parent, new);
+
+	qp_trie_branch_free(old, calc_twig_index(old->bitmap, bitmap));
+
+	atomic_dec(&trie->entries);
+
+	return 0;
+}
+
+static int qp_trie_merge_node(struct qp_trie *trie, struct qp_trie_branch **grand_parent,
+			      struct qp_trie_branch *parent, unsigned int parent_bitmap,
+			      unsigned int bitmap)
+{
+	struct qp_trie_branch *old_twigs, *new_twigs;
+	struct bpf_map *map = &trie->map;
+	void *new_sibling;
+	unsigned int iip;
+
+	iip = calc_twig_index(parent->bitmap, bitmap);
+	new_sibling = parent->nodes[!iip];
+
+	old_twigs = *grand_parent;
+	new_twigs = qp_trie_branch_replace(map, old_twigs, parent_bitmap, new_sibling);
+	if (!new_twigs)
+		return -ENOMEM;
+
+	rcu_assign_pointer(*grand_parent, new_twigs);
+
+	qp_trie_branch_free(old_twigs, QP_TRIE_TWIGS_FREE_NONE_IDX);
+	qp_trie_branch_free(parent, iip);
+
+	atomic_dec(&trie->entries);
+
+	return 0;
+}
+
+/* key and value are allocated together in qp_trie_init_leaf_node() */
+static inline bool is_valid_k_v_size(unsigned int key_size, unsigned int value_size)
+{
+	return round_up((u64)key_size + value_size, QP_TRIE_LEAF_ALLOC_ALIGN) <=
+	       KMALLOC_MAX_SIZE;
+}
+
+static int qp_trie_alloc_check(union bpf_attr *attr)
+{
+	if (!bpf_capable())
+		return -EPERM;
+
+	if (!(attr->map_flags | BPF_F_NO_PREALLOC) ||
+	    attr->map_flags & ~QP_TRIE_CREATE_FLAG_MASK ||
+	    !bpf_map_flags_access_ok(attr->map_flags))
+		return -EINVAL;
+
+	if (!attr->max_entries || attr->key_size < QP_TRIE_MIN_KEY_SIZE ||
+	    !attr->value_size)
+		return -EINVAL;
+
+	if (attr->key_size > QP_TRIE_MAX_KEY_SIZE ||
+	    !is_valid_k_v_size(attr->key_size, attr->value_size))
+		return -E2BIG;
+
+	return 0;
+}
+
+static struct bpf_map *qp_trie_alloc(union bpf_attr *attr)
+{
+	struct qp_trie *trie;
+	unsigned int i;
+
+	trie = bpf_map_area_alloc(sizeof(*trie), bpf_map_attr_numa_node(attr));
+	if (!trie)
+		return ERR_PTR(-ENOMEM);
+
+	/* roots are zeroed by bpf_map_area_alloc() */
+	for (i = 0; i < ARRAY_SIZE(trie->locks); i++)
+		spin_lock_init(&trie->locks[i]);
+
+	atomic_set(&trie->entries, 0);
+	bpf_map_init_from_attr(&trie->map, attr);
+
+	return &trie->map;
+}
+
+static void qp_trie_free_subtree(void *root)
+{
+	struct qp_trie_branch *parent = NULL;
+	struct bpf_qp_trie_key *cur = NULL;
+	void *node = root;
+
+	/*
+	 * Depth-first deletion
+	 *
+	 * 1. find left-most key and its parent
+	 * 2. get next sibling Y from parent
+	 * (a) Y is leaf node: continue
+	 * (b) Y is branch node: goto step 1
+	 * (c) no more sibling: backtrace upwards if parent is not NULL and
+	 *     goto step 1
+	 */
+	do {
+		while (is_branch_node(node)) {
+			parent = node;
+			node = rcu_dereference_protected(parent->nodes[0], 1);
+		}
+
+		cur = to_leaf_node(node);
+		while (parent) {
+			unsigned int iip, bitmap, nr;
+			void *ancestor;
+
+			bitmap = calc_br_bitmap(parent->index, cur);
+			iip = calc_twig_index(parent->bitmap, bitmap) + 1;
+			nr = calc_twig_nr(parent->bitmap);
+
+			for (; iip < nr; iip++) {
+				kfree(cur);
+
+				node = rcu_dereference_protected(parent->nodes[iip], 1);
+				if (is_branch_node(node))
+					break;
+
+				cur = to_leaf_node(node);
+			}
+			if (iip < nr)
+				break;
+
+			ancestor = rcu_dereference_protected(parent->parent, 1);
+			kfree(parent);
+			parent = ancestor;
+		}
+	} while (parent);
+
+	kfree(cur);
+}
+
+static void qp_trie_free(struct bpf_map *map)
+{
+	struct qp_trie *trie = container_of(map, struct qp_trie, map);
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(trie->roots); i++) {
+		void *root = rcu_dereference_protected(trie->roots[i], 1);
+
+		if (root)
+			qp_trie_free_subtree(root);
+	}
+	bpf_map_area_free(trie);
+}
+
+static inline void qp_trie_copy_leaf(struct bpf_qp_trie_key *leaf, void *key, unsigned int key_size)
+{
+	unsigned int used = sizeof(*leaf) + leaf->len;
+
+	memcpy(key, leaf, used);
+	memset(key + used, 0, key_size - used);
+}
+
+static void qp_trie_copy_min_key_from(void *root, void *next_key, unsigned int key_size)
+{
+	void *node;
+
+	node = root;
+	while (is_branch_node(node))
+		node = rcu_dereference_raw(((struct qp_trie_branch *)node)->nodes[0]);
+
+	qp_trie_copy_leaf(to_leaf_node(node), next_key, key_size);
+}
+
+static int qp_trie_lookup_min_key(struct qp_trie *trie, unsigned int from, void *next_key)
+{
+	unsigned int i;
+
+	for (i = from; i < ARRAY_SIZE(trie->roots); i++) {
+		void *root = rcu_dereference_raw(trie->roots[i]);
+
+		if (root) {
+			qp_trie_copy_min_key_from(root, next_key, trie->map.key_size);
+			return 0;
+		}
+	}
+
+	return -ENOENT;
+}
+
+static int qp_trie_next_twigs_index(struct qp_trie_branch *twigs, unsigned int bitmap)
+{
+	unsigned int idx, nr, next;
+
+	/* bitmap may not in twigs->bitmap */
+	idx = calc_twig_index(twigs->bitmap, bitmap);
+	nr = calc_twig_nr(twigs->bitmap);
+
+	next = idx;
+	if (twigs->bitmap & bitmap)
+		next += 1;
+
+	if (next >= nr)
+		return -1;
+	return next;
+}
+
+static int qp_trie_lookup_next_node(struct qp_trie *trie, struct bpf_qp_trie_key *key,
+				    void *next_key)
+{
+	struct qp_trie_branch *parent;
+	struct bpf_qp_trie_key *found;
+	void *node, *next;
+	unsigned char c;
+
+	if (!is_valid_key_data_len(key, trie->map.key_size))
+		return -EINVAL;
+
+	/* Non-exist key, so restart from the beginning */
+	c = key->data[0];
+	node = rcu_dereference_raw(trie->roots[c]);
+	if (!node)
+		return qp_trie_lookup_min_key(trie, 0, next_key);
+
+	parent = NULL;
+	while (is_branch_node(node)) {
+		struct qp_trie_branch *br = node;
+		unsigned int iip, bitmap;
+
+		bitmap = calc_br_bitmap(br->index, key);
+		if (bitmap & br->bitmap)
+			iip = calc_twig_index(br->bitmap, bitmap);
+		else
+			iip = 0;
+
+		parent = br;
+		node = rcu_dereference_raw(br->nodes[iip]);
+	}
+	found = to_leaf_node(node);
+	if (!is_same_key(found, key))
+		return qp_trie_lookup_min_key(trie, 0, next_key);
+
+	/* Pair with store release in rcu_assign_pointer(*parent, twigs) to
+	 * ensure reading node->parent will not return the old parent if
+	 * the node is found by following the newly-created parent.
+	 */
+	smp_rmb();
+
+	next = NULL;
+	while (parent) {
+		unsigned int bitmap;
+		int next_idx;
+
+		bitmap = calc_br_bitmap(parent->index, key);
+		next_idx = qp_trie_next_twigs_index(parent, bitmap);
+		if (next_idx >= 0) {
+			next = rcu_dereference_raw(parent->nodes[next_idx]);
+			break;
+		}
+		parent = rcu_dereference_raw(parent->parent);
+	}
+
+	/* Goto next sub-tree */
+	if (!next)
+		return qp_trie_lookup_min_key(trie, c + 1, next_key);
+
+	if (!is_branch_node(next))
+		qp_trie_copy_leaf(to_leaf_node(next), next_key, trie->map.key_size);
+	else
+		qp_trie_copy_min_key_from(next, next_key, trie->map.key_size);
+
+	return 0;
+}
+
+static int qp_trie_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+	struct qp_trie *trie = container_of(map, struct qp_trie, map);
+	struct bpf_qp_trie_key *cur_key = key;
+	int err;
+
+	if (!cur_key || !cur_key->len)
+		err = qp_trie_lookup_min_key(trie, 0, next_key);
+	else
+		err = qp_trie_lookup_next_node(trie, cur_key, next_key);
+	return err;
+}
+
+static void *qp_trie_lookup_elem(struct bpf_map *map, void *key)
+{
+	struct qp_trie *trie = container_of(map, struct qp_trie, map);
+	struct bpf_qp_trie_key *trie_key = key, *found;
+	unsigned char c = trie_key->data[0];
+	void *node, *value;
+
+	node = rcu_dereference_raw(trie->roots[c]);
+	if (!node)
+		return NULL;
+
+	value = NULL;
+	while (is_branch_node(node)) {
+		struct qp_trie_branch *br = node;
+		unsigned int bitmap;
+		unsigned int iip;
+
+		/* When byte index equals with key len, the target key
+		 * may be in twigs->nodes[0].
+		 */
+		if (index_to_byte_index(br->index) > trie_key->len)
+			goto done;
+
+		bitmap = calc_br_bitmap(br->index, trie_key);
+		if (!(bitmap & br->bitmap))
+			goto done;
+
+		iip = calc_twig_index(br->bitmap, bitmap);
+		node = rcu_dereference_raw(br->nodes[iip]);
+	}
+
+	found = to_leaf_node(node);
+	if (is_same_key(found, trie_key))
+		value = qp_trie_leaf_value(found);
+done:
+	return value;
+}
+
+static int qp_trie_update_elem(struct bpf_map *map, void *key, void *value, u64 flags)
+{
+	struct qp_trie *trie = container_of(map, struct qp_trie, map);
+	struct bpf_qp_trie_key *new_key = key, *leaf_key;
+	struct qp_trie_branch **parent;
+	struct qp_trie_diff d;
+	unsigned int bitmap;
+	spinlock_t *lock;
+	void **node;
+	unsigned char c;
+	bool equal;
+	int err;
+
+	if (!is_valid_key_data_len(new_key, map->key_size) || flags > BPF_EXIST)
+		return -EINVAL;
+
+	c = new_key->data[0];
+	lock = &trie->locks[c];
+	spin_lock(lock);
+	parent = (struct qp_trie_branch **)&trie->roots[c];
+	if (!*parent) {
+		if (flags == BPF_EXIST) {
+			err = -ENOENT;
+			goto unlock;
+		}
+		err = qp_trie_add_leaf_node(trie, parent, key, value);
+		goto unlock;
+	}
+
+	bitmap = 1;
+	node = &(*parent)->nodes[0];
+	while (is_branch_node(*node)) {
+		struct qp_trie_branch *br = *node;
+		unsigned int iip;
+
+		bitmap = calc_br_bitmap(br->index, new_key);
+		if (bitmap & br->bitmap)
+			iip = calc_twig_index(br->bitmap, bitmap);
+		else
+			iip = 0;
+		parent = (struct qp_trie_branch **)node;
+		node = &br->nodes[iip];
+	}
+
+	leaf_key = to_leaf_node(*node);
+	equal = calc_prefix_len(leaf_key, new_key, &d.index);
+	if (equal) {
+		if (flags == BPF_NOEXIST) {
+			err = -EEXIST;
+			goto unlock;
+		}
+		err = qp_trie_rep_leaf_node(trie, parent, key, value, leaf_key, bitmap);
+		goto unlock;
+	}
+
+	d.sibling_bm = calc_br_bitmap(d.index, leaf_key);
+	d.new_bm = calc_br_bitmap(d.index, new_key);
+
+	bitmap = 1;
+	parent = (struct qp_trie_branch **)&trie->roots[c];
+	node = &(*parent)->nodes[0];
+	while (is_branch_node(*node)) {
+		struct qp_trie_branch *br = *node;
+		unsigned int iip;
+
+		if (d.index < br->index)
+			goto new_branch;
+
+		parent = (struct qp_trie_branch **)node;
+		if (d.index == br->index) {
+			if (flags == BPF_EXIST) {
+				err = -ENOENT;
+				goto unlock;
+			}
+			err = qp_trie_ext_branch(trie, parent, key, value, d.new_bm);
+			goto unlock;
+		}
+
+		bitmap = calc_br_bitmap(br->index, new_key);
+		iip = calc_twig_index(br->bitmap, bitmap);
+		node = &br->nodes[iip];
+	}
+
+new_branch:
+	if (flags == BPF_EXIST) {
+		err = -ENOENT;
+		goto unlock;
+	}
+	err = qp_trie_new_branch(trie, parent, bitmap, *node, &d, key, value);
+unlock:
+	spin_unlock(lock);
+	return err;
+}
+
+static int qp_trie_delete_elem(struct bpf_map *map, void *key)
+{
+	struct qp_trie *trie = container_of(map, struct qp_trie, map);
+	struct qp_trie_branch **parent, **grand_parent;
+	unsigned int bitmap = 1, parent_bitmap = 1, nr;
+	struct bpf_qp_trie_key *trie_key = key, *found;
+	spinlock_t *lock;
+	void **node;
+	unsigned char c;
+	int err;
+
+	if (!is_valid_key_data_len(trie_key, map->key_size))
+		return -EINVAL;
+
+	err = -ENOENT;
+	grand_parent = NULL;
+	c = trie_key->data[0];
+	lock = &trie->locks[c];
+	spin_lock(lock);
+	parent = (struct qp_trie_branch **)&trie->roots[c];
+	if (!*parent)
+		goto unlock;
+
+	node = &(*parent)->nodes[0];
+	while (is_branch_node(*node)) {
+		struct qp_trie_branch *br = *node;
+		unsigned int iip;
+
+		if (index_to_byte_index(br->index) > trie_key->len)
+			goto unlock;
+
+		parent_bitmap = bitmap;
+		bitmap = calc_br_bitmap(br->index, trie_key);
+		if (!(bitmap & br->bitmap))
+			goto unlock;
+
+		grand_parent = parent;
+		parent = (struct qp_trie_branch **)node;
+		iip = calc_twig_index(br->bitmap, bitmap);
+		node = &br->nodes[iip];
+	}
+
+	found = to_leaf_node(*node);
+	if (!is_same_key(found, trie_key))
+		goto unlock;
+
+	nr = calc_twig_nr((*parent)->bitmap);
+	if (nr != 2)
+		err = qp_trie_remove_leaf(trie, parent, bitmap, found);
+	else
+		err = qp_trie_merge_node(trie, grand_parent, *parent, parent_bitmap, bitmap);
+unlock:
+	spin_unlock(lock);
+	return err;
+}
+
+static int qp_trie_check_btf(const struct bpf_map *map,
+			     const struct btf *btf,
+			     const struct btf_type *key_type,
+			     const struct btf_type *value_type)
+{
+	/* TODO: key_type embeds bpf_qp_trie_key as its first member */
+	return 0;
+}
+
+BTF_ID_LIST_SINGLE(qp_trie_map_btf_ids, struct, qp_trie)
+const struct bpf_map_ops qp_trie_map_ops = {
+	.map_alloc_check = qp_trie_alloc_check,
+	.map_alloc = qp_trie_alloc,
+	.map_free = qp_trie_free,
+	.map_get_next_key = qp_trie_get_next_key,
+	.map_lookup_elem = qp_trie_lookup_elem,
+	.map_update_elem = qp_trie_update_elem,
+	.map_delete_elem = qp_trie_delete_elem,
+	.map_meta_equal = bpf_map_meta_equal,
+	.map_check_btf = qp_trie_check_btf,
+	.map_btf_id = &qp_trie_map_btf_ids[0],
+};
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index ffcbf79a556b..39a65bf0d9f4 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -82,6 +82,13 @@ struct bpf_lpm_trie_key {
 	__u8	data[0];	/* Arbitrary size */
 };
 
+struct bpf_qp_trie_key {
+	/* the length of blob data */
+	__u32 len;
+	/* blob data */
+	__u8 data[0];
+};
+
 struct bpf_cgroup_storage_key {
 	__u64	cgroup_inode_id;	/* cgroup inode id */
 	__u32	attach_type;		/* program attach type (enum bpf_attach_type) */
@@ -909,6 +916,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_INODE_STORAGE,
 	BPF_MAP_TYPE_TASK_STORAGE,
 	BPF_MAP_TYPE_BLOOM_FILTER,
+	BPF_MAP_TYPE_QP_TRIE,
 };
 
 /* Note that tracing related programs such as
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH bpf-next 2/3] selftests/bpf: add a simple test for qp-trie
  2022-07-26 13:00 [RFC PATCH bpf-next 0/3] Add support for qp-trie map Hou Tao
  2022-07-26 13:00 ` [RFC PATCH bpf-next 1/3] bpf: " Hou Tao
@ 2022-07-26 13:00 ` Hou Tao
  2022-07-26 13:00 ` [RFC PATCH bpf-next 3/3] selftests/bpf: add benchmark for qp-trie map Hou Tao
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Hou Tao @ 2022-07-26 13:00 UTC (permalink / raw)
  To: Andrii Nakryiko, Alexei Starovoitov, bpf
  Cc: Daniel Borkmann, Martin KaFai Lau, Yonghong Song, Song Liu,
	KP Singh, David S . Miller, Jakub Kicinski, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, John Fastabend, houtao1

Add a test to demonstrate that qp-trie doesn't care about the
unused part of key.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 .../selftests/bpf/prog_tests/str_key.c        | 69 +++++++++++++++
 tools/testing/selftests/bpf/progs/str_key.c   | 85 +++++++++++++++++++
 2 files changed, 154 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/str_key.c
 create mode 100644 tools/testing/selftests/bpf/progs/str_key.c

diff --git a/tools/testing/selftests/bpf/prog_tests/str_key.c b/tools/testing/selftests/bpf/prog_tests/str_key.c
new file mode 100644
index 000000000000..8f134dd45902
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/str_key.c
@@ -0,0 +1,69 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2022. Huawei Technologies Co., Ltd */
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <test_progs.h>
+#include "str_key.skel.h"
+
+#define FILE_PATH_SIZE 64
+
+struct file_path_str {
+	unsigned int len;
+	char raw[FILE_PATH_SIZE];
+};
+
+static int setup_maps(struct str_key *skel, const char *name, unsigned int value)
+{
+	struct file_path_str key;
+	int fd, err;
+
+	memset(&key, 0, sizeof(key));
+	strncpy(key.raw, name, sizeof(key.raw) - 1);
+	key.len = strlen(name) + 1;
+
+	fd = bpf_map__fd(skel->maps.trie);
+	err = bpf_map_update_elem(fd, &key, &value, BPF_NOEXIST);
+	if (!ASSERT_OK(err, "trie add"))
+		return -EINVAL;
+
+	fd = bpf_map__fd(skel->maps.htab);
+	err = bpf_map_update_elem(fd, key.raw, &value, BPF_NOEXIST);
+	if (!ASSERT_OK(err, "htab add"))
+		return -EINVAL;
+
+	return 0;
+}
+
+void test_str_key(void)
+{
+	const char *name = "/tmp/str_key_test";
+	struct str_key *skel;
+	unsigned int value;
+	int err, fd;
+
+	skel = str_key__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "open_load str key"))
+		return;
+
+	value = time(NULL);
+	if (setup_maps(skel, name, value))
+		goto out;
+
+	skel->bss->pid = getpid();
+	err = str_key__attach(skel);
+	if (!ASSERT_OK(err, "attach"))
+		goto out;
+
+	fd = open(name, O_RDONLY | O_CREAT, 0644);
+	if (!ASSERT_GE(fd, 0, "open tmp file"))
+		goto out;
+	close(fd);
+	unlink(name);
+
+	ASSERT_EQ(skel->bss->trie_value, value, "trie lookup str");
+	ASSERT_EQ(skel->bss->htab_value, -1, "htab lookup str");
+out:
+	str_key__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/str_key.c b/tools/testing/selftests/bpf/progs/str_key.c
new file mode 100644
index 000000000000..2e8becdd58c6
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/str_key.c
@@ -0,0 +1,85 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2022. Huawei Technologies Co., Ltd */
+#include <linux/types.h>
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_core_read.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+struct path {
+} __attribute__((preserve_access_index));
+
+struct file {
+	struct path f_path;
+} __attribute__((preserve_access_index));
+
+#define FILE_PATH_SIZE 64
+
+struct file_path_str {
+	unsigned int len;
+	char raw[FILE_PATH_SIZE];
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 1);
+	__type(key, int);
+	__type(value, struct file_path_str);
+} array SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_QP_TRIE);
+	__uint(max_entries, 1);
+	__type(key, struct file_path_str);
+	__type(value, __u32);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+} trie SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 1);
+	__uint(key_size, FILE_PATH_SIZE);
+	__uint(value_size, sizeof(__u32));
+} htab SEC(".maps");
+
+int pid = 0;
+unsigned int trie_value = 0;
+unsigned int htab_value = 0;
+
+SEC("fentry/filp_close")
+int BPF_PROG(filp_close, struct file *filp)
+{
+	struct path *p = &filp->f_path;
+	struct file_path_str *str;
+	unsigned int *value;
+	int idx, len;
+
+	if (bpf_get_current_pid_tgid() >> 32 != pid)
+		return 0;
+
+	idx = 0;
+	str = bpf_map_lookup_elem(&array, &idx);
+	if (!str)
+		return 0;
+
+	len = bpf_d_path(p, str->raw, sizeof(str->raw));
+	if (len < 0 || len > sizeof(str->raw))
+		return 0;
+
+	str->len = len;
+	value = bpf_map_lookup_elem(&trie, str);
+	if (value)
+		trie_value = *value;
+	else
+		trie_value = -1;
+
+	value = bpf_map_lookup_elem(&htab, str->raw);
+	if (value)
+		htab_value = *value;
+	else
+		htab_value = -1;
+
+	return 0;
+}
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH bpf-next 3/3] selftests/bpf: add benchmark for qp-trie map
  2022-07-26 13:00 [RFC PATCH bpf-next 0/3] Add support for qp-trie map Hou Tao
  2022-07-26 13:00 ` [RFC PATCH bpf-next 1/3] bpf: " Hou Tao
  2022-07-26 13:00 ` [RFC PATCH bpf-next 2/3] selftests/bpf: add a simple test for qp-trie Hou Tao
@ 2022-07-26 13:00 ` Hou Tao
  2022-08-02  6:20 ` [RFC PATCH bpf-next 0/3] Add support " Hou Tao
  2022-08-02 22:38 ` Andrii Nakryiko
  4 siblings, 0 replies; 9+ messages in thread
From: Hou Tao @ 2022-07-26 13:00 UTC (permalink / raw)
  To: Andrii Nakryiko, Alexei Starovoitov, bpf
  Cc: Daniel Borkmann, Martin KaFai Lau, Yonghong Song, Song Liu,
	KP Singh, David S . Miller, Jakub Kicinski, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, John Fastabend, houtao1

Add a benchmark for qp-trie map to compare lookup, update/delete
performance and memory usage with hash table.

When jhash() of bpf htab creates too much hash collisions, lookup or update
of qp-trie may performance better than htab as shown below:

Randomly-generated binary data (length range: [1, 256], entries=16K)
htab lookup      (1  thread)    5.062 ± 0.004M/s (drops 0.002 ± 0.000M/s mem 8.125 MiB)
htab lookup      (2  thread)   10.256 ± 0.017M/s (drops 0.006 ± 0.000M/s mem 8.114 MiB)
htab lookup      (4  thread)   20.383 ± 0.006M/s (drops 0.009 ± 0.000M/s mem 8.117 MiB)
htab lookup      (8  thread)   40.727 ± 0.093M/s (drops 0.010 ± 0.000M/s mem 8.123 MiB)
htab lookup      (16 thread)   81.333 ± 0.311M/s (drops 0.020 ± 0.000M/s mem 8.122 MiB)
htab update      (1  thread)    2.404 ± 0.012M/s (drops 0.000 ± 0.000M/s mem 8.117 MiB)
htab update      (2  thread)    3.874 ± 0.010M/s (drops 0.000 ± 0.000M/s mem 8.106 MiB)
htab update      (4  thread)    5.895 ± 0.062M/s (drops 0.000 ± 0.000M/s mem 8.118 MiB)
htab update      (8  thread)    7.000 ± 0.088M/s (drops 0.000 ± 0.000M/s mem 8.116 MiB)
htab update      (16 thread)    7.076 ± 0.043M/s (drops 0.000 ± 0.000M/s mem 8.120 MiB)
qp-trie lookup   (1  thread)   10.161 ± 0.008M/s (drops 0.006 ± 0.000M/s mem 4.847 MiB)
qp-trie lookup   (2  thread)   20.287 ± 0.024M/s (drops 0.007 ± 0.000M/s mem 4.828 MiB)
qp-trie lookup   (4  thread)   40.784 ± 0.020M/s (drops 0.015 ± 0.000M/s mem 4.071 MiB)
qp-trie lookup   (8  thread)   81.165 ± 0.013M/s (drops 0.040 ± 0.000M/s mem 4.045 MiB)
qp-trie lookup   (16 thread)  159.955 ± 0.014M/s (drops 0.108 ± 0.000M/s mem 4.495 MiB)
qp-trie update   (1  thread)    1.621 ± 0.017M/s (drops 0.000 ± 0.000M/s mem 3.939 MiB)
qp-trie update   (2  thread)    2.615 ± 0.003M/s (drops 0.000 ± 0.000M/s mem 3.914 MiB)
qp-trie update   (4  thread)    4.017 ± 0.070M/s (drops 0.000 ± 0.000M/s mem 3.585 MiB)
qp-trie update   (8  thread)    6.857 ± 0.146M/s (drops 0.000 ± 0.000M/s mem 4.490 MiB)
qp-trie update   (16 thread)    9.871 ± 0.527M/s (drops 0.000 ± 0.000M/s mem 4.209 MiB)

But for the strings in /proc/kallsyms, lookup & update/delete performance
of qp-trie is 40% or more slower compare with hash table:

Strings in /proc/kallsyms (entries=186898)
htab lookup      (1  thread)    6.684 ± 0.005M/s (drops 0.458 ± 0.002M/s mem 33.614 MiB)
htab lookup      (2  thread)   12.644 ± 0.006M/s (drops 0.867 ± 0.003M/s mem 33.563 MiB)
htab lookup      (4  thread)   22.505 ± 0.012M/s (drops 1.542 ± 0.007M/s mem 33.565 MiB)
htab lookup      (8  thread)   45.302 ± 0.048M/s (drops 3.105 ± 0.015M/s mem 33.599 MiB)
htab lookup      (16 thread)   90.828 ± 0.069M/s (drops 6.225 ± 0.021M/s mem 33.564 MiB)
htab update      (1  thread)    2.850 ± 0.129M/s (drops 0.000 ± 0.000M/s mem 33.564 MiB)
htab update      (2  thread)    4.363 ± 0.031M/s (drops 0.000 ± 0.000M/s mem 33.563 MiB)
htab update      (4  thread)    6.306 ± 0.096M/s (drops 0.000 ± 0.000M/s mem 33.718 MiB)
htab update      (8  thread)    6.611 ± 0.026M/s (drops 0.000 ± 0.000M/s mem 33.627 MiB)
htab update      (16 thread)    6.390 ± 0.015M/s (drops 0.000 ± 0.000M/s mem 33.564 MiB)
qp-trie lookup   (1  thread)    4.122 ± 0.004M/s (drops 0.282 ± 0.001M/s mem 18.579 MiB)
qp-trie lookup   (2  thread)    8.280 ± 0.011M/s (drops 0.568 ± 0.004M/s mem 18.388 MiB)
qp-trie lookup   (4  thread)   15.927 ± 0.013M/s (drops 1.092 ± 0.007M/s mem 18.613 MiB)
qp-trie lookup   (8  thread)   31.412 ± 0.026M/s (drops 2.153 ± 0.008M/s mem 18.475 MiB)
qp-trie lookup   (16 thread)   63.807 ± 0.071M/s (drops 4.375 ± 0.025M/s mem 18.276 MiB)
qp-trie update   (1  thread)    1.157 ± 0.099M/s (drops 0.000 ± 0.000M/s mem 18.333 MiB)
qp-trie update   (2  thread)    1.920 ± 0.062M/s (drops 0.000 ± 0.000M/s mem 18.293 MiB)
qp-trie update   (4  thread)    2.630 ± 0.050M/s (drops 0.000 ± 0.000M/s mem 18.472 MiB)
qp-trie update   (8  thread)    3.171 ± 0.027M/s (drops 0.000 ± 0.000M/s mem 18.301 MiB)
qp-trie update   (16 thread)    3.782 ± 0.036M/s (drops 0.000 ± 0.000M/s mem 19.040 MiB)

For strings in BTF string section, the result is similar except the memory
saving of qp-trie decreases:

Sorted strings in BTF string sections (entries=115980)
htab lookup      (1  thread)    9.792 ± 0.017M/s (drops 0.000 ± 0.000M/s mem 15.069 MiB)
htab lookup      (2  thread)   19.581 ± 0.008M/s (drops 0.000 ± 0.000M/s mem 15.069 MiB)
htab lookup      (4  thread)   36.652 ± 0.022M/s (drops 0.000 ± 0.000M/s mem 15.069 MiB)
htab lookup      (8  thread)   73.799 ± 0.053M/s (drops 0.000 ± 0.000M/s mem 15.068 MiB)
htab lookup      (16 thread)  146.957 ± 0.054M/s (drops 0.000 ± 0.000M/s mem 15.067 MiB)
htab update      (1  thread)    3.497 ± 0.051M/s (drops 0.000 ± 0.000M/s mem 15.069 MiB)
htab update      (2  thread)    4.976 ± 0.039M/s (drops 0.000 ± 0.000M/s mem 15.067 MiB)
htab update      (4  thread)    6.550 ± 0.025M/s (drops 0.000 ± 0.000M/s mem 15.068 MiB)
htab update      (8  thread)    6.821 ± 0.015M/s (drops 0.000 ± 0.000M/s mem 15.067 MiB)
htab update      (16 thread)    7.163 ± 0.057M/s (drops 0.000 ± 0.000M/s mem 15.069 MiB)
qp-trie lookup   (1  thread)    6.144 ± 0.003M/s (drops 0.000 ± 0.000M/s mem 11.361 MiB)
qp-trie lookup   (2  thread)   12.461 ± 0.003M/s (drops 0.000 ± 0.000M/s mem 11.031 MiB)
qp-trie lookup   (4  thread)   25.098 ± 0.008M/s (drops 0.000 ± 0.000M/s mem 11.082 MiB)
qp-trie lookup   (8  thread)   49.734 ± 0.015M/s (drops 0.000 ± 0.000M/s mem 11.187 MiB)
qp-trie lookup   (16 thread)   97.829 ± 0.033M/s (drops 0.000 ± 0.000M/s mem 10.803 MiB)
qp-trie update   (1  thread)    1.473 ± 0.112M/s (drops 0.000 ± 0.000M/s mem 10.848 MiB)
qp-trie update   (2  thread)    2.649 ± 0.033M/s (drops 0.000 ± 0.000M/s mem 10.817 MiB)
qp-trie update   (4  thread)    5.129 ± 0.071M/s (drops 0.000 ± 0.000M/s mem 10.901 MiB)
qp-trie update   (8  thread)    9.191 ± 0.049M/s (drops 0.000 ± 0.000M/s mem 10.753 MiB)
qp-trie update   (16 thread)   11.095 ± 0.167M/s (drops 0.000 ± 0.000M/s mem 10.918 MiB)

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 tools/testing/selftests/bpf/Makefile          |   5 +-
 tools/testing/selftests/bpf/bench.c           |  10 +
 .../selftests/bpf/benchs/bench_qp_trie.c      | 499 ++++++++++++++++++
 .../selftests/bpf/benchs/run_bench_qp_trie.sh |  55 ++
 .../selftests/bpf/progs/qp_trie_bench.c       | 218 ++++++++
 5 files changed, 786 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_qp_trie.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_qp_trie.sh
 create mode 100644 tools/testing/selftests/bpf/progs/qp_trie_bench.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 1d1a9f15c140..67a1a79f460d 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -575,11 +575,13 @@ $(OUTPUT)/bench_strncmp.o: $(OUTPUT)/strncmp_bench.skel.h
 $(OUTPUT)/bench_bpf_hashmap_full_update.o: $(OUTPUT)/bpf_hashmap_full_update_bench.skel.h
 $(OUTPUT)/bench_local_storage.o: $(OUTPUT)/local_storage_bench.skel.h
 $(OUTPUT)/bench_local_storage_rcu_tasks_trace.o: $(OUTPUT)/local_storage_rcu_tasks_trace_bench.skel.h
+$(OUTPUT)/bench_qp_trie.o: $(OUTPUT)/qp_trie_bench.skel.h
 $(OUTPUT)/bench.o: bench.h testing_helpers.h $(BPFOBJ)
 $(OUTPUT)/bench: LDLIBS += -lm
 $(OUTPUT)/bench: $(OUTPUT)/bench.o \
 		 $(TESTING_HELPERS) \
 		 $(TRACE_HELPERS) \
+		 $(CGROUP_HELPERS) \
 		 $(OUTPUT)/bench_count.o \
 		 $(OUTPUT)/bench_rename.o \
 		 $(OUTPUT)/bench_trigger.o \
@@ -589,7 +591,8 @@ $(OUTPUT)/bench: $(OUTPUT)/bench.o \
 		 $(OUTPUT)/bench_strncmp.o \
 		 $(OUTPUT)/bench_bpf_hashmap_full_update.o \
 		 $(OUTPUT)/bench_local_storage.o \
-		 $(OUTPUT)/bench_local_storage_rcu_tasks_trace.o
+		 $(OUTPUT)/bench_local_storage_rcu_tasks_trace.o \
+		 $(OUTPUT)/bench_qp_trie.o
 	$(call msg,BINARY,,$@)
 	$(Q)$(CC) $(CFLAGS) $(LDFLAGS) $(filter %.a %.o,$^) $(LDLIBS) -o $@
 
diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index c1f20a147462..618f45fbe6e2 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -275,6 +275,7 @@ extern struct argp bench_bpf_loop_argp;
 extern struct argp bench_local_storage_argp;
 extern struct argp bench_local_storage_rcu_tasks_trace_argp;
 extern struct argp bench_strncmp_argp;
+extern struct argp bench_qp_trie_argp;
 
 static const struct argp_child bench_parsers[] = {
 	{ &bench_ringbufs_argp, 0, "Ring buffers benchmark", 0 },
@@ -284,6 +285,7 @@ static const struct argp_child bench_parsers[] = {
 	{ &bench_strncmp_argp, 0, "bpf_strncmp helper benchmark", 0 },
 	{ &bench_local_storage_rcu_tasks_trace_argp, 0,
 		"local_storage RCU Tasks Trace slowdown benchmark", 0 },
+	{ &bench_qp_trie_argp, 0, "qp-trie benchmark", 0 },
 	{},
 };
 
@@ -490,6 +492,10 @@ extern const struct bench bench_local_storage_cache_seq_get;
 extern const struct bench bench_local_storage_cache_interleaved_get;
 extern const struct bench bench_local_storage_cache_hashmap_control;
 extern const struct bench bench_local_storage_tasks_trace;
+extern const struct bench bench_htab_lookup;
+extern const struct bench bench_qp_trie_lookup;
+extern const struct bench bench_htab_update;
+extern const struct bench bench_qp_trie_update;
 
 static const struct bench *benchs[] = {
 	&bench_count_global,
@@ -529,6 +535,10 @@ static const struct bench *benchs[] = {
 	&bench_local_storage_cache_interleaved_get,
 	&bench_local_storage_cache_hashmap_control,
 	&bench_local_storage_tasks_trace,
+	&bench_htab_lookup,
+	&bench_qp_trie_lookup,
+	&bench_htab_update,
+	&bench_qp_trie_update,
 };
 
 static void setup_benchmark()
diff --git a/tools/testing/selftests/bpf/benchs/bench_qp_trie.c b/tools/testing/selftests/bpf/benchs/bench_qp_trie.c
new file mode 100644
index 000000000000..35359e5e442b
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/bench_qp_trie.c
@@ -0,0 +1,499 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2022. Huawei Technologies Co., Ltd */
+#include <argp.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include "bench.h"
+#include "bpf_util.h"
+#include "cgroup_helpers.h"
+
+#include "qp_trie_bench.skel.h"
+
+enum {
+	FOR_HTAB = 0,
+	FOR_TRIE,
+};
+
+static struct qp_trie_ctx {
+	struct qp_trie_bench *skel;
+	int cgrp_dfd;
+	u64 map_slab_mem;
+} ctx;
+
+static struct {
+	const char *file;
+	__u32 entries;
+} args;
+
+struct run_stat {
+	__u64 stats[2];
+};
+
+enum {
+	ARG_DATA_FILE = 8001,
+	ARG_DATA_ENTRIES = 8002,
+};
+
+static const struct argp_option opts[] = {
+	{ "file", ARG_DATA_FILE, "DATA-FILE", 0, "Set data file" },
+	{ "entries", ARG_DATA_ENTRIES, "DATA-ENTRIES", 0, "Set data entries" },
+	{},
+};
+
+static error_t qp_trie_parse_arg(int key, char *arg, struct argp_state *state)
+{
+	switch (key) {
+	case ARG_DATA_FILE:
+		args.file = strdup(arg);
+		break;
+	case ARG_DATA_ENTRIES:
+		args.entries = strtoul(arg, NULL, 10);
+		break;
+	default:
+		return ARGP_ERR_UNKNOWN;
+	}
+
+	return 0;
+}
+
+const struct argp bench_qp_trie_argp = {
+	.options = opts,
+	.parser = qp_trie_parse_arg,
+};
+
+static int parse_data_set(const char *name, struct bpf_qp_trie_key ***set, unsigned int *nr,
+			  unsigned int *max_len)
+{
+#define INT_MAX_DATA_SIZE 1024
+	unsigned int i, nr_items, item_max_len;
+	char line[INT_MAX_DATA_SIZE + 1];
+	struct bpf_qp_trie_key **items;
+	struct bpf_qp_trie_key *cur;
+	int err = 0;
+	FILE *file;
+	char *got;
+
+	file = fopen(name, "rb");
+	if (!file) {
+		fprintf(stderr, "open %s err %s\n", name, strerror(errno));
+		return -1;
+	}
+
+	got = fgets(line, sizeof(line), file);
+	if (!got) {
+		fprintf(stderr, "empty file ?\n");
+		err = -1;
+		goto out;
+	}
+	if (sscanf(line, "%u", &nr_items) != 1) {
+		fprintf(stderr, "the first line must be the number of items\n");
+		err = -1;
+		goto out;
+	}
+
+	fprintf(stdout, "item %u\n", nr_items);
+
+	items = (struct bpf_qp_trie_key **)calloc(nr_items, sizeof(*items) + INT_MAX_DATA_SIZE);
+	if (!items) {
+		fprintf(stderr, "no mem for items\n");
+		err = -1;
+		goto out;
+	}
+
+	i = 0;
+	item_max_len = 0;
+	cur = (void *)items + sizeof(*items) * nr_items;
+	while (true) {
+		unsigned int len;
+
+		got = fgets(line, sizeof(line), file);
+		if (!got) {
+			if (!feof(file)) {
+				fprintf(stderr, "read file %s error\n", name);
+				err = -1;
+			}
+			break;
+		}
+
+		len = strlen(got);
+		if (len && got[len - 1] == '\n') {
+			got[len - 1] = 0;
+			len -= 1;
+		}
+		if (!len) {
+			fprintf(stdout, "#%u empty line\n", i + 2);
+			continue;
+		}
+
+		if (i >= nr_items) {
+			fprintf(stderr, "too many line in %s\n", name);
+			break;
+		}
+
+		if (len > item_max_len)
+			item_max_len = len;
+		cur->len = len;
+		memcpy(cur->data, got, len);
+		items[i++] = cur;
+		cur = (void *)cur + INT_MAX_DATA_SIZE;
+	}
+
+	if (!err) {
+		if (i != nr_items)
+			fprintf(stdout, "few lines in %s (exp %u got %u)\n", name, nr_items, i);
+		*nr = i;
+		*set = items;
+		*max_len = item_max_len;
+	} else {
+		free(items);
+	}
+
+out:
+	fclose(file);
+	return err;
+}
+
+static int gen_data_set(struct bpf_qp_trie_key ***set, unsigned int *nr, unsigned int *max_len)
+{
+#define RND_MAX_DATA_SIZE 256
+	struct bpf_qp_trie_key **items;
+	struct bpf_qp_trie_key *cur;
+	size_t ptr_size, data_size;
+	unsigned int i, nr_items;
+	ssize_t got;
+	int err = 0;
+
+	ptr_size = *nr * sizeof(*items);
+	data_size = *nr * (sizeof(*cur) + RND_MAX_DATA_SIZE);
+	items = (struct bpf_qp_trie_key **)malloc(ptr_size + data_size);
+	if (!items) {
+		fprintf(stderr, "no mem for items\n");
+		err = -1;
+		goto out;
+	}
+
+	cur = (void *)items + ptr_size;
+	got = syscall(__NR_getrandom, cur, data_size, 0);
+	if (got != data_size) {
+		fprintf(stderr, "getrandom error %s\n", strerror(errno));
+		err = -1;
+		goto out;
+	}
+
+	nr_items = 0;
+	for (i = 0; i < *nr; i++) {
+		cur->len &= 0xff;
+		if (cur->len) {
+			items[nr_items++] = cur;
+			memset(cur->data + cur->len, 0, RND_MAX_DATA_SIZE - cur->len);
+		}
+		cur = (void *)cur + (sizeof(*cur) + RND_MAX_DATA_SIZE);
+	}
+	if (!nr_items) {
+		fprintf(stderr, "no valid key in random data\n");
+		err = -1;
+		goto out;
+	}
+	fprintf(stdout, "generate %u random keys\n", nr_items);
+
+	*nr = nr_items;
+	*set = items;
+	*max_len = RND_MAX_DATA_SIZE;
+out:
+	if (err && items)
+		free(items);
+	return err;
+}
+
+static void qp_trie_validate(void)
+{
+	if (env.consumer_cnt != 1) {
+		fprintf(stderr, "qp_trie_map benchmark doesn't support multi-consumer!\n");
+		exit(1);
+	}
+
+	if (!args.file && !args.entries) {
+		fprintf(stderr, "must specify entries when use random generated data set\n");
+		exit(1);
+	}
+
+	if (args.file && access(args.file, R_OK)) {
+		fprintf(stderr, "data file is un-accessible\n");
+		exit(1);
+	}
+}
+
+static void qp_trie_init_map_opts(struct qp_trie_bench *skel, unsigned int data_size,
+				  unsigned int nr)
+{
+	unsigned int key_size = data_size + sizeof(struct bpf_qp_trie_key);
+
+	bpf_map__set_value_size(skel->maps.htab_array, data_size);
+	bpf_map__set_max_entries(skel->maps.htab_array, nr);
+
+	bpf_map__set_key_size(skel->maps.htab, data_size);
+	bpf_map__set_max_entries(skel->maps.htab, nr);
+
+	bpf_map__set_value_size(skel->maps.trie_array, key_size);
+	bpf_map__set_max_entries(skel->maps.trie_array, nr);
+
+	bpf_map__set_key_size(skel->maps.qp_trie, key_size);
+	bpf_map__set_max_entries(skel->maps.qp_trie, nr);
+}
+
+static void qp_trie_setup_key_map(struct bpf_map *map, unsigned int map_type,
+				  struct bpf_qp_trie_key **set, unsigned int nr)
+{
+	int fd = bpf_map__fd(map);
+	unsigned int i;
+
+	for (i = 0; i < nr; i++) {
+		void *value;
+		int err;
+
+		value = (map_type != FOR_HTAB) ? (void *)set[i] : (void *)set[i]->data;
+		err = bpf_map_update_elem(fd, &i, value, 0);
+		if (err) {
+			fprintf(stderr, "add #%u key (%s) on %s error %d\n",
+				i, set[i]->data, bpf_map__name(map), err);
+			exit(1);
+		}
+	}
+}
+
+static u64 qp_trie_get_slab_mem(int dfd)
+{
+	const char *magic = "slab ";
+	const char *name = "memory.stat";
+	int fd;
+	ssize_t nr;
+	char buf[4096];
+	char *from;
+
+	fd = openat(dfd, name, 0);
+	if (fd < 0) {
+		fprintf(stderr, "no %s\n", name);
+		exit(1);
+	}
+
+	nr = read(fd, buf, sizeof(buf));
+	if (nr <= 0) {
+		fprintf(stderr, "empty %s ?\n", name);
+		exit(1);
+	}
+	buf[nr - 1] = 0;
+
+	close(fd);
+
+	from = strstr(buf, magic);
+	if (!from) {
+		fprintf(stderr, "no slab in %s\n", name);
+		exit(1);
+	}
+
+	return strtoull(from + strlen(magic), NULL, 10);
+}
+
+static void qp_trie_setup_lookup_map(struct bpf_map *map, unsigned int map_type,
+				     struct bpf_qp_trie_key **set, unsigned int nr)
+{
+	int fd = bpf_map__fd(map);
+	unsigned int i;
+
+	for (i = 0; i < nr; i++) {
+		void *key;
+		int err;
+
+		key = (map_type != FOR_HTAB) ? (void *)set[i] : (void *)set[i]->data;
+		err = bpf_map_update_elem(fd, key, &i, 0);
+		if (err) {
+			fprintf(stderr, "add #%u key (%s) on %s error %d\n",
+				i, set[i]->data, bpf_map__name(map), err);
+			exit(1);
+		}
+	}
+}
+
+static void qp_trie_setup(unsigned int map_type)
+{
+	struct bpf_qp_trie_key **set = NULL;
+	struct qp_trie_bench *skel;
+	unsigned int nr = 0, max_len = 0;
+	struct bpf_map *map;
+	u64 before, after;
+	int dfd;
+	int err;
+
+	if (!args.file) {
+		nr = args.entries;
+		err = gen_data_set(&set, &nr, &max_len);
+	} else {
+		err = parse_data_set(args.file, &set, &nr, &max_len);
+	}
+	if (err < 0)
+		exit(1);
+
+	if (args.entries && args.entries < nr)
+		nr = args.entries;
+
+	dfd = cgroup_setup_and_join("/qp_trie");
+	if (dfd < 0) {
+		fprintf(stderr, "failed to setup cgroup env\n");
+		exit(1);
+	}
+
+	setup_libbpf();
+
+	before = qp_trie_get_slab_mem(dfd);
+
+	skel = qp_trie_bench__open();
+	if (!skel) {
+		fprintf(stderr, "failed to open skeleton\n");
+		exit(1);
+	}
+
+	qp_trie_init_map_opts(skel, max_len, nr);
+
+	skel->bss->update_nr = nr;
+	skel->bss->update_chunk = nr / env.producer_cnt;
+
+	err = qp_trie_bench__load(skel);
+	if (err) {
+		fprintf(stderr, "failed to load skeleton\n");
+		exit(1);
+	}
+
+	map = (map_type == FOR_HTAB) ? skel->maps.htab_array : skel->maps.trie_array;
+	qp_trie_setup_key_map(map, map_type, set, nr);
+
+	map = (map_type == FOR_HTAB) ? skel->maps.htab : skel->maps.qp_trie;
+	qp_trie_setup_lookup_map(map, map_type, set, nr);
+
+	after = qp_trie_get_slab_mem(dfd);
+
+	ctx.skel = skel;
+	ctx.cgrp_dfd = dfd;
+	ctx.map_slab_mem = after - before;
+}
+
+static void qp_trie_attach_prog(struct bpf_program *prog)
+{
+	struct bpf_link *link;
+
+	link = bpf_program__attach(prog);
+	if (!link) {
+		fprintf(stderr, "failed to attach program!\n");
+		exit(1);
+	}
+}
+
+static void htab_lookup_setup(void)
+{
+	qp_trie_setup(FOR_HTAB);
+	qp_trie_attach_prog(ctx.skel->progs.htab_lookup);
+}
+
+static void qp_trie_lookup_setup(void)
+{
+	qp_trie_setup(FOR_TRIE);
+	qp_trie_attach_prog(ctx.skel->progs.qp_trie_lookup);
+}
+
+static void htab_update_setup(void)
+{
+	qp_trie_setup(FOR_HTAB);
+	qp_trie_attach_prog(ctx.skel->progs.htab_update);
+}
+
+static void qp_trie_update_setup(void)
+{
+	qp_trie_setup(FOR_TRIE);
+	qp_trie_attach_prog(ctx.skel->progs.qp_trie_update);
+}
+
+static void *qp_trie_producer(void *ctx)
+{
+	while (true)
+		(void)syscall(__NR_getpgid);
+	return NULL;
+}
+
+static void *qp_trie_consumer(void *ctx)
+{
+	return NULL;
+}
+
+static void qp_trie_measure(struct bench_res *res)
+{
+	static __u64 last_hits, last_drops;
+	__u64 total_hits = 0, total_drops = 0;
+	unsigned int i, nr_cpus;
+
+	nr_cpus = bpf_num_possible_cpus();
+	for (i = 0; i < nr_cpus; i++) {
+		struct run_stat *s = (void *)&ctx.skel->bss->percpu_stats[i & 255];
+
+		total_hits += s->stats[0];
+		total_drops += s->stats[1];
+	}
+
+	res->hits = total_hits - last_hits;
+	res->drops = total_drops - last_drops;
+
+	last_hits = total_hits;
+	last_drops = total_drops;
+}
+
+static void qp_trie_report_final(struct bench_res res[], int res_cnt)
+{
+	close(ctx.cgrp_dfd);
+	cleanup_cgroup_environment();
+
+	fprintf(stdout, "Slab: %.3f MiB\n", (float)ctx.map_slab_mem / 1024 / 1024);
+	hits_drops_report_final(res, res_cnt);
+}
+
+const struct bench bench_htab_lookup = {
+	.name = "htab-lookup",
+	.validate = qp_trie_validate,
+	.setup = htab_lookup_setup,
+	.producer_thread = qp_trie_producer,
+	.consumer_thread = qp_trie_consumer,
+	.measure = qp_trie_measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = qp_trie_report_final,
+};
+
+const struct bench bench_qp_trie_lookup = {
+	.name = "qp-trie-lookup",
+	.validate = qp_trie_validate,
+	.setup = qp_trie_lookup_setup,
+	.producer_thread = qp_trie_producer,
+	.consumer_thread = qp_trie_consumer,
+	.measure = qp_trie_measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = qp_trie_report_final,
+};
+
+const struct bench bench_htab_update = {
+	.name = "htab-update",
+	.validate = qp_trie_validate,
+	.setup = htab_update_setup,
+	.producer_thread = qp_trie_producer,
+	.consumer_thread = qp_trie_consumer,
+	.measure = qp_trie_measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = qp_trie_report_final,
+};
+
+const struct bench bench_qp_trie_update = {
+	.name = "qp-trie-update",
+	.validate = qp_trie_validate,
+	.setup = qp_trie_update_setup,
+	.producer_thread = qp_trie_producer,
+	.consumer_thread = qp_trie_consumer,
+	.measure = qp_trie_measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = qp_trie_report_final,
+};
diff --git a/tools/testing/selftests/bpf/benchs/run_bench_qp_trie.sh b/tools/testing/selftests/bpf/benchs/run_bench_qp_trie.sh
new file mode 100755
index 000000000000..0cbcb5bc9292
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/run_bench_qp_trie.sh
@@ -0,0 +1,55 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (C) 2022. Huawei Technologies Co., Ltd
+
+source ./benchs/run_common.sh
+
+set -eufo pipefail
+
+mem()
+{
+	echo "$*" | sed -E "s/.*Slab: ([0-9]+\.[0-9]+ MiB).*/\1/"
+}
+
+run_qp_trie_bench()
+{
+	local title=$1
+	local summary
+
+	shift 1
+	summary=$($RUN_BENCH "$@" | grep "Summary\|Slab:")
+	printf "%s %20s (drops %-16s mem %s)\n" "$title" "$(hits $summary)" \
+		"$(drops $summary)" "$(mem $summary)"
+}
+
+run_qp_trie_benchs()
+{
+	local p
+	local m
+	local b
+	local title
+
+	for m in htab qp-trie
+	do
+		for b in lookup update
+		do
+			for p in 1 2 4 8 16
+			do
+				title=$(printf "%-16s (%-2d thread)" "$m $b" $p)
+				run_qp_trie_bench "$title" ${m}-${b} -p $p "$@"
+			done
+		done
+	done
+	echo
+}
+
+echo "Randomly-generated binary data (16K)"
+run_qp_trie_benchs --entries 16384
+
+echo "Strings in /proc/kallsyms"
+TMP_FILE=/tmp/kallsyms.txt
+SRC_FILE=/proc/kallsyms
+trap 'rm -f $TMP_FILE' EXIT
+wc -l $SRC_FILE | awk '{ print $1}' > $TMP_FILE
+awk '{ print $3 }' $SRC_FILE >> $TMP_FILE
+run_qp_trie_benchs --file $TMP_FILE
diff --git a/tools/testing/selftests/bpf/progs/qp_trie_bench.c b/tools/testing/selftests/bpf/progs/qp_trie_bench.c
new file mode 100644
index 000000000000..21202c578105
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/qp_trie_bench.c
@@ -0,0 +1,218 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2022. Huawei Technologies Co., Ltd */
+#include <linux/types.h>
+#include <linux/bpf.h>
+#include <linux/errno.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+struct bpf_map;
+
+/* value_size will be set by benchmark */
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(key_size, 4);
+} htab_array SEC(".maps");
+
+/* value_size will be set by benchmark */
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(key_size, 4);
+} trie_array SEC(".maps");
+
+/* key_size will be set by benchmark */
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(value_size, 4);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+} htab SEC(".maps");
+
+/* key_size will be set by benchmark */
+struct {
+	__uint(type, BPF_MAP_TYPE_QP_TRIE);
+	__uint(value_size, 4);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+} qp_trie SEC(".maps");
+
+char _license[] SEC("license") = "GPL";
+
+struct {
+	__u64 stats[2];
+} __attribute__((__aligned__(128))) percpu_stats[256];
+
+struct update_ctx {
+	unsigned int max;
+	unsigned int from;
+};
+
+unsigned int update_nr;
+unsigned int update_chunk;
+
+static __always_inline void update_stats(int idx)
+{
+	__u32 cpu = bpf_get_smp_processor_id();
+
+	percpu_stats[cpu & 255].stats[idx]++;
+}
+
+static int lookup_htab(struct bpf_map *map, __u32 *key, void *value, void *data)
+{
+	__u32 *index;
+
+	index = bpf_map_lookup_elem(&htab, value);
+	if (index && *index == *key)
+		update_stats(0);
+	else
+		update_stats(1);
+	return 0;
+}
+
+static int update_htab_loop(unsigned int i, void *ctx)
+{
+	struct update_ctx *update = ctx;
+	void *value;
+	int err;
+
+	if (update->from >= update->max)
+		update->from = 0;
+	value = bpf_map_lookup_elem(&htab_array, &update->from);
+	if (!value)
+		return 1;
+
+	err = bpf_map_update_elem(&htab, value, &update->from, 0);
+	if (!err)
+		update_stats(0);
+	else
+		update_stats(1);
+	update->from++;
+
+	return 0;
+}
+
+static int delete_htab_loop(unsigned int i, void *ctx)
+{
+	struct update_ctx *update = ctx;
+	void *value;
+	int err;
+
+	if (update->from >= update->max)
+		update->from = 0;
+	value = bpf_map_lookup_elem(&htab_array, &update->from);
+	if (!value)
+		return 1;
+
+	err = bpf_map_delete_elem(&htab, value);
+	if (!err)
+		update_stats(0);
+	update->from++;
+
+	return 0;
+}
+
+static int lookup_qp_trie(struct bpf_map *map, __u32 *key, void *value, void *data)
+{
+	__u32 *index;
+
+	index = bpf_map_lookup_elem(&qp_trie, value);
+	if (index && *index == *key)
+		update_stats(0);
+	else
+		update_stats(1);
+	return 0;
+}
+
+static int update_qp_trie_loop(unsigned int i, void *ctx)
+{
+	struct update_ctx *update = ctx;
+	void *value;
+	int err;
+
+	if (update->from >= update->max)
+		update->from = 0;
+	value = bpf_map_lookup_elem(&trie_array, &update->from);
+	if (!value)
+		return 1;
+
+	err = bpf_map_update_elem(&qp_trie, value, &update->from, 0);
+	if (!err)
+		update_stats(0);
+	else
+		update_stats(1);
+	update->from++;
+
+	return 0;
+}
+
+static int delete_qp_trie_loop(unsigned int i, void *ctx)
+{
+	struct update_ctx *update = ctx;
+	void *value;
+	int err;
+
+	if (update->from >= update->max)
+		update->from = 0;
+	value = bpf_map_lookup_elem(&trie_array, &update->from);
+	if (!value)
+		return 1;
+
+	err = bpf_map_delete_elem(&qp_trie, value);
+	if (!err)
+		update_stats(0);
+	update->from++;
+
+	return 0;
+}
+
+SEC("tp/syscalls/sys_enter_getpgid")
+int htab_lookup(void *ctx)
+{
+	bpf_for_each_map_elem(&htab_array, lookup_htab, NULL, 0);
+	return 0;
+}
+
+SEC("tp/syscalls/sys_enter_getpgid")
+int qp_trie_lookup(void *ctx)
+{
+	bpf_for_each_map_elem(&trie_array, lookup_qp_trie, NULL, 0);
+	return 0;
+}
+
+SEC("tp/syscalls/sys_enter_getpgid")
+int htab_update(void *ctx)
+{
+	unsigned int index = bpf_get_smp_processor_id() * update_chunk;
+	struct update_ctx update;
+
+	update.max = update_nr;
+	if (update.max && index >= update.max)
+		index %= update.max;
+
+	/* Only operate part of keys according to cpu id */
+	update.from = index;
+	bpf_loop(update_chunk, update_htab_loop, &update, 0);
+
+	update.from = index;
+	bpf_loop(update_chunk, delete_htab_loop, &update, 0);
+
+	return 0;
+}
+
+SEC("tp/syscalls/sys_enter_getpgid")
+int qp_trie_update(void *ctx)
+{
+	unsigned int index = bpf_get_smp_processor_id() * update_chunk;
+	struct update_ctx update;
+
+	update.max = update_nr;
+	if (update.max && index >= update.max)
+		index %= update.max;
+
+	/* Only operate part of keys according to cpu id */
+	update.from = index;
+	bpf_loop(update_chunk, update_qp_trie_loop, &update, 0);
+
+	update.from = index;
+	bpf_loop(update_chunk, delete_qp_trie_loop, &update, 0);
+
+	return 0;
+}
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH bpf-next 0/3] Add support for qp-trie map
  2022-07-26 13:00 [RFC PATCH bpf-next 0/3] Add support for qp-trie map Hou Tao
                   ` (2 preceding siblings ...)
  2022-07-26 13:00 ` [RFC PATCH bpf-next 3/3] selftests/bpf: add benchmark for qp-trie map Hou Tao
@ 2022-08-02  6:20 ` Hou Tao
  2022-08-02 22:38 ` Andrii Nakryiko
  4 siblings, 0 replies; 9+ messages in thread
From: Hou Tao @ 2022-08-02  6:20 UTC (permalink / raw)
  To: Andrii Nakryiko, Alexei Starovoitov, bpf
  Cc: Daniel Borkmann, Martin KaFai Lau, Yonghong Song, Song Liu,
	KP Singh, David S . Miller, Jakub Kicinski, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, John Fastabend, Andrii Nakryiko

ping ?

On 7/26/2022 9:00 PM, Hou Tao wrote:
> Hi,
>
> The initial motivation for qp-trie map is to reduce memory usage for
> string keys special those with large differencies in length as
> discussed in [0]. And as a big-endian lexicographical ordered map, it
> can also be used for any binary data with fixed or variable length.
>
> Now the basic functionality of qp-trie is ready, so posting a RFC version
> to get more feedback or suggestions about qp-trie. Specially feedback
> about the following questions:
>
> (1) Application scenario for qp-trie
> Andrii had proposed to re-implement lpm-trie by using qp-trie. The
> advantage would be the speed up of lookup operations due to lower tree
> depth of qp-trie. Maybe the performance of update could also be improved
> although in cillium there is a big lock during lpm-trie update [1]. Is
> there any other use cases for qp-trie ? Specially those cases which need
> both ordering and memory efficiency or cases in which jhash() of htab
> creates too much collisions and qp-trie lookup performances better than
> hash-table lookup as shown below:
>
>   Randomly-generated binary data with variable length (length range=[1, 256] entries=16K)
>
>   htab lookup      (1  thread)    5.062 ± 0.004M/s (drops 0.002 ± 0.000M/s mem 8.125 MiB)
>   htab lookup      (2  thread)   10.256 ± 0.017M/s (drops 0.006 ± 0.000M/s mem 8.114 MiB)
>   htab lookup      (4  thread)   20.383 ± 0.006M/s (drops 0.009 ± 0.000M/s mem 8.117 MiB)
>   htab lookup      (8  thread)   40.727 ± 0.093M/s (drops 0.010 ± 0.000M/s mem 8.123 MiB)
>   htab lookup      (16 thread)   81.333 ± 0.311M/s (drops 0.020 ± 0.000M/s mem 8.122 MiB)
>   
>   qp-trie lookup   (1  thread)   10.161 ± 0.008M/s (drops 0.006 ± 0.000M/s mem 4.847 MiB)
>   qp-trie lookup   (2  thread)   20.287 ± 0.024M/s (drops 0.007 ± 0.000M/s mem 4.828 MiB)
>   qp-trie lookup   (4  thread)   40.784 ± 0.020M/s (drops 0.015 ± 0.000M/s mem 4.071 MiB)
>   qp-trie lookup   (8  thread)   81.165 ± 0.013M/s (drops 0.040 ± 0.000M/s mem 4.045 MiB)
>   qp-trie lookup   (16 thread)  159.955 ± 0.014M/s (drops 0.108 ± 0.000M/s mem 4.495 MiB)
>
>   * non-zero drops is due to duplicated keys in generated keys.
>
> (2) more fine-grained lock in qp-trie
> Now qp-trie is divided into 256 sub-trees by using the first character of
> key and one sub-tree is protected one spinlock. From the data below,
> although the update/delete speed of qp-trie is slower compare with hash
> table, but it scales similar with hash table. So maybe 256-locks is a
> good enough solution ?
>
>   Strings in /proc/kallsyms
>   htab update      (1  thread)    2.850 ± 0.129M/s (drops 0.000 ± 0.000M/s mem 33.564 MiB)
>   htab update      (2  thread)    4.363 ± 0.031M/s (drops 0.000 ± 0.000M/s mem 33.563 MiB)
>   htab update      (4  thread)    6.306 ± 0.096M/s (drops 0.000 ± 0.000M/s mem 33.718 MiB)
>   htab update      (8  thread)    6.611 ± 0.026M/s (drops 0.000 ± 0.000M/s mem 33.627 MiB)
>   htab update      (16 thread)    6.390 ± 0.015M/s (drops 0.000 ± 0.000M/s mem 33.564 MiB)
>   qp-trie update   (1  thread)    1.157 ± 0.099M/s (drops 0.000 ± 0.000M/s mem 18.333 MiB)
>   qp-trie update   (2  thread)    1.920 ± 0.062M/s (drops 0.000 ± 0.000M/s mem 18.293 MiB)
>   qp-trie update   (4  thread)    2.630 ± 0.050M/s (drops 0.000 ± 0.000M/s mem 18.472 MiB)
>   qp-trie update   (8  thread)    3.171 ± 0.027M/s (drops 0.000 ± 0.000M/s mem 18.301 MiB)
>   qp-trie update   (16 thread)    3.782 ± 0.036M/s (drops 0.000 ± 0.000M/s mem 19.040 MiB)
>
> (3) Improve memory efficiency further
> When using strings in BTF string section as a data set for qp-trie, the
> slab memory usage showed in cgroup memory.stats file is about 11MB for
> qp-trie and 15MB for hash table as shown below. However the theoretical
> memory usage for qp-trie is ~6.8MB (is ~4.9MB if removing "parent" & "rcu"
> fields from qp_trie_branch) and the extra memory usage (about 38% of total
> usage) mainly comes from internal fragment in slab (namely 2^n alignment
> for allocation) and overhead in kmem-cgroup accounting. We can reduce the
> internal fragment by creating separated kmem_cache for qp_trie_branch with
> different child nodes, but not sure whether it is worthy or not.
>
> And in order to prevent allocating a rcu_head for each leaf node, now only
> branch node is RCU-freed, so when replacing a leaf node, a new branch node
> and a new leaf node will be allocated instead of replacing the old leaf
> node and RCU-freed the old leaf node. Also not sure whether or not it is
> worthy.
>
>   Strings in BTF string section (entries=115980):
>   htab lookup      (1  thread)    9.889 ± 0.006M/s (drops 0.000 ± 0.000M/s mem 15.069 MiB)
>   qp-trie lookup   (1  thread)    5.132 ± 0.002M/s (drops 0.000 ± 0.000M/s mem 10.721 MiB)
>
>   All files under linux kernel source directory (entries=74359):
>   htab lookup      (1  thread)    8.418 ± 0.077M/s (drops 0.000 ± 0.000M/s mem 14.207 MiB)
>   qp-trie lookup   (1  thread)    4.966 ± 0.003M/s (drops 0.000 ± 0.000M/s mem 9.355 MiB)
>
>   Domain names for Alexa top million web site (entries=1000000):
>   htab lookup      (1  thread)    4.551 ± 0.043M/s (drops 0.000 ± 0.000M/s mem 190.761 MiB)
>   qp-trie lookup   (1  thread)    2.804 ± 0.017M/s (drops 0.000 ± 0.000M/s mem 83.194 MiB)
>
> Comments and suggestions are always welcome.
>
> Regards,
> Tao
>
> [0]: https://lore.kernel.org/bpf/CAEf4Bzb7keBS8vXgV5JZzwgNGgMV0X3_guQ_m9JW3X6fJBDpPQ@mail.gmail.com/
> [1]: https://github.com/cilium/cilium/blob/5145e31cd65db3361f6538d5f5f899440b769070/pkg/datapath/prefilter/prefilter.go#L123
>
> Hou Tao (3):
>   bpf: Add support for qp-trie map
>   selftests/bpf: add a simple test for qp-trie
>   selftests/bpf: add benchmark for qp-trie map
>
>  include/linux/bpf_types.h                     |    1 +
>  include/uapi/linux/bpf.h                      |    8 +
>  kernel/bpf/Makefile                           |    1 +
>  kernel/bpf/bpf_qp_trie.c                      | 1064 +++++++++++++++++
>  tools/include/uapi/linux/bpf.h                |    8 +
>  tools/testing/selftests/bpf/Makefile          |    5 +-
>  tools/testing/selftests/bpf/bench.c           |   10 +
>  .../selftests/bpf/benchs/bench_qp_trie.c      |  499 ++++++++
>  .../selftests/bpf/benchs/run_bench_qp_trie.sh |   55 +
>  .../selftests/bpf/prog_tests/str_key.c        |   69 ++
>  .../selftests/bpf/progs/qp_trie_bench.c       |  218 ++++
>  tools/testing/selftests/bpf/progs/str_key.c   |   85 ++
>  12 files changed, 2022 insertions(+), 1 deletion(-)
>  create mode 100644 kernel/bpf/bpf_qp_trie.c
>  create mode 100644 tools/testing/selftests/bpf/benchs/bench_qp_trie.c
>  create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_qp_trie.sh
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/str_key.c
>  create mode 100644 tools/testing/selftests/bpf/progs/qp_trie_bench.c
>  create mode 100644 tools/testing/selftests/bpf/progs/str_key.c
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH bpf-next 0/3] Add support for qp-trie map
  2022-07-26 13:00 [RFC PATCH bpf-next 0/3] Add support for qp-trie map Hou Tao
                   ` (3 preceding siblings ...)
  2022-08-02  6:20 ` [RFC PATCH bpf-next 0/3] Add support " Hou Tao
@ 2022-08-02 22:38 ` Andrii Nakryiko
  2022-08-08 17:54   ` Toke Høiland-Jørgensen
  2022-08-09  8:25   ` houtao
  4 siblings, 2 replies; 9+ messages in thread
From: Andrii Nakryiko @ 2022-08-02 22:38 UTC (permalink / raw)
  To: Hou Tao
  Cc: Andrii Nakryiko, Alexei Starovoitov, bpf, Daniel Borkmann,
	Martin KaFai Lau, Yonghong Song, Song Liu, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Hao Luo,
	Jiri Olsa, John Fastabend, Paul E . McKenney, Joanne Koong

On Tue, Jul 26, 2022 at 5:42 AM Hou Tao <houtao1@huawei.com> wrote:
>
> Hi,
>

Hey, sorry I'm catching up on upstream and there is just too many
complicated patch sets coming in, so it takes time to get through
them.

I think you did a great job with this implementation, it's certainly
worth submitting as non-RFC for proper upstream review. I know that
some people didn't get your patches, they got into spam somehow. So I
think it would be great to just resubmit it as non-RFC so that it
appears in patchworks as to-be-reviewew patches and hopefully will get
a wider audience to review this.

I've tried to answer some questions below, but would definitely like
more people to chime in. I haven't went through implementation in
details, but superficially it looks pretty clean and certainly ready
for proper non-RFC review.

One point about user API would be to maybe instead use bpf_dynptr as
an interface for specifying variable-sized lookup key instead of
hard-coded
 bpf_qp_trie_key. Please check recent work by Joanne on bpf_dynptr.

In short: looks great, I think it's certainly worth adding this as BPF
map type. Please submit as non-RFC and go through a proper review
process. Looking forward (even if that means reviewing 1000k lines of
dense algorithmic code :) ).

> The initial motivation for qp-trie map is to reduce memory usage for
> string keys special those with large differencies in length as
> discussed in [0]. And as a big-endian lexicographical ordered map, it
> can also be used for any binary data with fixed or variable length.
>
> Now the basic functionality of qp-trie is ready, so posting a RFC version
> to get more feedback or suggestions about qp-trie. Specially feedback
> about the following questions:
>
> (1) Application scenario for qp-trie
> Andrii had proposed to re-implement lpm-trie by using qp-trie. The
> advantage would be the speed up of lookup operations due to lower tree
> depth of qp-trie. Maybe the performance of update could also be improved
> although in cillium there is a big lock during lpm-trie update [1]. Is

Well, using qp-trie approach as an internal implementation of lpm-trie
is probably a major win already, what's wrong with that?

> there any other use cases for qp-trie ? Specially those cases which need
> both ordering and memory efficiency or cases in which jhash() of htab
> creates too much collisions and qp-trie lookup performances better than
> hash-table lookup as shown below:

I'm thinking about qp-trie as dynamically growable lookup table.
That's a pretty big use case already. There is an RB tree proposal
under review right now which would also satisfy dynamically growable
criteria, but its focus is slightly different and it remains to be
seen how convenient it will be as general-purpose resizable
alternative to hashmap. But from benchmarks I've found online, RB tree
will definitely use more memory than qp-trie.

Ordered property seems also very useful, but I don't yet have specific
use case for that. But once we have data structure like this in BPF,
I'm sure use cases will pop up.

>
>   Randomly-generated binary data with variable length (length range=[1, 256] entries=16K)
>
>   htab lookup      (1  thread)    5.062 ± 0.004M/s (drops 0.002 ± 0.000M/s mem 8.125 MiB)
>   htab lookup      (2  thread)   10.256 ± 0.017M/s (drops 0.006 ± 0.000M/s mem 8.114 MiB)
>   htab lookup      (4  thread)   20.383 ± 0.006M/s (drops 0.009 ± 0.000M/s mem 8.117 MiB)
>   htab lookup      (8  thread)   40.727 ± 0.093M/s (drops 0.010 ± 0.000M/s mem 8.123 MiB)
>   htab lookup      (16 thread)   81.333 ± 0.311M/s (drops 0.020 ± 0.000M/s mem 8.122 MiB)
>
>   qp-trie lookup   (1  thread)   10.161 ± 0.008M/s (drops 0.006 ± 0.000M/s mem 4.847 MiB)
>   qp-trie lookup   (2  thread)   20.287 ± 0.024M/s (drops 0.007 ± 0.000M/s mem 4.828 MiB)
>   qp-trie lookup   (4  thread)   40.784 ± 0.020M/s (drops 0.015 ± 0.000M/s mem 4.071 MiB)
>   qp-trie lookup   (8  thread)   81.165 ± 0.013M/s (drops 0.040 ± 0.000M/s mem 4.045 MiB)
>   qp-trie lookup   (16 thread)  159.955 ± 0.014M/s (drops 0.108 ± 0.000M/s mem 4.495 MiB)
>

Kind of surprised that qp-tire is twice as fast as hashmap. Do you
have any idea why hashmap is slower? Was htab pre-allocated? What was
it's max_entries?

>   * non-zero drops is due to duplicated keys in generated keys.
>
> (2) more fine-grained lock in qp-trie
> Now qp-trie is divided into 256 sub-trees by using the first character of

character -> byte, it's not always strings, right?

> key and one sub-tree is protected one spinlock. From the data below,
> although the update/delete speed of qp-trie is slower compare with hash
> table, but it scales similar with hash table. So maybe 256-locks is a
> good enough solution ?
>
>   Strings in /proc/kallsyms
>   htab update      (1  thread)    2.850 ± 0.129M/s (drops 0.000 ± 0.000M/s mem 33.564 MiB)
>   htab update      (2  thread)    4.363 ± 0.031M/s (drops 0.000 ± 0.000M/s mem 33.563 MiB)
>   htab update      (4  thread)    6.306 ± 0.096M/s (drops 0.000 ± 0.000M/s mem 33.718 MiB)
>   htab update      (8  thread)    6.611 ± 0.026M/s (drops 0.000 ± 0.000M/s mem 33.627 MiB)
>   htab update      (16 thread)    6.390 ± 0.015M/s (drops 0.000 ± 0.000M/s mem 33.564 MiB)
>   qp-trie update   (1  thread)    1.157 ± 0.099M/s (drops 0.000 ± 0.000M/s mem 18.333 MiB)
>   qp-trie update   (2  thread)    1.920 ± 0.062M/s (drops 0.000 ± 0.000M/s mem 18.293 MiB)
>   qp-trie update   (4  thread)    2.630 ± 0.050M/s (drops 0.000 ± 0.000M/s mem 18.472 MiB)
>   qp-trie update   (8  thread)    3.171 ± 0.027M/s (drops 0.000 ± 0.000M/s mem 18.301 MiB)
>   qp-trie update   (16 thread)    3.782 ± 0.036M/s (drops 0.000 ± 0.000M/s mem 19.040 MiB)


qp-trie being slower than htab matches my expectation and what I
observed when trying to use qp-trie with libbpf. But I think for a lot
of cases memory vs CPU tradeoff, coupled with ability to dynamically
grow qp-trie will make this data structure worthwhile.


>
> (3) Improve memory efficiency further
> When using strings in BTF string section as a data set for qp-trie, the
> slab memory usage showed in cgroup memory.stats file is about 11MB for
> qp-trie and 15MB for hash table as shown below. However the theoretical
> memory usage for qp-trie is ~6.8MB (is ~4.9MB if removing "parent" & "rcu"
> fields from qp_trie_branch) and the extra memory usage (about 38% of total
> usage) mainly comes from internal fragment in slab (namely 2^n alignment
> for allocation) and overhead in kmem-cgroup accounting. We can reduce the
> internal fragment by creating separated kmem_cache for qp_trie_branch with
> different child nodes, but not sure whether it is worthy or not.
>

Please CC Paul McKenney (paulmck@kernel.org) for your non-RFC patch
set and maybe he has some good idea how to avoid having rcu_head in
each leaf node. Maybe some sort of per-CPU queue of to-be-rcu-freed
elements, so that we don't have to keep 16 bytes in each leaf and
branch node?

> And in order to prevent allocating a rcu_head for each leaf node, now only
> branch node is RCU-freed, so when replacing a leaf node, a new branch node
> and a new leaf node will be allocated instead of replacing the old leaf
> node and RCU-freed the old leaf node. Also not sure whether or not it is
> worthy.
>
>   Strings in BTF string section (entries=115980):
>   htab lookup      (1  thread)    9.889 ± 0.006M/s (drops 0.000 ± 0.000M/s mem 15.069 MiB)
>   qp-trie lookup   (1  thread)    5.132 ± 0.002M/s (drops 0.000 ± 0.000M/s mem 10.721 MiB)
>
>   All files under linux kernel source directory (entries=74359):
>   htab lookup      (1  thread)    8.418 ± 0.077M/s (drops 0.000 ± 0.000M/s mem 14.207 MiB)
>   qp-trie lookup   (1  thread)    4.966 ± 0.003M/s (drops 0.000 ± 0.000M/s mem 9.355 MiB)
>
>   Domain names for Alexa top million web site (entries=1000000):
>   htab lookup      (1  thread)    4.551 ± 0.043M/s (drops 0.000 ± 0.000M/s mem 190.761 MiB)
>   qp-trie lookup   (1  thread)    2.804 ± 0.017M/s (drops 0.000 ± 0.000M/s mem 83.194 MiB)
>
> Comments and suggestions are always welcome.
>
> Regards,
> Tao
>
> [0]: https://lore.kernel.org/bpf/CAEf4Bzb7keBS8vXgV5JZzwgNGgMV0X3_guQ_m9JW3X6fJBDpPQ@mail.gmail.com/
> [1]: https://github.com/cilium/cilium/blob/5145e31cd65db3361f6538d5f5f899440b769070/pkg/datapath/prefilter/prefilter.go#L123
>
> Hou Tao (3):
>   bpf: Add support for qp-trie map
>   selftests/bpf: add a simple test for qp-trie
>   selftests/bpf: add benchmark for qp-trie map
>

Overall, looks



>  include/linux/bpf_types.h                     |    1 +
>  include/uapi/linux/bpf.h                      |    8 +
>  kernel/bpf/Makefile                           |    1 +
>  kernel/bpf/bpf_qp_trie.c                      | 1064 +++++++++++++++++
>  tools/include/uapi/linux/bpf.h                |    8 +
>  tools/testing/selftests/bpf/Makefile          |    5 +-
>  tools/testing/selftests/bpf/bench.c           |   10 +
>  .../selftests/bpf/benchs/bench_qp_trie.c      |  499 ++++++++
>  .../selftests/bpf/benchs/run_bench_qp_trie.sh |   55 +
>  .../selftests/bpf/prog_tests/str_key.c        |   69 ++
>  .../selftests/bpf/progs/qp_trie_bench.c       |  218 ++++
>  tools/testing/selftests/bpf/progs/str_key.c   |   85 ++
>  12 files changed, 2022 insertions(+), 1 deletion(-)
>  create mode 100644 kernel/bpf/bpf_qp_trie.c
>  create mode 100644 tools/testing/selftests/bpf/benchs/bench_qp_trie.c
>  create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_qp_trie.sh
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/str_key.c
>  create mode 100644 tools/testing/selftests/bpf/progs/qp_trie_bench.c
>  create mode 100644 tools/testing/selftests/bpf/progs/str_key.c
>
> --
> 2.29.2
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH bpf-next 0/3] Add support for qp-trie map
  2022-08-02 22:38 ` Andrii Nakryiko
@ 2022-08-08 17:54   ` Toke Høiland-Jørgensen
  2022-08-09  8:25   ` houtao
  1 sibling, 0 replies; 9+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-08-08 17:54 UTC (permalink / raw)
  To: Andrii Nakryiko, Hou Tao
  Cc: Andrii Nakryiko, Alexei Starovoitov, bpf, Daniel Borkmann,
	Martin KaFai Lau, Yonghong Song, Song Liu, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Hao Luo,
	Jiri Olsa, John Fastabend, Paul E . McKenney, Joanne Koong

Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Tue, Jul 26, 2022 at 5:42 AM Hou Tao <houtao1@huawei.com> wrote:
>>
>> Hi,
>>
>
> Hey, sorry I'm catching up on upstream and there is just too many
> complicated patch sets coming in, so it takes time to get through
> them.
>
> I think you did a great job with this implementation, it's certainly
> worth submitting as non-RFC for proper upstream review. I know that
> some people didn't get your patches, they got into spam somehow. So I
> think it would be great to just resubmit it as non-RFC so that it
> appears in patchworks as to-be-reviewew patches and hopefully will get
> a wider audience to review this.
>
> I've tried to answer some questions below, but would definitely like
> more people to chime in. I haven't went through implementation in
> details, but superficially it looks pretty clean and certainly ready
> for proper non-RFC review.
>
> One point about user API would be to maybe instead use bpf_dynptr as
> an interface for specifying variable-sized lookup key instead of
> hard-coded
>  bpf_qp_trie_key. Please check recent work by Joanne on bpf_dynptr.
>
> In short: looks great, I think it's certainly worth adding this as BPF
> map type. Please submit as non-RFC and go through a proper review
> process. Looking forward (even if that means reviewing 1000k lines of
> dense algorithmic code :) ).
>
>> The initial motivation for qp-trie map is to reduce memory usage for
>> string keys special those with large differencies in length as
>> discussed in [0]. And as a big-endian lexicographical ordered map, it
>> can also be used for any binary data with fixed or variable length.
>>
>> Now the basic functionality of qp-trie is ready, so posting a RFC version
>> to get more feedback or suggestions about qp-trie. Specially feedback
>> about the following questions:
>>
>> (1) Application scenario for qp-trie
>> Andrii had proposed to re-implement lpm-trie by using qp-trie. The
>> advantage would be the speed up of lookup operations due to lower tree
>> depth of qp-trie. Maybe the performance of update could also be improved
>> although in cillium there is a big lock during lpm-trie update [1]. Is
>
> Well, using qp-trie approach as an internal implementation of lpm-trie
> is probably a major win already, what's wrong with that?

+1, just improving the LPM trie map type is definitely worthwhile!

>> there any other use cases for qp-trie ? Specially those cases which need
>> both ordering and memory efficiency or cases in which jhash() of htab
>> creates too much collisions and qp-trie lookup performances better than
>> hash-table lookup as shown below:
>
> I'm thinking about qp-trie as dynamically growable lookup table.
> That's a pretty big use case already. There is an RB tree proposal
> under review right now which would also satisfy dynamically growable
> criteria, but its focus is slightly different and it remains to be
> seen how convenient it will be as general-purpose resizable
> alternative to hashmap. But from benchmarks I've found online, RB tree
> will definitely use more memory than qp-trie.
>
> Ordered property seems also very useful, but I don't yet have specific
> use case for that. But once we have data structure like this in BPF,
> I'm sure use cases will pop up.

From the various discussions of the packet queueing schemes (for both
XDP and sch_bpf), it seems some sort of ordered data structure is going
to be useful for that. Not sure this is a great fit for that use case,
but may be worth trying out. Looking harder at this is close to the top
of my list, so will play around with this series as well :)

-Toke

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH bpf-next 0/3] Add support for qp-trie map
  2022-08-02 22:38 ` Andrii Nakryiko
  2022-08-08 17:54   ` Toke Høiland-Jørgensen
@ 2022-08-09  8:25   ` houtao
  2022-08-16  5:33     ` Andrii Nakryiko
  1 sibling, 1 reply; 9+ messages in thread
From: houtao @ 2022-08-09  8:25 UTC (permalink / raw)
  To: Andrii Nakryiko, Alexei Starovoitov
  Cc: Andrii Nakryiko, bpf, Daniel Borkmann, Martin KaFai Lau,
	Yonghong Song, Song Liu, KP Singh, David S . Miller,
	Jakub Kicinski, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	John Fastabend, Paul E . McKenney, Joanne Koong, houtao1

Hi,

On 8/3/2022 6:38 AM, Andrii Nakryiko wrote:
> On Tue, Jul 26, 2022 at 5:42 AM Hou Tao <houtao1@huawei.com> wrote:
>> Hi,
>>
> Hey, sorry I'm catching up on upstream and there is just too many
> complicated patch sets coming in, so it takes time to get through
> them.
>
> I think you did a great job with this implementation, it's certainly
> worth submitting as non-RFC for proper upstream review. I know that
> some people didn't get your patches, they got into spam somehow. So I
> think it would be great to just resubmit it as non-RFC so that it
> appears in patchworks as to-be-reviewew patches and hopefully will get
> a wider audience to review this.
Thanks for your encouragement.
The spam email is due to some misconfiguration in email servers of our company,
but it can not be fixed soon, so I change to use another email account and hope
that will fix the spam problem.

> I've tried to answer some questions below, but would definitely like
> more people to chime in. I haven't went through implementation in
> details, but superficially it looks pretty clean and certainly ready
> for proper non-RFC review.
>
> One point about user API would be to maybe instead use bpf_dynptr as
> an interface for specifying variable-sized lookup key instead of
> hard-coded
>  bpf_qp_trie_key. Please check recent work by Joanne on bpf_dynptr.
Will check how to archive that. Does that means we will have real
variable-length key instead of fixed-allocated-length key with
variable-used-length ?
> In short: looks great, I think it's certainly worth adding this as BPF
> map type. Please submit as non-RFC and go through a proper review
> process. Looking forward (even if that means reviewing 1000k lines of
> dense algorithmic code :) ).
Thanks. :)
>> The initial motivation for qp-trie map is to reduce memory usage for
>> string keys special those with large differencies in length as
>> discussed in [0]. And as a big-endian lexicographical ordered map, it
>> can also be used for any binary data with fixed or variable length.
>>
>> Now the basic functionality of qp-trie is ready, so posting a RFC version
>> to get more feedback or suggestions about qp-trie. Specially feedback
>> about the following questions:
>>
>> (1) Application scenario for qp-trie
>> Andrii had proposed to re-implement lpm-trie by using qp-trie. The
>> advantage would be the speed up of lookup operations due to lower tree
>> depth of qp-trie. Maybe the performance of update could also be improved
>> although in cillium there is a big lock during lpm-trie update [1]. Is
> Well, using qp-trie approach as an internal implementation of lpm-trie
> is probably a major win already, what's wrong with that?
I am OK with that. My concern is that you had mentioned that qp-trie could be
used to remove the global lock in lpm-trie, but now there are still subtree
locks in qp-trie, so I am not sure whether the scalability of qp-trie is still a
problem. I had searched users of lpm-trie in github and only found Cilium. In
its source code [0], the update and deletion procedure of lpm-trie are protected
by a big lock, so may be the global lock is not a big issue right now.

[0]:
https://github.com/cilium/cilium/blob/5145e31cd65db3361f6538d5f5f899440b769070/pkg/datapath/prefilter/prefilter.go#L123
>> there any other use cases for qp-trie ? Specially those cases which need
>> both ordering and memory efficiency or cases in which jhash() of htab
>> creates too much collisions and qp-trie lookup performances better than
>> hash-table lookup as shown below:
> I'm thinking about qp-trie as dynamically growable lookup table.
> That's a pretty big use case already. There is an RB tree proposal
> under review right now which would also satisfy dynamically growable
> criteria, but its focus is slightly different and it remains to be
> seen how convenient it will be as general-purpose resizable
> alternative to hashmap. But from benchmarks I've found online, RB tree
> will definitely use more memory than qp-trie.
qp-trie is also RCU-read safe which is hard for rb-tree to archive. I also hope
qp-trie will have better lookup performance than rb-tree.
> Ordered property seems also very useful, but I don't yet have specific
> use case for that. But once we have data structure like this in BPF,
> I'm sure use cases will pop up.
For ordered map, next() and prev() helpers will be useful, but now these is no
such support. Embedded rb_node in program-defined structure may make it easier
to implement such helpers. Maybe we can also add such helpers for qp-trie in the
next-step ?

>>   Randomly-generated binary data with variable length (length range=[1, 256] entries=16K)
>>
>>   htab lookup      (1  thread)    5.062 ± 0.004M/s (drops 0.002 ± 0.000M/s mem 8.125 MiB)
>>   htab lookup      (2  thread)   10.256 ± 0.017M/s (drops 0.006 ± 0.000M/s mem 8.114 MiB)
>>   htab lookup      (4  thread)   20.383 ± 0.006M/s (drops 0.009 ± 0.000M/s mem 8.117 MiB)
>>   htab lookup      (8  thread)   40.727 ± 0.093M/s (drops 0.010 ± 0.000M/s mem 8.123 MiB)
>>   htab lookup      (16 thread)   81.333 ± 0.311M/s (drops 0.020 ± 0.000M/s mem 8.122 MiB)
>>
>>   qp-trie lookup   (1  thread)   10.161 ± 0.008M/s (drops 0.006 ± 0.000M/s mem 4.847 MiB)
>>   qp-trie lookup   (2  thread)   20.287 ± 0.024M/s (drops 0.007 ± 0.000M/s mem 4.828 MiB)
>>   qp-trie lookup   (4  thread)   40.784 ± 0.020M/s (drops 0.015 ± 0.000M/s mem 4.071 MiB)
>>   qp-trie lookup   (8  thread)   81.165 ± 0.013M/s (drops 0.040 ± 0.000M/s mem 4.045 MiB)
>>   qp-trie lookup   (16 thread)  159.955 ± 0.014M/s (drops 0.108 ± 0.000M/s mem 4.495 MiB)
>>
> Kind of surprised that qp-tire is twice as fast as hashmap. Do you
> have any idea why hashmap is slower? Was htab pre-allocated? What was
> it's max_entries?
The hash table is not pre-allocated and pre-allocation doesn't help for lookup
performance in the benchmark. max_entries is 16K, and if set max_entries to 4K
or 128K, the performance win of qp-trie is almost the same (worse when data set
is bigger).

The better performance is due to two reasons:
(1) height of qp-trie is low due to the randomly-generated data
The top-most branch nodes in qp-trie are almost full and have 16 or 17 child
nodes. If using /proc/kallsyms as input data set, the number of child nodes of
top-most branch nodes are just about 2~4.

(2) large difference between used length of key
The max key size is 256, but the range of valid data length is [1, 255] and the
unused part is zeroed. For htab-lookup, it has to compare 256 bytes each time
during list traverse, but qp-trie only needs to compare the used length of key.

>>   * non-zero drops is due to duplicated keys in generated keys.
>>
>> (2) more fine-grained lock in qp-trie
>> Now qp-trie is divided into 256 sub-trees by using the first character of
> character -> byte, it's not always strings, right?
Yes, qp-trie supports arbitrary bytes array as the key.
>> key and one sub-tree is protected one spinlock. From the data below,
>> although the update/delete speed of qp-trie is slower compare with hash
>> table, but it scales similar with hash table. So maybe 256-locks is a
>> good enough solution ?
>>
>>   Strings in /proc/kallsyms
>>   htab update      (1  thread)    2.850 ± 0.129M/s (drops 0.000 ± 0.000M/s mem 33.564 MiB)
>>   htab update      (2  thread)    4.363 ± 0.031M/s (drops 0.000 ± 0.000M/s mem 33.563 MiB)
>>   htab update      (4  thread)    6.306 ± 0.096M/s (drops 0.000 ± 0.000M/s mem 33.718 MiB)
>>   htab update      (8  thread)    6.611 ± 0.026M/s (drops 0.000 ± 0.000M/s mem 33.627 MiB)
>>   htab update      (16 thread)    6.390 ± 0.015M/s (drops 0.000 ± 0.000M/s mem 33.564 MiB)
>>   qp-trie update   (1  thread)    1.157 ± 0.099M/s (drops 0.000 ± 0.000M/s mem 18.333 MiB)
>>   qp-trie update   (2  thread)    1.920 ± 0.062M/s (drops 0.000 ± 0.000M/s mem 18.293 MiB)
>>   qp-trie update   (4  thread)    2.630 ± 0.050M/s (drops 0.000 ± 0.000M/s mem 18.472 MiB)
>>   qp-trie update   (8  thread)    3.171 ± 0.027M/s (drops 0.000 ± 0.000M/s mem 18.301 MiB)
>>   qp-trie update   (16 thread)    3.782 ± 0.036M/s (drops 0.000 ± 0.000M/s mem 19.040 MiB)
> qp-trie being slower than htab matches my expectation and what I
> observed when trying to use qp-trie with libbpf. But I think for a lot
> of cases memory vs CPU tradeoff, coupled with ability to dynamically
> grow qp-trie will make this data structure worthwhile.
>
>
>> (3) Improve memory efficiency further
>> When using strings in BTF string section as a data set for qp-trie, the
>> slab memory usage showed in cgroup memory.stats file is about 11MB for
>> qp-trie and 15MB for hash table as shown below. However the theoretical
>> memory usage for qp-trie is ~6.8MB (is ~4.9MB if removing "parent" & "rcu"
>> fields from qp_trie_branch) and the extra memory usage (about 38% of total
>> usage) mainly comes from internal fragment in slab (namely 2^n alignment
>> for allocation) and overhead in kmem-cgroup accounting. We can reduce the
>> internal fragment by creating separated kmem_cache for qp_trie_branch with
>> different child nodes, but not sure whether it is worthy or not.
>>
> Please CC Paul McKenney (paulmck@kernel.org) for your non-RFC patch
> set and maybe he has some good idea how to avoid having rcu_head in
> each leaf node. Maybe some sort of per-CPU queue of to-be-rcu-freed
> elements, so that we don't have to keep 16 bytes in each leaf and
> branch node?
Will cc Paul. I found a head-less kfree_rcu() which only needs a pointer, but
its downside is the calling context needs to sleepable, so it doesn't fit. And
it seems that BPF specific memory allocator from Alexei can also help to remove
rcu_head in bpf map element, right ?
>> And in order to prevent allocating a rcu_head for each leaf node, now only
>> branch node is RCU-freed, so when replacing a leaf node, a new branch node
>> and a new leaf node will be allocated instead of replacing the old leaf
>> node and RCU-freed the old leaf node. Also not sure whether or not it is
>> worthy.
>>
>>   Strings in BTF string section (entries=115980):
>>   htab lookup      (1  thread)    9.889 ± 0.006M/s (drops 0.000 ± 0.000M/s mem 15.069 MiB)
>>   qp-trie lookup   (1  thread)    5.132 ± 0.002M/s (drops 0.000 ± 0.000M/s mem 10.721 MiB)
>>
>>   All files under linux kernel source directory (entries=74359):
>>   htab lookup      (1  thread)    8.418 ± 0.077M/s (drops 0.000 ± 0.000M/s mem 14.207 MiB)
>>   qp-trie lookup   (1  thread)    4.966 ± 0.003M/s (drops 0.000 ± 0.000M/s mem 9.355 MiB)
>>
>>   Domain names for Alexa top million web site (entries=1000000):
>>   htab lookup      (1  thread)    4.551 ± 0.043M/s (drops 0.000 ± 0.000M/s mem 190.761 MiB)
>>   qp-trie lookup   (1  thread)    2.804 ± 0.017M/s (drops 0.000 ± 0.000M/s mem 83.194 MiB)
>>
>> Comments and suggestions are always welcome.
>>
>> Regards,
>> Tao
>>
>> [0]: https://lore.kernel.org/bpf/CAEf4Bzb7keBS8vXgV5JZzwgNGgMV0X3_guQ_m9JW3X6fJBDpPQ@mail.gmail.com/
>> [1]: https://github.com/cilium/cilium/blob/5145e31cd65db3361f6538d5f5f899440b769070/pkg/datapath/prefilter/prefilter.go#L123
>>
>> Hou Tao (3):
>>   bpf: Add support for qp-trie map
>>   selftests/bpf: add a simple test for qp-trie
>>   selftests/bpf: add benchmark for qp-trie map
>>
>
>>  include/linux/bpf_types.h                     |    1 +
>>  include/uapi/linux/bpf.h                      |    8 +
>>  kernel/bpf/Makefile                           |    1 +
>>  kernel/bpf/bpf_qp_trie.c                      | 1064 +++++++++++++++++
>>  tools/include/uapi/linux/bpf.h                |    8 +
>>  tools/testing/selftests/bpf/Makefile          |    5 +-
>>  tools/testing/selftests/bpf/bench.c           |   10 +
>>  .../selftests/bpf/benchs/bench_qp_trie.c      |  499 ++++++++
>>  .../selftests/bpf/benchs/run_bench_qp_trie.sh |   55 +
>>  .../selftests/bpf/prog_tests/str_key.c        |   69 ++
>>  .../selftests/bpf/progs/qp_trie_bench.c       |  218 ++++
>>  tools/testing/selftests/bpf/progs/str_key.c   |   85 ++
>>  12 files changed, 2022 insertions(+), 1 deletion(-)
>>  create mode 100644 kernel/bpf/bpf_qp_trie.c
>>  create mode 100644 tools/testing/selftests/bpf/benchs/bench_qp_trie.c
>>  create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_qp_trie.sh
>>  create mode 100644 tools/testing/selftests/bpf/prog_tests/str_key.c
>>  create mode 100644 tools/testing/selftests/bpf/progs/qp_trie_bench.c
>>  create mode 100644 tools/testing/selftests/bpf/progs/str_key.c
>>
>> --
>> 2.29.2
>>
> .


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH bpf-next 0/3] Add support for qp-trie map
  2022-08-09  8:25   ` houtao
@ 2022-08-16  5:33     ` Andrii Nakryiko
  0 siblings, 0 replies; 9+ messages in thread
From: Andrii Nakryiko @ 2022-08-16  5:33 UTC (permalink / raw)
  To: houtao
  Cc: Alexei Starovoitov, Andrii Nakryiko, bpf, Daniel Borkmann,
	Martin KaFai Lau, Yonghong Song, Song Liu, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Hao Luo,
	Jiri Olsa, John Fastabend, Paul E . McKenney, Joanne Koong,
	houtao1

On Tue, Aug 9, 2022 at 1:25 AM houtao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 8/3/2022 6:38 AM, Andrii Nakryiko wrote:
> > On Tue, Jul 26, 2022 at 5:42 AM Hou Tao <houtao1@huawei.com> wrote:
> >> Hi,
> >>
> > Hey, sorry I'm catching up on upstream and there is just too many
> > complicated patch sets coming in, so it takes time to get through
> > them.
> >
> > I think you did a great job with this implementation, it's certainly
> > worth submitting as non-RFC for proper upstream review. I know that
> > some people didn't get your patches, they got into spam somehow. So I
> > think it would be great to just resubmit it as non-RFC so that it
> > appears in patchworks as to-be-reviewew patches and hopefully will get
> > a wider audience to review this.
> Thanks for your encouragement.
> The spam email is due to some misconfiguration in email servers of our company,
> but it can not be fixed soon, so I change to use another email account and hope
> that will fix the spam problem.
>
> > I've tried to answer some questions below, but would definitely like
> > more people to chime in. I haven't went through implementation in
> > details, but superficially it looks pretty clean and certainly ready
> > for proper non-RFC review.
> >
> > One point about user API would be to maybe instead use bpf_dynptr as
> > an interface for specifying variable-sized lookup key instead of
> > hard-coded
> >  bpf_qp_trie_key. Please check recent work by Joanne on bpf_dynptr.
> Will check how to archive that. Does that means we will have real
> variable-length key instead of fixed-allocated-length key with
> variable-used-length ?

Yes, check bpf_dynptr patches. We are still working on expanding
bpf_dynptr use cases and API, but all the pieces needed for
variable-sized keys are already in place.

> > In short: looks great, I think it's certainly worth adding this as BPF
> > map type. Please submit as non-RFC and go through a proper review
> > process. Looking forward (even if that means reviewing 1000k lines of
> > dense algorithmic code :) ).
> Thanks. :)
> >> The initial motivation for qp-trie map is to reduce memory usage for
> >> string keys special those with large differencies in length as
> >> discussed in [0]. And as a big-endian lexicographical ordered map, it
> >> can also be used for any binary data with fixed or variable length.
> >>
> >> Now the basic functionality of qp-trie is ready, so posting a RFC version
> >> to get more feedback or suggestions about qp-trie. Specially feedback
> >> about the following questions:
> >>
> >> (1) Application scenario for qp-trie
> >> Andrii had proposed to re-implement lpm-trie by using qp-trie. The
> >> advantage would be the speed up of lookup operations due to lower tree
> >> depth of qp-trie. Maybe the performance of update could also be improved
> >> although in cillium there is a big lock during lpm-trie update [1]. Is
> > Well, using qp-trie approach as an internal implementation of lpm-trie
> > is probably a major win already, what's wrong with that?
> I am OK with that. My concern is that you had mentioned that qp-trie could be
> used to remove the global lock in lpm-trie, but now there are still subtree
> locks in qp-trie, so I am not sure whether the scalability of qp-trie is still a
> problem. I had searched users of lpm-trie in github and only found Cilium. In
> its source code [0], the update and deletion procedure of lpm-trie are protected
> by a big lock, so may be the global lock is not a big issue right now.
>

I think typically users just stay away from lpm-trie due to its
slowness. Once this is addressed, we'll have more use cases, as the
semantics of this BPF map is useful in a bunch of practical use cases.
So yes, it's very much an issue, overall.

> [0]:
> https://github.com/cilium/cilium/blob/5145e31cd65db3361f6538d5f5f899440b769070/pkg/datapath/prefilter/prefilter.go#L123
> >> there any other use cases for qp-trie ? Specially those cases which need
> >> both ordering and memory efficiency or cases in which jhash() of htab
> >> creates too much collisions and qp-trie lookup performances better than
> >> hash-table lookup as shown below:
> > I'm thinking about qp-trie as dynamically growable lookup table.
> > That's a pretty big use case already. There is an RB tree proposal
> > under review right now which would also satisfy dynamically growable
> > criteria, but its focus is slightly different and it remains to be
> > seen how convenient it will be as general-purpose resizable
> > alternative to hashmap. But from benchmarks I've found online, RB tree
> > will definitely use more memory than qp-trie.
> qp-trie is also RCU-read safe which is hard for rb-tree to archive. I also hope
> qp-trie will have better lookup performance than rb-tree.
> > Ordered property seems also very useful, but I don't yet have specific
> > use case for that. But once we have data structure like this in BPF,
> > I'm sure use cases will pop up.
> For ordered map, next() and prev() helpers will be useful, but now these is no
> such support. Embedded rb_node in program-defined structure may make it easier
> to implement such helpers. Maybe we can also add such helpers for qp-trie in the
> next-step ?
>

See bpf_for_each_map_elem() helper, we already have ability to iterate
BPF map elements in *some* order, but in this case for qp-trie those
elements will be guaranteed to be in sorted order.

> >>   Randomly-generated binary data with variable length (length range=[1, 256] entries=16K)
> >>
> >>   htab lookup      (1  thread)    5.062 ± 0.004M/s (drops 0.002 ± 0.000M/s mem 8.125 MiB)
> >>   htab lookup      (2  thread)   10.256 ± 0.017M/s (drops 0.006 ± 0.000M/s mem 8.114 MiB)
> >>   htab lookup      (4  thread)   20.383 ± 0.006M/s (drops 0.009 ± 0.000M/s mem 8.117 MiB)
> >>   htab lookup      (8  thread)   40.727 ± 0.093M/s (drops 0.010 ± 0.000M/s mem 8.123 MiB)
> >>   htab lookup      (16 thread)   81.333 ± 0.311M/s (drops 0.020 ± 0.000M/s mem 8.122 MiB)
> >>
> >>   qp-trie lookup   (1  thread)   10.161 ± 0.008M/s (drops 0.006 ± 0.000M/s mem 4.847 MiB)
> >>   qp-trie lookup   (2  thread)   20.287 ± 0.024M/s (drops 0.007 ± 0.000M/s mem 4.828 MiB)
> >>   qp-trie lookup   (4  thread)   40.784 ± 0.020M/s (drops 0.015 ± 0.000M/s mem 4.071 MiB)
> >>   qp-trie lookup   (8  thread)   81.165 ± 0.013M/s (drops 0.040 ± 0.000M/s mem 4.045 MiB)
> >>   qp-trie lookup   (16 thread)  159.955 ± 0.014M/s (drops 0.108 ± 0.000M/s mem 4.495 MiB)
> >>
> > Kind of surprised that qp-tire is twice as fast as hashmap. Do you
> > have any idea why hashmap is slower? Was htab pre-allocated? What was
> > it's max_entries?
> The hash table is not pre-allocated and pre-allocation doesn't help for lookup
> performance in the benchmark. max_entries is 16K, and if set max_entries to 4K
> or 128K, the performance win of qp-trie is almost the same (worse when data set
> is bigger).
>
> The better performance is due to two reasons:
> (1) height of qp-trie is low due to the randomly-generated data
> The top-most branch nodes in qp-trie are almost full and have 16 or 17 child
> nodes. If using /proc/kallsyms as input data set, the number of child nodes of
> top-most branch nodes are just about 2~4.
>

right, makes sense, dense data with high branching factor is best
scenario for trie-based data structures

> (2) large difference between used length of key
> The max key size is 256, but the range of valid data length is [1, 255] and the
> unused part is zeroed. For htab-lookup, it has to compare 256 bytes each time
> during list traverse, but qp-trie only needs to compare the used length of key.

yep, makes sense as well

>
> >>   * non-zero drops is due to duplicated keys in generated keys.
> >>
> >> (2) more fine-grained lock in qp-trie
> >> Now qp-trie is divided into 256 sub-trees by using the first character of
> > character -> byte, it's not always strings, right?
> Yes, qp-trie supports arbitrary bytes array as the key.
> >> key and one sub-tree is protected one spinlock. From the data below,
> >> although the update/delete speed of qp-trie is slower compare with hash
> >> table, but it scales similar with hash table. So maybe 256-locks is a
> >> good enough solution ?
> >>
> >>   Strings in /proc/kallsyms
> >>   htab update      (1  thread)    2.850 ± 0.129M/s (drops 0.000 ± 0.000M/s mem 33.564 MiB)
> >>   htab update      (2  thread)    4.363 ± 0.031M/s (drops 0.000 ± 0.000M/s mem 33.563 MiB)
> >>   htab update      (4  thread)    6.306 ± 0.096M/s (drops 0.000 ± 0.000M/s mem 33.718 MiB)
> >>   htab update      (8  thread)    6.611 ± 0.026M/s (drops 0.000 ± 0.000M/s mem 33.627 MiB)
> >>   htab update      (16 thread)    6.390 ± 0.015M/s (drops 0.000 ± 0.000M/s mem 33.564 MiB)
> >>   qp-trie update   (1  thread)    1.157 ± 0.099M/s (drops 0.000 ± 0.000M/s mem 18.333 MiB)
> >>   qp-trie update   (2  thread)    1.920 ± 0.062M/s (drops 0.000 ± 0.000M/s mem 18.293 MiB)
> >>   qp-trie update   (4  thread)    2.630 ± 0.050M/s (drops 0.000 ± 0.000M/s mem 18.472 MiB)
> >>   qp-trie update   (8  thread)    3.171 ± 0.027M/s (drops 0.000 ± 0.000M/s mem 18.301 MiB)
> >>   qp-trie update   (16 thread)    3.782 ± 0.036M/s (drops 0.000 ± 0.000M/s mem 19.040 MiB)
> > qp-trie being slower than htab matches my expectation and what I
> > observed when trying to use qp-trie with libbpf. But I think for a lot
> > of cases memory vs CPU tradeoff, coupled with ability to dynamically
> > grow qp-trie will make this data structure worthwhile.
> >
> >
> >> (3) Improve memory efficiency further
> >> When using strings in BTF string section as a data set for qp-trie, the
> >> slab memory usage showed in cgroup memory.stats file is about 11MB for
> >> qp-trie and 15MB for hash table as shown below. However the theoretical
> >> memory usage for qp-trie is ~6.8MB (is ~4.9MB if removing "parent" & "rcu"
> >> fields from qp_trie_branch) and the extra memory usage (about 38% of total
> >> usage) mainly comes from internal fragment in slab (namely 2^n alignment
> >> for allocation) and overhead in kmem-cgroup accounting. We can reduce the
> >> internal fragment by creating separated kmem_cache for qp_trie_branch with
> >> different child nodes, but not sure whether it is worthy or not.
> >>
> > Please CC Paul McKenney (paulmck@kernel.org) for your non-RFC patch
> > set and maybe he has some good idea how to avoid having rcu_head in
> > each leaf node. Maybe some sort of per-CPU queue of to-be-rcu-freed
> > elements, so that we don't have to keep 16 bytes in each leaf and
> > branch node?
> Will cc Paul. I found a head-less kfree_rcu() which only needs a pointer, but
> its downside is the calling context needs to sleepable, so it doesn't fit. And
> it seems that BPF specific memory allocator from Alexei can also help to remove
> rcu_head in bpf map element, right ?

Hm.. don't know, maybe?

> >> And in order to prevent allocating a rcu_head for each leaf node, now only
> >> branch node is RCU-freed, so when replacing a leaf node, a new branch node
> >> and a new leaf node will be allocated instead of replacing the old leaf
> >> node and RCU-freed the old leaf node. Also not sure whether or not it is
> >> worthy.
> >>
> >>   Strings in BTF string section (entries=115980):
> >>   htab lookup      (1  thread)    9.889 ± 0.006M/s (drops 0.000 ± 0.000M/s mem 15.069 MiB)
> >>   qp-trie lookup   (1  thread)    5.132 ± 0.002M/s (drops 0.000 ± 0.000M/s mem 10.721 MiB)
> >>
> >>   All files under linux kernel source directory (entries=74359):
> >>   htab lookup      (1  thread)    8.418 ± 0.077M/s (drops 0.000 ± 0.000M/s mem 14.207 MiB)
> >>   qp-trie lookup   (1  thread)    4.966 ± 0.003M/s (drops 0.000 ± 0.000M/s mem 9.355 MiB)
> >>
> >>   Domain names for Alexa top million web site (entries=1000000):
> >>   htab lookup      (1  thread)    4.551 ± 0.043M/s (drops 0.000 ± 0.000M/s mem 190.761 MiB)
> >>   qp-trie lookup   (1  thread)    2.804 ± 0.017M/s (drops 0.000 ± 0.000M/s mem 83.194 MiB)
> >>
> >> Comments and suggestions are always welcome.
> >>
> >> Regards,
> >> Tao
> >>
> >> [0]: https://lore.kernel.org/bpf/CAEf4Bzb7keBS8vXgV5JZzwgNGgMV0X3_guQ_m9JW3X6fJBDpPQ@mail.gmail.com/
> >> [1]: https://github.com/cilium/cilium/blob/5145e31cd65db3361f6538d5f5f899440b769070/pkg/datapath/prefilter/prefilter.go#L123
> >>
> >> Hou Tao (3):
> >>   bpf: Add support for qp-trie map
> >>   selftests/bpf: add a simple test for qp-trie
> >>   selftests/bpf: add benchmark for qp-trie map
> >>
> >
> >>  include/linux/bpf_types.h                     |    1 +
> >>  include/uapi/linux/bpf.h                      |    8 +
> >>  kernel/bpf/Makefile                           |    1 +
> >>  kernel/bpf/bpf_qp_trie.c                      | 1064 +++++++++++++++++
> >>  tools/include/uapi/linux/bpf.h                |    8 +
> >>  tools/testing/selftests/bpf/Makefile          |    5 +-
> >>  tools/testing/selftests/bpf/bench.c           |   10 +
> >>  .../selftests/bpf/benchs/bench_qp_trie.c      |  499 ++++++++
> >>  .../selftests/bpf/benchs/run_bench_qp_trie.sh |   55 +
> >>  .../selftests/bpf/prog_tests/str_key.c        |   69 ++
> >>  .../selftests/bpf/progs/qp_trie_bench.c       |  218 ++++
> >>  tools/testing/selftests/bpf/progs/str_key.c   |   85 ++
> >>  12 files changed, 2022 insertions(+), 1 deletion(-)
> >>  create mode 100644 kernel/bpf/bpf_qp_trie.c
> >>  create mode 100644 tools/testing/selftests/bpf/benchs/bench_qp_trie.c
> >>  create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_qp_trie.sh
> >>  create mode 100644 tools/testing/selftests/bpf/prog_tests/str_key.c
> >>  create mode 100644 tools/testing/selftests/bpf/progs/qp_trie_bench.c
> >>  create mode 100644 tools/testing/selftests/bpf/progs/str_key.c
> >>
> >> --
> >> 2.29.2
> >>
> > .
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-08-16  8:09 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-26 13:00 [RFC PATCH bpf-next 0/3] Add support for qp-trie map Hou Tao
2022-07-26 13:00 ` [RFC PATCH bpf-next 1/3] bpf: " Hou Tao
2022-07-26 13:00 ` [RFC PATCH bpf-next 2/3] selftests/bpf: add a simple test for qp-trie Hou Tao
2022-07-26 13:00 ` [RFC PATCH bpf-next 3/3] selftests/bpf: add benchmark for qp-trie map Hou Tao
2022-08-02  6:20 ` [RFC PATCH bpf-next 0/3] Add support " Hou Tao
2022-08-02 22:38 ` Andrii Nakryiko
2022-08-08 17:54   ` Toke Høiland-Jørgensen
2022-08-09  8:25   ` houtao
2022-08-16  5:33     ` Andrii Nakryiko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).