All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH bpf-next 0/2] bpf: Introduce ternary search tree for string key
@ 2022-03-31 12:28 Hou Tao
  2022-03-31 12:28 ` [RFC PATCH bpf-next 1/2] " Hou Tao
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Hou Tao @ 2022-03-31 12:28 UTC (permalink / raw)
  To: Alexei Starovoitov, Yonghong Song
  Cc: Daniel Borkmann, Martin KaFai Lau, Andrii Nakryiko, Song Liu,
	KP Singh, David S . Miller, Jakub Kicinski, John Fastabend,
	netdev, bpf, houtao1

Hi,

The initial motivation for the patchset is due to the suggestion of Alexei.
During the discuss of supporting of string key in hash-table, he saw the
space efficiency of ternary search tree under our early test and suggest
us to post it as a new bpf map [1].

Ternary search tree is a special trie where nodes are arranged in a
manner similar to binary search tree, but with up to three children
rather than two. The three children correpond to nodes whose value is
less than, equal to, and greater than the value of current node
respectively.

In ternary search tree map, only the valid content of string is saved.
The trailing null byte and unused bytes after it are not saved. If there
are common prefixes between these strings, the prefix is only saved once.
Compared with other space optimized trie (e.g. HAT-trie, succinct trie),
the advantage of ternary search tree is simple and being writeable.

Below are diagrams for ternary search map when inserting hello, he,
test and tea into it:

1. insert "hello"

        [ hello ]

2. insert "he": need split "hello" into "he" and "llo"

         [ he ]
            |
            *
            |
         [ llo ]

3. insert "test": add it as right child of "he"

         [ he ]
            |
            *-------x
            |       |
         [ llo ] [ test ]

5. insert "tea": split "test" into "te" and "st",
   and insert "a" as left child of "st"

         [ he ]
            |
     x------*-------x
     |      |       |
  [ ah ] [ llo ] [ te ]
                    |
                    *
                    |
                 [ st ]
                    |
               x----*
               |
             [ a ]

As showed in above diagrams, the common prefix between "test" and "tea"
is "te" and it only is saved once. Also add benchmarks to compare the
memory usage and lookup performance between ternary search tree and
hash table. When the common prefix is lengthy (~192 bytes) and the
length of suffix is about 64 bytes, there are about 2~3 folds memory
saving compared with hash table. But the memory saving comes at prices:
the lookup performance of tst is about 2~3 slower compared with hash
table. See more benchmark details on patch #2.

Comments and suggestions are always welcome.

Regards,
Tao

[1]: https://lore.kernel.org/bpf/CAADnVQJUJp3YBcpESwR3Q1U6GS1mBM=Vp-qYuQX7eZOaoLjdUA@mail.gmail.com/

Hou Tao (2):
  bpf: Introduce ternary search tree for string key
  selftests/bpf: add benchmark for ternary search tree map

 include/linux/bpf_types.h                     |   1 +
 include/uapi/linux/bpf.h                      |   1 +
 kernel/bpf/Makefile                           |   1 +
 kernel/bpf/bpf_tst.c                          | 411 +++++++++++++++++
 tools/include/uapi/linux/bpf.h                |   1 +
 tools/testing/selftests/bpf/Makefile          |   5 +-
 tools/testing/selftests/bpf/bench.c           |   6 +
 .../selftests/bpf/benchs/bench_tst_map.c      | 415 ++++++++++++++++++
 .../selftests/bpf/benchs/run_bench_tst.sh     |  54 +++
 tools/testing/selftests/bpf/progs/tst_bench.c |  70 +++
 10 files changed, 964 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/bpf_tst.c
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_tst_map.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_tst.sh
 create mode 100644 tools/testing/selftests/bpf/progs/tst_bench.c

-- 
2.31.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH bpf-next 1/2] bpf: Introduce ternary search tree for string key
  2022-03-31 12:28 [RFC PATCH bpf-next 0/2] bpf: Introduce ternary search tree for string key Hou Tao
@ 2022-03-31 12:28 ` Hou Tao
  2022-04-08 23:00   ` Alexei Starovoitov
  2022-03-31 12:28 ` [RFC PATCH bpf-next] selftests/bpf: add benchmark for ternary search tree map Hou Tao
  2022-04-06 17:38 ` [RFC PATCH bpf-next 0/2] bpf: Introduce ternary search tree for string key Andrii Nakryiko
  2 siblings, 1 reply; 14+ messages in thread
From: Hou Tao @ 2022-03-31 12:28 UTC (permalink / raw)
  To: Alexei Starovoitov, Yonghong Song
  Cc: Daniel Borkmann, Martin KaFai Lau, Andrii Nakryiko, Song Liu,
	KP Singh, David S . Miller, Jakub Kicinski, John Fastabend,
	netdev, bpf, houtao1

Now for string key in hash-table, the storage size of key for each
element is the size of longest string. If there are large differencies
in string length and comm prefixes between these strings, there will be
lots of space waste. So introduce a specific map for string key: ternary
search tree.

Ternary search tree is a special trie where nodes are arranged in a
manner similar to binary search tree, but with up to three children
rather than two. The three children correpond to nodes whose value is
less than, equal to, and greater than the value of current node
respectively.

For each key in ternary search map, only the content before the
terminated null byte is saved. And just like trie, the common prefixes
between these strings are shared and it can reducee the memory usage
significantly for hierarchical string (e.g. file path with lengthy
prefix). But the space efficiency comes at a price: the lookup
performance is about x2~x3 or more slower compared with hash table for
small or medium data set.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 include/linux/bpf_types.h      |   1 +
 include/uapi/linux/bpf.h       |   1 +
 kernel/bpf/Makefile            |   1 +
 kernel/bpf/bpf_tst.c           | 411 +++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |   1 +
 5 files changed, 415 insertions(+)
 create mode 100644 kernel/bpf/bpf_tst.c

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 48a91c51c015..8ffb3ab7f337 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -126,6 +126,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops)
 #endif
 BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_BLOOM_FILTER, bloom_filter_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_TST, bpf_tst_map_ops)
 
 BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint)
 BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index b0383d371b9a..470bf9a9353d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -907,6 +907,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_INODE_STORAGE,
 	BPF_MAP_TYPE_TASK_STORAGE,
 	BPF_MAP_TYPE_BLOOM_FILTER,
+	BPF_MAP_TYPE_TST,
 };
 
 /* Note that tracing related programs such as
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index c1a9be6a4b9f..65176d4e99bf 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -10,6 +10,7 @@ obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_i
 obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
 obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
 obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
+obj-$(CONFIG_BPF_SYSCALL) += bpf_tst.o
 obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
 obj-$(CONFIG_BPF_SYSCALL) += disasm.o
 obj-$(CONFIG_BPF_JIT) += trampoline.o
diff --git a/kernel/bpf/bpf_tst.c b/kernel/bpf/bpf_tst.c
new file mode 100644
index 000000000000..ab82aecaa023
--- /dev/null
+++ b/kernel/bpf/bpf_tst.c
@@ -0,0 +1,411 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (C) 2022. Huawei Technologies Co., Ltd */
+#include <linux/bpf.h>
+#include <linux/rcupdate.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+
+/*
+ * Ternary search tree is a special trie where nodes are arranged in
+ * a manner similar to binary search tree, but with up to three children
+ * rather than two. The three children correpond to nodes whose value is
+ * less than, equal to, and greater than the value of current node
+ * respectively.
+ *
+ * The following are illustrations of ternary search tree during inserting
+ * hello, he, test, tea and team:
+ *
+ * 1. insert "hello"
+ *
+ *         [ hello ]
+ *
+ * 2. insert "he": need split "hello" into "he" and "llo"
+ *
+ *          [ he ]
+ *             |
+ *             *
+ *             |
+ *          [ llo ]
+ *
+ * 3. insert "test": add it as right child of "he"
+ *
+ *          [ he ]
+ *             |
+ *             *-------x
+ *             |       |
+ *          [ llo ] [ test ]
+ *
+ * 5. insert "tea": split "test" into "te" and "st",
+ *    and insert "a" as left child of "st"
+ *
+ *          [ he ]
+ *             |
+ *      x------*-------x
+ *      |      |       |
+ *   [ ah ] [ llo ] [ te ]
+ *                     |
+ *                     *
+ *                     |
+ *                  [ st ]
+ *                     |
+ *                x----*
+ *                |
+ *              [ a ]
+ *
+ * 6. insert "team": insert "m" as middle child of "a"
+ *
+ *          [ he ]
+ *             |
+ *             *-------x
+ *             |       |
+ *          [ llo ] [ te ]
+ *                     |
+ *                     *
+ *                     |
+ *                  [ st ]
+ *                     |
+ *                x----*
+ *                |
+ *              [ a ]
+ *                |
+ *                *
+ *                |
+ *              [ m ]
+ */
+#define TST_CREATE_FLAG_MASK \
+	(BPF_F_NUMA_NODE | BPF_F_NO_PREALLOC | BPF_F_ACCESS_MASK)
+
+struct bpf_tst_node;
+
+struct bpf_tst_node {
+	struct rcu_head rcu;
+	struct bpf_tst_node __rcu *child[3];
+	u32 len;
+	bool leaf;
+	u8 key[];
+};
+
+struct bpf_tst {
+	struct bpf_map map;
+	struct bpf_tst_node __rcu *root;
+	size_t nr_entries;
+	spinlock_t lock;
+};
+
+/*
+ * match_prefix() - check whether prefix is fully matched
+ *
+ * @next: returns the position of next-to-compare character in str
+ *
+ * Return 0 if str has prefix, 1 if str > prefix and -1 if str < prefix
+ */
+static int match_prefix(const unsigned char *prefix, int len,
+			const unsigned char *str, int *next)
+{
+	int i;
+
+	for (i = 0; i < len; i++) {
+		int cmp = str[i] - prefix[i];
+
+		if (cmp) {
+			*next = i;
+			return cmp > 0 ? 1 : -1;
+		}
+		if (!str[i])
+			break;
+	}
+
+	*next = len;
+	return 0;
+}
+
+/* Called from syscall or from eBPF program */
+static void *tst_lookup_elem(struct bpf_map *map, void *key)
+{
+	struct bpf_tst *tst = container_of(map, struct bpf_tst, map);
+	struct bpf_tst_node *node;
+	const unsigned char *c = key;
+
+	/* A null terminated non-empty string */
+	if (!c[0] || c[map->key_size - 1])
+		return NULL;
+
+	node = rcu_dereference_protected(tst->root, rcu_read_lock_held());
+	while (node) {
+		int cmp;
+		int next;
+
+		cmp = match_prefix(node->key, node->len, c, &next);
+		/* Partially match an internal node */
+		if (cmp && next)
+			return NULL;
+
+		c += next;
+		/* Fully match */
+		if (!cmp && !*c) {
+			if (node->leaf)
+				return node->key + node->len;
+			return NULL;
+		}
+
+		node = rcu_dereference_protected(node->child[cmp + 1],
+						 rcu_read_lock_held());
+	}
+
+	return NULL;
+}
+
+/* Split node into two nodes */
+static struct bpf_tst_node *
+split_tst_node(struct bpf_map *map, struct bpf_tst_node *node, int next, void *value)
+{
+	struct bpf_tst_node *bot, *top;
+	size_t size;
+
+	size = sizeof(*bot) + node->len - next;
+	if (node->leaf)
+		size += map->value_size;
+	bot = bpf_map_kmalloc_node(map, size, GFP_ATOMIC | __GFP_NOWARN,
+				   map->numa_node);
+	if (!bot)
+		return NULL;
+
+	bot->child[0] = NULL;
+	/* node has been initialized, so no rcu_assign_pointer() */
+	bot->child[1] = node->child[1];
+	bot->child[2] = NULL;
+	bot->len = node->len - next;
+	bot->leaf = node->leaf;
+	memcpy(bot->key, node->key + next, bot->len);
+	if (bot->leaf)
+		memcpy(bot->key + bot->len, node->key + node->len,
+		       map->value_size);
+
+	size = sizeof(*top) + next;
+	if (value)
+		size += map->value_size;
+	top = bpf_map_kmalloc_node(map, size, GFP_ATOMIC | __GFP_NOWARN,
+				   map->numa_node);
+	if (!top) {
+		kfree(bot);
+		return NULL;
+	}
+
+	top->child[0] = node->child[0];
+	rcu_assign_pointer(top->child[1], bot);
+	top->child[2] = node->child[2];
+	top->len = next;
+	top->leaf = !!value;
+	memcpy(top->key, node->key, next);
+	if (value)
+		memcpy(top->key + top->len, value, map->value_size);
+
+	return top;
+}
+
+static struct bpf_tst_node *
+new_leaf_node(struct bpf_map *map, struct bpf_tst_node *node, bool replace,
+	      const void *c, void *value)
+{
+	struct bpf_tst_node *leaf;
+	size_t size;
+	unsigned int str_len;
+
+	/* Newly-created node or replace the original node */
+	if (!replace)
+		str_len = strlen(c);
+	else
+		str_len = node->len;
+	size = sizeof(*leaf) + str_len + map->value_size;
+	leaf = bpf_map_kmalloc_node(map, size, GFP_ATOMIC | __GFP_NOWARN,
+				    map->numa_node);
+	if (!leaf)
+		return NULL;
+
+	if (!replace) {
+		leaf->child[0] = leaf->child[1] = leaf->child[2] = NULL;
+		leaf->len = str_len;
+		memcpy(leaf->key, c, str_len);
+	} else {
+		memcpy(leaf, node, sizeof(*node) + str_len);
+	}
+	leaf->leaf = true;
+	memcpy(leaf->key + str_len, value, map->value_size);
+
+	return leaf;
+}
+
+/* Called from syscall or from eBPF program */
+static int tst_update_elem(struct bpf_map *map, void *key, void *value, u64 flags)
+{
+	struct bpf_tst *tst = container_of(map, struct bpf_tst, map);
+	struct bpf_tst_node __rcu **slot, **new_slot = NULL;
+	struct bpf_tst_node *node, *new_node, *new_intn_node = NULL;
+	unsigned long irq_flags;
+	const unsigned char *c = key;
+	bool replace;
+	int err = 0;
+
+	if (!c[0] || c[map->key_size - 1])
+		return -EINVAL;
+
+	spin_lock_irqsave(&tst->lock, irq_flags);
+	if (tst->nr_entries == map->max_entries) {
+		err = -ENOSPC;
+		goto out;
+	}
+
+	slot = &tst->root;
+	while ((node = rcu_dereference_protected(*slot, lockdep_is_held(&tst->lock)))) {
+		int cmp;
+		int next;
+
+		cmp = match_prefix(node->key, node->len, c, &next);
+		c += next;
+
+		/* Split internal node */
+		if (cmp && next) {
+			/* The split top node is a leaf node */
+			bool top_leaf = !*c;
+
+			new_node = split_tst_node(map, node, next,
+						  top_leaf ? value : NULL);
+			if (!new_node) {
+				err = -ENOMEM;
+				goto out;
+			}
+			if (top_leaf)
+				goto done;
+
+			new_intn_node = new_node;
+			new_slot = &new_node->child[1]->child[cmp + 1];
+			break;
+		}
+
+		/* Fully match */
+		if (!cmp && !*c)
+			break;
+		slot = &node->child[cmp + 1];
+	}
+
+	/* Replace the original node ? */
+	replace = node && !new_intn_node;
+	new_node = new_leaf_node(map, node, replace, c, value);
+	if (!new_node) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	/* Don't increase if replace a leaf node */
+	if (!replace || !node->leaf)
+		tst->nr_entries++;
+
+	/* Graft the leaf node first for splitting */
+	if (new_intn_node) {
+		rcu_assign_pointer(*new_slot, new_node);
+		new_node = new_intn_node;
+	}
+done:
+	rcu_assign_pointer(*slot, new_node);
+	spin_unlock_irqrestore(&tst->lock, irq_flags);
+	kfree_rcu(node, rcu);
+
+	return 0;
+out:
+	if (new_intn_node) {
+		kfree(new_intn_node->child[1]);
+		kfree(new_intn_node);
+	}
+	spin_unlock_irqrestore(&tst->lock, irq_flags);
+
+	return err;
+}
+
+static int tst_delete_elem(struct bpf_map *map, void *key)
+{
+	return -EOPNOTSUPP;
+}
+
+static int tst_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+	return -EOPNOTSUPP;
+}
+
+static struct bpf_map *tst_alloc(union bpf_attr *attr)
+{
+	struct bpf_tst *tst;
+
+	if (!bpf_capable())
+		return ERR_PTR(-EPERM);
+
+	if (!attr->key_size || !attr->value_size ||
+	    !attr->max_entries ||
+	    !(attr->map_flags & BPF_F_NO_PREALLOC) ||
+	    (attr->map_flags & ~TST_CREATE_FLAG_MASK) ||
+	    !bpf_map_flags_access_ok(attr->map_flags))
+		return ERR_PTR(-EINVAL);
+
+	tst = kzalloc(sizeof(*tst), GFP_USER | __GFP_NOWARN | __GFP_ACCOUNT);
+	if (!tst)
+		return ERR_PTR(-ENOMEM);
+
+	/* copy mandatory map attributes */
+	bpf_map_init_from_attr(&tst->map, attr);
+	spin_lock_init(&tst->lock);
+
+	return &tst->map;
+}
+
+static void tst_free(struct bpf_map *map)
+{
+	struct bpf_tst *tst = container_of(map, struct bpf_tst, map);
+	struct bpf_tst_node __rcu **slot;
+	struct bpf_tst_node *node;
+
+	/*
+	 * Always start at the root and walk down to a node that has no
+	 * children. Then free that node, nullify its reference in the parent
+	 * and start over.
+	 */
+	for (;;) {
+		slot = &tst->root;
+
+		for (;;) {
+			unsigned int i;
+
+			node = rcu_dereference_protected(*slot, 1);
+			if (!node)
+				goto out;
+
+			for (i = 0; i < ARRAY_SIZE(node->child); i++) {
+				if (rcu_access_pointer(node->child[i])) {
+					slot = &node->child[i];
+					break;
+				}
+			}
+
+			if (i < ARRAY_SIZE(node->child))
+				continue;
+
+			kfree(node);
+			RCU_INIT_POINTER(*slot, NULL);
+			break;
+		}
+	}
+
+out:
+	kfree(tst);
+}
+
+static int bpf_tst_map_btf_id;
+const struct bpf_map_ops bpf_tst_map_ops = {
+	.map_meta_equal = bpf_map_meta_equal,
+	.map_alloc = tst_alloc,
+	.map_free = tst_free,
+	.map_get_next_key = tst_get_next_key,
+	.map_lookup_elem = tst_lookup_elem,
+	.map_update_elem = tst_update_elem,
+	.map_delete_elem = tst_delete_elem,
+	.map_btf_name = "bpf_tst",
+	.map_btf_id = &bpf_tst_map_btf_id,
+};
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index b0383d371b9a..470bf9a9353d 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -907,6 +907,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_INODE_STORAGE,
 	BPF_MAP_TYPE_TASK_STORAGE,
 	BPF_MAP_TYPE_BLOOM_FILTER,
+	BPF_MAP_TYPE_TST,
 };
 
 /* Note that tracing related programs such as
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH bpf-next] selftests/bpf: add benchmark for ternary search tree map
  2022-03-31 12:28 [RFC PATCH bpf-next 0/2] bpf: Introduce ternary search tree for string key Hou Tao
  2022-03-31 12:28 ` [RFC PATCH bpf-next 1/2] " Hou Tao
@ 2022-03-31 12:28 ` Hou Tao
  2022-04-06 17:38 ` [RFC PATCH bpf-next 0/2] bpf: Introduce ternary search tree for string key Andrii Nakryiko
  2 siblings, 0 replies; 14+ messages in thread
From: Hou Tao @ 2022-03-31 12:28 UTC (permalink / raw)
  To: Alexei Starovoitov, Yonghong Song
  Cc: Daniel Borkmann, Martin KaFai Lau, Andrii Nakryiko, Song Liu,
	KP Singh, David S . Miller, Jakub Kicinski, John Fastabend,
	netdev, bpf, houtao1

Add a benchmark for ternary search tree map to compare lookup
performance and memory usage with hash table for string key.

To test the space efficiency of tst for string key with common prefix
(hierarchical key), the benchmark constructs a string key up to 256
bytes: up to 64 bytes base name and up to 191 bytes prefix. So the
space saving is about 2~3 fold and it can be confirmed from data of
tst-N-hk-dl section or tst-N-hk-sl section below.

It also tests the space saving when there are differences between string
length: the max length is 255 bytes and min length is 16. And there are
about 0.5~0.8 space saving in tst compared with htab as data of
tst-N-fk-dl section shows.

When the length of string key is the same and there are very few common
prefixes between these key, the space usage of tearnary will be greater
than hash table and it is about 3%~30% more space usage compared with
hash table as shown in tst-N-fk-sl section.

For the hierarchical key case, the lookup performance of tst is about
x2~x3 slower compared with hash table when the number of elements is
less than 1M. And the performance gap decreases when the number of
element increases.

The following are the detailed outputs of benchmark:

Notations:

* hk: hierarchical string key
* fk: flat string key (non-hierarchical string key)
* dl: the lengths of string keys are different
* sl: the lengths of string key are the same
* N: the number of string key

Details:

tst-1000-hk-dl            2.398 ± 0.000M/s (drops 0.000 ± 0.000M/s, mem 0.144 MiB)
tst-10000-hk-dl           2.458 ± 0.001M/s (drops 0.000 ± 0.000M/s, mem 1.479 MiB)
tst-100000-hk-dl          2.744 ± 0.006M/s (drops 0.000 ± 0.000M/s, mem 12.566 MiB)
tst-1000000-hk-dl         2.364 ± 0.007M/s (drops 0.000 ± 0.000M/s, mem 120.862 MiB)
tst-10000000-hk-dl        2.467 ± 0.032M/s (drops 0.000 ± 0.000M/s, mem 1204.017 MiB)

htab-1000-hk-dl           5.930 ± 0.006M/s (drops 0.000 ± 0.000M/s, mem 0.496 MiB)
htab-10000-hk-dl          5.070 ± 0.005M/s (drops 0.000 ± 0.000M/s, mem 4.960 MiB)
htab-100000-hk-dl         2.765 ± 0.022M/s (drops 0.000 ± 0.000M/s, mem 49.591 MiB)
htab-1000000-hk-dl        1.870 ± 0.032M/s (drops 0.000 ± 0.000M/s, mem 495.911 MiB)
htab-10000000-hk-dl       1.854 ± 0.104M/s (drops 0.000 ± 0.000M/s, mem 4959.095 MiB)

tst-1000-hk-sl            0.172 ± 0.384M/s (drops 0.000 ± 0.000M/s, mem 0.184 MiB)
tst-10000-hk-sl           1.831 ± 0.000M/s (drops 0.000 ± 0.000M/s, mem 1.821 MiB)
tst-100000-hk-sl          1.724 ± 0.002M/s (drops 0.000 ± 0.000M/s, mem 15.161 MiB)
tst-1000000-hk-sl         1.611 ± 0.005M/s (drops 0.000 ± 0.000M/s, mem 148.276 MiB)
tst-10000000-hk-sl        1.608 ± 0.007M/s (drops 0.000 ± 0.000M/s, mem 1480.046 MiB)

htab-1000-hk-sl           5.913 ± 0.001M/s (drops 0.000 ± 0.000M/s, mem 0.496 MiB)
htab-10000-hk-sl          5.067 ± 0.003M/s (drops 0.000 ± 0.000M/s, mem 4.960 MiB)
htab-100000-hk-sl         2.816 ± 0.015M/s (drops 0.000 ± 0.000M/s, mem 49.592 MiB)
htab-1000000-hk-sl        1.841 ± 0.026M/s (drops 0.000 ± 0.000M/s, mem 495.911 MiB)
htab-10000000-hk-sl       1.847 ± 0.102M/s (drops 0.000 ± 0.000M/s, mem 4959.107 MiB)

tst-1000-fk-dl            0.180 ± 0.402M/s (drops 0.000 ± 0.000M/s, mem 0.329 MiB)
tst-10000-fk-dl           2.546 ± 0.001M/s (drops 0.000 ± 0.000M/s, mem 3.211 MiB)
tst-100000-fk-dl          1.702 ± 0.003M/s (drops 0.000 ± 0.000M/s, mem 27.835 MiB)
tst-1000000-fk-dl         0.941 ± 0.006M/s (drops 0.000 ± 0.000M/s, mem 262.322 MiB)
tst-10000000-fk-dl        0.658 ± 0.039M/s (drops 0.000 ± 0.000M/s, mem 2618.858 MiB)

htab-1000-fk-dl           5.903 ± 0.002M/s (drops 0.000 ± 0.000M/s, mem 0.496 MiB)
htab-10000-fk-dl          5.057 ± 0.004M/s (drops 0.000 ± 0.000M/s, mem 4.960 MiB)
htab-100000-fk-dl         2.748 ± 0.019M/s (drops 0.000 ± 0.000M/s, mem 49.595 MiB)
htab-1000000-fk-dl        1.857 ± 0.029M/s (drops 0.000 ± 0.000M/s, mem 495.911 MiB)
htab-10000000-fk-dl       1.853 ± 0.103M/s (drops 0.000 ± 0.000M/s, mem 4959.107 MiB)

tst-1000-fk-sl            0.094 ± 0.210M/s (drops 0.000 ± 0.000M/s, mem 0.640 MiB)
tst-10000-fk-sl           1.656 ± 0.000M/s (drops 0.000 ± 0.000M/s, mem 6.251 MiB)
tst-100000-fk-sl          1.088 ± 0.005M/s (drops 0.000 ± 0.000M/s, mem 53.708 MiB)
tst-1000000-fk-sl         0.732 ± 0.020M/s (drops 0.000 ± 0.000M/s, mem 513.421 MiB)
tst-10000000-fk-sl        0.581 ± 0.031M/s (drops 0.000 ± 0.000M/s, mem 5139.383 MiB)

htab-1000-fk-sl           5.861 ± 0.004M/s (drops 0.000 ± 0.000M/s, mem 0.496 MiB)
htab-10000-fk-sl          5.077 ± 0.006M/s (drops 0.000 ± 0.000M/s, mem 4.959 MiB)
htab-100000-fk-sl         2.771 ± 0.020M/s (drops 0.000 ± 0.000M/s, mem 49.592 MiB)
htab-1000000-fk-sl        1.833 ± 0.029M/s (drops 0.000 ± 0.000M/s, mem 495.911 MiB)
htab-10000000-fk-sl       1.828 ± 0.099M/s (drops 0.000 ± 0.000M/s, mem 4959.107 MiB)

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 tools/testing/selftests/bpf/Makefile          |   5 +-
 tools/testing/selftests/bpf/bench.c           |   6 +
 .../selftests/bpf/benchs/bench_tst_map.c      | 415 ++++++++++++++++++
 .../selftests/bpf/benchs/run_bench_tst.sh     |  54 +++
 tools/testing/selftests/bpf/progs/tst_bench.c |  70 +++
 5 files changed, 549 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_tst_map.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_tst.sh
 create mode 100644 tools/testing/selftests/bpf/progs/tst_bench.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index e14636d82c6b..625805ba410f 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -538,18 +538,21 @@ $(OUTPUT)/bench_ringbufs.o: $(OUTPUT)/ringbuf_bench.skel.h \
 $(OUTPUT)/bench_bloom_filter_map.o: $(OUTPUT)/bloom_filter_bench.skel.h
 $(OUTPUT)/bench_bpf_loop.o: $(OUTPUT)/bpf_loop_bench.skel.h
 $(OUTPUT)/bench_strncmp.o: $(OUTPUT)/strncmp_bench.skel.h
+$(OUTPUT)/bench_tst_map.o: $(OUTPUT)/tst_bench.skel.h
 $(OUTPUT)/bench.o: bench.h testing_helpers.h $(BPFOBJ)
 $(OUTPUT)/bench: LDLIBS += -lm
 $(OUTPUT)/bench: $(OUTPUT)/bench.o \
 		 $(TESTING_HELPERS) \
 		 $(TRACE_HELPERS) \
+		 $(CGROUP_HELPERS) \
 		 $(OUTPUT)/bench_count.o \
 		 $(OUTPUT)/bench_rename.o \
 		 $(OUTPUT)/bench_trigger.o \
 		 $(OUTPUT)/bench_ringbufs.o \
 		 $(OUTPUT)/bench_bloom_filter_map.o \
 		 $(OUTPUT)/bench_bpf_loop.o \
-		 $(OUTPUT)/bench_strncmp.o
+		 $(OUTPUT)/bench_strncmp.o \
+		 $(OUTPUT)/bench_tst_map.o
 	$(call msg,BINARY,,$@)
 	$(Q)$(CC) $(CFLAGS) $(LDFLAGS) $(filter %.a %.o,$^) $(LDLIBS) -o $@
 
diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index f973320e6dbf..5414ca0fcd65 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -190,12 +190,14 @@ extern struct argp bench_ringbufs_argp;
 extern struct argp bench_bloom_map_argp;
 extern struct argp bench_bpf_loop_argp;
 extern struct argp bench_strncmp_argp;
+extern struct argp bench_tst_argp;
 
 static const struct argp_child bench_parsers[] = {
 	{ &bench_ringbufs_argp, 0, "Ring buffers benchmark", 0 },
 	{ &bench_bloom_map_argp, 0, "Bloom filter map benchmark", 0 },
 	{ &bench_bpf_loop_argp, 0, "bpf_loop helper benchmark", 0 },
 	{ &bench_strncmp_argp, 0, "bpf_strncmp helper benchmark", 0 },
+	{ &bench_tst_argp, 0, "tst map benchmark", 0 },
 	{},
 };
 
@@ -397,6 +399,8 @@ extern const struct bench bench_hashmap_with_bloom;
 extern const struct bench bench_bpf_loop;
 extern const struct bench bench_strncmp_no_helper;
 extern const struct bench bench_strncmp_helper;
+extern const struct bench bench_htab_lookup;
+extern const struct bench bench_tst_lookup;
 
 static const struct bench *benchs[] = {
 	&bench_count_global,
@@ -431,6 +435,8 @@ static const struct bench *benchs[] = {
 	&bench_bpf_loop,
 	&bench_strncmp_no_helper,
 	&bench_strncmp_helper,
+	&bench_htab_lookup,
+	&bench_tst_lookup,
 };
 
 static void setup_benchmark()
diff --git a/tools/testing/selftests/bpf/benchs/bench_tst_map.c b/tools/testing/selftests/bpf/benchs/bench_tst_map.c
new file mode 100644
index 000000000000..a1c1b86fad44
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/bench_tst_map.c
@@ -0,0 +1,415 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2022. Huawei Technologies Co., Ltd */
+#include <argp.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include "bench.h"
+#include "bpf_util.h"
+#include "cgroup_helpers.h"
+
+#include "tst_bench.skel.h"
+
+#define MAX_KEY_SIZE 256
+
+#define PATH_LEAF_NR 1000
+#define PATH_MID_LVL_NR 10
+
+struct tst_key_component_desc {
+	unsigned int nr;
+	unsigned int len;
+};
+
+static struct tst_ctx {
+	struct tst_bench *skel;
+	struct tst_key_component_desc *desc;
+	unsigned int nr_desc;
+	char (*keys)[MAX_KEY_SIZE];
+	char tmp[MAX_KEY_SIZE];
+	unsigned int cursor;
+	int cgrp_dfd;
+	unsigned long long map_mem;
+} ctx;
+
+static struct {
+	bool flat_key;
+	bool same_len;
+	__u32 max_entries;
+} args = {
+	.flat_key = false,
+	.same_len = false,
+	.max_entries = 1000,
+};
+
+enum {
+	ARG_TST_ENTRIES = 7001,
+	ARG_FLAT_KEY = 7002,
+	ARG_SAME_LEN = 7003,
+};
+
+static const struct argp_option opts[] = {
+	{ "tst-entries", ARG_TST_ENTRIES, "TST_ENTRIES", 0,
+	  "Set the max entries" },
+	{ "flat-key", ARG_FLAT_KEY, NULL, 0,
+	  "Do not generate hierarchical key" },
+	{ "same-len", ARG_SAME_LEN, NULL, 0,
+	  "Generate the key with the same len" },
+	{},
+};
+
+static error_t tst_parse_arg(int key, char *arg, struct argp_state *state)
+{
+	switch (key) {
+	case ARG_TST_ENTRIES:
+		args.max_entries = strtoul(arg, NULL, 10);
+		if (args.max_entries < PATH_LEAF_NR) {
+			fprintf(stderr, "invalid max entries %u (min %u)\n",
+				args.max_entries, PATH_LEAF_NR);
+			argp_usage(state);
+		}
+		break;
+	case ARG_FLAT_KEY:
+		args.flat_key = true;
+		break;
+	case ARG_SAME_LEN:
+		args.same_len = true;
+		break;
+	default:
+		return ARGP_ERR_UNKNOWN;
+	}
+
+	return 0;
+}
+
+const struct argp bench_tst_argp = {
+	.options = opts,
+	.parser = tst_parse_arg,
+};
+
+static void tst_validate(void)
+{
+	if (env.consumer_cnt != 1) {
+		fprintf(stderr, "tst_map benchmark doesn't support multi-consumer!\n");
+		exit(1);
+	}
+}
+
+static char tst_random_c(void)
+{
+	static const char tbl[] = "0123456789abcdefghijklmnopqrstuvwxyz._";
+	return tbl[random() % (sizeof(tbl) - 1)];
+}
+
+static unsigned int tst_calc_hierarchy(unsigned int nr)
+{
+	struct tst_key_component_desc *desc;
+	unsigned int left;
+	unsigned int total;
+	unsigned int depth;
+
+	/* Calculate the depth of hierarchical key */
+	depth = 1;
+	total = PATH_LEAF_NR;
+	left = nr / PATH_LEAF_NR;
+	while (left >= PATH_MID_LVL_NR) {
+		left /= PATH_MID_LVL_NR;
+		total *= PATH_MID_LVL_NR;
+		depth++;
+	}
+	depth++;
+	total *= left;
+
+	desc = calloc(depth, sizeof(*desc));
+	if (!desc) {
+		fprintf(stderr, "failed to alloc mem for desc\n");
+		exit(1);
+	}
+
+	/* Assign number and length for each component */
+	desc[depth - 1].nr = PATH_LEAF_NR;
+	desc[depth - 1].len = MAX_KEY_SIZE / 4;
+
+	desc[0].nr = left;
+	if (depth > 2) {
+		unsigned int avg;
+		unsigned int rem;
+		unsigned int i;
+
+		desc[0].len = MAX_KEY_SIZE / 32;
+
+		/* -1 for the trailing null byte */
+		left = MAX_KEY_SIZE - desc[0].len - desc[depth - 1].len - 1;
+		avg = left / (depth - 2);
+		rem = left - avg * (depth - 2);
+		for (i = 1; i <= depth - 2; i++) {
+			desc[i].nr = PATH_MID_LVL_NR;
+			desc[i].len = avg;
+			if (rem) {
+				desc[i].len += 1;
+				rem--;
+			}
+		}
+	} else {
+		desc[0].len = MAX_KEY_SIZE - desc[depth - 1].len - 1;
+	}
+
+	ctx.desc = desc;
+	ctx.nr_desc = depth;
+
+	return total;
+}
+
+static void tst_init_map_opts(struct tst_bench *skel)
+{
+	bpf_map__set_value_size(skel->maps.array, MAX_KEY_SIZE);
+	bpf_map__set_max_entries(skel->maps.array, args.max_entries);
+
+	bpf_map__set_key_size(skel->maps.htab, MAX_KEY_SIZE);
+	bpf_map__set_max_entries(skel->maps.htab, args.max_entries);
+
+	bpf_map__set_key_size(skel->maps.tst, MAX_KEY_SIZE);
+	bpf_map__set_max_entries(skel->maps.tst, args.max_entries);
+}
+
+static inline unsigned int tst_key_len(unsigned int max_len)
+{
+	unsigned int len;
+
+	if (args.same_len)
+		return max_len;
+
+	/* Make the differences between string length bigger */
+	len = random() % (max_len * 15 / 16 + 1) + max_len / 16;
+	if (len < 2)
+		len = 2;
+	return len;
+}
+
+static void tst_gen_hierarchical_key(unsigned int depth, unsigned int pos)
+{
+	unsigned int i, j, len;
+
+	if (depth >= ctx.nr_desc) {
+		memcpy(ctx.keys[ctx.cursor++], ctx.tmp, pos);
+		return;
+	}
+
+	for (i = 0; i < ctx.desc[depth].nr; i++) {
+		len = tst_key_len(ctx.desc[depth].len);
+
+		ctx.tmp[pos] = '/';
+		for (j = 1; j < len; j++)
+			ctx.tmp[pos + j] = tst_random_c();
+		tst_gen_hierarchical_key(depth + 1, pos + j);
+	}
+}
+
+static void tst_gen_flat_key(void)
+{
+	unsigned int i, j, len;
+
+	for (i = 0; i < args.max_entries; i++) {
+		len = tst_key_len(MAX_KEY_SIZE - 1);
+		for (j = 0; j < len; j++)
+			ctx.keys[i][j] = tst_random_c();
+	}
+}
+
+static void tst_alloc_and_fill_keys(void)
+{
+	ctx.keys = calloc(args.max_entries, sizeof(*ctx.keys));
+	if (!ctx.keys) {
+		fprintf(stderr, "failed to alloc mem for keys\n");
+		exit(1);
+	}
+
+	if (args.flat_key)
+		tst_gen_flat_key();
+	else
+		tst_gen_hierarchical_key(0, 0);
+}
+
+static void tst_setup_key_map(struct bpf_map *map)
+{
+	int fd = bpf_map__fd(map);
+	unsigned int i;
+
+	for (i = 0; i < args.max_entries; i++) {
+		int err;
+
+		err = bpf_map_update_elem(fd, &i, ctx.keys[i], 0);
+		if (err) {
+			fprintf(stderr, "add #%u key (%s) on %s error %d\n",
+				i, ctx.keys[i], bpf_map__name(map), err);
+			exit(1);
+		}
+	}
+}
+
+static unsigned long long tst_get_slab_mem(int dfd)
+{
+	const char *magic = "slab ";
+	const char *name = "memory.stat";
+	int fd;
+	ssize_t nr;
+	char buf[4096];
+	char *from;
+
+	fd = openat(dfd, name, 0);
+	if (fd < 0) {
+		fprintf(stderr, "no %s\n", name);
+		exit(1);
+	}
+
+	memset(buf, 0, sizeof(buf));
+	nr = read(fd, buf, sizeof(buf));
+	if (nr <= 0) {
+		fprintf(stderr, "empty %s ?\n", name);
+		exit(1);
+	}
+
+	close(fd);
+
+	from = strstr(buf, magic);
+	if (!from) {
+		fprintf(stderr, "no slab in %s\n", name);
+		exit(1);
+	}
+
+	return strtoull(from + strlen(magic), NULL, 10);
+}
+
+static void tst_setup_lookup_map(struct bpf_map *map)
+{
+	int fd = bpf_map__fd(map);
+	unsigned int i;
+	unsigned long long before, after;
+
+	before = tst_get_slab_mem(ctx.cgrp_dfd);
+	for (i = 0; i < args.max_entries; i++) {
+		int err;
+
+		err = bpf_map_update_elem(fd, ctx.keys[i], &i, 0);
+		if (err) {
+			fprintf(stderr, "add #%u key (%s) on %s error %d\n",
+				i, ctx.keys[i], bpf_map__name(map), err);
+			exit(1);
+		}
+	}
+	after = tst_get_slab_mem(ctx.cgrp_dfd);
+	ctx.map_mem = after - before;
+}
+
+static void tst_common_setup(void)
+{
+	struct tst_bench *skel;
+	int dfd;
+	int err;
+
+	srandom(time(NULL));
+
+	dfd = cgroup_setup_and_join("/tst");
+	if (dfd < 0) {
+		fprintf(stderr, "failed to setup cgroup env\n");
+		exit(1);
+	}
+	ctx.cgrp_dfd = dfd;
+
+	if (!args.flat_key)
+		args.max_entries = tst_calc_hierarchy(args.max_entries);
+
+	setup_libbpf();
+
+	skel = tst_bench__open();
+	if (!skel) {
+		fprintf(stderr, "failed to open skeleton\n");
+		exit(1);
+	}
+
+	tst_init_map_opts(skel);
+
+	err = tst_bench__load(skel);
+	if (err) {
+		fprintf(stderr, "failed to load skeleton\n");
+		exit(1);
+	}
+
+	tst_alloc_and_fill_keys();
+	tst_setup_key_map(skel->maps.array);
+
+	ctx.skel = skel;
+}
+
+static void tst_attach_prog(struct bpf_program *prog)
+{
+	struct bpf_link *link;
+
+	link = bpf_program__attach(prog);
+	if (!link) {
+		fprintf(stderr, "failed to attach program!\n");
+		exit(1);
+	}
+}
+
+static void htab_lookup_setup(void)
+{
+	tst_common_setup();
+	tst_setup_lookup_map(ctx.skel->maps.htab);
+	tst_attach_prog(ctx.skel->progs.htab_lookup);
+}
+
+static void tst_lookup_setup(void)
+{
+	tst_common_setup();
+	tst_setup_lookup_map(ctx.skel->maps.tst);
+	tst_attach_prog(ctx.skel->progs.tst_lookup);
+}
+
+static void *tst_producer(void *ctx)
+{
+	while (true)
+		(void)syscall(__NR_getpgid);
+	return NULL;
+}
+
+static void *tst_consumer(void *ctx)
+{
+	return NULL;
+}
+
+static void tst_measure(struct bench_res *res)
+{
+	res->hits = atomic_swap(&ctx.skel->bss->hits, 0);
+	res->drops = atomic_swap(&ctx.skel->bss->drops, 0);
+}
+
+static void tst_report_final(struct bench_res res[], int res_cnt)
+{
+	close(ctx.cgrp_dfd);
+	cleanup_cgroup_environment();
+
+	fprintf(stdout, "Memory: %.3f MiB\n", (float)ctx.map_mem / 1024 / 1024);
+	hits_drops_report_final(res, res_cnt);
+}
+
+const struct bench bench_htab_lookup = {
+	.name = "htab-lookup",
+	.validate = tst_validate,
+	.setup = htab_lookup_setup,
+	.producer_thread = tst_producer,
+	.consumer_thread = tst_consumer,
+	.measure = tst_measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = tst_report_final,
+};
+
+const struct bench bench_tst_lookup = {
+	.name = "tst-lookup",
+	.validate = tst_validate,
+	.setup = tst_lookup_setup,
+	.producer_thread = tst_producer,
+	.consumer_thread = tst_consumer,
+	.measure = tst_measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = tst_report_final,
+};
diff --git a/tools/testing/selftests/bpf/benchs/run_bench_tst.sh b/tools/testing/selftests/bpf/benchs/run_bench_tst.sh
new file mode 100755
index 000000000000..8209fd1341b7
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/run_bench_tst.sh
@@ -0,0 +1,54 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (C) 2022. Huawei Technologies Co., Ltd
+
+source ./benchs/run_common.sh
+
+set -eufo pipefail
+
+mem()
+{
+	echo "$*" | sed -E "s/.*Memory: ([0-9]+\.[0-9]+ MiB).*/\1/"
+}
+
+run_test()
+{
+	local title=$1
+	local summary
+
+	shift 1
+	summary=$(sudo ./bench -w1 -d4 -a "$@" | grep "Summary\|Memory:")
+	printf "%-25s %s (drops %s, mem %s)\n" "$title" "$(hits $summary)" \
+		"$(drops $summary)" "$(mem $summary)"
+}
+
+run_tests()
+{
+	local name=$1
+	local map
+	local nr
+	local s
+
+	shift 1
+	for map in tst htab
+	do
+		nr=1000
+		for s in $(seq 1 5)
+		do
+			run_test "$map-$nr-$name" $map-lookup --tst-entries $nr $@
+			let "nr *= 10"
+		done
+		echo
+	done
+}
+
+for key in hk fk
+do
+	opts=""
+	[ $key == "fk" ] && opts="--flat-key"
+	for len in dl sl
+	do
+		[ $len == "sl" ] && opts="$opts --same-len"
+		run_tests "$key-$len" "$opts"
+	done
+done
diff --git a/tools/testing/selftests/bpf/progs/tst_bench.c b/tools/testing/selftests/bpf/progs/tst_bench.c
new file mode 100644
index 000000000000..454c1abc6844
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/tst_bench.c
@@ -0,0 +1,70 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2022. Huawei Technologies Co., Ltd */
+#include <linux/types.h>
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+struct bpf_map;
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(key_size, 4);
+} array SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(value_size, 4);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+} htab SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_TST);
+	__uint(value_size, 4);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+} tst SEC(".maps");
+
+char _license[] SEC("license") = "GPL";
+
+long hits = 0;
+long drops = 0;
+
+static int lookup_htab(struct bpf_map *map, __u32 *key, void *value, void *data)
+{
+	__u32 *index;
+
+	index = bpf_map_lookup_elem(&htab, value);
+	if (index && *index == *key)
+		__sync_add_and_fetch(&hits, 1);
+	else
+		__sync_add_and_fetch(&drops, 1);
+
+	return 0;
+}
+
+static int lookup_tst(struct bpf_map *map, __u32 *key, void *value, void *data)
+{
+	__u32 *index;
+
+	index = bpf_map_lookup_elem(&tst, value);
+	if (index && *index == *key)
+		__sync_add_and_fetch(&hits, 1);
+	else
+		__sync_add_and_fetch(&drops, 1);
+
+	return 0;
+}
+
+SEC("tp/syscalls/sys_enter_getpgid")
+int htab_lookup(void *ctx)
+{
+	bpf_for_each_map_elem(&array, lookup_htab, NULL, 0);
+	return 0;
+}
+
+SEC("tp/syscalls/sys_enter_getpgid")
+int tst_lookup(void *ctx)
+{
+	bpf_for_each_map_elem(&array, lookup_tst, NULL, 0);
+	return 0;
+}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH bpf-next 0/2] bpf: Introduce ternary search tree for string key
  2022-03-31 12:28 [RFC PATCH bpf-next 0/2] bpf: Introduce ternary search tree for string key Hou Tao
  2022-03-31 12:28 ` [RFC PATCH bpf-next 1/2] " Hou Tao
  2022-03-31 12:28 ` [RFC PATCH bpf-next] selftests/bpf: add benchmark for ternary search tree map Hou Tao
@ 2022-04-06 17:38 ` Andrii Nakryiko
  2022-04-09  3:07   ` Hou Tao
  2 siblings, 1 reply; 14+ messages in thread
From: Andrii Nakryiko @ 2022-04-06 17:38 UTC (permalink / raw)
  To: Hou Tao
  Cc: Alexei Starovoitov, Yonghong Song, Daniel Borkmann,
	Martin KaFai Lau, Andrii Nakryiko, Song Liu, KP Singh,
	David S . Miller, Jakub Kicinski, John Fastabend, Networking,
	bpf

On Thu, Mar 31, 2022 at 5:04 AM Hou Tao <houtao1@huawei.com> wrote:
>
> Hi,
>
> The initial motivation for the patchset is due to the suggestion of Alexei.
> During the discuss of supporting of string key in hash-table, he saw the
> space efficiency of ternary search tree under our early test and suggest
> us to post it as a new bpf map [1].
>
> Ternary search tree is a special trie where nodes are arranged in a
> manner similar to binary search tree, but with up to three children
> rather than two. The three children correpond to nodes whose value is
> less than, equal to, and greater than the value of current node
> respectively.
>
> In ternary search tree map, only the valid content of string is saved.
> The trailing null byte and unused bytes after it are not saved. If there
> are common prefixes between these strings, the prefix is only saved once.
> Compared with other space optimized trie (e.g. HAT-trie, succinct trie),
> the advantage of ternary search tree is simple and being writeable.
>
> Below are diagrams for ternary search map when inserting hello, he,
> test and tea into it:
>
> 1. insert "hello"
>
>         [ hello ]
>
> 2. insert "he": need split "hello" into "he" and "llo"
>
>          [ he ]
>             |
>             *
>             |
>          [ llo ]
>
> 3. insert "test": add it as right child of "he"
>
>          [ he ]
>             |
>             *-------x
>             |       |
>          [ llo ] [ test ]
>
> 5. insert "tea": split "test" into "te" and "st",
>    and insert "a" as left child of "st"
>
>          [ he ]
>             |
>      x------*-------x
>      |      |       |
>   [ ah ] [ llo ] [ te ]
>                     |
>                     *
>                     |
>                  [ st ]
>                     |
>                x----*
>                |
>              [ a ]
>
> As showed in above diagrams, the common prefix between "test" and "tea"
> is "te" and it only is saved once. Also add benchmarks to compare the
> memory usage and lookup performance between ternary search tree and
> hash table. When the common prefix is lengthy (~192 bytes) and the
> length of suffix is about 64 bytes, there are about 2~3 folds memory
> saving compared with hash table. But the memory saving comes at prices:
> the lookup performance of tst is about 2~3 slower compared with hash
> table. See more benchmark details on patch #2.
>
> Comments and suggestions are always welcome.
>

Have you heard and tried qp-trie ([0]) by any chance? It is elegant
and simple data structure. By all the available benchmarks it handily
beats Red-Black trees in terms of memory usage and performance (though
it of course depends on the data set, just like "memory compression"
for ternary tree of yours depends on large set of common prefixes).
qp-trie based BPF map seems (at least on paper) like a better
general-purpose BPF map that is dynamically sized (avoiding current
HASHMAP limitations) and stores keys in sorted order (and thus allows
meaningful ordered iteration *and*, importantly for longest prefix
match tree, allows efficient prefix matches). I did a quick experiment
about a month ago trying to replace libbpf's internal use of hashmap
with qp-trie for BTF string dedup and it was slightly slower than
hashmap (not surprisingly, though, because libbpf over-sizes hashmap
to avoid hash collisions and long chains in buckets), but it was still
very decent even in that scenario. So I've been mulling the idea of
implementing BPF map based on qp-trie elegant design and ideas, but
can't find time to do this.

This prefix sharing is nice when you have a lot of long common
prefixes, but I'm a bit skeptical that as a general-purpose BPF data
structure it's going to be that beneficial. 192 bytes of common
prefixes seems like a very unusual dataset :)

More specifically about TST implementation in your paches. One global
per-map lock I think is a very big downside. We have LPM trie which is
very slow in big part due to global lock. It might be possible to
design more granular schema for TST, but this whole in-place splitting
logic makes this harder. I think qp-trie can be locked in a granular
fashion much more easily by having a "hand over hand" locking: lock
parent, find child, lock child, unlock parent, move into child node.
Something like that would be more scalable overall, especially if the
access pattern is not focused on a narrow set of nodes.

Anyways, I love data structures and this one is an interesting idea.
But just my few cents of "production-readiness" for general-purpose
data structures for BPF.

  [0] https://dotat.at/prog/qp/README.html

> Regards,
> Tao
>
> [1]: https://lore.kernel.org/bpf/CAADnVQJUJp3YBcpESwR3Q1U6GS1mBM=Vp-qYuQX7eZOaoLjdUA@mail.gmail.com/
>
> Hou Tao (2):
>   bpf: Introduce ternary search tree for string key
>   selftests/bpf: add benchmark for ternary search tree map
>
>  include/linux/bpf_types.h                     |   1 +
>  include/uapi/linux/bpf.h                      |   1 +
>  kernel/bpf/Makefile                           |   1 +
>  kernel/bpf/bpf_tst.c                          | 411 +++++++++++++++++
>  tools/include/uapi/linux/bpf.h                |   1 +
>  tools/testing/selftests/bpf/Makefile          |   5 +-
>  tools/testing/selftests/bpf/bench.c           |   6 +
>  .../selftests/bpf/benchs/bench_tst_map.c      | 415 ++++++++++++++++++
>  .../selftests/bpf/benchs/run_bench_tst.sh     |  54 +++
>  tools/testing/selftests/bpf/progs/tst_bench.c |  70 +++
>  10 files changed, 964 insertions(+), 1 deletion(-)
>  create mode 100644 kernel/bpf/bpf_tst.c
>  create mode 100644 tools/testing/selftests/bpf/benchs/bench_tst_map.c
>  create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_tst.sh
>  create mode 100644 tools/testing/selftests/bpf/progs/tst_bench.c
>
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH bpf-next 1/2] bpf: Introduce ternary search tree for string key
  2022-03-31 12:28 ` [RFC PATCH bpf-next 1/2] " Hou Tao
@ 2022-04-08 23:00   ` Alexei Starovoitov
  0 siblings, 0 replies; 14+ messages in thread
From: Alexei Starovoitov @ 2022-04-08 23:00 UTC (permalink / raw)
  To: Hou Tao
  Cc: Alexei Starovoitov, Yonghong Song, Daniel Borkmann,
	Martin KaFai Lau, Andrii Nakryiko, Song Liu, KP Singh,
	David S . Miller, Jakub Kicinski, John Fastabend, netdev, bpf

On Thu, Mar 31, 2022 at 08:28:21PM +0800, Hou Tao wrote:
> Now for string key in hash-table, the storage size of key for each
> element is the size of longest string. If there are large differencies
> in string length and comm prefixes between these strings, there will be
> lots of space waste. So introduce a specific map for string key: ternary
> search tree.
> 
> Ternary search tree is a special trie where nodes are arranged in a
> manner similar to binary search tree, but with up to three children
> rather than two. The three children correpond to nodes whose value is
> less than, equal to, and greater than the value of current node
> respectively.
> 
> For each key in ternary search map, only the content before the
> terminated null byte is saved. And just like trie, the common prefixes
> between these strings are shared and it can reducee the memory usage
> significantly for hierarchical string (e.g. file path with lengthy
> prefix). But the space efficiency comes at a price: the lookup
> performance is about x2~x3 or more slower compared with hash table for
> small or medium data set.
> 
> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---
>  include/linux/bpf_types.h      |   1 +
>  include/uapi/linux/bpf.h       |   1 +
>  kernel/bpf/Makefile            |   1 +
>  kernel/bpf/bpf_tst.c           | 411 +++++++++++++++++++++++++++++++++
>  tools/include/uapi/linux/bpf.h |   1 +
>  5 files changed, 415 insertions(+)
>  create mode 100644 kernel/bpf/bpf_tst.c
> 
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index 48a91c51c015..8ffb3ab7f337 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -126,6 +126,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops)
>  #endif
>  BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops)
>  BPF_MAP_TYPE(BPF_MAP_TYPE_BLOOM_FILTER, bloom_filter_map_ops)
> +BPF_MAP_TYPE(BPF_MAP_TYPE_TST, bpf_tst_map_ops)
>  
>  BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint)
>  BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing)
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index b0383d371b9a..470bf9a9353d 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -907,6 +907,7 @@ enum bpf_map_type {
>  	BPF_MAP_TYPE_INODE_STORAGE,
>  	BPF_MAP_TYPE_TASK_STORAGE,
>  	BPF_MAP_TYPE_BLOOM_FILTER,
> +	BPF_MAP_TYPE_TST,
>  };
>  
>  /* Note that tracing related programs such as
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index c1a9be6a4b9f..65176d4e99bf 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -10,6 +10,7 @@ obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_i
>  obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
>  obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
>  obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
> +obj-$(CONFIG_BPF_SYSCALL) += bpf_tst.o
>  obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
>  obj-$(CONFIG_BPF_SYSCALL) += disasm.o
>  obj-$(CONFIG_BPF_JIT) += trampoline.o
> diff --git a/kernel/bpf/bpf_tst.c b/kernel/bpf/bpf_tst.c
> new file mode 100644
> index 000000000000..ab82aecaa023
> --- /dev/null
> +++ b/kernel/bpf/bpf_tst.c
> @@ -0,0 +1,411 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (C) 2022. Huawei Technologies Co., Ltd */
> +#include <linux/bpf.h>
> +#include <linux/rcupdate.h>
> +#include <linux/slab.h>
> +#include <linux/spinlock.h>
> +
> +/*
> + * Ternary search tree is a special trie where nodes are arranged in
> + * a manner similar to binary search tree, but with up to three children
> + * rather than two. The three children correpond to nodes whose value is
> + * less than, equal to, and greater than the value of current node
> + * respectively.
> + *
> + * The following are illustrations of ternary search tree during inserting
> + * hello, he, test, tea and team:
> + *
> + * 1. insert "hello"
> + *
> + *         [ hello ]
> + *
> + * 2. insert "he": need split "hello" into "he" and "llo"
> + *
> + *          [ he ]
> + *             |
> + *             *
> + *             |
> + *          [ llo ]
> + *
> + * 3. insert "test": add it as right child of "he"
> + *
> + *          [ he ]
> + *             |
> + *             *-------x
> + *             |       |
> + *          [ llo ] [ test ]
> + *
> + * 5. insert "tea": split "test" into "te" and "st",
> + *    and insert "a" as left child of "st"
> + *
> + *          [ he ]
> + *             |
> + *      x------*-------x
> + *      |      |       |
> + *   [ ah ] [ llo ] [ te ]
> + *                     |
> + *                     *
> + *                     |
> + *                  [ st ]
> + *                     |
> + *                x----*
> + *                |
> + *              [ a ]
> + *
> + * 6. insert "team": insert "m" as middle child of "a"
> + *
> + *          [ he ]
> + *             |
> + *             *-------x
> + *             |       |
> + *          [ llo ] [ te ]
> + *                     |
> + *                     *
> + *                     |
> + *                  [ st ]
> + *                     |
> + *                x----*
> + *                |
> + *              [ a ]
> + *                |
> + *                *
> + *                |
> + *              [ m ]
> + */
> +#define TST_CREATE_FLAG_MASK \
> +	(BPF_F_NUMA_NODE | BPF_F_NO_PREALLOC | BPF_F_ACCESS_MASK)
> +
> +struct bpf_tst_node;
> +
> +struct bpf_tst_node {
> +	struct rcu_head rcu;
> +	struct bpf_tst_node __rcu *child[3];
> +	u32 len;
> +	bool leaf;
> +	u8 key[];
> +};
> +
> +struct bpf_tst {
> +	struct bpf_map map;
> +	struct bpf_tst_node __rcu *root;
> +	size_t nr_entries;
> +	spinlock_t lock;
> +};
> +
> +/*
> + * match_prefix() - check whether prefix is fully matched
> + *
> + * @next: returns the position of next-to-compare character in str
> + *
> + * Return 0 if str has prefix, 1 if str > prefix and -1 if str < prefix
> + */
> +static int match_prefix(const unsigned char *prefix, int len,
> +			const unsigned char *str, int *next)
> +{
> +	int i;
> +
> +	for (i = 0; i < len; i++) {
> +		int cmp = str[i] - prefix[i];
> +
> +		if (cmp) {
> +			*next = i;
> +			return cmp > 0 ? 1 : -1;
> +		}
> +		if (!str[i])
> +			break;
> +	}
> +
> +	*next = len;
> +	return 0;
> +}
> +
> +/* Called from syscall or from eBPF program */
> +static void *tst_lookup_elem(struct bpf_map *map, void *key)
> +{
> +	struct bpf_tst *tst = container_of(map, struct bpf_tst, map);
> +	struct bpf_tst_node *node;
> +	const unsigned char *c = key;
> +
> +	/* A null terminated non-empty string */
> +	if (!c[0] || c[map->key_size - 1])
> +		return NULL;

Looks like map->key_size is only used to make sure that
strlen(c) doesn't go over key_size bytes that were guaranteed by the verifier.
Looks like the algorithm should work for any blob of bytes instead of a string.
Can we make 'key' to be variable length similar to lpm?
Where first u32 would be the length of the blob.

> +
> +	node = rcu_dereference_protected(tst->root, rcu_read_lock_held());
> +	while (node) {
> +		int cmp;
> +		int next;
> +
> +		cmp = match_prefix(node->key, node->len, c, &next);
> +		/* Partially match an internal node */
> +		if (cmp && next)
> +			return NULL;
> +
> +		c += next;
> +		/* Fully match */
> +		if (!cmp && !*c) {
> +			if (node->leaf)
> +				return node->key + node->len;
> +			return NULL;
> +		}
> +
> +		node = rcu_dereference_protected(node->child[cmp + 1],
> +						 rcu_read_lock_held());
> +	}
> +
> +	return NULL;
> +}
> +
> +/* Split node into two nodes */
> +static struct bpf_tst_node *
> +split_tst_node(struct bpf_map *map, struct bpf_tst_node *node, int next, void *value)
> +{
> +	struct bpf_tst_node *bot, *top;
> +	size_t size;
> +
> +	size = sizeof(*bot) + node->len - next;
> +	if (node->leaf)
> +		size += map->value_size;
> +	bot = bpf_map_kmalloc_node(map, size, GFP_ATOMIC | __GFP_NOWARN,
> +				   map->numa_node);
> +	if (!bot)
> +		return NULL;
> +
> +	bot->child[0] = NULL;
> +	/* node has been initialized, so no rcu_assign_pointer() */
> +	bot->child[1] = node->child[1];
> +	bot->child[2] = NULL;
> +	bot->len = node->len - next;
> +	bot->leaf = node->leaf;
> +	memcpy(bot->key, node->key + next, bot->len);
> +	if (bot->leaf)
> +		memcpy(bot->key + bot->len, node->key + node->len,
> +		       map->value_size);
> +
> +	size = sizeof(*top) + next;
> +	if (value)
> +		size += map->value_size;
> +	top = bpf_map_kmalloc_node(map, size, GFP_ATOMIC | __GFP_NOWARN,
> +				   map->numa_node);
> +	if (!top) {
> +		kfree(bot);
> +		return NULL;
> +	}
> +
> +	top->child[0] = node->child[0];
> +	rcu_assign_pointer(top->child[1], bot);
> +	top->child[2] = node->child[2];
> +	top->len = next;
> +	top->leaf = !!value;
> +	memcpy(top->key, node->key, next);
> +	if (value)
> +		memcpy(top->key + top->len, value, map->value_size);
> +
> +	return top;
> +}
> +
> +static struct bpf_tst_node *
> +new_leaf_node(struct bpf_map *map, struct bpf_tst_node *node, bool replace,
> +	      const void *c, void *value)
> +{
> +	struct bpf_tst_node *leaf;
> +	size_t size;
> +	unsigned int str_len;
> +
> +	/* Newly-created node or replace the original node */
> +	if (!replace)
> +		str_len = strlen(c);
> +	else
> +		str_len = node->len;
> +	size = sizeof(*leaf) + str_len + map->value_size;
> +	leaf = bpf_map_kmalloc_node(map, size, GFP_ATOMIC | __GFP_NOWARN,
> +				    map->numa_node);
> +	if (!leaf)
> +		return NULL;
> +
> +	if (!replace) {
> +		leaf->child[0] = leaf->child[1] = leaf->child[2] = NULL;
> +		leaf->len = str_len;
> +		memcpy(leaf->key, c, str_len);
> +	} else {
> +		memcpy(leaf, node, sizeof(*node) + str_len);
> +	}
> +	leaf->leaf = true;
> +	memcpy(leaf->key + str_len, value, map->value_size);
> +
> +	return leaf;
> +}
> +
> +/* Called from syscall or from eBPF program */
> +static int tst_update_elem(struct bpf_map *map, void *key, void *value, u64 flags)
> +{
> +	struct bpf_tst *tst = container_of(map, struct bpf_tst, map);
> +	struct bpf_tst_node __rcu **slot, **new_slot = NULL;
> +	struct bpf_tst_node *node, *new_node, *new_intn_node = NULL;
> +	unsigned long irq_flags;
> +	const unsigned char *c = key;
> +	bool replace;
> +	int err = 0;
> +
> +	if (!c[0] || c[map->key_size - 1])
> +		return -EINVAL;
> +
> +	spin_lock_irqsave(&tst->lock, irq_flags);

The global lock is one of the reasons LPM map is seldomly used.
As Andrii suggested can you tweak the algorithm to do a subtree lock?
Maybe the lock doesn't need to be taken until "split" action 
identified the spot. Then lock the parent and recheck the spot.
In other words would it be possible to move the lock into if (cmp && next) below?

> +	if (tst->nr_entries == map->max_entries) {
> +		err = -ENOSPC;
> +		goto out;
> +	}
> +
> +	slot = &tst->root;
> +	while ((node = rcu_dereference_protected(*slot, lockdep_is_held(&tst->lock)))) {
> +		int cmp;
> +		int next;
> +
> +		cmp = match_prefix(node->key, node->len, c, &next);
> +		c += next;
> +
> +		/* Split internal node */
> +		if (cmp && next) {
> +			/* The split top node is a leaf node */
> +			bool top_leaf = !*c;
> +
> +			new_node = split_tst_node(map, node, next,
> +						  top_leaf ? value : NULL);
> +			if (!new_node) {
> +				err = -ENOMEM;
> +				goto out;
> +			}
> +			if (top_leaf)
> +				goto done;
> +
> +			new_intn_node = new_node;
> +			new_slot = &new_node->child[1]->child[cmp + 1];
> +			break;
> +		}
> +
> +		/* Fully match */
> +		if (!cmp && !*c)
> +			break;
> +		slot = &node->child[cmp + 1];
> +	}
> +
> +	/* Replace the original node ? */
> +	replace = node && !new_intn_node;
> +	new_node = new_leaf_node(map, node, replace, c, value);
> +	if (!new_node) {
> +		err = -ENOMEM;
> +		goto out;
> +	}
> +
> +	/* Don't increase if replace a leaf node */
> +	if (!replace || !node->leaf)
> +		tst->nr_entries++;
> +
> +	/* Graft the leaf node first for splitting */
> +	if (new_intn_node) {
> +		rcu_assign_pointer(*new_slot, new_node);
> +		new_node = new_intn_node;
> +	}
> +done:
> +	rcu_assign_pointer(*slot, new_node);
> +	spin_unlock_irqrestore(&tst->lock, irq_flags);
> +	kfree_rcu(node, rcu);
> +
> +	return 0;
> +out:
> +	if (new_intn_node) {
> +		kfree(new_intn_node->child[1]);
> +		kfree(new_intn_node);
> +	}
> +	spin_unlock_irqrestore(&tst->lock, irq_flags);
> +
> +	return err;
> +}
> +
> +static int tst_delete_elem(struct bpf_map *map, void *key)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +static int tst_get_next_key(struct bpf_map *map, void *key, void *next_key)
> +{
> +	return -EOPNOTSUPP;
> +}

For the patches to land delete and get_next_key should be implemented.
We cannot leave them for later.

Could you do a bit more benchmarking to highlight the benefits of this data structure?
For example:
. store kallsyms in this map and compare memory usage vs existing kallsyms
. take vmlinux BTF and store all of it strings and compare the memory usage

I suspect in both cases raw strings in BTF and kallsyms consume less
space, but would be good to compare.
One of the reasons is that we need a fast string search in both cases.
Currently it's done via for_each linear loop which is inefficient.

Thanks for working on this.
Overall looks very promising.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH bpf-next 0/2] bpf: Introduce ternary search tree for string key
  2022-04-06 17:38 ` [RFC PATCH bpf-next 0/2] bpf: Introduce ternary search tree for string key Andrii Nakryiko
@ 2022-04-09  3:07   ` Hou Tao
  2022-04-13  4:09     ` Andrii Nakryiko
  0 siblings, 1 reply; 14+ messages in thread
From: Hou Tao @ 2022-04-09  3:07 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Alexei Starovoitov, Yonghong Song, Daniel Borkmann,
	Martin KaFai Lau, Andrii Nakryiko, Song Liu, KP Singh,
	David S . Miller, Jakub Kicinski, John Fastabend, Networking,
	bpf

Hi,

On 4/7/2022 1:38 AM, Andrii Nakryiko wrote:
> On Thu, Mar 31, 2022 at 5:04 AM Hou Tao <houtao1@huawei.com> wrote:
>> Hi,
>>
>> The initial motivation for the patchset is due to the suggestion of Alexei.
>> During the discuss of supporting of string key in hash-table, he saw the
>> space efficiency of ternary search tree under our early test and suggest
>> us to post it as a new bpf map [1].
>>
>> Ternary search tree is a special trie where nodes are arranged in a
>> manner similar to binary search tree, but with up to three children
>> rather than two. The three children correpond to nodes whose value is
>> less than, equal to, and greater than the value of current node
>> respectively.
>>
>> In ternary search tree map, only the valid content of string is saved.
>> The trailing null byte and unused bytes after it are not saved. If there
>> are common prefixes between these strings, the prefix is only saved once.
>> Compared with other space optimized trie (e.g. HAT-trie, succinct trie),
>> the advantage of ternary search tree is simple and being writeable.
>>
>> Below are diagrams for ternary search map when inserting hello, he,
>> test and tea into it:
>>
>> 1. insert "hello"
>>
>>         [ hello ]
>>
>> 2. insert "he": need split "hello" into "he" and "llo"
>>
>>          [ he ]
>>             |
>>             *
>>             |
>>          [ llo ]
>>
>> 3. insert "test": add it as right child of "he"
>>
>>          [ he ]
>>             |
>>             *-------x
>>             |       |
>>          [ llo ] [ test ]
>>
>> 5. insert "tea": split "test" into "te" and "st",
>>    and insert "a" as left child of "st"
>>
>>          [ he ]
>>             |
>>      x------*-------x
>>      |      |       |
>>   [ ah ] [ llo ] [ te ]
>>                     |
>>                     *
>>                     |
>>                  [ st ]
>>                     |
>>                x----*
>>                |
>>              [ a ]
>>
>> As showed in above diagrams, the common prefix between "test" and "tea"
>> is "te" and it only is saved once. Also add benchmarks to compare the
>> memory usage and lookup performance between ternary search tree and
>> hash table. When the common prefix is lengthy (~192 bytes) and the
>> length of suffix is about 64 bytes, there are about 2~3 folds memory
>> saving compared with hash table. But the memory saving comes at prices:
>> the lookup performance of tst is about 2~3 slower compared with hash
>> table. See more benchmark details on patch #2.
>>
>> Comments and suggestions are always welcome.
>>
> Have you heard and tried qp-trie ([0]) by any chance? It is elegant
> and simple data structure. By all the available benchmarks it handily
> beats Red-Black trees in terms of memory usage and performance (though
> it of course depends on the data set, just like "memory compression"
> for ternary tree of yours depends on large set of common prefixes).
> qp-trie based BPF map seems (at least on paper) like a better
> general-purpose BPF map that is dynamically sized (avoiding current
> HASHMAP limitations) and stores keys in sorted order (and thus allows
> meaningful ordered iteration *and*, importantly for longest prefix
> match tree, allows efficient prefix matches). I did a quick experiment
> about a month ago trying to replace libbpf's internal use of hashmap
> with qp-trie for BTF string dedup and it was slightly slower than
> hashmap (not surprisingly, though, because libbpf over-sizes hashmap
> to avoid hash collisions and long chains in buckets), but it was still
> very decent even in that scenario. So I've been mulling the idea of
> implementing BPF map based on qp-trie elegant design and ideas, but
> can't find time to do this.
I have heard about it when check the space efficient of HAT trie [0], because
qp-trie needs to save the whole string key in the leaf node and its space
efficiency can not be better than ternary search tree for strings with common
prefix, so I did not consider about it. But I will do some benchmarks to check
the lookup performance and space efficiency of qp-trie and tst for string with
common prefix and strings without much common prefix.
If qp-trie is better, I think I can take the time to post it as a bpf map if you
are OK with that.


>
> This prefix sharing is nice when you have a lot of long common
> prefixes, but I'm a bit skeptical that as a general-purpose BPF data
> structure it's going to be that beneficial. 192 bytes of common
> prefixes seems like a very unusual dataset :)
Yes. The case with common prefix I known is full file path.
> More specifically about TST implementation in your paches. One global
> per-map lock I think is a very big downside. We have LPM trie which is
> very slow in big part due to global lock. It might be possible to
> design more granular schema for TST, but this whole in-place splitting
> logic makes this harder. I think qp-trie can be locked in a granular
> fashion much more easily by having a "hand over hand" locking: lock
> parent, find child, lock child, unlock parent, move into child node.
> Something like that would be more scalable overall, especially if the
> access pattern is not focused on a narrow set of nodes.
Yes. The global lock is a problem but the splitting is not in-place. I will try
to figure out whether the lock can be more scalable after the benchmark test
between qp-trie and tst.

Regards,
Tao

[0]: https://github.com/Tessil/hat-trie
>
> Anyways, I love data structures and this one is an interesting idea.
> But just my few cents of "production-readiness" for general-purpose
> data structures for BPF.
>
>   [0] https://dotat.at/prog/qp/README.html
>
>> Regards,
>> Tao
>>
>> [1]: https://lore.kernel.org/bpf/CAADnVQJUJp3YBcpESwR3Q1U6GS1mBM=Vp-qYuQX7eZOaoLjdUA@mail.gmail.com/
>>
>> Hou Tao (2):
>>   bpf: Introduce ternary search tree for string key
>>   selftests/bpf: add benchmark for ternary search tree map
>>
>>  include/linux/bpf_types.h                     |   1 +
>>  include/uapi/linux/bpf.h                      |   1 +
>>  kernel/bpf/Makefile                           |   1 +
>>  kernel/bpf/bpf_tst.c                          | 411 +++++++++++++++++
>>  tools/include/uapi/linux/bpf.h                |   1 +
>>  tools/testing/selftests/bpf/Makefile          |   5 +-
>>  tools/testing/selftests/bpf/bench.c           |   6 +
>>  .../selftests/bpf/benchs/bench_tst_map.c      | 415 ++++++++++++++++++
>>  .../selftests/bpf/benchs/run_bench_tst.sh     |  54 +++
>>  tools/testing/selftests/bpf/progs/tst_bench.c |  70 +++
>>  10 files changed, 964 insertions(+), 1 deletion(-)
>>  create mode 100644 kernel/bpf/bpf_tst.c
>>  create mode 100644 tools/testing/selftests/bpf/benchs/bench_tst_map.c
>>  create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_tst.sh
>>  create mode 100644 tools/testing/selftests/bpf/progs/tst_bench.c
>>
>> --
>> 2.31.1
>>
> .


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH bpf-next 0/2] bpf: Introduce ternary search tree for string key
  2022-04-09  3:07   ` Hou Tao
@ 2022-04-13  4:09     ` Andrii Nakryiko
  2022-04-14  1:03       ` Hou Tao
  0 siblings, 1 reply; 14+ messages in thread
From: Andrii Nakryiko @ 2022-04-13  4:09 UTC (permalink / raw)
  To: Hou Tao
  Cc: Alexei Starovoitov, Yonghong Song, Daniel Borkmann,
	Martin KaFai Lau, Andrii Nakryiko, Song Liu, KP Singh,
	David S . Miller, Jakub Kicinski, John Fastabend, Networking,
	bpf

On Fri, Apr 8, 2022 at 8:08 PM Hou Tao <houtao1@huawei.com> wrote:
>
> Hi,
>
> On 4/7/2022 1:38 AM, Andrii Nakryiko wrote:
> > On Thu, Mar 31, 2022 at 5:04 AM Hou Tao <houtao1@huawei.com> wrote:
> >> Hi,
> >>
> >> The initial motivation for the patchset is due to the suggestion of Alexei.
> >> During the discuss of supporting of string key in hash-table, he saw the
> >> space efficiency of ternary search tree under our early test and suggest
> >> us to post it as a new bpf map [1].
> >>
> >> Ternary search tree is a special trie where nodes are arranged in a
> >> manner similar to binary search tree, but with up to three children
> >> rather than two. The three children correpond to nodes whose value is
> >> less than, equal to, and greater than the value of current node
> >> respectively.
> >>
> >> In ternary search tree map, only the valid content of string is saved.
> >> The trailing null byte and unused bytes after it are not saved. If there
> >> are common prefixes between these strings, the prefix is only saved once.
> >> Compared with other space optimized trie (e.g. HAT-trie, succinct trie),
> >> the advantage of ternary search tree is simple and being writeable.
> >>
> >> Below are diagrams for ternary search map when inserting hello, he,
> >> test and tea into it:
> >>
> >> 1. insert "hello"
> >>
> >>         [ hello ]
> >>
> >> 2. insert "he": need split "hello" into "he" and "llo"
> >>
> >>          [ he ]
> >>             |
> >>             *
> >>             |
> >>          [ llo ]
> >>
> >> 3. insert "test": add it as right child of "he"
> >>
> >>          [ he ]
> >>             |
> >>             *-------x
> >>             |       |
> >>          [ llo ] [ test ]
> >>
> >> 5. insert "tea": split "test" into "te" and "st",
> >>    and insert "a" as left child of "st"
> >>
> >>          [ he ]
> >>             |
> >>      x------*-------x
> >>      |      |       |
> >>   [ ah ] [ llo ] [ te ]
> >>                     |
> >>                     *
> >>                     |
> >>                  [ st ]
> >>                     |
> >>                x----*
> >>                |
> >>              [ a ]
> >>
> >> As showed in above diagrams, the common prefix between "test" and "tea"
> >> is "te" and it only is saved once. Also add benchmarks to compare the
> >> memory usage and lookup performance between ternary search tree and
> >> hash table. When the common prefix is lengthy (~192 bytes) and the
> >> length of suffix is about 64 bytes, there are about 2~3 folds memory
> >> saving compared with hash table. But the memory saving comes at prices:
> >> the lookup performance of tst is about 2~3 slower compared with hash
> >> table. See more benchmark details on patch #2.
> >>
> >> Comments and suggestions are always welcome.
> >>
> > Have you heard and tried qp-trie ([0]) by any chance? It is elegant
> > and simple data structure. By all the available benchmarks it handily
> > beats Red-Black trees in terms of memory usage and performance (though
> > it of course depends on the data set, just like "memory compression"
> > for ternary tree of yours depends on large set of common prefixes).
> > qp-trie based BPF map seems (at least on paper) like a better
> > general-purpose BPF map that is dynamically sized (avoiding current
> > HASHMAP limitations) and stores keys in sorted order (and thus allows
> > meaningful ordered iteration *and*, importantly for longest prefix
> > match tree, allows efficient prefix matches). I did a quick experiment
> > about a month ago trying to replace libbpf's internal use of hashmap
> > with qp-trie for BTF string dedup and it was slightly slower than
> > hashmap (not surprisingly, though, because libbpf over-sizes hashmap
> > to avoid hash collisions and long chains in buckets), but it was still
> > very decent even in that scenario. So I've been mulling the idea of
> > implementing BPF map based on qp-trie elegant design and ideas, but
> > can't find time to do this.
> I have heard about it when check the space efficient of HAT trie [0], because
> qp-trie needs to save the whole string key in the leaf node and its space
> efficiency can not be better than ternary search tree for strings with common
> prefix, so I did not consider about it. But I will do some benchmarks to check
> the lookup performance and space efficiency of qp-trie and tst for string with
> common prefix and strings without much common prefix.
> If qp-trie is better, I think I can take the time to post it as a bpf map if you
> are OK with that.

You can probably always craft a data set where prefix sharing is so
prevalent that space savings are very significant. But I think for a
lot of real-world data it won't be as extreme and qp-trie might be
very comparable (if not more memory-efficient) due to very compact
node layout (which was the point of qp-trie). So I'd be really curious
to see some comparisons. Would be great if you can try both!

>
>
> >
> > This prefix sharing is nice when you have a lot of long common
> > prefixes, but I'm a bit skeptical that as a general-purpose BPF data
> > structure it's going to be that beneficial. 192 bytes of common
> > prefixes seems like a very unusual dataset :)
> Yes. The case with common prefix I known is full file path.
> > More specifically about TST implementation in your paches. One global
> > per-map lock I think is a very big downside. We have LPM trie which is
> > very slow in big part due to global lock. It might be possible to
> > design more granular schema for TST, but this whole in-place splitting
> > logic makes this harder. I think qp-trie can be locked in a granular
> > fashion much more easily by having a "hand over hand" locking: lock
> > parent, find child, lock child, unlock parent, move into child node.
> > Something like that would be more scalable overall, especially if the
> > access pattern is not focused on a narrow set of nodes.
> Yes. The global lock is a problem but the splitting is not in-place. I will try
> to figure out whether the lock can be more scalable after the benchmark test
> between qp-trie and tst.

Great, looking forward!

>
> Regards,
> Tao
>
> [0]: https://github.com/Tessil/hat-trie
> >
> > Anyways, I love data structures and this one is an interesting idea.
> > But just my few cents of "production-readiness" for general-purpose
> > data structures for BPF.
> >
> >   [0] https://dotat.at/prog/qp/README.html
> >
> >> Regards,
> >> Tao
> >>
> >> [1]: https://lore.kernel.org/bpf/CAADnVQJUJp3YBcpESwR3Q1U6GS1mBM=Vp-qYuQX7eZOaoLjdUA@mail.gmail.com/
> >>
> >> Hou Tao (2):
> >>   bpf: Introduce ternary search tree for string key
> >>   selftests/bpf: add benchmark for ternary search tree map
> >>
> >>  include/linux/bpf_types.h                     |   1 +
> >>  include/uapi/linux/bpf.h                      |   1 +
> >>  kernel/bpf/Makefile                           |   1 +
> >>  kernel/bpf/bpf_tst.c                          | 411 +++++++++++++++++
> >>  tools/include/uapi/linux/bpf.h                |   1 +
> >>  tools/testing/selftests/bpf/Makefile          |   5 +-
> >>  tools/testing/selftests/bpf/bench.c           |   6 +
> >>  .../selftests/bpf/benchs/bench_tst_map.c      | 415 ++++++++++++++++++
> >>  .../selftests/bpf/benchs/run_bench_tst.sh     |  54 +++
> >>  tools/testing/selftests/bpf/progs/tst_bench.c |  70 +++
> >>  10 files changed, 964 insertions(+), 1 deletion(-)
> >>  create mode 100644 kernel/bpf/bpf_tst.c
> >>  create mode 100644 tools/testing/selftests/bpf/benchs/bench_tst_map.c
> >>  create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_tst.sh
> >>  create mode 100644 tools/testing/selftests/bpf/progs/tst_bench.c
> >>
> >> --
> >> 2.31.1
> >>
> > .
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH bpf-next 0/2] bpf: Introduce ternary search tree for string key
  2022-04-13  4:09     ` Andrii Nakryiko
@ 2022-04-14  1:03       ` Hou Tao
  2022-04-14 21:25         ` Andrii Nakryiko
  0 siblings, 1 reply; 14+ messages in thread
From: Hou Tao @ 2022-04-14  1:03 UTC (permalink / raw)
  To: Andrii Nakryiko, Alexei Starovoitov
  Cc: Yonghong Song, Daniel Borkmann, Martin KaFai Lau,
	Andrii Nakryiko, Song Liu, KP Singh, David S . Miller,
	Jakub Kicinski, John Fastabend, Networking, bpf, houtao1

Hi,

(I send my previous reply in HTML mode mistakenly and the mail list doesn't
receive it, so send it again in the plain text mode.)

On 4/13/2022 12:09 PM, Andrii Nakryiko wrote:
> On Fri, Apr 8, 2022 at 8:08 PM Hou Tao <houtao1@huawei.com> wrote:
>> Hi,
>>
>> On 4/7/2022 1:38 AM, Andrii Nakryiko wrote:
>>> On Thu, Mar 31, 2022 at 5:04 AM Hou Tao <houtao1@huawei.com> wrote:
>>>> Hi,
>>>>
>>>> The initial motivation for the patchset is due to the suggestion of Alexei.
>>>> During the discuss of supporting of string key in hash-table, he saw the
>>>> space efficiency of ternary search tree under our early test and suggest
>>>> us to post it as a new bpf map [1].
>>>>
>>>> Ternary search tree is a special trie where nodes are arranged in a
>>>> manner similar to binary search tree, but with up to three children
>>>> rather than two. The three children correpond to nodes whose value is
>>>> less than, equal to, and greater than the value of current node
>>>> respectively.
>>>>
>>>> In ternary search tree map, only the valid content of string is saved.
>>>> The trailing null byte and unused bytes after it are not saved. If there
>>>> are common prefixes between these strings, the prefix is only saved once.
>>>> Compared with other space optimized trie (e.g. HAT-trie, succinct trie),
>>>> the advantage of ternary search tree is simple and being writeable.
snip
>>>>
>>> Have you heard and tried qp-trie ([0]) by any chance? It is elegant
>>> and simple data structure. By all the available benchmarks it handily
>>> beats Red-Black trees in terms of memory usage and performance (though
>>> it of course depends on the data set, just like "memory compression"
>>> for ternary tree of yours depends on large set of common prefixes).
>>> qp-trie based BPF map seems (at least on paper) like a better
>>> general-purpose BPF map that is dynamically sized (avoiding current
>>> HASHMAP limitations) and stores keys in sorted order (and thus allows
>>> meaningful ordered iteration *and*, importantly for longest prefix
>>> match tree, allows efficient prefix matches). I did a quick experiment
>>> about a month ago trying to replace libbpf's internal use of hashmap
>>> with qp-trie for BTF string dedup and it was slightly slower than
>>> hashmap (not surprisingly, though, because libbpf over-sizes hashmap
>>> to avoid hash collisions and long chains in buckets), but it was still
>>> very decent even in that scenario. So I've been mulling the idea of
>>> implementing BPF map based on qp-trie elegant design and ideas, but
>>> can't find time to do this.
>> I have heard about it when check the space efficient of HAT trie [0], because
>> qp-trie needs to save the whole string key in the leaf node and its space
>> efficiency can not be better than ternary search tree for strings with common
>> prefix, so I did not consider about it. But I will do some benchmarks to check
>> the lookup performance and space efficiency of qp-trie and tst for string with
>> common prefix and strings without much common prefix.
>> If qp-trie is better, I think I can take the time to post it as a bpf map if you
>> are OK with that.
> You can probably always craft a data set where prefix sharing is so
> prevalent that space savings are very significant. But I think for a
> lot of real-world data it won't be as extreme and qp-trie might be
> very comparable (if not more memory-efficient) due to very compact
> node layout (which was the point of qp-trie). So I'd be really curious
> to see some comparisons. Would be great if you can try both!
It is a bit surprised to me that qp-trie has better memory efficiency  (and
better lookup performance sometimes) compared with tst when there are not so
many common prefix between input strings (All tests below are conducted by
implementing the data structure in user-space):

* all unique symbols in /proc/kallsyms (171428 sorted symbols,  4.2MB in total)

                                        | qp-trie   | tst    | hash   |
total memory used (MB) | 8.6       | 11.2   | 22.3   |
total update time (us) | 94623     | 87396  | 24477  |
total lookup time (us) | 50681     | 67395  | 22842  |

* all strings in BTF string area (115980 unsorted strings, 2MB in total)

                                        | qp-trie   | tst    | hash   |
total memory used (MB) | 5.0       | 7.3    | 13.5   |
total update time (us) | 67764     | 43484  | 16462  |
total lookup time (us) | 33732     | 31612  | 16462  |

* all strings in BTF string area (115980 sorted string, 2MB in total)

                                       | qp-trie   | tst    | hash   |
total memory used (MB) | 5.0       | 7.3    | 13.5   |
total update time (us) | 58745     | 57756  | 16210  |
total lookup time (us) | 26922     | 40850  | 16896  |

* all files under Linux kernel (2.7MB, 74359 files generated by find utility
with "./" stripped)

                                        | qp-trie   | tst    | hash   |
total memory used (MB) | 4.6       | 5.2    | 11.6   |
total update time (us) | 50422     | 28842  | 15255  |
total lookup time (us) | 22543     | 18252  | 11836  |

When the length of common prefix increases, ternary search tree becomes better
than qp-trie.

* all files under Linux kernel with a comm prefix (e.g. "/home/houtao")

                                        | qp-trie   | tst    | hash   |
total memory used (MB) | 5.5       | 5.2    | 12.2   |
total update time (us) | 51558     | 29835  | 15345  |
total lookup time (us) | 23121     | 19638  | 11540  |

Because the lengthy prefix is not so common, and for string map I think the
memory efficiency and lookup performance is more importance than update
performance, so maybe qp-trie is a better choice for string map ?  Any suggestions ?

Regards,
Tao
>>
>>> This prefix sharing is nice when you have a lot of long common
>>> prefixes, but I'm a bit skeptical that as a general-purpose BPF data
>>> structure it's going to be that beneficial. 192 bytes of common
>>> prefixes seems like a very unusual dataset :)
>> Yes. The case with common prefix I known is full file path.
>>> More specifically about TST implementation in your paches. One global
>>> per-map lock I think is a very big downside. We have LPM trie which is
>>> very slow in big part due to global lock. It might be possible to
>>> design more granular schema for TST, but this whole in-place splitting
>>> logic makes this harder. I think qp-trie can be locked in a granular
>>> fashion much more easily by having a "hand over hand" locking: lock
>>> parent, find child, lock child, unlock parent, move into child node.
>>> Something like that would be more scalable overall, especially if the
>>> access pattern is not focused on a narrow set of nodes.
>> Yes. The global lock is a problem but the splitting is not in-place. I will try
>> to figure out whether the lock can be more scalable after the benchmark test
>> between qp-trie and tst.
> Great, looking forward!
>
>> Regards,
>> Tao
>>
>> [0]: https://github.com/Tessil/hat-trie
>>> Anyways, I love data structures and this one is an interesting idea.
>>> But just my few cents of "production-readiness" for general-purpose
>>> data structures for BPF.
>>>
>>>   [0] https://dotat.at/prog/qp/README.html
>>>
>>>> Regards,
>>>> Tao
>>>>
>>>> [1]: https://lore.kernel.org/bpf/CAADnVQJUJp3YBcpESwR3Q1U6GS1mBM=Vp-qYuQX7eZOaoLjdUA@mail.gmail.com/
>>>>
>>>> Hou Tao (2):
>>>>   bpf: Introduce ternary search tree for string key
>>>>   selftests/bpf: add benchmark for ternary search tree map
>>>>
>>>>  include/linux/bpf_types.h                     |   1 +
>>>>  include/uapi/linux/bpf.h                      |   1 +
>>>>  kernel/bpf/Makefile                           |   1 +
>>>>  kernel/bpf/bpf_tst.c                          | 411 +++++++++++++++++
>>>>  tools/include/uapi/linux/bpf.h                |   1 +
>>>>  tools/testing/selftests/bpf/Makefile          |   5 +-
>>>>  tools/testing/selftests/bpf/bench.c           |   6 +
>>>>  .../selftests/bpf/benchs/bench_tst_map.c      | 415 ++++++++++++++++++
>>>>  .../selftests/bpf/benchs/run_bench_tst.sh     |  54 +++
>>>>  tools/testing/selftests/bpf/progs/tst_bench.c |  70 +++
>>>>  10 files changed, 964 insertions(+), 1 deletion(-)
>>>>  create mode 100644 kernel/bpf/bpf_tst.c
>>>>  create mode 100644 tools/testing/selftests/bpf/benchs/bench_tst_map.c
>>>>  create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_tst.sh
>>>>  create mode 100644 tools/testing/selftests/bpf/progs/tst_bench.c
>>>>
>>>> --
>>>> 2.31.1
>>>>
>>> .
> .


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH bpf-next 0/2] bpf: Introduce ternary search tree for string key
  2022-04-14  1:03       ` Hou Tao
@ 2022-04-14 21:25         ` Andrii Nakryiko
  2022-04-26  8:03           ` Hou Tao
  0 siblings, 1 reply; 14+ messages in thread
From: Andrii Nakryiko @ 2022-04-14 21:25 UTC (permalink / raw)
  To: Hou Tao
  Cc: Alexei Starovoitov, Yonghong Song, Daniel Borkmann,
	Martin KaFai Lau, Andrii Nakryiko, Song Liu, KP Singh,
	David S . Miller, Jakub Kicinski, John Fastabend, Networking,
	bpf

On Wed, Apr 13, 2022 at 6:03 PM Hou Tao <houtao1@huawei.com> wrote:
>
> Hi,
>
> (I send my previous reply in HTML mode mistakenly and the mail list doesn't
> receive it, so send it again in the plain text mode.)
>
> On 4/13/2022 12:09 PM, Andrii Nakryiko wrote:
> > On Fri, Apr 8, 2022 at 8:08 PM Hou Tao <houtao1@huawei.com> wrote:
> >> Hi,
> >>
> >> On 4/7/2022 1:38 AM, Andrii Nakryiko wrote:
> >>> On Thu, Mar 31, 2022 at 5:04 AM Hou Tao <houtao1@huawei.com> wrote:
> >>>> Hi,
> >>>>
> >>>> The initial motivation for the patchset is due to the suggestion of Alexei.
> >>>> During the discuss of supporting of string key in hash-table, he saw the
> >>>> space efficiency of ternary search tree under our early test and suggest
> >>>> us to post it as a new bpf map [1].
> >>>>
> >>>> Ternary search tree is a special trie where nodes are arranged in a
> >>>> manner similar to binary search tree, but with up to three children
> >>>> rather than two. The three children correpond to nodes whose value is
> >>>> less than, equal to, and greater than the value of current node
> >>>> respectively.
> >>>>
> >>>> In ternary search tree map, only the valid content of string is saved.
> >>>> The trailing null byte and unused bytes after it are not saved. If there
> >>>> are common prefixes between these strings, the prefix is only saved once.
> >>>> Compared with other space optimized trie (e.g. HAT-trie, succinct trie),
> >>>> the advantage of ternary search tree is simple and being writeable.
> snip
> >>>>
> >>> Have you heard and tried qp-trie ([0]) by any chance? It is elegant
> >>> and simple data structure. By all the available benchmarks it handily
> >>> beats Red-Black trees in terms of memory usage and performance (though
> >>> it of course depends on the data set, just like "memory compression"
> >>> for ternary tree of yours depends on large set of common prefixes).
> >>> qp-trie based BPF map seems (at least on paper) like a better
> >>> general-purpose BPF map that is dynamically sized (avoiding current
> >>> HASHMAP limitations) and stores keys in sorted order (and thus allows
> >>> meaningful ordered iteration *and*, importantly for longest prefix
> >>> match tree, allows efficient prefix matches). I did a quick experiment
> >>> about a month ago trying to replace libbpf's internal use of hashmap
> >>> with qp-trie for BTF string dedup and it was slightly slower than
> >>> hashmap (not surprisingly, though, because libbpf over-sizes hashmap
> >>> to avoid hash collisions and long chains in buckets), but it was still
> >>> very decent even in that scenario. So I've been mulling the idea of
> >>> implementing BPF map based on qp-trie elegant design and ideas, but
> >>> can't find time to do this.
> >> I have heard about it when check the space efficient of HAT trie [0], because
> >> qp-trie needs to save the whole string key in the leaf node and its space
> >> efficiency can not be better than ternary search tree for strings with common
> >> prefix, so I did not consider about it. But I will do some benchmarks to check
> >> the lookup performance and space efficiency of qp-trie and tst for string with
> >> common prefix and strings without much common prefix.
> >> If qp-trie is better, I think I can take the time to post it as a bpf map if you
> >> are OK with that.
> > You can probably always craft a data set where prefix sharing is so
> > prevalent that space savings are very significant. But I think for a
> > lot of real-world data it won't be as extreme and qp-trie might be
> > very comparable (if not more memory-efficient) due to very compact
> > node layout (which was the point of qp-trie). So I'd be really curious
> > to see some comparisons. Would be great if you can try both!
> It is a bit surprised to me that qp-trie has better memory efficiency  (and
> better lookup performance sometimes) compared with tst when there are not so
> many common prefix between input strings (All tests below are conducted by
> implementing the data structure in user-space):

Thanks for a quick follow up and a benchmark!

Low memory use is probably due to the minimal amount of pointers and
extra metadata used per node in qp-trie. qp-trie approach is very
lean, which is why I was recommending trying it out.

>
> * all unique symbols in /proc/kallsyms (171428 sorted symbols,  4.2MB in total)
>
>                                         | qp-trie   | tst    | hash   |
> total memory used (MB) | 8.6       | 11.2   | 22.3   |
> total update time (us) | 94623     | 87396  | 24477  |
> total lookup time (us) | 50681     | 67395  | 22842  |
>
> * all strings in BTF string area (115980 unsorted strings, 2MB in total)
>
>                                         | qp-trie   | tst    | hash   |
> total memory used (MB) | 5.0       | 7.3    | 13.5   |
> total update time (us) | 67764     | 43484  | 16462  |
> total lookup time (us) | 33732     | 31612  | 16462  |
>
> * all strings in BTF string area (115980 sorted string, 2MB in total)
>
>                                        | qp-trie   | tst    | hash   |
> total memory used (MB) | 5.0       | 7.3    | 13.5   |
> total update time (us) | 58745     | 57756  | 16210  |
> total lookup time (us) | 26922     | 40850  | 16896  |
>
> * all files under Linux kernel (2.7MB, 74359 files generated by find utility
> with "./" stripped)
>
>                                         | qp-trie   | tst    | hash   |
> total memory used (MB) | 4.6       | 5.2    | 11.6   |
> total update time (us) | 50422     | 28842  | 15255  |
> total lookup time (us) | 22543     | 18252  | 11836  |

Seems like lookup time is more or less on par (and for kallsyms
noticeably faster), but update is sometimes a bit slower. I don't know
if you did your own code or used open-source implementation, but keep
in mind that performance of qp-trie very much depends on fast
__builtin_popcount, so make sure you are using proper -march when
compiling. See [0]

  [0] https://stackoverflow.com/questions/52161596/why-is-builtin-popcount-slower-than-my-own-bit-counting-function

>
> When the length of common prefix increases, ternary search tree becomes better
> than qp-trie.
>
> * all files under Linux kernel with a comm prefix (e.g. "/home/houtao")
>
>                                         | qp-trie   | tst    | hash   |
> total memory used (MB) | 5.5       | 5.2    | 12.2   |
> total update time (us) | 51558     | 29835  | 15345  |
> total lookup time (us) | 23121     | 19638  | 11540  |
>
> Because the lengthy prefix is not so common, and for string map I think the
> memory efficiency and lookup performance is more importance than update
> performance, so maybe qp-trie is a better choice for string map ?  Any suggestions ?
>

I'm biased :) But I like the idea of qp-trie as a general purpose
ordered and dynamically sized BPF map. It makes no assumption about
data being string-like and sharing common prefixes. It can be made to
work just as fine with any array of bytes, making it very suitable as
a generic lookup table map. Note that upstream implementation does
assume zero-terminated strings and no key being a prefix of another
key. But all that can be removed. For fixed-length keys this can never
happen by construction, for variable-length keys (and we'll be able to
support this finally with bpf_dynptr's help very soon), we can record
length of the key in each leaf and use that during comparisons.

Also note that qp-trie can be internally used by BPF_MAP_TYPE_LPM_TRIE
very efficiently and speed it up considerable in the process (and
especially to get rid of the global lock).

So if you were to invest in a proper full-featured production
implementation of a BPF map, I'd start with qp-trie. From available
benchmarks it's both faster and more memory efficient than Red-Black
trees, which could be an alternative underlying implementation of such
ordered and "resizable" map.


> Regards,
> Tao
> >>
> >>> This prefix sharing is nice when you have a lot of long common
> >>> prefixes, but I'm a bit skeptical that as a general-purpose BPF data
> >>> structure it's going to be that beneficial. 192 bytes of common
> >>> prefixes seems like a very unusual dataset :)
> >> Yes. The case with common prefix I known is full file path.
> >>> More specifically about TST implementation in your paches. One global
> >>> per-map lock I think is a very big downside. We have LPM trie which is
> >>> very slow in big part due to global lock. It might be possible to
> >>> design more granular schema for TST, but this whole in-place splitting
> >>> logic makes this harder. I think qp-trie can be locked in a granular
> >>> fashion much more easily by having a "hand over hand" locking: lock
> >>> parent, find child, lock child, unlock parent, move into child node.
> >>> Something like that would be more scalable overall, especially if the
> >>> access pattern is not focused on a narrow set of nodes.
> >> Yes. The global lock is a problem but the splitting is not in-place. I will try
> >> to figure out whether the lock can be more scalable after the benchmark test
> >> between qp-trie and tst.
> > Great, looking forward!
> >
> >> Regards,
> >> Tao
> >>
> >> [0]: https://github.com/Tessil/hat-trie
> >>> Anyways, I love data structures and this one is an interesting idea.
> >>> But just my few cents of "production-readiness" for general-purpose
> >>> data structures for BPF.
> >>>
> >>>   [0] https://dotat.at/prog/qp/README.html
> >>>
> >>>> Regards,
> >>>> Tao
> >>>>
> >>>> [1]: https://lore.kernel.org/bpf/CAADnVQJUJp3YBcpESwR3Q1U6GS1mBM=Vp-qYuQX7eZOaoLjdUA@mail.gmail.com/
> >>>>
> >>>> Hou Tao (2):
> >>>>   bpf: Introduce ternary search tree for string key
> >>>>   selftests/bpf: add benchmark for ternary search tree map
> >>>>
> >>>>  include/linux/bpf_types.h                     |   1 +
> >>>>  include/uapi/linux/bpf.h                      |   1 +
> >>>>  kernel/bpf/Makefile                           |   1 +
> >>>>  kernel/bpf/bpf_tst.c                          | 411 +++++++++++++++++
> >>>>  tools/include/uapi/linux/bpf.h                |   1 +
> >>>>  tools/testing/selftests/bpf/Makefile          |   5 +-
> >>>>  tools/testing/selftests/bpf/bench.c           |   6 +
> >>>>  .../selftests/bpf/benchs/bench_tst_map.c      | 415 ++++++++++++++++++
> >>>>  .../selftests/bpf/benchs/run_bench_tst.sh     |  54 +++
> >>>>  tools/testing/selftests/bpf/progs/tst_bench.c |  70 +++
> >>>>  10 files changed, 964 insertions(+), 1 deletion(-)
> >>>>  create mode 100644 kernel/bpf/bpf_tst.c
> >>>>  create mode 100644 tools/testing/selftests/bpf/benchs/bench_tst_map.c
> >>>>  create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_tst.sh
> >>>>  create mode 100644 tools/testing/selftests/bpf/progs/tst_bench.c
> >>>>
> >>>> --
> >>>> 2.31.1
> >>>>
> >>> .
> > .
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH bpf-next 0/2] bpf: Introduce ternary search tree for string key
  2022-04-14 21:25         ` Andrii Nakryiko
@ 2022-04-26  8:03           ` Hou Tao
  2022-04-27  3:57             ` Andrii Nakryiko
  0 siblings, 1 reply; 14+ messages in thread
From: Hou Tao @ 2022-04-26  8:03 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Alexei Starovoitov, Yonghong Song, Daniel Borkmann,
	Martin KaFai Lau, Andrii Nakryiko, Song Liu, KP Singh,
	David S . Miller, Jakub Kicinski, John Fastabend, Networking,
	bpf

Hi,

On 4/15/2022 5:25 AM, Andrii Nakryiko wrote:
> On Wed, Apr 13, 2022 at 6:03 PM Hou Tao <houtao1@huawei.com> wrote:
>> Hi,
>>
>> (I send my previous reply in HTML mode mistakenly and the mail list doesn't
>> receive it, so send it again in the plain text mode.)
>>
>> On 4/13/2022 12:09 PM, Andrii Nakryiko wrote:
>>> On Fri, Apr 8, 2022 at 8:08 PM Hou Tao <houtao1@huawei.com> wrote:
>>>> Hi,
>>>>
>>>> On 4/7/2022 1:38 AM, Andrii Nakryiko wrote:
>>>>> On Thu, Mar 31, 2022 at 5:04 AM Hou Tao <houtao1@huawei.com> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> The initial motivation for the patchset is due to the suggestion of Alexei.
>>>>>> During the discuss of supporting of string key in hash-table, he saw the
>>>>>> space efficiency of ternary search tree under our early test and suggest
>>>>>> us to post it as a new bpf map [1].
>>>>>>
>>>>>> Ternary search tree is a special trie where nodes are arranged in a
>>>>>> manner similar to binary search tree, but with up to three children
>>>>>> rather than two. The three children correpond to nodes whose value is
>>>>>> less than, equal to, and greater than the value of current node
>>>>>> respectively.
>>>>>>
>>>>>> In ternary search tree map, only the valid content of string is saved.
>>>>>> The trailing null byte and unused bytes after it are not saved. If there
>>>>>> are common prefixes between these strings, the prefix is only saved once.
>>>>>> Compared with other space optimized trie (e.g. HAT-trie, succinct trie),
>>>>>> the advantage of ternary search tree is simple and being writeable.
>> snip
>>>>> Have you heard and tried qp-trie ([0]) by any chance? It is elegant
>>>>> and simple data structure. By all the available benchmarks it handily
>>>>> beats Red-Black trees in terms of memory usage and performance (though
>>>>> it of course depends on the data set, just like "memory compression"
>>>>> for ternary tree of yours depends on large set of common prefixes).
>>>>> qp-trie based BPF map seems (at least on paper) like a better
>>>>> general-purpose BPF map that is dynamically sized (avoiding current
>>>>> HASHMAP limitations) and stores keys in sorted order (and thus allows
>>>>> meaningful ordered iteration *and*, importantly for longest prefix
>>>>> match tree, allows efficient prefix matches). I did a quick experiment
>>>>> about a month ago trying to replace libbpf's internal use of hashmap
>>>>> with qp-trie for BTF string dedup and it was slightly slower than
>>>>> hashmap (not surprisingly, though, because libbpf over-sizes hashmap
>>>>> to avoid hash collisions and long chains in buckets), but it was still
>>>>> very decent even in that scenario. So I've been mulling the idea of
>>>>> implementing BPF map based on qp-trie elegant design and ideas, but
>>>>> can't find time to do this.
>>>> I have heard about it when check the space efficient of HAT trie [0], because
>>>> qp-trie needs to save the whole string key in the leaf node and its space
>>>> efficiency can not be better than ternary search tree for strings with common
>>>> prefix, so I did not consider about it. But I will do some benchmarks to check
>>>> the lookup performance and space efficiency of qp-trie and tst for string with
>>>> common prefix and strings without much common prefix.
>>>> If qp-trie is better, I think I can take the time to post it as a bpf map if you
>>>> are OK with that.
>>> You can probably always craft a data set where prefix sharing is so
>>> prevalent that space savings are very significant. But I think for a
>>> lot of real-world data it won't be as extreme and qp-trie might be
>>> very comparable (if not more memory-efficient) due to very compact
>>> node layout (which was the point of qp-trie). So I'd be really curious
>>> to see some comparisons. Would be great if you can try both!
>> It is a bit surprised to me that qp-trie has better memory efficiency  (and
>> better lookup performance sometimes) compared with tst when there are not so
>> many common prefix between input strings (All tests below are conducted by
>> implementing the data structure in user-space):
> Thanks for a quick follow up and a benchmark!
>
> Low memory use is probably due to the minimal amount of pointers and
> extra metadata used per node in qp-trie. qp-trie approach is very
> lean, which is why I was recommending trying it out.
>
>> * all unique symbols in /proc/kallsyms (171428 sorted symbols,  4.2MB in total)
>>
>>                                         | qp-trie   | tst    | hash   |
>> total memory used (MB) | 8.6       | 11.2   | 22.3   |
>> total update time (us) | 94623     | 87396  | 24477  |
>> total lookup time (us) | 50681     | 67395  | 22842  |
>>
>> * all strings in BTF string area (115980 unsorted strings, 2MB in total)
>>
>>                                         | qp-trie   | tst    | hash   |
>> total memory used (MB) | 5.0       | 7.3    | 13.5   |
>> total update time (us) | 67764     | 43484  | 16462  |
>> total lookup time (us) | 33732     | 31612  | 16462  |
>>
>> * all strings in BTF string area (115980 sorted string, 2MB in total)
>>
>>                                        | qp-trie   | tst    | hash   |
>> total memory used (MB) | 5.0       | 7.3    | 13.5   |
>> total update time (us) | 58745     | 57756  | 16210  |
>> total lookup time (us) | 26922     | 40850  | 16896  |
>>
>> * all files under Linux kernel (2.7MB, 74359 files generated by find utility
>> with "./" stripped)
>>
>>                                         | qp-trie   | tst    | hash   |
>> total memory used (MB) | 4.6       | 5.2    | 11.6   |
>> total update time (us) | 50422     | 28842  | 15255  |
>> total lookup time (us) | 22543     | 18252  | 11836  |
> Seems like lookup time is more or less on par (and for kallsyms
> noticeably faster), but update is sometimes a bit slower. I don't know
> if you did your own code or used open-source implementation, but keep
> in mind that performance of qp-trie very much depends on fast
> __builtin_popcount, so make sure you are using proper -march when
> compiling. See [0]
>
>   [0] https://stackoverflow.com/questions/52161596/why-is-builtin-popcount-slower-than-my-own-bit-counting-function
I used the open source code from github [0] directly.  And after adding
-march=native, both the lookup and update performance of qp-trie are improved.
And the lookup performance of qp-trie is always better than tst, but the update
performance of  qp-trie is still worse than tst.

[0]: https://github.com/fanf2/qp.git
>> When the length of common prefix increases, ternary search tree becomes better
>> than qp-trie.
>>
>> * all files under Linux kernel with a comm prefix (e.g. "/home/houtao")
>>
>>                                         | qp-trie   | tst    | hash   |
>> total memory used (MB) | 5.5       | 5.2    | 12.2   |
>> total update time (us) | 51558     | 29835  | 15345  |
>> total lookup time (us) | 23121     | 19638  | 11540  |
>>
>> Because the lengthy prefix is not so common, and for string map I think the
>> memory efficiency and lookup performance is more importance than update
>> performance, so maybe qp-trie is a better choice for string map ?  Any suggestions ?
>>
> I'm biased :) But I like the idea of qp-trie as a general purpose
> ordered and dynamically sized BPF map. It makes no assumption about
> data being string-like and sharing common prefixes. It can be made to
> work just as fine with any array of bytes, making it very suitable as
> a generic lookup table map. Note that upstream implementation does
> assume zero-terminated strings and no key being a prefix of another
> key. But all that can be removed. For fixed-length keys this can never
> happen by construction, for variable-length keys (and we'll be able to
> support this finally with bpf_dynptr's help very soon), we can record
> length of the key in each leaf and use that during comparisons.
Using the trailing zero byte to make sure no key will be a prefix of another is
simple, but I will check whether or not there is other way to make the bytes
array key work out. Alexei had suggest me to use the length of key as part of
key just like bpf_lpm_trie_key does, maybe i can try it first.
>
> Also note that qp-trie can be internally used by BPF_MAP_TYPE_LPM_TRIE
> very efficiently and speed it up considerable in the process (and
> especially to get rid of the global lock).
>
> So if you were to invest in a proper full-featured production
> implementation of a BPF map, I'd start with qp-trie. From available
> benchmarks it's both faster and more memory efficient than Red-Black
> trees, which could be an alternative underlying implementation of such
> ordered and "resizable" map.
Thanks for your suggestions. I will give it a try.

Regards,
Tao

>> Regards,
>> Tao
>>>>> This prefix sharing is nice when you have a lot of long common
>>>>> prefixes, but I'm a bit skeptical that as a general-purpose BPF data
>>>>> structure it's going to be that beneficial. 192 bytes of common
>>>>> prefixes seems like a very unusual dataset :)
>>>> Yes. The case with common prefix I known is full file path.
>>>>> More specifically about TST implementation in your paches. One global
>>>>> per-map lock I think is a very big downside. We have LPM trie which is
>>>>> very slow in big part due to global lock. It might be possible to
>>>>> design more granular schema for TST, but this whole in-place splitting
>>>>> logic makes this harder. I think qp-trie can be locked in a granular
>>>>> fashion much more easily by having a "hand over hand" locking: lock
>>>>> parent, find child, lock child, unlock parent, move into child node.
>>>>> Something like that would be more scalable overall, especially if the
>>>>> access pattern is not focused on a narrow set of nodes.
>>>> Yes. The global lock is a problem but the splitting is not in-place. I will try
>>>> to figure out whether the lock can be more scalable after the benchmark test
>>>> between qp-trie and tst.
>>> Great, looking forward!
>>>
>>>> Regards,
>>>> Tao
>>>>
>>>> [0]: https://github.com/Tessil/hat-trie
>>>>> Anyways, I love data structures and this one is an interesting idea.
>>>>> But just my few cents of "production-readiness" for general-purpose
>>>>> data structures for BPF.
>>>>>
>>>>>   [0] https://dotat.at/prog/qp/README.html
>>>>>
>>>>>> Regards,
>>>>>> Tao
>>>>>>
>>>>>> [1]: https://lore.kernel.org/bpf/CAADnVQJUJp3YBcpESwR3Q1U6GS1mBM=Vp-qYuQX7eZOaoLjdUA@mail.gmail.com/
>>>>>>
>>>>>> Hou Tao (2):
>>>>>>   bpf: Introduce ternary search tree for string key
>>>>>>   selftests/bpf: add benchmark for ternary search tree map
>>>>>>
>>>>>>  include/linux/bpf_types.h                     |   1 +
>>>>>>  include/uapi/linux/bpf.h                      |   1 +
>>>>>>  kernel/bpf/Makefile                           |   1 +
>>>>>>  kernel/bpf/bpf_tst.c                          | 411 +++++++++++++++++
>>>>>>  tools/include/uapi/linux/bpf.h                |   1 +
>>>>>>  tools/testing/selftests/bpf/Makefile          |   5 +-
>>>>>>  tools/testing/selftests/bpf/bench.c           |   6 +
>>>>>>  .../selftests/bpf/benchs/bench_tst_map.c      | 415 ++++++++++++++++++
>>>>>>  .../selftests/bpf/benchs/run_bench_tst.sh     |  54 +++
>>>>>>  tools/testing/selftests/bpf/progs/tst_bench.c |  70 +++
>>>>>>  10 files changed, 964 insertions(+), 1 deletion(-)
>>>>>>  create mode 100644 kernel/bpf/bpf_tst.c
>>>>>>  create mode 100644 tools/testing/selftests/bpf/benchs/bench_tst_map.c
>>>>>>  create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_tst.sh
>>>>>>  create mode 100644 tools/testing/selftests/bpf/progs/tst_bench.c
>>>>>>
>>>>>> --
>>>>>> 2.31.1
>>>>>>
>>>>> .
>>> .
> .


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH bpf-next 0/2] bpf: Introduce ternary search tree for string key
  2022-04-26  8:03           ` Hou Tao
@ 2022-04-27  3:57             ` Andrii Nakryiko
  2022-06-08  9:00               ` Hou Tao
  0 siblings, 1 reply; 14+ messages in thread
From: Andrii Nakryiko @ 2022-04-27  3:57 UTC (permalink / raw)
  To: Hou Tao
  Cc: Alexei Starovoitov, Yonghong Song, Daniel Borkmann,
	Martin KaFai Lau, Andrii Nakryiko, Song Liu, KP Singh,
	David S . Miller, Jakub Kicinski, John Fastabend, Networking,
	bpf

On Tue, Apr 26, 2022 at 1:03 AM Hou Tao <houtao1@huawei.com> wrote:
>
> Hi,
>
> On 4/15/2022 5:25 AM, Andrii Nakryiko wrote:
> > On Wed, Apr 13, 2022 at 6:03 PM Hou Tao <houtao1@huawei.com> wrote:
> >> Hi,
> >>
> >> (I send my previous reply in HTML mode mistakenly and the mail list doesn't
> >> receive it, so send it again in the plain text mode.)
> >>
> >> On 4/13/2022 12:09 PM, Andrii Nakryiko wrote:
> >>> On Fri, Apr 8, 2022 at 8:08 PM Hou Tao <houtao1@huawei.com> wrote:
> >>>> Hi,
> >>>>
> >>>> On 4/7/2022 1:38 AM, Andrii Nakryiko wrote:
> >>>>> On Thu, Mar 31, 2022 at 5:04 AM Hou Tao <houtao1@huawei.com> wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> The initial motivation for the patchset is due to the suggestion of Alexei.
> >>>>>> During the discuss of supporting of string key in hash-table, he saw the
> >>>>>> space efficiency of ternary search tree under our early test and suggest
> >>>>>> us to post it as a new bpf map [1].
> >>>>>>
> >>>>>> Ternary search tree is a special trie where nodes are arranged in a
> >>>>>> manner similar to binary search tree, but with up to three children
> >>>>>> rather than two. The three children correpond to nodes whose value is
> >>>>>> less than, equal to, and greater than the value of current node
> >>>>>> respectively.
> >>>>>>
> >>>>>> In ternary search tree map, only the valid content of string is saved.
> >>>>>> The trailing null byte and unused bytes after it are not saved. If there
> >>>>>> are common prefixes between these strings, the prefix is only saved once.
> >>>>>> Compared with other space optimized trie (e.g. HAT-trie, succinct trie),
> >>>>>> the advantage of ternary search tree is simple and being writeable.
> >> snip
> >>>>> Have you heard and tried qp-trie ([0]) by any chance? It is elegant
> >>>>> and simple data structure. By all the available benchmarks it handily
> >>>>> beats Red-Black trees in terms of memory usage and performance (though
> >>>>> it of course depends on the data set, just like "memory compression"
> >>>>> for ternary tree of yours depends on large set of common prefixes).
> >>>>> qp-trie based BPF map seems (at least on paper) like a better
> >>>>> general-purpose BPF map that is dynamically sized (avoiding current
> >>>>> HASHMAP limitations) and stores keys in sorted order (and thus allows
> >>>>> meaningful ordered iteration *and*, importantly for longest prefix
> >>>>> match tree, allows efficient prefix matches). I did a quick experiment
> >>>>> about a month ago trying to replace libbpf's internal use of hashmap
> >>>>> with qp-trie for BTF string dedup and it was slightly slower than
> >>>>> hashmap (not surprisingly, though, because libbpf over-sizes hashmap
> >>>>> to avoid hash collisions and long chains in buckets), but it was still
> >>>>> very decent even in that scenario. So I've been mulling the idea of
> >>>>> implementing BPF map based on qp-trie elegant design and ideas, but
> >>>>> can't find time to do this.
> >>>> I have heard about it when check the space efficient of HAT trie [0], because
> >>>> qp-trie needs to save the whole string key in the leaf node and its space
> >>>> efficiency can not be better than ternary search tree for strings with common
> >>>> prefix, so I did not consider about it. But I will do some benchmarks to check
> >>>> the lookup performance and space efficiency of qp-trie and tst for string with
> >>>> common prefix and strings without much common prefix.
> >>>> If qp-trie is better, I think I can take the time to post it as a bpf map if you
> >>>> are OK with that.
> >>> You can probably always craft a data set where prefix sharing is so
> >>> prevalent that space savings are very significant. But I think for a
> >>> lot of real-world data it won't be as extreme and qp-trie might be
> >>> very comparable (if not more memory-efficient) due to very compact
> >>> node layout (which was the point of qp-trie). So I'd be really curious
> >>> to see some comparisons. Would be great if you can try both!
> >> It is a bit surprised to me that qp-trie has better memory efficiency  (and
> >> better lookup performance sometimes) compared with tst when there are not so
> >> many common prefix between input strings (All tests below are conducted by
> >> implementing the data structure in user-space):
> > Thanks for a quick follow up and a benchmark!
> >
> > Low memory use is probably due to the minimal amount of pointers and
> > extra metadata used per node in qp-trie. qp-trie approach is very
> > lean, which is why I was recommending trying it out.
> >
> >> * all unique symbols in /proc/kallsyms (171428 sorted symbols,  4.2MB in total)
> >>
> >>                                         | qp-trie   | tst    | hash   |
> >> total memory used (MB) | 8.6       | 11.2   | 22.3   |
> >> total update time (us) | 94623     | 87396  | 24477  |
> >> total lookup time (us) | 50681     | 67395  | 22842  |
> >>
> >> * all strings in BTF string area (115980 unsorted strings, 2MB in total)
> >>
> >>                                         | qp-trie   | tst    | hash   |
> >> total memory used (MB) | 5.0       | 7.3    | 13.5   |
> >> total update time (us) | 67764     | 43484  | 16462  |
> >> total lookup time (us) | 33732     | 31612  | 16462  |
> >>
> >> * all strings in BTF string area (115980 sorted string, 2MB in total)
> >>
> >>                                        | qp-trie   | tst    | hash   |
> >> total memory used (MB) | 5.0       | 7.3    | 13.5   |
> >> total update time (us) | 58745     | 57756  | 16210  |
> >> total lookup time (us) | 26922     | 40850  | 16896  |
> >>
> >> * all files under Linux kernel (2.7MB, 74359 files generated by find utility
> >> with "./" stripped)
> >>
> >>                                         | qp-trie   | tst    | hash   |
> >> total memory used (MB) | 4.6       | 5.2    | 11.6   |
> >> total update time (us) | 50422     | 28842  | 15255  |
> >> total lookup time (us) | 22543     | 18252  | 11836  |
> > Seems like lookup time is more or less on par (and for kallsyms
> > noticeably faster), but update is sometimes a bit slower. I don't know
> > if you did your own code or used open-source implementation, but keep
> > in mind that performance of qp-trie very much depends on fast
> > __builtin_popcount, so make sure you are using proper -march when
> > compiling. See [0]
> >
> >   [0] https://stackoverflow.com/questions/52161596/why-is-builtin-popcount-slower-than-my-own-bit-counting-function
> I used the open source code from github [0] directly.  And after adding
> -march=native, both the lookup and update performance of qp-trie are improved.
> And the lookup performance of qp-trie is always better than tst, but the update
> performance of  qp-trie is still worse than tst.

Cool, thanks for update! If update is not much worse and the lookup is
better, I think it's a pretty good tradeoff!

>
> [0]: https://github.com/fanf2/qp.git
> >> When the length of common prefix increases, ternary search tree becomes better
> >> than qp-trie.
> >>
> >> * all files under Linux kernel with a comm prefix (e.g. "/home/houtao")
> >>
> >>                                         | qp-trie   | tst    | hash   |
> >> total memory used (MB) | 5.5       | 5.2    | 12.2   |
> >> total update time (us) | 51558     | 29835  | 15345  |
> >> total lookup time (us) | 23121     | 19638  | 11540  |
> >>
> >> Because the lengthy prefix is not so common, and for string map I think the
> >> memory efficiency and lookup performance is more importance than update
> >> performance, so maybe qp-trie is a better choice for string map ?  Any suggestions ?
> >>
> > I'm biased :) But I like the idea of qp-trie as a general purpose
> > ordered and dynamically sized BPF map. It makes no assumption about
> > data being string-like and sharing common prefixes. It can be made to
> > work just as fine with any array of bytes, making it very suitable as
> > a generic lookup table map. Note that upstream implementation does
> > assume zero-terminated strings and no key being a prefix of another
> > key. But all that can be removed. For fixed-length keys this can never
> > happen by construction, for variable-length keys (and we'll be able to
> > support this finally with bpf_dynptr's help very soon), we can record
> > length of the key in each leaf and use that during comparisons.
> Using the trailing zero byte to make sure no key will be a prefix of another is
> simple, but I will check whether or not there is other way to make the bytes
> array key work out. Alexei had suggest me to use the length of key as part of
> key just like bpf_lpm_trie_key does, maybe i can try it first.

Yeah, using key length as part of the key during comparison is what I
meant as well. I didn't mean to aritificially add trailing zero (this
idea doesn't work for arbitrary binary data).

> >
> > Also note that qp-trie can be internally used by BPF_MAP_TYPE_LPM_TRIE
> > very efficiently and speed it up considerable in the process (and
> > especially to get rid of the global lock).
> >
> > So if you were to invest in a proper full-featured production
> > implementation of a BPF map, I'd start with qp-trie. From available
> > benchmarks it's both faster and more memory efficient than Red-Black
> > trees, which could be an alternative underlying implementation of such
> > ordered and "resizable" map.
> Thanks for your suggestions. I will give it a try.

Awesome!

>
> Regards,
> Tao
>
> >> Regards,
> >> Tao
> >>>>> This prefix sharing is nice when you have a lot of long common
> >>>>> prefixes, but I'm a bit skeptical that as a general-purpose BPF data
> >>>>> structure it's going to be that beneficial. 192 bytes of common
> >>>>> prefixes seems like a very unusual dataset :)
> >>>> Yes. The case with common prefix I known is full file path.
> >>>>> More specifically about TST implementation in your paches. One global
> >>>>> per-map lock I think is a very big downside. We have LPM trie which is
> >>>>> very slow in big part due to global lock. It might be possible to
> >>>>> design more granular schema for TST, but this whole in-place splitting
> >>>>> logic makes this harder. I think qp-trie can be locked in a granular
> >>>>> fashion much more easily by having a "hand over hand" locking: lock
> >>>>> parent, find child, lock child, unlock parent, move into child node.
> >>>>> Something like that would be more scalable overall, especially if the
> >>>>> access pattern is not focused on a narrow set of nodes.
> >>>> Yes. The global lock is a problem but the splitting is not in-place. I will try
> >>>> to figure out whether the lock can be more scalable after the benchmark test
> >>>> between qp-trie and tst.
> >>> Great, looking forward!
> >>>
> >>>> Regards,
> >>>> Tao
> >>>>
> >>>> [0]: https://github.com/Tessil/hat-trie
> >>>>> Anyways, I love data structures and this one is an interesting idea.
> >>>>> But just my few cents of "production-readiness" for general-purpose
> >>>>> data structures for BPF.
> >>>>>
> >>>>>   [0] https://dotat.at/prog/qp/README.html
> >>>>>
> >>>>>> Regards,
> >>>>>> Tao
> >>>>>>
> >>>>>> [1]: https://lore.kernel.org/bpf/CAADnVQJUJp3YBcpESwR3Q1U6GS1mBM=Vp-qYuQX7eZOaoLjdUA@mail.gmail.com/
> >>>>>>
> >>>>>> Hou Tao (2):
> >>>>>>   bpf: Introduce ternary search tree for string key
> >>>>>>   selftests/bpf: add benchmark for ternary search tree map
> >>>>>>
> >>>>>>  include/linux/bpf_types.h                     |   1 +
> >>>>>>  include/uapi/linux/bpf.h                      |   1 +
> >>>>>>  kernel/bpf/Makefile                           |   1 +
> >>>>>>  kernel/bpf/bpf_tst.c                          | 411 +++++++++++++++++
> >>>>>>  tools/include/uapi/linux/bpf.h                |   1 +
> >>>>>>  tools/testing/selftests/bpf/Makefile          |   5 +-
> >>>>>>  tools/testing/selftests/bpf/bench.c           |   6 +
> >>>>>>  .../selftests/bpf/benchs/bench_tst_map.c      | 415 ++++++++++++++++++
> >>>>>>  .../selftests/bpf/benchs/run_bench_tst.sh     |  54 +++
> >>>>>>  tools/testing/selftests/bpf/progs/tst_bench.c |  70 +++
> >>>>>>  10 files changed, 964 insertions(+), 1 deletion(-)
> >>>>>>  create mode 100644 kernel/bpf/bpf_tst.c
> >>>>>>  create mode 100644 tools/testing/selftests/bpf/benchs/bench_tst_map.c
> >>>>>>  create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_tst.sh
> >>>>>>  create mode 100644 tools/testing/selftests/bpf/progs/tst_bench.c
> >>>>>>
> >>>>>> --
> >>>>>> 2.31.1
> >>>>>>
> >>>>> .
> >>> .
> > .
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH bpf-next 0/2] bpf: Introduce ternary search tree for string key
  2022-04-27  3:57             ` Andrii Nakryiko
@ 2022-06-08  9:00               ` Hou Tao
  2022-07-05 22:37                 ` Andrii Nakryiko
  0 siblings, 1 reply; 14+ messages in thread
From: Hou Tao @ 2022-06-08  9:00 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Alexei Starovoitov, Yonghong Song, Daniel Borkmann,
	Martin KaFai Lau, Andrii Nakryiko, Song Liu, KP Singh,
	David S . Miller, Jakub Kicinski, John Fastabend, Networking,
	bpf

Hi,

On 4/27/2022 11:57 AM, Andrii Nakryiko wrote:
> On Tue, Apr 26, 2022 at 1:03 AM Hou Tao <houtao1@huawei.com> wrote:
snip
>>> I'm biased :) But I like the idea of qp-trie as a general purpose
>>> ordered and dynamically sized BPF map. It makes no assumption about
>>> data being string-like and sharing common prefixes. It can be made to
>>> work just as fine with any array of bytes, making it very suitable as
>>> a generic lookup table map. Note that upstream implementation does
>>> assume zero-terminated strings and no key being a prefix of another
>>> key. But all that can be removed. For fixed-length keys this can never
>>> happen by construction, for variable-length keys (and we'll be able to
>>> support this finally with bpf_dynptr's help very soon), we can record
>>> length of the key in each leaf and use that during comparisons.
>> Using the trailing zero byte to make sure no key will be a prefix of another is
>> simple, but I will check whether or not there is other way to make the bytes
>> array key work out. Alexei had suggest me to use the length of key as part of
>> key just like bpf_lpm_trie_key does, maybe i can try it first.
> Yeah, using key length as part of the key during comparison is what I
> meant as well. I didn't mean to aritificially add trailing zero (this
> idea doesn't work for arbitrary binary data).
>
>>> Also note that qp-trie can be internally used by BPF_MAP_TYPE_LPM_TRIE
>>> very efficiently and speed it up considerable in the process (and
>>> especially to get rid of the global lock).
>>>
>>> So if you were to invest in a proper full-featured production
>>> implementation of a BPF map, I'd start with qp-trie. From available
>>> benchmarks it's both faster and more memory efficient than Red-Black
>>> trees, which could be an alternative underlying implementation of such
>>> ordered and "resizable" map.
Recently I tried to add concurrent update and deletion support for qp-trie, and
found out that hand-over-hand lock scheme may don't work for qp-trie update. The
short explanation is that update procedure needs traverse qp-trie twice and the
position tuple got in the first pass may be stale due to concurrent updates
occurred in the second pass. The detailed explanation is shown below.

To reduce space usage for qp-trie, there is no common prefix saved in branch
node, so the update of qp-trie needs to traversing qp-trie to find the most
similar key firstly, comparing with it to get a (index, flag,) tuple for the
leaf key, then traversing qp-trie again to find the insert position by using the
tuple. The problem is that, the position tuple may be stale due to concurrent
updates occurred in different branch. Considering the following case:

When inserting "aa_bind_mount" and "aa_buffers_lock" concurrently into the
following qp-trie. The most similar key for "aa_bind_mount" is leaf X, and for
"aa_buffers_lock" it is leaf Y. The calculated index tuple for both new keys are
the same: (index=3, flag=2). Assuming "aa_bind_mount" is inserted firstly, so
when inserting "aa_buffers_lock", the correct index will be (index=4, flag=1)
instead of (index=3, flag=2) and the result will be incorrect.

branch: index  1 flags 1 bitmap 0x00088
* leaf: a.81577 0
* branch: index  4 flags 1 bitmap 0x00180
* * branch: index  4 flags 2 bitmap 0x02080
* * * leaf: aa_af_perm 1 (leaf X, for aa_bind_mount)
* * branch: index  4 flags 2 bitmap 0x00052
* * * leaf: aa_apply_modes_to_perms 6
* * * leaf: aa_asprint_hashstr 7 (leaf Y, for aa_buffers_lock)

In order to get a correct position tuple, the intuitive solution is adding
common prefix in branch node and letting update procedure to find the insertion
position by comparing with the prefix, so it only needs to traverse qp-trie once
and hand-over-hand locking scheme can work. I plan to work on qp-trie with
common prefix and will update its memory usage and concurrent update/insert
speed in this thread once the demo is ready.  So any more suggestions ?

Regards,
Tao

>> Thanks for your suggestions. I will give it a try.
> Awesome!
>
>> Regards,
>> Tao
>>
>>>> Regards,
>>>> Tao
>>>>>>> This prefix sharing is nice when you have a lot of long common
>>>>>>> prefixes, but I'm a bit skeptical that as a general-purpose BPF data
>>>>>>> structure it's going to be that beneficial. 192 bytes of common
>>>>>>> prefixes seems like a very unusual dataset :)
>>>>>> Yes. The case with common prefix I known is full file path.
>>>>>>> More specifically about TST implementation in your paches. One global
>>>>>>> per-map lock I think is a very big downside. We have LPM trie which is
>>>>>>> very slow in big part due to global lock. It might be possible to
>>>>>>> design more granular schema for TST, but this whole in-place splitting
>>>>>>> logic makes this harder. I think qp-trie can be locked in a granular
>>>>>>> fashion much more easily by having a "hand over hand" locking: lock
>>>>>>> parent, find child, lock child, unlock parent, move into child node.
>>>>>>> Something like that would be more scalable overall, especially if the
>>>>>>> access pattern is not focused on a narrow set of nodes.
>>>>>> Yes. The global lock is a problem but the splitting is not in-place. I will try
>>>>>> to figure out whether the lock can be more scalable after the benchmark test
>>>>>> between qp-trie and tst.
>>>>> Great, looking forward!
>>>>>
>>>>>> Regards,
>>>>>> Tao
>>>>>>
>>>>>> [0]: https://github.com/Tessil/hat-trie
>>>>>>> Anyways, I love data structures and this one is an interesting idea.
>>>>>>> But just my few cents of "production-readiness" for general-purpose
>>>>>>> data structures for BPF.
>>>>>>>
>>>>>>>   [0] https://dotat.at/prog/qp/README.html
>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Tao
>>>>>>>>
>>>>>>>> [1]: https://lore.kernel.org/bpf/CAADnVQJUJp3YBcpESwR3Q1U6GS1mBM=Vp-qYuQX7eZOaoLjdUA@mail.gmail.com/
>>>>>>>>
>>>>>>>> Hou Tao (2):
>>>>>>>>   bpf: Introduce ternary search tree for string key
>>>>>>>>   selftests/bpf: add benchmark for ternary search tree map
>>>>>>>>
>>>>>>>>  include/linux/bpf_types.h                     |   1 +
>>>>>>>>  include/uapi/linux/bpf.h                      |   1 +
>>>>>>>>  kernel/bpf/Makefile                           |   1 +
>>>>>>>>  kernel/bpf/bpf_tst.c                          | 411 +++++++++++++++++
>>>>>>>>  tools/include/uapi/linux/bpf.h                |   1 +
>>>>>>>>  tools/testing/selftests/bpf/Makefile          |   5 +-
>>>>>>>>  tools/testing/selftests/bpf/bench.c           |   6 +
>>>>>>>>  .../selftests/bpf/benchs/bench_tst_map.c      | 415 ++++++++++++++++++
>>>>>>>>  .../selftests/bpf/benchs/run_bench_tst.sh     |  54 +++
>>>>>>>>  tools/testing/selftests/bpf/progs/tst_bench.c |  70 +++
>>>>>>>>  10 files changed, 964 insertions(+), 1 deletion(-)
>>>>>>>>  create mode 100644 kernel/bpf/bpf_tst.c
>>>>>>>>  create mode 100644 tools/testing/selftests/bpf/benchs/bench_tst_map.c
>>>>>>>>  create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_tst.sh
>>>>>>>>  create mode 100644 tools/testing/selftests/bpf/progs/tst_bench.c
>>>>>>>>
>>>>>>>> --
>>>>>>>> 2.31.1
>>>>>>>>
>>>>>>> .
>>>>> .
>>> .
> .


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH bpf-next 0/2] bpf: Introduce ternary search tree for string key
  2022-06-08  9:00               ` Hou Tao
@ 2022-07-05 22:37                 ` Andrii Nakryiko
  2022-07-09 14:18                   ` Hou Tao
  0 siblings, 1 reply; 14+ messages in thread
From: Andrii Nakryiko @ 2022-07-05 22:37 UTC (permalink / raw)
  To: Hou Tao
  Cc: Alexei Starovoitov, Yonghong Song, Daniel Borkmann,
	Martin KaFai Lau, Andrii Nakryiko, Song Liu, KP Singh,
	David S . Miller, Jakub Kicinski, John Fastabend, Networking,
	bpf

On Wed, Jun 8, 2022 at 2:00 AM Hou Tao <houtao1@huawei.com> wrote:
>
> Hi,
>
> On 4/27/2022 11:57 AM, Andrii Nakryiko wrote:
> > On Tue, Apr 26, 2022 at 1:03 AM Hou Tao <houtao1@huawei.com> wrote:
> snip
> >>> I'm biased :) But I like the idea of qp-trie as a general purpose
> >>> ordered and dynamically sized BPF map. It makes no assumption about
> >>> data being string-like and sharing common prefixes. It can be made to
> >>> work just as fine with any array of bytes, making it very suitable as
> >>> a generic lookup table map. Note that upstream implementation does
> >>> assume zero-terminated strings and no key being a prefix of another
> >>> key. But all that can be removed. For fixed-length keys this can never
> >>> happen by construction, for variable-length keys (and we'll be able to
> >>> support this finally with bpf_dynptr's help very soon), we can record
> >>> length of the key in each leaf and use that during comparisons.
> >> Using the trailing zero byte to make sure no key will be a prefix of another is
> >> simple, but I will check whether or not there is other way to make the bytes
> >> array key work out. Alexei had suggest me to use the length of key as part of
> >> key just like bpf_lpm_trie_key does, maybe i can try it first.
> > Yeah, using key length as part of the key during comparison is what I
> > meant as well. I didn't mean to aritificially add trailing zero (this
> > idea doesn't work for arbitrary binary data).
> >
> >>> Also note that qp-trie can be internally used by BPF_MAP_TYPE_LPM_TRIE
> >>> very efficiently and speed it up considerable in the process (and
> >>> especially to get rid of the global lock).
> >>>
> >>> So if you were to invest in a proper full-featured production
> >>> implementation of a BPF map, I'd start with qp-trie. From available
> >>> benchmarks it's both faster and more memory efficient than Red-Black
> >>> trees, which could be an alternative underlying implementation of such
> >>> ordered and "resizable" map.
> Recently I tried to add concurrent update and deletion support for qp-trie, and
> found out that hand-over-hand lock scheme may don't work for qp-trie update. The
> short explanation is that update procedure needs traverse qp-trie twice and the
> position tuple got in the first pass may be stale due to concurrent updates
> occurred in the second pass. The detailed explanation is shown below.
>
> To reduce space usage for qp-trie, there is no common prefix saved in branch
> node, so the update of qp-trie needs to traversing qp-trie to find the most
> similar key firstly, comparing with it to get a (index, flag,) tuple for the
> leaf key, then traversing qp-trie again to find the insert position by using the
> tuple. The problem is that, the position tuple may be stale due to concurrent
> updates occurred in different branch. Considering the following case:
>
> When inserting "aa_bind_mount" and "aa_buffers_lock" concurrently into the
> following qp-trie. The most similar key for "aa_bind_mount" is leaf X, and for
> "aa_buffers_lock" it is leaf Y. The calculated index tuple for both new keys are
> the same: (index=3, flag=2). Assuming "aa_bind_mount" is inserted firstly, so
> when inserting "aa_buffers_lock", the correct index will be (index=4, flag=1)
> instead of (index=3, flag=2) and the result will be incorrect.
>
> branch: index  1 flags 1 bitmap 0x00088
> * leaf: a.81577 0
> * branch: index  4 flags 1 bitmap 0x00180
> * * branch: index  4 flags 2 bitmap 0x02080
> * * * leaf: aa_af_perm 1 (leaf X, for aa_bind_mount)
> * * branch: index  4 flags 2 bitmap 0x00052
> * * * leaf: aa_apply_modes_to_perms 6
> * * * leaf: aa_asprint_hashstr 7 (leaf Y, for aa_buffers_lock)
>
> In order to get a correct position tuple, the intuitive solution is adding
> common prefix in branch node and letting update procedure to find the insertion
> position by comparing with the prefix, so it only needs to traverse qp-trie once
> and hand-over-hand locking scheme can work. I plan to work on qp-trie with
> common prefix and will update its memory usage and concurrent update/insert
> speed in this thread once the demo is ready.  So any more suggestions ?
>

Yeah, that sucks. I'm not sure I completely understand the common
prefix solution and whether it will still be qp-trie after that, but
if you try that, it would be interesting to learn about the results
you get!

I think in the worst case we'll have to do tree-wide lock, perhaps
maybe having an initial 256-way root node for first byte, and each of
256 subtrees could have their own lock.

Alternatively we can do optimistic lockless lookup (in the hope to
find a match), but if that fails - take tree-wide lock, and perform
the search and insertion again. This will favor lookup hits,
obviously, but hopefully that's going to be a common use case where
keys are mostly matching.

WDYT?


> Regards,
> Tao
>
> >> Thanks for your suggestions. I will give it a try.
> > Awesome!
> >
> >> Regards,
> >> Tao
> >>
> >>>> Regards,
> >>>> Tao
> >>>>>>> This prefix sharing is nice when you have a lot of long common
> >>>>>>> prefixes, but I'm a bit skeptical that as a general-purpose BPF data
> >>>>>>> structure it's going to be that beneficial. 192 bytes of common
> >>>>>>> prefixes seems like a very unusual dataset :)
> >>>>>> Yes. The case with common prefix I known is full file path.
> >>>>>>> More specifically about TST implementation in your paches. One global
> >>>>>>> per-map lock I think is a very big downside. We have LPM trie which is
> >>>>>>> very slow in big part due to global lock. It might be possible to
> >>>>>>> design more granular schema for TST, but this whole in-place splitting
> >>>>>>> logic makes this harder. I think qp-trie can be locked in a granular
> >>>>>>> fashion much more easily by having a "hand over hand" locking: lock
> >>>>>>> parent, find child, lock child, unlock parent, move into child node.
> >>>>>>> Something like that would be more scalable overall, especially if the
> >>>>>>> access pattern is not focused on a narrow set of nodes.
> >>>>>> Yes. The global lock is a problem but the splitting is not in-place. I will try
> >>>>>> to figure out whether the lock can be more scalable after the benchmark test
> >>>>>> between qp-trie and tst.
> >>>>> Great, looking forward!
> >>>>>
> >>>>>> Regards,
> >>>>>> Tao
> >>>>>>
> >>>>>> [0]: https://github.com/Tessil/hat-trie
> >>>>>>> Anyways, I love data structures and this one is an interesting idea.
> >>>>>>> But just my few cents of "production-readiness" for general-purpose
> >>>>>>> data structures for BPF.
> >>>>>>>
> >>>>>>>   [0] https://dotat.at/prog/qp/README.html
> >>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Tao
> >>>>>>>>
> >>>>>>>> [1]: https://lore.kernel.org/bpf/CAADnVQJUJp3YBcpESwR3Q1U6GS1mBM=Vp-qYuQX7eZOaoLjdUA@mail.gmail.com/
> >>>>>>>>
> >>>>>>>> Hou Tao (2):
> >>>>>>>>   bpf: Introduce ternary search tree for string key
> >>>>>>>>   selftests/bpf: add benchmark for ternary search tree map
> >>>>>>>>
> >>>>>>>>  include/linux/bpf_types.h                     |   1 +
> >>>>>>>>  include/uapi/linux/bpf.h                      |   1 +
> >>>>>>>>  kernel/bpf/Makefile                           |   1 +
> >>>>>>>>  kernel/bpf/bpf_tst.c                          | 411 +++++++++++++++++
> >>>>>>>>  tools/include/uapi/linux/bpf.h                |   1 +
> >>>>>>>>  tools/testing/selftests/bpf/Makefile          |   5 +-
> >>>>>>>>  tools/testing/selftests/bpf/bench.c           |   6 +
> >>>>>>>>  .../selftests/bpf/benchs/bench_tst_map.c      | 415 ++++++++++++++++++
> >>>>>>>>  .../selftests/bpf/benchs/run_bench_tst.sh     |  54 +++
> >>>>>>>>  tools/testing/selftests/bpf/progs/tst_bench.c |  70 +++
> >>>>>>>>  10 files changed, 964 insertions(+), 1 deletion(-)
> >>>>>>>>  create mode 100644 kernel/bpf/bpf_tst.c
> >>>>>>>>  create mode 100644 tools/testing/selftests/bpf/benchs/bench_tst_map.c
> >>>>>>>>  create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_tst.sh
> >>>>>>>>  create mode 100644 tools/testing/selftests/bpf/progs/tst_bench.c
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> 2.31.1
> >>>>>>>>
> >>>>>>> .
> >>>>> .
> >>> .
> > .
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH bpf-next 0/2] bpf: Introduce ternary search tree for string key
  2022-07-05 22:37                 ` Andrii Nakryiko
@ 2022-07-09 14:18                   ` Hou Tao
  0 siblings, 0 replies; 14+ messages in thread
From: Hou Tao @ 2022-07-09 14:18 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Alexei Starovoitov, Yonghong Song, Daniel Borkmann,
	Martin KaFai Lau, Andrii Nakryiko, Song Liu, KP Singh,
	David S . Miller, Jakub Kicinski, John Fastabend, Networking,
	bpf

Hi,

On 7/6/2022 6:37 AM, Andrii Nakryiko wrote:
> On Wed, Jun 8, 2022 at 2:00 AM Hou Tao <houtao1@huawei.com> wrote:
>> Hi,
>>
>> On 4/27/2022 11:57 AM, Andrii Nakryiko wrote:
>>> On Tue, Apr 26, 2022 at 1:03 AM Hou Tao <houtao1@huawei.com> wrote:
>> snip
>>>>> I'm biased :) But I like the idea of qp-trie as a general purpose
>>>>> ordered and dynamically sized BPF map. It makes no assumption about
>>>>> data being string-like and sharing common prefixes. It can be made to
>>>>> work just as fine with any array of bytes, making it very suitable as
>>>>> a generic lookup table map. Note that upstream implementation does
>>>>> assume zero-terminated strings and no key being a prefix of another
>>>>> key. But all that can be removed. For fixed-length keys this can never
>>>>> happen by construction, for variable-length keys (and we'll be able to
>>>>> support this finally with bpf_dynptr's help very soon), we can record
>>>>> length of the key in each leaf and use that during comparisons.
>>>> Using the trailing zero byte to make sure no key will be a prefix of another is
>>>> simple, but I will check whether or not there is other way to make the bytes
>>>> array key work out. Alexei had suggest me to use the length of key as part of
>>>> key just like bpf_lpm_trie_key does, maybe i can try it first.
>>> Yeah, using key length as part of the key during comparison is what I
>>> meant as well. I didn't mean to aritificially add trailing zero (this
>>> idea doesn't work for arbitrary binary data).
>>>
>>>>> Also note that qp-trie can be internally used by BPF_MAP_TYPE_LPM_TRIE
>>>>> very efficiently and speed it up considerable in the process (and
>>>>> especially to get rid of the global lock).
>>>>>
>>>>> So if you were to invest in a proper full-featured production
>>>>> implementation of a BPF map, I'd start with qp-trie. From available
>>>>> benchmarks it's both faster and more memory efficient than Red-Black
>>>>> trees, which could be an alternative underlying implementation of such
>>>>> ordered and "resizable" map.
>> Recently I tried to add concurrent update and deletion support for qp-trie, and
>> found out that hand-over-hand lock scheme may don't work for qp-trie update. The
>> short explanation is that update procedure needs traverse qp-trie twice and the
>> position tuple got in the first pass may be stale due to concurrent updates
>> occurred in the second pass. The detailed explanation is shown below.
>>
>> To reduce space usage for qp-trie, there is no common prefix saved in branch
>> node, so the update of qp-trie needs to traversing qp-trie to find the most
>> similar key firstly, comparing with it to get a (index, flag,) tuple for the
>> leaf key, then traversing qp-trie again to find the insert position by using the
>> tuple. The problem is that, the position tuple may be stale due to concurrent
>> updates occurred in different branch. Considering the following case:
>>
>> When inserting "aa_bind_mount" and "aa_buffers_lock" concurrently into the
>> following qp-trie. The most similar key for "aa_bind_mount" is leaf X, and for
>> "aa_buffers_lock" it is leaf Y. The calculated index tuple for both new keys are
>> the same: (index=3, flag=2). Assuming "aa_bind_mount" is inserted firstly, so
>> when inserting "aa_buffers_lock", the correct index will be (index=4, flag=1)
>> instead of (index=3, flag=2) and the result will be incorrect.
>>
>> branch: index  1 flags 1 bitmap 0x00088
>> * leaf: a.81577 0
>> * branch: index  4 flags 1 bitmap 0x00180
>> * * branch: index  4 flags 2 bitmap 0x02080
>> * * * leaf: aa_af_perm 1 (leaf X, for aa_bind_mount)
>> * * branch: index  4 flags 2 bitmap 0x00052
>> * * * leaf: aa_apply_modes_to_perms 6
>> * * * leaf: aa_asprint_hashstr 7 (leaf Y, for aa_buffers_lock)
>>
>> In order to get a correct position tuple, the intuitive solution is adding
>> common prefix in branch node and letting update procedure to find the insertion
>> position by comparing with the prefix, so it only needs to traverse qp-trie once
>> and hand-over-hand locking scheme can work. I plan to work on qp-trie with
>> common prefix and will update its memory usage and concurrent update/insert
>> speed in this thread once the demo is ready.  So any more suggestions ?
>>
> Yeah, that sucks. I'm not sure I completely understand the common prefix
> solution and whether it will still be qp-trie after that, but if you try that,
> it would be interesting to learn about the results you get!
The code for bpf-qp-trie is mostly ready, and will show some benchmark numbers
for prefixed qp-trie and 256-locks qp-trie.

For the common prefix solution, the branch node will save the common prefix for
its children node just like the following diagram shows:

     branch node A (prefix: a)
            |
            |
   *--------*--------*
   |                 |
leaf X (ab)          |
                 branch node (prefix: d)
              *------*--------*
              |               |
         leaf Y (adc)     leaf Z (admin)

The newly-add prefix is only used by update procedure and is updated by both
update and delete procedure due to the prefix splitting and merging. After
adding the common prefix, the hand-over-hand lock scheme works, but the memory
usage is bad compared with 256-lock qp-trie and hash table. For the strings in
BTF, memory usage of qp-trie with prefix is bigger than hash table as shown below.

* Notation (All below tests are conducted by creating and manipulating bpf maps
in kernel through libbpf)

tree-lock: prefixed qp-trie uses hand-over-hand lock scheme
256-lock: qp-trie ues 256 global locks for 256 sub-trees
htab: hashtab

data set                | kallsyms | btf      | knl      | top_1m   |
tree-lock memory  (MB)  | 32.6     | 20.5     | 15.4     | 158.1    |
256-lock memory   (MB)  | 19.1     | 12.0     | 9.9      | 90.6     |
htab memory       (MB)  | 36.7     | 17.1     | 16.2     | 206.8    |

The lookup performance of prefixed qp-trie is nearly the same with 256-lock
qp-trie, but is still bad then 256-lock qp-trie when thread number is 1 or 2.
From the below table, it seems lookup procedure don't scale very well, because
there are not too much differences between the result of number thread 6 and 8.
After a quick perf report, it seems the overhead comes from copy_to_user() and
fget().

* all unique symbols in /proc/kallsyms
        thread number                   | 1      | 2      | 3      | 4      |
5      | 6      | 7      | 8      |
        tree-lock lookup time    (ms)   | 144.8  | 92.5   | 71.7   | 59.9   |
52.9   | 49.1   | 45.8   | 44.3   |
        256-lock lookup time     (ms)   | 136.7  | 91.1   | 69.9   | 58.1   |
51.5   | 47.6   | 44.9   | 44.3   |
        htab lookup time         (ms)   | 118.8  | 81.7   | 59.9   | 47.2   |
44.0   | 41.7   | 40.3   | 40.9   |

* all strings in BTF string area
        thread number                   | 1      | 2      | 3      | 4      |
5      | 6      | 7      | 8      |
        tree-lock lookup time    (ms)   | 109.7  | 71.7   | 54.4   | 44.2   |
38.6   | 34.7   | 33.6   | 32.1   |
        256-lock lookup time     (ms)   | 101.3  | 69.0   | 51.5   | 41.2   |
36.0   | 32.8   | 32.3   | 30.4   |
        htab lookup time         (ms)   | 83.2   | 59.3   | 42.7   | 30.7   |
28.7   | 27.8   | 27.4   | 31.7   |

* all files under Linux kernel without prefix generated by find
        thread number                   | 1      | 2      | 3      | 4      |
5      | 6      | 7      | 8      |
        tree-lock lookup time    (ms)   | 73.8   | 47.7   | 36.7   | 29.9   |
26.5   | 22.9   | 20.9   | 20.4   |
        256-lock lookup time     (ms)   | 67.9   | 39.0   | 36.2   | 28.1   |
23.1   | 21.4   | 19.6   | 20.1   |
        htab lookup time         (ms)   | 53.9   | 39.0   | 33.2   | 25.8   |
21.1   | 21.9   | 20.9   | 20.5   |

* domain name for the top 1-million sites
        thread number                   | 1      | 2      | 3      | 4      |
5      | 6      | 7      | 8      |
        tree-lock lookup time    (ms)   | 1032.6 | 660.8  | 473.8  | 383.0  |
333.4  | 293.2  | 259.0  | 241.2  |
        256-lock lookup time     (ms)   | 972.1  | 607.3  | 435.6  | 348.7  |
298.0  | 264.8  | 240.5  | 222.5  |
        htab lookup time         (ms)   | 590.6  | 435.2  | 341.2  | 268.4  |
229.2  | 206.0  | 197.9  | 194.8  |

For update procedure, performance of hand-over-hand lock will be much better
than 256-locks when number of threads is greater than 4. But it doesn't have too
much win against 256-locks for delete procedure. The main reason may be now
3-locks are taken for delete procedure (2 locks for update procedure) because
now qp_trie_node instead of its pointer is saved in qp trie twigs and it can
improve the lookup performance by ~30% by decreasing one pointer de-reference.

* all unique symbols in /proc/kallsyms
        thread number               | 1      | 2      | 3      | 4      | 5     
| 6      | 7      | 8      |
        tree-lock update time (ms)  | 305.4  | 243.2  | 199.9  | 189.6  | 188.1 
| 191.5  | 185.5  | 191.2  |
        256-lock update time  (ms)  | 251.6  | 209.7  | 202.6  | 202.2  | 216.5 
| 250.0  | 209.8  | 236.2  |
        htab update time      (ms)  | 173.6  | 124.1  | 95.3   | 96.9   | 84.8  
| 78.2   | 75.8   | 74.4   |
        tree-lock delete time (ms)  | 267.4  | 231.1  | 199.7  | 194.8  | 189.4 
| 190.0  | 190.6  | 191.0  |
        256-lock delete time  (ms)  | 209.2  | 161.8  | 166.0  | 164.2  | 162.1 
| 161.8  | 160.7  | 163.3  |
        htab delete time      (ms)  | 114.7  | 84.3   | 65.9   | 56.9   | 53.7  
| 52.4   | 52.5   | 53.6   |

* all strings in BTF string area
        thread number               | 1      | 2      | 3      | 4      | 5     
| 6      | 7      | 8      |
        tree-lock update time (ms)  | 215.9  | 184.1  | 146.7  | 127.6  | 93.3  
| 86.9   | 81.3   | 77.9   |
        256-lock update time  (ms)  | 173.7  | 130.9  | 116.4  | 107.9  | 101.3 
| 98.1   | 95.7   | 91.6   |
        htab update time      (ms)  | 113.1  | 84.5   | 76.8   | 63.3   | 55.1  
| 45.5   | 40.7   | 40.5   |
        tree-lock delete time (ms)  | 187.5  | 144.2  | 116.8  | 106.6  | 111.3 
| 90.1   | 83.5   | 82.0   |
        256-lock delete time  (ms)  | 148.4  | 117.2  | 120.5  | 104.8  | 94.3  
| 88.7   | 95.3   | 90.8   |
        htab delete time      (ms)  | 84.1   | 73.3   | 47.8   | 41.2   | 40.0  
| 37.3   | 37.8   | 38.0   |

* all files under Linux kernel without prefix generated by find
        thread number               | 1      | 2      | 3      | 4      | 5     
| 6      | 7      | 8      |
        tree-lock update time (ms)  | 143.1  | 112.1  | 87.9   | 94.8   | 91.4  
| 89.7   | 87.9   | 86.9   |
        256-lock update time  (ms)  | 121.8  | 113.2  | 117.9  | 117.6  | 118.3 
| 120.7  | 119.8  | 120.5  |
        htab update time      (ms)  | 85.4   | 66.7   | 52.9   | 44.6   | 39.2  
| 37.8   | 36.1   | 34.8   |
        tree-lock delete time (ms)  | 126.1  | 110.1  | 86.4   | 77.7   | 74.4  
| 81.8   | 88.3   | 91.7   |
        256-lock delete time  (ms)  | 100.0  | 88.7   | 95.9   | 93.5   | 77.1  
| 88.1   | 76.7   | 85.5   |
        htab delete time      (ms)  | 59.0   | 50.0   | 31.7   | 32.5   | 31.4  
| 31.1   | 27.1   | 29.6   |

* domain name for the top sites
        thread number               | 1      | 2      | 3      | 4      | 5     
| 6      | 7      | 8      |
        tree-lock update time (ms)  | 1890.2 | 1289.4 | 949.8  | 790.0  | 677.6 
| 614.2  | 567.6  | 536.2  |
        256-lock update time  (ms)  | 1521.4 | 1173.1 | 1136.2 | 1075.2 | 1059.3
| 1043.6 | 1022.2 | 988.1  |
        htab update time      (ms)  | 896.5  | 618.5  | 474.1  | 396.2  | 357.0 
| 319.3  | 297.1  | 288.2  |
        tree-lock delete time (ms)  | 1856.3 | 1290.1 | 1007.1 | 881.5  | 804.7 
| 755.7  | 725.9  | 700.2  |
        256-lock delete time  (ms)  | 1429.3 | 1066.3 | 1018.1 | 955.2  | 923.4 
| 889.3  | 870.3  | 836.6  |
        htab delete time      (ms)  | 648.8  | 480.8  | 377.6  | 313.7  | 284.1 
| 280.4  | 273.9  | 269.9  |

But if the input data set is inserted or deleted in sorted order, hand-over-hand
lock scheme don't work well as show below:

* all sorted strings in BTF string area

        thread number               | 1      | 2      | 3      | 4      | 5     
| 6      | 7      | 8      |
        tree-lock update time (ms)  | 201.2  | 185.1  | 156.6  | 131.6  | 151.3 
| 134.2  | 144.7  | 130.6  |
        256-lock update time  (ms)  | 162.7  | 132.6  | 135.6  | 136.8  | 138.9 
| 137.8  | 149.4  | 161.0  |
        htab update time      (ms)  | 114.5  | 89.3   | 75.1   | 55.1   | 48.2  
| 52.5   | 53.7   | 52.5   |
        tree-lock delete time (ms)  | 176.7  | 172.1  | 163.7  | 154.3  | 134.1 
| 133.6  | 133.3  | 133.0  |
        256-lock delete time  (ms)  | 139.4  | 128.2  | 138.8  | 131.1  | 112.2 
| 107.1  | 128.8  | 130.9  |
        htab delete time      (ms)  | 81.1   | 67.1   | 44.7   | 38.3   | 37.5  
| 41.1   | 44.3   | 43.3   | 
> I think in the worst case we'll have to do tree-wide lock, perhaps
> maybe having an initial 256-way root node for first byte, and each of
> 256 subtrees could have their own lock.
Yes, I think 256-way root node and 256 locks is not so bad choice for us as we
can see from the test results above. And for prefixed qp-trie its memory usage
is too bad, so I prefer 256 subtree locks for now.
> Alternatively we can do optimistic lockless lookup (in the hope to
> find a match), but if that fails - take tree-wide lock, and perform
> the search and insertion again. This will favor lookup hits,
> obviously, but hopefully that's going to be a common use case where
> keys are mostly matching.
Lockless programming is hard, but I think we can try these optimization after
supporting the basic operations for qp-trie.

Regards,
Tao
>
> WDYT?
>
>
>> Regards,
>> Tao
>>
>>>> Thanks for your suggestions. I will give it a try.
>>> Awesome!
>>>
>>>> Regards,
>>>> Tao
>>>>
>>>>>> Regards,
>>>>>> Tao
>>>>>>>>> This prefix sharing is nice when you have a lot of long common
>>>>>>>>> prefixes, but I'm a bit skeptical that as a general-purpose BPF data
>>>>>>>>> structure it's going to be that beneficial. 192 bytes of common
>>>>>>>>> prefixes seems like a very unusual dataset :)
>>>>>>>> Yes. The case with common prefix I known is full file path.
>>>>>>>>> More specifically about TST implementation in your paches. One global
>>>>>>>>> per-map lock I think is a very big downside. We have LPM trie which is
>>>>>>>>> very slow in big part due to global lock. It might be possible to
>>>>>>>>> design more granular schema for TST, but this whole in-place splitting
>>>>>>>>> logic makes this harder. I think qp-trie can be locked in a granular
>>>>>>>>> fashion much more easily by having a "hand over hand" locking: lock
>>>>>>>>> parent, find child, lock child, unlock parent, move into child node.
>>>>>>>>> Something like that would be more scalable overall, especially if the
>>>>>>>>> access pattern is not focused on a narrow set of nodes.
>>>>>>>> Yes. The global lock is a problem but the splitting is not in-place. I will try
>>>>>>>> to figure out whether the lock can be more scalable after the benchmark test
>>>>>>>> between qp-trie and tst.
>>>>>>> Great, looking forward!
>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Tao
>>>>>>>>
>>>>>>>> [0]: https://github.com/Tessil/hat-trie
>>>>>>>>> Anyways, I love data structures and this one is an interesting idea.
>>>>>>>>> But just my few cents of "production-readiness" for general-purpose
>>>>>>>>> data structures for BPF.
>>>>>>>>>
>>>>>>>>>   [0] https://dotat.at/prog/qp/README.html
>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Tao
>>>>>>>>>>
>>>>>>>>>> [1]: https://lore.kernel.org/bpf/CAADnVQJUJp3YBcpESwR3Q1U6GS1mBM=Vp-qYuQX7eZOaoLjdUA@mail.gmail.com/
>>>>>>>>>>
>>>>>>>>>> Hou Tao (2):
>>>>>>>>>>   bpf: Introduce ternary search tree for string key
>>>>>>>>>>   selftests/bpf: add benchmark for ternary search tree map
>>>>>>>>>>
>>>>>>>>>>  include/linux/bpf_types.h                     |   1 +
>>>>>>>>>>  include/uapi/linux/bpf.h                      |   1 +
>>>>>>>>>>  kernel/bpf/Makefile                           |   1 +
>>>>>>>>>>  kernel/bpf/bpf_tst.c                          | 411 +++++++++++++++++
>>>>>>>>>>  tools/include/uapi/linux/bpf.h                |   1 +
>>>>>>>>>>  tools/testing/selftests/bpf/Makefile          |   5 +-
>>>>>>>>>>  tools/testing/selftests/bpf/bench.c           |   6 +
>>>>>>>>>>  .../selftests/bpf/benchs/bench_tst_map.c      | 415 ++++++++++++++++++
>>>>>>>>>>  .../selftests/bpf/benchs/run_bench_tst.sh     |  54 +++
>>>>>>>>>>  tools/testing/selftests/bpf/progs/tst_bench.c |  70 +++
>>>>>>>>>>  10 files changed, 964 insertions(+), 1 deletion(-)
>>>>>>>>>>  create mode 100644 kernel/bpf/bpf_tst.c
>>>>>>>>>>  create mode 100644 tools/testing/selftests/bpf/benchs/bench_tst_map.c
>>>>>>>>>>  create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_tst.sh
>>>>>>>>>>  create mode 100644 tools/testing/selftests/bpf/progs/tst_bench.c
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> 2.31.1
>>>>>>>>>>
>>>>>>>>> .
>>>>>>> .
>>>>> .
>>> .
> .


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-07-09 14:18 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-31 12:28 [RFC PATCH bpf-next 0/2] bpf: Introduce ternary search tree for string key Hou Tao
2022-03-31 12:28 ` [RFC PATCH bpf-next 1/2] " Hou Tao
2022-04-08 23:00   ` Alexei Starovoitov
2022-03-31 12:28 ` [RFC PATCH bpf-next] selftests/bpf: add benchmark for ternary search tree map Hou Tao
2022-04-06 17:38 ` [RFC PATCH bpf-next 0/2] bpf: Introduce ternary search tree for string key Andrii Nakryiko
2022-04-09  3:07   ` Hou Tao
2022-04-13  4:09     ` Andrii Nakryiko
2022-04-14  1:03       ` Hou Tao
2022-04-14 21:25         ` Andrii Nakryiko
2022-04-26  8:03           ` Hou Tao
2022-04-27  3:57             ` Andrii Nakryiko
2022-06-08  9:00               ` Hou Tao
2022-07-05 22:37                 ` Andrii Nakryiko
2022-07-09 14:18                   ` Hou Tao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.