bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 bpf-next 0/5] Implement bloom filter map
@ 2021-09-21 21:02 Joanne Koong
  2021-09-21 21:02 ` [PATCH v3 bpf-next 1/5] bpf: Add bloom filter map implementation Joanne Koong
                   ` (4 more replies)
  0 siblings, 5 replies; 37+ messages in thread
From: Joanne Koong @ 2021-09-21 21:02 UTC (permalink / raw)
  To: bpf; +Cc: Kernel-team, Joanne Koong

This patchset adds a new kind of bpf map: the bloom filter map.
Bloom filters are a space-efficient probabilistic data structure
used to quickly test whether an element exists in a set.
For a brief overview about how bloom filters work,
https://en.wikipedia.org/wiki/Bloom_filter
may be helpful.

One example use-case is an application leveraging a bloom filter
map to determine whether a computationally expensive hashmap
lookup can be avoided. If the element was not found in the bloom
filter map, the hashmap lookup can be skipped.

This patchset includes benchmarks for testing the performance of
the bloom filter for different entry sizes and different number of
hash functions used, as well as comparisons for hashmap lookups
with vs. without the bloom filter.

v2 -> v3:
* Add libbpf changes for supporting nr_hash_funcs, instead of passing the
number of hash functions through map_flags.
* Separate the hashing logic in kernel/bpf/bloom_filter.c into a helper
function

v1 -> v2:
* Remove libbpf changes, and pass the number of hash functions through
map_flags instead.
* Default to using 5 hash functions if no number of hash functions
is specified.
* Use set_bit instead of spinlocks in the bloom filter bitmap. This
improved the speed significantly. For example, using 5 hash functions
with 100k entries, there was roughly a 35% speed increase.
* Use jhash2 (instead of jhash) for u32-aligned value sizes. This
increased the speed by roughly 5 to 15%. When using jhash2 on value
sizes non-u32 aligned (truncating any remainder bits), there was not
a noticeable difference.
* Add test for using the bloom filter as an inner map.
* Reran the benchmarks, updated the commit messages to correspond to
the new results.


Joanne Koong (5):
  bpf: Add bloom filter map implementation
  libbpf: Allow the number of hashes in bloom filter maps to be
    configurable
  selftests/bpf: Add bloom filter map test cases
  bpf/benchs: Add benchmark test for bloom filter maps
  bpf/benchs: Add benchmarks for comparing hashmap lookups with vs.
    without bloom filter

 include/linux/bpf_types.h                     |   1 +
 include/uapi/linux/bpf.h                      |   6 +-
 kernel/bpf/Makefile                           |   2 +-
 kernel/bpf/bloom_filter.c                     | 185 +++++++++
 kernel/bpf/syscall.c                          |  14 +-
 kernel/bpf/verifier.c                         |  19 +-
 tools/include/uapi/linux/bpf.h                |   6 +-
 tools/lib/bpf/bpf.c                           |   2 +
 tools/lib/bpf/bpf.h                           |   1 +
 tools/lib/bpf/libbpf.c                        |  32 +-
 tools/lib/bpf/libbpf.h                        |   2 +
 tools/lib/bpf/libbpf.map                      |   1 +
 tools/lib/bpf/libbpf_internal.h               |   4 +-
 tools/testing/selftests/bpf/Makefile          |   4 +-
 tools/testing/selftests/bpf/bench.c           |  57 ++-
 tools/testing/selftests/bpf/bench.h           |   3 +
 .../bpf/benchs/bench_bloom_filter_map.c       | 384 ++++++++++++++++++
 .../bpf/benchs/run_bench_bloom_filter_map.sh  |  43 ++
 .../bpf/benchs/run_bench_ringbufs.sh          |  30 +-
 .../selftests/bpf/benchs/run_common.sh        |  60 +++
 .../bpf/prog_tests/bloom_filter_map.c         | 177 ++++++++
 .../selftests/bpf/progs/bloom_filter_map.c    | 157 +++++++
 22 files changed, 1142 insertions(+), 48 deletions(-)
 create mode 100644 kernel/bpf/bloom_filter.c
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_bloom_filter_map.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_bloom_filter_map.sh
 create mode 100644 tools/testing/selftests/bpf/benchs/run_common.sh
 create mode 100644 tools/testing/selftests/bpf/prog_tests/bloom_filter_map.c
 create mode 100644 tools/testing/selftests/bpf/progs/bloom_filter_map.c

-- 
2.30.2


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v3 bpf-next 1/5] bpf: Add bloom filter map implementation
  2021-09-21 21:02 [PATCH v3 bpf-next 0/5] Implement bloom filter map Joanne Koong
@ 2021-09-21 21:02 ` Joanne Koong
  2021-09-21 23:44   ` Andrii Nakryiko
  2021-09-21 21:02 ` [PATCH v3 bpf-next 2/5] libbpf: Allow the number of hashes in bloom filter maps to be configurable Joanne Koong
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 37+ messages in thread
From: Joanne Koong @ 2021-09-21 21:02 UTC (permalink / raw)
  To: bpf; +Cc: Kernel-team, Joanne Koong

Bloom filters are a space-efficient probabilistic data structure
used to quickly test whether an element exists in a set.
In a bloom filter, false positives are possible whereas false
negatives should never be.

This patch adds a bloom filter map for bpf programs.
The bloom filter map supports peek (determining whether an element
is present in the map) and push (adding an element to the map)
operations.These operations are exposed to userspace applications
through the already existing syscalls in the following way:

BPF_MAP_LOOKUP_ELEM -> peek
BPF_MAP_UPDATE_ELEM -> push

The bloom filter map does not have keys, only values. In light of
this, the bloom filter map's API matches that of queue stack maps:
user applications use BPF_MAP_LOOKUP_ELEM/BPF_MAP_UPDATE_ELEM
which correspond internally to bpf_map_peek_elem/bpf_map_push_elem,
and bpf programs must use the bpf_map_peek_elem and bpf_map_push_elem
APIs to query or add an element to the bloom filter map. When the
bloom filter map is created, it must be created with a key_size of 0.

For updates, the user will pass in the element to add to the map
as the value, with a NULL key. For lookups, the user will pass in the
element to query in the map as the value. In the verifier layer, this
requires us to modify the argument type of a bloom filter's
BPF_FUNC_map_peek_elem call to ARG_PTR_TO_MAP_VALUE; as well, in
the syscall layer, we need to copy over the user value so that in
bpf_map_peek_elem, we know which specific value to query.

A few things to please take note of:
 * If there are any concurrent lookups + updates, the user is
responsible for synchronizing this to ensure no false negative lookups
occur.
 * The number of hashes to use for the bloom filter is configurable from
userspace. If no number is specified, the default used will be 5 hash
functions. The benchmarks later in this patchset can help compare the
performance of using different number of hashes on different entry
sizes. In general, using more hashes decreases the speed of a lookup,
but increases the false positive rate of an element being detected in the
bloom filter.
 * Deleting an element in the bloom filter map is not supported.
 * The bloom filter map may be used as an inner map.
 * The "max_entries" size that is specified at map creation time is used to
approximate a reasonable bitmap size for the bloom filter, and is not
otherwise strictly enforced. If the user wishes to insert more entries into
the bloom filter than "max_entries", they may do so but they should be
aware that this may lead to a higher false positive rate.

Signed-off-by: Joanne Koong <joannekoong@fb.com>
---
 include/linux/bpf_types.h      |   1 +
 include/uapi/linux/bpf.h       |   1 +
 kernel/bpf/Makefile            |   2 +-
 kernel/bpf/bloom_filter.c      | 185 +++++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c           |  14 ++-
 kernel/bpf/verifier.c          |  19 +++-
 tools/include/uapi/linux/bpf.h |   1 +
 7 files changed, 217 insertions(+), 6 deletions(-)
 create mode 100644 kernel/bpf/bloom_filter.c

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 9c81724e4b98..c4424ac2fa02 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -125,6 +125,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops)
 #endif
 BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_BLOOM_FILTER, bloom_filter_map_ops)
 
 BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint)
 BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 6fc59d61937a..fec9fcfe0629 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -906,6 +906,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_RINGBUF,
 	BPF_MAP_TYPE_INODE_STORAGE,
 	BPF_MAP_TYPE_TASK_STORAGE,
+	BPF_MAP_TYPE_BLOOM_FILTER,
 };
 
 /* Note that tracing related programs such as
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 7f33098ca63f..cf6ca339f3cd 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -7,7 +7,7 @@ endif
 CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy)
 
 obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o prog_iter.o
-obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o
+obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
 obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
 obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
 obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
diff --git a/kernel/bpf/bloom_filter.c b/kernel/bpf/bloom_filter.c
new file mode 100644
index 000000000000..72254edb24f1
--- /dev/null
+++ b/kernel/bpf/bloom_filter.c
@@ -0,0 +1,185 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Facebook */
+
+#include <linux/bitmap.h>
+#include <linux/bpf.h>
+#include <linux/err.h>
+#include <linux/jhash.h>
+#include <linux/random.h>
+
+#define BLOOM_FILTER_CREATE_FLAG_MASK \
+	(BPF_F_NUMA_NODE | BPF_F_ZERO_SEED | BPF_F_ACCESS_MASK)
+
+struct bpf_bloom_filter {
+	struct bpf_map map;
+	u32 bit_array_mask;
+	u32 hash_seed;
+	/* If the size of the values in the bloom filter is u32 aligned,
+	 * then it is more performant to use jhash2 as the underlying hash
+	 * function, else we use jhash. This tracks the number of u32s
+	 * in an u32-aligned value size. If the value size is not u32 aligned,
+	 * this will be 0.
+	 */
+	u32 aligned_u32_count;
+	u32 nr_hash_funcs;
+	unsigned long bit_array[];
+};
+
+static inline u32 hash(struct bpf_bloom_filter *bloom_filter, void *value,
+		       u64 value_size, u32 index)
+{
+	if (bloom_filter->aligned_u32_count)
+		return jhash2(value, bloom_filter->aligned_u32_count,
+			      bloom_filter->hash_seed + index) &
+			bloom_filter->bit_array_mask;
+
+	return jhash(value, value_size, bloom_filter->hash_seed + index) &
+		bloom_filter->bit_array_mask;
+}
+
+static int bloom_filter_map_peek_elem(struct bpf_map *map, void *value)
+{
+	struct bpf_bloom_filter *bloom_filter =
+		container_of(map, struct bpf_bloom_filter, map);
+	u32 i;
+
+	for (i = 0; i < bloom_filter->nr_hash_funcs; i++) {
+		if (!test_bit(hash(bloom_filter, value, map->value_size, i),
+			      bloom_filter->bit_array))
+			return -ENOENT;
+	}
+
+	return 0;
+}
+
+static struct bpf_map *bloom_filter_map_alloc(union bpf_attr *attr)
+{
+	u32 nr_bits, bit_array_bytes, bit_array_mask;
+	int numa_node = bpf_map_attr_numa_node(attr);
+	struct bpf_bloom_filter *bloom_filter;
+	u32 nr_hash_funcs;
+
+	if (!bpf_capable())
+		return ERR_PTR(-EPERM);
+
+	if (attr->key_size != 0 || attr->value_size == 0 || attr->max_entries == 0 ||
+	    attr->map_flags & ~BLOOM_FILTER_CREATE_FLAG_MASK ||
+	    !bpf_map_flags_access_ok(attr->map_flags))
+		return ERR_PTR(-EINVAL);
+
+	/* Default to 5 if no number of hash functions was specified */
+	nr_hash_funcs = attr->nr_hash_funcs == 0 ? 5 : attr->nr_hash_funcs;
+
+	/* For the bloom filter, the optimal bit array size that minimizes the
+	 * false positive probability is n * k / ln(2) where n is the number of
+	 * expected entries in the bloom filter and k is the number of hash
+	 * functions. We use 7 / 5 to approximate 1 / ln(2).
+	 *
+	 * We round this up to the nearest power of two to enable more efficient
+	 * hashing using bitmasks. The bitmask will be the bit array size - 1.
+	 *
+	 * If this overflows a u32, the bit array size will have 2^32 (4
+	 * GB) bits.
+	 */
+	if (check_mul_overflow(attr->max_entries, nr_hash_funcs, &nr_bits) ||
+	    check_mul_overflow(nr_bits / 5, (u32)7, &nr_bits) ||
+	    nr_bits > (1UL << 31)) {
+		/* The bit array size is 2^32 bits but to avoid overflowing the
+		 * u32, we use BITS_TO_BYTES(U32_MAX), which will round up to the
+		 * equivalent number of bytes
+		 */
+		bit_array_bytes = BITS_TO_BYTES(U32_MAX);
+		bit_array_mask = U32_MAX;
+	} else {
+		if (nr_bits <= BITS_PER_LONG)
+			nr_bits = BITS_PER_LONG;
+		else
+			nr_bits = roundup_pow_of_two(nr_bits);
+		bit_array_bytes = BITS_TO_BYTES(nr_bits);
+		bit_array_mask = nr_bits - 1;
+	}
+
+	bit_array_bytes = roundup(bit_array_bytes, sizeof(unsigned long));
+	bloom_filter = bpf_map_area_alloc(sizeof(*bloom_filter) + bit_array_bytes,
+					  numa_node);
+
+	if (!bloom_filter)
+		return ERR_PTR(-ENOMEM);
+
+	bpf_map_init_from_attr(&bloom_filter->map, attr);
+
+	bloom_filter->nr_hash_funcs = nr_hash_funcs;
+	bloom_filter->bit_array_mask = bit_array_mask;
+	if ((attr->value_size & (sizeof(u32) - 1)) == 0)
+		bloom_filter->aligned_u32_count = attr->value_size / sizeof(u32);
+
+	if (!(attr->map_flags & BPF_F_ZERO_SEED))
+		bloom_filter->hash_seed = get_random_int();
+
+	return &bloom_filter->map;
+}
+
+static void bloom_filter_map_free(struct bpf_map *map)
+{
+	struct bpf_bloom_filter *bloom_filter =
+		container_of(map, struct bpf_bloom_filter, map);
+
+	bpf_map_area_free(bloom_filter);
+}
+
+static int bloom_filter_map_push_elem(struct bpf_map *map, void *value,
+				      u64 flags)
+{
+	struct bpf_bloom_filter *bloom_filter =
+		container_of(map, struct bpf_bloom_filter, map);
+	u32 i;
+
+	if (flags != BPF_ANY)
+		return -EINVAL;
+
+	for (i = 0; i < bloom_filter->nr_hash_funcs; i++)
+		set_bit(hash(bloom_filter, value, map->value_size, i),
+			bloom_filter->bit_array);
+
+	return 0;
+}
+
+static void *bloom_filter_map_lookup_elem(struct bpf_map *map, void *key)
+{
+	/* The eBPF program should use map_peek_elem instead */
+	return ERR_PTR(-EINVAL);
+}
+
+static int bloom_filter_map_update_elem(struct bpf_map *map, void *key,
+					void *value, u64 flags)
+{
+	/* The eBPF program should use map_push_elem instead */
+	return -EINVAL;
+}
+
+static int bloom_filter_map_delete_elem(struct bpf_map *map, void *key)
+{
+	return -EOPNOTSUPP;
+}
+
+static int bloom_filter_map_get_next_key(struct bpf_map *map, void *key,
+					 void *next_key)
+{
+	return -EOPNOTSUPP;
+}
+
+static int bloom_filter_map_btf_id;
+const struct bpf_map_ops bloom_filter_map_ops = {
+	.map_meta_equal = bpf_map_meta_equal,
+	.map_alloc = bloom_filter_map_alloc,
+	.map_free = bloom_filter_map_free,
+	.map_push_elem = bloom_filter_map_push_elem,
+	.map_peek_elem = bloom_filter_map_peek_elem,
+	.map_lookup_elem = bloom_filter_map_lookup_elem,
+	.map_update_elem = bloom_filter_map_update_elem,
+	.map_delete_elem = bloom_filter_map_delete_elem,
+	.map_get_next_key = bloom_filter_map_get_next_key,
+	.map_check_btf = map_check_no_btf,
+	.map_btf_name = "bpf_bloom_filter",
+	.map_btf_id = &bloom_filter_map_btf_id,
+};
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 4e50c0bfdb7d..9865b5b1e667 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -199,7 +199,8 @@ static int bpf_map_update_value(struct bpf_map *map, struct fd f, void *key,
 		err = bpf_fd_reuseport_array_update_elem(map, key, value,
 							 flags);
 	} else if (map->map_type == BPF_MAP_TYPE_QUEUE ||
-		   map->map_type == BPF_MAP_TYPE_STACK) {
+		   map->map_type == BPF_MAP_TYPE_STACK ||
+		   map->map_type == BPF_MAP_TYPE_BLOOM_FILTER) {
 		err = map->ops->map_push_elem(map, value, flags);
 	} else {
 		rcu_read_lock();
@@ -238,7 +239,8 @@ static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value,
 	} else if (map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) {
 		err = bpf_fd_reuseport_array_lookup_elem(map, key, value);
 	} else if (map->map_type == BPF_MAP_TYPE_QUEUE ||
-		   map->map_type == BPF_MAP_TYPE_STACK) {
+		   map->map_type == BPF_MAP_TYPE_STACK ||
+		   map->map_type == BPF_MAP_TYPE_BLOOM_FILTER) {
 		err = map->ops->map_peek_elem(map, value);
 	} else if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS) {
 		/* struct_ops map requires directly updating "value" */
@@ -1080,6 +1082,14 @@ static int map_lookup_elem(union bpf_attr *attr)
 	if (!value)
 		goto free_key;
 
+	if (map->map_type == BPF_MAP_TYPE_BLOOM_FILTER) {
+		if (copy_from_user(value, uvalue, value_size))
+			err = -EFAULT;
+		else
+			err = bpf_map_copy_value(map, key, value, attr->flags);
+		goto free_value;
+	}
+
 	err = bpf_map_copy_value(map, key, value, attr->flags);
 	if (err)
 		goto free_value;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index e76b55917905..fd2b7f9dccae 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -4813,7 +4813,10 @@ static int resolve_map_arg_type(struct bpf_verifier_env *env,
 			return -EINVAL;
 		}
 		break;
-
+	case BPF_MAP_TYPE_BLOOM_FILTER:
+		if (meta->func_id == BPF_FUNC_map_peek_elem)
+			*arg_type = ARG_PTR_TO_MAP_VALUE;
+		break;
 	default:
 		break;
 	}
@@ -5388,6 +5391,11 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		    func_id != BPF_FUNC_task_storage_delete)
 			goto error;
 		break;
+	case BPF_MAP_TYPE_BLOOM_FILTER:
+		if (func_id != BPF_FUNC_map_push_elem &&
+		    func_id != BPF_FUNC_map_peek_elem)
+			goto error;
+		break;
 	default:
 		break;
 	}
@@ -5455,13 +5463,18 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		    map->map_type != BPF_MAP_TYPE_SOCKHASH)
 			goto error;
 		break;
-	case BPF_FUNC_map_peek_elem:
 	case BPF_FUNC_map_pop_elem:
-	case BPF_FUNC_map_push_elem:
 		if (map->map_type != BPF_MAP_TYPE_QUEUE &&
 		    map->map_type != BPF_MAP_TYPE_STACK)
 			goto error;
 		break;
+	case BPF_FUNC_map_push_elem:
+	case BPF_FUNC_map_peek_elem:
+		if (map->map_type != BPF_MAP_TYPE_QUEUE &&
+		    map->map_type != BPF_MAP_TYPE_STACK &&
+		    map->map_type != BPF_MAP_TYPE_BLOOM_FILTER)
+			goto error;
+		break;
 	case BPF_FUNC_sk_storage_get:
 	case BPF_FUNC_sk_storage_delete:
 		if (map->map_type != BPF_MAP_TYPE_SK_STORAGE)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 6fc59d61937a..fec9fcfe0629 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -906,6 +906,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_RINGBUF,
 	BPF_MAP_TYPE_INODE_STORAGE,
 	BPF_MAP_TYPE_TASK_STORAGE,
+	BPF_MAP_TYPE_BLOOM_FILTER,
 };
 
 /* Note that tracing related programs such as
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 bpf-next 2/5] libbpf: Allow the number of hashes in bloom filter maps to be configurable
  2021-09-21 21:02 [PATCH v3 bpf-next 0/5] Implement bloom filter map Joanne Koong
  2021-09-21 21:02 ` [PATCH v3 bpf-next 1/5] bpf: Add bloom filter map implementation Joanne Koong
@ 2021-09-21 21:02 ` Joanne Koong
  2021-09-21 22:24   ` Joanne Koong
  2021-09-21 23:57   ` Andrii Nakryiko
  2021-09-21 21:02 ` [PATCH v3 bpf-next 3/5] selftests/bpf: Add bloom filter map test cases Joanne Koong
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 37+ messages in thread
From: Joanne Koong @ 2021-09-21 21:02 UTC (permalink / raw)
  To: bpf; +Cc: Kernel-team, Joanne Koong

This patch adds the libbpf infrastructure that will allow the user to
specify a configurable number of hash functions to use for the bloom
filter map.

Please note that this patch does not enforce that a pinned bloom filter
map may only be reused if the number of hash functions is the same. If
they are not the same, the number of hash functions used will be the one
that was set for the pinned map.

Signed-off-by: Joanne Koong <joannekoong@fb.com>
---
 include/uapi/linux/bpf.h        |  5 ++++-
 tools/include/uapi/linux/bpf.h  |  5 ++++-
 tools/lib/bpf/bpf.c             |  2 ++
 tools/lib/bpf/bpf.h             |  1 +
 tools/lib/bpf/libbpf.c          | 32 +++++++++++++++++++++++++++-----
 tools/lib/bpf/libbpf.h          |  2 ++
 tools/lib/bpf/libbpf.map        |  1 +
 tools/lib/bpf/libbpf_internal.h |  4 +++-
 8 files changed, 44 insertions(+), 8 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index fec9fcfe0629..2e3048488feb 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1262,7 +1262,10 @@ union bpf_attr {
 		__u32	map_flags;	/* BPF_MAP_CREATE related
 					 * flags defined above.
 					 */
-		__u32	inner_map_fd;	/* fd pointing to the inner map */
+		union {
+			__u32	inner_map_fd;	/* fd pointing to the inner map */
+			__u32	nr_hash_funcs;  /* or number of hash functions */
+		};
 		__u32	numa_node;	/* numa node (effective only if
 					 * BPF_F_NUMA_NODE is set).
 					 */
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index fec9fcfe0629..2e3048488feb 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1262,7 +1262,10 @@ union bpf_attr {
 		__u32	map_flags;	/* BPF_MAP_CREATE related
 					 * flags defined above.
 					 */
-		__u32	inner_map_fd;	/* fd pointing to the inner map */
+		union {
+			__u32	inner_map_fd;	/* fd pointing to the inner map */
+			__u32	nr_hash_funcs;  /* or number of hash functions */
+		};
 		__u32	numa_node;	/* numa node (effective only if
 					 * BPF_F_NUMA_NODE is set).
 					 */
diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 2401fad090c5..8a9dd4f6d6c8 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -100,6 +100,8 @@ int bpf_create_map_xattr(const struct bpf_create_map_attr *create_attr)
 	if (attr.map_type == BPF_MAP_TYPE_STRUCT_OPS)
 		attr.btf_vmlinux_value_type_id =
 			create_attr->btf_vmlinux_value_type_id;
+	else if (attr.map_type == BPF_MAP_TYPE_BLOOM_FILTER)
+		attr.nr_hash_funcs = create_attr->nr_hash_funcs;
 	else
 		attr.inner_map_fd = create_attr->inner_map_fd;
 
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 6fffb3cdf39b..1194b6f01572 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -49,6 +49,7 @@ struct bpf_create_map_attr {
 	union {
 		__u32 inner_map_fd;
 		__u32 btf_vmlinux_value_type_id;
+		__u32 nr_hash_funcs;
 	};
 };
 
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index da65a1666a5e..e51e68a07aaf 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -378,6 +378,7 @@ struct bpf_map {
 	char *pin_path;
 	bool pinned;
 	bool reused;
+	__u32 nr_hash_funcs;
 };
 
 enum extern_type {
@@ -1291,6 +1292,11 @@ static bool bpf_map_type__is_map_in_map(enum bpf_map_type type)
 	return false;
 }
 
+static inline bool bpf_map__is_bloom_filter(const struct bpf_map *map)
+{
+	return map->def.type == BPF_MAP_TYPE_BLOOM_FILTER;
+}
+
 int bpf_object__section_size(const struct bpf_object *obj, const char *name,
 			     __u32 *size)
 {
@@ -2238,6 +2244,10 @@ int parse_btf_map_def(const char *map_name, struct btf *btf,
 			}
 			map_def->pinning = val;
 			map_def->parts |= MAP_DEF_PINNING;
+		} else if (strcmp(name, "nr_hash_funcs") == 0) {
+			if (!get_map_field_int(map_name, btf, m, &map_def->nr_hash_funcs))
+				return -EINVAL;
+			map_def->parts |= MAP_DEF_NR_HASH_FUNCS;
 		} else {
 			if (strict) {
 				pr_warn("map '%s': unknown field '%s'.\n", map_name, name);
@@ -2266,6 +2276,7 @@ static void fill_map_from_def(struct bpf_map *map, const struct btf_map_def *def
 	map->numa_node = def->numa_node;
 	map->btf_key_type_id = def->key_type_id;
 	map->btf_value_type_id = def->value_type_id;
+	map->nr_hash_funcs = def->nr_hash_funcs;
 
 	if (def->parts & MAP_DEF_MAP_TYPE)
 		pr_debug("map '%s': found type = %u.\n", map->name, def->map_type);
@@ -2290,6 +2301,8 @@ static void fill_map_from_def(struct bpf_map *map, const struct btf_map_def *def
 		pr_debug("map '%s': found pinning = %u.\n", map->name, def->pinning);
 	if (def->parts & MAP_DEF_NUMA_NODE)
 		pr_debug("map '%s': found numa_node = %u.\n", map->name, def->numa_node);
+	if (def->parts & MAP_DEF_NR_HASH_FUNCS)
+		pr_debug("map '%s': found nr_hash_funcs = %u.\n", map->name, def->nr_hash_funcs);
 
 	if (def->parts & MAP_DEF_INNER_MAP)
 		pr_debug("map '%s': found inner map definition.\n", map->name);
@@ -4616,10 +4629,6 @@ static int bpf_object__create_map(struct bpf_object *obj, struct bpf_map *map, b
 		create_attr.max_entries = def->max_entries;
 	}
 
-	if (bpf_map__is_struct_ops(map))
-		create_attr.btf_vmlinux_value_type_id =
-			map->btf_vmlinux_value_type_id;
-
 	create_attr.btf_fd = 0;
 	create_attr.btf_key_type_id = 0;
 	create_attr.btf_value_type_id = 0;
@@ -4629,7 +4638,12 @@ static int bpf_object__create_map(struct bpf_object *obj, struct bpf_map *map, b
 		create_attr.btf_value_type_id = map->btf_value_type_id;
 	}
 
-	if (bpf_map_type__is_map_in_map(def->type)) {
+	if (bpf_map__is_struct_ops(map)) {
+		create_attr.btf_vmlinux_value_type_id =
+			map->btf_vmlinux_value_type_id;
+	} else if (bpf_map__is_bloom_filter(map)) {
+		create_attr.nr_hash_funcs = map->nr_hash_funcs;
+	} else if (bpf_map_type__is_map_in_map(def->type)) {
 		if (map->inner_map) {
 			err = bpf_object__create_map(obj, map->inner_map, true);
 			if (err) {
@@ -8610,6 +8624,14 @@ int bpf_map__set_value_size(struct bpf_map *map, __u32 size)
 	return 0;
 }
 
+int bpf_map__set_nr_hash_funcs(struct bpf_map *map, __u32 nr_hash_funcs)
+{
+	if (map->fd >= 0)
+		return libbpf_err(-EBUSY);
+	map->nr_hash_funcs = nr_hash_funcs;
+	return 0;
+}
+
 __u32 bpf_map__btf_key_type_id(const struct bpf_map *map)
 {
 	return map ? map->btf_key_type_id : 0;
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index d0bedd673273..5c441744f766 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -550,6 +550,8 @@ LIBBPF_API __u32 bpf_map__btf_value_type_id(const struct bpf_map *map);
 /* get/set map if_index */
 LIBBPF_API __u32 bpf_map__ifindex(const struct bpf_map *map);
 LIBBPF_API int bpf_map__set_ifindex(struct bpf_map *map, __u32 ifindex);
+/* set nr_hash_funcs */
+LIBBPF_API int bpf_map__set_nr_hash_funcs(struct bpf_map *map, __u32 nr_hash_funcs);
 
 typedef void (*bpf_map_clear_priv_t)(struct bpf_map *, void *);
 LIBBPF_API int bpf_map__set_priv(struct bpf_map *map, void *priv,
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index 9e649cf9e771..ee0e1e7648f4 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -385,6 +385,7 @@ LIBBPF_0.5.0 {
 		btf__load_vmlinux_btf;
 		btf_dump__dump_type_data;
 		libbpf_set_strict_mode;
+		bpf_map__set_nr_hash_funcs;
 } LIBBPF_0.4.0;
 
 LIBBPF_0.6.0 {
diff --git a/tools/lib/bpf/libbpf_internal.h b/tools/lib/bpf/libbpf_internal.h
index ceb0c98979bc..95dbbeba231f 100644
--- a/tools/lib/bpf/libbpf_internal.h
+++ b/tools/lib/bpf/libbpf_internal.h
@@ -186,8 +186,9 @@ enum map_def_parts {
 	MAP_DEF_NUMA_NODE	= 0x080,
 	MAP_DEF_PINNING		= 0x100,
 	MAP_DEF_INNER_MAP	= 0x200,
+	MAP_DEF_NR_HASH_FUNCS	= 0x400,
 
-	MAP_DEF_ALL		= 0x3ff, /* combination of all above */
+	MAP_DEF_ALL		= 0x7ff, /* combination of all above */
 };
 
 struct btf_map_def {
@@ -201,6 +202,7 @@ struct btf_map_def {
 	__u32 map_flags;
 	__u32 numa_node;
 	__u32 pinning;
+	__u32 nr_hash_funcs;
 };
 
 int parse_btf_map_def(const char *map_name, struct btf *btf,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 bpf-next 3/5] selftests/bpf: Add bloom filter map test cases
  2021-09-21 21:02 [PATCH v3 bpf-next 0/5] Implement bloom filter map Joanne Koong
  2021-09-21 21:02 ` [PATCH v3 bpf-next 1/5] bpf: Add bloom filter map implementation Joanne Koong
  2021-09-21 21:02 ` [PATCH v3 bpf-next 2/5] libbpf: Allow the number of hashes in bloom filter maps to be configurable Joanne Koong
@ 2021-09-21 21:02 ` Joanne Koong
  2021-09-21 21:02 ` [PATCH v3 bpf-next 4/5] bpf/benchs: Add benchmark test for bloom filter maps Joanne Koong
  2021-09-21 21:02 ` [PATCH v3 bpf-next 5/5] bpf/benchs: Add benchmarks for comparing hashmap lookups with vs. without bloom filter Joanne Koong
  4 siblings, 0 replies; 37+ messages in thread
From: Joanne Koong @ 2021-09-21 21:02 UTC (permalink / raw)
  To: bpf; +Cc: Kernel-team, Joanne Koong

This patch adds test cases for bpf bloom filter maps. They include tests
checking against invalid operations by userspace, tests for using the bloom
filter map as an inner map, and a bpf program that queries the bloom filter
map for values added by a userspace program.

Signed-off-by: Joanne Koong <joannekoong@fb.com>
---
 .../bpf/prog_tests/bloom_filter_map.c         | 177 ++++++++++++++++++
 .../selftests/bpf/progs/bloom_filter_map.c    |  83 ++++++++
 2 files changed, 260 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/bloom_filter_map.c
 create mode 100644 tools/testing/selftests/bpf/progs/bloom_filter_map.c

diff --git a/tools/testing/selftests/bpf/prog_tests/bloom_filter_map.c b/tools/testing/selftests/bpf/prog_tests/bloom_filter_map.c
new file mode 100644
index 000000000000..134705d74a66
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/bloom_filter_map.c
@@ -0,0 +1,177 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Facebook */
+
+#include <sys/syscall.h>
+#include <test_progs.h>
+#include "bloom_filter_map.skel.h"
+
+static void test_bloom_filter_map_fail(void)
+{
+	struct bpf_create_map_attr xattr = {
+		.name = "bloom_filter_map",
+		.map_type = BPF_MAP_TYPE_BLOOM_FILTER,
+		.max_entries = 100,
+		.value_size = sizeof(__u32),
+		.nr_hash_funcs = 3,
+	};
+	__u32 value;
+	int fd, err;
+
+	/* Invalid key size */
+	xattr.key_size = 4;
+	fd = bpf_create_map_xattr(&xattr);
+	if (!ASSERT_LT(fd, 0, "bpf_create_map bloom filter invalid key size"))
+		close(fd);
+	xattr.key_size = 0;
+
+	/* Invalid value size */
+	xattr.value_size = 0;
+	fd = bpf_create_map_xattr(&xattr);
+	if (!ASSERT_LT(fd, 0, "bpf_create_map bloom filter invalid value size"))
+		close(fd);
+	xattr.value_size = sizeof(__u32);
+
+	/* Invalid max entries size */
+	xattr.max_entries = 0;
+	fd = bpf_create_map_xattr(&xattr);
+	if (!ASSERT_LT(fd, 0, "bpf_create_map bloom filter invalid max entries size"))
+		close(fd);
+	xattr.max_entries = 100;
+
+	/* Bloom filter maps do not support BPF_F_NO_PREALLOC */
+	xattr.map_flags = BPF_F_NO_PREALLOC;
+	fd = bpf_create_map_xattr(&xattr);
+	if (!ASSERT_LT(fd, 0, "bpf_create_map bloom filter invalid flags"))
+		close(fd);
+	xattr.map_flags = 0;
+
+	fd = bpf_create_map_xattr(&xattr);
+	if (!ASSERT_GE(fd, 0, "bpf_create_map bloom filter"))
+		return;
+
+	/* Test invalid flags */
+	err = bpf_map_update_elem(fd, NULL, &value, -1);
+	ASSERT_EQ(err, -EINVAL, "bpf_map_update_elem bloom filter invalid flags");
+
+	err = bpf_map_update_elem(fd, NULL, &value, BPF_EXIST);
+	ASSERT_EQ(err, -EINVAL, "bpf_map_update_elem bloom filter invalid flags");
+
+	err = bpf_map_update_elem(fd, NULL, &value, BPF_F_LOCK);
+	ASSERT_EQ(err, -EINVAL, "bpf_map_update_elem bloom filter invalid flags");
+
+	err = bpf_map_update_elem(fd, NULL, &value, BPF_NOEXIST);
+	ASSERT_EQ(err, -EINVAL, "bpf_map_update_elem bloom filter invalid flags");
+
+	err = bpf_map_update_elem(fd, NULL, &value, 10000);
+	ASSERT_EQ(err, -EINVAL, "bpf_map_update_elem bloom filter invalid flags");
+
+	close(fd);
+}
+
+static void bloom_filter_map(struct bloom_filter_map *skel)
+{
+	const int map_size = bpf_map__max_entries(skel->maps.map_random_data);
+	int err, map_random_data_fd, map_bloom_filter_fd, i;
+	__u64 val;
+	struct bpf_link *link;
+
+	map_random_data_fd = bpf_map__fd(skel->maps.map_random_data);
+	map_bloom_filter_fd = bpf_map__fd(skel->maps.map_bloom_filter);
+
+	/* Generate random values and add them to the maps */
+	for (i = 0; i < map_size; i++) {
+		val = rand();
+		err = bpf_map_update_elem(map_random_data_fd, &i, &val, BPF_ANY);
+		if (!ASSERT_OK(err, "Add random value to map_random_data"))
+			continue;
+
+		err = bpf_map_update_elem(map_bloom_filter_fd, NULL, &val, 0);
+		if (!ASSERT_OK(err, "Add random value to map_bloom_filter"))
+			return;
+	}
+
+	link = bpf_program__attach(skel->progs.prog_bloom_filter);
+	if (!ASSERT_OK_PTR(link, "link"))
+		return;
+
+	syscall(SYS_getpgid);
+
+	ASSERT_EQ(skel->bss->error, 0, "error");
+
+	bpf_link__destroy(link);
+}
+
+static void bloom_filter_inner_map(struct bloom_filter_map *skel)
+{
+	const int map_size = bpf_map__max_entries(skel->maps.map_random_data);
+	int outer_map_fd, inner_map_fd, map_random_data_fd, err, i, key = 0;
+	struct bpf_create_map_attr xattr = {
+		.name = "bloom_filter_inner_map",
+		.map_type = BPF_MAP_TYPE_BLOOM_FILTER,
+		.max_entries = map_size,
+		.value_size = sizeof(__u64),
+	};
+	struct bpf_link *link;
+	__u64 val;
+
+	/* Create a bloom filter map that will be used as the inner map */
+	inner_map_fd = bpf_create_map_xattr(&xattr);
+	if (!ASSERT_GE(inner_map_fd, 0, "bpf_create_map bloom filter map as inner map"))
+		return;
+
+	/* Generate random values and add them to the maps */
+	map_random_data_fd = bpf_map__fd(skel->maps.map_random_data);
+	for (i = 0; i < map_size; i++) {
+		val = rand();
+		err = bpf_map_update_elem(map_random_data_fd, &i, &val, BPF_ANY);
+		if (!ASSERT_OK(err, "Add random value to map_random_data"))
+			continue;
+
+		err = bpf_map_update_elem(inner_map_fd, NULL, &val, 0);
+		if (!ASSERT_OK(err, "Add random value to inner_map_fd"))
+			goto done;
+	}
+
+	outer_map_fd = bpf_map__fd(skel->maps.outer_map);
+	/* Add the bloom filter map to the outer map */
+	err = bpf_map_update_elem(outer_map_fd, &key, &inner_map_fd, 0);
+	if (!ASSERT_OK(err, "Add bloom filter map to outer map"))
+		goto done;
+
+	/* Attach the bloom_filter_inner_map prog */
+	link = bpf_program__attach(skel->progs.prog_bloom_filter_inner_map);
+	if (!ASSERT_OK_PTR(link, "link"))
+		goto delete_inner_map;
+
+	syscall(SYS_getpgid);
+
+	ASSERT_EQ(skel->bss->error, 0, "error");
+
+	bpf_link__destroy(link);
+
+delete_inner_map:
+	/* Ensure the inner bloom filter map can be deleted */
+	err = bpf_map_delete_elem(outer_map_fd, &key);
+	ASSERT_OK(err, "Delete inner bloom filter map");
+
+done:
+	close(inner_map_fd);
+}
+
+void test_bloom_filter_map(void)
+{
+	struct bloom_filter_map *skel;
+
+	test_bloom_filter_map_fail();
+
+	skel = bloom_filter_map__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "bloom_filter_map__open_and_load"))
+		goto cleanup;
+
+	bloom_filter_map(skel);
+
+	bloom_filter_inner_map(skel);
+
+cleanup:
+	bloom_filter_map__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/bloom_filter_map.c b/tools/testing/selftests/bpf/progs/bloom_filter_map.c
new file mode 100644
index 000000000000..5925d8dce4ec
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bloom_filter_map.c
@@ -0,0 +1,83 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Facebook */
+
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+
+char _license[] SEC("license") = "GPL";
+
+struct bpf_map;
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 1000);
+	__type(key, __u32);
+	__type(value, __u64);
+} map_random_data SEC(".maps");
+
+struct map_bloom_filter_type {
+	__uint(type, BPF_MAP_TYPE_BLOOM_FILTER);
+	__uint(key_size, 0);
+	__uint(value_size, sizeof(__u64));
+	__uint(max_entries, 1000);
+	__uint(nr_hash_funcs, 3);
+} map_bloom_filter SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS);
+	__uint(max_entries, 1);
+	__uint(key_size, sizeof(int));
+	__uint(value_size, sizeof(int));
+	__array(values, struct map_bloom_filter_type);
+} outer_map SEC(".maps");
+
+struct callback_ctx {
+	struct map_bloom_filter_type *map;
+};
+
+int error = 0;
+
+static __u64
+check_elem(struct bpf_map *map, __u32 *key, __u64 *val,
+	   struct callback_ctx *data)
+{
+	int err;
+
+	err = bpf_map_peek_elem(data->map, val);
+	if (err) {
+		error |= 1;
+		return 1; /* stop the iteration */
+	}
+
+	return 0;
+}
+
+SEC("fentry/__x64_sys_getpgid")
+int prog_bloom_filter(void *ctx)
+{
+	struct callback_ctx data;
+
+	data.map = &map_bloom_filter;
+	bpf_for_each_map_elem(&map_random_data, check_elem, &data, 0);
+
+	return 0;
+}
+
+SEC("fentry/__x64_sys_getpgid")
+int prog_bloom_filter_inner_map(void *ctx)
+{
+	struct map_bloom_filter_type *inner_map;
+	struct callback_ctx data;
+	int key = 0;
+
+	inner_map = bpf_map_lookup_elem(&outer_map, &key);
+	if (!inner_map) {
+		error |= 2;
+		return 0;
+	}
+
+	data.map = inner_map;
+	bpf_for_each_map_elem(&map_random_data, check_elem, &data, 0);
+
+	return 0;
+}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 bpf-next 4/5] bpf/benchs: Add benchmark test for bloom filter maps
  2021-09-21 21:02 [PATCH v3 bpf-next 0/5] Implement bloom filter map Joanne Koong
                   ` (2 preceding siblings ...)
  2021-09-21 21:02 ` [PATCH v3 bpf-next 3/5] selftests/bpf: Add bloom filter map test cases Joanne Koong
@ 2021-09-21 21:02 ` Joanne Koong
  2021-09-21 21:02 ` [PATCH v3 bpf-next 5/5] bpf/benchs: Add benchmarks for comparing hashmap lookups with vs. without bloom filter Joanne Koong
  4 siblings, 0 replies; 37+ messages in thread
From: Joanne Koong @ 2021-09-21 21:02 UTC (permalink / raw)
  To: bpf; +Cc: Kernel-team, Joanne Koong

This patch adds benchmark tests for the throughput and false positive
rate of bloom filter map lookups for a given number of entries and a
given number of hash functions.

These benchmarks show that as the number of hash functions increases,
the throughput and the false positive rate of the bloom filter map
decreases. From the benchmark data, the approximate average
false-positive rates are roughly as follows:

1 hash function = ~30%
2 hash functions = ~15%
3 hash functions = ~5%
4 hash functions = ~2.5%
5 hash functions = ~1%
6 hash functions = ~0.5%
7 hash functions  = ~0.35%
8 hash functions = ~0.15%
9 hash functions = ~0.1%
10 hash functions = ~0%

Signed-off-by: Joanne Koong <joannekoong@fb.com>
---
 tools/testing/selftests/bpf/Makefile          |   4 +-
 tools/testing/selftests/bpf/bench.c           |  35 ++
 tools/testing/selftests/bpf/bench.h           |   3 +
 .../bpf/benchs/bench_bloom_filter_map.c       | 345 ++++++++++++++++++
 .../bpf/benchs/run_bench_bloom_filter_map.sh  |  28 ++
 .../bpf/benchs/run_bench_ringbufs.sh          |  30 +-
 .../selftests/bpf/benchs/run_common.sh        |  48 +++
 .../selftests/bpf/progs/bloom_filter_map.c    |  74 ++++
 8 files changed, 538 insertions(+), 29 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_bloom_filter_map.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_bloom_filter_map.sh
 create mode 100644 tools/testing/selftests/bpf/benchs/run_common.sh

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 326ea75ce99e..5dbaf7f512fd 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -520,13 +520,15 @@ $(OUTPUT)/bench_rename.o: $(OUTPUT)/test_overhead.skel.h
 $(OUTPUT)/bench_trigger.o: $(OUTPUT)/trigger_bench.skel.h
 $(OUTPUT)/bench_ringbufs.o: $(OUTPUT)/ringbuf_bench.skel.h \
 			    $(OUTPUT)/perfbuf_bench.skel.h
+$(OUTPUT)/bench_bloom_filter_map.o: $(OUTPUT)/bloom_filter_map.skel.h
 $(OUTPUT)/bench.o: bench.h testing_helpers.h $(BPFOBJ)
 $(OUTPUT)/bench: LDLIBS += -lm
 $(OUTPUT)/bench: $(OUTPUT)/bench.o $(OUTPUT)/testing_helpers.o \
 		 $(OUTPUT)/bench_count.o \
 		 $(OUTPUT)/bench_rename.o \
 		 $(OUTPUT)/bench_trigger.o \
-		 $(OUTPUT)/bench_ringbufs.o
+		 $(OUTPUT)/bench_ringbufs.o \
+		 $(OUTPUT)/bench_bloom_filter_map.o
 	$(call msg,BINARY,,$@)
 	$(Q)$(CC) $(LDFLAGS) -o $@ $(filter %.a %.o,$^) $(LDLIBS)
 
diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index 6ea15b93a2f8..0bcbdb4405a3 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -51,6 +51,35 @@ void setup_libbpf()
 		fprintf(stderr, "failed to increase RLIMIT_MEMLOCK: %d", err);
 }
 
+void false_hits_report_progress(int iter, struct bench_res *res, long delta_ns)
+{
+	long total = res->false_hits  + res->hits + res->drops;
+
+	printf("Iter %3d (%7.3lfus): ",
+	       iter, (delta_ns - 1000000000) / 1000.0);
+
+	printf("%ld false hits of %ld total operations. Percentage = %2.2f %%\n",
+	       res->false_hits, total, ((float)res->false_hits / total) * 100);
+}
+
+void false_hits_report_final(struct bench_res res[], int res_cnt)
+{
+	long total_hits = 0, total_drops = 0, total_false_hits = 0, total_ops = 0;
+	int i;
+
+	for (i = 0; i < res_cnt; i++) {
+		total_hits += res[i].hits;
+		total_false_hits += res[i].false_hits;
+		total_drops += res[i].drops;
+	}
+	total_ops = total_hits + total_false_hits + total_drops;
+
+	printf("Summary: %ld false hits of %ld total operations. ",
+	       total_false_hits, total_ops);
+	printf("Percentage =  %2.2f %%\n",
+	       ((float)total_false_hits / total_ops) * 100);
+}
+
 void hits_drops_report_progress(int iter, struct bench_res *res, long delta_ns)
 {
 	double hits_per_sec, drops_per_sec;
@@ -132,9 +161,11 @@ static const struct argp_option opts[] = {
 };
 
 extern struct argp bench_ringbufs_argp;
+extern struct argp bench_bloom_filter_map_argp;
 
 static const struct argp_child bench_parsers[] = {
 	{ &bench_ringbufs_argp, 0, "Ring buffers benchmark", 0 },
+	{ &bench_bloom_filter_map_argp, 0, "Bloom filter map benchmark", 0 },
 	{},
 };
 
@@ -323,6 +354,8 @@ extern const struct bench bench_rb_libbpf;
 extern const struct bench bench_rb_custom;
 extern const struct bench bench_pb_libbpf;
 extern const struct bench bench_pb_custom;
+extern const struct bench bench_bloom_filter_map;
+extern const struct bench bench_bloom_filter_false_positive;
 
 static const struct bench *benchs[] = {
 	&bench_count_global,
@@ -344,6 +377,8 @@ static const struct bench *benchs[] = {
 	&bench_rb_custom,
 	&bench_pb_libbpf,
 	&bench_pb_custom,
+	&bench_bloom_filter_map,
+	&bench_bloom_filter_false_positive,
 };
 
 static void setup_benchmark()
diff --git a/tools/testing/selftests/bpf/bench.h b/tools/testing/selftests/bpf/bench.h
index c1f48a473b02..624c6b11501f 100644
--- a/tools/testing/selftests/bpf/bench.h
+++ b/tools/testing/selftests/bpf/bench.h
@@ -33,6 +33,7 @@ struct env {
 struct bench_res {
 	long hits;
 	long drops;
+	long false_hits;
 };
 
 struct bench {
@@ -56,6 +57,8 @@ extern const struct bench *bench;
 void setup_libbpf();
 void hits_drops_report_progress(int iter, struct bench_res *res, long delta_ns);
 void hits_drops_report_final(struct bench_res res[], int res_cnt);
+void false_hits_report_progress(int iter, struct bench_res *res, long delta_ns);
+void false_hits_report_final(struct bench_res res[], int res_cnt);
 
 static inline __u64 get_time_ns() {
 	struct timespec t;
diff --git a/tools/testing/selftests/bpf/benchs/bench_bloom_filter_map.c b/tools/testing/selftests/bpf/benchs/bench_bloom_filter_map.c
new file mode 100644
index 000000000000..8b4cd9a52a88
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/bench_bloom_filter_map.c
@@ -0,0 +1,345 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Facebook */
+
+#include <argp.h>
+#include <linux/log2.h>
+#include <pthread.h>
+#include "bench.h"
+#include "bloom_filter_map.skel.h"
+#include "bpf_util.h"
+
+static struct ctx {
+	struct bloom_filter_map *skel;
+	pthread_mutex_t map_done_mtx;
+	pthread_cond_t map_done;
+	bool map_prepare_err;
+	__u32 next_map_idx;
+} ctx = {
+	.map_done_mtx = PTHREAD_MUTEX_INITIALIZER,
+	.map_done = PTHREAD_COND_INITIALIZER,
+};
+
+static struct {
+	__u32 nr_entries;
+	__u8 nr_hash_funcs;
+} args = {
+	.nr_entries = 1000,
+	.nr_hash_funcs = 3,
+};
+
+enum {
+	ARG_NR_ENTRIES = 3000,
+	ARG_NR_HASH_FUNCS = 3001,
+};
+
+static const struct argp_option opts[] = {
+	{ "nr_entries", ARG_NR_ENTRIES, "NR_ENTRIES", 0,
+		"Set number of entries in the bloom filter map"},
+	{ "nr_hash_funcs", ARG_NR_HASH_FUNCS, "NR_HASH_FUNCS", 0,
+		"Set number of hashes in the bloom filter map"},
+	{},
+};
+
+static error_t parse_arg(int key, char *arg, struct argp_state *state)
+{
+	switch (key) {
+	case ARG_NR_ENTRIES:
+		args.nr_entries = strtol(arg, NULL, 10);
+		if (args.nr_entries == 0) {
+			fprintf(stderr, "Invalid nr_entries count.");
+			argp_usage(state);
+		}
+		break;
+	case ARG_NR_HASH_FUNCS:
+		args.nr_hash_funcs = strtol(arg, NULL, 10);
+		if (args.nr_hash_funcs == 0) {
+			fprintf(stderr, "Cannot specify a bloom filter map with 0 hashes.");
+			argp_usage(state);
+		}
+		break;
+	default:
+		return ARGP_ERR_UNKNOWN;
+	}
+
+	return 0;
+}
+
+/* exported into benchmark runner */
+const struct argp bench_bloom_filter_map_argp = {
+	.options = opts,
+	.parser = parse_arg,
+};
+
+static void validate(void)
+{
+	if (env.consumer_cnt != 1) {
+		fprintf(stderr, "bloom filter map benchmark doesn't support multi-consumer!\n");
+		exit(1);
+	}
+}
+
+static inline void trigger_bpf_program(void)
+{
+	syscall(__NR_getpgid);
+}
+
+static void *producer(void *input)
+{
+	while (true)
+		trigger_bpf_program();
+
+	return NULL;
+}
+
+static void *map_prepare_thread(void *arg)
+{
+	int err, random_data_fd, bloom_filter_fd, hashmap_fd;
+	__u64 i, val;
+
+	bloom_filter_fd = bpf_map__fd(ctx.skel->maps.map_bloom_filter);
+	random_data_fd = bpf_map__fd(ctx.skel->maps.map_random_data);
+	hashmap_fd = bpf_map__fd(ctx.skel->maps.hashmap);
+
+	while (true) {
+		i = __atomic_add_fetch(&ctx.next_map_idx, 1, __ATOMIC_RELAXED);
+		if (i > args.nr_entries)
+			break;
+again:
+		err = syscall(__NR_getrandom, &val, sizeof(val), 0);
+		if (err != sizeof(val)) {
+			ctx.map_prepare_err = true;
+			fprintf(stderr, "failed to get random value\n");
+			break;
+		}
+		err = bpf_map_update_elem(hashmap_fd, &val, &val, BPF_NOEXIST);
+		if (err) {
+			if (err != -EEXIST) {
+				ctx.map_prepare_err = true;
+				fprintf(stderr, "failed to add elem to hashmap: %d\n", -errno);
+				break;
+			}
+			goto again;
+		}
+
+		i--;
+		err = bpf_map_update_elem(random_data_fd, &i, &val, 0);
+		if (err) {
+			ctx.map_prepare_err = true;
+			fprintf(stderr, "failed to add elem to array: %d\n", -errno);
+			break;
+		}
+
+		err = bpf_map_update_elem(bloom_filter_fd, NULL, &val, 0);
+		if (err) {
+			ctx.map_prepare_err = true;
+			fprintf(stderr, "failed to add elem to bloom_filter: %d\n", -errno);
+			break;
+		}
+	}
+
+	pthread_mutex_lock(&ctx.map_done_mtx);
+	pthread_cond_signal(&ctx.map_done);
+	pthread_mutex_unlock(&ctx.map_done_mtx);
+
+	return NULL;
+}
+
+static void populate_maps(void)
+{
+	unsigned int nr_cpus = bpf_num_possible_cpus();
+	pthread_t map_thread;
+	int i, err;
+
+	for (i = 0; i < nr_cpus; i++) {
+		err = pthread_create(&map_thread, NULL, map_prepare_thread,
+				     NULL);
+		if (err) {
+			fprintf(stderr, "failed to create pthread: %d\n", -errno);
+			exit(1);
+		}
+	}
+
+	pthread_mutex_lock(&ctx.map_done_mtx);
+	pthread_cond_wait(&ctx.map_done, &ctx.map_done_mtx);
+	pthread_mutex_unlock(&ctx.map_done_mtx);
+
+	if (ctx.map_prepare_err)
+		exit(1);
+}
+
+static struct bloom_filter_map *setup_skeleton(void)
+{
+	struct bloom_filter_map *skel;
+	int err;
+
+	setup_libbpf();
+
+	skel = bloom_filter_map__open();
+	if (!skel) {
+		fprintf(stderr, "failed to open skeleton\n");
+		exit(1);
+	}
+
+	err = bpf_map__resize(skel->maps.map_random_data, args.nr_entries);
+	if (err) {
+		fprintf(stderr, "failed to resize map_random_data\n");
+		exit(1);
+	}
+
+	err = bpf_map__resize(skel->maps.hashmap, args.nr_entries);
+	if (err) {
+		fprintf(stderr, "failed to resize hashmap\n");
+		exit(1);
+	}
+
+	err = bpf_map__resize(skel->maps.map_bloom_filter, args.nr_entries);
+	if (err) {
+		fprintf(stderr, "failed to resize bloom filter\n");
+		exit(1);
+	}
+
+	err = bpf_map__set_nr_hash_funcs(skel->maps.map_bloom_filter, args.nr_hash_funcs);
+	if (err) {
+		fprintf(stderr, "failed to set %u hashes\n", args.nr_hash_funcs);
+		exit(1);
+	}
+
+	if (bloom_filter_map__load(skel)) {
+		fprintf(stderr, "failed to load skeleton\n");
+		exit(1);
+	}
+
+	return skel;
+}
+
+static void bloom_filter_map_setup(void)
+{
+	struct bpf_link *link;
+
+	ctx.skel = setup_skeleton();
+
+	populate_maps();
+
+	link = bpf_program__attach(ctx.skel->progs.prog_bloom_filter);
+	if (!link) {
+		fprintf(stderr, "failed to attach program!\n");
+		exit(1);
+	}
+}
+
+static void hashmap_lookup_setup(void)
+{
+	struct bpf_link *link;
+
+	ctx.skel = setup_skeleton();
+
+	populate_maps();
+
+	link = bpf_program__attach(ctx.skel->progs.prog_bloom_filter_hashmap_lookup);
+	if (!link) {
+		fprintf(stderr, "failed to attach program!\n");
+		exit(1);
+	}
+}
+
+static void measure(struct bench_res *res)
+{
+	long total_hits = 0, total_drops = 0, total_false_hits = 0;
+	unsigned int nr_cpus = bpf_num_possible_cpus();
+	BPF_DECLARE_PERCPU(__u64, zeroed_values);
+	BPF_DECLARE_PERCPU(__u64, false_hits);
+	BPF_DECLARE_PERCPU(__u64, drops);
+	BPF_DECLARE_PERCPU(__u64, hits);
+	int err, i, percpu_array_fd;
+	__u32 key;
+
+	if (ctx.skel->bss->error != 0) {
+		fprintf(stderr, "error (%d) when searching the bloom filter\n",
+			ctx.skel->bss->error);
+		exit(1);
+	}
+
+	key = ctx.skel->rodata->hit_key;
+	percpu_array_fd = bpf_map__fd(ctx.skel->maps.percpu_array);
+	err = bpf_map_lookup_elem(percpu_array_fd, &key, hits);
+	if (err) {
+		fprintf(stderr, "lookup in the percpu array  for 'hits' failed: %d\n",
+			-errno);
+		exit(1);
+	}
+
+	key = ctx.skel->rodata->drop_key;
+	err = bpf_map_lookup_elem(percpu_array_fd, &key, drops);
+	if (err) {
+		fprintf(stderr, "lookup in the percpu array for 'drops' failed: %d\n",
+			-errno);
+		exit(1);
+	}
+
+	key = ctx.skel->rodata->false_hit_key;
+	err = bpf_map_lookup_elem(percpu_array_fd, &key, false_hits);
+	if (err) {
+		fprintf(stderr, "lookup in the percpu array for 'false hits' failed: %d\n",
+			-errno);
+		exit(1);
+	}
+
+	for (i = 0; i < nr_cpus; i++) {
+		total_hits += bpf_percpu(hits, i);
+		total_drops += bpf_percpu(drops, i);
+		total_false_hits += bpf_percpu(false_hits, i);
+	}
+
+	res->hits = total_hits;
+	res->drops = total_drops;
+	res->false_hits = total_false_hits;
+
+	memset(zeroed_values, 0, sizeof(zeroed_values));
+
+	/* zero out the percpu array */
+	key = ctx.skel->rodata->hit_key;
+	err = bpf_map_update_elem(percpu_array_fd, &key, zeroed_values, BPF_ANY);
+	if (err) {
+		fprintf(stderr, "zeroing the percpu array failed: %d\n", -errno);
+		exit(1);
+	}
+	key = ctx.skel->rodata->drop_key;
+	err = bpf_map_update_elem(percpu_array_fd, &key, zeroed_values, BPF_ANY);
+	if (err) {
+		fprintf(stderr, "zeroing the percpu array failed: %d\n", -errno);
+		exit(1);
+	}
+	key = ctx.skel->rodata->false_hit_key;
+	err = bpf_map_update_elem(percpu_array_fd, &key, zeroed_values, BPF_ANY);
+	if (err) {
+		fprintf(stderr, "zeroing the percpu array failed: %d\n", -errno);
+		exit(1);
+	}
+}
+
+static void *consumer(void *input)
+{
+	return NULL;
+}
+
+const struct bench bench_bloom_filter_map = {
+	.name = "bloom-filter-map",
+	.validate = validate,
+	.setup = bloom_filter_map_setup,
+	.producer_thread = producer,
+	.consumer_thread = consumer,
+	.measure = measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = hits_drops_report_final,
+};
+
+const struct bench bench_bloom_filter_false_positive = {
+	.name = "bloom-filter-false-positive",
+	.validate = validate,
+	.setup = hashmap_lookup_setup,
+	.producer_thread = producer,
+	.consumer_thread = consumer,
+	.measure = measure,
+	.report_progress = false_hits_report_progress,
+	.report_final = false_hits_report_final,
+};
diff --git a/tools/testing/selftests/bpf/benchs/run_bench_bloom_filter_map.sh b/tools/testing/selftests/bpf/benchs/run_bench_bloom_filter_map.sh
new file mode 100755
index 000000000000..0dbbd85937e3
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/run_bench_bloom_filter_map.sh
@@ -0,0 +1,28 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+source ./benchs/run_common.sh
+
+set -eufo pipefail
+
+header "Bloom filter map"
+for t in 1 4 8; do
+for h in {1..10}; do
+subtitle "# threads: $t, # hashes: $h"
+	for e in 10000 50000 75000 100000 250000 500000 750000 1000000 2500000 5000000; do
+		printf "%'d entries -\n" $e
+		printf "\t"
+		summarize "Total operations: " \
+			"$($RUN_BENCH -p $t --nr_hash_funcs $h --nr_entries $e bloom-filter-map)"
+		printf "\t"
+		summarize_percentage "False positive rate: " \
+			"$($RUN_BENCH -p $t --nr_hash_funcs $h --nr_entries $e bloom-filter-false-positive)"
+	done
+	printf "\n"
+done
+done
+
+header "Bloom filter map, multi-producer contention"
+for t in 1 2 3 4 8 12 16 20 24 28 32 36 40 44 48 52; do
+	summarize "$t threads - " "$($RUN_BENCH -p $t bloom-filter-map)"
+done
diff --git a/tools/testing/selftests/bpf/benchs/run_bench_ringbufs.sh b/tools/testing/selftests/bpf/benchs/run_bench_ringbufs.sh
index af4aa04caba6..ada028aa9007 100755
--- a/tools/testing/selftests/bpf/benchs/run_bench_ringbufs.sh
+++ b/tools/testing/selftests/bpf/benchs/run_bench_ringbufs.sh
@@ -1,34 +1,8 @@
 #!/bin/bash
 
-set -eufo pipefail
-
-RUN_BENCH="sudo ./bench -w3 -d10 -a"
-
-function hits()
-{
-	echo "$*" | sed -E "s/.*hits\s+([0-9]+\.[0-9]+ ± [0-9]+\.[0-9]+M\/s).*/\1/"
-}
-
-function drops()
-{
-	echo "$*" | sed -E "s/.*drops\s+([0-9]+\.[0-9]+ ± [0-9]+\.[0-9]+M\/s).*/\1/"
-}
+source ./benchs/run_common.sh
 
-function header()
-{
-	local len=${#1}
-
-	printf "\n%s\n" "$1"
-	for i in $(seq 1 $len); do printf '='; done
-	printf '\n'
-}
-
-function summarize()
-{
-	bench="$1"
-	summary=$(echo $2 | tail -n1)
-	printf "%-20s %s (drops %s)\n" "$bench" "$(hits $summary)" "$(drops $summary)"
-}
+set -eufo pipefail
 
 header "Single-producer, parallel producer"
 for b in rb-libbpf rb-custom pb-libbpf pb-custom; do
diff --git a/tools/testing/selftests/bpf/benchs/run_common.sh b/tools/testing/selftests/bpf/benchs/run_common.sh
new file mode 100644
index 000000000000..670f23b037c4
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/run_common.sh
@@ -0,0 +1,48 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+RUN_BENCH="sudo ./bench -w3 -d10 -a"
+
+function header()
+{
+	local len=${#1}
+
+	printf "\n%s\n" "$1"
+	for i in $(seq 1 $len); do printf '='; done
+	printf '\n'
+}
+
+function subtitle()
+{
+	local len=${#1}
+	printf "\t%s\n" "$1"
+}
+
+function hits()
+{
+	echo "$*" | sed -E "s/.*hits\s+([0-9]+\.[0-9]+ ± [0-9]+\.[0-9]+M\/s).*/\1/"
+}
+
+function drops()
+{
+	echo "$*" | sed -E "s/.*drops\s+([0-9]+\.[0-9]+ ± [0-9]+\.[0-9]+M\/s).*/\1/"
+}
+
+function percentage()
+{
+	echo "$*" | sed -E "s/.*Percentage\s=\s+([0-9]+\.[0-9]+).*/\1/"
+}
+
+function summarize()
+{
+	bench="$1"
+	summary=$(echo $2 | tail -n1)
+	printf "%-20s %s (drops %s)\n" "$bench" "$(hits $summary)" "$(drops $summary)"
+}
+
+function summarize_percentage()
+{
+	bench="$1"
+	summary=$(echo $2 | tail -n1)
+	printf "%-20s %s%%\n" "$bench" "$(percentage $summary)"
+}
diff --git a/tools/testing/selftests/bpf/progs/bloom_filter_map.c b/tools/testing/selftests/bpf/progs/bloom_filter_map.c
index 5925d8dce4ec..05f9706a5ba6 100644
--- a/tools/testing/selftests/bpf/progs/bloom_filter_map.c
+++ b/tools/testing/selftests/bpf/progs/bloom_filter_map.c
@@ -1,7 +1,9 @@
 // SPDX-License-Identifier: GPL-2.0
 /* Copyright (c) 2021 Facebook */
 
+#include <errno.h>
 #include <linux/bpf.h>
+#include <stdbool.h>
 #include <bpf/bpf_helpers.h>
 
 char _license[] SEC("license") = "GPL";
@@ -35,8 +37,38 @@ struct callback_ctx {
 	struct map_bloom_filter_type *map;
 };
 
+/* Tracks the number of hits, drops, and false hits */
+struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+	__uint(max_entries, 3);
+	__type(key, __u32);
+	__type(value, __u64);
+} percpu_array SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 1000);
+	__type(key, __u64);
+	__type(value, __u64);
+} hashmap SEC(".maps");
+
+const __u32 hit_key  = 0;
+const __u32 drop_key  = 1;
+const __u32 false_hit_key = 2;
+
+bool hashmap_use_bloom_filter = true;
+
 int error = 0;
 
+static __always_inline void log_result(__u32 key)
+{
+	__u64 *count;
+
+	count = bpf_map_lookup_elem(&percpu_array, &key);
+	if (count)
+		*count += 1;
+}
+
 static __u64
 check_elem(struct bpf_map *map, __u32 *key, __u64 *val,
 	   struct callback_ctx *data)
@@ -49,6 +81,8 @@ check_elem(struct bpf_map *map, __u32 *key, __u64 *val,
 		return 1; /* stop the iteration */
 	}
 
+	log_result(hit_key);
+
 	return 0;
 }
 
@@ -81,3 +115,43 @@ int prog_bloom_filter_inner_map(void *ctx)
 
 	return 0;
 }
+
+SEC("fentry/__x64_sys_getpgid")
+int prog_bloom_filter_hashmap_lookup(void *ctx)
+{
+	__u64 *result;
+	int i, err;
+
+	union {
+		__u64 data64;
+		__u32 data32[2];
+	} val;
+
+	for (i = 0; i < 512; i++) {
+		val.data32[0] = bpf_get_prandom_u32();
+		val.data32[1] = bpf_get_prandom_u32();
+
+		if (hashmap_use_bloom_filter) {
+			err = bpf_map_peek_elem(&map_bloom_filter, &val);
+			if (err) {
+				if (err != -ENOENT) {
+					error |= 3;
+					return 0;
+				}
+				log_result(drop_key);
+				continue;
+			}
+		}
+
+		result = bpf_map_lookup_elem(&hashmap, &val);
+		if (result) {
+			log_result(hit_key);
+		} else {
+			if (hashmap_use_bloom_filter)
+				log_result(false_hit_key);
+			log_result(drop_key);
+		}
+	}
+
+	return 0;
+}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 bpf-next 5/5] bpf/benchs: Add benchmarks for comparing hashmap lookups with vs. without bloom filter
  2021-09-21 21:02 [PATCH v3 bpf-next 0/5] Implement bloom filter map Joanne Koong
                   ` (3 preceding siblings ...)
  2021-09-21 21:02 ` [PATCH v3 bpf-next 4/5] bpf/benchs: Add benchmark test for bloom filter maps Joanne Koong
@ 2021-09-21 21:02 ` Joanne Koong
  4 siblings, 0 replies; 37+ messages in thread
From: Joanne Koong @ 2021-09-21 21:02 UTC (permalink / raw)
  To: bpf; +Cc: Kernel-team, Joanne Koong

This patch adds benchmark tests for comparing the performance of hashmap
lookups without the bloom filter vs. hashmap lookups with the bloom filter.

Checking the bloom filter first for whether the element exists should
overall enable a higher throughput for hashmap lookups, since if the
element does not exist in the bloom filter, we can avoid a costly lookup in
the hashmap.

On average, using 5 hash functions in the bloom filter tended to perform
the best across the widest range of different entry sizes. The benchmark
results using 5 hash functions (running on 8 threads on a machine with one
numa node, and taking the average of 3 runs) were roughly as follows:

value_size = 4 bytes -
	10k entries: 30% faster
	50k entries: 50% faster
	100k entries: 55% faster
	500k entres: 80% faster
	1 million entries: 120% faster
	5 million entries: 135% faster

value_size = 8 bytes -
	10k entries: 35% faster
	50k entries: 55% faster
	100k entries: 70% faster
	500k entres: 110% faster
	1 million entries: 215% faster
	5 million entries: 215% faster

value_size = 16 bytes -
	10k entries: 5% slower
	50k entries: 25% faster
	100k entries: 35% faster
	500k entres: 105% faster
	1 million entries: 130% faster
	5 million entries: 105% faster

value_size = 40 bytes -
	10k entries: 5% slower
	50k entries: 10% faster
	100k entries: 20% faster
	500k entres: 45% faster
	1 million entries: 60% faster
	5 million entries: 75% faster

Signed-off-by: Joanne Koong <joannekoong@fb.com>
---
 tools/testing/selftests/bpf/bench.c           | 22 ++++++++---
 .../bpf/benchs/bench_bloom_filter_map.c       | 39 +++++++++++++++++++
 .../bpf/benchs/run_bench_bloom_filter_map.sh  | 15 +++++++
 .../selftests/bpf/benchs/run_common.sh        | 12 ++++++
 4 files changed, 83 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index 0bcbdb4405a3..7da1589a9fe0 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -92,20 +92,21 @@ void hits_drops_report_progress(int iter, struct bench_res *res, long delta_ns)
 	printf("Iter %3d (%7.3lfus): ",
 	       iter, (delta_ns - 1000000000) / 1000.0);
 
-	printf("hits %8.3lfM/s (%7.3lfM/prod), drops %8.3lfM/s\n",
-	       hits_per_sec, hits_per_prod, drops_per_sec);
+	printf("hits %8.3lfM/s (%7.3lfM/prod), drops %8.3lfM/s, total operations %8.3lfM/s\n",
+	       hits_per_sec, hits_per_prod, drops_per_sec, hits_per_sec + drops_per_sec);
 }
 
 void hits_drops_report_final(struct bench_res res[], int res_cnt)
 {
 	int i;
-	double hits_mean = 0.0, drops_mean = 0.0;
-	double hits_stddev = 0.0, drops_stddev = 0.0;
+	double hits_mean = 0.0, drops_mean = 0.0, total_ops_mean = 0.0;
+	double hits_stddev = 0.0, drops_stddev = 0.0, total_ops_stddev = 0.0;
 
 	for (i = 0; i < res_cnt; i++) {
 		hits_mean += res[i].hits / 1000000.0 / (0.0 + res_cnt);
 		drops_mean += res[i].drops / 1000000.0 / (0.0 + res_cnt);
 	}
+	total_ops_mean = hits_mean + drops_mean;
 
 	if (res_cnt > 1)  {
 		for (i = 0; i < res_cnt; i++) {
@@ -115,14 +116,21 @@ void hits_drops_report_final(struct bench_res res[], int res_cnt)
 			drops_stddev += (drops_mean - res[i].drops / 1000000.0) *
 					(drops_mean - res[i].drops / 1000000.0) /
 					(res_cnt - 1.0);
+			total_ops_stddev += (total_ops_mean -
+					(res[i].hits + res[i].drops) / 1000000.0) *
+					(total_ops_mean - (res[i].hits + res[i].drops) / 1000000.0)
+					/ (res_cnt - 1.0);
 		}
 		hits_stddev = sqrt(hits_stddev);
 		drops_stddev = sqrt(drops_stddev);
+		total_ops_stddev = sqrt(total_ops_stddev);
 	}
 	printf("Summary: hits %8.3lf \u00B1 %5.3lfM/s (%7.3lfM/prod), ",
 	       hits_mean, hits_stddev, hits_mean / env.producer_cnt);
-	printf("drops %8.3lf \u00B1 %5.3lfM/s\n",
+	printf("drops %8.3lf \u00B1 %5.3lfM/s, ",
 	       drops_mean, drops_stddev);
+	printf("total operations %8.3lf \u00B1 %5.3lfM/s\n",
+	       total_ops_mean, total_ops_stddev);
 }
 
 const char *argp_program_version = "benchmark";
@@ -356,6 +364,8 @@ extern const struct bench bench_pb_libbpf;
 extern const struct bench bench_pb_custom;
 extern const struct bench bench_bloom_filter_map;
 extern const struct bench bench_bloom_filter_false_positive;
+extern const struct bench bench_hashmap_without_bloom_filter;
+extern const struct bench bench_hashmap_with_bloom_filter;
 
 static const struct bench *benchs[] = {
 	&bench_count_global,
@@ -379,6 +389,8 @@ static const struct bench *benchs[] = {
 	&bench_pb_custom,
 	&bench_bloom_filter_map,
 	&bench_bloom_filter_false_positive,
+	&bench_hashmap_without_bloom_filter,
+	&bench_hashmap_with_bloom_filter,
 };
 
 static void setup_benchmark()
diff --git a/tools/testing/selftests/bpf/benchs/bench_bloom_filter_map.c b/tools/testing/selftests/bpf/benchs/bench_bloom_filter_map.c
index 8b4cd9a52a88..7adf80be7292 100644
--- a/tools/testing/selftests/bpf/benchs/bench_bloom_filter_map.c
+++ b/tools/testing/selftests/bpf/benchs/bench_bloom_filter_map.c
@@ -242,6 +242,23 @@ static void hashmap_lookup_setup(void)
 	}
 }
 
+static void hashmap_no_bloom_filter_setup(void)
+{
+	struct bpf_link *link;
+
+	ctx.skel = setup_skeleton();
+
+	ctx.skel->data->hashmap_use_bloom_filter = false;
+
+	populate_maps();
+
+	link = bpf_program__attach(ctx.skel->progs.prog_bloom_filter_hashmap_lookup);
+	if (!link) {
+		fprintf(stderr, "failed to attach program!\n");
+		exit(1);
+	}
+}
+
 static void measure(struct bench_res *res)
 {
 	long total_hits = 0, total_drops = 0, total_false_hits = 0;
@@ -343,3 +360,25 @@ const struct bench bench_bloom_filter_false_positive = {
 	.report_progress = false_hits_report_progress,
 	.report_final = false_hits_report_final,
 };
+
+const struct bench bench_hashmap_without_bloom_filter = {
+	.name = "hashmap-without-bloom-filter",
+	.validate = validate,
+	.setup = hashmap_no_bloom_filter_setup,
+	.producer_thread = producer,
+	.consumer_thread = consumer,
+	.measure = measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = hits_drops_report_final,
+};
+
+const struct bench bench_hashmap_with_bloom_filter = {
+	.name = "hashmap-with-bloom-filter",
+	.validate = validate,
+	.setup = hashmap_lookup_setup,
+	.producer_thread = producer,
+	.consumer_thread = consumer,
+	.measure = measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = hits_drops_report_final,
+};
diff --git a/tools/testing/selftests/bpf/benchs/run_bench_bloom_filter_map.sh b/tools/testing/selftests/bpf/benchs/run_bench_bloom_filter_map.sh
index 0dbbd85937e3..239c040b7aaa 100755
--- a/tools/testing/selftests/bpf/benchs/run_bench_bloom_filter_map.sh
+++ b/tools/testing/selftests/bpf/benchs/run_bench_bloom_filter_map.sh
@@ -26,3 +26,18 @@ header "Bloom filter map, multi-producer contention"
 for t in 1 2 3 4 8 12 16 20 24 28 32 36 40 44 48 52; do
 	summarize "$t threads - " "$($RUN_BENCH -p $t bloom-filter-map)"
 done
+
+header "Hashmap without bloom filter vs. hashmap with bloom filter (throughput, 8 threads)"
+for h in {1..10}; do
+subtitle "# hashes: $h"
+	for e in 10000 50000 75000 100000 250000 500000 750000 1000000 2500000 5000000; do
+		printf "%'d entries -\n" $e
+		printf "\t"
+		summarize_total "Hashmap without bloom filter: " \
+			"$($RUN_BENCH --nr_hash_funcs $h --nr_entries $e -p 8 hashmap-without-bloom-filter)"
+		printf "\t"
+		summarize_total "Hashmap with bloom filter: " \
+			"$($RUN_BENCH --nr_hash_funcs $h --nr_entries $e -p 8 hashmap-with-bloom-filter)"
+	done
+	printf "\n"
+done
diff --git a/tools/testing/selftests/bpf/benchs/run_common.sh b/tools/testing/selftests/bpf/benchs/run_common.sh
index 670f23b037c4..9a16be78b180 100644
--- a/tools/testing/selftests/bpf/benchs/run_common.sh
+++ b/tools/testing/selftests/bpf/benchs/run_common.sh
@@ -33,6 +33,11 @@ function percentage()
 	echo "$*" | sed -E "s/.*Percentage\s=\s+([0-9]+\.[0-9]+).*/\1/"
 }
 
+function total()
+{
+	echo "$*" | sed -E "s/.*total operations\s+([0-9]+\.[0-9]+ ± [0-9]+\.[0-9]+M\/s).*/\1/"
+}
+
 function summarize()
 {
 	bench="$1"
@@ -46,3 +51,10 @@ function summarize_percentage()
 	summary=$(echo $2 | tail -n1)
 	printf "%-20s %s%%\n" "$bench" "$(percentage $summary)"
 }
+
+function summarize_total()
+{
+	bench="$1"
+	summary=$(echo $2 | tail -n1)
+	printf "%-20s %s\n" "$bench" "$(total $summary)"
+}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 bpf-next 2/5] libbpf: Allow the number of hashes in bloom filter maps to be configurable
  2021-09-21 21:02 ` [PATCH v3 bpf-next 2/5] libbpf: Allow the number of hashes in bloom filter maps to be configurable Joanne Koong
@ 2021-09-21 22:24   ` Joanne Koong
  2021-09-22 23:14     ` Andrii Nakryiko
  2021-09-21 23:57   ` Andrii Nakryiko
  1 sibling, 1 reply; 37+ messages in thread
From: Joanne Koong @ 2021-09-21 22:24 UTC (permalink / raw)
  To: bpf; +Cc: Kernel-team


On 9/21/21 2:02 PM, Joanne Koong wrote:
> This patch adds the libbpf infrastructure that will allow the user to
> specify a configurable number of hash functions to use for the bloom
> filter map.
>
> Please note that this patch does not enforce that a pinned bloom filter
> map may only be reused if the number of hash functions is the same. If
> they are not the same, the number of hash functions used will be the one
> that was set for the pinned map.
>
> Signed-off-by: Joanne Koong <joannekoong@fb.com>
> ---
>   include/uapi/linux/bpf.h        |  5 ++++-
>   tools/include/uapi/linux/bpf.h  |  5 ++++-
>   tools/lib/bpf/bpf.c             |  2 ++
>   tools/lib/bpf/bpf.h             |  1 +
>   tools/lib/bpf/libbpf.c          | 32 +++++++++++++++++++++++++++-----
>   tools/lib/bpf/libbpf.h          |  2 ++
>   tools/lib/bpf/libbpf.map        |  1 +
>   tools/lib/bpf/libbpf_internal.h |  4 +++-
>   8 files changed, 44 insertions(+), 8 deletions(-)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index fec9fcfe0629..2e3048488feb 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1262,7 +1262,10 @@ union bpf_attr {
>   		__u32	map_flags;	/* BPF_MAP_CREATE related
>   					 * flags defined above.
>   					 */
> -		__u32	inner_map_fd;	/* fd pointing to the inner map */
> +		union {
> +			__u32	inner_map_fd;	/* fd pointing to the inner map */
> +			__u32	nr_hash_funcs;  /* or number of hash functions */
> +		};
>   		__u32	numa_node;	/* numa node (effective only if
>   					 * BPF_F_NUMA_NODE is set).
>   					 */
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index fec9fcfe0629..2e3048488feb 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -1262,7 +1262,10 @@ union bpf_attr {
>   		__u32	map_flags;	/* BPF_MAP_CREATE related
>   					 * flags defined above.
>   					 */
> -		__u32	inner_map_fd;	/* fd pointing to the inner map */
> +		union {
> +			__u32	inner_map_fd;	/* fd pointing to the inner map */
> +			__u32	nr_hash_funcs;  /* or number of hash functions */
> +		};
>   		__u32	numa_node;	/* numa node (effective only if
>   					 * BPF_F_NUMA_NODE is set).
>   					 */
> diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
> index 2401fad090c5..8a9dd4f6d6c8 100644
> --- a/tools/lib/bpf/bpf.c
> +++ b/tools/lib/bpf/bpf.c
> @@ -100,6 +100,8 @@ int bpf_create_map_xattr(const struct bpf_create_map_attr *create_attr)
>   	if (attr.map_type == BPF_MAP_TYPE_STRUCT_OPS)
>   		attr.btf_vmlinux_value_type_id =
>   			create_attr->btf_vmlinux_value_type_id;
> +	else if (attr.map_type == BPF_MAP_TYPE_BLOOM_FILTER)
> +		attr.nr_hash_funcs = create_attr->nr_hash_funcs;
>   	else
>   		attr.inner_map_fd = create_attr->inner_map_fd;
>   
> diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
> index 6fffb3cdf39b..1194b6f01572 100644
> --- a/tools/lib/bpf/bpf.h
> +++ b/tools/lib/bpf/bpf.h
> @@ -49,6 +49,7 @@ struct bpf_create_map_attr {
>   	union {
>   		__u32 inner_map_fd;
>   		__u32 btf_vmlinux_value_type_id;
> +		__u32 nr_hash_funcs;
>   	};
>   };
>   
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index da65a1666a5e..e51e68a07aaf 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -378,6 +378,7 @@ struct bpf_map {
>   	char *pin_path;
>   	bool pinned;
>   	bool reused;
> +	__u32 nr_hash_funcs;
>   };
>   
>   enum extern_type {
> @@ -1291,6 +1292,11 @@ static bool bpf_map_type__is_map_in_map(enum bpf_map_type type)
>   	return false;
>   }
>   
> +static inline bool bpf_map__is_bloom_filter(const struct bpf_map *map)
> +{
> +	return map->def.type == BPF_MAP_TYPE_BLOOM_FILTER;
> +}
> +
>   int bpf_object__section_size(const struct bpf_object *obj, const char *name,
>   			     __u32 *size)
>   {
> @@ -2238,6 +2244,10 @@ int parse_btf_map_def(const char *map_name, struct btf *btf,
>   			}
>   			map_def->pinning = val;
>   			map_def->parts |= MAP_DEF_PINNING;
> +		} else if (strcmp(name, "nr_hash_funcs") == 0) {
> +			if (!get_map_field_int(map_name, btf, m, &map_def->nr_hash_funcs))
> +				return -EINVAL;
> +			map_def->parts |= MAP_DEF_NR_HASH_FUNCS;
>   		} else {
>   			if (strict) {
>   				pr_warn("map '%s': unknown field '%s'.\n", map_name, name);
> @@ -2266,6 +2276,7 @@ static void fill_map_from_def(struct bpf_map *map, const struct btf_map_def *def
>   	map->numa_node = def->numa_node;
>   	map->btf_key_type_id = def->key_type_id;
>   	map->btf_value_type_id = def->value_type_id;
> +	map->nr_hash_funcs = def->nr_hash_funcs;
>   
>   	if (def->parts & MAP_DEF_MAP_TYPE)
>   		pr_debug("map '%s': found type = %u.\n", map->name, def->map_type);
> @@ -2290,6 +2301,8 @@ static void fill_map_from_def(struct bpf_map *map, const struct btf_map_def *def
>   		pr_debug("map '%s': found pinning = %u.\n", map->name, def->pinning);
>   	if (def->parts & MAP_DEF_NUMA_NODE)
>   		pr_debug("map '%s': found numa_node = %u.\n", map->name, def->numa_node);
> +	if (def->parts & MAP_DEF_NR_HASH_FUNCS)
> +		pr_debug("map '%s': found nr_hash_funcs = %u.\n", map->name, def->nr_hash_funcs);
>   
>   	if (def->parts & MAP_DEF_INNER_MAP)
>   		pr_debug("map '%s': found inner map definition.\n", map->name);
> @@ -4616,10 +4629,6 @@ static int bpf_object__create_map(struct bpf_object *obj, struct bpf_map *map, b
>   		create_attr.max_entries = def->max_entries;
>   	}
>   
> -	if (bpf_map__is_struct_ops(map))
> -		create_attr.btf_vmlinux_value_type_id =
> -			map->btf_vmlinux_value_type_id;
> -
>   	create_attr.btf_fd = 0;
>   	create_attr.btf_key_type_id = 0;
>   	create_attr.btf_value_type_id = 0;
> @@ -4629,7 +4638,12 @@ static int bpf_object__create_map(struct bpf_object *obj, struct bpf_map *map, b
>   		create_attr.btf_value_type_id = map->btf_value_type_id;
>   	}
>   
> -	if (bpf_map_type__is_map_in_map(def->type)) {
> +	if (bpf_map__is_struct_ops(map)) {
> +		create_attr.btf_vmlinux_value_type_id =
> +			map->btf_vmlinux_value_type_id;
> +	} else if (bpf_map__is_bloom_filter(map)) {
> +		create_attr.nr_hash_funcs = map->nr_hash_funcs;
> +	} else if (bpf_map_type__is_map_in_map(def->type)) {
>   		if (map->inner_map) {
>   			err = bpf_object__create_map(obj, map->inner_map, true);
>   			if (err) {
> @@ -8610,6 +8624,14 @@ int bpf_map__set_value_size(struct bpf_map *map, __u32 size)
>   	return 0;
>   }
>   
> +int bpf_map__set_nr_hash_funcs(struct bpf_map *map, __u32 nr_hash_funcs)
> +{
> +	if (map->fd >= 0)
> +		return libbpf_err(-EBUSY);
> +	map->nr_hash_funcs = nr_hash_funcs;
> +	return 0;
> +}
> +
>   __u32 bpf_map__btf_key_type_id(const struct bpf_map *map)
>   {
>   	return map ? map->btf_key_type_id : 0;
> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index d0bedd673273..5c441744f766 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -550,6 +550,8 @@ LIBBPF_API __u32 bpf_map__btf_value_type_id(const struct bpf_map *map);
>   /* get/set map if_index */
>   LIBBPF_API __u32 bpf_map__ifindex(const struct bpf_map *map);
>   LIBBPF_API int bpf_map__set_ifindex(struct bpf_map *map, __u32 ifindex);
> +/* set nr_hash_funcs */
> +LIBBPF_API int bpf_map__set_nr_hash_funcs(struct bpf_map *map, __u32 nr_hash_funcs);
>   
>   typedef void (*bpf_map_clear_priv_t)(struct bpf_map *, void *);
>   LIBBPF_API int bpf_map__set_priv(struct bpf_map *map, void *priv,
> diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
> index 9e649cf9e771..ee0e1e7648f4 100644
> --- a/tools/lib/bpf/libbpf.map
> +++ b/tools/lib/bpf/libbpf.map
> @@ -385,6 +385,7 @@ LIBBPF_0.5.0 {
>   		btf__load_vmlinux_btf;
>   		btf_dump__dump_type_data;
>   		libbpf_set_strict_mode;
> +		bpf_map__set_nr_hash_funcs;
>   } LIBBPF_0.4.0;
>   
>   LIBBPF_0.6.0 {
> diff --git a/tools/lib/bpf/libbpf_internal.h b/tools/lib/bpf/libbpf_internal.h
> index ceb0c98979bc..95dbbeba231f 100644
> --- a/tools/lib/bpf/libbpf_internal.h
> +++ b/tools/lib/bpf/libbpf_internal.h
> @@ -186,8 +186,9 @@ enum map_def_parts {
>   	MAP_DEF_NUMA_NODE	= 0x080,
>   	MAP_DEF_PINNING		= 0x100,
>   	MAP_DEF_INNER_MAP	= 0x200,
> +	MAP_DEF_NR_HASH_FUNCS	= 0x400,
>   
> -	MAP_DEF_ALL		= 0x3ff, /* combination of all above */
> +	MAP_DEF_ALL		= 0x7ff, /* combination of all above */
>   };
>   
>   struct btf_map_def {
> @@ -201,6 +202,7 @@ struct btf_map_def {
>   	__u32 map_flags;
>   	__u32 numa_node;
>   	__u32 pinning;
> +	__u32 nr_hash_funcs;
>   };
>   

I just realized that Andrii's comment on v1 stated that btf_map_def is 
fixed indefinitely.

This implies that for bloom filter maps where the number of hash 
functions needs to be set,
we will not be able to use the BTF-defined format and will instead need 
to use the older
map definition that uses bpf_map_def. Is my understanding of this 
correct? If so, I will go
ahead and fix this for v4.

>   int parse_btf_map_def(const char *map_name, struct btf *btf,

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 bpf-next 1/5] bpf: Add bloom filter map implementation
  2021-09-21 21:02 ` [PATCH v3 bpf-next 1/5] bpf: Add bloom filter map implementation Joanne Koong
@ 2021-09-21 23:44   ` Andrii Nakryiko
  2021-09-22 19:06     ` Joanne Koong
  0 siblings, 1 reply; 37+ messages in thread
From: Andrii Nakryiko @ 2021-09-21 23:44 UTC (permalink / raw)
  To: Joanne Koong; +Cc: bpf, Kernel Team

On Tue, Sep 21, 2021 at 2:30 PM Joanne Koong <joannekoong@fb.com> wrote:
>
> Bloom filters are a space-efficient probabilistic data structure
> used to quickly test whether an element exists in a set.
> In a bloom filter, false positives are possible whereas false
> negatives should never be.
>
> This patch adds a bloom filter map for bpf programs.
> The bloom filter map supports peek (determining whether an element
> is present in the map) and push (adding an element to the map)
> operations.These operations are exposed to userspace applications
> through the already existing syscalls in the following way:
>
> BPF_MAP_LOOKUP_ELEM -> peek
> BPF_MAP_UPDATE_ELEM -> push
>
> The bloom filter map does not have keys, only values. In light of
> this, the bloom filter map's API matches that of queue stack maps:
> user applications use BPF_MAP_LOOKUP_ELEM/BPF_MAP_UPDATE_ELEM
> which correspond internally to bpf_map_peek_elem/bpf_map_push_elem,
> and bpf programs must use the bpf_map_peek_elem and bpf_map_push_elem
> APIs to query or add an element to the bloom filter map. When the
> bloom filter map is created, it must be created with a key_size of 0.
>
> For updates, the user will pass in the element to add to the map
> as the value, with a NULL key. For lookups, the user will pass in the
> element to query in the map as the value. In the verifier layer, this
> requires us to modify the argument type of a bloom filter's
> BPF_FUNC_map_peek_elem call to ARG_PTR_TO_MAP_VALUE; as well, in
> the syscall layer, we need to copy over the user value so that in
> bpf_map_peek_elem, we know which specific value to query.
>
> A few things to please take note of:
>  * If there are any concurrent lookups + updates, the user is
> responsible for synchronizing this to ensure no false negative lookups
> occur.
>  * The number of hashes to use for the bloom filter is configurable from
> userspace. If no number is specified, the default used will be 5 hash
> functions. The benchmarks later in this patchset can help compare the
> performance of using different number of hashes on different entry
> sizes. In general, using more hashes decreases the speed of a lookup,
> but increases the false positive rate of an element being detected in the
> bloom filter.
>  * Deleting an element in the bloom filter map is not supported.
>  * The bloom filter map may be used as an inner map.
>  * The "max_entries" size that is specified at map creation time is used to
> approximate a reasonable bitmap size for the bloom filter, and is not
> otherwise strictly enforced. If the user wishes to insert more entries into
> the bloom filter than "max_entries", they may do so but they should be
> aware that this may lead to a higher false positive rate.
>
> Signed-off-by: Joanne Koong <joannekoong@fb.com>
> ---
>  include/linux/bpf_types.h      |   1 +
>  include/uapi/linux/bpf.h       |   1 +
>  kernel/bpf/Makefile            |   2 +-
>  kernel/bpf/bloom_filter.c      | 185 +++++++++++++++++++++++++++++++++
>  kernel/bpf/syscall.c           |  14 ++-
>  kernel/bpf/verifier.c          |  19 +++-
>  tools/include/uapi/linux/bpf.h |   1 +
>  7 files changed, 217 insertions(+), 6 deletions(-)
>  create mode 100644 kernel/bpf/bloom_filter.c
>

See some stylistic nitpicking below (and not a nitpicking about BTF).

But I just wanted to say that I'm a bit amazed by how much special
casing this BLOOM_FILTER map requires in syscall.c and verifier.c. I
still believe that starting with a BPF helper for hashing would be a
better approach, but oh well.

[...]

> +
> +static inline u32 hash(struct bpf_bloom_filter *bloom_filter, void *value,
> +                      u64 value_size, u32 index)
> +{
> +       if (bloom_filter->aligned_u32_count)
> +               return jhash2(value, bloom_filter->aligned_u32_count,
> +                             bloom_filter->hash_seed + index) &
> +                       bloom_filter->bit_array_mask;
> +
> +       return jhash(value, value_size, bloom_filter->hash_seed + index) &
> +               bloom_filter->bit_array_mask;

stylistic nit, but this feels way to dense text-wise, this seems
easier to follow

u32 h;

if (bloom_filter->aligned_u32_count)
    h = jhash2(...);
else
    h = jhash(...);
return h & bloom_filter->bit_array_mask;

WDYT?

> +}
> +
> +static int bloom_filter_map_peek_elem(struct bpf_map *map, void *value)
> +{
> +       struct bpf_bloom_filter *bloom_filter =
> +               container_of(map, struct bpf_bloom_filter, map);
> +       u32 i;
> +
> +       for (i = 0; i < bloom_filter->nr_hash_funcs; i++) {
> +               if (!test_bit(hash(bloom_filter, value, map->value_size, i),
> +                             bloom_filter->bit_array))
> +                       return -ENOENT;

same here, I think the hash calculation deserves a separate statement
and a local variable

> +       }
> +
> +       return 0;
> +}
> +

[...]

> +static void bloom_filter_map_free(struct bpf_map *map)
> +{
> +       struct bpf_bloom_filter *bloom_filter =
> +               container_of(map, struct bpf_bloom_filter, map);
> +
> +       bpf_map_area_free(bloom_filter);
> +}
> +
> +static int bloom_filter_map_push_elem(struct bpf_map *map, void *value,
> +                                     u64 flags)
> +{
> +       struct bpf_bloom_filter *bloom_filter =
> +               container_of(map, struct bpf_bloom_filter, map);
> +       u32 i;
> +
> +       if (flags != BPF_ANY)
> +               return -EINVAL;
> +
> +       for (i = 0; i < bloom_filter->nr_hash_funcs; i++)
> +               set_bit(hash(bloom_filter, value, map->value_size, i),
> +                       bloom_filter->bit_array);

same as above about hash() call on separate line

> +
> +       return 0;
> +}
> +
> +static void *bloom_filter_map_lookup_elem(struct bpf_map *map, void *key)
> +{
> +       /* The eBPF program should use map_peek_elem instead */
> +       return ERR_PTR(-EINVAL);
> +}
> +
> +static int bloom_filter_map_update_elem(struct bpf_map *map, void *key,
> +                                       void *value, u64 flags)
> +{
> +       /* The eBPF program should use map_push_elem instead */
> +       return -EINVAL;
> +}
> +
> +static int bloom_filter_map_delete_elem(struct bpf_map *map, void *key)
> +{
> +       return -EOPNOTSUPP;
> +}
> +
> +static int bloom_filter_map_get_next_key(struct bpf_map *map, void *key,
> +                                        void *next_key)
> +{
> +       return -EOPNOTSUPP;
> +}
> +
> +static int bloom_filter_map_btf_id;
> +const struct bpf_map_ops bloom_filter_map_ops = {
> +       .map_meta_equal = bpf_map_meta_equal,
> +       .map_alloc = bloom_filter_map_alloc,
> +       .map_free = bloom_filter_map_free,
> +       .map_push_elem = bloom_filter_map_push_elem,
> +       .map_peek_elem = bloom_filter_map_peek_elem,
> +       .map_lookup_elem = bloom_filter_map_lookup_elem,
> +       .map_update_elem = bloom_filter_map_update_elem,
> +       .map_delete_elem = bloom_filter_map_delete_elem,
> +       .map_get_next_key = bloom_filter_map_get_next_key,
> +       .map_check_btf = map_check_no_btf,

can you please implement basically a no-op callback here to allow
specifying btf_value_id, there is no good reason to restrict this new
map to not allow BTF type being specified for its value

> +       .map_btf_name = "bpf_bloom_filter",
> +       .map_btf_id = &bloom_filter_map_btf_id,
> +};

[...]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 bpf-next 2/5] libbpf: Allow the number of hashes in bloom filter maps to be configurable
  2021-09-21 21:02 ` [PATCH v3 bpf-next 2/5] libbpf: Allow the number of hashes in bloom filter maps to be configurable Joanne Koong
  2021-09-21 22:24   ` Joanne Koong
@ 2021-09-21 23:57   ` Andrii Nakryiko
  1 sibling, 0 replies; 37+ messages in thread
From: Andrii Nakryiko @ 2021-09-21 23:57 UTC (permalink / raw)
  To: Joanne Koong; +Cc: bpf, Kernel Team

On Tue, Sep 21, 2021 at 2:30 PM Joanne Koong <joannekoong@fb.com> wrote:
>
> This patch adds the libbpf infrastructure that will allow the user to
> specify a configurable number of hash functions to use for the bloom
> filter map.
>
> Please note that this patch does not enforce that a pinned bloom filter
> map may only be reused if the number of hash functions is the same. If
> they are not the same, the number of hash functions used will be the one
> that was set for the pinned map.
>
> Signed-off-by: Joanne Koong <joannekoong@fb.com>
> ---
>  include/uapi/linux/bpf.h        |  5 ++++-
>  tools/include/uapi/linux/bpf.h  |  5 ++++-
>  tools/lib/bpf/bpf.c             |  2 ++
>  tools/lib/bpf/bpf.h             |  1 +
>  tools/lib/bpf/libbpf.c          | 32 +++++++++++++++++++++++++++-----
>  tools/lib/bpf/libbpf.h          |  2 ++
>  tools/lib/bpf/libbpf.map        |  1 +
>  tools/lib/bpf/libbpf_internal.h |  4 +++-
>  8 files changed, 44 insertions(+), 8 deletions(-)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index fec9fcfe0629..2e3048488feb 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1262,7 +1262,10 @@ union bpf_attr {
>                 __u32   map_flags;      /* BPF_MAP_CREATE related
>                                          * flags defined above.
>                                          */
> -               __u32   inner_map_fd;   /* fd pointing to the inner map */
> +               union {
> +                       __u32   inner_map_fd;   /* fd pointing to the inner map */
> +                       __u32   nr_hash_funcs;  /* or number of hash functions */
> +               };
>                 __u32   numa_node;      /* numa node (effective only if
>                                          * BPF_F_NUMA_NODE is set).
>                                          */
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index fec9fcfe0629..2e3048488feb 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -1262,7 +1262,10 @@ union bpf_attr {
>                 __u32   map_flags;      /* BPF_MAP_CREATE related
>                                          * flags defined above.
>                                          */
> -               __u32   inner_map_fd;   /* fd pointing to the inner map */
> +               union {
> +                       __u32   inner_map_fd;   /* fd pointing to the inner map */
> +                       __u32   nr_hash_funcs;  /* or number of hash functions */
> +               };
>                 __u32   numa_node;      /* numa node (effective only if
>                                          * BPF_F_NUMA_NODE is set).
>                                          */

these changes should be part of patch 1, otherwise patch 1 leaves
kernel in non-compilable state and breaks bisection


> diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
> index 2401fad090c5..8a9dd4f6d6c8 100644
> --- a/tools/lib/bpf/bpf.c
> +++ b/tools/lib/bpf/bpf.c
> @@ -100,6 +100,8 @@ int bpf_create_map_xattr(const struct bpf_create_map_attr *create_attr)
>         if (attr.map_type == BPF_MAP_TYPE_STRUCT_OPS)
>                 attr.btf_vmlinux_value_type_id =
>                         create_attr->btf_vmlinux_value_type_id;
> +       else if (attr.map_type == BPF_MAP_TYPE_BLOOM_FILTER)
> +               attr.nr_hash_funcs = create_attr->nr_hash_funcs;
>         else
>                 attr.inner_map_fd = create_attr->inner_map_fd;
>
> diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
> index 6fffb3cdf39b..1194b6f01572 100644
> --- a/tools/lib/bpf/bpf.h
> +++ b/tools/lib/bpf/bpf.h
> @@ -49,6 +49,7 @@ struct bpf_create_map_attr {
>         union {
>                 __u32 inner_map_fd;
>                 __u32 btf_vmlinux_value_type_id;
> +               __u32 nr_hash_funcs;
>         };
>  };
>
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index da65a1666a5e..e51e68a07aaf 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -378,6 +378,7 @@ struct bpf_map {
>         char *pin_path;
>         bool pinned;
>         bool reused;
> +       __u32 nr_hash_funcs;

nit: please keep it next to inner_map_fd field

>  };
>
>  enum extern_type {
> @@ -1291,6 +1292,11 @@ static bool bpf_map_type__is_map_in_map(enum bpf_map_type type)
>         return false;
>  }
>
> +static inline bool bpf_map__is_bloom_filter(const struct bpf_map *map)
> +{
> +       return map->def.type == BPF_MAP_TYPE_BLOOM_FILTER;
> +}

this is used in only one place, it's ok to just open-code it

> +
>  int bpf_object__section_size(const struct bpf_object *obj, const char *name,
>                              __u32 *size)
>  {

[...]

> @@ -8610,6 +8624,14 @@ int bpf_map__set_value_size(struct bpf_map *map, __u32 size)
>         return 0;
>  }
>
> +int bpf_map__set_nr_hash_funcs(struct bpf_map *map, __u32 nr_hash_funcs)
> +{
> +       if (map->fd >= 0)
> +               return libbpf_err(-EBUSY);

should we disallow this for anything but BLOOM_FILTER?

> +       map->nr_hash_funcs = nr_hash_funcs;
> +       return 0;
> +}
> +
>  __u32 bpf_map__btf_key_type_id(const struct bpf_map *map)
>  {
>         return map ? map->btf_key_type_id : 0;
> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index d0bedd673273..5c441744f766 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -550,6 +550,8 @@ LIBBPF_API __u32 bpf_map__btf_value_type_id(const struct bpf_map *map);
>  /* get/set map if_index */
>  LIBBPF_API __u32 bpf_map__ifindex(const struct bpf_map *map);
>  LIBBPF_API int bpf_map__set_ifindex(struct bpf_map *map, __u32 ifindex);
> +/* set nr_hash_funcs */
> +LIBBPF_API int bpf_map__set_nr_hash_funcs(struct bpf_map *map, __u32 nr_hash_funcs);

Let's not add this setter at all right now. Let's allow setting this
from BPF source code only in BTF map definition. It's unlikely anyone
would be setting the number of hash functions dynamically and I just
hate the idea of this one-off setter just for one specific map type.
We can always add it later if there really will be a need for that.

>
>  typedef void (*bpf_map_clear_priv_t)(struct bpf_map *, void *);
>  LIBBPF_API int bpf_map__set_priv(struct bpf_map *map, void *priv,

[...]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 bpf-next 1/5] bpf: Add bloom filter map implementation
  2021-09-21 23:44   ` Andrii Nakryiko
@ 2021-09-22 19:06     ` Joanne Koong
  2021-09-22 19:38       ` Martin KaFai Lau
  2021-09-22 20:44       ` Andrii Nakryiko
  0 siblings, 2 replies; 37+ messages in thread
From: Joanne Koong @ 2021-09-22 19:06 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: bpf, Kernel Team


On 9/21/21 4:44 PM, Andrii Nakryiko wrote:
> On Tue, Sep 21, 2021 at 2:30 PM Joanne Koong <joannekoong@fb.com> wrote:
>> Bloom filters are a space-efficient probabilistic data structure
>> used to quickly test whether an element exists in a set.
>> In a bloom filter, false positives are possible whereas false
>> negatives should never be.
>>
>> This patch adds a bloom filter map for bpf programs.
>> The bloom filter map supports peek (determining whether an element
>> is present in the map) and push (adding an element to the map)
>> operations.These operations are exposed to userspace applications
>> through the already existing syscalls in the following way:
>>
>> BPF_MAP_LOOKUP_ELEM -> peek
>> BPF_MAP_UPDATE_ELEM -> push
>>
>> The bloom filter map does not have keys, only values. In light of
>> this, the bloom filter map's API matches that of queue stack maps:
>> user applications use BPF_MAP_LOOKUP_ELEM/BPF_MAP_UPDATE_ELEM
>> which correspond internally to bpf_map_peek_elem/bpf_map_push_elem,
>> and bpf programs must use the bpf_map_peek_elem and bpf_map_push_elem
>> APIs to query or add an element to the bloom filter map. When the
>> bloom filter map is created, it must be created with a key_size of 0.
>>
>> For updates, the user will pass in the element to add to the map
>> as the value, with a NULL key. For lookups, the user will pass in the
>> element to query in the map as the value. In the verifier layer, this
>> requires us to modify the argument type of a bloom filter's
>> BPF_FUNC_map_peek_elem call to ARG_PTR_TO_MAP_VALUE; as well, in
>> the syscall layer, we need to copy over the user value so that in
>> bpf_map_peek_elem, we know which specific value to query.
>>
>> A few things to please take note of:
>>   * If there are any concurrent lookups + updates, the user is
>> responsible for synchronizing this to ensure no false negative lookups
>> occur.
>>   * The number of hashes to use for the bloom filter is configurable from
>> userspace. If no number is specified, the default used will be 5 hash
>> functions. The benchmarks later in this patchset can help compare the
>> performance of using different number of hashes on different entry
>> sizes. In general, using more hashes decreases the speed of a lookup,
>> but increases the false positive rate of an element being detected in the
>> bloom filter.
>>   * Deleting an element in the bloom filter map is not supported.
>>   * The bloom filter map may be used as an inner map.
>>   * The "max_entries" size that is specified at map creation time is used to
>> approximate a reasonable bitmap size for the bloom filter, and is not
>> otherwise strictly enforced. If the user wishes to insert more entries into
>> the bloom filter than "max_entries", they may do so but they should be
>> aware that this may lead to a higher false positive rate.
>>
>> Signed-off-by: Joanne Koong <joannekoong@fb.com>
>> ---
>>   include/linux/bpf_types.h      |   1 +
>>   include/uapi/linux/bpf.h       |   1 +
>>   kernel/bpf/Makefile            |   2 +-
>>   kernel/bpf/bloom_filter.c      | 185 +++++++++++++++++++++++++++++++++
>>   kernel/bpf/syscall.c           |  14 ++-
>>   kernel/bpf/verifier.c          |  19 +++-
>>   tools/include/uapi/linux/bpf.h |   1 +
>>   7 files changed, 217 insertions(+), 6 deletions(-)
>>   create mode 100644 kernel/bpf/bloom_filter.c
>>
> See some stylistic nitpicking below (and not a nitpicking about BTF).
>
> But I just wanted to say that I'm a bit amazed by how much special
> casing this BLOOM_FILTER map requires in syscall.c and verifier.c. I
> still believe that starting with a BPF helper for hashing would be a
> better approach, but oh well.
>
> [...]
I liked your comment on v1 regarding using a BPF helper and I agree with 
the benefits
you outlined. I'm curious to see what the performance differences 
between that approach
and this one end up being, if any. I plan to test out the BPF helper 
approach in a few weeks,
and if the performance is comparable or better, I am definitely open to 
reverting this code
and just going with the BPF helper approach :)
>> +
>> +static inline u32 hash(struct bpf_bloom_filter *bloom_filter, void *value,
>> +                      u64 value_size, u32 index)
>> +{
>> +       if (bloom_filter->aligned_u32_count)
>> +               return jhash2(value, bloom_filter->aligned_u32_count,
>> +                             bloom_filter->hash_seed + index) &
>> +                       bloom_filter->bit_array_mask;
>> +
>> +       return jhash(value, value_size, bloom_filter->hash_seed + index) &
>> +               bloom_filter->bit_array_mask;
> stylistic nit, but this feels way to dense text-wise, this seems
> easier to follow
>
> u32 h;
>
> if (bloom_filter->aligned_u32_count)
>      h = jhash2(...);
> else
>      h = jhash(...);
> return h & bloom_filter->bit_array_mask;
>
> WDYT?
I think this sounds great! I will make these changes for v4.
>> +}
>> +
>> +static int bloom_filter_map_peek_elem(struct bpf_map *map, void *value)
>> +{
>> +       struct bpf_bloom_filter *bloom_filter =
>> +               container_of(map, struct bpf_bloom_filter, map);
>> +       u32 i;
>> +
>> +       for (i = 0; i < bloom_filter->nr_hash_funcs; i++) {
>> +               if (!test_bit(hash(bloom_filter, value, map->value_size, i),
>> +                             bloom_filter->bit_array))
>> +                       return -ENOENT;
> same here, I think the hash calculation deserves a separate statement
> and a local variable
>
>> +       }
>> +
>> +       return 0;
>> +}
>> +
> [...]
>
>> +static void bloom_filter_map_free(struct bpf_map *map)
>> +{
>> +       struct bpf_bloom_filter *bloom_filter =
>> +               container_of(map, struct bpf_bloom_filter, map);
>> +
>> +       bpf_map_area_free(bloom_filter);
>> +}
>> +
>> +static int bloom_filter_map_push_elem(struct bpf_map *map, void *value,
>> +                                     u64 flags)
>> +{
>> +       struct bpf_bloom_filter *bloom_filter =
>> +               container_of(map, struct bpf_bloom_filter, map);
>> +       u32 i;
>> +
>> +       if (flags != BPF_ANY)
>> +               return -EINVAL;
>> +
>> +       for (i = 0; i < bloom_filter->nr_hash_funcs; i++)
>> +               set_bit(hash(bloom_filter, value, map->value_size, i),
>> +                       bloom_filter->bit_array);
> same as above about hash() call on separate line
>
>> +
>> +       return 0;
>> +}
>> +
>> +static void *bloom_filter_map_lookup_elem(struct bpf_map *map, void *key)
>> +{
>> +       /* The eBPF program should use map_peek_elem instead */
>> +       return ERR_PTR(-EINVAL);
>> +}
>> +
>> +static int bloom_filter_map_update_elem(struct bpf_map *map, void *key,
>> +                                       void *value, u64 flags)
>> +{
>> +       /* The eBPF program should use map_push_elem instead */
>> +       return -EINVAL;
>> +}
>> +
>> +static int bloom_filter_map_delete_elem(struct bpf_map *map, void *key)
>> +{
>> +       return -EOPNOTSUPP;
>> +}
>> +
>> +static int bloom_filter_map_get_next_key(struct bpf_map *map, void *key,
>> +                                        void *next_key)
>> +{
>> +       return -EOPNOTSUPP;
>> +}
>> +
>> +static int bloom_filter_map_btf_id;
>> +const struct bpf_map_ops bloom_filter_map_ops = {
>> +       .map_meta_equal = bpf_map_meta_equal,
>> +       .map_alloc = bloom_filter_map_alloc,
>> +       .map_free = bloom_filter_map_free,
>> +       .map_push_elem = bloom_filter_map_push_elem,
>> +       .map_peek_elem = bloom_filter_map_peek_elem,
>> +       .map_lookup_elem = bloom_filter_map_lookup_elem,
>> +       .map_update_elem = bloom_filter_map_update_elem,
>> +       .map_delete_elem = bloom_filter_map_delete_elem,
>> +       .map_get_next_key = bloom_filter_map_get_next_key,
>> +       .map_check_btf = map_check_no_btf,
> can you please implement basically a no-op callback here to allow
> specifying btf_value_id, there is no good reason to restrict this new
> map to not allow BTF type being specified for its value
Sounds great, will add this in v4.
>> +       .map_btf_name = "bpf_bloom_filter",
>> +       .map_btf_id = &bloom_filter_map_btf_id,
>> +};
> [...]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 bpf-next 1/5] bpf: Add bloom filter map implementation
  2021-09-22 19:06     ` Joanne Koong
@ 2021-09-22 19:38       ` Martin KaFai Lau
  2021-09-22 20:52         ` Andrii Nakryiko
  2021-09-22 20:44       ` Andrii Nakryiko
  1 sibling, 1 reply; 37+ messages in thread
From: Martin KaFai Lau @ 2021-09-22 19:38 UTC (permalink / raw)
  To: Joanne Koong; +Cc: Andrii Nakryiko, bpf, Kernel Team

On Wed, Sep 22, 2021 at 12:06:02PM -0700, Joanne Koong wrote:
> 
> On 9/21/21 4:44 PM, Andrii Nakryiko wrote:
> > On Tue, Sep 21, 2021 at 2:30 PM Joanne Koong <joannekoong@fb.com> wrote:
> > > Bloom filters are a space-efficient probabilistic data structure
> > > used to quickly test whether an element exists in a set.
> > > In a bloom filter, false positives are possible whereas false
> > > negatives should never be.
> > > 
> > > This patch adds a bloom filter map for bpf programs.
> > > The bloom filter map supports peek (determining whether an element
> > > is present in the map) and push (adding an element to the map)
> > > operations.These operations are exposed to userspace applications
> > > through the already existing syscalls in the following way:
> > > 
> > > BPF_MAP_LOOKUP_ELEM -> peek
> > > BPF_MAP_UPDATE_ELEM -> push
> > > 
> > > The bloom filter map does not have keys, only values. In light of
> > > this, the bloom filter map's API matches that of queue stack maps:
> > > user applications use BPF_MAP_LOOKUP_ELEM/BPF_MAP_UPDATE_ELEM
> > > which correspond internally to bpf_map_peek_elem/bpf_map_push_elem,
> > > and bpf programs must use the bpf_map_peek_elem and bpf_map_push_elem
> > > APIs to query or add an element to the bloom filter map. When the
> > > bloom filter map is created, it must be created with a key_size of 0.
> > > 
> > > For updates, the user will pass in the element to add to the map
> > > as the value, with a NULL key. For lookups, the user will pass in the
> > > element to query in the map as the value. In the verifier layer, this
> > > requires us to modify the argument type of a bloom filter's
> > > BPF_FUNC_map_peek_elem call to ARG_PTR_TO_MAP_VALUE; as well, in
> > > the syscall layer, we need to copy over the user value so that in
> > > bpf_map_peek_elem, we know which specific value to query.
> > > 
> > > A few things to please take note of:
> > >   * If there are any concurrent lookups + updates, the user is
> > > responsible for synchronizing this to ensure no false negative lookups
> > > occur.
> > >   * The number of hashes to use for the bloom filter is configurable from
> > > userspace. If no number is specified, the default used will be 5 hash
> > > functions. The benchmarks later in this patchset can help compare the
> > > performance of using different number of hashes on different entry
> > > sizes. In general, using more hashes decreases the speed of a lookup,
> > > but increases the false positive rate of an element being detected in the
> > > bloom filter.
> > >   * Deleting an element in the bloom filter map is not supported.
> > >   * The bloom filter map may be used as an inner map.
> > >   * The "max_entries" size that is specified at map creation time is used to
> > > approximate a reasonable bitmap size for the bloom filter, and is not
> > > otherwise strictly enforced. If the user wishes to insert more entries into
> > > the bloom filter than "max_entries", they may do so but they should be
> > > aware that this may lead to a higher false positive rate.
> > > 
> > > Signed-off-by: Joanne Koong <joannekoong@fb.com>
> > > ---
> > >   include/linux/bpf_types.h      |   1 +
> > >   include/uapi/linux/bpf.h       |   1 +
> > >   kernel/bpf/Makefile            |   2 +-
> > >   kernel/bpf/bloom_filter.c      | 185 +++++++++++++++++++++++++++++++++
> > >   kernel/bpf/syscall.c           |  14 ++-
> > >   kernel/bpf/verifier.c          |  19 +++-
> > >   tools/include/uapi/linux/bpf.h |   1 +
> > >   7 files changed, 217 insertions(+), 6 deletions(-)
> > >   create mode 100644 kernel/bpf/bloom_filter.c
> > > 
> > See some stylistic nitpicking below (and not a nitpicking about BTF).
> > 
> > But I just wanted to say that I'm a bit amazed by how much special
> > casing this BLOOM_FILTER map requires in syscall.c and verifier.c. I
> > still believe that starting with a BPF helper for hashing would be a
> > better approach, but oh well.
> > 
> > [...]
> I liked your comment on v1 regarding using a BPF helper and I agree with the
> benefits you outlined. I'm curious to see what the performance differences between
> that approach and this one end up being, if any. I plan to test out the BPF helper
> approach in a few weeks, and if the performance is comparable or better, I am definitely open to
> reverting this code and just going with the BPF helper approach :)
Reverting won't be an option and I don't think it is necessary.

Agree that a generic hash helper is in general useful.  It may be
useful in hashing the skb also.  The bpf prog only implementation could
have more flexibility in configuring roundup to pow2 or not, how to hash,
how many hashes, nr of bits ...etc.  In the mean time, the bpf prog and
user space need to co-ordinate more and worry about more things,
e.g. how to reuse a bloom filter with different nr_hashes,
nr_bits, handle synchronization...etc.

It is useful to have a default implementation in the kernel
for some useful maps like this one that works for most
common cases and the bpf user can just use it as get-and-go
like all other common bpf maps do.

imo, the verifier/syscall change here is quite minimal also and it
is mostly riding on top of the existing BPF_MAP_TYPE_STACK.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 bpf-next 1/5] bpf: Add bloom filter map implementation
  2021-09-22 19:06     ` Joanne Koong
  2021-09-22 19:38       ` Martin KaFai Lau
@ 2021-09-22 20:44       ` Andrii Nakryiko
  1 sibling, 0 replies; 37+ messages in thread
From: Andrii Nakryiko @ 2021-09-22 20:44 UTC (permalink / raw)
  To: Joanne Koong; +Cc: bpf, Kernel Team

On Wed, Sep 22, 2021 at 12:06 PM Joanne Koong <joannekoong@fb.com> wrote:
>
>
> On 9/21/21 4:44 PM, Andrii Nakryiko wrote:
> > On Tue, Sep 21, 2021 at 2:30 PM Joanne Koong <joannekoong@fb.com> wrote:
> >> Bloom filters are a space-efficient probabilistic data structure
> >> used to quickly test whether an element exists in a set.
> >> In a bloom filter, false positives are possible whereas false
> >> negatives should never be.
> >>
> >> This patch adds a bloom filter map for bpf programs.
> >> The bloom filter map supports peek (determining whether an element
> >> is present in the map) and push (adding an element to the map)
> >> operations.These operations are exposed to userspace applications
> >> through the already existing syscalls in the following way:
> >>
> >> BPF_MAP_LOOKUP_ELEM -> peek
> >> BPF_MAP_UPDATE_ELEM -> push
> >>
> >> The bloom filter map does not have keys, only values. In light of
> >> this, the bloom filter map's API matches that of queue stack maps:
> >> user applications use BPF_MAP_LOOKUP_ELEM/BPF_MAP_UPDATE_ELEM
> >> which correspond internally to bpf_map_peek_elem/bpf_map_push_elem,
> >> and bpf programs must use the bpf_map_peek_elem and bpf_map_push_elem
> >> APIs to query or add an element to the bloom filter map. When the
> >> bloom filter map is created, it must be created with a key_size of 0.
> >>
> >> For updates, the user will pass in the element to add to the map
> >> as the value, with a NULL key. For lookups, the user will pass in the
> >> element to query in the map as the value. In the verifier layer, this
> >> requires us to modify the argument type of a bloom filter's
> >> BPF_FUNC_map_peek_elem call to ARG_PTR_TO_MAP_VALUE; as well, in
> >> the syscall layer, we need to copy over the user value so that in
> >> bpf_map_peek_elem, we know which specific value to query.
> >>
> >> A few things to please take note of:
> >>   * If there are any concurrent lookups + updates, the user is
> >> responsible for synchronizing this to ensure no false negative lookups
> >> occur.
> >>   * The number of hashes to use for the bloom filter is configurable from
> >> userspace. If no number is specified, the default used will be 5 hash
> >> functions. The benchmarks later in this patchset can help compare the
> >> performance of using different number of hashes on different entry
> >> sizes. In general, using more hashes decreases the speed of a lookup,
> >> but increases the false positive rate of an element being detected in the
> >> bloom filter.
> >>   * Deleting an element in the bloom filter map is not supported.
> >>   * The bloom filter map may be used as an inner map.
> >>   * The "max_entries" size that is specified at map creation time is used to
> >> approximate a reasonable bitmap size for the bloom filter, and is not
> >> otherwise strictly enforced. If the user wishes to insert more entries into
> >> the bloom filter than "max_entries", they may do so but they should be
> >> aware that this may lead to a higher false positive rate.
> >>
> >> Signed-off-by: Joanne Koong <joannekoong@fb.com>
> >> ---
> >>   include/linux/bpf_types.h      |   1 +
> >>   include/uapi/linux/bpf.h       |   1 +
> >>   kernel/bpf/Makefile            |   2 +-
> >>   kernel/bpf/bloom_filter.c      | 185 +++++++++++++++++++++++++++++++++
> >>   kernel/bpf/syscall.c           |  14 ++-
> >>   kernel/bpf/verifier.c          |  19 +++-
> >>   tools/include/uapi/linux/bpf.h |   1 +
> >>   7 files changed, 217 insertions(+), 6 deletions(-)
> >>   create mode 100644 kernel/bpf/bloom_filter.c
> >>
> > See some stylistic nitpicking below (and not a nitpicking about BTF).
> >
> > But I just wanted to say that I'm a bit amazed by how much special
> > casing this BLOOM_FILTER map requires in syscall.c and verifier.c. I
> > still believe that starting with a BPF helper for hashing would be a
> > better approach, but oh well.
> >
> > [...]
> I liked your comment on v1 regarding using a BPF helper and I agree with
> the benefits
> you outlined. I'm curious to see what the performance differences
> between that approach
> and this one end up being, if any. I plan to test out the BPF helper
> approach in a few weeks,
> and if the performance is comparable or better, I am definitely open to
> reverting this code
> and just going with the BPF helper approach :)

I got curious myself yesterday and spent the evening trying this out.
There is slight performance regression in pure bloom filter querying
performance. More confusingly, with such a small difference in bloom
filter performance, there is a much more noticeable difference in the
hybrid bloom+hashmap benchmark. I don't know how to explain the
difference. It might be just the multiplying effect of small perf
differences due to subsequent hashmap lookup. TBH, we are talking
about millions of operations per second, and I don't think in practice
it will matter much. It's like XDP microbenchmarks, where as soon as
you start doing something useful that overhead trumps whatever small
overhead BPF/XDP/bloom filter causes. But I'm not going to argue about
that here :)

I just sent and RFC patch set to demonstrate the approach and compare
with your results. They are not in Patchworks yet, but they are on
lore for those following along ([0]).

While doing that I found a problem with the way you implemented
benchmark measuring. I've also optimized few stats collecting parts.
Please take a look at the RFC. If anything, it would be good to
incorporate those fixes.

It would also be good to add benchmark that test keys bigger than 8
bytes, as mentioned before. I haven't found that in your benchmark.
Also, they way benchmark is implemented right now, it spends lots of
(measured) time doing stuff like generating random values on the fly.
There is also no control over the ratio of hits and misses. It feels
like the benchmark is over-optimizing for happy case of bloom filter
having a true negative and thus proceeding very fast. There are barely
any hits at all (they don't even register in the benchmark).

It also turned out that you don't benchmark updating Bloom filter from
BPF program side. Was that intentional or you anticipate that in
practice that won't be necessary? I was curious to compare those
results because the rely on BPF atomics, but I didn't go all the way
to add them.


  [0] https://lore.kernel.org/bpf/20210922203224.912809-1-andrii@kernel.org/T/#t

> >> +
> >> +static inline u32 hash(struct bpf_bloom_filter *bloom_filter, void *value,
> >> +                      u64 value_size, u32 index)
> >> +{
> >> +       if (bloom_filter->aligned_u32_count)
> >> +               return jhash2(value, bloom_filter->aligned_u32_count,
> >> +                             bloom_filter->hash_seed + index) &
> >> +                       bloom_filter->bit_array_mask;
> >> +
> >> +       return jhash(value, value_size, bloom_filter->hash_seed + index) &
> >> +               bloom_filter->bit_array_mask;
> > stylistic nit, but this feels way to dense text-wise, this seems
> > easier to follow
> >
> > u32 h;
> >
> > if (bloom_filter->aligned_u32_count)
> >      h = jhash2(...);
> > else
> >      h = jhash(...);
> > return h & bloom_filter->bit_array_mask;
> >
> > WDYT?
> I think this sounds great! I will make these changes for v4.

great

> >> +}
> >> +

[...]

> >> +
> >> +static int bloom_filter_map_btf_id;
> >> +const struct bpf_map_ops bloom_filter_map_ops = {
> >> +       .map_meta_equal = bpf_map_meta_equal,
> >> +       .map_alloc = bloom_filter_map_alloc,
> >> +       .map_free = bloom_filter_map_free,
> >> +       .map_push_elem = bloom_filter_map_push_elem,
> >> +       .map_peek_elem = bloom_filter_map_peek_elem,
> >> +       .map_lookup_elem = bloom_filter_map_lookup_elem,
> >> +       .map_update_elem = bloom_filter_map_update_elem,
> >> +       .map_delete_elem = bloom_filter_map_delete_elem,
> >> +       .map_get_next_key = bloom_filter_map_get_next_key,
> >> +       .map_check_btf = map_check_no_btf,
> > can you please implement basically a no-op callback here to allow
> > specifying btf_value_id, there is no good reason to restrict this new
> > map to not allow BTF type being specified for its value
> Sounds great, will add this in v4.

cool, we should try to support BTF for all new maps


> >> +       .map_btf_name = "bpf_bloom_filter",
> >> +       .map_btf_id = &bloom_filter_map_btf_id,
> >> +};
> > [...]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 bpf-next 1/5] bpf: Add bloom filter map implementation
  2021-09-22 19:38       ` Martin KaFai Lau
@ 2021-09-22 20:52         ` Andrii Nakryiko
  2021-09-22 22:08           ` Martin KaFai Lau
  0 siblings, 1 reply; 37+ messages in thread
From: Andrii Nakryiko @ 2021-09-22 20:52 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: Joanne Koong, bpf, Kernel Team

On Wed, Sep 22, 2021 at 12:38 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Wed, Sep 22, 2021 at 12:06:02PM -0700, Joanne Koong wrote:
> >
> > On 9/21/21 4:44 PM, Andrii Nakryiko wrote:
> > > On Tue, Sep 21, 2021 at 2:30 PM Joanne Koong <joannekoong@fb.com> wrote:
> > > > Bloom filters are a space-efficient probabilistic data structure
> > > > used to quickly test whether an element exists in a set.
> > > > In a bloom filter, false positives are possible whereas false
> > > > negatives should never be.
> > > >
> > > > This patch adds a bloom filter map for bpf programs.
> > > > The bloom filter map supports peek (determining whether an element
> > > > is present in the map) and push (adding an element to the map)
> > > > operations.These operations are exposed to userspace applications
> > > > through the already existing syscalls in the following way:
> > > >
> > > > BPF_MAP_LOOKUP_ELEM -> peek
> > > > BPF_MAP_UPDATE_ELEM -> push
> > > >
> > > > The bloom filter map does not have keys, only values. In light of
> > > > this, the bloom filter map's API matches that of queue stack maps:
> > > > user applications use BPF_MAP_LOOKUP_ELEM/BPF_MAP_UPDATE_ELEM
> > > > which correspond internally to bpf_map_peek_elem/bpf_map_push_elem,
> > > > and bpf programs must use the bpf_map_peek_elem and bpf_map_push_elem
> > > > APIs to query or add an element to the bloom filter map. When the
> > > > bloom filter map is created, it must be created with a key_size of 0.
> > > >
> > > > For updates, the user will pass in the element to add to the map
> > > > as the value, with a NULL key. For lookups, the user will pass in the
> > > > element to query in the map as the value. In the verifier layer, this
> > > > requires us to modify the argument type of a bloom filter's
> > > > BPF_FUNC_map_peek_elem call to ARG_PTR_TO_MAP_VALUE; as well, in
> > > > the syscall layer, we need to copy over the user value so that in
> > > > bpf_map_peek_elem, we know which specific value to query.
> > > >
> > > > A few things to please take note of:
> > > >   * If there are any concurrent lookups + updates, the user is
> > > > responsible for synchronizing this to ensure no false negative lookups
> > > > occur.
> > > >   * The number of hashes to use for the bloom filter is configurable from
> > > > userspace. If no number is specified, the default used will be 5 hash
> > > > functions. The benchmarks later in this patchset can help compare the
> > > > performance of using different number of hashes on different entry
> > > > sizes. In general, using more hashes decreases the speed of a lookup,
> > > > but increases the false positive rate of an element being detected in the
> > > > bloom filter.
> > > >   * Deleting an element in the bloom filter map is not supported.
> > > >   * The bloom filter map may be used as an inner map.
> > > >   * The "max_entries" size that is specified at map creation time is used to
> > > > approximate a reasonable bitmap size for the bloom filter, and is not
> > > > otherwise strictly enforced. If the user wishes to insert more entries into
> > > > the bloom filter than "max_entries", they may do so but they should be
> > > > aware that this may lead to a higher false positive rate.
> > > >
> > > > Signed-off-by: Joanne Koong <joannekoong@fb.com>
> > > > ---
> > > >   include/linux/bpf_types.h      |   1 +
> > > >   include/uapi/linux/bpf.h       |   1 +
> > > >   kernel/bpf/Makefile            |   2 +-
> > > >   kernel/bpf/bloom_filter.c      | 185 +++++++++++++++++++++++++++++++++
> > > >   kernel/bpf/syscall.c           |  14 ++-
> > > >   kernel/bpf/verifier.c          |  19 +++-
> > > >   tools/include/uapi/linux/bpf.h |   1 +
> > > >   7 files changed, 217 insertions(+), 6 deletions(-)
> > > >   create mode 100644 kernel/bpf/bloom_filter.c
> > > >
> > > See some stylistic nitpicking below (and not a nitpicking about BTF).
> > >
> > > But I just wanted to say that I'm a bit amazed by how much special
> > > casing this BLOOM_FILTER map requires in syscall.c and verifier.c. I
> > > still believe that starting with a BPF helper for hashing would be a
> > > better approach, but oh well.
> > >
> > > [...]
> > I liked your comment on v1 regarding using a BPF helper and I agree with the
> > benefits you outlined. I'm curious to see what the performance differences between
> > that approach and this one end up being, if any. I plan to test out the BPF helper
> > approach in a few weeks, and if the performance is comparable or better, I am definitely open to
> > reverting this code and just going with the BPF helper approach :)
> Reverting won't be an option and I don't think it is necessary.
>
> Agree that a generic hash helper is in general useful.  It may be
> useful in hashing the skb also.  The bpf prog only implementation could
> have more flexibility in configuring roundup to pow2 or not, how to hash,
> how many hashes, nr of bits ...etc.  In the mean time, the bpf prog and

Exactly. If I know better how many bits I need, I'll have to reverse
engineer kernel's heuristic to provide such max_entries values to
arrive at the desired amount of memory that Bloom filter will be
using.

> user space need to co-ordinate more and worry about more things,
> e.g. how to reuse a bloom filter with different nr_hashes,
> nr_bits, handle synchronization...etc.

Please see my RFC ([0]). I don't think there is much to coordinate. It
could be purely BPF-side code, or BPF + user-space initialization
code, depending on the need. It's a simple and beautiful algorithm,
which BPF is powerful enough to implement customly and easily.

  [0] https://lore.kernel.org/bpf/20210922203224.912809-1-andrii@kernel.org/T/#t

>
> It is useful to have a default implementation in the kernel
> for some useful maps like this one that works for most
> common cases and the bpf user can just use it as get-and-go
> like all other common bpf maps do.

I disagree with the premise that Bloom filter is a common and
generally useful data structure, tbh. It has its nice niche
applications, but its semantics isn't applicable generally, which is
why I hesitate to claim that this should live in kernel.

>
> imo, the verifier/syscall change here is quite minimal also and it
> is mostly riding on top of the existing BPF_MAP_TYPE_STACK.

Not exactly true, it adds quite a bit of new custom pieces. But it's
subjective so there is no point in arguing.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 bpf-next 1/5] bpf: Add bloom filter map implementation
  2021-09-22 20:52         ` Andrii Nakryiko
@ 2021-09-22 22:08           ` Martin KaFai Lau
  2021-09-22 23:07             ` Andrii Nakryiko
  0 siblings, 1 reply; 37+ messages in thread
From: Martin KaFai Lau @ 2021-09-22 22:08 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: Joanne Koong, bpf, Kernel Team

On Wed, Sep 22, 2021 at 01:52:12PM -0700, Andrii Nakryiko wrote:
> > Agree that a generic hash helper is in general useful.  It may be
> > useful in hashing the skb also.  The bpf prog only implementation could
> > have more flexibility in configuring roundup to pow2 or not, how to hash,
> > how many hashes, nr of bits ...etc.  In the mean time, the bpf prog and
> 
> Exactly. If I know better how many bits I need, I'll have to reverse
> engineer kernel's heuristic to provide such max_entries values to
> arrive at the desired amount of memory that Bloom filter will be
> using.
Good point. I don't think it needs to guess.  The formula is stable
and publicly known also.  The formula comment from kernel/bpf/bloom_filter.c
should be moved to the include/uapi/linux/bpf.h.

> > user space need to co-ordinate more and worry about more things,
> > e.g. how to reuse a bloom filter with different nr_hashes,
> > nr_bits, handle synchronization...etc.
> 
> Please see my RFC ([0]). I don't think there is much to coordinate. It
> could be purely BPF-side code, or BPF + user-space initialization
> code, depending on the need. It's a simple and beautiful algorithm,
> which BPF is powerful enough to implement customly and easily.
> 
>   [0] https://lore.kernel.org/bpf/20210922203224.912809-1-andrii@kernel.org/T/#t
In practice, the bloom filter will be populated only once by the userspace.

The future update will be done by map-in-map to replace the whole bloom filter.
May be with more max_entries with more nr_hashes.  May be fewer
max_entries with fewer nr_hashes.

Currently, the continuous running bpf prog using this bloom filter does
not need to worry about any change in the newer bloom filter
configure/setup.

I wonder how that may look like in the custom bpf bloom filter in the
bench prog for the map-in-map usage.

> 
> >
> > It is useful to have a default implementation in the kernel
> > for some useful maps like this one that works for most
> > common cases and the bpf user can just use it as get-and-go
> > like all other common bpf maps do.
> 
> I disagree with the premise that Bloom filter is a common and
> generally useful data structure, tbh. It has its nice niche
> applications, but its semantics isn't applicable generally, which is
> why I hesitate to claim that this should live in kernel.
I don't agree the application is nice niche.  I have encountered this
many times when bumping into networking usecase discussion and not
necessary limited to security usage also.  Yes, it is not a link-list
like data structure but its usage is very common.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 bpf-next 1/5] bpf: Add bloom filter map implementation
  2021-09-22 22:08           ` Martin KaFai Lau
@ 2021-09-22 23:07             ` Andrii Nakryiko
  2021-09-23  1:28               ` Martin KaFai Lau
  0 siblings, 1 reply; 37+ messages in thread
From: Andrii Nakryiko @ 2021-09-22 23:07 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: Joanne Koong, bpf, Kernel Team

On Wed, Sep 22, 2021 at 3:08 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Wed, Sep 22, 2021 at 01:52:12PM -0700, Andrii Nakryiko wrote:
> > > Agree that a generic hash helper is in general useful.  It may be
> > > useful in hashing the skb also.  The bpf prog only implementation could
> > > have more flexibility in configuring roundup to pow2 or not, how to hash,
> > > how many hashes, nr of bits ...etc.  In the mean time, the bpf prog and
> >
> > Exactly. If I know better how many bits I need, I'll have to reverse
> > engineer kernel's heuristic to provide such max_entries values to
> > arrive at the desired amount of memory that Bloom filter will be
> > using.
> Good point. I don't think it needs to guess.  The formula is stable
> and publicly known also.  The formula comment from kernel/bpf/bloom_filter.c
> should be moved to the include/uapi/linux/bpf.h.

By guessing I mean you'd need to invert the formula to calculate the
necessary max_entries you'd need just to size bloom filter to the
desired size. At which point max_entries won't really mean much.

Also the whole "let's choose 5 as number of hash functions" approach
seems pretty arbitrary and not explained. How was this chosen? Based
on one very particular benchmark? How can we tell it's good for most
use cases? How did we arrive at 5 and not 3 or 4?

Looking at [0], there are many ways to go about sizing Bloom filter. 5
hash functions, assuming the wikipedia formula we are using (N * K *
ln(2)), and according to "This means that for a given false positive
probability ε, the length of a Bloom filter m is proportionate to the
number of elements being filtered n and the required number of hash
functions only depends on the target false positive probability ε."
(from [1]) and corresponding formula, gives a false positive
probability or around 3%, if my math is right. Did we state those 3%
anywhere and how we arrived at them?

But there are multiple parameters involved in this decision, they are
interconnected, and different subsets of them might be driving user's
choice:
  - number of expected unique elements;
  - number of bits allocated;
  - acceptable false positive rate;
  - number of hash functions.

What if I want a false positive probability of 1%, or maybe 10%, or
0.1%, but not 3%? What if I'm concerned about memory usage and rather
save some memory at the expense of more false positives? Or calling
too many hash functions is prohibitive, but I can allocate more memory
to reduce the chance of hash collisions. There are many ways to slice
this cat. Kernel implementation makes *some* assumptions with little
justification and explanation. It's opinionated in one place (M * N *
ln(2)), but leaves it up to the user to make sense of what number of
functions it needs (K = -log2(eps)). Feels quite unusual and
underspecified for the kernel data structure. It would be probably
good to provide a bit of guidance in map description, I suppose.

Anyways, I've spent way too much time on this. It was fun, but I'll
shut up and go do something else.

  [0] https://en.wikipedia.org/wiki/Bloom_filter
  [1] https://en.wikipedia.org/wiki/Bloom_filter#Optimal_number_of_hash_functions

>
> > > user space need to co-ordinate more and worry about more things,
> > > e.g. how to reuse a bloom filter with different nr_hashes,
> > > nr_bits, handle synchronization...etc.
> >
> > Please see my RFC ([0]). I don't think there is much to coordinate. It
> > could be purely BPF-side code, or BPF + user-space initialization
> > code, depending on the need. It's a simple and beautiful algorithm,
> > which BPF is powerful enough to implement customly and easily.
> >
> >   [0] https://lore.kernel.org/bpf/20210922203224.912809-1-andrii@kernel.org/T/#t
> In practice, the bloom filter will be populated only once by the userspace.
>
> The future update will be done by map-in-map to replace the whole bloom filter.
> May be with more max_entries with more nr_hashes.  May be fewer
> max_entries with fewer nr_hashes.
>
> Currently, the continuous running bpf prog using this bloom filter does
> not need to worry about any change in the newer bloom filter
> configure/setup.
>
> I wonder how that may look like in the custom bpf bloom filter in the
> bench prog for the map-in-map usage.

You'd have to use BPF_MAP_TYPE_ARRAY for the map-in-map use case.

>
> >
> > >
> > > It is useful to have a default implementation in the kernel
> > > for some useful maps like this one that works for most
> > > common cases and the bpf user can just use it as get-and-go
> > > like all other common bpf maps do.
> >
> > I disagree with the premise that Bloom filter is a common and
> > generally useful data structure, tbh. It has its nice niche
> > applications, but its semantics isn't applicable generally, which is
> > why I hesitate to claim that this should live in kernel.
> I don't agree the application is nice niche.  I have encountered this
> many times when bumping into networking usecase discussion and not
> necessary limited to security usage also.  Yes, it is not a link-list
> like data structure but its usage is very common.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 bpf-next 2/5] libbpf: Allow the number of hashes in bloom filter maps to be configurable
  2021-09-21 22:24   ` Joanne Koong
@ 2021-09-22 23:14     ` Andrii Nakryiko
  0 siblings, 0 replies; 37+ messages in thread
From: Andrii Nakryiko @ 2021-09-22 23:14 UTC (permalink / raw)
  To: Joanne Koong; +Cc: bpf, Kernel Team

On Tue, Sep 21, 2021 at 7:02 PM Joanne Koong <joannekoong@fb.com> wrote:
>
>
> On 9/21/21 2:02 PM, Joanne Koong wrote:
> > This patch adds the libbpf infrastructure that will allow the user to
> > specify a configurable number of hash functions to use for the bloom
> > filter map.
> >
> > Please note that this patch does not enforce that a pinned bloom filter
> > map may only be reused if the number of hash functions is the same. If
> > they are not the same, the number of hash functions used will be the one
> > that was set for the pinned map.
> >
> > Signed-off-by: Joanne Koong <joannekoong@fb.com>
> > ---
> >   include/uapi/linux/bpf.h        |  5 ++++-
> >   tools/include/uapi/linux/bpf.h  |  5 ++++-
> >   tools/lib/bpf/bpf.c             |  2 ++
> >   tools/lib/bpf/bpf.h             |  1 +
> >   tools/lib/bpf/libbpf.c          | 32 +++++++++++++++++++++++++++-----
> >   tools/lib/bpf/libbpf.h          |  2 ++
> >   tools/lib/bpf/libbpf.map        |  1 +
> >   tools/lib/bpf/libbpf_internal.h |  4 +++-
> >   8 files changed, 44 insertions(+), 8 deletions(-)
> >

[...]

> >
> >   struct btf_map_def {
> > @@ -201,6 +202,7 @@ struct btf_map_def {
> >       __u32 map_flags;
> >       __u32 numa_node;
> >       __u32 pinning;
> > +     __u32 nr_hash_funcs;
> >   };
> >
>
> I just realized that Andrii's comment on v1 stated that btf_map_def is
> fixed indefinitely.
>
> This implies that for bloom filter maps where the number of hash
> functions needs to be set,
> we will not be able to use the BTF-defined format and will instead need
> to use the older
> map definition that uses bpf_map_def. Is my understanding of this
> correct? If so, I will go
> ahead and fix this for v4.

You are confusing bpf_map_def (which is part of the public libbpf API,
even if to-be-deprecated) and btf_map_def, which is only
libbpf-internal. So it's ok to modify that struct.

>
> >   int parse_btf_map_def(const char *map_name, struct btf *btf,

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 bpf-next 1/5] bpf: Add bloom filter map implementation
  2021-09-22 23:07             ` Andrii Nakryiko
@ 2021-09-23  1:28               ` Martin KaFai Lau
  2021-09-23 18:42                 ` Andrii Nakryiko
  0 siblings, 1 reply; 37+ messages in thread
From: Martin KaFai Lau @ 2021-09-23  1:28 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: Joanne Koong, bpf, Kernel Team

On Wed, Sep 22, 2021 at 04:07:52PM -0700, Andrii Nakryiko wrote:
> > > Please see my RFC ([0]). I don't think there is much to coordinate. It
> > > could be purely BPF-side code, or BPF + user-space initialization
> > > code, depending on the need. It's a simple and beautiful algorithm,
> > > which BPF is powerful enough to implement customly and easily.
> > >
> > >   [0] https://lore.kernel.org/bpf/20210922203224.912809-1-andrii@kernel.org/T/#t
> > In practice, the bloom filter will be populated only once by the userspace.
> >
> > The future update will be done by map-in-map to replace the whole bloom filter.
> > May be with more max_entries with more nr_hashes.  May be fewer
> > max_entries with fewer nr_hashes.
> >
> > Currently, the continuous running bpf prog using this bloom filter does
> > not need to worry about any change in the newer bloom filter
> > configure/setup.
> >
> > I wonder how that may look like in the custom bpf bloom filter in the
> > bench prog for the map-in-map usage.
> 
> You'd have to use BPF_MAP_TYPE_ARRAY for the map-in-map use case.
Right, another map is needed.  When the user space generates
a new bloom filter as inner map, it is likely that it has different
number of entries, so the map size is different.

The old and new inner array map need to at least have the same value_size,
so an one element array with different value_size will not work.

The inner array map with BPF_F_INNER_MAP can have different max_entries
but then there is no inline code lookup generation.  It may not be too
bad to call it multiple times to lookup a value considering the
array_map_lookup_elem will still be directly called without retpoline.
The next part is how to learn those "const volatile __u32 bloom_*;"
values of the new inner map.  I think the max_entires can be obtained
by map_ptr->max_entries.   Other vars (e.g. hash_cnt and seed) can
be used as non-const global, allow the update, and a brief moment of
inconsistence may be fine.

It all sounds doable but all these small need-to-pay-attention
things add up.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 bpf-next 1/5] bpf: Add bloom filter map implementation
  2021-09-23  1:28               ` Martin KaFai Lau
@ 2021-09-23 18:42                 ` Andrii Nakryiko
  2021-09-23 19:42                   ` Martin KaFai Lau
  0 siblings, 1 reply; 37+ messages in thread
From: Andrii Nakryiko @ 2021-09-23 18:42 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: Joanne Koong, bpf, Kernel Team

On Wed, Sep 22, 2021 at 6:28 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Wed, Sep 22, 2021 at 04:07:52PM -0700, Andrii Nakryiko wrote:
> > > > Please see my RFC ([0]). I don't think there is much to coordinate. It
> > > > could be purely BPF-side code, or BPF + user-space initialization
> > > > code, depending on the need. It's a simple and beautiful algorithm,
> > > > which BPF is powerful enough to implement customly and easily.
> > > >
> > > >   [0] https://lore.kernel.org/bpf/20210922203224.912809-1-andrii@kernel.org/T/#t
> > > In practice, the bloom filter will be populated only once by the userspace.
> > >
> > > The future update will be done by map-in-map to replace the whole bloom filter.
> > > May be with more max_entries with more nr_hashes.  May be fewer
> > > max_entries with fewer nr_hashes.
> > >
> > > Currently, the continuous running bpf prog using this bloom filter does
> > > not need to worry about any change in the newer bloom filter
> > > configure/setup.
> > >
> > > I wonder how that may look like in the custom bpf bloom filter in the
> > > bench prog for the map-in-map usage.
> >
> > You'd have to use BPF_MAP_TYPE_ARRAY for the map-in-map use case.
> Right, another map is needed.  When the user space generates
> a new bloom filter as inner map, it is likely that it has different
> number of entries, so the map size is different.
>
> The old and new inner array map need to at least have the same value_size,
> so an one element array with different value_size will not work.
>
> The inner array map with BPF_F_INNER_MAP can have different max_entries
> but then there is no inline code lookup generation.  It may not be too
> bad to call it multiple times to lookup a value considering the
> array_map_lookup_elem will still be directly called without retpoline.

All true, of course, due to generic BPF limitations. In practice, I'd
decide what's the maximum size of the bloom filter I'd need and use
that as an inner map definition. If I understand correctly, there is
going to be only one "active" Bloom filter map and it's generally not
that big (few megabytes covers tons of "max_entries"), so I'd just
work with maximum expected size.

If I absolutely needed variable-sized filters, I'd consider doing a
multi-element array as you suggested, but I'd expect lower
performance, as you mentioned.

> The next part is how to learn those "const volatile __u32 bloom_*;"
> values of the new inner map.  I think the max_entires can be obtained
> by map_ptr->max_entries.   Other vars (e.g. hash_cnt and seed) can
> be used as non-const global, allow the update, and a brief moment of
> inconsistence may be fine.

For single-element array with fixed value_size I'd put those in first 8 bytes:

struct my_bloom {
    __u32 msk;
    __u32 seed;
    __u64 data[];
}

For multi-element BPF_MAP_TYPE_ARRAY I'd put a mask and seed into elem[0].

I'd expect that hash_cnt would be just hard-coded, because as I
mentioned before, it determines the probability of false positive,
which is what end-users probably care about the most and set upfront,
at least they should be coming at this from the perspective "1% of
false positives is acceptable" rather than "hmm... 3 hash functions is
probably acceptable", no? But if not, first two elements would be
taken.

>
> It all sounds doable but all these small need-to-pay-attention
> things add up.

Of course, there is always a tension between "make it simple and
provide a dedicated BPF helper/BPF map" and "let users implement it on
their own". I'm saying I'm not convinced that it has to be the former
in this case. Bloom filter is just a glorified bit set, once you have
a hashing helper. I don't think we've added BPF_MAP_TYPE_BITSET yet,
though it probably would be pretty useful in some cases, just like the
Bloom filter. Similarly, we don't have BPF_MAP_TYPE_HASHSET in
addition to BPF_MAP_TYPE_HASHMAP. I've seen many cases where HASHMAP
is used as HASHSET, yet we didn't have a dedicated map for that
either. I'm just curious where we draw the line between what should be
added to the kernel for BPF, if there are reasonable ways to avoid
that.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 bpf-next 1/5] bpf: Add bloom filter map implementation
  2021-09-23 18:42                 ` Andrii Nakryiko
@ 2021-09-23 19:42                   ` Martin KaFai Lau
  2021-09-23 20:30                     ` Alexei Starovoitov
  0 siblings, 1 reply; 37+ messages in thread
From: Martin KaFai Lau @ 2021-09-23 19:42 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: Joanne Koong, bpf, Kernel Team

On Thu, Sep 23, 2021 at 11:42:55AM -0700, Andrii Nakryiko wrote:
> On Wed, Sep 22, 2021 at 6:28 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > On Wed, Sep 22, 2021 at 04:07:52PM -0700, Andrii Nakryiko wrote:
> > > > > Please see my RFC ([0]). I don't think there is much to coordinate. It
> > > > > could be purely BPF-side code, or BPF + user-space initialization
> > > > > code, depending on the need. It's a simple and beautiful algorithm,
> > > > > which BPF is powerful enough to implement customly and easily.
> > > > >
> > > > >   [0] https://lore.kernel.org/bpf/20210922203224.912809-1-andrii@kernel.org/T/#t
> > > > In practice, the bloom filter will be populated only once by the userspace.
> > > >
> > > > The future update will be done by map-in-map to replace the whole bloom filter.
> > > > May be with more max_entries with more nr_hashes.  May be fewer
> > > > max_entries with fewer nr_hashes.
> > > >
> > > > Currently, the continuous running bpf prog using this bloom filter does
> > > > not need to worry about any change in the newer bloom filter
> > > > configure/setup.
> > > >
> > > > I wonder how that may look like in the custom bpf bloom filter in the
> > > > bench prog for the map-in-map usage.
> > >
> > > You'd have to use BPF_MAP_TYPE_ARRAY for the map-in-map use case.
> > Right, another map is needed.  When the user space generates
> > a new bloom filter as inner map, it is likely that it has different
> > number of entries, so the map size is different.
> >
> > The old and new inner array map need to at least have the same value_size,
> > so an one element array with different value_size will not work.
> >
> > The inner array map with BPF_F_INNER_MAP can have different max_entries
> > but then there is no inline code lookup generation.  It may not be too
> > bad to call it multiple times to lookup a value considering the
> > array_map_lookup_elem will still be directly called without retpoline.
> 
> All true, of course, due to generic BPF limitations. In practice, I'd
> decide what's the maximum size of the bloom filter I'd need and use
> that as an inner map definition. If I understand correctly, there is
> going to be only one "active" Bloom filter map and it's generally not
> that big (few megabytes covers tons of "max_entries"), so I'd just
> work with maximum expected size.
A constant size defeats the configurability/flexibility mem usage
argument for the custom bpf map.

> 
> If I absolutely needed variable-sized filters, I'd consider doing a
> multi-element array as you suggested, but I'd expect lower
> performance, as you mentioned.
yep.  I also expect it will be worse and it reduces the benefit of
using bloom filter.

> > The next part is how to learn those "const volatile __u32 bloom_*;"
> > values of the new inner map.  I think the max_entires can be obtained
> > by map_ptr->max_entries.   Other vars (e.g. hash_cnt and seed) can
> > be used as non-const global, allow the update, and a brief moment of
> > inconsistence may be fine.
> 
> For single-element array with fixed value_size I'd put those in first 8 bytes:
> 
> struct my_bloom {
>     __u32 msk;
>     __u32 seed;
>     __u64 data[];
> }
> 
> For multi-element BPF_MAP_TYPE_ARRAY I'd put a mask and seed into elem[0].
Yes, it is the thing that came to my mind also but I was very hesitant even
to mention this one extra thing to remind the user that he/she needs to do
to avoid potential false negative during the map switching and the user
expects it will not happen for bloom filter once the map is fully
populated.

> I'd expect that hash_cnt would be just hard-coded, because as I
> mentioned before, it determines the probability of false positive,
> which is what end-users probably care about the most and set upfront,
> at least they should be coming at this from the perspective "1% of
> false positives is acceptable" rather than "hmm... 3 hash functions is
> probably acceptable", no? But if not, first two elements would be
> taken.
I am not sure a constant hashes/false-positive-rate is always the use
case also. but yes, another element in the arraymap will be needed.

> > It all sounds doable but all these small need-to-pay-attention
> > things add up.
> 
> Of course, there is always a tension between "make it simple and
> provide a dedicated BPF helper/BPF map" and "let users implement it on
> their own". I'm saying I'm not convinced that it has to be the former
> in this case. Bloom filter is just a glorified bit set, once you have
> a hashing helper. I don't think we've added BPF_MAP_TYPE_BITSET yet,
> though it probably would be pretty useful in some cases, just like the
> Bloom filter. Similarly, we don't have BPF_MAP_TYPE_HASHSET in
> addition to BPF_MAP_TYPE_HASHMAP. I've seen many cases where HASHMAP
> is used as HASHSET, yet we didn't have a dedicated map for that
> either. I'm just curious where we draw the line between what should be
> added to the kernel for BPF, if there are reasonable ways to avoid
> that.
I think it is pretty clear on the ups and downs on both arguments
in terms of performance and configurability and map-in-map.

How to move it forward from here?  Is it a must to keep the
bloomfilter-like map in pure bpf only and we should stop
the current work?

or it is useful to have the kernel bloom filter that provides
a get-and-go usage and a consistent experience for user in
map-in-map which is the primary use case here.  I don't think
this is necessarily blocking the custom bpf map effort also.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 bpf-next 1/5] bpf: Add bloom filter map implementation
  2021-09-23 19:42                   ` Martin KaFai Lau
@ 2021-09-23 20:30                     ` Alexei Starovoitov
  2021-09-23 21:12                       ` Andrii Nakryiko
  0 siblings, 1 reply; 37+ messages in thread
From: Alexei Starovoitov @ 2021-09-23 20:30 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: Andrii Nakryiko, Joanne Koong, bpf, Kernel Team

On Thu, Sep 23, 2021 at 12:42:33PM -0700, Martin KaFai Lau wrote:
> 
> How to move it forward from here?  Is it a must to keep the
> bloomfilter-like map in pure bpf only and we should stop
> the current work?
> 
> or it is useful to have the kernel bloom filter that provides
> a get-and-go usage and a consistent experience for user in
> map-in-map which is the primary use case here.  I don't think
> this is necessarily blocking the custom bpf map effort also.

I think map based and helper based bloom filter implementations
are far from equivalent. There are pros and cons to both.
For example the helper style doesn't have a good way
of query/populate from the user space. If it's a single element
array the user space would be forced to allocate huge buffers
just to read/write single huge value_size.
With multi element array it's sort-of easier.
mmap-ing the array could help too,
but in either case the user space would need to copy-paste jhash,
which is GPL, and that might be more than just inconvenience.
We can try siphash in the bpf helper and give it a flag to choose
between hash implementations. That helps, but doesn't completely
makes them equivalent.

As far as map based bloom filter I think it can combine bitset
and bloomfilter features into one. delete_elem from user space
can be mapped into pop() to clear bits.
Some special value of nr_hashes could mean that no hashing
is applied and 4 or 8 byte key gets modulo of max_entries
and treated as a bit index. Both bpf prog and user space will
have uniform access into such bitset. With nr_hashes >= 1
it will become a bloom filter.
In that sense may be max_entries should be specified in bits
and the map is called bitset. With nr_hashes >= 1 the kernel
would accept key_size > 8 and convert it to bloom filter
peek/pop/push. In other words
nr_hash == 0 bit_idx == key for set/read/clear
nr_hashes >= 1 bit_idx[1..N] = hash(key, N) for set/read/clear.
If we could teach the verifier to inline the bit lookup
we potentially can get rid of bloomfilter loop inside the peek().
Then the map would be true bitset without bloomfilter quirks.
Even in such case it's not equivalent to bpf_hash(mem_ptr, size, flags) helper.
Thoughts?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 bpf-next 1/5] bpf: Add bloom filter map implementation
  2021-09-23 20:30                     ` Alexei Starovoitov
@ 2021-09-23 21:12                       ` Andrii Nakryiko
  2021-09-23 22:28                         ` Joanne Koong
  0 siblings, 1 reply; 37+ messages in thread
From: Andrii Nakryiko @ 2021-09-23 21:12 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Martin KaFai Lau, Joanne Koong, bpf, Kernel Team

On Thu, Sep 23, 2021 at 1:30 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Sep 23, 2021 at 12:42:33PM -0700, Martin KaFai Lau wrote:
> >
> > How to move it forward from here?  Is it a must to keep the
> > bloomfilter-like map in pure bpf only and we should stop
> > the current work?
> >
> > or it is useful to have the kernel bloom filter that provides
> > a get-and-go usage and a consistent experience for user in
> > map-in-map which is the primary use case here.  I don't think
> > this is necessarily blocking the custom bpf map effort also.
>
> I think map based and helper based bloom filter implementations
> are far from equivalent. There are pros and cons to both.
> For example the helper style doesn't have a good way
> of query/populate from the user space. If it's a single element

I thought about the use from user-space. I'd consider two ways of
doing that. One more complicated but with better performance, another
simpler but less performant (but in this case less performant is
equivalent to in-kernel implementation performance, or still better):

1. Have identical hash function implementation in user-space. In this
case Jenkins hash. Then memory-map contents and use exactly the same
bloom filter code to set the bits (as I said, once you have hash, it's
just a glorified bitset). This has the downside that if there is even
a single bit difference between hash produced by kernel and
user-space, you are screwed. But can't beat the performance because no
syscall overhead.

2. Use BPF_PROG_RUN command to execute custom program that would set
one or multiple provided values in the hashset. Just like we argued
for BPF timers, BPF program can be a custom "API" that would avoid
having separate user-space logic. Pass one or many values through a
global variable/array, BPF_PROG_RUN program that would iterate values,
calculate hashes, set bits. It actually should be faster than doing
BPF_MAP_UPDATE_ELEM syscall for each value. Next proposal will be to
add batched update support, of course, so I won't claim victory for
the performance argument here. :)

But yes, it needs a bit more (but simple) code, than if the kernel
just provided a Bloom filter map out of the box.

> array the user space would be forced to allocate huge buffers
> just to read/write single huge value_size.
> With multi element array it's sort-of easier.
> mmap-ing the array could help too,
> but in either case the user space would need to copy-paste jhash,
> which is GPL, and that might be more than just inconvenience.

From include/linux/jhash.h: "You can use this free for any purpose.
It's in the public domain".

> We can try siphash in the bpf helper and give it a flag to choose

I did bpf_jhash_mem() just to demonstrate the approach quickly. I
think in practice I'd go for a more generic implementation where one
of the parameters is enum that specifies which supported hash
algorithm is used. It's easier to extend that:

u64 bpf_hash_mem(const void *data, u32 sz, u32 seed, enum bpf_hash_algo algo);

enum bpf_hash_algo {
   XOR = 0,
   JENKINS = 1,
   MURMUR3 = 2,
   ...
}

Note the XOR case. If we specify it as "xor u64 values, where the last
<8 bytes are zero extended", it will come useful below for your
proposal.


> between hash implementations. That helps, but doesn't completely
> makes them equivalent.

I don't claim that implementing and using a custom Bloom filter will
be easier to use in all situations. I think the best we can strive for
is making it not much harder, and I think in this case it is. Of
course we can come up with a bunch of situations where doing it with
pure BPF isn't possible to do equivalently (like map-in-map with
dynamically sized bit size, well, sorry, BPF verifier can't validate
stuff like that). Dedicated BPF map or helper (as a general case, not
just this one) will pretty much always going to be easier to use just
because it's a dedicated and tailored API.


>
> As far as map based bloom filter I think it can combine bitset
> and bloomfilter features into one. delete_elem from user space
> can be mapped into pop() to clear bits.
> Some special value of nr_hashes could mean that no hashing
> is applied and 4 or 8 byte key gets modulo of max_entries
> and treated as a bit index. Both bpf prog and user space will
> have uniform access into such bitset. With nr_hashes >= 1
> it will become a bloom filter.
> In that sense may be max_entries should be specified in bits
> and the map is called bitset. With nr_hashes >= 1 the kernel
> would accept key_size > 8 and convert it to bloom filter
> peek/pop/push. In other words
> nr_hash == 0 bit_idx == key for set/read/clear
> nr_hashes >= 1 bit_idx[1..N] = hash(key, N) for set/read/clear.
> If we could teach the verifier to inline the bit lookup
> we potentially can get rid of bloomfilter loop inside the peek().
> Then the map would be true bitset without bloomfilter quirks.
> Even in such case it's not equivalent to bpf_hash(mem_ptr, size, flags) helper.
> Thoughts?

Sounds a bit complicated from end-user's perspective, tbh, but bitset
map (with generalization for bloom filter) sounds a bit more widely
useful. See above for the bpf_hash_algo proposal. If we allow to
specify nr_hashes and hash algorithm, then with XOR as defined above
and nr_hash = 1, you'll get just bitset behavior with not extra
restrictions on key size: you could have 1, 2, 4, 8 and more bytes
(where with more bytes it's some suboptimal bloom filter with one hash
function, not sure why you'd do that).

The biggest quirk is defining that XOR hashes in chunks of 8 bytes
(with zero-extending anything that is not a multiple of 8 bytes
length). We can do special "only 1, 2, 4, and 8 bytes are supported",
of course, but it will be special-cased internally. Not sure which one
is cleaner.

While writing this, another thought was to have a "NOOP" (or
"IDENTITY") hash, where we say that we treat bytes as one huge number.
Obviously then we truncate to the actual bitmap size, which just
basically means "use up to lower 8 bytes as a number". But it sucks
for big-endian, because to make it sane we'd need to take last "up to
8 bytes", which obviously sounds convoluted. So I don't know, just a
thought.

If we do the map, though, regardless if it's bitset or bloom
specifically. Maybe we should consider modeling as actual
bpf_map_lookup_elem(), where the key is a pointer to whatever we are
hashing and looking up? It makes much more sense, that's how people
model sets based on maps: key is the element you are looking up, value
is either true/false or meaningless (at least for me it felt much more
natural that you are looking up by key, not by value). In this case,
what if on successful lookup we return a pointer to some fixed
u8/u32/u64 location in the kernel, some dedicated static variable
shared between all maps. So NULL means "element is not in a set",
non-NULL means it is in the set. Ideally we'd prevent such element to
be written to, but it might be too hard to do that as just one
exception here, don't know.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 bpf-next 1/5] bpf: Add bloom filter map implementation
  2021-09-23 21:12                       ` Andrii Nakryiko
@ 2021-09-23 22:28                         ` Joanne Koong
  2021-09-23 23:46                           ` Martin KaFai Lau
  2021-09-24  2:23                           ` Andrii Nakryiko
  0 siblings, 2 replies; 37+ messages in thread
From: Joanne Koong @ 2021-09-23 22:28 UTC (permalink / raw)
  To: Andrii Nakryiko, Alexei Starovoitov; +Cc: Martin KaFai Lau, bpf, Kernel Team


On 9/23/21 2:12 PM, Andrii Nakryiko wrote:
> On Thu, Sep 23, 2021 at 1:30 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> On Thu, Sep 23, 2021 at 12:42:33PM -0700, Martin KaFai Lau wrote:
>>> How to move it forward from here?  Is it a must to keep the
>>> bloomfilter-like map in pure bpf only and we should stop
>>> the current work?
>>>
>>> or it is useful to have the kernel bloom filter that provides
>>> a get-and-go usage and a consistent experience for user in
>>> map-in-map which is the primary use case here.  I don't think
>>> this is necessarily blocking the custom bpf map effort also.
>> I think map based and helper based bloom filter implementations
>> are far from equivalent. There are pros and cons to both.
>> For example the helper style doesn't have a good way
>> of query/populate from the user space. If it's a single element
> I thought about the use from user-space. I'd consider two ways of
> doing that. One more complicated but with better performance, another
> simpler but less performant (but in this case less performant is
> equivalent to in-kernel implementation performance, or still better):
>
> 1. Have identical hash function implementation in user-space. In this
> case Jenkins hash. Then memory-map contents and use exactly the same
> bloom filter code to set the bits (as I said, once you have hash, it's
> just a glorified bitset). This has the downside that if there is even
> a single bit difference between hash produced by kernel and
> user-space, you are screwed. But can't beat the performance because no
> syscall overhead.
>
> 2. Use BPF_PROG_RUN command to execute custom program that would set
> one or multiple provided values in the hashset. Just like we argued
> for BPF timers, BPF program can be a custom "API" that would avoid
> having separate user-space logic. Pass one or many values through a
> global variable/array, BPF_PROG_RUN program that would iterate values,
> calculate hashes, set bits. It actually should be faster than doing
> BPF_MAP_UPDATE_ELEM syscall for each value. Next proposal will be to
> add batched update support, of course, so I won't claim victory for
> the performance argument here. :)
>
> But yes, it needs a bit more (but simple) code, than if the kernel
> just provided a Bloom filter map out of the box.
>
>> array the user space would be forced to allocate huge buffers
>> just to read/write single huge value_size.
>> With multi element array it's sort-of easier.
>> mmap-ing the array could help too,
>> but in either case the user space would need to copy-paste jhash,
>> which is GPL, and that might be more than just inconvenience.
>  From include/linux/jhash.h: "You can use this free for any purpose.
> It's in the public domain".
>
>> We can try siphash in the bpf helper and give it a flag to choose
> I did bpf_jhash_mem() just to demonstrate the approach quickly. I
> think in practice I'd go for a more generic implementation where one
> of the parameters is enum that specifies which supported hash
> algorithm is used. It's easier to extend that:
>
> u64 bpf_hash_mem(const void *data, u32 sz, u32 seed, enum bpf_hash_algo algo);
>
> enum bpf_hash_algo {
>     XOR = 0,
>     JENKINS = 1,
>     MURMUR3 = 2,
>     ...
> }
>
> Note the XOR case. If we specify it as "xor u64 values, where the last
> <8 bytes are zero extended", it will come useful below for your
> proposal.
>
>
>> between hash implementations. That helps, but doesn't completely
>> makes them equivalent.
> I don't claim that implementing and using a custom Bloom filter will
> be easier to use in all situations. I think the best we can strive for
> is making it not much harder, and I think in this case it is. Of
> course we can come up with a bunch of situations where doing it with
> pure BPF isn't possible to do equivalently (like map-in-map with
> dynamically sized bit size, well, sorry, BPF verifier can't validate
> stuff like that). Dedicated BPF map or helper (as a general case, not
> just this one) will pretty much always going to be easier to use just
> because it's a dedicated and tailored API.
>
To me, it seems like we get the best of both worlds by using both of these
two ideas for the bloom filter. For developers who would like
to use a general bloom filter without having to do any extra 
implementation work
or having to understand how bloom filters are implemented, they could use
the custom bloom filter map with minimal effort. For developers who
would like to customize their bloom filter to something more specific or
fine-tuned, they could use craft their own bloom filter in an ebpf program.
To me, these two directions don't seem mutually exclusive.

>> As far as map based bloom filter I think it can combine bitset
>> and bloomfilter features into one. delete_elem from user space
>> can be mapped into pop() to clear bits.
>> Some special value of nr_hashes could mean that no hashing
>> is applied and 4 or 8 byte key gets modulo of max_entries
>> and treated as a bit index. Both bpf prog and user space will
>> have uniform access into such bitset. With nr_hashes >= 1
>> it will become a bloom filter.
>> In that sense may be max_entries should be specified in bits
>> and the map is called bitset. With nr_hashes >= 1 the kernel
>> would accept key_size > 8 and convert it to bloom filter
>> peek/pop/push. In other words
>> nr_hash == 0 bit_idx == key for set/read/clear
>> nr_hashes >= 1 bit_idx[1..N] = hash(key, N) for set/read/clear.
>> If we could teach the verifier to inline the bit lookup
>> we potentially can get rid of bloomfilter loop inside the peek().
>> Then the map would be true bitset without bloomfilter quirks.
>> Even in such case it's not equivalent to bpf_hash(mem_ptr, size, flags) helper.
>> Thoughts?
This is an interesting suggestion; to me, it seems like the APIs and 
code would be
more straightforward if the bitset and the bloom filter were separate maps.
With having max_entries be specified in bits, I think this also relies 
on the
user to make an educated call on the optimal number of bits to use for 
their bloom
filter, instead of passing in the number of entries they expect to have 
and having the
bit size automatically calculated according to a mathematically 
optimized equation.
I am open to this idea though.

> Sounds a bit complicated from end-user's perspective, tbh, but bitset
> map (with generalization for bloom filter) sounds a bit more widely
> useful. See above for the bpf_hash_algo proposal. If we allow to
> specify nr_hashes and hash algorithm, then with XOR as defined above
> and nr_hash = 1, you'll get just bitset behavior with not extra
> restrictions on key size: you could have 1, 2, 4, 8 and more bytes
> (where with more bytes it's some suboptimal bloom filter with one hash
> function, not sure why you'd do that).
>
> The biggest quirk is defining that XOR hashes in chunks of 8 bytes
> (with zero-extending anything that is not a multiple of 8 bytes
> length). We can do special "only 1, 2, 4, and 8 bytes are supported",
> of course, but it will be special-cased internally. Not sure which one
> is cleaner.
>
> While writing this, another thought was to have a "NOOP" (or
> "IDENTITY") hash, where we say that we treat bytes as one huge number.
> Obviously then we truncate to the actual bitmap size, which just
> basically means "use up to lower 8 bytes as a number". But it sucks
> for big-endian, because to make it sane we'd need to take last "up to
> 8 bytes", which obviously sounds convoluted. So I don't know, just a
> thought.
>
> If we do the map, though, regardless if it's bitset or bloom
> specifically. Maybe we should consider modeling as actual
> bpf_map_lookup_elem(), where the key is a pointer to whatever we are
> hashing and looking up? It makes much more sense, that's how people
> model sets based on maps: key is the element you are looking up, value
> is either true/false or meaningless (at least for me it felt much more
> natural that you are looking up by key, not by value). In this case,
> what if on successful lookup we return a pointer to some fixed
> u8/u32/u64 location in the kernel, some dedicated static variable
> shared between all maps. So NULL means "element is not in a set",
> non-NULL means it is in the set.
I think this would then also require that the bpf_map_update_elem() API from
the userspace side would have to pass in a valid memory address for the 
"value".
I understand what you're saying though about it feeling more natural
that the "key" is the element here; I agree but there doesn't seem to be 
a clean way
of doing this - I think maybe one viable approach would be allowing 
map_update_elem
to pass in a NULL value in the kernel if the map is a non-associative 
map, and refactoring the
push_elem/peek_elem API so that the element can represent either the key 
or the value.
>   Ideally we'd prevent such element to
> be written to, but it might be too hard to do that as just one
> exception here, don't know.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 bpf-next 1/5] bpf: Add bloom filter map implementation
  2021-09-23 22:28                         ` Joanne Koong
@ 2021-09-23 23:46                           ` Martin KaFai Lau
  2021-09-24  2:23                           ` Andrii Nakryiko
  1 sibling, 0 replies; 37+ messages in thread
From: Martin KaFai Lau @ 2021-09-23 23:46 UTC (permalink / raw)
  To: Joanne Koong; +Cc: Andrii Nakryiko, Alexei Starovoitov, bpf, Kernel Team

On Thu, Sep 23, 2021 at 03:28:01PM -0700, Joanne Koong wrote:
> > > As far as map based bloom filter I think it can combine bitset
> > > and bloomfilter features into one. delete_elem from user space
> > > can be mapped into pop() to clear bits.
> > > Some special value of nr_hashes could mean that no hashing
> > > is applied and 4 or 8 byte key gets modulo of max_entries
> > > and treated as a bit index. Both bpf prog and user space will
> > > have uniform access into such bitset. With nr_hashes >= 1
> > > it will become a bloom filter.
> > > In that sense may be max_entries should be specified in bits
> > > and the map is called bitset. With nr_hashes >= 1 the kernel
> > > would accept key_size > 8 and convert it to bloom filter
> > > peek/pop/push. In other words
> > > nr_hash == 0 bit_idx == key for set/read/clear
> > > nr_hashes >= 1 bit_idx[1..N] = hash(key, N) for set/read/clear.
I like this bitset+nr_hash semantic, then max_entries logcially follows
the number of bits.

> > If we do the map, though, regardless if it's bitset or bloom
> > specifically. Maybe we should consider modeling as actual
> > bpf_map_lookup_elem(), where the key is a pointer to whatever we are
> > hashing and looking up? It makes much more sense, that's how people
> > model sets based on maps: key is the element you are looking up, value
> > is either true/false or meaningless (at least for me it felt much more
> > natural that you are looking up by key, not by value). In this case,
> > what if on successful lookup we return a pointer to some fixed
> > u8/u32/u64 location in the kernel, some dedicated static variable
> > shared between all maps. So NULL means "element is not in a set",
> > non-NULL means it is in the set.
> I think this would then also require that the bpf_map_update_elem() API from
> the userspace side would have to pass in a valid memory address for the
> "value".  I understand what you're saying though about it feeling more natural
> that the "key" is the element here; I agree but there doesn't seem to be a
> clean way of doing this - I think maybe one viable approach would be allowing
> map_update_elem to pass in a NULL value in the kernel if the map is a non-associative map,
> and refactoring the push_elem/peek_elem API so that the element can represent either the key or
> the value.
> > Ideally we'd prevent such element to
> > be written to, but it might be too hard to do that as just one
> > exception here, don't know.
I don't mind key or value also.  With nr_hash == 0 and key is
the bit_idx, it may be more correct to say that bit is
indeed 0/1 instead of returning the bit_idx back.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 bpf-next 1/5] bpf: Add bloom filter map implementation
  2021-09-23 22:28                         ` Joanne Koong
  2021-09-23 23:46                           ` Martin KaFai Lau
@ 2021-09-24  2:23                           ` Andrii Nakryiko
  2021-09-24 16:32                             ` Joanne Koong
  1 sibling, 1 reply; 37+ messages in thread
From: Andrii Nakryiko @ 2021-09-24  2:23 UTC (permalink / raw)
  To: Joanne Koong; +Cc: Alexei Starovoitov, Martin KaFai Lau, bpf, Kernel Team

On Thu, Sep 23, 2021 at 3:28 PM Joanne Koong <joannekoong@fb.com> wrote:
>
>
> On 9/23/21 2:12 PM, Andrii Nakryiko wrote:
> > On Thu, Sep 23, 2021 at 1:30 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> >> On Thu, Sep 23, 2021 at 12:42:33PM -0700, Martin KaFai Lau wrote:
> >>> How to move it forward from here?  Is it a must to keep the
> >>> bloomfilter-like map in pure bpf only and we should stop
> >>> the current work?
> >>>
> >>> or it is useful to have the kernel bloom filter that provides
> >>> a get-and-go usage and a consistent experience for user in
> >>> map-in-map which is the primary use case here.  I don't think
> >>> this is necessarily blocking the custom bpf map effort also.
> >> I think map based and helper based bloom filter implementations
> >> are far from equivalent. There are pros and cons to both.
> >> For example the helper style doesn't have a good way
> >> of query/populate from the user space. If it's a single element
> > I thought about the use from user-space. I'd consider two ways of
> > doing that. One more complicated but with better performance, another
> > simpler but less performant (but in this case less performant is
> > equivalent to in-kernel implementation performance, or still better):
> >
> > 1. Have identical hash function implementation in user-space. In this
> > case Jenkins hash. Then memory-map contents and use exactly the same
> > bloom filter code to set the bits (as I said, once you have hash, it's
> > just a glorified bitset). This has the downside that if there is even
> > a single bit difference between hash produced by kernel and
> > user-space, you are screwed. But can't beat the performance because no
> > syscall overhead.
> >
> > 2. Use BPF_PROG_RUN command to execute custom program that would set
> > one or multiple provided values in the hashset. Just like we argued
> > for BPF timers, BPF program can be a custom "API" that would avoid
> > having separate user-space logic. Pass one or many values through a
> > global variable/array, BPF_PROG_RUN program that would iterate values,
> > calculate hashes, set bits. It actually should be faster than doing
> > BPF_MAP_UPDATE_ELEM syscall for each value. Next proposal will be to
> > add batched update support, of course, so I won't claim victory for
> > the performance argument here. :)
> >
> > But yes, it needs a bit more (but simple) code, than if the kernel
> > just provided a Bloom filter map out of the box.
> >
> >> array the user space would be forced to allocate huge buffers
> >> just to read/write single huge value_size.
> >> With multi element array it's sort-of easier.
> >> mmap-ing the array could help too,
> >> but in either case the user space would need to copy-paste jhash,
> >> which is GPL, and that might be more than just inconvenience.
> >  From include/linux/jhash.h: "You can use this free for any purpose.
> > It's in the public domain".
> >
> >> We can try siphash in the bpf helper and give it a flag to choose
> > I did bpf_jhash_mem() just to demonstrate the approach quickly. I
> > think in practice I'd go for a more generic implementation where one
> > of the parameters is enum that specifies which supported hash
> > algorithm is used. It's easier to extend that:
> >
> > u64 bpf_hash_mem(const void *data, u32 sz, u32 seed, enum bpf_hash_algo algo);
> >
> > enum bpf_hash_algo {
> >     XOR = 0,
> >     JENKINS = 1,
> >     MURMUR3 = 2,
> >     ...
> > }
> >
> > Note the XOR case. If we specify it as "xor u64 values, where the last
> > <8 bytes are zero extended", it will come useful below for your
> > proposal.
> >
> >
> >> between hash implementations. That helps, but doesn't completely
> >> makes them equivalent.
> > I don't claim that implementing and using a custom Bloom filter will
> > be easier to use in all situations. I think the best we can strive for
> > is making it not much harder, and I think in this case it is. Of
> > course we can come up with a bunch of situations where doing it with
> > pure BPF isn't possible to do equivalently (like map-in-map with
> > dynamically sized bit size, well, sorry, BPF verifier can't validate
> > stuff like that). Dedicated BPF map or helper (as a general case, not
> > just this one) will pretty much always going to be easier to use just
> > because it's a dedicated and tailored API.
> >
> To me, it seems like we get the best of both worlds by using both of these
> two ideas for the bloom filter. For developers who would like
> to use a general bloom filter without having to do any extra
> implementation work
> or having to understand how bloom filters are implemented, they could use
> the custom bloom filter map with minimal effort. For developers who
> would like to customize their bloom filter to something more specific or
> fine-tuned, they could use craft their own bloom filter in an ebpf program.
> To me, these two directions don't seem mutually exclusive.

They are not mutually exclusive, of course, but adding stuff to the
kernel has its maintenance costs.

>
> >> As far as map based bloom filter I think it can combine bitset
> >> and bloomfilter features into one. delete_elem from user space
> >> can be mapped into pop() to clear bits.
> >> Some special value of nr_hashes could mean that no hashing
> >> is applied and 4 or 8 byte key gets modulo of max_entries
> >> and treated as a bit index. Both bpf prog and user space will
> >> have uniform access into such bitset. With nr_hashes >= 1
> >> it will become a bloom filter.
> >> In that sense may be max_entries should be specified in bits
> >> and the map is called bitset. With nr_hashes >= 1 the kernel
> >> would accept key_size > 8 and convert it to bloom filter
> >> peek/pop/push. In other words
> >> nr_hash == 0 bit_idx == key for set/read/clear
> >> nr_hashes >= 1 bit_idx[1..N] = hash(key, N) for set/read/clear.
> >> If we could teach the verifier to inline the bit lookup
> >> we potentially can get rid of bloomfilter loop inside the peek().
> >> Then the map would be true bitset without bloomfilter quirks.
> >> Even in such case it's not equivalent to bpf_hash(mem_ptr, size, flags) helper.
> >> Thoughts?
> This is an interesting suggestion; to me, it seems like the APIs and
> code would be
> more straightforward if the bitset and the bloom filter were separate maps.
> With having max_entries be specified in bits, I think this also relies
> on the
> user to make an educated call on the optimal number of bits to use for
> their bloom
> filter, instead of passing in the number of entries they expect to have
> and having the
> bit size automatically calculated according to a mathematically
> optimized equation.
> I am open to this idea though.

We can provide a macro that will calculate mathematically optimized
value based on desired number of unique entries and hash functions.
E.g.:

#define BPF_BLOOM_FILTER_BYTE_SZ(nr_uniq_entries, nr_hash_funcs)
(nr_uniq_entires * nr_hash_funcs / 5 * 7 / 8)

Kernel code can round up to closest power-of-two internally to make
this simpler. So if users don't care or don't know, they'll use
BPF_BPLOOM_FILTER_BYTE_SZ() macro, but if they know better, they'll
just specify desired amount of bytes.

>
> > Sounds a bit complicated from end-user's perspective, tbh, but bitset
> > map (with generalization for bloom filter) sounds a bit more widely
> > useful. See above for the bpf_hash_algo proposal. If we allow to
> > specify nr_hashes and hash algorithm, then with XOR as defined above
> > and nr_hash = 1, you'll get just bitset behavior with not extra
> > restrictions on key size: you could have 1, 2, 4, 8 and more bytes
> > (where with more bytes it's some suboptimal bloom filter with one hash
> > function, not sure why you'd do that).
> >
> > The biggest quirk is defining that XOR hashes in chunks of 8 bytes
> > (with zero-extending anything that is not a multiple of 8 bytes
> > length). We can do special "only 1, 2, 4, and 8 bytes are supported",
> > of course, but it will be special-cased internally. Not sure which one
> > is cleaner.
> >
> > While writing this, another thought was to have a "NOOP" (or
> > "IDENTITY") hash, where we say that we treat bytes as one huge number.
> > Obviously then we truncate to the actual bitmap size, which just
> > basically means "use up to lower 8 bytes as a number". But it sucks
> > for big-endian, because to make it sane we'd need to take last "up to
> > 8 bytes", which obviously sounds convoluted. So I don't know, just a
> > thought.
> >
> > If we do the map, though, regardless if it's bitset or bloom
> > specifically. Maybe we should consider modeling as actual
> > bpf_map_lookup_elem(), where the key is a pointer to whatever we are
> > hashing and looking up? It makes much more sense, that's how people
> > model sets based on maps: key is the element you are looking up, value
> > is either true/false or meaningless (at least for me it felt much more
> > natural that you are looking up by key, not by value). In this case,
> > what if on successful lookup we return a pointer to some fixed
> > u8/u32/u64 location in the kernel, some dedicated static variable
> > shared between all maps. So NULL means "element is not in a set",
> > non-NULL means it is in the set.
> I think this would then also require that the bpf_map_update_elem() API from
> the userspace side would have to pass in a valid memory address for the
> "value".
> I understand what you're saying though about it feeling more natural
> that the "key" is the element here; I agree but there doesn't seem to be
> a clean way
> of doing this - I think maybe one viable approach would be allowing
> map_update_elem
> to pass in a NULL value in the kernel if the map is a non-associative
> map, and refactoring the
> push_elem/peek_elem API so that the element can represent either the key
> or the value.

Yeah, we can allow value to be NULL (and key non-NULL). But why
push/peek if we are talking about using standard
lookup_elem/update_elem (and maybe even delete_elem which will reset
bits to 0)?

> >   Ideally we'd prevent such element to
> > be written to, but it might be too hard to do that as just one
> > exception here, don't know.

BTW, that nr_hash_funcs field in UAPI and in libbpf was still
bothering me. I'd feel better if we generalize this to future map
needs and make it generic. How about adding "u32 map_extra;" field to
UAPI (as a new field, so it's available for all maps, including
map-in-maps). The meaning of that field would be per-map-type extra
flags/values/etc. In this case we can define that map_extra for
BLOOM_FILTER it would encode number of hash functions. If we ok adding
hash function enum that I proposed for bpf_hash_mem() helper, we can
also include that into map_extra. We can reserve lower N bits for
number of hash functions and then next few bits would be reserved for
hash algo enum.

So then you'd define map (in libbpf syntax) pretty naturally as:

struct {
    __uint(type, BPF_MAP_TYPE_BLOOM_FILTER); /* or BITSET, don't know
which way this is going */
    ....
    __uint(map_extra, BPF_HASH_SHA256 | 3); /* 3 x SHA256 hash functions */
} my_bloom_filter SEC(".maps");

BPF_HASH_SHA256 would be defined as something like 0x{1,2,4,etc}0,
leaving lower 4 bits for nr_hash_funcs.

And then I'd have no problem supporting map_extra field for any map
definition, with bpf_map__map_extra() and bpf_map__set_map_extra()
getter/setter.

map_flags is different in that it's partially shared by all map types
(e.g., BPF_F_RDONLY_PROG, BPF_F_INNER_MAP, BPF_F_NUMA_NODE can be
specified for lots of different types).

Naming is obviously up for discussion.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 bpf-next 1/5] bpf: Add bloom filter map implementation
  2021-09-24  2:23                           ` Andrii Nakryiko
@ 2021-09-24 16:32                             ` Joanne Koong
  2021-09-24 23:12                               ` Andrii Nakryiko
  0 siblings, 1 reply; 37+ messages in thread
From: Joanne Koong @ 2021-09-24 16:32 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: Alexei Starovoitov, Martin KaFai Lau, bpf, Kernel Team

On 9/23/21 7:23 PM, Andrii Nakryiko wrote:

> On Thu, Sep 23, 2021 at 3:28 PM Joanne Koong <joannekoong@fb.com> wrote:
>
> On 9/23/21 2:12 PM, Andrii Nakryiko wrote:
>>> On Thu, Sep 23, 2021 at 1:30 PM Alexei Starovoitov
>>> <alexei.starovoitov@gmail.com> wrote:
>>>> On Thu, Sep 23, 2021 at 12:42:33PM -0700, Martin KaFai Lau wrote:
>>>>> How to move it forward from here?  Is it a must to keep the
>>>>> bloomfilter-like map in pure bpf only and we should stop
>>>>> the current work?
>>>>>
>>>>> or it is useful to have the kernel bloom filter that provides
>>>>> a get-and-go usage and a consistent experience for user in
>>>>> map-in-map which is the primary use case here.  I don't think
>>>>> this is necessarily blocking the custom bpf map effort also.
>>>> I think map based and helper based bloom filter implementations
>>>> are far from equivalent. There are pros and cons to both.
>>>> For example the helper style doesn't have a good way
>>>> of query/populate from the user space. If it's a single element
>>> I thought about the use from user-space. I'd consider two ways of
>>> doing that. One more complicated but with better performance, another
>>> simpler but less performant (but in this case less performant is
>>> equivalent to in-kernel implementation performance, or still better):
>>>
>>> 1. Have identical hash function implementation in user-space. In this
>>> case Jenkins hash. Then memory-map contents and use exactly the same
>>> bloom filter code to set the bits (as I said, once you have hash, it's
>>> just a glorified bitset). This has the downside that if there is even
>>> a single bit difference between hash produced by kernel and
>>> user-space, you are screwed. But can't beat the performance because no
>>> syscall overhead.
>>>
>>> 2. Use BPF_PROG_RUN command to execute custom program that would set
>>> one or multiple provided values in the hashset. Just like we argued
>>> for BPF timers, BPF program can be a custom "API" that would avoid
>>> having separate user-space logic. Pass one or many values through a
>>> global variable/array, BPF_PROG_RUN program that would iterate values,
>>> calculate hashes, set bits. It actually should be faster than doing
>>> BPF_MAP_UPDATE_ELEM syscall for each value. Next proposal will be to
>>> add batched update support, of course, so I won't claim victory for
>>> the performance argument here. :)
>>>
>>> But yes, it needs a bit more (but simple) code, than if the kernel
>>> just provided a Bloom filter map out of the box.
>>>
>>>> array the user space would be forced to allocate huge buffers
>>>> just to read/write single huge value_size.
>>>> With multi element array it's sort-of easier.
>>>> mmap-ing the array could help too,
>>>> but in either case the user space would need to copy-paste jhash,
>>>> which is GPL, and that might be more than just inconvenience.
>>>   From include/linux/jhash.h: "You can use this free for any purpose.
>>> It's in the public domain".
>>>
>>>> We can try siphash in the bpf helper and give it a flag to choose
>>> I did bpf_jhash_mem() just to demonstrate the approach quickly. I
>>> think in practice I'd go for a more generic implementation where one
>>> of the parameters is enum that specifies which supported hash
>>> algorithm is used. It's easier to extend that:
>>>
>>> u64 bpf_hash_mem(const void *data, u32 sz, u32 seed, enum bpf_hash_algo algo);
>>>
>>> enum bpf_hash_algo {
>>>      XOR = 0,
>>>      JENKINS = 1,
>>>      MURMUR3 = 2,
>>>      ...
>>> }
>>>
>>> Note the XOR case. If we specify it as "xor u64 values, where the last
>>> <8 bytes are zero extended", it will come useful below for your
>>> proposal.
>>>
>>>
>>>> between hash implementations. That helps, but doesn't completely
>>>> makes them equivalent.
>>> I don't claim that implementing and using a custom Bloom filter will
>>> be easier to use in all situations. I think the best we can strive for
>>> is making it not much harder, and I think in this case it is. Of
>>> course we can come up with a bunch of situations where doing it with
>>> pure BPF isn't possible to do equivalently (like map-in-map with
>>> dynamically sized bit size, well, sorry, BPF verifier can't validate
>>> stuff like that). Dedicated BPF map or helper (as a general case, not
>>> just this one) will pretty much always going to be easier to use just
>>> because it's a dedicated and tailored API.
>>>
>> To me, it seems like we get the best of both worlds by using both of these
>> two ideas for the bloom filter. For developers who would like
>> to use a general bloom filter without having to do any extra
>> implementation work
>> or having to understand how bloom filters are implemented, they could use
>> the custom bloom filter map with minimal effort. For developers who
>> would like to customize their bloom filter to something more specific or
>> fine-tuned, they could use craft their own bloom filter in an ebpf program.
>> To me, these two directions don't seem mutually exclusive.
> They are not mutually exclusive, of course, but adding stuff to the
> kernel has its maintenance costs.
>
>>>> As far as map based bloom filter I think it can combine bitset
>>>> and bloomfilter features into one. delete_elem from user space
>>>> can be mapped into pop() to clear bits.
>>>> Some special value of nr_hashes could mean that no hashing
>>>> is applied and 4 or 8 byte key gets modulo of max_entries
>>>> and treated as a bit index. Both bpf prog and user space will
>>>> have uniform access into such bitset. With nr_hashes >= 1
>>>> it will become a bloom filter.
>>>> In that sense may be max_entries should be specified in bits
>>>> and the map is called bitset. With nr_hashes >= 1 the kernel
>>>> would accept key_size > 8 and convert it to bloom filter
>>>> peek/pop/push. In other words
>>>> nr_hash == 0 bit_idx == key for set/read/clear
>>>> nr_hashes >= 1 bit_idx[1..N] = hash(key, N) for set/read/clear.
>>>> If we could teach the verifier to inline the bit lookup
>>>> we potentially can get rid of bloomfilter loop inside the peek().
>>>> Then the map would be true bitset without bloomfilter quirks.
>>>> Even in such case it's not equivalent to bpf_hash(mem_ptr, size, flags) helper.
>>>> Thoughts?
>> This is an interesting suggestion; to me, it seems like the APIs and
>> code would be
>> more straightforward if the bitset and the bloom filter were separate maps.
>> With having max_entries be specified in bits, I think this also relies
>> on the
>> user to make an educated call on the optimal number of bits to use for
>> their bloom
>> filter, instead of passing in the number of entries they expect to have
>> and having the
>> bit size automatically calculated according to a mathematically
>> optimized equation.
>> I am open to this idea though.
> We can provide a macro that will calculate mathematically optimized
> value based on desired number of unique entries and hash functions.
> E.g.:
>
> #define BPF_BLOOM_FILTER_BYTE_SZ(nr_uniq_entries, nr_hash_funcs)
> (nr_uniq_entires * nr_hash_funcs / 5 * 7 / 8)
>
> Kernel code can round up to closest power-of-two internally to make
> this simpler. So if users don't care or don't know, they'll use
> BPF_BPLOOM_FILTER_BYTE_SZ() macro, but if they know better, they'll
> just specify desired amount of bytes.
Sounds great!
>>> Sounds a bit complicated from end-user's perspective, tbh, but bitset
>>> map (with generalization for bloom filter) sounds a bit more widely
>>> useful. See above for the bpf_hash_algo proposal. If we allow to
>>> specify nr_hashes and hash algorithm, then with XOR as defined above
>>> and nr_hash = 1, you'll get just bitset behavior with not extra
>>> restrictions on key size: you could have 1, 2, 4, 8 and more bytes
>>> (where with more bytes it's some suboptimal bloom filter with one hash
>>> function, not sure why you'd do that).
>>>
>>> The biggest quirk is defining that XOR hashes in chunks of 8 bytes
>>> (with zero-extending anything that is not a multiple of 8 bytes
>>> length). We can do special "only 1, 2, 4, and 8 bytes are supported",
>>> of course, but it will be special-cased internally. Not sure which one
>>> is cleaner.
>>>
>>> While writing this, another thought was to have a "NOOP" (or
>>> "IDENTITY") hash, where we say that we treat bytes as one huge number.
>>> Obviously then we truncate to the actual bitmap size, which just
>>> basically means "use up to lower 8 bytes as a number". But it sucks
>>> for big-endian, because to make it sane we'd need to take last "up to
>>> 8 bytes", which obviously sounds convoluted. So I don't know, just a
>>> thought.
>>>
>>> If we do the map, though, regardless if it's bitset or bloom
>>> specifically. Maybe we should consider modeling as actual
>>> bpf_map_lookup_elem(), where the key is a pointer to whatever we are
>>> hashing and looking up? It makes much more sense, that's how people
>>> model sets based on maps: key is the element you are looking up, value
>>> is either true/false or meaningless (at least for me it felt much more
>>> natural that you are looking up by key, not by value). In this case,
>>> what if on successful lookup we return a pointer to some fixed
>>> u8/u32/u64 location in the kernel, some dedicated static variable
>>> shared between all maps. So NULL means "element is not in a set",
>>> non-NULL means it is in the set.
>> I think this would then also require that the bpf_map_update_elem() API from
>> the userspace side would have to pass in a valid memory address for the
>> "value".
>> I understand what you're saying though about it feeling more natural
>> that the "key" is the element here; I agree but there doesn't seem to be
>> a clean way
>> of doing this - I think maybe one viable approach would be allowing
>> map_update_elem
>> to pass in a NULL value in the kernel if the map is a non-associative
>> map, and refactoring the
>> push_elem/peek_elem API so that the element can represent either the key
>> or the value.
> Yeah, we can allow value to be NULL (and key non-NULL). But why
> push/peek if we are talking about using standard
> lookup_elem/update_elem (and maybe even delete_elem which will reset
> bits to 0)?
In the syscall layer where we handle lookup_elem, we call 
map->ops->map_lookup_elem
and expect that the ptr we get back is a ptr to the value associated 
with the key
(and if the ptr is NULL we treat that as an error).

Instead of adding special-casing for bloom filter maps to treat a NULL 
ptr value
as something that's okay, it seems cleaner to repurpose peek to be the 
API we use for
all non-associative map types.
>>>    Ideally we'd prevent such element to
>>> be written to, but it might be too hard to do that as just one
>>> exception here, don't know.
> BTW, that nr_hash_funcs field in UAPI and in libbpf was still
> bothering me. I'd feel better if we generalize this to future map
> needs and make it generic. How about adding "u32 map_extra;" field to
> UAPI (as a new field, so it's available for all maps, including
> map-in-maps). The meaning of that field would be per-map-type extra
> flags/values/etc. In this case we can define that map_extra for
> BLOOM_FILTER it would encode number of hash functions. If we ok adding
> hash function enum that I proposed for bpf_hash_mem() helper, we can
> also include that into map_extra. We can reserve lower N bits for
> number of hash functions and then next few bits would be reserved for
> hash algo enum.
>
> So then you'd define map (in libbpf syntax) pretty naturally as:
>
> struct {
>      __uint(type, BPF_MAP_TYPE_BLOOM_FILTER); /* or BITSET, don't know
> which way this is going */
>      ....
>      __uint(map_extra, BPF_HASH_SHA256 | 3); /* 3 x SHA256 hash functions */
> } my_bloom_filter SEC(".maps");
>
> BPF_HASH_SHA256 would be defined as something like 0x{1,2,4,etc}0,
> leaving lower 4 bits for nr_hash_funcs.
>
> And then I'd have no problem supporting map_extra field for any map
> definition, with bpf_map__map_extra() and bpf_map__set_map_extra()
> getter/setter.
>
> map_flags is different in that it's partially shared by all map types
> (e.g., BPF_F_RDONLY_PROG, BPF_F_INNER_MAP, BPF_F_NUMA_NODE can be
> specified for lots of different types).
>
> Naming is obviously up for discussion.

I love this idea!


To make sure we're all aligned on the direction of this, for v4 I'm 
going to make
the following changes:
* Make the map a bitset + bloom filter map, depending on how many hashes
are passed in
* Write tests for the bitset functionality of the map
* Change nr_hashes to a more generic "u32 map_extra"
* Incorporate some of Andrii's changes to the benchmarks for
accounting + measuring
* Add more documentation regarding the optimal bitmap size equation
"nr_unique_entries * nr_hash_funcs / 5 * 7 / 8" to clear up any confusion


Thanks for the discussion on this, Martin, Andrii, and Alexei!


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 bpf-next 1/5] bpf: Add bloom filter map implementation
  2021-09-24 16:32                             ` Joanne Koong
@ 2021-09-24 23:12                               ` Andrii Nakryiko
  2021-09-27 16:41                                 ` Alexei Starovoitov
  2021-09-28  1:09                                 ` Martin KaFai Lau
  0 siblings, 2 replies; 37+ messages in thread
From: Andrii Nakryiko @ 2021-09-24 23:12 UTC (permalink / raw)
  To: Joanne Koong; +Cc: Alexei Starovoitov, Martin KaFai Lau, bpf, Kernel Team

On Fri, Sep 24, 2021 at 9:32 AM Joanne Koong <joannekoong@fb.com> wrote:
>
> On 9/23/21 7:23 PM, Andrii Nakryiko wrote:
>
> > On Thu, Sep 23, 2021 at 3:28 PM Joanne Koong <joannekoong@fb.com> wrote:
> >
> > On 9/23/21 2:12 PM, Andrii Nakryiko wrote:
> >>> On Thu, Sep 23, 2021 at 1:30 PM Alexei Starovoitov
> >>> <alexei.starovoitov@gmail.com> wrote:
> >>>> On Thu, Sep 23, 2021 at 12:42:33PM -0700, Martin KaFai Lau wrote:
> >>>>> How to move it forward from here?  Is it a must to keep the
> >>>>> bloomfilter-like map in pure bpf only and we should stop
> >>>>> the current work?
> >>>>>
> >>>>> or it is useful to have the kernel bloom filter that provides
> >>>>> a get-and-go usage and a consistent experience for user in
> >>>>> map-in-map which is the primary use case here.  I don't think
> >>>>> this is necessarily blocking the custom bpf map effort also.
> >>>> I think map based and helper based bloom filter implementations
> >>>> are far from equivalent. There are pros and cons to both.
> >>>> For example the helper style doesn't have a good way
> >>>> of query/populate from the user space. If it's a single element
> >>> I thought about the use from user-space. I'd consider two ways of
> >>> doing that. One more complicated but with better performance, another
> >>> simpler but less performant (but in this case less performant is
> >>> equivalent to in-kernel implementation performance, or still better):
> >>>
> >>> 1. Have identical hash function implementation in user-space. In this
> >>> case Jenkins hash. Then memory-map contents and use exactly the same
> >>> bloom filter code to set the bits (as I said, once you have hash, it's
> >>> just a glorified bitset). This has the downside that if there is even
> >>> a single bit difference between hash produced by kernel and
> >>> user-space, you are screwed. But can't beat the performance because no
> >>> syscall overhead.
> >>>
> >>> 2. Use BPF_PROG_RUN command to execute custom program that would set
> >>> one or multiple provided values in the hashset. Just like we argued
> >>> for BPF timers, BPF program can be a custom "API" that would avoid
> >>> having separate user-space logic. Pass one or many values through a
> >>> global variable/array, BPF_PROG_RUN program that would iterate values,
> >>> calculate hashes, set bits. It actually should be faster than doing
> >>> BPF_MAP_UPDATE_ELEM syscall for each value. Next proposal will be to
> >>> add batched update support, of course, so I won't claim victory for
> >>> the performance argument here. :)
> >>>
> >>> But yes, it needs a bit more (but simple) code, than if the kernel
> >>> just provided a Bloom filter map out of the box.
> >>>
> >>>> array the user space would be forced to allocate huge buffers
> >>>> just to read/write single huge value_size.
> >>>> With multi element array it's sort-of easier.
> >>>> mmap-ing the array could help too,
> >>>> but in either case the user space would need to copy-paste jhash,
> >>>> which is GPL, and that might be more than just inconvenience.
> >>>   From include/linux/jhash.h: "You can use this free for any purpose.
> >>> It's in the public domain".
> >>>
> >>>> We can try siphash in the bpf helper and give it a flag to choose
> >>> I did bpf_jhash_mem() just to demonstrate the approach quickly. I
> >>> think in practice I'd go for a more generic implementation where one
> >>> of the parameters is enum that specifies which supported hash
> >>> algorithm is used. It's easier to extend that:
> >>>
> >>> u64 bpf_hash_mem(const void *data, u32 sz, u32 seed, enum bpf_hash_algo algo);
> >>>
> >>> enum bpf_hash_algo {

btw, once we have this for Bloom filter and BPF helper, we should
consider adding support for that to STACK_TRACE map (we've seen a
considerable amount of hash collisions in practice), so letting users
choose a hash function would be great for it. I'd use the proposed
map_extra for that as well. It's obviously all besides the Bloom
filter discussion, just point out to not forget later.

> >>>      XOR = 0,
> >>>      JENKINS = 1,
> >>>      MURMUR3 = 2,
> >>>      ...
> >>> }
> >>>
> >>> Note the XOR case. If we specify it as "xor u64 values, where the last
> >>> <8 bytes are zero extended", it will come useful below for your
> >>> proposal.
> >>>
> >>>
> >>>> between hash implementations. That helps, but doesn't completely
> >>>> makes them equivalent.
> >>> I don't claim that implementing and using a custom Bloom filter will
> >>> be easier to use in all situations. I think the best we can strive for
> >>> is making it not much harder, and I think in this case it is. Of
> >>> course we can come up with a bunch of situations where doing it with
> >>> pure BPF isn't possible to do equivalently (like map-in-map with
> >>> dynamically sized bit size, well, sorry, BPF verifier can't validate
> >>> stuff like that). Dedicated BPF map or helper (as a general case, not
> >>> just this one) will pretty much always going to be easier to use just
> >>> because it's a dedicated and tailored API.
> >>>
> >> To me, it seems like we get the best of both worlds by using both of these
> >> two ideas for the bloom filter. For developers who would like
> >> to use a general bloom filter without having to do any extra
> >> implementation work
> >> or having to understand how bloom filters are implemented, they could use
> >> the custom bloom filter map with minimal effort. For developers who
> >> would like to customize their bloom filter to something more specific or
> >> fine-tuned, they could use craft their own bloom filter in an ebpf program.
> >> To me, these two directions don't seem mutually exclusive.
> > They are not mutually exclusive, of course, but adding stuff to the
> > kernel has its maintenance costs.
> >
> >>>> As far as map based bloom filter I think it can combine bitset
> >>>> and bloomfilter features into one. delete_elem from user space
> >>>> can be mapped into pop() to clear bits.
> >>>> Some special value of nr_hashes could mean that no hashing
> >>>> is applied and 4 or 8 byte key gets modulo of max_entries
> >>>> and treated as a bit index. Both bpf prog and user space will
> >>>> have uniform access into such bitset. With nr_hashes >= 1
> >>>> it will become a bloom filter.
> >>>> In that sense may be max_entries should be specified in bits
> >>>> and the map is called bitset. With nr_hashes >= 1 the kernel
> >>>> would accept key_size > 8 and convert it to bloom filter
> >>>> peek/pop/push. In other words
> >>>> nr_hash == 0 bit_idx == key for set/read/clear
> >>>> nr_hashes >= 1 bit_idx[1..N] = hash(key, N) for set/read/clear.
> >>>> If we could teach the verifier to inline the bit lookup
> >>>> we potentially can get rid of bloomfilter loop inside the peek().
> >>>> Then the map would be true bitset without bloomfilter quirks.
> >>>> Even in such case it's not equivalent to bpf_hash(mem_ptr, size, flags) helper.
> >>>> Thoughts?
> >> This is an interesting suggestion; to me, it seems like the APIs and
> >> code would be
> >> more straightforward if the bitset and the bloom filter were separate maps.
> >> With having max_entries be specified in bits, I think this also relies
> >> on the
> >> user to make an educated call on the optimal number of bits to use for
> >> their bloom
> >> filter, instead of passing in the number of entries they expect to have
> >> and having the
> >> bit size automatically calculated according to a mathematically
> >> optimized equation.
> >> I am open to this idea though.
> > We can provide a macro that will calculate mathematically optimized
> > value based on desired number of unique entries and hash functions.
> > E.g.:
> >
> > #define BPF_BLOOM_FILTER_BYTE_SZ(nr_uniq_entries, nr_hash_funcs)
> > (nr_uniq_entires * nr_hash_funcs / 5 * 7 / 8)
> >
> > Kernel code can round up to closest power-of-two internally to make
> > this simpler. So if users don't care or don't know, they'll use
> > BPF_BPLOOM_FILTER_BYTE_SZ() macro, but if they know better, they'll
> > just specify desired amount of bytes.
> Sounds great!
> >>> Sounds a bit complicated from end-user's perspective, tbh, but bitset
> >>> map (with generalization for bloom filter) sounds a bit more widely
> >>> useful. See above for the bpf_hash_algo proposal. If we allow to
> >>> specify nr_hashes and hash algorithm, then with XOR as defined above
> >>> and nr_hash = 1, you'll get just bitset behavior with not extra
> >>> restrictions on key size: you could have 1, 2, 4, 8 and more bytes
> >>> (where with more bytes it's some suboptimal bloom filter with one hash
> >>> function, not sure why you'd do that).
> >>>
> >>> The biggest quirk is defining that XOR hashes in chunks of 8 bytes
> >>> (with zero-extending anything that is not a multiple of 8 bytes
> >>> length). We can do special "only 1, 2, 4, and 8 bytes are supported",
> >>> of course, but it will be special-cased internally. Not sure which one
> >>> is cleaner.
> >>>
> >>> While writing this, another thought was to have a "NOOP" (or
> >>> "IDENTITY") hash, where we say that we treat bytes as one huge number.
> >>> Obviously then we truncate to the actual bitmap size, which just
> >>> basically means "use up to lower 8 bytes as a number". But it sucks
> >>> for big-endian, because to make it sane we'd need to take last "up to
> >>> 8 bytes", which obviously sounds convoluted. So I don't know, just a
> >>> thought.
> >>>
> >>> If we do the map, though, regardless if it's bitset or bloom
> >>> specifically. Maybe we should consider modeling as actual
> >>> bpf_map_lookup_elem(), where the key is a pointer to whatever we are
> >>> hashing and looking up? It makes much more sense, that's how people
> >>> model sets based on maps: key is the element you are looking up, value
> >>> is either true/false or meaningless (at least for me it felt much more
> >>> natural that you are looking up by key, not by value). In this case,
> >>> what if on successful lookup we return a pointer to some fixed
> >>> u8/u32/u64 location in the kernel, some dedicated static variable
> >>> shared between all maps. So NULL means "element is not in a set",
> >>> non-NULL means it is in the set.
> >> I think this would then also require that the bpf_map_update_elem() API from
> >> the userspace side would have to pass in a valid memory address for the
> >> "value".
> >> I understand what you're saying though about it feeling more natural
> >> that the "key" is the element here; I agree but there doesn't seem to be
> >> a clean way
> >> of doing this - I think maybe one viable approach would be allowing
> >> map_update_elem
> >> to pass in a NULL value in the kernel if the map is a non-associative
> >> map, and refactoring the
> >> push_elem/peek_elem API so that the element can represent either the key
> >> or the value.
> > Yeah, we can allow value to be NULL (and key non-NULL). But why
> > push/peek if we are talking about using standard
> > lookup_elem/update_elem (and maybe even delete_elem which will reset
> > bits to 0)?
> In the syscall layer where we handle lookup_elem, we call
> map->ops->map_lookup_elem
> and expect that the ptr we get back is a ptr to the value associated
> with the key
> (and if the ptr is NULL we treat that as an error).
>
> Instead of adding special-casing for bloom filter maps to treat a NULL
> ptr value
> as something that's okay, it seems cleaner to repurpose peek to be the
> API we use for
> all non-associative map types.

That's not what I proposed. So let's say somewhere in the kernel we
have this variable:

static int bpf_bloom_exists = 1;

Now, for bpf_map_lookup_elem() helper we pass data as key pointer. If
all its hashed bits are set in Bloom filter (it "exists"), we return
&bpf_bloom_exists. So it's not a NULL pointer.

Now, despite returning a valid pointer, it would be good to prevent
reading/writing from/to it in a valid BPF program. I'm hoping it is as
easy as just seetting map->value_size = 0 during map creation. But
worst case, we can let BPF program just overwrite that 1 with whatever
they want. It doesn't matter because the contract is that
bpf_map_lookup_elem() returns non-NULL for "exists" and NULL
otherwise.

Now, for MAP_LOOKUP_ELEM command on user-space side. I'd say the
contract should be 0 return code on "exists" (and nothing is written
to value pointer, perhaps even disallow to specify the non-NULL value
pointer altogether); -ENOENT, otherwise. Again, worst-case we can
specify that "value_size" is 1 or 4 bytes and write 1 to it.

Does it make sense?


BTW, for STACK/QUEUE maps, I'm not clear why we had to add
map_peek_elem/map_pop_elem/map_push_elem operations, to be honest.
They could have been pretty reasonably mapped to map_lookup_elem (peek
is non-destructive lookup), map_lookup_elem + some flag (pop is
lookup, but destructive, so flag allows "destruction"), and
map_update_elem (for push). map_delete_elem would be pop for when
value can be ignored.

That's to say that I don't consider STACK/QUEUE maps as good examples,
it's rather a counter-example of maps that barely anyone is using, yet
it just adds clutter to BPF internals.



> >>>    Ideally we'd prevent such element to
> >>> be written to, but it might be too hard to do that as just one
> >>> exception here, don't know.
> > BTW, that nr_hash_funcs field in UAPI and in libbpf was still
> > bothering me. I'd feel better if we generalize this to future map
> > needs and make it generic. How about adding "u32 map_extra;" field to
> > UAPI (as a new field, so it's available for all maps, including
> > map-in-maps). The meaning of that field would be per-map-type extra
> > flags/values/etc. In this case we can define that map_extra for
> > BLOOM_FILTER it would encode number of hash functions. If we ok adding
> > hash function enum that I proposed for bpf_hash_mem() helper, we can
> > also include that into map_extra. We can reserve lower N bits for
> > number of hash functions and then next few bits would be reserved for
> > hash algo enum.
> >
> > So then you'd define map (in libbpf syntax) pretty naturally as:
> >
> > struct {
> >      __uint(type, BPF_MAP_TYPE_BLOOM_FILTER); /* or BITSET, don't know
> > which way this is going */
> >      ....
> >      __uint(map_extra, BPF_HASH_SHA256 | 3); /* 3 x SHA256 hash functions */
> > } my_bloom_filter SEC(".maps");
> >
> > BPF_HASH_SHA256 would be defined as something like 0x{1,2,4,etc}0,
> > leaving lower 4 bits for nr_hash_funcs.
> >
> > And then I'd have no problem supporting map_extra field for any map
> > definition, with bpf_map__map_extra() and bpf_map__set_map_extra()
> > getter/setter.
> >
> > map_flags is different in that it's partially shared by all map types
> > (e.g., BPF_F_RDONLY_PROG, BPF_F_INNER_MAP, BPF_F_NUMA_NODE can be
> > specified for lots of different types).
> >
> > Naming is obviously up for discussion.
>
> I love this idea!
>
>
> To make sure we're all aligned on the direction of this, for v4 I'm
> going to make
> the following changes:
> * Make the map a bitset + bloom filter map, depending on how many hashes
> are passed in

I'm not sure we are all on the same page on the exact encoding and
what to do about the bitset case (do we restrict keys to 1, 4, 8
bytes? or we go with "noop" (or xor?) hash function? or?...) Would be
good to hear from Alexei and Martin to not waste your time iterating
on this. The rest looks good to me.


> * Write tests for the bitset functionality of the map
> * Change nr_hashes to a more generic "u32 map_extra"

yep, with this being generic I think it's good to have
bpf_map__set_map_extra() and bpf_map__map_extra() APIs in libbpf as
well.

> * Incorporate some of Andrii's changes to the benchmarks for
> accounting + measuring

would be nice to have in-kernel updates benchmark, if it's not too much work

for existing lookup benchmarks, bpf_get_prandom_u32() seems to take a
nontrivial amount of time, I'd recommend pre-generating a bunch of
random data during preparation phase, and then for each BPF program
invocation call bpf_get_prandom_u32() just once (before 512 iterations
of the loop) to determine the starting position p. Then just use
pregenerated data from rand_data[(p + i) % TOTAL_RAND_DATA_CNT].


> * Add more documentation regarding the optimal bitmap size equation
> "nr_unique_entries * nr_hash_funcs / 5 * 7 / 8" to clear up any confusion
>
>
> Thanks for the discussion on this, Martin, Andrii, and Alexei!
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 bpf-next 1/5] bpf: Add bloom filter map implementation
  2021-09-24 23:12                               ` Andrii Nakryiko
@ 2021-09-27 16:41                                 ` Alexei Starovoitov
  2021-09-27 21:14                                   ` Andrii Nakryiko
  2021-09-28  1:09                                 ` Martin KaFai Lau
  1 sibling, 1 reply; 37+ messages in thread
From: Alexei Starovoitov @ 2021-09-27 16:41 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: Joanne Koong, Martin KaFai Lau, bpf, Kernel Team

On Fri, Sep 24, 2021 at 04:12:11PM -0700, Andrii Nakryiko wrote:
> 
> That's not what I proposed. So let's say somewhere in the kernel we
> have this variable:
> 
> static int bpf_bloom_exists = 1;
> 
> Now, for bpf_map_lookup_elem() helper we pass data as key pointer. If
> all its hashed bits are set in Bloom filter (it "exists"), we return
> &bpf_bloom_exists. So it's not a NULL pointer.

imo that's too much of a hack.

> Now, despite returning a valid pointer, it would be good to prevent
> reading/writing from/to it in a valid BPF program. I'm hoping it is as
> easy as just seetting map->value_size = 0 during map creation. But
> worst case, we can let BPF program just overwrite that 1 with whatever
> they want. It doesn't matter because the contract is that
> bpf_map_lookup_elem() returns non-NULL for "exists" and NULL
> otherwise.
> 
> Now, for MAP_LOOKUP_ELEM command on user-space side. I'd say the
> contract should be 0 return code on "exists" (and nothing is written
> to value pointer, perhaps even disallow to specify the non-NULL value
> pointer altogether); -ENOENT, otherwise. Again, worst-case we can
> specify that "value_size" is 1 or 4 bytes and write 1 to it.
> 
> Does it make sense?
> 
> 
> BTW, for STACK/QUEUE maps, I'm not clear why we had to add
> map_peek_elem/map_pop_elem/map_push_elem operations, to be honest.
> They could have been pretty reasonably mapped to map_lookup_elem (peek
> is non-destructive lookup), map_lookup_elem + some flag (pop is
> lookup, but destructive, so flag allows "destruction"), and
> map_update_elem (for push). map_delete_elem would be pop for when
> value can be ignored.
> 
> That's to say that I don't consider STACK/QUEUE maps as good examples,
> it's rather a counter-example of maps that barely anyone is using, yet
> it just adds clutter to BPF internals.

Repurposing lookup/update ops for stack/queue and forcing bpf progs
to pass dummy key would have looked quite ugly.
peek/pop/push have one pointer. That pointer points to value.
For bitset we have single pointer as well.
So it makes perfect sense to reuse push/pop/peek and keep bitset
as a value from the verifier side.
bpf_helper_defs.h could have static inline functions:
bool bpf_bitset_clear(map, key);
bool bpf_bitset_set(map, key);
bool bpf_bitset_test(map, key);

But they will map to FUNC_map_pop_elem/push/peek as func ids
and will be seen as values from the verifier pov.

The bpf progs might think of them as keys, but they will be value_size.
The bitset creation could use key_size as an input, but internally
set it as value_size.
Not sure whether such internal vs uapi remaping will not lead
to confusion and bugs though.
I agree that key as an input to bitset_clear/set/test make more sense,
but the verifier needs value_size to plug into peek/pop infra.

I don't think it's worth to extend ops with yet another 3 callbacks
and have clean ARG_PTR_TO_UNINIT_MAP_KEY there.
That is probably too much clutter.

I think 
bool bpf_bitset_clear(map, value);
bool bpf_bitset_set(map, value);
bool bpf_bitset_test(map, value);
doesn't look too bad either.
At least this way the bit set map_create op will use value_size
and keep it as value_size. The bpf prog won't be confusing.
That's my preference, so far.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 bpf-next 1/5] bpf: Add bloom filter map implementation
  2021-09-27 16:41                                 ` Alexei Starovoitov
@ 2021-09-27 21:14                                   ` Andrii Nakryiko
  2021-09-27 23:51                                     ` Alexei Starovoitov
  0 siblings, 1 reply; 37+ messages in thread
From: Andrii Nakryiko @ 2021-09-27 21:14 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Joanne Koong, Martin KaFai Lau, bpf, Kernel Team

On Mon, Sep 27, 2021 at 9:41 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Sep 24, 2021 at 04:12:11PM -0700, Andrii Nakryiko wrote:
> >
> > That's not what I proposed. So let's say somewhere in the kernel we
> > have this variable:
> >
> > static int bpf_bloom_exists = 1;
> >
> > Now, for bpf_map_lookup_elem() helper we pass data as key pointer. If
> > all its hashed bits are set in Bloom filter (it "exists"), we return
> > &bpf_bloom_exists. So it's not a NULL pointer.
>
> imo that's too much of a hack.

too bad, because this feels pretty natural in BPF code:

int my_key = 1234;

if (bpf_map_lookup_elem(&my_bloom_filter, &my_key)) {
    /* handle "maybe exists" */
} else {
    /* handle "definitely doesn't exist" */
}

With map->value_size == 0 verifier should be able to ensure that
whatever non-NULL pointer is returned isn't readable/writable. We
might need to fix some BPF verifier internals if that's not supported
yet, I haven't checked thoroughly.

>
> > Now, despite returning a valid pointer, it would be good to prevent
> > reading/writing from/to it in a valid BPF program. I'm hoping it is as
> > easy as just seetting map->value_size = 0 during map creation. But
> > worst case, we can let BPF program just overwrite that 1 with whatever
> > they want. It doesn't matter because the contract is that
> > bpf_map_lookup_elem() returns non-NULL for "exists" and NULL
> > otherwise.
> >
> > Now, for MAP_LOOKUP_ELEM command on user-space side. I'd say the
> > contract should be 0 return code on "exists" (and nothing is written
> > to value pointer, perhaps even disallow to specify the non-NULL value
> > pointer altogether); -ENOENT, otherwise. Again, worst-case we can
> > specify that "value_size" is 1 or 4 bytes and write 1 to it.
> >
> > Does it make sense?
> >
> >
> > BTW, for STACK/QUEUE maps, I'm not clear why we had to add
> > map_peek_elem/map_pop_elem/map_push_elem operations, to be honest.
> > They could have been pretty reasonably mapped to map_lookup_elem (peek
> > is non-destructive lookup), map_lookup_elem + some flag (pop is
> > lookup, but destructive, so flag allows "destruction"), and
> > map_update_elem (for push). map_delete_elem would be pop for when
> > value can be ignored.
> >
> > That's to say that I don't consider STACK/QUEUE maps as good examples,
> > it's rather a counter-example of maps that barely anyone is using, yet
> > it just adds clutter to BPF internals.
>
> Repurposing lookup/update ops for stack/queue and forcing bpf progs
> to pass dummy key would have looked quite ugly.

The idea was to pass non-NULL *key* pointer. For bpf_map_update_elem()
you'd pass non-NULL key and NULL value. But it's fine, we have
peek/pop/push now, might as well use them. From user-space side it's
still mapped to LOOKUP/UPDATE/DELETE commands, right? BTW, seems like
we have LOOKUP_AND_DELETE command as well which is implemented for
STACK/QUEUE maps, but not BLOOM_FILTER. I guess for consistency we'd
have to implement that one for user-space as well?

> peek/pop/push have one pointer. That pointer points to value.
> For bitset we have single pointer as well.
> So it makes perfect sense to reuse push/pop/peek and keep bitset
> as a value from the verifier side.

Ok.

> bpf_helper_defs.h could have static inline functions:
> bool bpf_bitset_clear(map, key);
> bool bpf_bitset_set(map, key);
> bool bpf_bitset_test(map, key);
>
> But they will map to FUNC_map_pop_elem/push/peek as func ids
> and will be seen as values from the verifier pov.
>
> The bpf progs might think of them as keys, but they will be value_size.
> The bitset creation could use key_size as an input, but internally
> set it as value_size.
> Not sure whether such internal vs uapi remaping will not lead
> to confusion and bugs though.

I think it will and I wouldn't do it.

> I agree that key as an input to bitset_clear/set/test make more sense,
> but the verifier needs value_size to plug into peek/pop infra.

Which is why I initially proposed to not use peek/pop infra and
instead do lookup/update/delete, but if a NULL value pointer passed
into bpf_map_update_elem() disqualifies this, then it's fine. But I
didn't ask to repurpose peek/pop for working with key pointers.

>
> I don't think it's worth to extend ops with yet another 3 callbacks
> and have clean ARG_PTR_TO_UNINIT_MAP_KEY there.
> That is probably too much clutter.
>
> I think
> bool bpf_bitset_clear(map, value);
> bool bpf_bitset_set(map, value);
> bool bpf_bitset_test(map, value);
> doesn't look too bad either.
> At least this way the bit set map_create op will use value_size
> and keep it as value_size. The bpf prog won't be confusing.
> That's my preference, so far.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 bpf-next 1/5] bpf: Add bloom filter map implementation
  2021-09-27 21:14                                   ` Andrii Nakryiko
@ 2021-09-27 23:51                                     ` Alexei Starovoitov
  2021-09-28  0:36                                       ` Andrii Nakryiko
  0 siblings, 1 reply; 37+ messages in thread
From: Alexei Starovoitov @ 2021-09-27 23:51 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: Joanne Koong, Martin KaFai Lau, bpf, Kernel Team

On Mon, Sep 27, 2021 at 02:14:09PM -0700, Andrii Nakryiko wrote:
> On Mon, Sep 27, 2021 at 9:41 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Fri, Sep 24, 2021 at 04:12:11PM -0700, Andrii Nakryiko wrote:
> > >
> > > That's not what I proposed. So let's say somewhere in the kernel we
> > > have this variable:
> > >
> > > static int bpf_bloom_exists = 1;
> > >
> > > Now, for bpf_map_lookup_elem() helper we pass data as key pointer. If
> > > all its hashed bits are set in Bloom filter (it "exists"), we return
> > > &bpf_bloom_exists. So it's not a NULL pointer.
> >
> > imo that's too much of a hack.
> 
> too bad, because this feels pretty natural in BPF code:
> 
> int my_key = 1234;
> 
> if (bpf_map_lookup_elem(&my_bloom_filter, &my_key)) {
>     /* handle "maybe exists" */
> } else {
>     /* handle "definitely doesn't exist" */
> }

I don't think it fits bitset map.
In the bitset the value is zero or one. It always exist.
If bloomfilter is not a special map and instead implemented on top of
generic bitset with a plain loop in a bpf program then
push -> bit_set
pop -> bit_clear
peek -> bit_test
would be a better fit for bitset map.

bpf_map_pop_elem() and peek_elem() don't have 'flags' argument.
In most cases that would be a blocker,
but in this case we can add:
.arg3_type      = ARG_ANYTHING
and ignore it in case of stack/queue.
While bitset could use the flags as an additional seed into the hash.
So to do a bloomfilter the bpf prog would do:
for (i = 0; i < 5; i++)
   if (bpf_map_peek_elem(map, &value, conver_i_to_seed(i)))

Probably would still be an improvement to add:
static inline long bpf_bit_test(void *map, void *value, long flags)
{
    return bpf_map_peek_elem(map, value, flags);
}
to some header file.

Or maybe bloomfilter and the loop can be a flavor of bitset map and
done inside the helper to spare bpf programmers doing the manual loops.
Or such loop could be another static inline function in a header file?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 bpf-next 1/5] bpf: Add bloom filter map implementation
  2021-09-27 23:51                                     ` Alexei Starovoitov
@ 2021-09-28  0:36                                       ` Andrii Nakryiko
  2021-09-28 16:21                                         ` Alexei Starovoitov
  0 siblings, 1 reply; 37+ messages in thread
From: Andrii Nakryiko @ 2021-09-28  0:36 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Joanne Koong, Martin KaFai Lau, bpf, Kernel Team

On Mon, Sep 27, 2021 at 4:51 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Sep 27, 2021 at 02:14:09PM -0700, Andrii Nakryiko wrote:
> > On Mon, Sep 27, 2021 at 9:41 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Fri, Sep 24, 2021 at 04:12:11PM -0700, Andrii Nakryiko wrote:
> > > >
> > > > That's not what I proposed. So let's say somewhere in the kernel we
> > > > have this variable:
> > > >
> > > > static int bpf_bloom_exists = 1;
> > > >
> > > > Now, for bpf_map_lookup_elem() helper we pass data as key pointer. If
> > > > all its hashed bits are set in Bloom filter (it "exists"), we return
> > > > &bpf_bloom_exists. So it's not a NULL pointer.
> > >
> > > imo that's too much of a hack.
> >
> > too bad, because this feels pretty natural in BPF code:
> >
> > int my_key = 1234;
> >
> > if (bpf_map_lookup_elem(&my_bloom_filter, &my_key)) {
> >     /* handle "maybe exists" */
> > } else {
> >     /* handle "definitely doesn't exist" */
> > }
>
> I don't think it fits bitset map.
> In the bitset the value is zero or one. It always exist.
> If bloomfilter is not a special map and instead implemented on top of
> generic bitset with a plain loop in a bpf program then
> push -> bit_set
> pop -> bit_clear
> peek -> bit_test
> would be a better fit for bitset map.
>
> bpf_map_pop_elem() and peek_elem() don't have 'flags' argument.
> In most cases that would be a blocker,
> but in this case we can add:
> .arg3_type      = ARG_ANYTHING
> and ignore it in case of stack/queue.
> While bitset could use the flags as an additional seed into the hash.
> So to do a bloomfilter the bpf prog would do:
> for (i = 0; i < 5; i++)
>    if (bpf_map_peek_elem(map, &value, conver_i_to_seed(i)))

I think I'm getting lost in the whole unified bitset + bloom filter
design, tbh. In this case, why would you pass the seed to peek()? And
what is value here? Is that the value (N bytes) or the bit index (4
bytes?)? I assumed that once we have a hashing helper and a bitset
map, you'd use that and seed to calculate bit index. But now I'm
confused about what this peek operation is doing. I'm sorry if I'm
just slow.

Overall, I think I agree with Joanne that a separate dedicated Bloom
filter map will have simpler and better usability. This bitset + bloom
filter generalization seems to just create unnecessary confusion. I
don't feel the need for bitset map even more than I didn't feel the
need for Bloom filter, given it's even simpler data structure and is
totally implementable on either global var array or
BPF_MAP_TYPE_ARRAY, if map-in-map and dynamic sizing in mandatory.

But given we are converging on having a new map, maybe let's do just a
Bloom filter map with the tweaks we discussed in this thread for
map_extra and max_entries?

>
> Probably would still be an improvement to add:
> static inline long bpf_bit_test(void *map, void *value, long flags)
> {
>     return bpf_map_peek_elem(map, value, flags);
> }
> to some header file.
>
> Or maybe bloomfilter and the loop can be a flavor of bitset map and
> done inside the helper to spare bpf programmers doing the manual loops.
> Or such loop could be another static inline function in a header file?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 bpf-next 1/5] bpf: Add bloom filter map implementation
  2021-09-24 23:12                               ` Andrii Nakryiko
  2021-09-27 16:41                                 ` Alexei Starovoitov
@ 2021-09-28  1:09                                 ` Martin KaFai Lau
  1 sibling, 0 replies; 37+ messages in thread
From: Martin KaFai Lau @ 2021-09-28  1:09 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: Joanne Koong, Alexei Starovoitov, bpf, Kernel Team

On Fri, Sep 24, 2021 at 04:12:11PM -0700, Andrii Nakryiko wrote:
> > To make sure we're all aligned on the direction of this, for v4 I'm
> > going to make
> > the following changes:
> > * Make the map a bitset + bloom filter map, depending on how many hashes
> > are passed in
> 
> I'm not sure we are all on the same page on the exact encoding and
> what to do about the bitset case (do we restrict keys to 1, 4, 8
> bytes? or we go with "noop" (or xor?) hash function? or?...) Would be
> good to hear from Alexei and Martin to not waste your time iterating
> on this. The rest looks good to me.
I would prefer the earlier definition on nr_hashes:

> > > nr_hash == 0 bit_idx == key for set/read/clear
With nr_hash == 0, the value (or key) is the position of a bit.
Since it is the position of a bit, it is easier to understand
if the value is some integers, either __u8,__u16,__u32, or __u64,
so I don't see a need on noop.
As the position of a bit, I would even limit the value_size either to
be sizeof(__u32) or sizeof(__u64) for this case to keep it simple.

> > > nr_hashes >= 1 bit_idx[1..N] = hash(key, N) for set/read/clear.
With nr_hashes != 0 (let say nr_hashes == N), the value (or key) is
hashed N number of times to set N bits.  I don't see a need
to limit any non-zero value_size or XOR must be used with nr_hashes == 1.
Some hashes may need special handling for non 4-bytes or non 8-bytes value_size
but if possible that should be something internal to the kernel implementation
instead of limiting what value_size can be hashed.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 bpf-next 1/5] bpf: Add bloom filter map implementation
  2021-09-28  0:36                                       ` Andrii Nakryiko
@ 2021-09-28 16:21                                         ` Alexei Starovoitov
       [not found]                                           ` <aa967ed2-a958-f995-3a09-bbd6b6e775d4@fb.com>
  2021-09-29  0:14                                           ` Andrii Nakryiko
  0 siblings, 2 replies; 37+ messages in thread
From: Alexei Starovoitov @ 2021-09-28 16:21 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: Joanne Koong, Martin KaFai Lau, bpf, Kernel Team

On Mon, Sep 27, 2021 at 05:36:00PM -0700, Andrii Nakryiko wrote:
> On Mon, Sep 27, 2021 at 4:51 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Mon, Sep 27, 2021 at 02:14:09PM -0700, Andrii Nakryiko wrote:
> > > On Mon, Sep 27, 2021 at 9:41 AM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Fri, Sep 24, 2021 at 04:12:11PM -0700, Andrii Nakryiko wrote:
> > > > >
> > > > > That's not what I proposed. So let's say somewhere in the kernel we
> > > > > have this variable:
> > > > >
> > > > > static int bpf_bloom_exists = 1;
> > > > >
> > > > > Now, for bpf_map_lookup_elem() helper we pass data as key pointer. If
> > > > > all its hashed bits are set in Bloom filter (it "exists"), we return
> > > > > &bpf_bloom_exists. So it's not a NULL pointer.
> > > >
> > > > imo that's too much of a hack.
> > >
> > > too bad, because this feels pretty natural in BPF code:
> > >
> > > int my_key = 1234;
> > >
> > > if (bpf_map_lookup_elem(&my_bloom_filter, &my_key)) {
> > >     /* handle "maybe exists" */
> > > } else {
> > >     /* handle "definitely doesn't exist" */
> > > }
> >
> > I don't think it fits bitset map.
> > In the bitset the value is zero or one. It always exist.
> > If bloomfilter is not a special map and instead implemented on top of
> > generic bitset with a plain loop in a bpf program then
> > push -> bit_set
> > pop -> bit_clear
> > peek -> bit_test
> > would be a better fit for bitset map.
> >
> > bpf_map_pop_elem() and peek_elem() don't have 'flags' argument.
> > In most cases that would be a blocker,
> > but in this case we can add:
> > .arg3_type      = ARG_ANYTHING
> > and ignore it in case of stack/queue.
> > While bitset could use the flags as an additional seed into the hash.
> > So to do a bloomfilter the bpf prog would do:
> > for (i = 0; i < 5; i++)
> >    if (bpf_map_peek_elem(map, &value, conver_i_to_seed(i)))
> 
> I think I'm getting lost in the whole unified bitset + bloom filter
> design, tbh. In this case, why would you pass the seed to peek()? And
> what is value here? Is that the value (N bytes) or the bit index (4
> bytes?)? 

The full N byte value, of course.
The pure index has the same downsides as hashing helper:
- hard to make kernel and user space produce the same hash in all cases
- inability to dynamically change max_entries in a clean way

> I assumed that once we have a hashing helper and a bitset
> map, you'd use that and seed to calculate bit index. But now I'm
> confused about what this peek operation is doing. I'm sorry if I'm
> just slow.
> 
> Overall, I think I agree with Joanne that a separate dedicated Bloom
> filter map will have simpler and better usability. This bitset + bloom
> filter generalization seems to just create unnecessary confusion. I
> don't feel the need for bitset map even more than I didn't feel the
> need for Bloom filter, given it's even simpler data structure and is
> totally implementable on either global var array or
> BPF_MAP_TYPE_ARRAY, if map-in-map and dynamic sizing in mandatory.

Not really. For two reasons:
- inner array with N 8-byte elements is a slow workaround.
map_lookup is not inlined for inner arrays because max_entries will
be different.
- doing the same hash in user space and the kernel is hard.
For example, iproute2 is using socket(AF_ALG) to compute the same hash
(program tag) as the kernel.
Copy-paste of kernel jhash.h is not possible due to GPL,
but, as you pointed out, it's public domain, so user space would
need to search a public domain, reimplement jhash and then
make sure that it produces the same hash as the kernel.
All these trade offs point out the need for dedicated map type
(either generalized bitset or true bloomfilter) that does the hashing
and can change its size.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 bpf-next 1/5] bpf: Add bloom filter map implementation
       [not found]                                           ` <aa967ed2-a958-f995-3a09-bbd6b6e775d4@fb.com>
@ 2021-09-28 23:54                                             ` Andrii Nakryiko
  2021-09-29  1:54                                               ` Joanne Koong
  0 siblings, 1 reply; 37+ messages in thread
From: Andrii Nakryiko @ 2021-09-28 23:54 UTC (permalink / raw)
  To: Joanne Koong; +Cc: Alexei Starovoitov, Martin KaFai Lau, bpf, Kernel Team

On Tue, Sep 28, 2021 at 1:56 PM Joanne Koong <joannekoong@fb.com> wrote:
>
> On 9/28/21 9:21 AM, Alexei Starovoitov wrote:
>
> > On Mon, Sep 27, 2021 at 05:36:00PM -0700, Andrii Nakryiko wrote:
> >> On Mon, Sep 27, 2021 at 4:51 PM Alexei Starovoitov
> >> <alexei.starovoitov@gmail.com> wrote:
> >>> On Mon, Sep 27, 2021 at 02:14:09PM -0700, Andrii Nakryiko wrote:
> >>>> On Mon, Sep 27, 2021 at 9:41 AM Alexei Starovoitov
> >>>> <alexei.starovoitov@gmail.com> wrote:
> >>>>> On Fri, Sep 24, 2021 at 04:12:11PM -0700, Andrii Nakryiko wrote:
> >>>>>> That's not what I proposed. So let's say somewhere in the kernel we
> >>>>>> have this variable:
> >>>>>>
> >>>>>> static int bpf_bloom_exists = 1;
> >>>>>>
> >>>>>> Now, for bpf_map_lookup_elem() helper we pass data as key pointer. If
> >>>>>> all its hashed bits are set in Bloom filter (it "exists"), we return
> >>>>>> &bpf_bloom_exists. So it's not a NULL pointer.
> >>>>> imo that's too much of a hack.
> >>>> too bad, because this feels pretty natural in BPF code:
> >>>>
> >>>> int my_key = 1234;
> >>>>
> >>>> if (bpf_map_lookup_elem(&my_bloom_filter, &my_key)) {
> >>>>      /* handle "maybe exists" */
> >>>> } else {
> >>>>      /* handle "definitely doesn't exist" */
> >>>> }
>
> To summarize this, Andrii it seems like you are proposing two possibilities
> for passing the ptr to the data as the key:
>
> 1. Have the value be NULL (value_size = 0)

Yeah, this was my biggest hope (except value is non-NULL, it's just
zero-sized so not readable/writable), but that's academic at this
point, see below.

>
> 2. Have the value be value_size = 1 byte or value_size = 4 bytes in a
> worst-case scenario
>
>
> For #1 where the value_size is 0 and we return something like
> &bpf_bloom_exists
> for bpf_map_lookup_elem() where the key is found, this would still
> require us to do
> the following in the syscall layer elsewhere:
>
> a) In the syscall layer in map_lookup_elem, add code that will allow
> value_sizes of
> 0. This would require another change in bpf_map_copy_value where we have to
> also check that if the value_size is 0, then we shouldn't copy the
> resulting ptr
> of the bpf_map_lookup_elem call to the value ptr (the value ptr isn't
> allocated since
> value_size is 0).
>
> b) In map_update_elem, add code that allows the user to pass in a NULL /
> zero-size
> value. Currently, there exists only support for passing in a
> NULL/zero-size key
> (which was added for stack/queue maps that pass in NULL keys); we'd have
> to add
> in the equivalent for NULL/zero-size values. We'd also have to modify
> the verifier
> to allow bpf_map_update_elem for NULL values (ARG_PTR_TO_UNINIT_MAP_KEY).

This "UNINIT_MAP_KEY" part is confusing me because we are talking
about *NULL value* (so neither uninitialized nor a key), so I must be
missing something important here. I thought it would be
ARG_PTR_TO_MAP_VALUE_OR_NULL, but it does suck that other map types
would then be allowed NULL where they don't expect to get NULL, I
agree.

>
>
> For #2, from the user-side for bpf_map_update_elem, this now means
> the user would have to always pass in some dummy 1-byte or 4-byte value
> since the value_size is no longer 0. This seems like a hacky API
>
> Repurposing peek/push/pop (in the scenario where value_size = 0) would
> avoid the
> bpf_map_copy_value change in #1a altogether, which was the primary
> reason for
> suggesting it.
>
> The approach taken in this patchset (where we have the key as NULL, and
> the value
> as the ptr to the data) avoids the need to add that infrastructure
> outlined above
> for allowing NULL values, since it just rides on top of the changes that
> were added to the
> stack/queue map that allows NULL keys.

So overall, I agree that all the above will be an unnecessary
complication for relatively little gain. Just go with peek/pop/push.

>
> >>> I don't think it fits bitset map.
> >>> In the bitset the value is zero or one. It always exist.
> >>> If bloomfilter is not a special map and instead implemented on top of
> >>> generic bitset with a plain loop in a bpf program then
> >>> push -> bit_set
> >>> pop -> bit_clear
> >>> peek -> bit_test
> >>> would be a better fit for bitset map.
> >>>
> >>> bpf_map_pop_elem() and peek_elem() don't have 'flags' argument.
> >>> In most cases that would be a blocker,
> >>> but in this case we can add:
> >>> .arg3_type      = ARG_ANYTHING
> >>> and ignore it in case of stack/queue.
> >>> While bitset could use the flags as an additional seed into the hash.
> >>> So to do a bloomfilter the bpf prog would do:
> >>> for (i = 0; i < 5; i++)
> >>>     if (bpf_map_peek_elem(map, &value, conver_i_to_seed(i)))
> >> I think I'm getting lost in the whole unified bitset + bloom filter
> >> design, tbh. In this case, why would you pass the seed to peek()? And
> >> what is value here? Is that the value (N bytes) or the bit index (4
> >> bytes?)?
> In that example where seed is passed to peek(), the context is the
> hypothetical scenario where  the bloom filter is implemented on top
> of a generic bitset.

But then why does *bitset* do the hashing on behalf of the user,
that's the confusing bit. But I'll reply to Alexei's email in just a
sec.

> > The full N byte value, of course.
> > The pure index has the same downsides as hashing helper:
> > - hard to make kernel and user space produce the same hash in all cases
> > - inability to dynamically change max_entries in a clean way
> >
> >> I assumed that once we have a hashing helper and a bitset
> >> map, you'd use that and seed to calculate bit index. But now I'm
> >> confused about what this peek operation is doing. I'm sorry if I'm
> >> just slow.
> >>
> >> Overall, I think I agree with Joanne that a separate dedicated Bloom
> >> filter map will have simpler and better usability. This bitset + bloom
> >> filter generalization seems to just create unnecessary confusion. I
> >> don't feel the need for bitset map even more than I didn't feel the
> >> need for Bloom filter, given it's even simpler data structure and is
> >> totally implementable on either global var array or
> >> BPF_MAP_TYPE_ARRAY, if map-in-map and dynamic sizing in mandatory.
> > Not really. For two reasons:
> > - inner array with N 8-byte elements is a slow workaround.
> > map_lookup is not inlined for inner arrays because max_entries will
> > be different.
> > - doing the same hash in user space and the kernel is hard.
> > For example, iproute2 is using socket(AF_ALG) to compute the same hash
> > (program tag) as the kernel.
> > Copy-paste of kernel jhash.h is not possible due to GPL,
> > but, as you pointed out, it's public domain, so user space would
> > need to search a public domain, reimplement jhash and then
> > make sure that it produces the same hash as the kernel.
> > All these trade offs point out the need for dedicated map type
> > (either generalized bitset or true bloomfilter) that does the hashing
> > and can change its size.
>
>
> To ensure we are all aligned on this conversation, here is in more
> detail what I am intending for the v4 map changes:
> * A bitset map that also internally functions as a bloom filter if
> nr_hashes > 0 (where nr_hashes is denoted through the map_extra flags).
> max_entries will be the requested size of the bitset. Key_size should always
> be 0.

ok, makes sense. max_entries is the number of bytes or bits? Not sure
which is better (bytes is more consistent with other uses and allows
for bigger bitset/filter, but bits might be more natural for bitset),
just bringing this up as it's unclear.


> * Add the convenience helpers
> bool bpf_bitset_clear(map, value);
> bool bpf_bitset_set(map, value);
> bool bpf_bitset_test(map, value);
>

It maps one to one to bpf_map_pop_elem(), bpf_map_push_elem(), and
bpf_map_peek_elem(), right? The signatures for pop and peek are
*identical* (map + value), while push has also extra flags (not a big
deal, we have 0 flags in a lot of helpers). So I don't see much value
for this (and it actually will be more confusing when bitset is really
a bloom filter :) ).

>
> In the case where nr_hashes == 0 (straight bitset):
> * For simplicity, value_size for the bitmap should always be a u32. This
> denotes which index of the bitmap to set/check/clear

SGTM.

>
> In the case where nr_hashes > 0 (bloom filter):
> * The value size can be whatever
> * Clear/delete is not supported
>

Sounds good as well.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 bpf-next 1/5] bpf: Add bloom filter map implementation
  2021-09-28 16:21                                         ` Alexei Starovoitov
       [not found]                                           ` <aa967ed2-a958-f995-3a09-bbd6b6e775d4@fb.com>
@ 2021-09-29  0:14                                           ` Andrii Nakryiko
  2021-09-29  3:17                                             ` Alexei Starovoitov
  1 sibling, 1 reply; 37+ messages in thread
From: Andrii Nakryiko @ 2021-09-29  0:14 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Joanne Koong, Martin KaFai Lau, bpf, Kernel Team

On Tue, Sep 28, 2021 at 9:21 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Sep 27, 2021 at 05:36:00PM -0700, Andrii Nakryiko wrote:
> > On Mon, Sep 27, 2021 at 4:51 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Mon, Sep 27, 2021 at 02:14:09PM -0700, Andrii Nakryiko wrote:
> > > > On Mon, Sep 27, 2021 at 9:41 AM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Fri, Sep 24, 2021 at 04:12:11PM -0700, Andrii Nakryiko wrote:
> > > > > >
> > > > > > That's not what I proposed. So let's say somewhere in the kernel we
> > > > > > have this variable:
> > > > > >
> > > > > > static int bpf_bloom_exists = 1;
> > > > > >
> > > > > > Now, for bpf_map_lookup_elem() helper we pass data as key pointer. If
> > > > > > all its hashed bits are set in Bloom filter (it "exists"), we return
> > > > > > &bpf_bloom_exists. So it's not a NULL pointer.
> > > > >
> > > > > imo that's too much of a hack.
> > > >
> > > > too bad, because this feels pretty natural in BPF code:
> > > >
> > > > int my_key = 1234;
> > > >
> > > > if (bpf_map_lookup_elem(&my_bloom_filter, &my_key)) {
> > > >     /* handle "maybe exists" */
> > > > } else {
> > > >     /* handle "definitely doesn't exist" */
> > > > }
> > >
> > > I don't think it fits bitset map.
> > > In the bitset the value is zero or one. It always exist.
> > > If bloomfilter is not a special map and instead implemented on top of
> > > generic bitset with a plain loop in a bpf program then
> > > push -> bit_set
> > > pop -> bit_clear
> > > peek -> bit_test
> > > would be a better fit for bitset map.
> > >
> > > bpf_map_pop_elem() and peek_elem() don't have 'flags' argument.
> > > In most cases that would be a blocker,
> > > but in this case we can add:
> > > .arg3_type      = ARG_ANYTHING
> > > and ignore it in case of stack/queue.
> > > While bitset could use the flags as an additional seed into the hash.
> > > So to do a bloomfilter the bpf prog would do:
> > > for (i = 0; i < 5; i++)
> > >    if (bpf_map_peek_elem(map, &value, conver_i_to_seed(i)))
> >
> > I think I'm getting lost in the whole unified bitset + bloom filter
> > design, tbh. In this case, why would you pass the seed to peek()? And
> > what is value here? Is that the value (N bytes) or the bit index (4
> > bytes?)?
>
> The full N byte value, of course.
> The pure index has the same downsides as hashing helper:
> - hard to make kernel and user space produce the same hash in all cases
> - inability to dynamically change max_entries in a clean way

Here's the confusing part. If the map is performing hashing, then why
would we do explicit 5 iteration instead of the map (bloom filter)
just doing it internally in one go (faster and simpler). But if it is
a bitset (and so we have to do 5 iterations to implement Bloom filter
logic), then it is quite unconventional for the bitset data structure
to perform the hashing of a value. The only upside of this hybrid
one-bit-at-a-time-but-with-hashing-included approach would be more
freedom in choosing the hashing seed for each individual bit. But that
hasn't come up as a limitation in the discussion at all, which made me
wonder what we are optimizing here for.

>
> > I assumed that once we have a hashing helper and a bitset
> > map, you'd use that and seed to calculate bit index. But now I'm
> > confused about what this peek operation is doing. I'm sorry if I'm
> > just slow.
> >
> > Overall, I think I agree with Joanne that a separate dedicated Bloom
> > filter map will have simpler and better usability. This bitset + bloom
> > filter generalization seems to just create unnecessary confusion. I
> > don't feel the need for bitset map even more than I didn't feel the
> > need for Bloom filter, given it's even simpler data structure and is
> > totally implementable on either global var array or
> > BPF_MAP_TYPE_ARRAY, if map-in-map and dynamic sizing in mandatory.
>
> Not really. For two reasons:
> - inner array with N 8-byte elements is a slow workaround.
> map_lookup is not inlined for inner arrays because max_entries will
> be different.

If we are talking about bit set data structure (not Bloom filter),
then elementary operation is setting/resetting *one bit*. For that,
looking up an 8-byte element and setting a bit in it is just as
efficient as having a dedicated bpf_map_push_elem(). Keep in mind, I'm
talking about pure bitset logic here. Once you add N hashes (and thus
we start talking about Bloom filter, not a bit set), then you get N
helper calls vs just 1 for bloom filter logic.

So for Bloom filter you get performance advantage from a dedicated map
(due to having just 1 helper call to do N hashing operations). For
pure bitset, there seems to be little benefit at all because it is
basically ARRAY. For pure bitset of a fixed size, doing it on a global
var array is faster (no helper overhead). For map-in-map case with
known or unknown size, ARRAY is equivalent to BITSET map (one helper
call + setting 1 bit through returned pointer).

But the nr_hashes == 0 special casing for pure bitset works fine as well.

> - doing the same hash in user space and the kernel is hard.
> For example, iproute2 is using socket(AF_ALG) to compute the same hash
> (program tag) as the kernel.
> Copy-paste of kernel jhash.h is not possible due to GPL,

I haven't found SPDX header or any mention of GPL in
include/linux/jhash.h, so I assumed someone can just copy paste the
code (given the references to public domain). Seems like that's not
the case? Just curious about implications, license-wise, if there is
no SPDX? Is it still considered GPL?

> but, as you pointed out, it's public domain, so user space would
> need to search a public domain, reimplement jhash and then
> make sure that it produces the same hash as the kernel.
> All these trade offs point out the need for dedicated map type
> (either generalized bitset or true bloomfilter) that does the hashing
> and can change its size.

Sure, we've converged on that already. I was confused by the above
example of an explicit loop + bpf_map_peek_elem with hashing.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 bpf-next 1/5] bpf: Add bloom filter map implementation
  2021-09-28 23:54                                             ` Andrii Nakryiko
@ 2021-09-29  1:54                                               ` Joanne Koong
  0 siblings, 0 replies; 37+ messages in thread
From: Joanne Koong @ 2021-09-29  1:54 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: Alexei Starovoitov, Martin KaFai Lau, bpf, Kernel Team

On 9/28/21 4:54 PM, Andrii Nakryiko wrote:

> On Tue, Sep 28, 2021 at 1:56 PM Joanne Koong <joannekoong@fb.com> wrote:
>> On 9/28/21 9:21 AM, Alexei Starovoitov wrote:
>>
>>> On Mon, Sep 27, 2021 at 05:36:00PM -0700, Andrii Nakryiko wrote:
>>>> On Mon, Sep 27, 2021 at 4:51 PM Alexei Starovoitov
>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>> On Mon, Sep 27, 2021 at 02:14:09PM -0700, Andrii Nakryiko wrote:
>>>>>> On Mon, Sep 27, 2021 at 9:41 AM Alexei Starovoitov
>>>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>>>> On Fri, Sep 24, 2021 at 04:12:11PM -0700, Andrii Nakryiko wrote:
>>>>>>>> That's not what I proposed. So let's say somewhere in the kernel we
>>>>>>>> have this variable:
>>>>>>>>
>>>>>>>> static int bpf_bloom_exists = 1;
>>>>>>>>
>>>>>>>> Now, for bpf_map_lookup_elem() helper we pass data as key pointer. If
>>>>>>>> all its hashed bits are set in Bloom filter (it "exists"), we return
>>>>>>>> &bpf_bloom_exists. So it's not a NULL pointer.
>>>>>>> imo that's too much of a hack.
>>>>>> too bad, because this feels pretty natural in BPF code:
>>>>>>
>>>>>> int my_key = 1234;
>>>>>>
>>>>>> if (bpf_map_lookup_elem(&my_bloom_filter, &my_key)) {
>>>>>>       /* handle "maybe exists" */
>>>>>> } else {
>>>>>>       /* handle "definitely doesn't exist" */
>>>>>> }
>> To summarize this, Andrii it seems like you are proposing two possibilities
>> for passing the ptr to the data as the key:
>>
>> 1. Have the value be NULL (value_size = 0)
> Yeah, this was my biggest hope (except value is non-NULL, it's just
> zero-sized so not readable/writable), but that's academic at this
> point, see below.
>
>> 2. Have the value be value_size = 1 byte or value_size = 4 bytes in a
>> worst-case scenario
>>
>>
>> For #1 where the value_size is 0 and we return something like
>> &bpf_bloom_exists
>> for bpf_map_lookup_elem() where the key is found, this would still
>> require us to do
>> the following in the syscall layer elsewhere:
>>
>> a) In the syscall layer in map_lookup_elem, add code that will allow
>> value_sizes of
>> 0. This would require another change in bpf_map_copy_value where we have to
>> also check that if the value_size is 0, then we shouldn't copy the
>> resulting ptr
>> of the bpf_map_lookup_elem call to the value ptr (the value ptr isn't
>> allocated since
>> value_size is 0).
>>
>> b) In map_update_elem, add code that allows the user to pass in a NULL /
>> zero-size
>> value. Currently, there exists only support for passing in a
>> NULL/zero-size key
>> (which was added for stack/queue maps that pass in NULL keys); we'd have
>> to add
>> in the equivalent for NULL/zero-size values. We'd also have to modify
>> the verifier
>> to allow bpf_map_update_elem for NULL values (ARG_PTR_TO_UNINIT_MAP_KEY).
> This "UNINIT_MAP_KEY" part is confusing me because we are talking
> about *NULL value* (so neither uninitialized nor a key), so I must be
> missing something important here. I thought it would be
> ARG_PTR_TO_MAP_VALUE_OR_NULL, but it does suck that other map types
> would then be allowed NULL where they don't expect to get NULL, I
> agree.
You're right, this would just be changed to ARG_PTR_TO_MAP_VALUE_OR_NULL;
I mis-pasted ARG_PTR_TO_UNINIT_MAP_KEY from a previous draft. Sorry if
that caused confusion.
>> For #2, from the user-side for bpf_map_update_elem, this now means
>> the user would have to always pass in some dummy 1-byte or 4-byte value
>> since the value_size is no longer 0. This seems like a hacky API
>>
>> Repurposing peek/push/pop (in the scenario where value_size = 0) would
>> avoid the
>> bpf_map_copy_value change in #1a altogether, which was the primary
>> reason for
>> suggesting it.
>>
>> The approach taken in this patchset (where we have the key as NULL, and
>> the value
>> as the ptr to the data) avoids the need to add that infrastructure
>> outlined above
>> for allowing NULL values, since it just rides on top of the changes that
>> were added to the
>> stack/queue map that allows NULL keys.
> So overall, I agree that all the above will be an unnecessary
> complication for relatively little gain. Just go with peek/pop/push.
>
>>>>> I don't think it fits bitset map.
>>>>> In the bitset the value is zero or one. It always exist.
>>>>> If bloomfilter is not a special map and instead implemented on top of
>>>>> generic bitset with a plain loop in a bpf program then
>>>>> push -> bit_set
>>>>> pop -> bit_clear
>>>>> peek -> bit_test
>>>>> would be a better fit for bitset map.
>>>>>
>>>>> bpf_map_pop_elem() and peek_elem() don't have 'flags' argument.
>>>>> In most cases that would be a blocker,
>>>>> but in this case we can add:
>>>>> .arg3_type      = ARG_ANYTHING
>>>>> and ignore it in case of stack/queue.
>>>>> While bitset could use the flags as an additional seed into the hash.
>>>>> So to do a bloomfilter the bpf prog would do:
>>>>> for (i = 0; i < 5; i++)
>>>>>      if (bpf_map_peek_elem(map, &value, conver_i_to_seed(i)))
>>>> I think I'm getting lost in the whole unified bitset + bloom filter
>>>> design, tbh. In this case, why would you pass the seed to peek()? And
>>>> what is value here? Is that the value (N bytes) or the bit index (4
>>>> bytes?)?
>> In that example where seed is passed to peek(), the context is the
>> hypothetical scenario where  the bloom filter is implemented on top
>> of a generic bitset.
> But then why does *bitset* do the hashing on behalf of the user,
> that's the confusing bit. But I'll reply to Alexei's email in just a
> sec.
>
>>> The full N byte value, of course.
>>> The pure index has the same downsides as hashing helper:
>>> - hard to make kernel and user space produce the same hash in all cases
>>> - inability to dynamically change max_entries in a clean way
>>>
>>>> I assumed that once we have a hashing helper and a bitset
>>>> map, you'd use that and seed to calculate bit index. But now I'm
>>>> confused about what this peek operation is doing. I'm sorry if I'm
>>>> just slow.
>>>>
>>>> Overall, I think I agree with Joanne that a separate dedicated Bloom
>>>> filter map will have simpler and better usability. This bitset + bloom
>>>> filter generalization seems to just create unnecessary confusion. I
>>>> don't feel the need for bitset map even more than I didn't feel the
>>>> need for Bloom filter, given it's even simpler data structure and is
>>>> totally implementable on either global var array or
>>>> BPF_MAP_TYPE_ARRAY, if map-in-map and dynamic sizing in mandatory.
>>> Not really. For two reasons:
>>> - inner array with N 8-byte elements is a slow workaround.
>>> map_lookup is not inlined for inner arrays because max_entries will
>>> be different.
>>> - doing the same hash in user space and the kernel is hard.
>>> For example, iproute2 is using socket(AF_ALG) to compute the same hash
>>> (program tag) as the kernel.
>>> Copy-paste of kernel jhash.h is not possible due to GPL,
>>> but, as you pointed out, it's public domain, so user space would
>>> need to search a public domain, reimplement jhash and then
>>> make sure that it produces the same hash as the kernel.
>>> All these trade offs point out the need for dedicated map type
>>> (either generalized bitset or true bloomfilter) that does the hashing
>>> and can change its size.
>>
>> To ensure we are all aligned on this conversation, here is in more
>> detail what I am intending for the v4 map changes:
>> * A bitset map that also internally functions as a bloom filter if
>> nr_hashes > 0 (where nr_hashes is denoted through the map_extra flags).
>> max_entries will be the requested size of the bitset. Key_size should always
>> be 0.
> ok, makes sense. max_entries is the number of bytes or bits? Not sure
> which is better (bytes is more consistent with other uses and allows
> for bigger bitset/filter, but bits might be more natural for bitset),
> just bringing this up as it's unclear.
>
I think number of bits would be more ergonomic for the user
in the context of the bitset map. Especially since they will be operating
at the granularity of the bit anyways for setting/clearing/checking
a specified bit.
>> * Add the convenience helpers
>> bool bpf_bitset_clear(map, value);
>> bool bpf_bitset_set(map, value);
>> bool bpf_bitset_test(map, value);
>>
> It maps one to one to bpf_map_pop_elem(), bpf_map_push_elem(), and
> bpf_map_peek_elem(), right? The signatures for pop and peek are
> *identical* (map + value), while push has also extra flags (not a big
> deal, we have 0 flags in a lot of helpers). So I don't see much value
> for this (and it actually will be more confusing when bitset is really
> a bloom filter :) ).
Makes sense! I will not add these convenience helpers in v4
>> In the case where nr_hashes == 0 (straight bitset):
>> * For simplicity, value_size for the bitmap should always be a u32. This
>> denotes which index of the bitmap to set/check/clear
> SGTM.
>
>> In the case where nr_hashes > 0 (bloom filter):
>> * The value size can be whatever
>> * Clear/delete is not supported
>>
> Sounds good as well.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 bpf-next 1/5] bpf: Add bloom filter map implementation
  2021-09-29  0:14                                           ` Andrii Nakryiko
@ 2021-09-29  3:17                                             ` Alexei Starovoitov
  2021-09-29  3:38                                               ` Joanne Koong
  0 siblings, 1 reply; 37+ messages in thread
From: Alexei Starovoitov @ 2021-09-29  3:17 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: Joanne Koong, Martin KaFai Lau, bpf, Kernel Team

On Tue, Sep 28, 2021 at 05:14:55PM -0700, Andrii Nakryiko wrote:
> 
> So for Bloom filter you get performance advantage from a dedicated map
> (due to having just 1 helper call to do N hashing operations). For
> pure bitset, there seems to be little benefit at all because it is
> basically ARRAY.

I brought up bitset only as an idea to make bloomfilter map a bit more
generic and was looking for feedback whether to call the new map
"bitset that allows bloomfilter operations" or call it
"bloomfilter that simplifies to bitset" with nr_hashes=0.

It sounds that using bloomfilter as a base name fits better.

> I haven't found SPDX header or any mention of GPL in
> include/linux/jhash.h, so I assumed someone can just copy paste the
> code (given the references to public domain). Seems like that's not
> the case? Just curious about implications, license-wise, if there is
> no SPDX? Is it still considered GPL?

I believe so. Every project that copy pasted jhash from the kernel
add SPDX gpl-2 to its source.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 bpf-next 1/5] bpf: Add bloom filter map implementation
  2021-09-29  3:17                                             ` Alexei Starovoitov
@ 2021-09-29  3:38                                               ` Joanne Koong
  0 siblings, 0 replies; 37+ messages in thread
From: Joanne Koong @ 2021-09-29  3:38 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko; +Cc: Martin KaFai Lau, bpf, Kernel Team

On 9/28/21 8:17 PM, Alexei Starovoitov wrote:

> On Tue, Sep 28, 2021 at 05:14:55PM -0700, Andrii Nakryiko wrote:
>> So for Bloom filter you get performance advantage from a dedicated map
>> (due to having just 1 helper call to do N hashing operations). For
>> pure bitset, there seems to be little benefit at all because it is
>> basically ARRAY.
> I brought up bitset only as an idea to make bloomfilter map a bit more
> generic and was looking for feedback whether to call the new map
> "bitset that allows bloomfilter operations" or call it
> "bloomfilter that simplifies to bitset" with nr_hashes=0.
>
> It sounds that using bloomfilter as a base name fits better.
I am in favor of calling the new map "bitset that allows bloom filter 
operations"
and having the map be BPF_MAP_TYPE_BITSET everywhere.
To me that is more intuitive since the bloom filter is based on top of a 
bitset.
And I think more people will generally be using bitsets than the bloom 
filter.
Additionally, for users who don't know or care what a bloom filter is, they
might skip this map type altogether and not realize it also can be used as
a bitset.
>> I haven't found SPDX header or any mention of GPL in
>> include/linux/jhash.h, so I assumed someone can just copy paste the
>> code (given the references to public domain). Seems like that's not
>> the case? Just curious about implications, license-wise, if there is
>> no SPDX? Is it still considered GPL?
> I believe so. Every project that copy pasted jhash from the kernel
> add SPDX gpl-2 to its source.

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2021-09-29  3:38 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-21 21:02 [PATCH v3 bpf-next 0/5] Implement bloom filter map Joanne Koong
2021-09-21 21:02 ` [PATCH v3 bpf-next 1/5] bpf: Add bloom filter map implementation Joanne Koong
2021-09-21 23:44   ` Andrii Nakryiko
2021-09-22 19:06     ` Joanne Koong
2021-09-22 19:38       ` Martin KaFai Lau
2021-09-22 20:52         ` Andrii Nakryiko
2021-09-22 22:08           ` Martin KaFai Lau
2021-09-22 23:07             ` Andrii Nakryiko
2021-09-23  1:28               ` Martin KaFai Lau
2021-09-23 18:42                 ` Andrii Nakryiko
2021-09-23 19:42                   ` Martin KaFai Lau
2021-09-23 20:30                     ` Alexei Starovoitov
2021-09-23 21:12                       ` Andrii Nakryiko
2021-09-23 22:28                         ` Joanne Koong
2021-09-23 23:46                           ` Martin KaFai Lau
2021-09-24  2:23                           ` Andrii Nakryiko
2021-09-24 16:32                             ` Joanne Koong
2021-09-24 23:12                               ` Andrii Nakryiko
2021-09-27 16:41                                 ` Alexei Starovoitov
2021-09-27 21:14                                   ` Andrii Nakryiko
2021-09-27 23:51                                     ` Alexei Starovoitov
2021-09-28  0:36                                       ` Andrii Nakryiko
2021-09-28 16:21                                         ` Alexei Starovoitov
     [not found]                                           ` <aa967ed2-a958-f995-3a09-bbd6b6e775d4@fb.com>
2021-09-28 23:54                                             ` Andrii Nakryiko
2021-09-29  1:54                                               ` Joanne Koong
2021-09-29  0:14                                           ` Andrii Nakryiko
2021-09-29  3:17                                             ` Alexei Starovoitov
2021-09-29  3:38                                               ` Joanne Koong
2021-09-28  1:09                                 ` Martin KaFai Lau
2021-09-22 20:44       ` Andrii Nakryiko
2021-09-21 21:02 ` [PATCH v3 bpf-next 2/5] libbpf: Allow the number of hashes in bloom filter maps to be configurable Joanne Koong
2021-09-21 22:24   ` Joanne Koong
2021-09-22 23:14     ` Andrii Nakryiko
2021-09-21 23:57   ` Andrii Nakryiko
2021-09-21 21:02 ` [PATCH v3 bpf-next 3/5] selftests/bpf: Add bloom filter map test cases Joanne Koong
2021-09-21 21:02 ` [PATCH v3 bpf-next 4/5] bpf/benchs: Add benchmark test for bloom filter maps Joanne Koong
2021-09-21 21:02 ` [PATCH v3 bpf-next 5/5] bpf/benchs: Add benchmarks for comparing hashmap lookups with vs. without bloom filter Joanne Koong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).