From: Joanne Koong <joannekoong@fb.com>
To: John Fastabend <john.fastabend@gmail.com>,
Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf <bpf@vger.kernel.org>
Subject: Re: [PATCH bpf-next 1/5] bpf: Add bloom filter map implementation
Date: Thu, 2 Sep 2021 15:07:56 -0700 [thread overview]
Message-ID: <0c1bb5a6-4ef5-77b4-cd10-aea0060d5349@fb.com> (raw)
In-Reply-To: <61305cf822fa_439b208a5@john-XPS-13-9370.notmuch>
On 9/1/21 10:11 PM, John Fastabend wrote:
> Andrii Nakryiko wrote:
>> On Tue, Aug 31, 2021 at 3:51 PM Joanne Koong <joannekoong@fb.com> wrote:
>>> Bloom filters are a space-efficient probabilistic data structure
>>> used to quickly test whether an element exists in a set.
>>> In a bloom filter, false positives are possible whereas false
>>> negatives are not.
>>>
>>> This patch adds a bloom filter map for bpf programs.
>>> The bloom filter map supports peek (determining whether an element
>>> is present in the map) and push (adding an element to the map)
>>> operations.These operations are exposed to userspace applications
>>> through the already existing syscalls in the following way:
>>>
>>> BPF_MAP_LOOKUP_ELEM -> peek
>>> BPF_MAP_UPDATE_ELEM -> push
>>>
>>> The bloom filter map does not have keys, only values. In light of
>>> this, the bloom filter map's API matches that of queue stack maps:
>>> user applications use BPF_MAP_LOOKUP_ELEM/BPF_MAP_UPDATE_ELEM
>>> which correspond internally to bpf_map_peek_elem/bpf_map_push_elem,
>>> and bpf programs must use the bpf_map_peek_elem and bpf_map_push_elem
>>> APIs to query or add an element to the bloom filter map. When the
>>> bloom filter map is created, it must be created with a key_size of 0.
>>>
>>> For updates, the user will pass in the element to add to the map
>>> as the value, wih a NULL key. For lookups, the user will pass in the
>>> element to query in the map as the value. In the verifier layer, this
>>> requires us to modify the argument type of a bloom filter's
>>> BPF_FUNC_map_peek_elem call to ARG_PTR_TO_MAP_VALUE; as well, in
>>> the syscall layer, we need to copy over the user value so that in
>>> bpf_map_peek_elem, we know which specific value to query.
>>>
>>> The maximum number of entries in the bloom filter is not enforced; if
>>> the user wishes to insert more entries into the bloom filter than they
>>> specified as the max entries size of the bloom filter, that is permitted
>>> but the performance of their bloom filter will have a higher false
>>> positive rate.
>>>
>>> The number of hashes to use for the bloom filter is configurable from
>>> userspace. The benchmarks later in this patchset can help compare the
>>> performances of different number of hashes on different entry
>>> sizes. In general, using more hashes decreases the speed of a lookup,
>>> but increases the false positive rate of an element being detected in the
>>> bloom filter.
>>>
>>> Signed-off-by: Joanne Koong <joannekoong@fb.com>
>>> ---
>> This looks nice and simple. I left a few comments below.
>>
>> But one high-level point I wanted to discuss was that bloom filter
>> logic is actually simple enough to be implementable by pure BPF
>> program logic. The only problematic part is generic hashing of a piece
>> of memory. Regardless of implementing bloom filter as kernel-provided
>> BPF map or implementing it with custom BPF program logic, having BPF
>> helper for hashing a piece of memory seems extremely useful and very
>> generic. I can't recall if we ever discussed adding such helpers, but
>> maybe we should?
> Aha started typing the same thing :)
>
> Adding generic hash helper has been on my todo list and close to the top
> now. The use case is hashing data from skb payloads and such from kprobe
> and sockmap side. I'm happy to work on it as soon as possible if no one
> else picks it up.
>
>> It would be a really interesting experiment to implement the same
>> logic in pure BPF logic and run it as another benchmark, along the
>> Bloom filter map. BPF has both spinlock and atomic operation, so we
>> can compare and contrast. We only miss hashing BPF helper.
> The one issue I've found writing a hash logic is its a bit tricky
> to get the verifier to consume it. Especially when the hash is nested
> inside a for loop and sometimes a couple for loops so you end up with
> things like,
>
> for (i = 0; i < someTlvs; i++) {
> for (j = 0; j < someKeys; i++) {
> ...
> bpf_hash(someValue)
> ...
> }
>
> I've find small seemingly unrelated changes cause the complexity limit
> to explode. Usually we can work around it with code to get pruning
> points and such, but its a bit ugly. Perhaps this means we need
> to dive into details of why the complexity explodes, but I've not
> got to it yet. The todo list is long.
>
>> Being able to do this in pure BPF code has a bunch of advantages.
>> Depending on specific application, users can decide to:
>> - speed up the operation by ditching spinlock or atomic operation,
>> if the logic allows to lose some bit updates;
>> - decide on optimal size, which might not be a power of 2, depending
>> on memory vs CPU trade of in any particular case;
>> - it's also possible to implement a more general Counting Bloom
>> filter, all without modifying the kernel.
> Also it means no call and if you build it on top of an array
> map of size 1 its just a load. Could this be a performance win for
> example a Bloom filter in XDP for DDOS? Maybe. Not sure if the program
> would be complex enough a call might be in the noise. I don't know.
>
>> We could go further, and start implementing other simple data
>> structures relying on hashing, like HyperLogLog. And all with no
>> kernel modifications. Map-in-map is no issue as well, because there is
>> a choice of using either fixed global data arrays for maximum
>> performance, or using BPF_MAP_TYPE_ARRAY maps that can go inside
>> map-in-map.
> We've been doing most of our array maps as single entry arrays
> at this point and just calculating offsets directly in BPF. Same
> for some simple hashing algorithms.
>
>> Basically, regardless of having this map in the kernel or not, let's
>> have a "universal" hashing function as a BPF helper as well.
> Yes please!
I completely agree!
>> Thoughts?
> I like it, but not the bloom filter expert here.
Ooh, I like your idea of comparing the performance of the bloom filter with
a kernel-provided BPF map vs. custom BPF program logic using a
hash helper, especially if a BPF hash helper is something useful that
we want to add to the codebase in and of itself!
>>> include/linux/bpf.h | 3 +-
>>> include/linux/bpf_types.h | 1 +
>>> include/uapi/linux/bpf.h | 3 +
>>> kernel/bpf/Makefile | 2 +-
>>> kernel/bpf/bloom_filter.c | 171 +++++++++++++++++++++++++++++++++
>>> kernel/bpf/syscall.c | 20 +++-
>>> kernel/bpf/verifier.c | 19 +++-
>>> tools/include/uapi/linux/bpf.h | 3 +
>>> 8 files changed, 214 insertions(+), 8 deletions(-)
>>> create mode 100644 kernel/bpf/bloom_filter.c
>>>
>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>>> index f4c16f19f83e..2abaa1052096 100644
>>> --- a/include/linux/bpf.h
>>> +++ b/include/linux/bpf.h
>>> @@ -181,7 +181,8 @@ struct bpf_map {
>>> u32 btf_vmlinux_value_type_id;
>>> bool bypass_spec_v1;
>>> bool frozen; /* write-once; write-protected by freeze_mutex */
>>> - /* 22 bytes hole */
>>> + u32 nr_hashes; /* used for bloom filter maps */
>>> + /* 18 bytes hole */
>>>
>>> /* The 3rd and 4th cacheline with misc members to avoid false sharing
>>> * particularly with refcounting.
>>> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
>>> index 9c81724e4b98..c4424ac2fa02 100644
>>> --- a/include/linux/bpf_types.h
>>> +++ b/include/linux/bpf_types.h
>>> @@ -125,6 +125,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops)
>>> BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops)
>>> #endif
>>> BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops)
>>> +BPF_MAP_TYPE(BPF_MAP_TYPE_BLOOM_FILTER, bloom_filter_map_ops)
>>>
>>> BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint)
>>> BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing)
>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>> index 791f31dd0abe..c2acb0a510fe 100644
>>> --- a/include/uapi/linux/bpf.h
>>> +++ b/include/uapi/linux/bpf.h
>>> @@ -906,6 +906,7 @@ enum bpf_map_type {
>>> BPF_MAP_TYPE_RINGBUF,
>>> BPF_MAP_TYPE_INODE_STORAGE,
>>> BPF_MAP_TYPE_TASK_STORAGE,
>>> + BPF_MAP_TYPE_BLOOM_FILTER,
>>> };
>>>
>>> /* Note that tracing related programs such as
>>> @@ -1274,6 +1275,7 @@ union bpf_attr {
>>> * struct stored as the
>>> * map value
>>> */
>>> + __u32 nr_hashes; /* used for configuring bloom filter maps */
>> This feels like a bit too one-off property that won't be ever reused
>> by any other type of map. Also consider that we should probably limit
>> nr_hashes to some pretty small sane value (<16? <64?) to prevent easy
>> DOS from inside BPF programs (e.g., set nr_hash to 2bln, each
>> operation is now extremely slow and CPU intensive). So with that,
>> maybe let's provide number of hashes as part of map_flags? And as
>> Alexei proposed, zero would mean some recommended value (2 or 3,
>> right?). This would also mean that libbpf won't need to know about
>> one-off map property in parsing BTF map definitions.
I think we can limit nr_hashes to 10, since 10 hashes has a false
positive rate of
roughly ~0%
>>> };
>>>
>>> struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */
>>> @@ -5594,6 +5596,7 @@ struct bpf_map_info {
>>> __u32 btf_id;
>>> __u32 btf_key_type_id;
>>> __u32 btf_value_type_id;
>>> + __u32 nr_hashes; /* used for bloom filter maps */
>>> } __attribute__((aligned(8)));
>>>
>>> struct bpf_btf_info {
>>> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
>>> index 7f33098ca63f..cf6ca339f3cd 100644
>>> --- a/kernel/bpf/Makefile
>>> +++ b/kernel/bpf/Makefile
>>> @@ -7,7 +7,7 @@ endif
>>> CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy)
>>>
>>> obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o prog_iter.o
>>> -obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o
>>> +obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
>>> obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
>>> obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
>>> obj-${CONFIG_BPF_LSM} += bpf_inode_storage.o
>>> diff --git a/kernel/bpf/bloom_filter.c b/kernel/bpf/bloom_filter.c
>>> new file mode 100644
>>> index 000000000000..3ae799ab3747
>>> --- /dev/null
>>> +++ b/kernel/bpf/bloom_filter.c
>>> @@ -0,0 +1,171 @@
>>> +// SPDX-License-Identifier: GPL-2.0
>>> +/* Copyright (c) 2021 Facebook */
>>> +
>>> +#include <linux/bitmap.h>
>>> +#include <linux/bpf.h>
>>> +#include <linux/err.h>
>>> +#include <linux/jhash.h>
>>> +#include <linux/random.h>
>>> +#include <linux/spinlock.h>
>>> +
>>> +#define BLOOM_FILTER_CREATE_FLAG_MASK \
>>> + (BPF_F_NUMA_NODE | BPF_F_ZERO_SEED | BPF_F_ACCESS_MASK)
>>> +
>>> +struct bpf_bloom_filter {
>>> + struct bpf_map map;
>>> + u32 bit_array_mask;
>>> + u32 hash_seed;
>>> + /* Used for synchronizing parallel writes to the bit array */
>>> + spinlock_t spinlock;
>>> + unsigned long bit_array[];
>>> +};
>>> +
>>> +static int bloom_filter_map_peek_elem(struct bpf_map *map, void *value)
>>> +{
>>> + struct bpf_bloom_filter *bloom_filter =
>>> + container_of(map, struct bpf_bloom_filter, map);
>>> + u32 i, hash;
>>> +
>>> + for (i = 0; i < bloom_filter->map.nr_hashes; i++) {
>>> + hash = jhash(value, map->value_size, bloom_filter->hash_seed + i) &
>>> + bloom_filter->bit_array_mask;
>>> + if (!test_bit(hash, bloom_filter->bit_array))
>>> + return -ENOENT;
>>> + }
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +static struct bpf_map *bloom_filter_map_alloc(union bpf_attr *attr)
>>> +{
>>> + int numa_node = bpf_map_attr_numa_node(attr);
>>> + u32 nr_bits, bit_array_bytes, bit_array_mask;
>>> + struct bpf_bloom_filter *bloom_filter;
>>> +
>>> + if (!bpf_capable())
>>> + return ERR_PTR(-EPERM);
>>> +
>>> + if (attr->key_size != 0 || attr->value_size == 0 || attr->max_entries == 0 ||
>>> + attr->nr_hashes == 0 || attr->map_flags & ~BLOOM_FILTER_CREATE_FLAG_MASK ||
>>> + !bpf_map_flags_access_ok(attr->map_flags))
>>> + return ERR_PTR(-EINVAL);
>>> +
>>> + /* For the bloom filter, the optimal bit array size that minimizes the
>>> + * false positive probability is n * k / ln(2) where n is the number of
>>> + * expected entries in the bloom filter and k is the number of hash
>>> + * functions. We use 7 / 5 to approximate 1 / ln(2).
>>> + *
>>> + * We round this up to the nearest power of two to enable more efficient
>>> + * hashing using bitmasks. The bitmask will be the bit array size - 1.
>>> + *
>>> + * If this overflows a u32, the bit array size will have 2^32 (4
>>> + * GB) bits.
> Would it be better to return E2BIG or EINVAL here? Speculating a bit, but if I was
> a user I might want to know that the number of bits I pushed down is not the actual
> number?
I think if we return E2BIG or EINVAL here, this will fail to create the
bloom filter map
if the max_entries exceeds some limit (~3 billion, according to my math)
whereas
automatically setting the bit array size to 2^32 if the max_entries is
extraordinarily large will still allow the user to create and use a
bloom filter (albeit
one with a higher false positive rate).
> Another thought, would it be simpler to let user do this calculation and just let
> max_elements be number of bits they want? Then we could have examples with the
> above comment. Just a thought...
I like Martin's idea of keeping the max_entries meaning consistent
across all map types.
I think that makes the interface clearer for users.
>>> + */
>>> + if (unlikely(check_mul_overflow(attr->max_entries, attr->nr_hashes, &nr_bits)) ||
>>> + unlikely(check_mul_overflow(nr_bits / 5, (u32)7, &nr_bits)) ||
>>> + unlikely(nr_bits > (1UL << 31))) {
>> nit: map_alloc is not performance-critical (because it's infrequent),
>> so unlikely() are probably unnecessary?
>>
>>> + /* The bit array size is 2^32 bits but to avoid overflowing the
>>> + * u32, we use BITS_TO_BYTES(U32_MAX), which will round up to the
>>> + * equivalent number of bytes
>>> + */
>>> + bit_array_bytes = BITS_TO_BYTES(U32_MAX);
>>> + bit_array_mask = U32_MAX;
>>> + } else {
>>> + if (nr_bits <= BITS_PER_LONG)
>>> + nr_bits = BITS_PER_LONG;
>>> + else
>>> + nr_bits = roundup_pow_of_two(nr_bits);
>>> + bit_array_bytes = BITS_TO_BYTES(nr_bits);
>>> + bit_array_mask = nr_bits - 1;
>>> + }
>>> +
>>> + bit_array_bytes = roundup(bit_array_bytes, sizeof(unsigned long));
>>> + bloom_filter = bpf_map_area_alloc(sizeof(*bloom_filter) + bit_array_bytes,
>>> + numa_node);
>>> +
>>> + if (!bloom_filter)
>>> + return ERR_PTR(-ENOMEM);
>>> +
>>> + bpf_map_init_from_attr(&bloom_filter->map, attr);
>>> + bloom_filter->map.nr_hashes = attr->nr_hashes;
>>> +
>>> + bloom_filter->bit_array_mask = bit_array_mask;
>>> + spin_lock_init(&bloom_filter->spinlock);
>>> +
>>> + if (!(attr->map_flags & BPF_F_ZERO_SEED))
>>> + bloom_filter->hash_seed = get_random_int();
>>> +
>>> + return &bloom_filter->map;
>>> +}
>>> +
>>> +static void bloom_filter_map_free(struct bpf_map *map)
>>> +{
>>> + struct bpf_bloom_filter *bloom_filter =
>>> + container_of(map, struct bpf_bloom_filter, map);
>>> +
>>> + bpf_map_area_free(bloom_filter);
>>> +}
>>> +
>>> +static int bloom_filter_map_push_elem(struct bpf_map *map, void *value,
>>> + u64 flags)
>>> +{
>>> + struct bpf_bloom_filter *bloom_filter =
>>> + container_of(map, struct bpf_bloom_filter, map);
>>> + unsigned long spinlock_flags;
>>> + u32 i, hash;
>>> +
>>> + if (flags != BPF_ANY)
>>> + return -EINVAL;
>>> +
>>> + spin_lock_irqsave(&bloom_filter->spinlock, spinlock_flags);
>>> +
>> If value_size is pretty big, hashing might take a noticeable amount of
>> CPU, during which we'll be keeping spinlock. With what I said above
>> about sane number of hashes, if we bound it to small reasonable number
>> (e.g., 16), we can have a local 16-element array with hashes
>> calculated before we take lock. That way spinlock will be held only
>> for few bit flips.
> +1. Anyways we are inside a RCU section here and the map shouldn't
> disapper without a grace period so we can READ_ONCE the seed right?
> Or are we thinking about sleepable programs here?
>
>> Also, I wonder if ditching spinlock in favor of atomic bit set
>> operation would improve performance in typical scenarios. Seems like
>> set_bit() is an atomic operation, so it should be easy to test. Do you
>> mind running benchmarks with spinlock and with set_bit()?
> With the jhash pulled out of lock, I think it might be noticable. Curious
> to see.
Awesome, I will test this out and report back!
>>> + for (i = 0; i < bloom_filter->map.nr_hashes; i++) {
>>> + hash = jhash(value, map->value_size, bloom_filter->hash_seed + i) &
>>> + bloom_filter->bit_array_mask;
>>> + bitmap_set(bloom_filter->bit_array, hash, 1);
>>> + }
>>> +
>>> + spin_unlock_irqrestore(&bloom_filter->spinlock, spinlock_flags);
>>> +
>>> + return 0;
>>> +}
>>> +
>> [...]
>
next prev parent reply other threads:[~2021-09-02 22:08 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-08-31 22:50 [PATCH bpf-next 0/5] Implement bloom filter map Joanne Koong
2021-08-31 22:50 ` [PATCH bpf-next 1/5] bpf: Add bloom filter map implementation Joanne Koong
2021-09-01 2:55 ` Alexei Starovoitov
2021-09-02 19:11 ` Joanne Koong
2021-09-02 1:44 ` Andrii Nakryiko
2021-09-02 5:11 ` John Fastabend
2021-09-02 6:16 ` Martin KaFai Lau
2021-09-02 22:07 ` Joanne Koong [this message]
2021-09-03 0:56 ` Martin KaFai Lau
2021-09-03 7:13 ` Joanne Koong
2021-09-03 17:19 ` Andrii Nakryiko
2021-09-03 17:22 ` John Fastabend
2021-09-08 19:10 ` Joanne Koong
2021-09-02 3:16 ` John Fastabend
2021-09-02 3:28 ` Andrii Nakryiko
2021-09-02 4:40 ` John Fastabend
2021-08-31 22:50 ` [PATCH bpf-next 2/5] libbpf: Allow the number of hashes in bloom filter maps to be configurable Joanne Koong
2021-09-02 3:30 ` Andrii Nakryiko
2021-08-31 22:50 ` [PATCH bpf-next 3/5] selftests/bpf: Add bloom filter map test cases Joanne Koong
2021-09-01 2:55 ` Alexei Starovoitov
2021-08-31 22:50 ` [PATCH bpf-next 4/5] bpf/benchs: Add benchmark test for bloom filter maps Joanne Koong
2021-09-02 3:35 ` Andrii Nakryiko
2021-08-31 22:50 ` [PATCH bpf-next 5/5] bpf/benchs: Add benchmarks for comparing hashmap lookups with vs. without bloom filter Joanne Koong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0c1bb5a6-4ef5-77b4-cd10-aea0060d5349@fb.com \
--to=joannekoong@fb.com \
--cc=andrii.nakryiko@gmail.com \
--cc=bpf@vger.kernel.org \
--cc=john.fastabend@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).