bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andrii Nakryiko <andrii.nakryiko@gmail.com>
To: Joanne Koong <joannekoong@fb.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>,
	bpf <bpf@vger.kernel.org>, Kernel Team <Kernel-team@fb.com>
Subject: Re: [PATCH v2 bpf-next 1/4] bpf: Add bloom filter map implementation
Date: Mon, 20 Sep 2021 16:21:09 -0700	[thread overview]
Message-ID: <CAEf4BzYRZM7Oi_CY=trjvefkavt5dwfF-Zu2GKihhfpeopGAnw@mail.gmail.com> (raw)
In-Reply-To: <17d7b319-01d0-163e-57b6-c385d38cc9ad@fb.com>

On Mon, Sep 20, 2021 at 3:52 PM Joanne Koong <joannekoong@fb.com> wrote:
>
> My previous email replied to Alexei's email before I saw Andrii's new
> email, so please
> feel free to disregard my previous email.

Never got that reply of yours. Alexei's email arrived a few hours
after I've already replied to you. It was a time warp anomaly :)

>
> On 9/20/21 1:58 PM, Andrii Nakryiko wrote:
>
> > On Fri, Sep 17, 2021 at 6:08 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> >> On Mon, Sep 13, 2021 at 09:04:30PM -0700, Joanne Koong wrote:
> >>> +
> >>> +/* For bloom filter maps, the next 4 bits represent how many hashes to use.
> >>> + * The maximum number of hash functions supported is 15. If this is not set,
> >>> + * the default number of hash functions used will be 5.
> >>> + */
> >>> +     BPF_F_BLOOM_FILTER_HASH_BIT_1 = (1U << 13),
> >>> +     BPF_F_BLOOM_FILTER_HASH_BIT_2 = (1U << 14),
> >>> +     BPF_F_BLOOM_FILTER_HASH_BIT_3 = (1U << 15),
> >>> +     BPF_F_BLOOM_FILTER_HASH_BIT_4 = (1U << 16),
> >> The bit selection is unintuitive.
> >> Since key_size has to be zero may be used that instead to indicate the number of hash
> >> functions in the rare case when 5 is not good enough?
> > Hm... I was initially thinking about proposing something like that,
> > but it felt a bit ugly at the time. But now thinking about this a bit
> > more, I think this would be a bit more meaningful if we change the
> > terminology a bit. Instead of saying that Bloom filter has values and
> > no keys, we actually have keys and no values. So all those bytes that
> > are hashed are treated as keys (which is actually how sets are
> > implemented on top of maps, where you have keys and no values, or at
> > least the value is always true).
> >
> > So with that we'll have key/key_size to specify number of bytes that
> > needs to be hashed (and it's type info). And then we can squint a bit
> > and say that number of hashes are specified by value_size, as in
> > values are those nr_hash bits that we set in Bloom filter.
> >
> > Still a bit of terminology stretch, but won't necessitate those
> > specialized fields just for Bloom filter map. But if default value is
> > going to be good enough for most cases and most cases won't need to
> > adjust number of hashes, this seems to be pretty clean to me.
>
> With having bloom filter map keys instead of values,  I think this would
> lead to messier code in the kernel for handling map_lookup_elem
> and map_update_elem calls, due to the fact that the bloom filter map
> is a non-associative map and the current APIs for non-associative map types
> (peek_elem/push_elem/pop_elem) all have the map data as the value and
> not the key.
>
> For example, for map_update_elem, the API from the eBPF program side is
>
> int (*map_update_elem)(struct bpf_map *map, void *key, void *value, u64
> flags);
>
> This would require us to either
>
> a) Add some custom logic in syscall.c so that we bypass the
> copy_from_bpfptr call on
> bloom filter map values (necessary because memcpying 0 bytes still
> requires the src pointer
> to be valid), which would allow us to pass in a NULL value
>
> b) Add a new function like
>
> int (*map_push_key)(struct bpf_map *map, void *key, u64 flags)
>
> that eBPF programs can call instead of map_update_elem.
>
> or
>
> c) Try to repurpose the existing map_push_elem API by passing in the key
> instead of the value,
> which would lead to inconsistent use of the API
>
> I think if we could change the non-associative map types (currently only
> stack maps and queue
> maps, I believe) to have their data be a key instead of a value, and
> have the pop/peek APIs use
> keys instead of values, then this would be cleaner, since we could then
> just use the existing peek/pop
> APIs.

I don't think we can change existing map APIs anymore, unfortunately.

>
> >> Or use inner_map_fd since there is no possibility of having an inner map in bloomfilter.
> >> It could be a union:
> >>      __u32   max_entries;    /* max number of entries in a map */
> >>      __u32   map_flags;      /* BPF_MAP_CREATE related
> >>                               * flags defined above.
> >>                               */
> >>      union {
> >>         __u32  inner_map_fd;   /* fd pointing to the inner map */
> >>         __u32  nr_hash_funcs;  /* or number of hash functions */
> >>      };
> > This works as well. A bit more Bloom filter-only terminology
> > throughout UAPI and libbpf, but I'd be fine with that as well.
> >
> Great, it looks like this is the consensus - I will go with this option
> for v3!
> >>      __u32   numa_node;      /* numa node */
> >>
> >>> +struct bpf_bloom_filter {
> >>> +     struct bpf_map map;
> >>> +     u32 bit_array_mask;
> >>> +     u32 hash_seed;
> >>> +     /* If the size of the values in the bloom filter is u32 aligned,
> >>> +      * then it is more performant to use jhash2 as the underlying hash
> >>> +      * function, else we use jhash. This tracks the number of u32s
> >>> +      * in an u32-aligned value size. If the value size is not u32 aligned,
> >>> +      * this will be 0.
> >>> +      */
> >>> +     u32 aligned_u32_count;
> >> what is the performance difference?
> >> May be we enforce 4-byte sized value for simplicity?
> > This might be a bit too surprising, especially if keys are just some
> > strings, where people might not expect that it has to 4-byte multiple
> > size. And debugging this without extra tooling (like retsnoop) is
> > going to be nightmarish.
> >
> > If the performance diff is huge and that if/else logic is
> > unacceptable, we can also internally pad with up to 3 zero bytes and
> > include those into the hash.
> I think the if/else logic is unavoidable if we support non 4-byte
> aligned value sizes,
> unless we are okay with truncating any remainder bytes of non 4-byte
> aligned values
> and stipulating that a bloom filter map value size has to be greater
> than 4 bytes (these
> conditions would allow us to use jhash2 for every value without an
> if/else check). If we
> internally pad, we will have to pad on every update and lookup, which
> would also
> require an if/else.
> Thanks for the comments and reviews, Alexei and Andrii. They are much
> appreciated!

I don't think truncation is an option. And I also forgot that we don't
really store values, so there is nothing to pad, really. So yeah, I'd
keep it as is, especially if that is not expensive (which I assume
it's not). As I mentioned before, that logic can be encapsulated in a
dedicated helper function and reused in a few places.

  reply	other threads:[~2021-09-21  1:59 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-14  4:04 [PATCH v2 bpf-next 0/4] Implement bloom filter map Joanne Koong
2021-09-14  4:04 ` [PATCH v2 bpf-next 1/4] bpf: Add bloom filter map implementation Joanne Koong
2021-09-17 17:01   ` Alexei Starovoitov
2021-09-20 20:58     ` Andrii Nakryiko
2021-09-20 22:52       ` Joanne Koong
2021-09-20 23:21         ` Andrii Nakryiko [this message]
2021-09-20 21:03     ` Joanne Koong
2021-09-17 21:48   ` Andrii Nakryiko
2021-09-14  4:04 ` [PATCH v2 bpf-next 2/4] selftests/bpf: Add bloom filter map test cases Joanne Koong
2021-09-14  4:04 ` [PATCH v2 bpf-next 3/4] bpf/benchs: Add benchmark test for bloom filter maps Joanne Koong
2021-09-14  4:04 ` [PATCH v2 bpf-next 4/4] bpf/benchs: Add benchmarks for comparing hashmap lookups with vs. without bloom filter Joanne Koong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAEf4BzYRZM7Oi_CY=trjvefkavt5dwfF-Zu2GKihhfpeopGAnw@mail.gmail.com' \
    --to=andrii.nakryiko@gmail.com \
    --cc=Kernel-team@fb.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=bpf@vger.kernel.org \
    --cc=joannekoong@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).