Re: [PATCH bpf-next 1/5] bpf: Add bloom filter map implementation

From: Martin KaFai Lau <kafai@fb.com>
To: John Fastabend <john.fastabend@gmail.com>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>,
	Joanne Koong <joannekoong@fb.com>, bpf <bpf@vger.kernel.org>
Subject: Re: [PATCH bpf-next 1/5] bpf: Add bloom filter map implementation
Date: Wed, 1 Sep 2021 23:16:43 -0700	[thread overview]
Message-ID: <20210902061643.h2o67abkwn26d66a@kafai-mbp.dhcp.thefacebook.com> (raw)
In-Reply-To: <61305cf822fa_439b208a5@john-XPS-13-9370.notmuch>

On Wed, Sep 01, 2021 at 10:11:20PM -0700, John Fastabend wrote:
[ ... ]

> > > +static struct bpf_map *bloom_filter_map_alloc(union bpf_attr *attr)
> > > +{
> > > +       int numa_node = bpf_map_attr_numa_node(attr);
> > > +       u32 nr_bits, bit_array_bytes, bit_array_mask;
> > > +       struct bpf_bloom_filter *bloom_filter;
> > > +
> > > +       if (!bpf_capable())
> > > +               return ERR_PTR(-EPERM);
> > > +
> > > +       if (attr->key_size != 0 || attr->value_size == 0 || attr->max_entries == 0 ||
> > > +           attr->nr_hashes == 0 || attr->map_flags & ~BLOOM_FILTER_CREATE_FLAG_MASK ||
> > > +           !bpf_map_flags_access_ok(attr->map_flags))
> > > +               return ERR_PTR(-EINVAL);
> > > +
> > > +       /* For the bloom filter, the optimal bit array size that minimizes the
> > > +        * false positive probability is n * k / ln(2) where n is the number of
> > > +        * expected entries in the bloom filter and k is the number of hash
> > > +        * functions. We use 7 / 5 to approximate 1 / ln(2).
> > > +        *
> > > +        * We round this up to the nearest power of two to enable more efficient
> > > +        * hashing using bitmasks. The bitmask will be the bit array size - 1.
> > > +        *
> > > +        * If this overflows a u32, the bit array size will have 2^32 (4
> > > +        * GB) bits.
> 
> Would it be better to return E2BIG or EINVAL here? Speculating a bit, but if I was
> a user I might want to know that the number of bits I pushed down is not the actual
> number?
> 
> Another thought, would it be simpler to let user do this calculation and just let
> max_elements be number of bits they want? Then we could have examples with the
> above comment. Just a thought...
Instead of having user second guessing on what max_entries means
for a particular map, I think it is better to keep max_entries
meaning as consistent as possible and let the kernel figure out
the correct nr_bits to use.

[ ... ]

> > > +static int bloom_filter_map_push_elem(struct bpf_map *map, void *value,
> > > +                                     u64 flags)
> > > +{
> > > +       struct bpf_bloom_filter *bloom_filter =
> > > +               container_of(map, struct bpf_bloom_filter, map);
> > > +       unsigned long spinlock_flags;
> > > +       u32 i, hash;
> > > +
> > > +       if (flags != BPF_ANY)
> > > +               return -EINVAL;
> > > +
> > > +       spin_lock_irqsave(&bloom_filter->spinlock, spinlock_flags);
> > > +
> > 
> > If value_size is pretty big, hashing might take a noticeable amount of
> > CPU, during which we'll be keeping spinlock. With what I said above
Good catch on big value_size.

> > about sane number of hashes, if we bound it to small reasonable number
> > (e.g., 16), we can have a local 16-element array with hashes
> > calculated before we take lock. That way spinlock will be held only
> > for few bit flips.
> 
> +1. Anyways we are inside a RCU section here and the map shouldn't
> disapper without a grace period so we can READ_ONCE the seed right?
> Or are we thinking about sleepable programs here?
> 
> > Also, I wonder if ditching spinlock in favor of atomic bit set
> > operation would improve performance in typical scenarios. Seems like
> > set_bit() is an atomic operation, so it should be easy to test. Do you
> > mind running benchmarks with spinlock and with set_bit()?
The atomic set_bit() is a good idea.  Then no need to have a 16-element array
and keep thing simple.
It is in general useful to optimize the update/push path (e.g. I would like to
have the bloom-filter bench populating millions entries faster).   Our current
usecase is to have the userspace populates the map (e.g. a lot of suspicious
IP that we have already learned) at the very beginning and then very sparse
update after that.  The bpf prog will mostly only lookup/peek which I think
is a better optimization and benchmark target.

> 
> With the jhash pulled out of lock, I think it might be noticable. Curious
> to see.
> 
> >
> > > +       for (i = 0; i < bloom_filter->map.nr_hashes; i++) {
> > > +               hash = jhash(value, map->value_size, bloom_filter->hash_seed + i) &
> > > +                       bloom_filter->bit_array_mask;
> > > +               bitmap_set(bloom_filter->bit_array, hash, 1);
> > > +       }
> > > +
> > > +       spin_unlock_irqrestore(&bloom_filter->spinlock, spinlock_flags);
> > > +
> > > +       return 0;
> > > +}
> > > +
> > 
> > [...]
> 
>