From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 51777C433EF for ; Tue, 21 Sep 2021 01:59:29 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 35C8361175 for ; Tue, 21 Sep 2021 01:59:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240737AbhIUCAx (ORCPT ); Mon, 20 Sep 2021 22:00:53 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:32928 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237156AbhIUBwr (ORCPT ); Mon, 20 Sep 2021 21:52:47 -0400 Received: from mail-qk1-x731.google.com (mail-qk1-x731.google.com [IPv6:2607:f8b0:4864:20::731]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DC3AAC0386FC for ; Mon, 20 Sep 2021 16:21:21 -0700 (PDT) Received: by mail-qk1-x731.google.com with SMTP id 73so43472023qki.4 for ; Mon, 20 Sep 2021 16:21:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=znh9ggZizMwDhKAV9Ni97xVIQID9gP6YfoS3LZVvffg=; b=N8ZSpG7C6mN0SQeR7680td5NGAFslnD4wwN1QolFW95kOU8f01aiVDII1nd96YX0zq tYEjYTSEJ69SGgfilLKPpOyt3hXuZkdSr64beccuZjyCmSlymxCAQAJYB2FZEGIJvcKW E8W6p1KVv9DIc0xoYDs5/5cT6YmnWNSj6NEVr57t9Ab8ULlVDvRVnRvUlSUVuSlFOfGo /A17gnCaG8Mm/oSu5/UhTPQg9hNrL2i4vnXzlyXKdCoYzUn4Jw2SnvGH4oKACYAPBTzA 6qny694uMuIWjVF8+pyYtiZUdqnJ7l/bM2cc1JUs7nP/Ycxq/LwTn0m9LJHdQ+uWFull Gm5w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=znh9ggZizMwDhKAV9Ni97xVIQID9gP6YfoS3LZVvffg=; b=bXhE6JEcGPDR77YZO+52D+VJ+U+xtyFQUJ6UldZidKSlj6S3DNv6ScHANdwWQlUexp /FaPKKlaDdn+p2UdPRwuBQbBAHmt1ALgrLA2qOkpdXPrjMQkbqt3052bqJzA+RBlEbd0 vBtnlN/4VUjoy/lzPjJ1TU42Qr835KdLLxYQbqUB75Be0vyXGxFcbz2ArxvaNbXQ2O95 sJgV++F5HfiiDXCmYKtAoOhNEp5FfljAbwMZWOXOfHHeMXlZsRxCtg9H+Gm9qvmrpXoq fOwJ1ebLuVE0ygIWxL/uBYfrIVi3mgbGnX6yOON5u7R93/aK2scZxtcvLyvtChtxUdKp NsxA== X-Gm-Message-State: AOAM530gQ0WAZUs2SPrRCSGcY/moBsVBHcsNUIPqFuiCn8ryPP5ge9m8 S5iS2z1Xs1Z/RrfDmVBcct9biDO9LqSuResFxGw= X-Google-Smtp-Source: ABdhPJyVoVQBXjUMvWz68EcBWd22uXQ3Tvd7Y1AoPTM616Y5BbvZ0Xo5IDxlKWuYzMIQ8VLDPZe3peiKekUOdyxEKI4= X-Received: by 2002:a25:47c4:: with SMTP id u187mr37048023yba.225.1632180080962; Mon, 20 Sep 2021 16:21:20 -0700 (PDT) MIME-Version: 1.0 References: <20210914040433.3184308-1-joannekoong@fb.com> <20210914040433.3184308-2-joannekoong@fb.com> <20210917170130.njmm3dm65ftd76vo@ast-mbp> <17d7b319-01d0-163e-57b6-c385d38cc9ad@fb.com> In-Reply-To: <17d7b319-01d0-163e-57b6-c385d38cc9ad@fb.com> From: Andrii Nakryiko Date: Mon, 20 Sep 2021 16:21:09 -0700 Message-ID: Subject: Re: [PATCH v2 bpf-next 1/4] bpf: Add bloom filter map implementation To: Joanne Koong Cc: Alexei Starovoitov , bpf , Kernel Team Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org On Mon, Sep 20, 2021 at 3:52 PM Joanne Koong wrote: > > My previous email replied to Alexei's email before I saw Andrii's new > email, so please > feel free to disregard my previous email. Never got that reply of yours. Alexei's email arrived a few hours after I've already replied to you. It was a time warp anomaly :) > > On 9/20/21 1:58 PM, Andrii Nakryiko wrote: > > > On Fri, Sep 17, 2021 at 6:08 PM Alexei Starovoitov > > wrote: > >> On Mon, Sep 13, 2021 at 09:04:30PM -0700, Joanne Koong wrote: > >>> + > >>> +/* For bloom filter maps, the next 4 bits represent how many hashes to use. > >>> + * The maximum number of hash functions supported is 15. If this is not set, > >>> + * the default number of hash functions used will be 5. > >>> + */ > >>> + BPF_F_BLOOM_FILTER_HASH_BIT_1 = (1U << 13), > >>> + BPF_F_BLOOM_FILTER_HASH_BIT_2 = (1U << 14), > >>> + BPF_F_BLOOM_FILTER_HASH_BIT_3 = (1U << 15), > >>> + BPF_F_BLOOM_FILTER_HASH_BIT_4 = (1U << 16), > >> The bit selection is unintuitive. > >> Since key_size has to be zero may be used that instead to indicate the number of hash > >> functions in the rare case when 5 is not good enough? > > Hm... I was initially thinking about proposing something like that, > > but it felt a bit ugly at the time. But now thinking about this a bit > > more, I think this would be a bit more meaningful if we change the > > terminology a bit. Instead of saying that Bloom filter has values and > > no keys, we actually have keys and no values. So all those bytes that > > are hashed are treated as keys (which is actually how sets are > > implemented on top of maps, where you have keys and no values, or at > > least the value is always true). > > > > So with that we'll have key/key_size to specify number of bytes that > > needs to be hashed (and it's type info). And then we can squint a bit > > and say that number of hashes are specified by value_size, as in > > values are those nr_hash bits that we set in Bloom filter. > > > > Still a bit of terminology stretch, but won't necessitate those > > specialized fields just for Bloom filter map. But if default value is > > going to be good enough for most cases and most cases won't need to > > adjust number of hashes, this seems to be pretty clean to me. > > With having bloom filter map keys instead of values, I think this would > lead to messier code in the kernel for handling map_lookup_elem > and map_update_elem calls, due to the fact that the bloom filter map > is a non-associative map and the current APIs for non-associative map types > (peek_elem/push_elem/pop_elem) all have the map data as the value and > not the key. > > For example, for map_update_elem, the API from the eBPF program side is > > int (*map_update_elem)(struct bpf_map *map, void *key, void *value, u64 > flags); > > This would require us to either > > a) Add some custom logic in syscall.c so that we bypass the > copy_from_bpfptr call on > bloom filter map values (necessary because memcpying 0 bytes still > requires the src pointer > to be valid), which would allow us to pass in a NULL value > > b) Add a new function like > > int (*map_push_key)(struct bpf_map *map, void *key, u64 flags) > > that eBPF programs can call instead of map_update_elem. > > or > > c) Try to repurpose the existing map_push_elem API by passing in the key > instead of the value, > which would lead to inconsistent use of the API > > I think if we could change the non-associative map types (currently only > stack maps and queue > maps, I believe) to have their data be a key instead of a value, and > have the pop/peek APIs use > keys instead of values, then this would be cleaner, since we could then > just use the existing peek/pop > APIs. I don't think we can change existing map APIs anymore, unfortunately. > > >> Or use inner_map_fd since there is no possibility of having an inner map in bloomfilter. > >> It could be a union: > >> __u32 max_entries; /* max number of entries in a map */ > >> __u32 map_flags; /* BPF_MAP_CREATE related > >> * flags defined above. > >> */ > >> union { > >> __u32 inner_map_fd; /* fd pointing to the inner map */ > >> __u32 nr_hash_funcs; /* or number of hash functions */ > >> }; > > This works as well. A bit more Bloom filter-only terminology > > throughout UAPI and libbpf, but I'd be fine with that as well. > > > Great, it looks like this is the consensus - I will go with this option > for v3! > >> __u32 numa_node; /* numa node */ > >> > >>> +struct bpf_bloom_filter { > >>> + struct bpf_map map; > >>> + u32 bit_array_mask; > >>> + u32 hash_seed; > >>> + /* If the size of the values in the bloom filter is u32 aligned, > >>> + * then it is more performant to use jhash2 as the underlying hash > >>> + * function, else we use jhash. This tracks the number of u32s > >>> + * in an u32-aligned value size. If the value size is not u32 aligned, > >>> + * this will be 0. > >>> + */ > >>> + u32 aligned_u32_count; > >> what is the performance difference? > >> May be we enforce 4-byte sized value for simplicity? > > This might be a bit too surprising, especially if keys are just some > > strings, where people might not expect that it has to 4-byte multiple > > size. And debugging this without extra tooling (like retsnoop) is > > going to be nightmarish. > > > > If the performance diff is huge and that if/else logic is > > unacceptable, we can also internally pad with up to 3 zero bytes and > > include those into the hash. > I think the if/else logic is unavoidable if we support non 4-byte > aligned value sizes, > unless we are okay with truncating any remainder bytes of non 4-byte > aligned values > and stipulating that a bloom filter map value size has to be greater > than 4 bytes (these > conditions would allow us to use jhash2 for every value without an > if/else check). If we > internally pad, we will have to pad on every update and lookup, which > would also > require an if/else. > Thanks for the comments and reviews, Alexei and Andrii. They are much > appreciated! I don't think truncation is an option. And I also forgot that we don't really store values, so there is nothing to pad, really. So yeah, I'd keep it as is, especially if that is not expensive (which I assume it's not). As I mentioned before, that logic can be encapsulated in a dedicated helper function and reused in a few places.