From: Stefano Brivio <sbrivio@redhat.com>
To: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: netfilter-devel@vger.kernel.org,
"Florian Westphal" <fw@strlen.de>,
"Kadlecsik József" <kadlec@blackhole.kfki.hu>,
"Eric Garver" <eric@garver.life>, "Phil Sutter" <phil@nwl.cc>
Subject: Re: [PATCH nf-next v4 5/9] nf_tables: Add set type for arbitrary concatenation of ranges
Date: Sat, 15 Feb 2020 00:06:01 +0100 [thread overview]
Message-ID: <20200215000601.12ab6626@elisabeth> (raw)
In-Reply-To: <20200214204225.dh3ubs67vorh2ail@salvia>
On Fri, 14 Feb 2020 21:42:25 +0100
Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Fri, Feb 14, 2020 at 08:42:13PM +0100, Stefano Brivio wrote:
> > On Fri, 14 Feb 2020 19:16:34 +0100
> > Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> [...]
> > > You refer to a property that says that you can split a range into a
> > > 2*n netmasks IIRC. Do you know what is the worst case when splitting
> > > ranges?
> >
> > I'm not sure I got your question: that is exactly the worst case, i.e.
> > we can have _up to_ 2 * n netmasks (hence rules) given a range of n
> > bits. There's an additional upper bound on this, given by the address
> > space, but single fields in a concatenation can overlap.
> >
> > For example, we can have up to 128 rules for an IPv6 range where at
> > least 64 bits differ between the endpoints, and which would contain
> > 2 ^ 64 addresses. Or, say, the IPv4 range 1.2.3.4 - 255.255.0.2 is
> > expressed by 42 rules.
> >
> > By the way, 0.0.0.1 - 255.255.255.254 takes 62 rules, so we can
> > *probably* say it's 2 * n - 2, but I don't have a formal proof for that.
>
> By "splitting" I was actually refering to "expanding", so you're
> replying here to my worst-case range-to-rules expansion question.
>
> > I have a couple of ways in mind to get that down to n / 2, but it's not
> > straightforward and it will take me some time (assuming it makes
> > sense). For the n bound, we can introduce negations (proof in
> > literature), and I have some kind of ugly prototype. For the n / 2
> > bound, I'd need some auxiliary data structure to keep insertion
> > invertible.
>
> OK, so there is room to improve the "rule expansion" logic. I didn't
> spend much time on that front yet.
I wrote some (barely understandable if at all) notes here:
https://pipapo.lameexcu.se/pipapo/tree/pipapo.c#n271
but I can't guarantee at this stage that it can all be implemented, or
not necessarily in that way. Should you ever (try to) read it, take it
as a rough, buggy draft.
> > In practice, the "average" case is much less, but to define it we would
> > first need to agree on what are the actual components of the
> > multivariate distribution... size and start? Is it a Poisson
> > distribution then? After spending some time on this and disagreeing
> > with myself I'd shyly recommend to skip the topic. :)
>
> Yes, I agree to stick to something relatively simple and good is just
> fine.
>
> > > There is no ipset set like this, but I agree usecase might happen.
> >
> > Actually, for ipset, a "net,port,net,port" type was proposed
> > (netfilter-devel <20181216213039.399-1-oliver@uptheinter.net>), but when
> > József enquired about the intended use case, none was given. So maybe
> > this whole "net,net,port,mac" story makes even less sense.
>
> Would it make sense to you to restrict pipapo to 3 fields until there
> is someone with a usecase for this?
I wouldn't. I'd rather implement the dynamic switch of bucket size if
this is the alternative. I think the current version is rather generic,
doesn't use too much memory, and is also reasonably tested, so I don't
feel like taking working functionality away.
> [...]
> > > The per-cpu scratch index is only required if we cannot fit in the
> > > "result bitmap" into the stack, right?
> >
> > Right.
> >
> > > Probably up to 256 bytes result bitmap in the stack is reasonable?
> > > That makes 8192 pipapo rules. There will be no need to disable bh and
> > > make use of the percpu scratchpad area in that case.
> >
> > Right -- the question is whether that would mean yet another
> > implementation for the lookup function.
>
> This would need another lookup function that can be selected from
> control plane path. The set size and the range-to-rule expansion
> worst-case can tell us if it would fit into the stack. It's would be
> just one extra lookup function for this case, ~80-100 LOC.
Ah, you're right, it's not much. On the other hand, advantages seems to
be:
- dropping the local_bh_{disable,enable}() pair: I can't find numbers
right now, but the difference in pps between v2 (when Florian pointed
out it was needed) and v1 wasn't statistically relevant
- avoiding the access to per-cpu allocated areas, here I have
before/after numbers just for net,mac with 1000 entries on the AVX2
implementation (AMD 7351, one thread), it didn't look significant,
3.99Mpps instead of 3.90Mpps. Maybe it would make a difference when
tested on more threads though (as a reference, same CPU, 10 threads,
as of v2: 38.02Mpps) or with smaller L2
- memory usage (and how that affects locality on some archs), but if we
do this for cases with small numbers of entries, it's probably not a
big difference either
Anyway, I guess it's worth a try -- I'll try to find some time for it.
> > > If adjusting the code to deal with variable length "pipapo word" size
> > > is not too convoluted, then you could just deal with the variable word
> > > size from the insert / delete / get (slow) path and register one
> > > lookup function for the version that is optimized for this pipapo word
> > > size.
> >
> > Yes, I like this a lot -- we would also need one function to rebuild
> > tables when the word size changes, but that sounds almost trivial.
> > Changes for the slow path are actually rather simple.
> >
> > Still, I start doubting quite heavily that my original worst case is
> > reasonable. If we stick to the one you mentioned, or even something in
> > between, it makes no sense to keep 4-bit buckets.
>
> OK, then moving to 8-bits will probably remove a bit of code which is
> dealing with "nibbles".
Right now I have:
1 file changed, 16 insertions(+), 37 deletions(-)
but okay, I'll give the dynamic switch a try, first.
> > By the way, I went ahead and tried the 8-bit bucket version of the C
> > implementation only, on my usual x86_64 box (one thread, AMD Epyc 7351).
> > I think it's worth it:
> >
> > 4-bit 8-bit
> > net,port
> > 1000 entries 2304165pps 2901299pps
> > port,net
> > 100 entries 4131471pps 4751247pps
> > net6,port
> > 1000 entries 1092557pps 1651037pps
> > port,proto
> > 30000 entries 284147pps 449665pps
> > net6,port,mac
> > 10 entries 2082880pps 2762291pps
> > net6,port,mac,proto
> > 1000 entries 783810pps 1195823pps
> > net,mac
> > 1000 entries 1279122pps 1934003pps
>
> Assuming the same concatenation type, larger bucket size makes pps
> drop in the C implementation?
Uh, no, they go up with larger buckets, in the figures above we have
+26%, +15%, +51%, +58%, +33%, +53%, +51%.
> > I would now proceed extending this to the AVX2 implementation and (once
> > I finish it) to the NEON one, I actually expect bigger gains there.
>
> Good. BTW, probably you can add a new NFT_SET_CLASS_JIT class that
> comes becomes NFT_SET_CLASS_O_1 to make the set routine that selects
> the set pick the jit version instead.
It's not exactly JIT though :) it's hand-written assembly. It's coming
up a bit nicer for NEON as I can use compiler intrinsics there.
Maybe I didn't understand your point, but I don't think this is needed.
In 9/9 of v4 I simply had, in nft_pipapo_avx2_estimate():
if (!boot_cpu_has(X86_FEATURE_AVX2) || !boot_cpu_has(X86_FEATURE_AVX))
return false;
> > > Probably adding helper function to deal with pipapo words would help
> > > to prepare for such update in the future. There is the ->estimate
> > > function that allows to calculate for the best word size depending on
> > > all the information this gets from the set definition.
> >
> > Hm, I really think it should be kind of painless to make this dynamic
> > on insertion/deletion.
>
> OK, good. How would you like to proceed?
Let me give a shot at the dynamic bucket size switching. If that looks
reasonable, I'd proceed extending that to the AVX2 implementation,
finish the NEON implementation including it right away, and prepare a
series. If it doesn't, I'll report back :)
--
Stefano
next prev parent reply other threads:[~2020-02-14 23:06 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-01-21 23:17 [PATCH nf-next v4 0/9] nftables: Set implementation for arbitrary concatenation of ranges Stefano Brivio
2020-01-21 23:17 ` [PATCH nf-next v4 1/9] netfilter: nf_tables: add nft_setelem_parse_key() Stefano Brivio
2020-01-21 23:17 ` [PATCH nf-next v4 2/9] netfilter: nf_tables: add NFTA_SET_ELEM_KEY_END attribute Stefano Brivio
2020-01-21 23:17 ` [PATCH nf-next v4 3/9] netfilter: nf_tables: Support for sets with multiple ranged fields Stefano Brivio
2020-01-21 23:17 ` [PATCH nf-next v4 4/9] bitmap: Introduce bitmap_cut(): cut bits and shift remaining Stefano Brivio
2020-01-21 23:17 ` [PATCH nf-next v4 5/9] nf_tables: Add set type for arbitrary concatenation of ranges Stefano Brivio
2020-02-07 11:23 ` Pablo Neira Ayuso
2020-02-10 15:10 ` Stefano Brivio
2020-02-14 18:16 ` Pablo Neira Ayuso
2020-02-14 19:42 ` Stefano Brivio
2020-02-14 20:42 ` Pablo Neira Ayuso
2020-02-14 23:06 ` Stefano Brivio [this message]
2020-01-21 23:17 ` [PATCH nf-next v4 6/9] selftests: netfilter: Introduce tests for sets with range concatenation Stefano Brivio
2020-01-21 23:17 ` [PATCH nf-next v4 7/9] nft_set_pipapo: Prepare for vectorised implementation: alignment Stefano Brivio
2020-01-21 23:17 ` [PATCH nf-next v4 8/9] nft_set_pipapo: Prepare for vectorised implementation: helpers Stefano Brivio
2020-01-21 23:17 ` [PATCH nf-next v4 9/9] nft_set_pipapo: Introduce AVX2-based lookup implementation Stefano Brivio
2020-01-27 6:41 ` kbuild test robot
2020-01-27 8:20 ` [PATCH nf-next v4 0/9] nftables: Set implementation for arbitrary concatenation of ranges Pablo Neira Ayuso
2020-02-20 10:52 ` Phil Sutter
2020-02-20 11:04 ` Stefano Brivio
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200215000601.12ab6626@elisabeth \
--to=sbrivio@redhat.com \
--cc=eric@garver.life \
--cc=fw@strlen.de \
--cc=kadlec@blackhole.kfki.hu \
--cc=netfilter-devel@vger.kernel.org \
--cc=pablo@netfilter.org \
--cc=phil@nwl.cc \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).