netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Stefano Brivio <sbrivio@redhat.com>
To: Linus Walleij <linus.walleij@linaro.org>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>,
	netfilter-devel@vger.kernel.org,
	"David S. Miller" <davem@davemloft.net>,
	netdev <netdev@vger.kernel.org>, Ard Biesheuvel <ardb@kernel.org>,
	Arnd Bergmann <arnd@arndb.de>
Subject: Re: [PATCH 19/29] nft_set_pipapo: Introduce AVX2-based lookup implementation
Date: Thu, 19 Mar 2020 12:16:50 +0100	[thread overview]
Message-ID: <20200319121650.44bf3c17@elisabeth> (raw)
In-Reply-To: <CACRpkdbK0dZ87beU8qPSHmRMxTWog-8WbiDQvM-ec06_hAjkoQ@mail.gmail.com>

Hi Linus,

On Thu, 19 Mar 2020 11:20:28 +0100
Linus Walleij <linus.walleij@linaro.org> wrote:

> Hi Pablo,
> 
> First: I really like this type of optimizations. It's really cool to
> see this hardware being put to good use. So for the record,
> I'm impressed with your work here.

Thanks! :)

> On Wed, Mar 18, 2020 at 1:40 AM Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> 
> > +ifdef CONFIG_X86_64
> > +ifneq (,$(findstring -DCONFIG_AS_AVX2=1,$(KBUILD_CFLAGS)))
> > +nf_tables-objs += nft_set_pipapo_avx2.o
> > +endif
> > +endif  
> 
> So this is the first time I see some x86-specific asm optimizations
> in the middle of nftables. That's pretty significant, so it should be
> pointed out in the commit message I think.

It didn't occur to me, you're right, sorry for that (this is in
net-next already).

> I have a question around this:
> 
> > +#define NFT_PIPAPO_LONGS_PER_M256      (XSAVE_YMM_SIZE / BITS_PER_LONG)
> > +
> > +/* Load from memory into YMM register with non-temporal hint ("stream load"),
> > + * that is, don't fetch lines from memory into the cache. This avoids pushing
> > + * precious packet data out of the cache hierarchy, and is appropriate when:
> > + *
> > + * - loading buckets from lookup tables, as they are not going to be used
> > + *   again before packets are entirely classified
> > + *
> > + * - loading the result bitmap from the previous field, as it's never used
> > + *   again
> > + */
> > +#define NFT_PIPAPO_AVX2_LOAD(reg, loc)                                 \
> > +       asm volatile("vmovntdqa %0, %%ymm" #reg : : "m" (loc))  
> 
> (...)
> 
> > +/* Bitwise AND: the staple operation of this algorithm */
> > +#define NFT_PIPAPO_AVX2_AND(dst, a, b)                                 \
> > +       asm volatile("vpand %ymm" #a ", %ymm" #b ", %ymm" #dst)
> > +
> > +/* Jump to label if @reg is zero */
> > +#define NFT_PIPAPO_AVX2_NOMATCH_GOTO(reg, label)                       \
> > +       asm_volatile_goto("vptest %%ymm" #reg ", %%ymm" #reg ";"        \
> > +                         "je %l[" #label "]" : : : : label)
> > +
> > +/* Store 256 bits from YMM register into memory. Contrary to bucket load
> > + * operation, we don't bypass the cache here, as stored matching results
> > + * are always used shortly after.
> > + */
> > +#define NFT_PIPAPO_AVX2_STORE(loc, reg)                                        \
> > +       asm volatile("vmovdqa %%ymm" #reg ", %0" : "=m" (loc))
> > +
> > +/* Zero out a complete YMM register, @reg */
> > +#define NFT_PIPAPO_AVX2_ZERO(reg)                                      \
> > +       asm volatile("vpxor %ymm" #reg ", %ymm" #reg ", %ymm" #reg)  
> 
> The usual practice for this kind of asm optimizations is to store it
> in the arch.
> 
> See for example
> arch/x86/include/asm/bitops.h
> arch/arm64/include/asm/bitrev.h
> which optimize a few bit operations with inline assembly.
> 
> The upside is that bitwise operations can be optimized per-arch
> depending on available arch instructions.

I spent some time trying to figure out where to fit this, and decided
instead to go the same way as RAID6 and some crypto implementations.

A reasonable threshold (and what appears to be the current practice for
the few examples we have) seems to be how specific to a subsystem an
implementation actually is. In that perspective, this looks to me
conceptually similar to AVX2 (or NEON) RAID6 implementations.

> If other archs have similar instructions to AVX2 which can
> slot in and optimize the same code, it would make sense to
> move the assembly to the arch and define some new
> bitops for loading, storing, zero and bitwise AND, possibly even
> if restricted to 256 bits bitmaps.

I'm currently taking care of that for NEON, and while we'll have obvious
gains using a vectorised bitwise AND (with different sizes), the cost of
other operations involved (e.g. branching, or the "refill" operation)
is different, so I'll probably have to arrange algorithm steps in a
different way, and use SIMD instructions that are fundamentally not
equivalent.

On top of that, some architectures are not super-scalar, and some are
but in a different way. Another example: I'm using vmovntdqa here, but,
for a generic 256-bit AND operation, vmovdqa (without non-temporal
memory hint, that is, pushing to cache) makes more sense in the general
case.

So, well, this implementation has to be way more specific (at least for
AVX2 and NEON) than just a random pile of AND operations. :) However,

> We have lib/bitmap.c I can see that this library contain
> things such as:
> 
> int __bitmap_and(unsigned long *dst, const unsigned long *bitmap1,
>                                 const unsigned long *bitmap2, unsigned int bits)
> 
> Which intuitively seems like something that could use
> these optimizations. It should be fine to augment the kernel
> to handle arch-specific optimizations of bitmap operations
> just like we do for setting bits or finding the first set bit
> in a bitmap etc. Today only bitops.h contain arch optimizations
> but if needed surely we can expand on that?

...yes, absolutely, this makes a lot of sense, I've also been thinking
about this.

For instance, I use __bitmap_and() in the non-AVX2 implementation, and
that would benefit from generic vectorised operations on other
architectures (AltiVec extensions are probably a good example). I plan
to eventually work on this, help would be greatly appreciated (ARM/MIPS
person here :)).

> So I would like to see an explanation why we cannot take
> an extra step and make this code something that is entire
> abstract from x86 and will optimize any arch that can to
> 256 bit bitwise acceleration such as this.

I can add some specific comments if you think it makes sense, detailing
exactly what makes this special compared to a simple sequence of
vectorised 256-bit AND operations. The current comments probably give a
hint about that, but I haven't provided a detailed list there, I can add
it.

-- 
Stefano


  reply	other threads:[~2020-03-19 11:17 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-18  0:39 [PATCH 00/29] Netfilter updates for net-next Pablo Neira Ayuso
2020-03-18  0:39 ` [PATCH 01/29] netfilter: flowtable: Use nf_flow_offload_tuple for stats as well Pablo Neira Ayuso
2020-03-18  0:39 ` [PATCH 02/29] netfilter: xtables: Add snapshot of hardidletimer target Pablo Neira Ayuso
2020-03-18  0:39 ` [PATCH 03/29] netfilter: nft_tunnel: add support for geneve opts Pablo Neira Ayuso
2020-03-18  0:39 ` [PATCH 04/29] netfilter: nf_tables: make sets built-in Pablo Neira Ayuso
2020-03-18  0:39 ` [PATCH 05/29] netfilter: nf_tables: make all set structs const Pablo Neira Ayuso
2020-03-18  0:39 ` [PATCH 06/29] netfilter: cleanup unused macro Pablo Neira Ayuso
2020-03-18  0:39 ` [PATCH 07/29] netfilter: nft_set_pipapo: make the symbol 'nft_pipapo_get' static Pablo Neira Ayuso
2020-03-18  0:39 ` [PATCH 08/29] netfilter: Replace zero-length array with flexible-array member Pablo Neira Ayuso
2020-03-18  0:39 ` [PATCH 09/29] netfilter: bitwise: use more descriptive variable-names Pablo Neira Ayuso
2020-03-18  0:39 ` [PATCH 10/29] netfilter: xt_IDLETIMER: clean up some indenting Pablo Neira Ayuso
2020-03-18  0:39 ` [PATCH 11/29] netfilter: flowtable: add nf_flow_table_block_offload_init() Pablo Neira Ayuso
2020-03-18  0:39 ` [PATCH 12/29] netfilter: flowtable: add indr block setup support Pablo Neira Ayuso
2020-03-18  0:39 ` [PATCH 13/29] netfilter: flowtable: add tunnel match offload support Pablo Neira Ayuso
2020-03-19 19:02   ` Edward Cree
2020-03-19 19:35     ` Pablo Neira Ayuso
2020-03-19 19:41       ` Edward Cree
2020-03-18  0:39 ` [PATCH 14/29] netfilter: flowtable: add tunnel encap/decap action " Pablo Neira Ayuso
2020-03-18  0:39 ` [PATCH 15/29] nft_set_pipapo: Generalise group size for buckets Pablo Neira Ayuso
2020-03-18  0:39 ` [PATCH 16/29] nft_set_pipapo: Add support for 8-bit lookup groups and dynamic switch Pablo Neira Ayuso
2020-03-18  0:39 ` [PATCH 17/29] nft_set_pipapo: Prepare for vectorised implementation: alignment Pablo Neira Ayuso
2020-03-18  0:39 ` [PATCH 18/29] nft_set_pipapo: Prepare for vectorised implementation: helpers Pablo Neira Ayuso
2020-03-18  0:39 ` [PATCH 19/29] nft_set_pipapo: Introduce AVX2-based lookup implementation Pablo Neira Ayuso
2020-03-19 10:20   ` Linus Walleij
2020-03-19 11:16     ` Stefano Brivio [this message]
2020-03-18  0:39 ` [PATCH 20/29] nft_set_pipapo: Prepare for single ranged field usage Pablo Neira Ayuso
2020-03-18  0:39 ` [PATCH 21/29] netfilter: nf_tables: add nft_set_elem_expr_alloc() Pablo Neira Ayuso
2020-03-18  0:39 ` [PATCH 22/29] netfilter: nf_tables: statify nft_expr_init() Pablo Neira Ayuso
2020-03-18  0:39 ` [PATCH 23/29] netfilter: nf_tables: add elements with stateful expressions Pablo Neira Ayuso
2020-03-18  0:39 ` [PATCH 24/29] netfilter: nf_tables: add nft_set_elem_update_expr() helper function Pablo Neira Ayuso
2020-03-18  0:39 ` [PATCH 25/29] netfilter: nft_lookup: update element stateful expression Pablo Neira Ayuso
2020-03-18  0:39 ` [PATCH 26/29] netfilter: conntrack: re-visit sysctls in unprivileged namespaces Pablo Neira Ayuso
2020-03-18  0:39 ` [PATCH 27/29] netfilter: Rename ingress hook include file Pablo Neira Ayuso
2020-03-18  0:39 ` [PATCH 28/29] netfilter: Generalize ingress hook Pablo Neira Ayuso
2020-03-18  0:39 ` [PATCH 29/29] netfilter: Introduce egress hook Pablo Neira Ayuso
2020-03-18  6:55 ` [PATCH 00/29] Netfilter updates for net-next Alexei Starovoitov
2020-03-18  8:11   ` David Miller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200319121650.44bf3c17@elisabeth \
    --to=sbrivio@redhat.com \
    --cc=ardb@kernel.org \
    --cc=arnd@arndb.de \
    --cc=davem@davemloft.net \
    --cc=linus.walleij@linaro.org \
    --cc=netdev@vger.kernel.org \
    --cc=netfilter-devel@vger.kernel.org \
    --cc=pablo@netfilter.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).