netfilter-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH nf-next v2 0/8] nftables: Set implementation for arbitrary concatenation of ranges
@ 2019-11-22 13:39 Stefano Brivio
  2019-11-22 13:40 ` [PATCH nf-next v2 1/8] netfilter: nf_tables: Support for subkeys, set with multiple ranged fields Stefano Brivio
                   ` (8 more replies)
  0 siblings, 9 replies; 24+ messages in thread
From: Stefano Brivio @ 2019-11-22 13:39 UTC (permalink / raw)
  To: Pablo Neira Ayuso, netfilter-devel
  Cc: Florian Westphal, Kadlecsik József, Eric Garver, Phil Sutter

Existing nftables set implementations allow matching entries with
interval expressions (rbtree), e.g. 192.0.2.1-192.0.2.4, entries
specifying field concatenation (hash, rhash), e.g. 192.0.2.1:22,
but not both.

In other words, none of the set types allows matching on range
expressions for more than one packet field at a time, such as ipset
does with types bitmap:ip,mac, and, to a more limited extent
(netmasks, not arbitrary ranges), with types hash:net,net,
hash:net,port, hash:ip,port,net, and hash:net,port,net.

As a pure hash-based approach is unsuitable for matching on ranges,
and "proxying" the existing red-black tree type looks impractical as
elements would need to be shared and managed across all employed
trees, this new set implementation intends to fill the functionality
gap by employing a relatively novel approach.

The fundamental idea, illustrated in deeper detail in patch 3/8, is to
use lookup tables classifying a small number of grouped bits from each
field, and map the lookup results in a way that yields a verdict for
the full set of specified fields.

The grouping bit aspect is loosely inspired by the Grouper algorithm,
by Jay Ligatti, Josh Kuhn, and Chris Gage (see patch 3/8 for the full
reference).

A reference, stand-alone implementation of the algorithm itself is
available at:
	https://pipapo.lameexcu.se

Some notes about possible future optimisations are also mentioned
there. This algorithm reduces the matching problem to, essentially,
a repetitive sequence of simple bitwise operations, and is
particularly suitable to be optimised by leveraging SIMD instruction
sets. An AVX2-based implementation is also presented in this series.

I plan to post the adaptation of the existing AVX2 vectorised
implementation for (at least) NEON at a later time.

Patch 1/8 implements the needed UAPI bits: additions to the existing
interface are kept to a minimum by recycling existing concepts for
both ranging and concatenation, as suggested by Florian.

Patch 2/8 adds a new bitmap operation that copies the source bitmap
onto the destination while removing a given region, and is needed to
delete regions of arrays mapping between lookup tables.

Patch 3/8 is the actual set implementation.

Patch 4/8 introduces selftests for the new implementation.

Patch 5/8 provides an easy optimisation with substantial gain on
matching rates.

Patches 6/8 and 7/8 are preparatory work to add an alternative,
vectorised lookup implementation.

Patch 8/8 contains the AVX2-based implementation of the lookup
routines.

The nftables and libnftnl counterparts depend on changes to the UAPI
header file included in patch 1/8.

Credits go to Jay Ligatti, Josh Kuhn, and Chris Gage for their
original Grouper implementation and article from ICCCN proceedings
(see reference in patch 3/8), and to Daniel Lemire for his public
domain implementation of a fast iterator on set bits using built-in
implementations of the CTZL operation, also included in patch 3/8.

Special thanks go to Florian Westphal for all the nftables consulting
and the original interface idea, to Sabrina Dubroca for support with
RCU and bit manipulation topics, to Eric Garver for an early review,
and to Phil Sutter for reaffirming the need for the use case covered
here.

v2: changes listed in messages for 3/8 and 8/8

Stefano Brivio (8):
  netfilter: nf_tables: Support for subkeys, set with multiple ranged
    fields
  bitmap: Introduce bitmap_cut(): cut bits and shift remaining
  nf_tables: Add set type for arbitrary concatenation of ranges
  selftests: netfilter: Introduce tests for sets with range
    concatenation
  nft_set_pipapo: Provide unrolled lookup loops for common field sizes
  nft_set_pipapo: Prepare for vectorised implementation: alignment
  nft_set_pipapo: Prepare for vectorised implementation: helpers
  nft_set_pipapo: Introduce AVX2-based lookup implementation

 include/linux/bitmap.h                        |    4 +
 include/net/netfilter/nf_tables_core.h        |    2 +
 include/uapi/linux/netfilter/nf_tables.h      |   16 +
 lib/bitmap.c                                  |   66 +
 net/netfilter/Makefile                        |    6 +-
 net/netfilter/nf_tables_api.c                 |    4 +-
 net/netfilter/nf_tables_set_core.c            |    8 +
 net/netfilter/nft_set_pipapo.c                | 2165 +++++++++++++++++
 net/netfilter/nft_set_pipapo.h                |  236 ++
 net/netfilter/nft_set_pipapo_avx2.c           |  838 +++++++
 net/netfilter/nft_set_pipapo_avx2.h           |   14 +
 tools/testing/selftests/netfilter/Makefile    |    3 +-
 .../selftests/netfilter/nft_concat_range.sh   | 1481 +++++++++++
 13 files changed, 4839 insertions(+), 4 deletions(-)
 create mode 100644 net/netfilter/nft_set_pipapo.c
 create mode 100644 net/netfilter/nft_set_pipapo.h
 create mode 100644 net/netfilter/nft_set_pipapo_avx2.c
 create mode 100644 net/netfilter/nft_set_pipapo_avx2.h
 create mode 100755 tools/testing/selftests/netfilter/nft_concat_range.sh

-- 
2.20.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH nf-next v2 1/8] netfilter: nf_tables: Support for subkeys, set with multiple ranged fields
  2019-11-22 13:39 [PATCH nf-next v2 0/8] nftables: Set implementation for arbitrary concatenation of ranges Stefano Brivio
@ 2019-11-22 13:40 ` Stefano Brivio
  2019-11-23 20:01   ` Pablo Neira Ayuso
  2019-11-22 13:40 ` [PATCH nf-next v2 2/8] bitmap: Introduce bitmap_cut(): cut bits and shift remaining Stefano Brivio
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 24+ messages in thread
From: Stefano Brivio @ 2019-11-22 13:40 UTC (permalink / raw)
  To: Pablo Neira Ayuso, netfilter-devel
  Cc: Florian Westphal, Kadlecsik József, Eric Garver, Phil Sutter

Introduce a new nested netlink attribute, NFTA_SET_SUBKEY, used to
specify the length of each field in a set concatenation.

This allows set implementations to support concatenation of multiple
ranged items, as they can divide the input key into matching data for
every single field. Such set implementations would indicate this
capability with the NFT_SET_SUBKEY flag.

In order to specify the interval for a set entry, userspace would
simply keep using two elements per entry, as it happens now, with the
end element indicating the upper interval bound. As a single element
can now be a concatenation of several fields, with or without the
NFT_SET_ELEM_INTERVAL_END flag, we obtain a convenient way to support
multiple ranged fields in a set.

While at it, export the number of 32-bit registers available for
packet matching, as nftables will need this to know the maximum
number of field lengths that can be specified.

For example, "packets with an IPv4 address between 192.0.2.0 and
192.0.2.42, with destination port between 22 and 25", can be
expressed as two concatenated elements:

  192.0.2.0 . 22
  192.0.2.42 . 25 with NFT_SET_ELEM_INTERVAL_END

and the NFTA_SET_SUBKEY attributes would be 32, 16, in that order.

Note that this does *not* represent the concatenated range:

  0xc0 0x00 0x02 0x00 0x00 0x16 - 0xc0 0x00 0x02 0x2a 0x00 0x25

on the six packet bytes of interest. That is, the range specified
does *not* include e.g. 0xc0 0x00 0x02 0x29 0x00 0x42, which is:
  192.0.0.41 . 66

Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
v2: No changes

 include/uapi/linux/netfilter/nf_tables.h | 16 ++++++++++++++++
 net/netfilter/nf_tables_api.c            |  4 ++--
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/netfilter/nf_tables.h b/include/uapi/linux/netfilter/nf_tables.h
index bb9b049310df..f8dbeac14898 100644
--- a/include/uapi/linux/netfilter/nf_tables.h
+++ b/include/uapi/linux/netfilter/nf_tables.h
@@ -48,6 +48,7 @@ enum nft_registers {
 
 #define NFT_REG_SIZE	16
 #define NFT_REG32_SIZE	4
+#define NFT_REG32_COUNT	(NFT_REG32_15 - NFT_REG32_00 + 1)
 
 /**
  * enum nft_verdicts - nf_tables internal verdicts
@@ -275,6 +276,7 @@ enum nft_rule_compat_attributes {
  * @NFT_SET_TIMEOUT: set uses timeouts
  * @NFT_SET_EVAL: set can be updated from the evaluation path
  * @NFT_SET_OBJECT: set contains stateful objects
+ * @NFT_SET_SUBKEY: set uses subkeys to map intervals for multiple fields
  */
 enum nft_set_flags {
 	NFT_SET_ANONYMOUS		= 0x1,
@@ -284,6 +286,7 @@ enum nft_set_flags {
 	NFT_SET_TIMEOUT			= 0x10,
 	NFT_SET_EVAL			= 0x20,
 	NFT_SET_OBJECT			= 0x40,
+	NFT_SET_SUBKEY			= 0x80,
 };
 
 /**
@@ -309,6 +312,17 @@ enum nft_set_desc_attributes {
 };
 #define NFTA_SET_DESC_MAX	(__NFTA_SET_DESC_MAX - 1)
 
+/**
+ * enum nft_set_subkey_attributes - subkeys for multiple ranged fields
+ *
+ * @NFTA_SET_SUBKEY_LEN: length of single field, in bits (NLA_U32)
+ */
+enum nft_set_subkey_attributes {
+	NFTA_SET_SUBKEY_LEN,
+	__NFTA_SET_SUBKEY_MAX
+};
+#define NFTA_SET_SUBKEY_MAX	(__NFTA_SET_SUBKEY_MAX - 1)
+
 /**
  * enum nft_set_attributes - nf_tables set netlink attributes
  *
@@ -327,6 +341,7 @@ enum nft_set_desc_attributes {
  * @NFTA_SET_USERDATA: user data (NLA_BINARY)
  * @NFTA_SET_OBJ_TYPE: stateful object type (NLA_U32: NFT_OBJECT_*)
  * @NFTA_SET_HANDLE: set handle (NLA_U64)
+ * @NFTA_SET_SUBKEY: subkeys for multiple ranged fields (NLA_NESTED)
  */
 enum nft_set_attributes {
 	NFTA_SET_UNSPEC,
@@ -346,6 +361,7 @@ enum nft_set_attributes {
 	NFTA_SET_PAD,
 	NFTA_SET_OBJ_TYPE,
 	NFTA_SET_HANDLE,
+	NFTA_SET_SUBKEY,
 	__NFTA_SET_MAX
 };
 #define NFTA_SET_MAX		(__NFTA_SET_MAX - 1)
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index ff04cdc87f76..a877d60f86a9 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -3248,7 +3248,7 @@ EXPORT_SYMBOL_GPL(nft_unregister_set);
 
 #define NFT_SET_FEATURES	(NFT_SET_INTERVAL | NFT_SET_MAP | \
 				 NFT_SET_TIMEOUT | NFT_SET_OBJECT | \
-				 NFT_SET_EVAL)
+				 NFT_SET_EVAL | NFT_SET_SUBKEY)
 
 static bool nft_set_ops_candidate(const struct nft_set_type *type, u32 flags)
 {
@@ -3826,7 +3826,7 @@ static int nf_tables_newset(struct net *net, struct sock *nlsk,
 		if (flags & ~(NFT_SET_ANONYMOUS | NFT_SET_CONSTANT |
 			      NFT_SET_INTERVAL | NFT_SET_TIMEOUT |
 			      NFT_SET_MAP | NFT_SET_EVAL |
-			      NFT_SET_OBJECT))
+			      NFT_SET_OBJECT | NFT_SET_SUBKEY))
 			return -EINVAL;
 		/* Only one of these operations is supported */
 		if ((flags & (NFT_SET_MAP | NFT_SET_OBJECT)) ==
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH nf-next v2 2/8] bitmap: Introduce bitmap_cut(): cut bits and shift remaining
  2019-11-22 13:39 [PATCH nf-next v2 0/8] nftables: Set implementation for arbitrary concatenation of ranges Stefano Brivio
  2019-11-22 13:40 ` [PATCH nf-next v2 1/8] netfilter: nf_tables: Support for subkeys, set with multiple ranged fields Stefano Brivio
@ 2019-11-22 13:40 ` Stefano Brivio
  2019-11-22 13:40 ` [PATCH nf-next v2 3/8] nf_tables: Add set type for arbitrary concatenation of ranges Stefano Brivio
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 24+ messages in thread
From: Stefano Brivio @ 2019-11-22 13:40 UTC (permalink / raw)
  To: Pablo Neira Ayuso, netfilter-devel
  Cc: Florian Westphal, Kadlecsik József, Eric Garver, Phil Sutter

The new bitmap function bitmap_cut() copies bits from source to
destination by removing the region specified by parameters first
and cut, and remapping the bits above the cut region by right
shifting them.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
v2: No changes

 include/linux/bitmap.h |  4 +++
 lib/bitmap.c           | 66 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 70 insertions(+)

diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index 29fc933df3bf..e66cff371688 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -53,6 +53,7 @@
  *  bitmap_find_next_zero_area_off(buf, len, pos, n, mask)  as above
  *  bitmap_shift_right(dst, src, n, nbits)      *dst = *src >> n
  *  bitmap_shift_left(dst, src, n, nbits)       *dst = *src << n
+ *  bitmap_cut(dst, src, first, n, nbits)       Cut n bits from first, copy rest
  *  bitmap_remap(dst, src, old, new, nbits)     *dst = map(old, new)(src)
  *  bitmap_bitremap(oldbit, old, new, nbits)    newbit = map(old, new)(oldbit)
  *  bitmap_onto(dst, orig, relmap, nbits)       *dst = orig relative to relmap
@@ -130,6 +131,9 @@ extern void __bitmap_shift_right(unsigned long *dst, const unsigned long *src,
 				unsigned int shift, unsigned int nbits);
 extern void __bitmap_shift_left(unsigned long *dst, const unsigned long *src,
 				unsigned int shift, unsigned int nbits);
+extern void bitmap_cut(unsigned long *dst, const unsigned long *src,
+		       unsigned int first, unsigned int cut,
+		       unsigned int nbits);
 extern int __bitmap_and(unsigned long *dst, const unsigned long *bitmap1,
 			const unsigned long *bitmap2, unsigned int nbits);
 extern void __bitmap_or(unsigned long *dst, const unsigned long *bitmap1,
diff --git a/lib/bitmap.c b/lib/bitmap.c
index f9e834841e94..90ac4f413275 100644
--- a/lib/bitmap.c
+++ b/lib/bitmap.c
@@ -168,6 +168,72 @@ void __bitmap_shift_left(unsigned long *dst, const unsigned long *src,
 }
 EXPORT_SYMBOL(__bitmap_shift_left);
 
+/**
+ * bitmap_cut() - remove bit region from bitmap and right shift remaining bits
+ * @dst: destination bitmap, might overlap with src
+ * @src: source bitmap
+ * @first: start bit of region to be removed
+ * @cut: number of bits to remove
+ * @nbits: bitmap size, in bits
+ *
+ * Set the n-th bit of @dst iff the n-th bit of @src is set and
+ * n is less than @first, or the m-th bit of @src is set for any
+ * m such that @first <= n < nbits, and m = n + @cut.
+ *
+ * In pictures, example for a big-endian 32-bit architecture:
+ *
+ * @src:
+ * 31                                   63
+ * |                                    |
+ * 10000000 11000001 11110010 00010101  10000000 11000001 01110010 00010101
+ *                 |  |              |                                    |
+ *                16  14             0                                   32
+ *
+ * if @cut is 3, and @first is 14, bits 14-16 in @src are cut and @dst is:
+ *
+ * 31                                   63
+ * |                                    |
+ * 10110000 00011000 00110010 00010101  00010000 00011000 00101110 01000010
+ *                    |              |                                    |
+ *                    14 (bit 17     0                                   32
+ *                        from @src)
+ *
+ * Note that @dst and @src might overlap partially or entirely.
+ *
+ * This is implemented in the obvious way, with a shift and carry
+ * step for each moved bit. Optimisation is left as an exercise
+ * for the compiler.
+ */
+void bitmap_cut(unsigned long *dst, const unsigned long *src,
+		unsigned int first, unsigned int cut, unsigned int nbits)
+{
+	unsigned int len = BITS_TO_LONGS(nbits);
+	unsigned long keep = 0, carry;
+	int i;
+
+	memmove(dst, src, len * sizeof(*dst));
+
+	if (first % BITS_PER_LONG) {
+		keep = src[first / BITS_PER_LONG] &
+		       (~0UL >> (BITS_PER_LONG - first % BITS_PER_LONG));
+	}
+
+	while (cut--) {
+		for (i = first / BITS_PER_LONG; i < len; i++) {
+			if (i < len - 1)
+				carry = dst[i + 1] & 1UL;
+			else
+				carry = 0;
+
+			dst[i] = (dst[i] >> 1) | (carry << (BITS_PER_LONG - 1));
+		}
+	}
+
+	dst[first / BITS_PER_LONG] &= ~0UL << (first % BITS_PER_LONG);
+	dst[first / BITS_PER_LONG] |= keep;
+}
+EXPORT_SYMBOL(bitmap_cut);
+
 int __bitmap_and(unsigned long *dst, const unsigned long *bitmap1,
 				const unsigned long *bitmap2, unsigned int bits)
 {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH nf-next v2 3/8] nf_tables: Add set type for arbitrary concatenation of ranges
  2019-11-22 13:39 [PATCH nf-next v2 0/8] nftables: Set implementation for arbitrary concatenation of ranges Stefano Brivio
  2019-11-22 13:40 ` [PATCH nf-next v2 1/8] netfilter: nf_tables: Support for subkeys, set with multiple ranged fields Stefano Brivio
  2019-11-22 13:40 ` [PATCH nf-next v2 2/8] bitmap: Introduce bitmap_cut(): cut bits and shift remaining Stefano Brivio
@ 2019-11-22 13:40 ` Stefano Brivio
  2019-11-27  9:29   ` Pablo Neira Ayuso
  2019-11-22 13:40 ` [PATCH nf-next v2 4/8] selftests: netfilter: Introduce tests for sets with range concatenation Stefano Brivio
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 24+ messages in thread
From: Stefano Brivio @ 2019-11-22 13:40 UTC (permalink / raw)
  To: Pablo Neira Ayuso, netfilter-devel
  Cc: Florian Westphal, Kadlecsik József, Eric Garver, Phil Sutter

This new set type allows for intervals in concatenated fields,
which are expressed in the usual way, that is, simple byte
concatenation with padding to 32 bits for single fields, and
given as ranges by specifying start and end elements containing,
each, the full concatenation of start and end values for the
single fields.

Ranges are expanded to composing netmasks, for each field: these
are inserted as rules in per-field lookup tables. Bits to be
classified are divided in 4-bit groups, and for each group, the
lookup table contains 4^2 buckets, representing all the possible
values of a bit group. This approach was inspired by the Grouper
algorithm:
	http://www.cse.usf.edu/~ligatti/projects/grouper/

Matching is performed by a sequence of AND operations between
bucket values, with buckets selected according to the value of
packet bits, for each group. The result of this sequence tells
us which rules matched for a given field.

In order to concatenate several ranged fields, per-field rules
are mapped using mapping arrays, one per field, that specify
which rules should be considered while matching the next field.
The mapping array for the last field contains a reference to
the element originally inserted.

The notes in nft_set_pipapo.c cover the algorithm in deeper
detail.

A pure hash-based approach is of no use here, as ranges need
to be classified. An implementation based on "proxying" the
existing red-black tree set type, creating a tree for each
field, was considered, but deemed impractical due to the fact
that elements would need to be shared between trees, at least
as long as we want to keep UAPI changes to a minimum.

A stand-alone implementation of this algorithm is available at:
	https://pipapo.lameexcu.se
together with notes about possible future optimisations
(in pipapo.c).

This algorithm was designed with data locality in mind, and can
be highly optimised for SIMD instruction sets, as the bulk of
the matching work is done with repetitive, simple bitwise
operations.

v2:
 - protect access to scratch maps in nft_pipapo_lookup() with
   local_bh_disable/enable() (Florian Westphal)
 - drop rcu_read_lock/unlock() from nft_pipapo_lookup(), it's
   already implied (Florian Westphal)
 - explain why partial allocation failures don't need handling
   in pipapo_realloc_scratch(), rename 'm' to clone and update
   related kerneldoc to make it clear we're not operating on
   the live copy (Florian Westphal)
 - add expicit check for priv->start_elem in
   nft_pipapo_insert() to avoid ending up in nft_pipapo_walk()
   with a NULL start element, and also zero it out in every
   operation that might make it invalid, so that insertion
   doesn't proceed with an invalid element (Florian Westphal)

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 include/net/netfilter/nf_tables_core.h |    1 +
 net/netfilter/Makefile                 |    3 +-
 net/netfilter/nf_tables_set_core.c     |    2 +
 net/netfilter/nft_set_pipapo.c         | 2197 ++++++++++++++++++++++++
 4 files changed, 2202 insertions(+), 1 deletion(-)
 create mode 100644 net/netfilter/nft_set_pipapo.c

diff --git a/include/net/netfilter/nf_tables_core.h b/include/net/netfilter/nf_tables_core.h
index 7281895fa6d9..9759257ec8ec 100644
--- a/include/net/netfilter/nf_tables_core.h
+++ b/include/net/netfilter/nf_tables_core.h
@@ -74,6 +74,7 @@ extern struct nft_set_type nft_set_hash_type;
 extern struct nft_set_type nft_set_hash_fast_type;
 extern struct nft_set_type nft_set_rbtree_type;
 extern struct nft_set_type nft_set_bitmap_type;
+extern struct nft_set_type nft_set_pipapo_type;
 
 struct nft_expr;
 struct nft_regs;
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index 5e9b2eb24349..3f572e5a975e 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -81,7 +81,8 @@ nf_tables-objs := nf_tables_core.o nf_tables_api.o nft_chain_filter.o \
 		  nft_chain_route.o nf_tables_offload.o
 
 nf_tables_set-objs := nf_tables_set_core.o \
-		      nft_set_hash.o nft_set_bitmap.o nft_set_rbtree.o
+		      nft_set_hash.o nft_set_bitmap.o nft_set_rbtree.o \
+		      nft_set_pipapo.o
 
 obj-$(CONFIG_NF_TABLES)		+= nf_tables.o
 obj-$(CONFIG_NF_TABLES_SET)	+= nf_tables_set.o
diff --git a/net/netfilter/nf_tables_set_core.c b/net/netfilter/nf_tables_set_core.c
index a9fce8d10051..586b621007eb 100644
--- a/net/netfilter/nf_tables_set_core.c
+++ b/net/netfilter/nf_tables_set_core.c
@@ -9,12 +9,14 @@ static int __init nf_tables_set_module_init(void)
 	nft_register_set(&nft_set_rhash_type);
 	nft_register_set(&nft_set_bitmap_type);
 	nft_register_set(&nft_set_rbtree_type);
+	nft_register_set(&nft_set_pipapo_type);
 
 	return 0;
 }
 
 static void __exit nf_tables_set_module_exit(void)
 {
+	nft_unregister_set(&nft_set_pipapo_type);
 	nft_unregister_set(&nft_set_rbtree_type);
 	nft_unregister_set(&nft_set_bitmap_type);
 	nft_unregister_set(&nft_set_rhash_type);
diff --git a/net/netfilter/nft_set_pipapo.c b/net/netfilter/nft_set_pipapo.c
new file mode 100644
index 000000000000..3cad9aedc168
--- /dev/null
+++ b/net/netfilter/nft_set_pipapo.c
@@ -0,0 +1,2197 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/* PIPAPO: PIle PAcket POlicies: set for arbitrary concatenations of ranges
+ *
+ * Copyright (c) 2019 Red Hat GmbH
+ *
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ */
+
+/**
+ * DOC: Theory of Operation
+ *
+ *
+ * Problem
+ * -------
+ *
+ * Match packet bytes against entries composed of ranged or non-ranged packet
+ * field specifiers, mapping them to arbitrary references. For example:
+ *
+ * ::
+ *
+ *               --- fields --->
+ *      |    [net],[port],[net]... => [reference]
+ *   entries [net],[port],[net]... => [reference]
+ *      |    [net],[port],[net]... => [reference]
+ *      V    ...
+ *
+ * where [net] fields can be IP ranges or netmasks, and [port] fields are port
+ * ranges. Arbitrary packet fields can be matched.
+ *
+ *
+ * Algorithm Overview
+ * ------------------
+ *
+ * This algorithm is loosely inspired by [Ligatti 2010], and fundamentally
+ * relies on the consideration that every contiguous range in a space of b bits
+ * can be converted into b * 2 netmasks, from Theorem 3 in [Rottenstreich 2010],
+ * as also illustrated in Section 9 of [Kogan 2014].
+ *
+ * Classification against a number of entries, that require matching given bits
+ * of a packet field, is performed by grouping those bits in sets of arbitrary
+ * size, and classifying packet bits one group at a time.
+ *
+ * Example:
+ *   to match the source port (16 bits) of a packet, we can divide those 16 bits
+ *   in 4 groups of 4 bits each. Given the entry:
+ *      0000 0001 0101 1001
+ *   and a packet with source port:
+ *      0000 0001 1010 1001
+ *   first and second groups match, but the third doesn't. We conclude that the
+ *   packet doesn't match the given entry.
+ *
+ * Translate the set to a sequence of lookup tables, one per field. Each table
+ * has two dimensions: bit groups to be matched for a single packet field, and
+ * all the possible values of said groups (buckets). Input entries are
+ * represented as one or more rules, depending on the number of composing
+ * netmasks for the given field specifier, and a group match is indicated as a
+ * set bit, with number corresponding to the rule index, in all the buckets
+ * whose value matches the entry for a given group.
+ *
+ * Rules are mapped between fields through an array of x, n pairs, with each
+ * item mapping a matched rule to one or more rules. The position of the pair in
+ * the array indicates the matched rule to be mapped to the next field, x
+ * indicates the first rule index in the next field, and n the amount of
+ * next-field rules the current rule maps to.
+ *
+ * The mapping array for the last field maps to the desired references.
+ *
+ * To match, we perform table lookups using the values of grouped packet bits,
+ * and use a sequence of bitwise operations to progressively evaluate rule
+ * matching.
+ *
+ * A stand-alone, reference implementation, also including notes about possible
+ * future optimisations, is available at:
+ *    https://pipapo.lameexcu.se/
+ *
+ * Insertion
+ * ---------
+ *
+ * - For each packet field:
+ *
+ *   - divide the b packet bits we want to classify into groups of size t,
+ *     obtaining ceil(b / t) groups
+ *
+ *      Example: match on destination IP address, with t = 4: 32 bits, 8 groups
+ *      of 4 bits each
+ *
+ *   - allocate a lookup table with one column ("bucket") for each possible
+ *     value of a group, and with one row for each group
+ *
+ *      Example: 8 groups, 2^4 buckets:
+ *
+ * ::
+ *
+ *                     bucket
+ *      group  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
+ *        0
+ *        1
+ *        2
+ *        3
+ *        4
+ *        5
+ *        6
+ *        7
+ *
+ *   - map the bits we want to classify for the current field, for a given
+ *     entry, to a single rule for non-ranged and netmask set items, and to one
+ *     or multiple rules for ranges. Ranges are expanded to composing netmasks
+ *     by pipapo_expand().
+ *
+ *      Example: 2 entries, 10.0.0.5:1024 and 192.168.1.0-192.168.2.1:2048
+ *      - rule #0: 10.0.0.5
+ *      - rule #1: 192.168.1.0/24
+ *      - rule #2: 192.168.2.0/31
+ *
+ *   - insert references to the rules in the lookup table, selecting buckets
+ *     according to bit values of a rule in the given group. This is done by
+ *     pipapo_insert().
+ *
+ *      Example: given:
+ *      - rule #0: 10.0.0.5 mapping to buckets
+ *        < 0 10  0 0   0 0  0 5 >
+ *      - rule #1: 192.168.1.0/24 mapping to buckets
+ *        < 12 0  10 8  0 1  < 0..15 > < 0..15 > >
+ *      - rule #2: 192.168.2.0/31 mapping to buckets
+ *        < 12 0  10 8  0 2  0 < 0..1 > >
+ *
+ *      these bits are set in the lookup table:
+ *
+ * ::
+ *
+ *                     bucket
+ *      group  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
+ *        0    0                                              1,2
+ *        1   1,2                                      0
+ *        2    0                                      1,2
+ *        3    0                              1,2
+ *        4  0,1,2
+ *        5    0   1   2
+ *        6  0,1,2 1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
+ *        7   1,2 1,2  1   1   1  0,1  1   1   1   1   1   1   1   1   1   1
+ *
+ *   - if this is not the last field in the set, fill a mapping array that maps
+ *     rules from the lookup table to rules belonging to the same entry in
+ *     the next lookup table, done by pipapo_map().
+ *
+ *     Note that as rules map to contiguous ranges of rules, given how netmask
+ *     expansion and insertion is performed, &union nft_pipapo_map_bucket stores
+ *     this information as pairs of first rule index, rule count.
+ *
+ *      Example: 2 entries, 10.0.0.5:1024 and 192.168.1.0-192.168.2.1:2048,
+ *      given lookup table #0 for field 0 (see example above):
+ *
+ * ::
+ *
+ *                     bucket
+ *      group  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
+ *        0    0                                              1,2
+ *        1   1,2                                      0
+ *        2    0                                      1,2
+ *        3    0                              1,2
+ *        4  0,1,2
+ *        5    0   1   2
+ *        6  0,1,2 1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
+ *        7   1,2 1,2  1   1   1  0,1  1   1   1   1   1   1   1   1   1   1
+ *
+ *      and lookup table #1 for field 1 with:
+ *      - rule #0: 1024 mapping to buckets
+ *        < 0  0  4  0 >
+ *      - rule #1: 2048 mapping to buckets
+ *        < 0  0  5  0 >
+ *
+ * ::
+ *
+ *                     bucket
+ *      group  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
+ *        0   0,1
+ *        1   0,1
+ *        2                    0   1
+ *        3   0,1
+ *
+ *      we need to map rules for 10.0.0.5 in lookup table #0 (rule #0) to 1024
+ *      in lookup table #1 (rule #0) and rules for 192.168.1.0-192.168.2.1
+ *      (rules #1, #2) to 2048 in lookup table #2 (rule #1):
+ *
+ * ::
+ *
+ *       rule indices in current field: 0    1    2
+ *       map to rules in next field:    0    1    1
+ *
+ *   - if this is the last field in the set, fill a mapping array that maps
+ *     rules from the last lookup table to element pointers, also done by
+ *     pipapo_map().
+ *
+ *     Note that, in this implementation, we have two elements (start, end) for
+ *     each entry. The pointer to the end element is stored in this array, and
+ *     the pointer to the start element is linked from it.
+ *
+ *      Example: entry 10.0.0.5:1024 has a corresponding &struct nft_pipapo_elem
+ *      pointer, 0x66, and element for 192.168.1.0-192.168.2.1:2048 is at 0x42.
+ *      From the rules of lookup table #1 as mapped above:
+ *
+ * ::
+ *
+ *       rule indices in last field:    0    1
+ *       map to elements:             0x42  0x66
+ *
+ *
+ * Matching
+ * --------
+ *
+ * We use a result bitmap, with the size of a single lookup table bucket, to
+ * represent the matching state that applies at every algorithm step. This is
+ * done by pipapo_lookup().
+ *
+ * - For each packet field:
+ *
+ *   - start with an all-ones result bitmap (res_map in pipapo_lookup())
+ *
+ *   - perform a lookup into the table corresponding to the current field,
+ *     for each group, and at every group, AND the current result bitmap with
+ *     the value from the lookup table bucket
+ *
+ * ::
+ *
+ *      Example: 192.168.1.5 < 12 0  10 8  0 1  0 5 >, with lookup table from
+ *      insertion examples.
+ *      Lookup table buckets are at least 3 bits wide, we'll assume 8 bits for
+ *      convenience in this example. Initial result bitmap is 0xff, the steps
+ *      below show the value of the result bitmap after each group is processed:
+ *
+ *                     bucket
+ *      group  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
+ *        0    0                                              1,2
+ *        result bitmap is now: 0xff & 0x6 [bucket 12] = 0x6
+ *
+ *        1   1,2                                      0
+ *        result bitmap is now: 0x6 & 0x6 [bucket 0] = 0x6
+ *
+ *        2    0                                      1,2
+ *        result bitmap is now: 0x6 & 0x6 [bucket 10] = 0x6
+ *
+ *        3    0                              1,2
+ *        result bitmap is now: 0x6 & 0x6 [bucket 8] = 0x6
+ *
+ *        4  0,1,2
+ *        result bitmap is now: 0x6 & 0x7 [bucket 0] = 0x6
+ *
+ *        5    0   1   2
+ *        result bitmap is now: 0x6 & 0x2 [bucket 1] = 0x2
+ *
+ *        6  0,1,2 1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
+ *        result bitmap is now: 0x2 & 0x7 [bucket 0] = 0x2
+ *
+ *        7   1,2 1,2  1   1   1  0,1  1   1   1   1   1   1   1   1   1   1
+ *        final result bitmap for this field is: 0x2 & 0x3 [bucket 5] = 0x2
+ *
+ *   - at the next field, start with a new, all-zeroes result bitmap. For each
+ *     bit set in the previous result bitmap, fill the new result bitmap
+ *     (fill_map in pipapo_lookup()) with the rule indices from the
+ *     corresponding buckets of the mapping field for this field, done by
+ *     pipapo_refill()
+ *
+ *      Example: with mapping table from insertion examples, with the current
+ *      result bitmap from the previous example, 0x02:
+ *
+ * ::
+ *
+ *       rule indices in current field: 0    1    2
+ *       map to rules in next field:    0    1    1
+ *
+ *      the new result bitmap will be 0x02: rule 1 was set, and rule 1 will be
+ *      set.
+ *
+ *      We can now extend this example to cover the second iteration of the step
+ *      above (lookup and AND bitmap): assuming the port field is
+ *      2048 < 0  0  5  0 >, with starting result bitmap 0x2, and lookup table
+ *      for "port" field from pre-computation example:
+ *
+ * ::
+ *
+ *                     bucket
+ *      group  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
+ *        0   0,1
+ *        1   0,1
+ *        2                    0   1
+ *        3   0,1
+ *
+ *       operations are: 0x2 & 0x3 [bucket 0] & 0x3 [bucket 0] & 0x2 [bucket 5]
+ *       & 0x3 [bucket 0], resulting bitmap is 0x2.
+ *
+ *   - if this is the last field in the set, look up the value from the mapping
+ *     array corresponding to the final result bitmap
+ *
+ *      Example: 0x2 resulting bitmap from 192.168.1.5:2048, mapping array for
+ *      last field from insertion example:
+ *
+ * ::
+ *
+ *       rule indices in last field:    0    1
+ *       map to elements:             0x42  0x66
+ *
+ *      the matching element is at 0x42.
+ *
+ *
+ * References
+ * ----------
+ *
+ * [Ligatti 2010]
+ *      A Packet-classification Algorithm for Arbitrary Bitmask Rules, with
+ *      Automatic Time-space Tradeoffs
+ *      Jay Ligatti, Josh Kuhn, and Chris Gage.
+ *      Proceedings of the IEEE International Conference on Computer
+ *      Communication Networks (ICCCN), August 2010.
+ *      http://www.cse.usf.edu/~ligatti/papers/grouper-conf.pdf
+ *
+ * [Rottenstreich 2010]
+ *      Worst-Case TCAM Rule Expansion
+ *      Ori Rottenstreich and Isaac Keslassy.
+ *      2010 Proceedings IEEE INFOCOM, San Diego, CA, 2010.
+ *      http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.212.4592&rep=rep1&type=pdf
+ *
+ * [Kogan 2014]
+ *      SAX-PAC (Scalable And eXpressive PAcket Classification)
+ *      Kirill Kogan, Sergey Nikolenko, Ori Rottenstreich, William Culhane,
+ *      and Patrick Eugster.
+ *      Proceedings of the 2014 ACM conference on SIGCOMM, August 2014.
+ *      http://www.sigcomm.org/sites/default/files/ccr/papers/2014/August/2619239-2626294.pdf
+ */
+
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/log2.h>
+#include <linux/module.h>
+#include <linux/netlink.h>
+#include <linux/netfilter.h>
+#include <linux/netfilter/nf_tables.h>
+#include <net/netfilter/nf_tables_core.h>
+#include <uapi/linux/netfilter/nf_tables.h>
+#include <net/ipv6.h>			/* For the maximum length of a field */
+#include <linux/bitmap.h>
+#include <linux/bitops.h>
+
+/* Count of concatenated fields depends on count of 32-bit nftables registers */
+#define NFT_PIPAPO_MAX_FIELDS		NFT_REG32_COUNT
+
+/* Largest supported field size */
+#define NFT_PIPAPO_MAX_BYTES		(sizeof(struct in6_addr))
+#define NFT_PIPAPO_MAX_BITS		(NFT_PIPAPO_MAX_BYTES * BITS_PER_BYTE)
+
+/* Number of bits to be grouped together in lookup table buckets, arbitrary */
+#define NFT_PIPAPO_GROUP_BITS		4
+#define NFT_PIPAPO_GROUPS_PER_BYTE	(BITS_PER_BYTE / NFT_PIPAPO_GROUP_BITS)
+
+/* Fields are padded to 32 bits in input registers */
+#define NFT_PIPAPO_GROUPS_PADDED_SIZE(x)				\
+	(round_up((x) / NFT_PIPAPO_GROUPS_PER_BYTE, sizeof(u32)))
+#define NFT_PIPAPO_GROUPS_PADDING(x)					\
+	(NFT_PIPAPO_GROUPS_PADDED_SIZE((x)) - (x) / NFT_PIPAPO_GROUPS_PER_BYTE)
+
+/* Number of buckets, given by 2 ^ n, with n grouped bits */
+#define NFT_PIPAPO_BUCKETS		(1 << NFT_PIPAPO_GROUP_BITS)
+
+/* Each n-bit range maps to up to n * 2 rules */
+#define NFT_PIPAPO_MAP_NBITS		(const_ilog2(NFT_PIPAPO_MAX_BITS * 2))
+
+/* Use the rest of mapping table buckets for rule indices, but it makes no sense
+ * to exceed 32 bits
+ */
+#if BITS_PER_LONG == 64
+#define NFT_PIPAPO_MAP_TOBITS		32
+#else
+#define NFT_PIPAPO_MAP_TOBITS		(BITS_PER_LONG - NFT_PIPAPO_MAP_NBITS)
+#endif
+
+/* ...which gives us the highest allowed index for a rule */
+#define NFT_PIPAPO_RULE0_MAX		((1UL << (NFT_PIPAPO_MAP_TOBITS - 1)) \
+					- (1UL << NFT_PIPAPO_MAP_NBITS))
+
+#define nft_pipapo_for_each_field(field, index, match)		\
+	for ((field) = (match)->f, (index) = 0;			\
+	     (index) < (match)->field_count;			\
+	     (index)++, (field)++)
+
+/**
+ * union nft_pipapo_map_bucket - Bucket of mapping table
+ * @to:		First rule number (in next field) this rule maps to
+ * @n:		Number of rules (in next field) this rule maps to
+ * @e:		If there's no next field, pointer to element this rule maps to
+ */
+union nft_pipapo_map_bucket {
+	struct {
+#if BITS_PER_LONG == 64
+		static_assert(NFT_PIPAPO_MAP_TOBITS <= 32);
+		u32 to;
+
+		static_assert(NFT_PIPAPO_MAP_NBITS <= 32);
+		u32 n;
+#else
+		unsigned long to:NFT_PIPAPO_MAP_TOBITS;
+		unsigned long  n:NFT_PIPAPO_MAP_NBITS;
+#endif
+	};
+	struct nft_pipapo_elem *e;
+};
+
+/**
+ * struct nft_pipapo_field - Lookup, mapping tables and related data for a field
+ * @groups:	Amount of 4-bit groups
+ * @rules:	Number of inserted rules
+ * @bsize:	Size of each bucket in lookup table, in longs
+ * @lt:		Lookup table: 'groups' rows of NFT_PIPAPO_BUCKETS buckets
+ * @mt:		Mapping table: one bucket per rule
+ */
+struct nft_pipapo_field {
+	int groups;
+	unsigned long rules;
+	size_t bsize;
+	unsigned long *lt;
+	union nft_pipapo_map_bucket *mt;
+};
+
+/**
+ * struct nft_pipapo_match - Data used for lookup and matching
+ * @field_count		Amount of fields in set
+ * @scratch:		Preallocated per-CPU maps for partial matching results
+ * @bsize_max:		Maximum lookup table bucket size of all fields, in longs
+ * @rcu			Matching data is swapped on commits
+ * @f:			Fields, with lookup and mapping tables
+ */
+struct nft_pipapo_match {
+	int field_count;
+	unsigned long * __percpu *scratch;
+	size_t bsize_max;
+	struct rcu_head rcu;
+	struct nft_pipapo_field f[0];
+};
+
+/* Current working bitmap index, toggled between field matches */
+static DEFINE_PER_CPU(bool, nft_pipapo_scratch_index);
+
+/**
+ * struct nft_pipapo - Representation of a set
+ * @match:	Currently in-use matching data
+ * @clone:	Copy where pending insertions and deletions are kept
+ * @groups:	Total amount of 4-bit groups for fields in this set
+ * @width:	Total bytes to be matched for one packet, including padding
+ * @dirty:	Working copy has pending insertions or deletions
+ * @last_gc:	Timestamp of last garbage collection run, jiffies
+ * @start_data:	Key data of start element for insertion
+ * @start_elem:	Start element for insertion
+ */
+struct nft_pipapo {
+	struct nft_pipapo_match __rcu *match;
+	struct nft_pipapo_match *clone;
+	int groups;
+	int width;
+	bool dirty;
+	unsigned long last_gc;
+	u8 start_data[NFT_DATA_VALUE_MAXLEN * sizeof(u32)];
+	struct nft_pipapo_elem *start_elem;
+};
+
+struct nft_pipapo_elem;
+
+/**
+ * struct nft_pipapo_elem - API-facing representation of single set element
+ * @start:	Pointer to element that represents start of interval
+ * @ext:	nftables API extensions
+ */
+struct nft_pipapo_elem {
+	struct nft_pipapo_elem *start;
+	struct nft_set_ext ext;
+};
+
+/**
+ * pipapo_refill() - For each set bit, set bits from selected mapping table item
+ * @map:	Bitmap to be scanned for set bits
+ * @len:	Length of bitmap in longs
+ * @rules:	Number of rules in field
+ * @dst:	Destination bitmap
+ * @mt:		Mapping table containing bit set specifiers
+ * @match_only:	Find a single bit and return, don't fill
+ *
+ * Iteration over set bits with __builtin_ctzl(): Daniel Lemire, public domain.
+ *
+ * For each bit set in map, select the bucket from mapping table with index
+ * corresponding to the position of the bit set. Use start bit and amount of
+ * bits specified in bucket to fill region in dst.
+ *
+ * Return: -1 on no match, bit position on 'match_only', 0 otherwise.
+ */
+static int pipapo_refill(unsigned long *map, int len, int rules,
+			 unsigned long *dst, union nft_pipapo_map_bucket *mt,
+			 bool match_only)
+{
+	unsigned long bitset;
+	int k, ret = -1;
+
+	for (k = 0; k < len; k++) {
+		bitset = map[k];
+		while (bitset) {
+			unsigned long t = bitset & -bitset;
+			int r = __builtin_ctzl(bitset);
+			int i = k * BITS_PER_LONG + r;
+
+			if (unlikely(i >= rules)) {
+				map[k] = 0;
+				return -1;
+			}
+
+			if (unlikely(match_only)) {
+				bitmap_clear(map, i, 1);
+				return i;
+			}
+
+			ret = 0;
+
+			bitmap_set(dst, mt[i].to, mt[i].n);
+
+			bitset ^= t;
+		}
+		map[k] = 0;
+	}
+
+	return ret;
+}
+
+/**
+ * nft_pipapo_lookup() - Lookup function
+ * @net:	Network namespace
+ * @set:	nftables API set representation
+ * @elem:	nftables API element representation containing key data
+ * @ext:	nftables API extension pointer, filled with matching reference
+ *
+ * For more details, see DOC: Theory of Operation.
+ *
+ * Return: true on match, false otherwise.
+ */
+static bool nft_pipapo_lookup(const struct net *net, const struct nft_set *set,
+			      const u32 *key, const struct nft_set_ext **ext)
+{
+	struct nft_pipapo *priv = nft_set_priv(set);
+	unsigned long *res_map, *fill_map;
+	u8 genmask = nft_genmask_cur(net);
+	const u8 *rp = (const u8 *)key;
+	struct nft_pipapo_match *m;
+	struct nft_pipapo_field *f;
+	bool map_index;
+	int i;
+
+	local_bh_disable();
+
+	map_index = raw_cpu_read(nft_pipapo_scratch_index);
+
+	m = rcu_dereference(priv->match);
+
+	if (unlikely(!m || !*raw_cpu_ptr(m->scratch)))
+		goto out;
+
+	res_map  = *raw_cpu_ptr(m->scratch) + (map_index ? m->bsize_max : 0);
+	fill_map = *raw_cpu_ptr(m->scratch) + (map_index ? 0 : m->bsize_max);
+
+	memset(res_map, 0xff, m->bsize_max * sizeof(*res_map));
+
+	nft_pipapo_for_each_field(f, i, m) {
+		bool last = i == m->field_count - 1;
+		unsigned long *lt = f->lt;
+		int b, group;
+
+		/* For each 4-bit group: select lookup table bucket depending on
+		 * packet bytes value, then AND bucket value
+		 */
+		for (group = 0; group < f->groups; group++) {
+			u8 v;
+
+			if (group % 2) {
+				v = *rp & 0x0f;
+				rp++;
+			} else {
+				v = *rp >> 4;
+			}
+			__bitmap_and(res_map, res_map, lt + v * f->bsize,
+				     f->bsize * BITS_PER_LONG);
+
+			lt += f->bsize * NFT_PIPAPO_BUCKETS;
+		}
+
+		/* Now populate the bitmap for the next field, unless this is
+		 * the last field, in which case return the matched 'ext'
+		 * pointer if any.
+		 *
+		 * Now res_map contains the matching bitmap, and fill_map is the
+		 * bitmap for the next field.
+		 */
+next_match:
+		b = pipapo_refill(res_map, f->bsize, f->rules, fill_map, f->mt,
+				  last);
+		if (b < 0) {
+			raw_cpu_write(nft_pipapo_scratch_index, map_index);
+			local_bh_enable();
+
+			return false;
+		}
+
+		if (last) {
+			*ext = &f->mt[b].e->ext;
+			if (unlikely(nft_set_elem_expired(*ext) ||
+				     !nft_set_elem_active(*ext, genmask)))
+				goto next_match;
+
+			/* Last field: we're just returning the key without
+			 * filling the initial bitmap for the next field, so the
+			 * current inactive bitmap is clean and can be reused as
+			 * *next* bitmap (not initial) for the next packet.
+			 */
+			raw_cpu_write(nft_pipapo_scratch_index, map_index);
+			local_bh_enable();
+
+			return true;
+		}
+
+		/* Swap bitmap indices: res_map is the initial bitmap for the
+		 * next field, and fill_map is guaranteed to be all-zeroes at
+		 * this point.
+		 */
+		map_index = !map_index;
+		swap(res_map, fill_map);
+
+		rp += NFT_PIPAPO_GROUPS_PADDING(f->groups);
+	}
+
+out:
+	local_bh_enable();
+	return false;
+}
+
+/**
+ * pipapo_get() - Get matching start or end element reference given key data
+ * @net:	Network namespace
+ * @set:	nftables API set representation
+ * @data:	Key data to be matched against existing elements
+ * @flags:	If NFT_SET_ELEM_INTERVAL_END is passed, return the end element
+ *
+ * This is essentially the same as the lookup function, except that it matches
+ * key data against the uncommitted copy and doesn't use preallocated maps for
+ * bitmap results.
+ *
+ * Return: pointer to &struct nft_pipapo_elem on match, error pointer otherwise.
+ */
+static void *pipapo_get(const struct net *net, const struct nft_set *set,
+			const u8 *data, unsigned int flags)
+{
+	struct nft_pipapo *priv = nft_set_priv(set);
+	struct nft_pipapo_match *m = priv->clone;
+	unsigned long *res_map, *fill_map = NULL;
+	void *ret = ERR_PTR(-ENOENT);
+	struct nft_pipapo_field *f;
+	int i;
+
+	res_map = kmalloc_array(m->bsize_max, sizeof(*res_map), GFP_ATOMIC);
+	if (!res_map) {
+		ret = ERR_PTR(-ENOMEM);
+		goto out;
+	}
+
+	fill_map = kcalloc(m->bsize_max, sizeof(*res_map), GFP_ATOMIC);
+	if (!fill_map) {
+		ret = ERR_PTR(-ENOMEM);
+		goto out;
+	}
+
+	memset(res_map, 0xff, m->bsize_max * sizeof(*res_map));
+
+	nft_pipapo_for_each_field(f, i, m) {
+		bool last = i == m->field_count - 1;
+		unsigned long *lt = f->lt;
+		int b, group;
+
+		/* For each 4-bit group: select lookup table bucket depending on
+		 * packet bytes value, then AND bucket value
+		 */
+		for (group = 0; group < f->groups; group++) {
+			u8 v;
+
+			if (group % 2) {
+				v = *data & 0x0f;
+				data++;
+			} else {
+				v = *data >> 4;
+			}
+			__bitmap_and(res_map, res_map, lt + v * f->bsize,
+				     f->bsize * BITS_PER_LONG);
+
+			lt += f->bsize * NFT_PIPAPO_BUCKETS;
+		}
+
+		/* Now populate the bitmap for the next field, unless this is
+		 * the last field, in which case return the matched 'ext'
+		 * pointer if any.
+		 *
+		 * Now res_map contains the matching bitmap, and fill_map is the
+		 * bitmap for the next field.
+		 */
+next_match:
+		b = pipapo_refill(res_map, f->bsize, f->rules, fill_map, f->mt,
+				  last);
+		if (b < 0)
+			goto out;
+
+		if (last) {
+			if (nft_set_elem_expired(&f->mt[b].e->ext))
+				goto next_match;
+
+			if (flags & NFT_SET_ELEM_INTERVAL_END)
+				ret = f->mt[b].e;
+			else
+				ret = f->mt[b].e->start;
+			goto out;
+		}
+
+		data += NFT_PIPAPO_GROUPS_PADDING(f->groups);
+
+		/* Swap bitmap indices: fill_map will be the initial bitmap for
+		 * the next field (i.e. the new res_map), and res_map is
+		 * guaranteed to be all-zeroes at this point, ready to be filled
+		 * according to the next mapping table.
+		 */
+		swap(res_map, fill_map);
+	}
+
+out:
+	kfree(fill_map);
+	kfree(res_map);
+	return ret;
+}
+
+/**
+ * nft_pipapo_get() - Get matching element reference given key data
+ * @net:	Network namespace
+ * @set:	nftables API set representation
+ * @elem:	nftables API element representation containing key data
+ * @flags:	If NFT_SET_ELEM_INTERVAL_END is passed, return the end element
+ */
+static void *nft_pipapo_get(const struct net *net, const struct nft_set *set,
+			    const struct nft_set_elem *elem, unsigned int flags)
+{
+	return pipapo_get(net, set, (const u8 *)elem->key.val.data, flags);
+}
+
+/**
+ * pipapo_resize() - Resize lookup or mapping table, or both
+ * @f:		Field containing lookup and mapping tables
+ * @old_rules:	Previous amount of rules in field
+ * @rules:	New amount of rules
+ *
+ * Increase, decrease or maintain tables size depending on new amount of rules,
+ * and copy data over. In case the new size is smaller, throw away data for
+ * highest-numbered rules.
+ *
+ * Return: 0 on success, -ENOMEM on allocation failure.
+ */
+static int pipapo_resize(struct nft_pipapo_field *f, int old_rules, int rules)
+{
+	long *new_lt = NULL, *new_p, *old_lt = f->lt, *old_p;
+	union nft_pipapo_map_bucket *new_mt, *old_mt = f->mt;
+	size_t new_bucket_size, copy;
+	int group, bucket;
+
+	new_bucket_size = DIV_ROUND_UP(rules, BITS_PER_LONG);
+
+	if (new_bucket_size == f->bsize)
+		goto mt;
+
+	if (new_bucket_size > f->bsize)
+		copy = f->bsize;
+	else
+		copy = new_bucket_size;
+
+	new_lt = kvzalloc(f->groups * NFT_PIPAPO_BUCKETS * new_bucket_size *
+			  sizeof(*new_lt), GFP_KERNEL);
+	if (!new_lt)
+		return -ENOMEM;
+
+	new_p = new_lt;
+	old_p = old_lt;
+	for (group = 0; group < f->groups; group++) {
+		for (bucket = 0; bucket < NFT_PIPAPO_BUCKETS; bucket++) {
+			memcpy(new_p, old_p, copy * sizeof(*new_p));
+			new_p += copy;
+			old_p += copy;
+
+			if (new_bucket_size > f->bsize)
+				new_p += new_bucket_size - f->bsize;
+			else
+				old_p += f->bsize - new_bucket_size;
+		}
+	}
+
+mt:
+	new_mt = kvmalloc(rules * sizeof(*new_mt), GFP_KERNEL);
+	if (!new_mt) {
+		kvfree(new_lt);
+		return -ENOMEM;
+	}
+
+	memcpy(new_mt, f->mt, min(old_rules, rules) * sizeof(*new_mt));
+	if (rules > old_rules) {
+		memset(new_mt + old_rules, 0,
+		       (rules - old_rules) * sizeof(*new_mt));
+	}
+
+	if (new_lt) {
+		f->bsize = new_bucket_size;
+		f->lt = new_lt;
+		kvfree(old_lt);
+	}
+
+	f->mt = new_mt;
+	kvfree(old_mt);
+
+	return 0;
+}
+
+/**
+ * pipapo_bucket_set() - Set rule bit in bucket given group and group value
+ * @f:		Field containing lookup table
+ * @rule:	Rule index
+ * @group:	Group index
+ * @v:		Value of bit group
+ */
+static void pipapo_bucket_set(struct nft_pipapo_field *f, int rule, int group,
+			      int v)
+{
+	unsigned long *pos;
+
+	pos = f->lt + f->bsize * NFT_PIPAPO_BUCKETS * group;
+	pos += f->bsize * v;
+
+	__set_bit(rule, pos);
+}
+
+/**
+ * pipapo_insert() - Insert new rule in field given input key and mask length
+ * @f:		Field containing lookup table
+ * @k:		Input key for classification, without nftables padding
+ * @mask_bits:	Length of mask; matches field length for non-ranged entry
+ *
+ * Insert a new rule reference in lookup buckets corresponding to k and
+ * mask_bits.
+ *
+ * Return: 1 on success (one rule inserted), negative error code on failure.
+ */
+static int pipapo_insert(struct nft_pipapo_field *f, const uint8_t *k,
+			 int mask_bits)
+{
+	int rule = f->rules++, group, ret;
+
+	ret = pipapo_resize(f, f->rules - 1, f->rules);
+	if (ret)
+		return ret;
+
+	for (group = 0; group < f->groups; group++) {
+		int i, v;
+		u8 mask;
+
+		if (group % 2)
+			v = k[group / 2] & 0x0f;
+		else
+			v = k[group / 2] >> 4;
+
+		if (mask_bits >= (group + 1) * 4) {
+			/* Not masked */
+			pipapo_bucket_set(f, rule, group, v);
+		} else if (mask_bits <= group * 4) {
+			/* Completely masked */
+			for (i = 0; i < NFT_PIPAPO_BUCKETS; i++)
+				pipapo_bucket_set(f, rule, group, i);
+		} else {
+			/* The mask limit falls on this group */
+			mask = 0x0f >> (mask_bits - group * 4);
+			for (i = 0; i < NFT_PIPAPO_BUCKETS; i++) {
+				if ((i & ~mask) == (v & ~mask))
+					pipapo_bucket_set(f, rule, group, i);
+			}
+		}
+	}
+
+	return 1;
+}
+
+/**
+ * pipapo_step_diff() - Check if setting @step bit in netmask would change it
+ * @base:	Mask we are expanding
+ * @step:	Step bit for given expansion step
+ * @len:	Total length of mask space (set and unset bits), bytes
+ *
+ * Convenience function for mask expansion.
+ *
+ * Return: true if step bit changes mask (i.e. isn't set), false otherwise.
+ */
+static bool pipapo_step_diff(u8 *base, int step, int len)
+{
+	/* Network order, byte-addressed */
+#ifdef __BIG_ENDIAN__
+	return !(BIT(step % BITS_PER_BYTE) & base[step / BITS_PER_BYTE]);
+#else
+	return !(BIT(step % BITS_PER_BYTE) &
+		 base[len - 1 - step / BITS_PER_BYTE]);
+#endif
+}
+
+/**
+ * pipapo_step_after_end() - Check if mask exceeds range end with given step
+ * @base:	Mask we are expanding
+ * @end:	End of range
+ * @step:	Step bit for given expansion step, highest bit to be set
+ * @len:	Total length of mask space (set and unset bits), bytes
+ *
+ * Convenience function for mask expansion.
+ *
+ * Return: true if mask exceeds range setting step bits, false otherwise.
+ */
+static bool pipapo_step_after_end(const u8 *base, const u8 *end, int step,
+				  int len)
+{
+	u8 tmp[NFT_PIPAPO_MAX_BYTES];
+	int i;
+
+	memcpy(tmp, base, len);
+
+	/* Network order, byte-addressed */
+	for (i = 0; i <= step; i++)
+#ifdef __BIG_ENDIAN__
+		tmp[i / BITS_PER_BYTE] |= BIT(i % BITS_PER_BYTE);
+#else
+		tmp[len - 1 - i / BITS_PER_BYTE] |= BIT(i % BITS_PER_BYTE);
+#endif
+
+	return memcmp(tmp, end, len) > 0;
+}
+
+/**
+ * pipapo_base_sum() - Sum step bit to given len-sized netmask base with carry
+ * @base:	Netmask base
+ * @step:	Step bit to sum
+ * @len:	Netmask length, bytes
+ */
+static void pipapo_base_sum(u8 *base, int step, int len)
+{
+	bool carry = false;
+	int i;
+
+	/* Network order, byte-addressed */
+#ifdef __BIG_ENDIAN__
+	for (i = step / BITS_PER_BYTE; i < len; i++) {
+#else
+	for (i = len - 1 - step / BITS_PER_BYTE; i >= 0; i--) {
+#endif
+		if (carry)
+			base[i]++;
+		else
+			base[i] += 1 << (step % BITS_PER_BYTE);
+
+		if (base[i])
+			break;
+
+		carry = true;
+	}
+}
+
+/**
+ * expand() - Expand range to composing netmasks and insert into lookup table
+ * @f:		Field containing lookup table
+ * @start:	Start of range
+ * @end:	End of range
+ * @len:	Length of value in bits
+ *
+ * Expand range to composing netmasks and insert corresponding rule references
+ * in lookup buckets.
+ *
+ * Return: number of inserted rules on success, negative error code on failure.
+ */
+static int pipapo_expand(struct nft_pipapo_field *f,
+			 const u8 *start, const u8 *end, int len)
+{
+	int step, masks = 0, bytes = DIV_ROUND_UP(len, BITS_PER_BYTE);
+	u8 base[NFT_PIPAPO_MAX_BYTES];
+
+	memcpy(base, start, bytes);
+	while (memcmp(base, end, bytes) <= 0) {
+		int err;
+
+		step = 0;
+		while (pipapo_step_diff(base, step, bytes)) {
+			if (pipapo_step_after_end(base, end, step, bytes))
+				break;
+
+			step++;
+			if (step >= len) {
+				if (!masks) {
+					pipapo_insert(f, base, 0);
+					masks = 1;
+				}
+				goto out;
+			}
+		}
+
+		err = pipapo_insert(f, base, len - step);
+
+		if (err < 0)
+			return err;
+
+		masks++;
+		pipapo_base_sum(base, step, bytes);
+	}
+out:
+	return masks;
+}
+
+/**
+ * pipapo_map() - Insert rules in mapping tables, mapping them between fields
+ * @m:		Matching data, including mapping table
+ * @map:	Table of rule maps: array of first rule and amount of rules
+ *		in next field a given rule maps to, for each field
+ * @ext:	For last field, nft_set_ext pointer matching rules map to
+ */
+static void pipapo_map(struct nft_pipapo_match *m,
+		       union nft_pipapo_map_bucket map[NFT_PIPAPO_MAX_FIELDS],
+		       struct nft_pipapo_elem *e)
+{
+	struct nft_pipapo_field *f;
+	int i, j;
+
+	for (i = 0, f = m->f; i < m->field_count - 1; i++, f++) {
+		for (j = 0; j < map[i].n; j++) {
+			f->mt[map[i].to + j].to = map[i + 1].to;
+			f->mt[map[i].to + j].n = map[i + 1].n;
+		}
+	}
+
+	/* Last field: map to ext instead of mapping to next field */
+	for (j = 0; j < map[i].n; j++)
+		f->mt[map[i].to + j].e = e;
+}
+
+/**
+ * pipapo_realloc_scratch() - Reallocate scratch maps for partial match results
+ * @clone:	Copy of matching data with pending insertions and deletions
+ * @bsize_max	Maximum bucket size, scratch maps cover two buckets
+ *
+ * Return: 0 on success, -ENOMEM on failure.
+ */
+static int pipapo_realloc_scratch(struct nft_pipapo_match *clone,
+				  unsigned long bsize_max)
+{
+	int i;
+
+	for_each_possible_cpu(i) {
+		unsigned long *scratch;
+
+		scratch = kzalloc_node(bsize_max * sizeof(*scratch) * 2,
+				       GFP_KERNEL, cpu_to_node(i));
+		if (!scratch) {
+			/* On failure, there's no need to undo previous
+			 * allocations: this means that some scratch maps have
+			 * a bigger allocated size now (this is only called on
+			 * insertion), but the extra space won't be used by any
+			 * CPU as new elements are not inserted and m->bsize_max
+			 * is not updated.
+			 */
+			return -ENOMEM;
+		}
+
+		kfree(*per_cpu_ptr(clone->scratch, i));
+
+		*per_cpu_ptr(clone->scratch, i) = scratch;
+	}
+
+	return 0;
+}
+
+/**
+ * nft_pipapo_insert() - Validate and insert ranged elements
+ * @net:	Network namespace
+ * @set:	nftables API set representation
+ * @elem:	nftables API element representation containing key data
+ * @flags:	If NFT_SET_ELEM_INTERVAL_END is passed, this is the end element
+ * @ext2:	Filled with pointer to &struct nft_set_ext in inserted element
+ *
+ * In this set implementation, this functions needs to be called twice, with
+ * start and end element, to obtain a valid entry insertion.
+ *
+ * Calls to this function are serialised with each other, so we can store
+ * element and key data on the first call with start element, and use it on the
+ * second call once we get the end element too.
+ *
+ * However, userspace could send a single NFT_SET_ELEM_INTERVAL_END element,
+ * without a start element, so we need to check for it explicitly before
+ * inserting an entry, lest we end up in nft_pipapo_walk() with an empty start
+ * element.
+ *
+ * Also, we need to make sure that the start element hasn't been deactivated or
+ * destroyed between the two calls to this function, otherwise we might link an
+ * invalid start item to the end item triggering the insertion. Clear
+ * priv->start_elem on any operation that might render it invalid.
+ *
+ * Return: 0 on success, error pointer on failure.
+ */
+static int nft_pipapo_insert(const struct net *net, const struct nft_set *set,
+			     const struct nft_set_elem *elem,
+			     struct nft_set_ext **ext2)
+{
+	const struct nft_set_ext *ext = nft_set_elem_ext(set, elem->priv);
+	const u8 *data = (const u8 *)elem->key.val.data, *start, *end;
+	union nft_pipapo_map_bucket rulemap[NFT_PIPAPO_MAX_FIELDS];
+	struct nft_pipapo *priv = nft_set_priv(set);
+	struct nft_pipapo_match *m = priv->clone;
+	struct nft_pipapo_elem *e = elem->priv;
+	struct nft_pipapo_field *f;
+	int i, bsize_max, err = 0;
+	void *dup;
+
+	dup = nft_pipapo_get(net, set, elem, 0);
+	if (PTR_ERR(dup) != -ENOENT) {
+		priv->start_elem = NULL;
+		if (IS_ERR(dup))
+			return PTR_ERR(dup);
+		*ext2 = dup;
+		return -EEXIST;
+	}
+
+	if (!nft_set_ext_exists(ext, NFT_SET_EXT_FLAGS) ||
+	    !(*nft_set_ext_flags(ext) & NFT_SET_ELEM_INTERVAL_END)) {
+		priv->start_elem = e;
+		*ext2 = &e->ext;
+		memcpy(priv->start_data, data, priv->width);
+		return 0;
+	}
+
+	if (!priv->start_elem)
+		return -EINVAL;
+
+	e->start = priv->start_elem;
+
+	/* Validate */
+	start = priv->start_data;
+	end = data;
+
+	nft_pipapo_for_each_field(f, i, m) {
+		if (f->rules >= (unsigned long)NFT_PIPAPO_RULE0_MAX)
+			return -ENOSPC;
+
+		if (memcmp(start, end,
+			   f->groups / NFT_PIPAPO_GROUPS_PER_BYTE) > 0)
+			return -EINVAL;
+
+		start += NFT_PIPAPO_GROUPS_PADDED_SIZE(f->groups);
+		end += NFT_PIPAPO_GROUPS_PADDED_SIZE(f->groups);
+	}
+
+	/* Insert */
+	priv->dirty = true;
+
+	bsize_max = m->bsize_max;
+
+	start = priv->start_data;
+	end = data;
+	nft_pipapo_for_each_field(f, i, m) {
+		int ret;
+
+		rulemap[i].to = f->rules;
+
+		ret = memcmp(start, end,
+			     f->groups / NFT_PIPAPO_GROUPS_PER_BYTE);
+		if (!ret) {
+			ret = pipapo_insert(f, start,
+					    f->groups * NFT_PIPAPO_GROUP_BITS);
+		} else {
+			ret = pipapo_expand(f, start, end,
+					    f->groups * NFT_PIPAPO_GROUP_BITS);
+		}
+
+		if (f->bsize > bsize_max)
+			bsize_max = f->bsize;
+
+		rulemap[i].n = ret;
+
+		start += NFT_PIPAPO_GROUPS_PADDED_SIZE(f->groups);
+		end += NFT_PIPAPO_GROUPS_PADDED_SIZE(f->groups);
+	}
+
+	if (!*this_cpu_ptr(m->scratch) || bsize_max > m->bsize_max) {
+		err = pipapo_realloc_scratch(m, bsize_max);
+		if (err)
+			return err;
+
+		this_cpu_write(nft_pipapo_scratch_index, false);
+
+		m->bsize_max = bsize_max;
+	}
+
+	*ext2 = &e->ext;
+
+	pipapo_map(m, rulemap, e);
+
+	return 0;
+}
+
+/**
+ * pipapo_clone() - Clone matching data to create new working copy
+ * @old:	Existing matching data
+ *
+ * Return: copy of matching data passed as 'old', error pointer on failure
+ */
+static struct nft_pipapo_match *pipapo_clone(struct nft_pipapo_match *old)
+{
+	struct nft_pipapo_field *dst, *src;
+	struct nft_pipapo_match *new;
+	int i;
+
+	new = kmalloc(sizeof(*new) + sizeof(*dst) * old->field_count,
+		      GFP_KERNEL);
+	if (!new)
+		return ERR_PTR(-ENOMEM);
+
+	new->field_count = old->field_count;
+	new->bsize_max = old->bsize_max;
+
+	new->scratch = alloc_percpu(*new->scratch);
+	if (!new->scratch)
+		goto out_scratch;
+
+	rcu_head_init(&new->rcu);
+
+	src = old->f;
+	dst = new->f;
+
+	for (i = 0; i < old->field_count; i++) {
+		memcpy(dst, src, offsetof(struct nft_pipapo_field, lt));
+
+		dst->lt = kvzalloc(src->groups * NFT_PIPAPO_BUCKETS *
+				   src->bsize * sizeof(*dst->lt),
+				   GFP_KERNEL);
+		if (!dst->lt)
+			goto out_lt;
+
+		memcpy(dst->lt, src->lt,
+		       src->bsize * sizeof(*dst->lt) *
+		       src->groups * NFT_PIPAPO_BUCKETS);
+
+		dst->mt = kvmalloc(src->rules * sizeof(*src->mt), GFP_KERNEL);
+		if (!dst->mt)
+			goto out_mt;
+
+		memcpy(dst->mt, src->mt, src->rules * sizeof(*src->mt));
+		src++;
+		dst++;
+	}
+
+	return new;
+
+out_mt:
+	kvfree(dst->lt);
+out_lt:
+	for (dst--; i > 0; i--) {
+		kvfree(dst->mt);
+		kvfree(dst->lt);
+		dst--;
+	}
+	free_percpu(new->scratch);
+out_scratch:
+	kfree(new);
+
+	return ERR_PTR(-ENOMEM);
+}
+
+/**
+ * pipapo_rules_same_key() - Get number of rules originated from the same entry
+ * @f:		Field containing mapping table
+ * @first:	Index of first rule in set of rules mapping to same entry
+ *
+ * Using the fact that all rules in a field that originated from the same entry
+ * will map to the same set of rules in the next field, or to the same element
+ * reference, return the cardinality of the set of rules that originated from
+ * the same entry as the rule with index @first, @first rule included.
+ *
+ * In pictures:
+ *				rules
+ *	field #0		0    1    2    3    4
+ *		map to:		0    1   2-4  2-4  5-9
+ *				.    .    .......   . ...
+ *				|    |    |    | \   \
+ *				|    |    |    |  \   \
+ *				|    |    |    |   \   \
+ *				'    '    '    '    '   \
+ *	in field #1		0    1    2    3    4    5 ...
+ *
+ * if this is called for rule 2 on field #0, it will return 3, as also rules 2
+ * and 3 in field 0 map to the same set of rules (2, 3, 4) in the next field.
+ *
+ * For the last field in a set, we can rely on associated entries to map to the
+ * same element references.
+ *
+ * Return: Number of rules that originated from the same entry as @first.
+ */
+static int pipapo_rules_same_key(struct nft_pipapo_field *f, int first)
+{
+	struct nft_pipapo_elem *e = NULL; /* Keep gcc happy */
+	int r;
+
+	for (r = first; r < f->rules; r++) {
+		if (r != first && e != f->mt[r].e)
+			return r - first;
+
+		e = f->mt[r].e;
+	}
+
+	if (r != first)
+		return r - first;
+
+	return 0;
+}
+
+/**
+ * pipapo_unmap() - Remove rules from mapping tables, renumber remaining ones
+ * @mt:		Mapping array
+ * @rules:	Original amount of rules in mapping table
+ * @start:	First rule index to be removed
+ * @n:		Amount of rules to be removed
+ * @to_offset:	First rule index, in next field, this group of rules maps to
+ * @is_last:	If this is the last field, delete reference from mapping array
+ *
+ * This is used to unmap rules from the mapping table for a single field,
+ * maintaining consistency and compactness for the existing ones.
+ *
+ * In pictures: let's assume that we want to delete rules 2 and 3 from the
+ * following mapping array:
+ *
+ *                 rules
+ *               0      1      2      3      4
+ *      map to:  4-10   4-10   11-15  11-15  16-18
+ *
+ * the result will be:
+ *
+ *                 rules
+ *               0      1      2
+ *      map to:  4-10   4-10   11-13
+ *
+ * for fields before the last one. In case this is the mapping table for the
+ * last field in a set, and rules map to pointers to &struct nft_pipapo_elem:
+ *
+ *                      rules
+ *                        0      1      2      3      4
+ *  element pointers:  0x42   0x42   0x33   0x33   0x44
+ *
+ * the result will be:
+ *
+ *                      rules
+ *                        0      1      2
+ *  element pointers:  0x42   0x42   0x44
+ */
+static void pipapo_unmap(union nft_pipapo_map_bucket *mt, int rules,
+			 int start, int n, int to_offset, bool is_last)
+{
+	int i;
+
+	memmove(mt + start, mt + start + n, (rules - start - n) * sizeof(*mt));
+	memset(mt + rules - n, 0, n * sizeof(*mt));
+
+	if (is_last)
+		return;
+
+	for (i = start; i < rules - n; i++)
+		mt[i].to -= to_offset;
+}
+
+/**
+ * pipapo_drop() - Delete entry from lookup and mapping tables, given rule map
+ * @m:		Matching data
+ * @rulemap	Table of rule maps, arrays of first rule and amount of rules
+ *		in next field a given entry maps to, for each field
+ *
+ * For each rule in lookup table buckets mapping to this set of rules, drop
+ * all bits set in lookup table mapping. In pictures, assuming we want to drop
+ * rules 0 and 1 from this lookup table:
+ *
+ *                     bucket
+ *      group  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
+ *        0    0                                              1,2
+ *        1   1,2                                      0
+ *        2    0                                      1,2
+ *        3    0                              1,2
+ *        4  0,1,2
+ *        5    0   1   2
+ *        6  0,1,2 1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
+ *        7   1,2 1,2  1   1   1  0,1  1   1   1   1   1   1   1   1   1   1
+ *
+ * rule 2 becomes rule 0, and the result will be:
+ *
+ *                     bucket
+ *      group  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
+ *        0                                                    0
+ *        1    0
+ *        2                                            0
+ *        3                                    0
+ *        4    0
+ *        5            0
+ *        6    0
+ *        7    0   0
+ *
+ * once this is done, call unmap() to drop all the corresponding rule references
+ * from mapping tables.
+ */
+static void pipapo_drop(struct nft_pipapo_match *m,
+			union nft_pipapo_map_bucket rulemap[])
+{
+	struct nft_pipapo_field *f;
+	int i;
+
+	nft_pipapo_for_each_field(f, i, m) {
+		int g;
+
+		for (g = 0; g < f->groups; g++) {
+			unsigned long *pos;
+			int b;
+
+			pos = f->lt + g * NFT_PIPAPO_BUCKETS * f->bsize;
+
+			for (b = 0; b < NFT_PIPAPO_BUCKETS; b++) {
+				bitmap_cut(pos, pos, rulemap[i].to,
+					   rulemap[i].n,
+					   f->bsize * BITS_PER_LONG);
+
+				pos += f->bsize;
+			}
+		}
+
+		pipapo_unmap(f->mt, f->rules, rulemap[i].to, rulemap[i].n,
+			     rulemap[i + 1].n, i == m->field_count - 1);
+		if (pipapo_resize(f, f->rules, f->rules - rulemap[i].n)) {
+			/* We can ignore this, a failure to shrink tables down
+			 * doesn't make tables invalid.
+			 */
+			;
+		}
+		f->rules -= rulemap[i].n;
+	}
+}
+
+/**
+ * pipapo_gc() - Drop expired entries from set, destroy start and end elements
+ * @set:	nftables API set representation
+ * @m:		Matching data
+ */
+static void pipapo_gc(const struct nft_set *set, struct nft_pipapo_match *m)
+{
+	struct nft_pipapo *priv = nft_set_priv(set);
+	int rules_f0, first_rule = 0;
+
+	while ((rules_f0 = pipapo_rules_same_key(m->f, first_rule))) {
+		union nft_pipapo_map_bucket rulemap[NFT_PIPAPO_MAX_FIELDS];
+		struct nft_pipapo_field *f;
+		struct nft_pipapo_elem *e;
+		int i, start, rules_fx;
+
+		start = first_rule;
+		rules_fx = rules_f0;
+
+		nft_pipapo_for_each_field(f, i, m) {
+			rulemap[i].to = start;
+			rulemap[i].n = rules_fx;
+
+			if (i < m->field_count - 1) {
+				rules_fx = f->mt[start].n;
+				start = f->mt[start].to;
+			}
+		}
+
+		/* Pick the last field, and its last index */
+		f--;
+		i--;
+		e = f->mt[rulemap[i].to].e;
+		if (nft_set_elem_expired(&e->ext) &&
+		    !nft_set_elem_mark_busy(&e->ext)) {
+			priv->dirty = true;
+			pipapo_drop(m, rulemap);
+
+			rcu_barrier();
+			nft_set_elem_destroy(set, e->start, true);
+			nft_set_elem_destroy(set, e, true);
+
+			/* And check again current first rule, which is now the
+			 * first we haven't checked.
+			 */
+		} else {
+			first_rule += rules_f0;
+		}
+	}
+
+	priv->last_gc = jiffies;
+}
+
+/**
+ * pipapo_free_fields() - Free per-field tables contained in matching data
+ * @m:		Matching data
+ */
+static void pipapo_free_fields(struct nft_pipapo_match *m)
+{
+	struct nft_pipapo_field *f;
+	int i;
+
+	nft_pipapo_for_each_field(f, i, m) {
+		kvfree(f->lt);
+		kvfree(f->mt);
+	}
+}
+
+/**
+ * pipapo_reclaim_match - RCU callback to free fields from old matching data
+ * @rcu:	RCU head
+ */
+static void pipapo_reclaim_match(struct rcu_head *rcu)
+{
+	struct nft_pipapo_match *m;
+	int i;
+
+	m = container_of(rcu, struct nft_pipapo_match, rcu);
+
+	for_each_possible_cpu(i)
+		kfree(*per_cpu_ptr(m->scratch, i));
+
+	free_percpu(m->scratch);
+
+	pipapo_free_fields(m);
+
+	kfree(m);
+}
+
+/**
+ * pipapo_commit() - Replace lookup data with current working copy
+ * @set:	nftables API set representation
+ *
+ * While at it, check if we should perform garbage collection on the working
+ * copy before committing it for lookup, and don't replace the table if the
+ * working copy doesn't have pending changes.
+ *
+ * We also need to create a new working copy for subsequent insertions and
+ * deletions.
+ */
+static void pipapo_commit(const struct nft_set *set)
+{
+	struct nft_pipapo *priv = nft_set_priv(set);
+	struct nft_pipapo_match *new_clone, *old;
+
+	if (time_after_eq(jiffies, priv->last_gc + nft_set_gc_interval(set)))
+		pipapo_gc(set, priv->clone);
+
+	if (!priv->dirty)
+		return;
+
+	new_clone = pipapo_clone(priv->clone);
+	if (IS_ERR(new_clone))
+		return;
+
+	priv->dirty = false;
+
+	old = rcu_access_pointer(priv->match);
+	rcu_assign_pointer(priv->match, priv->clone);
+	if (old)
+		call_rcu(&old->rcu, pipapo_reclaim_match);
+
+	priv->clone = new_clone;
+}
+
+/**
+ * nft_pipapo_activate() - Mark element reference as active given key, commit
+ * @net:	Network namespace
+ * @set:	nftables API set representation
+ * @elem:	nftables API element representation containing key data
+ *
+ * On insertion, elements are added to a copy of the matching data currently
+ * in use for lookups, and not directly inserted into current lookup data, so
+ * we'll take care of that by calling pipapo_commit() here. This is probably as
+ * close as we can get to an actual atomic transaction: both nft_pipapo_insert()
+ * and nft_pipapo_activate() are called once for each element, hence we can't
+ * purpose either one as a real commit operation.
+ */
+static void nft_pipapo_activate(const struct net *net,
+				const struct nft_set *set,
+				const struct nft_set_elem *elem)
+{
+	const struct nft_set_ext *ext = nft_set_elem_ext(set, elem->priv);
+	struct nft_pipapo_elem *e;
+
+	if (!nft_set_ext_exists(ext, NFT_SET_EXT_FLAGS) ||
+	    !(*nft_set_ext_flags(ext) & NFT_SET_ELEM_INTERVAL_END)) {
+		e = pipapo_get(net, set, (const u8 *)elem->key.val.data, 0);
+		if (IS_ERR(e))
+			return;
+
+		nft_set_elem_change_active(net, set, &e->ext);
+		nft_set_elem_clear_busy(&e->ext);
+
+		return;
+	}
+
+	e = pipapo_get(net, set, (const u8 *)elem->key.val.data,
+		       NFT_SET_ELEM_INTERVAL_END);
+	if (IS_ERR(e))
+		return;
+
+	nft_set_elem_change_active(net, set, &e->ext);
+	nft_set_elem_clear_busy(&e->ext);
+
+	pipapo_commit(set);
+}
+
+/**
+ * pipapo_deactivate() - Check that element is in set, mark as inactive
+ * @net:	Network namespace
+ * @set:	nftables API set representation
+ * @data:	Input key data
+ * @ext:	nftables API extension pointer, used to check for end element
+ *
+ * This is a convenience function that can be called from both
+ * nft_pipapo_deactivate() and nft_pipapo_flush(), as they are in fact the same
+ * operation.
+ *
+ * Return: deactivated element if found, NULL otherwise.
+ */
+static void *pipapo_deactivate(const struct net *net, const struct nft_set *set,
+			       const u8 *data, const struct nft_set_ext *ext)
+{
+	struct nft_pipapo *priv = nft_set_priv(set);
+	u8 genmask = nft_genmask_next(net);
+	struct nft_pipapo_elem *e;
+	unsigned int flags = 0;
+
+	/* See nft_pipapo_insert() */
+	priv->start_elem = NULL;
+
+	if (nft_set_ext_exists(ext, NFT_SET_EXT_FLAGS))
+		flags = *nft_set_ext_flags(ext) & NFT_SET_ELEM_INTERVAL_END;
+
+	e = pipapo_get(net, set, data, flags);
+	if (IS_ERR(e))
+		return NULL;
+
+	if (nft_set_elem_active(&e->ext, genmask))
+		nft_set_elem_change_active(net, set, &e->ext);
+
+	return e;
+}
+
+/**
+ * nft_pipapo_deactivate() - Call pipapo_deactivate() to make element inactive
+ * @net:	Network namespace
+ * @set:	nftables API set representation
+ * @elem:	nftables API element representation containing key data
+ *
+ * Return: deactivated element if found, NULL otherwise.
+ */
+static void *nft_pipapo_deactivate(const struct net *net,
+				   const struct nft_set *set,
+				   const struct nft_set_elem *elem)
+{
+	const struct nft_set_ext *ext = nft_set_elem_ext(set, elem->priv);
+
+	return pipapo_deactivate(net, set, (const u8 *)elem->key.val.data, ext);
+}
+
+/**
+ * nft_pipapo_flush() - Call pipapo_deactivate() to make element inactive
+ * @net:	Network namespace
+ * @set:	nftables API set representation
+ * @elem:	nftables API element representation containing key data
+ *
+ * This is functionally the same as nft_pipapo_deactivate(), with a slightly
+ * different interface, and it's also called once for each element in a set
+ * being flushed, so we can't implement an atomic flush operation, which would
+ * otherwise be as simple as allocating an empty copy of the matching data.
+ *
+ * Note that we could in theory do that, mark the set as flushed, and ignore
+ * subsequent calls, but we would leak all the elements after the first one,
+ * because they wouldn't then be freed as result of API calls.
+ *
+ * Return: true if element was found and deactivated.
+ */
+static bool nft_pipapo_flush(const struct net *net, const struct nft_set *set,
+			     void *elem)
+{
+	struct nft_pipapo_elem *e = elem;
+
+	return pipapo_deactivate(net, set, (const u8 *)nft_set_ext_key(&e->ext),
+				 &e->ext);
+}
+
+/**
+ * pipapo_get_boundaries() - Get byte interval for associated rules
+ * @f:		Field including lookup table
+ * @first_rule:	First rule (lowest index)
+ * @rule_count:	Number of associated rules
+ * @left:	Byte expression for left boundary (start of range)
+ * @right:	Byte expression for right boundary (end of range)
+ *
+ * Given the first rule and amount of rules that originated from the same entry,
+ * build the original range associated with the entry, and calculate the length
+ * of the originating netmask.
+ *
+ * In pictures:
+ *
+ *                     bucket
+ *      group  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
+ *        0                                                   1,2
+ *        1   1,2
+ *        2                                           1,2
+ *        3                                   1,2
+ *        4   1,2
+ *        5        1   2
+ *        6   1,2  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
+ *        7   1,2 1,2  1   1   1   1   1   1   1   1   1   1   1   1   1   1
+ *
+ * this is the lookup table corresponding to the IPv4 range
+ * 192.168.1.0-192.168.2.1, which was expanded to the two composing netmasks,
+ * rule #1: 192.168.1.0/24, and rule #2: 192.168.2.0/31.
+ *
+ * This function fills @left and @right with the byte values of the leftmost
+ * and rightmost bucket indices for the lowest and highest rule indices,
+ * respectively. If @first_rule is 1 and @rule_count is 2, we obtain, in
+ * nibbles:
+ *   left:  < 12, 0, 10, 8, 0, 1, 0, 0 >
+ *   right: < 12, 0, 10, 8, 0, 2, 2, 1 >
+ * corresponding to bytes:
+ *   left:  < 192, 168, 1, 0 >
+ *   right: < 192, 168, 2, 1 >
+ * with mask length irrelevant here, unused on return, as the range is already
+ * defined by its start and end points. The mask length is relevant for a single
+ * ranged entry instead: if @first_rule is 1 and @rule_count is 1, we ignore
+ * rule 2 above: @left becomes < 192, 168, 1, 0 >, @right becomes
+ * < 192, 168, 1, 255 >, and the mask length, calculated from the distances
+ * between leftmost and rightmost bucket indices for each group, would be 24.
+ *
+ * Return: mask length, in bits.
+ */
+static int pipapo_get_boundaries(struct nft_pipapo_field *f, int first_rule,
+				 int rule_count, u8 *left, u8 *right)
+{
+	u8 *l = left, *r = right;
+	int g, mask_len = 0;
+
+	for (g = 0; g < f->groups; g++) {
+		int b, x0, x1;
+
+		x0 = -1;
+		x1 = -1;
+		for (b = 0; b < NFT_PIPAPO_BUCKETS; b++) {
+			unsigned long *pos;
+
+			pos = f->lt + (g * NFT_PIPAPO_BUCKETS + b) * f->bsize;
+			if (test_bit(first_rule, pos) && x0 == -1)
+				x0 = b;
+			if (test_bit(first_rule + rule_count - 1, pos))
+				x1 = b;
+		}
+
+		if (g % 2) {
+			*(l++) |= x0 & 0x0f;
+			*(r++) |= x1 & 0x0f;
+		} else {
+			*l |= x0 << 4;
+			*r |= x1 << 4;
+		}
+
+		if (x1 - x0 == 0)
+			mask_len += 4;
+		else if (x1 - x0 == 1)
+			mask_len += 3;
+		else if (x1 - x0 == 3)
+			mask_len += 2;
+		else if (x1 - x0 == 7)
+			mask_len += 1;
+	}
+
+	return mask_len;
+}
+
+/**
+ * pipapo_match_field() - Match rules against byte ranges
+ * @f:		Field including the lookup table
+ * @first_rule:	First of associated rules originating from same entry
+ * @rule_count:	Amount of associated rules
+ * @start:	Start of range to be matched
+ * @end:	End of range to be matched
+ *
+ * Return: true on match, false otherwise.
+ */
+static bool pipapo_match_field(struct nft_pipapo_field *f,
+			       int first_rule, int rule_count,
+			       const u8 *start, const u8 *end)
+{
+	u8 right[NFT_PIPAPO_MAX_BYTES] = { 0 };
+	u8 left[NFT_PIPAPO_MAX_BYTES] = { 0 };
+
+	pipapo_get_boundaries(f, first_rule, rule_count, left, right);
+
+	return !memcmp(start, left, f->groups / NFT_PIPAPO_GROUPS_PER_BYTE) &&
+	       !memcmp(end, right, f->groups / NFT_PIPAPO_GROUPS_PER_BYTE);
+}
+
+/**
+ * nft_pipapo_remove() - Remove element given key, commit
+ * @net:	Network namespace
+ * @set:	nftables API set representation
+ * @elem:	nftables API element representation containing key data
+ *
+ * Similarly to nft_pipapo_activate(), this is used as commit operation by the
+ * API, but it's called once per element in the pending transaction, so we can't
+ * implement an actual, atomic commit operation. Closest we can get is to remove
+ * the matched element here, if any, and commit the updated matching data.
+ */
+static void nft_pipapo_remove(const struct net *net, const struct nft_set *set,
+			      const struct nft_set_elem *elem)
+{
+	const u8 *data = (const u8 *)elem->key.val.data;
+	struct nft_pipapo *priv = nft_set_priv(set);
+	struct nft_pipapo_match *m = priv->clone;
+	const struct nft_set_ext *ext;
+	int rules_f0, first_rule = 0;
+	struct nft_pipapo_elem *e;
+
+	/* See nft_pipapo_insert() */
+	priv->start_elem = NULL;
+
+	ext = nft_set_elem_ext(set, elem->priv);
+	if (!nft_set_ext_exists(ext, NFT_SET_EXT_FLAGS) ||
+	    !(*nft_set_ext_flags(ext) & NFT_SET_ELEM_INTERVAL_END))
+		return;
+
+	e = pipapo_get(net, set, data, NFT_SET_ELEM_INTERVAL_END);
+	if (IS_ERR(e))
+		return;
+
+	while ((rules_f0 = pipapo_rules_same_key(m->f, first_rule))) {
+		union nft_pipapo_map_bucket rulemap[NFT_PIPAPO_MAX_FIELDS];
+		const u8 *match_start, *match_end;
+		struct nft_pipapo_field *f;
+		int i, start, rules_fx;
+
+		match_start = (const u8 *)nft_set_ext_key(&e->start->ext);
+		match_end = data;
+
+		start = first_rule;
+		rules_fx = rules_f0;
+
+		nft_pipapo_for_each_field(f, i, m) {
+			if (!pipapo_match_field(f, start, rules_fx,
+						match_start, match_end))
+				break;
+
+			rulemap[i].to = start;
+			rulemap[i].n = rules_fx;
+
+			rules_fx = f->mt[start].n;
+			start = f->mt[start].to;
+
+			match_start += NFT_PIPAPO_GROUPS_PADDED_SIZE(f->groups);
+			match_end += NFT_PIPAPO_GROUPS_PADDED_SIZE(f->groups);
+		}
+
+		if (i == m->field_count) {
+			priv->dirty = true;
+			pipapo_drop(m, rulemap);
+			pipapo_commit(set);
+			return;
+		}
+
+		first_rule += rules_f0;
+	}
+}
+
+/**
+ * nft_pipapo_walk() - Walk over elements
+ * @ctx:	nftables API context
+ * @set:	nftables API set representation
+ * @iter:	Iterator
+ *
+ * As elements are referenced in the mapping array for the last field, directly
+ * scan that array: there's no need to follow rule mappings from the first
+ * field.
+ *
+ * Note that we'll return two elements for each call, as each entry is
+ * represented as start and end elements.
+ */
+static void nft_pipapo_walk(const struct nft_ctx *ctx, struct nft_set *set,
+			    struct nft_set_iter *iter)
+{
+	struct nft_pipapo *priv = nft_set_priv(set);
+	struct nft_pipapo_match *m;
+	struct nft_pipapo_field *f;
+	int i, r;
+
+	rcu_read_lock();
+	m = rcu_dereference(priv->match);
+
+	if (unlikely(!m))
+		goto out;
+
+	for (i = 0, f = m->f; i < m->field_count - 1; i++, f++)
+		;
+
+	for (r = 0; r < f->rules; r++) {
+		struct nft_set_elem elem_start, elem_end;
+		struct nft_pipapo_elem *e;
+
+		if (r < f->rules - 1 && f->mt[r + 1].e == f->mt[r].e)
+			continue;
+
+		if (iter->count < iter->skip)
+			goto cont;
+
+		e = f->mt[r].e;
+		if (nft_set_elem_expired(&e->ext))
+			goto cont;
+
+		elem_start.priv = e->start;
+
+		iter->err = iter->fn(ctx, set, iter, &elem_start);
+		if (iter->err < 0)
+			goto out;
+
+		elem_end.priv = e;
+
+		iter->err = iter->fn(ctx, set, iter, &elem_end);
+		if (iter->err < 0)
+			goto out;
+
+cont:
+		iter->count += 2;
+	}
+
+out:
+	rcu_read_unlock();
+}
+
+/**
+ * nft_pipapo_privsize() - Return the size of private data for the set
+ * @nla:	netlink attributes, ignored as size doesn't depend on them
+ * @desc:	Set description, ignored as size doesn't depend on it
+ *
+ * Return: size of private data for this set implementation, in bytes
+ */
+static u64 nft_pipapo_privsize(const struct nlattr * const nla[],
+			       const struct nft_set_desc *desc)
+{
+	return sizeof(struct nft_pipapo);
+}
+
+/**
+ * nft_pipapo_estimate() - Estimate set size, space and lookup complexity
+ * @desc:	Set description, initial element count used here
+ * @features:	Flags: NFT_SET_SUBKEY needs to be there
+ * @est:	Storage for estimation data
+ *
+ * The size for this set type can vary dramatically, as it depends on the number
+ * of rules (composing netmasks) the entries expand to. We compute the worst
+ * case here, in order to ensure that other types are used if concatenation of
+ * ranges is not needed.
+ *
+ * In general, for a non-ranged entry or a single composing netmask, we need
+ * one bit in each of the sixteen NFT_PIPAPO_BUCKETS, for each 4-bit group (that
+ * is, each input bit needs four bits of matching data), plus a bucket in the
+ * mapping table for each field.
+ *
+ * Return: true
+ */
+static bool nft_pipapo_estimate(const struct nft_set_desc *desc, u32 features,
+				struct nft_set_estimate *est)
+{
+	if (!(features & NFT_SET_SUBKEY))
+		return false;
+
+	est->size = sizeof(struct nft_pipapo) + sizeof(struct nft_pipapo_match);
+
+	/* Worst-case with current amount of 32-bit VM registers (16 of them):
+	 * - 2 IPv6 addresses	8 registers
+	 * - 2 interface names	8 registers
+	 * that is, four 128-bit fields:
+	 */
+	est->size += sizeof(struct nft_pipapo_field) * 4;
+
+	/* expanding to worst-case ranges, 128 * 2 rules each, resulting in:
+	 * - 128 4-bit groups
+	 * - each set entry taking 256 bits in each bucket
+	 */
+	est->size += desc->size * NFT_PIPAPO_MAX_BITS / NFT_PIPAPO_GROUP_BITS *
+		     NFT_PIPAPO_BUCKETS * NFT_PIPAPO_MAX_BITS * 2 /
+		     BITS_PER_BYTE;
+
+	/* and we need mapping buckets, too */
+	est->size += desc->size * NFT_PIPAPO_MAP_NBITS *
+		     sizeof(union nft_pipapo_map_bucket);
+
+	est->lookup = NFT_SET_CLASS_O_LOG_N;
+
+	est->space = NFT_SET_CLASS_O_N;
+
+	return true;
+}
+
+/**
+ * nft_pipapo_init() - Initialise data for a set instance
+ * @set:	nftables API set representation
+ * @desc:	Set description
+ * @nla:	netlink attributes
+ *
+ * Validate number and size of fields passed as NFTA_SET_SUBKEY netlink
+ * attributes, initialise internal set parameters, current instance of matching
+ * data and a copy for subsequent insertions.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+static int nft_pipapo_init(const struct nft_set *set,
+			   const struct nft_set_desc *desc,
+			   const struct nlattr * const nla[])
+{
+	struct nft_pipapo *priv = nft_set_priv(set);
+	int rem, err = -EINVAL, field_count = 0, i;
+	struct nft_pipapo_match *m;
+	struct nft_pipapo_field *f;
+	struct nlattr *attr;
+	unsigned int klen;
+
+	if (!nla || !nla[NFTA_SET_SUBKEY])
+		return -EINVAL;
+
+	nla_for_each_nested(attr, nla[NFTA_SET_SUBKEY], rem) {
+		if (++field_count >= NFT_PIPAPO_MAX_FIELDS)
+			return -EINVAL;
+
+		if (nla_len(attr) != sizeof(klen) ||
+		    nla_type(attr) != NFTA_SET_SUBKEY_LEN)
+			return -EINVAL;
+	}
+
+	if (!field_count)
+		return -EINVAL;
+
+	m = kmalloc(sizeof(*priv->match) + sizeof(*f) * field_count,
+		    GFP_KERNEL);
+	if (!m)
+		return -ENOMEM;
+
+	m->field_count = field_count;
+	m->bsize_max = 0;
+
+	m->scratch = alloc_percpu(unsigned long *);
+	if (!m->scratch) {
+		err = -ENOMEM;
+		goto out_free;
+	}
+	for_each_possible_cpu(i)
+		*per_cpu_ptr(m->scratch, i) = NULL;
+
+	rcu_head_init(&m->rcu);
+
+	f = m->f;
+	priv->width = 0;
+	nla_for_each_nested(attr, nla[NFTA_SET_SUBKEY], rem) {
+		klen = ntohl(nla_get_be32(attr));
+		if (!klen || klen % NFT_PIPAPO_GROUP_BITS)
+			goto out_free;
+
+		if (klen > NFT_PIPAPO_MAX_BITS)
+			goto out_free;
+
+		priv->groups += f->groups = klen / NFT_PIPAPO_GROUP_BITS;
+		priv->width += round_up(klen / BITS_PER_BYTE, sizeof(u32));
+
+		f->bsize = 0;
+		f->rules = 0;
+		f->lt = NULL;
+		f->mt = NULL;
+
+		f++;
+	}
+
+	/* Create an initial clone of matching data for next insertion */
+	priv->clone = pipapo_clone(m);
+	if (IS_ERR(priv->clone)) {
+		err = PTR_ERR(priv->clone);
+		goto out_free;
+	}
+
+	priv->dirty = false;
+
+	rcu_assign_pointer(priv->match, m);
+
+	return 0;
+
+out_free:
+	free_percpu(m->scratch);
+	kfree(m);
+
+	return err;
+}
+
+/**
+ * nft_pipapo_destroy() - Free private data for set and all committed elements
+ * @set:	nftables API set representation
+ */
+static void nft_pipapo_destroy(const struct nft_set *set)
+{
+	struct nft_pipapo *priv = nft_set_priv(set);
+	struct nft_pipapo_match *m;
+	struct nft_pipapo_field *f;
+	int i, r, cpu;
+
+	/* See nft_pipapo_insert() */
+	priv->start_elem = NULL;
+
+	m = rcu_dereference_protected(priv->match, true);
+	if (m) {
+		rcu_barrier();
+
+		for (i = 0, f = m->f; i < m->field_count - 1; i++, f++)
+			;
+
+		for (r = 0; r < f->rules; r++) {
+			struct nft_pipapo_elem *e;
+
+			if (r < f->rules - 1 && f->mt[r + 1].e == f->mt[r].e)
+				continue;
+
+			e = f->mt[r].e;
+
+			nft_set_elem_destroy(set, e->start, true);
+			nft_set_elem_destroy(set, e, true);
+		}
+
+		for_each_possible_cpu(cpu)
+			kfree(*per_cpu_ptr(m->scratch, cpu));
+		free_percpu(m->scratch);
+
+		pipapo_free_fields(m);
+		kfree(m);
+		priv->match = NULL;
+	}
+
+	if (priv->clone) {
+		for_each_possible_cpu(cpu)
+			kfree(*per_cpu_ptr(priv->clone->scratch, cpu));
+		free_percpu(priv->clone->scratch);
+
+		pipapo_free_fields(priv->clone);
+		kfree(priv->clone);
+		priv->clone = NULL;
+	}
+}
+
+/**
+ * nft_pipapo_gc_init() - Initialise garbage collection
+ * @set:	nftables API set representation
+ *
+ * Instead of actually setting up a periodic work for garbage collection, as
+ * this operation requires a swap of matching data with the working copy, we'll
+ * do that opportunistically with other commit operations if the interval is
+ * elapsed, so we just need to set the current jiffies timestamp here.
+ */
+static void nft_pipapo_gc_init(const struct nft_set *set)
+{
+	struct nft_pipapo *priv = nft_set_priv(set);
+
+	priv->last_gc = jiffies;
+}
+
+struct nft_set_type nft_set_pipapo_type __read_mostly = {
+	.owner		= THIS_MODULE,
+	.features	= NFT_SET_INTERVAL | NFT_SET_MAP | NFT_SET_OBJECT |
+			  NFT_SET_TIMEOUT | NFT_SET_SUBKEY,
+	.ops		= {
+		.lookup		= nft_pipapo_lookup,
+		.insert		= nft_pipapo_insert,
+		.activate	= nft_pipapo_activate,
+		.deactivate	= nft_pipapo_deactivate,
+		.flush		= nft_pipapo_flush,
+		.remove		= nft_pipapo_remove,
+		.walk		= nft_pipapo_walk,
+		.get		= nft_pipapo_get,
+		.privsize	= nft_pipapo_privsize,
+		.estimate	= nft_pipapo_estimate,
+		.init		= nft_pipapo_init,
+		.destroy	= nft_pipapo_destroy,
+		.gc_init	= nft_pipapo_gc_init,
+		.elemsize	= offsetof(struct nft_pipapo_elem, ext),
+	},
+};
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH nf-next v2 4/8] selftests: netfilter: Introduce tests for sets with range concatenation
  2019-11-22 13:39 [PATCH nf-next v2 0/8] nftables: Set implementation for arbitrary concatenation of ranges Stefano Brivio
                   ` (2 preceding siblings ...)
  2019-11-22 13:40 ` [PATCH nf-next v2 3/8] nf_tables: Add set type for arbitrary concatenation of ranges Stefano Brivio
@ 2019-11-22 13:40 ` Stefano Brivio
  2019-11-22 13:40 ` [PATCH nf-next v2 5/8] nft_set_pipapo: Provide unrolled lookup loops for common field sizes Stefano Brivio
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 24+ messages in thread
From: Stefano Brivio @ 2019-11-22 13:40 UTC (permalink / raw)
  To: Pablo Neira Ayuso, netfilter-devel
  Cc: Florian Westphal, Kadlecsik József, Eric Garver, Phil Sutter

This test covers functionality and stability of the newly added
nftables set implementation supporting concatenation of ranged
fields.

For some selected set expression types, test:
- correctness, by checking that packets match or don't
- concurrency, by attempting races between insertion, deletion, lookup
- timeout feature, checking that packets don't match expired entries

and (roughly) estimate matching rates, comparing to baselines for
simple drop on netdev ingress hook and for hash and rbtrees sets.

In order to send packets, this needs one of sendip, netcat or bash.
To flood with traffic, iperf3, iperf and netperf are supported. For
performance measurements, this relies on the sample pktgen script
pktgen_bench_xmit_mode_netif_receive.sh.

If none of the tools suitable for a given test are available, specific
tests will be skipped.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
v2: No changes

 tools/testing/selftests/netfilter/Makefile    |    3 +-
 .../selftests/netfilter/nft_concat_range.sh   | 1481 +++++++++++++++++
 2 files changed, 1483 insertions(+), 1 deletion(-)
 create mode 100755 tools/testing/selftests/netfilter/nft_concat_range.sh

diff --git a/tools/testing/selftests/netfilter/Makefile b/tools/testing/selftests/netfilter/Makefile
index de1032b5ddea..08194aa44006 100644
--- a/tools/testing/selftests/netfilter/Makefile
+++ b/tools/testing/selftests/netfilter/Makefile
@@ -2,6 +2,7 @@
 # Makefile for netfilter selftests
 
 TEST_PROGS := nft_trans_stress.sh nft_nat.sh bridge_brouter.sh \
-	conntrack_icmp_related.sh nft_flowtable.sh ipvs.sh
+	conntrack_icmp_related.sh nft_flowtable.sh ipvs.sh \
+	nft_concat_range.sh
 
 include ../lib.mk
diff --git a/tools/testing/selftests/netfilter/nft_concat_range.sh b/tools/testing/selftests/netfilter/nft_concat_range.sh
new file mode 100755
index 000000000000..aca21dde102a
--- /dev/null
+++ b/tools/testing/selftests/netfilter/nft_concat_range.sh
@@ -0,0 +1,1481 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+#
+# nft_concat_range.sh - Tests for sets with concatenation of ranged fields
+#
+# Copyright (c) 2019 Red Hat GmbH
+#
+# Author: Stefano Brivio <sbrivio@redhat.com>
+#
+# shellcheck disable=SC2154,SC2034,SC2016,SC2030,SC2031
+# ^ Configuration and templates sourced with eval, counters reused in subshells
+
+KSELFTEST_SKIP=4
+
+# Available test groups:
+# - correctness: check that packets match given entries, and only those
+# - concurrency: attempt races between insertion, deletion and lookup
+# - timeout: check that packets match entries until they expire
+# - performance: estimate matching rate, compare with rbtree and hash baselines
+TESTS="correctness concurrency timeout"
+[ "${quicktest}" != "1" ] && TESTS="${TESTS} performance"
+
+# Set types, defined by TYPE_ variables below
+TYPES="net_port port_net net6_port port_proto net6_port_mac net6_port_mac_proto
+       net_port_net net_mac net_mac_icmp net6_mac_icmp net6_port_net6_port
+       net_port_mac_proto_net"
+
+# List of possible paths to pktgen script from kernel tree for performance tests
+PKTGEN_SCRIPT_PATHS="
+	../../../samples/pktgen/pktgen_bench_xmit_mode_netif_receive.sh
+	pktgen/pktgen_bench_xmit_mode_netif_receive.sh"
+
+# Definition of set types:
+# display	display text for test report
+# type_spec	nftables set type specifier
+# chain_spec	nftables type specifier for rules mapping to set
+# dst		call sequence of format_*() functions for destination fields
+# src		call sequence of format_*() functions for source fields
+# start		initial integer used to generate addresses and ports
+# count		count of entries to generate and match
+# src_delta	number summed to destination generator for source fields
+# tools		list of tools for correctness and timeout tests, any can be used
+# proto		L4 protocol of test packets
+#
+# race_repeat	race attempts per thread, 0 disables concurrency test for type
+# flood_tools	list of tools for concurrency tests, any can be used
+# flood_proto	L4 protocol of test packets for concurrency tests
+# flood_spec	nftables type specifier for concurrency tests
+#
+# perf_duration	duration of single pktgen injection test
+# perf_spec	nftables type specifier for performance tests
+# perf_dst	format_*() functions for destination fields in performance test
+# perf_src	format_*() functions for source fields in performance test
+# perf_entries	number of set entries for performance test
+# perf_proto	L3 protocol of test packets
+TYPE_net_port="
+display		net,port
+type_spec	ipv4_addr . inet_service
+chain_spec	ip daddr . udp dport
+dst		addr4 port
+src		 
+start		1
+count		5
+src_delta	2000
+tools		sendip nc bash
+proto		udp
+
+race_repeat	3
+flood_tools	iperf3 iperf netperf
+flood_proto	udp
+flood_spec	ip daddr . udp dport
+
+perf_duration	5
+perf_spec	ip daddr . udp dport
+perf_dst	addr4 port
+perf_src	 
+perf_entries	1000
+perf_proto	ipv4
+"
+
+TYPE_port_net="
+display		port,net
+type_spec	inet_service . ipv4_addr
+chain_spec	udp dport . ip daddr
+dst		port addr4
+src		 
+start		1
+count		5
+src_delta	2000
+tools		sendip nc bash
+proto		udp
+
+race_repeat	3
+flood_tools	iperf3 iperf netperf
+flood_proto	udp
+flood_spec	udp dport . ip daddr
+
+perf_duration	5
+perf_spec	udp dport . ip daddr
+perf_dst	port addr4
+perf_src	 
+perf_entries	100
+perf_proto	ipv4
+"
+
+TYPE_net6_port="
+display		net6,port
+type_spec	ipv6_addr . inet_service
+chain_spec	ip6 daddr . udp dport
+dst		addr6 port
+src		 
+start		10
+count		5
+src_delta	2000
+tools		sendip nc bash
+proto		udp6
+
+race_repeat	3
+flood_tools	iperf3 iperf netperf
+flood_proto	tcp6
+flood_spec	ip6 daddr . udp dport
+
+perf_duration	5
+perf_spec	ip6 daddr . udp dport
+perf_dst	addr6 port
+perf_src	 
+perf_entries	1000
+perf_proto	ipv6
+"
+
+TYPE_port_proto="
+display		port,proto
+type_spec	inet_service . inet_proto
+chain_spec	udp dport . meta l4proto
+dst		port proto
+src		 
+start		1
+count		5
+src_delta	2000
+tools		sendip nc bash
+proto		udp
+
+race_repeat	0
+
+perf_duration	5
+perf_spec	udp dport . meta l4proto
+perf_dst	port proto
+perf_src	 
+perf_entries	30000
+perf_proto	ipv4
+"
+
+TYPE_net6_port_mac="
+display		net6,port,mac
+type_spec	ipv6_addr . inet_service . ether_addr
+chain_spec	ip6 daddr . udp dport . ether saddr
+dst		addr6 port
+src		mac
+start		10
+count		5
+src_delta	2000
+tools		sendip nc bash
+proto		udp6
+
+race_repeat	0
+
+perf_duration	5
+perf_spec	ip6 daddr . udp dport . ether daddr
+perf_dst	addr6 port mac
+perf_src	 
+perf_entries	10
+perf_proto	ipv6
+"
+
+TYPE_net6_port_mac_proto="
+display		net6,port,mac,proto
+type_spec	ipv6_addr . inet_service . ether_addr . inet_proto
+chain_spec	ip6 daddr . udp dport . ether saddr . meta l4proto
+dst		addr6 port
+src		mac proto
+start		10
+count		5
+src_delta	2000
+tools		sendip nc bash
+proto		udp6
+
+race_repeat	0
+
+perf_duration	5
+perf_spec	ip6 daddr . udp dport . ether daddr . meta l4proto
+perf_dst	addr6 port mac proto
+perf_src	 
+perf_entries	1000
+perf_proto	ipv6
+"
+
+TYPE_net_port_net="
+display		net,port,net
+type_spec	ipv4_addr . inet_service . ipv4_addr
+chain_spec	ip daddr . udp dport . ip saddr
+dst		addr4 port
+src		addr4
+start		1
+count		5
+src_delta	2000
+tools		sendip nc bash
+proto		udp
+
+race_repeat	3
+flood_tools	iperf3 iperf netperf
+flood_proto	tcp
+flood_spec	ip daddr . udp dport . ip saddr
+
+perf_duration	0
+"
+
+TYPE_net6_port_net6_port="
+display		net6,port,net6,port
+type_spec	ipv6_addr . inet_service . ipv6_addr . inet_service
+chain_spec	ip6 daddr . udp dport . ip6 saddr . udp sport
+dst		addr6 port
+src		addr6 port
+start		10
+count		5
+src_delta	2000
+tools		sendip nc
+proto		udp6
+
+race_repeat	3
+flood_tools	iperf3 iperf netperf
+flood_proto	tcp6
+flood_spec	ip6 daddr . tcp dport . ip6 saddr . tcp sport
+
+perf_duration	0
+"
+
+TYPE_net_port_mac_proto_net="
+display		net,port,mac,proto,net
+type_spec	ipv4_addr . inet_service . ether_addr . inet_proto . ipv4_addr
+chain_spec	ip daddr . udp dport . ether saddr . meta l4proto . ip saddr
+dst		addr4 port
+src		mac proto addr4
+start		1
+count		5
+src_delta	2000
+tools		sendip nc bash
+proto		udp
+
+race_repeat	0
+
+perf_duration	0
+"
+
+TYPE_net_mac="
+display		net,mac
+type_spec	ipv4_addr . ether_addr
+chain_spec	ip daddr . ether saddr
+dst		addr4
+src		mac
+start		1
+count		5
+src_delta	2000
+tools		sendip nc bash
+proto		udp
+
+race_repeat	0
+
+perf_duration	5
+perf_spec	ip daddr . ether daddr
+perf_dst	addr4 mac
+perf_src	 
+perf_entries	1000
+perf_proto	ipv4
+"
+
+TYPE_net_mac_icmp="
+display		net,mac - ICMP
+type_spec	ipv4_addr . ether_addr
+chain_spec	ip daddr . ether saddr
+dst		addr4
+src		mac
+start		1
+count		5
+src_delta	2000
+tools		ping
+proto		icmp
+
+race_repeat	0
+
+perf_duration	0
+"
+
+TYPE_net6_mac_icmp="
+display		net6,mac - ICMPv6
+type_spec	ipv6_addr . ether_addr
+chain_spec	ip6 daddr . ether saddr
+dst		addr6
+src		mac
+start		10
+count		50
+src_delta	2000
+tools		ping
+proto		icmp6
+
+race_repeat	0
+
+perf_duration	0
+"
+
+TYPE_net_port_proto_net="
+display		net,port,proto,net
+type_spec	ipv4_addr . inet_service . inet_proto . ipv4_addr
+chain_spec	ip daddr . udp dport . meta l4proto . ip saddr
+dst		addr4 port proto
+src		addr4
+start		1
+count		5
+src_delta	2000
+tools		sendip nc
+proto		udp
+
+race_repeat	3
+flood_tools	iperf3 iperf netperf
+flood_proto	tcp
+flood_spec	ip daddr . tcp dport . meta l4proto . ip saddr
+
+perf_duration	0
+"
+
+# Set template for all tests, types and rules are filled in depending on test
+set_template='
+flush ruleset
+
+table inet filter {
+	counter test {
+		packets 0 bytes 0
+	}
+
+	set test {
+		type ${type_spec}
+		flags interval,timeout
+	}
+
+	chain input {
+		type filter hook prerouting priority 0; policy accept;
+		${chain_spec} @test counter name \"test\"
+	}
+}
+
+table netdev perf {
+	counter test {
+		packets 0 bytes 0
+	}
+
+	counter match {
+		packets 0 bytes 0
+	}
+
+	set test {
+		type ${type_spec}
+		flags interval
+	}
+
+	set norange {
+		type ${type_spec}
+	}
+
+	set noconcat {
+		type ${type_spec%% *}
+		flags interval
+	}
+
+	chain test {
+		type filter hook ingress device veth_a priority 0;
+	}
+}
+'
+
+err_buf=
+info_buf=
+
+# Append string to error buffer
+err() {
+	err_buf="${err_buf}${1}
+"
+}
+
+# Append string to information buffer
+info() {
+	info_buf="${info_buf}${1}
+"
+}
+
+# Flush error buffer to stdout
+err_flush() {
+	printf "%s" "${err_buf}"
+	err_buf=
+}
+
+# Flush information buffer to stdout
+info_flush() {
+	printf "%s" "${info_buf}"
+	info_buf=
+}
+
+# Setup veth pair: this namespace receives traffic, B generates it
+setup_veth() {
+	ip netns add B
+	ip link add veth_a type veth peer name veth_b || return 1
+
+	ip link set veth_a up
+	ip link set veth_b netns B
+
+	ip -n B link set veth_b up
+
+	ip addr add dev veth_a 10.0.0.1
+	ip route add default dev veth_a
+
+	ip -6 addr add fe80::1/64 dev veth_a nodad
+	ip -6 addr add 2001:db8::1/64 dev veth_a nodad
+	ip -6 route add default dev veth_a
+
+	ip -n B route add default dev veth_b
+
+	ip -6 -n B addr add fe80::2/64 dev veth_b nodad
+	ip -6 -n B addr add 2001:db8::2/64 dev veth_b nodad
+	ip -6 -n B route add default dev veth_b
+
+	B() {
+		ip netns exec B "$@" >/dev/null 2>&1
+	}
+
+	sleep 2
+}
+
+# Fill in set template and initialise set
+setup_set() {
+	eval "echo \"${set_template}\"" | nft -f -
+}
+
+# Check that at least one of the needed tools is available
+check_tools() {
+	__tools=
+	for tool in ${tools}; do
+		if [ "${tool}" = "nc" ] && [ "${proto}" = "udp6" ] && \
+		   ! nc -u -w0 1.1.1.1 1 2>/dev/null; then
+			# Some GNU netcat builds might not support IPv6
+			__tools="${__tools} netcat-openbsd"
+			continue
+		fi
+		__tools="${__tools} ${tool}"
+
+		command -v "${tool}" >/dev/null && return 0
+	done
+	err "need one of:${__tools}, skipping" && return 1
+}
+
+# Set up function to send ICMP packets
+setup_send_icmp() {
+	send_icmp() {
+		B ping -c1 -W1 "${dst_addr4}" >/dev/null 2>&1
+	}
+}
+
+# Set up function to send ICMPv6 packets
+setup_send_icmp6() {
+	if command -v ping6 >/dev/null; then
+		send_icmp6() {
+			ip -6 addr add "${dst_addr6}" dev veth_a nodad \
+				2>/dev/null
+			B ping6 -q -c1 -W1 "${dst_addr6}"
+		}
+	else
+		send_icmp6() {
+			ip -6 addr add "${dst_addr6}" dev veth_a nodad \
+				2>/dev/null
+			B ping -q -6 -c1 -W1 "${dst_addr6}"
+		}
+	fi
+}
+
+# Set up function to send single UDP packets on IPv4
+setup_send_udp() {
+	if command -v sendip >/dev/null; then
+		send_udp() {
+			[ -n "${src_port}" ] && src_port="-us ${src_port}"
+			[ -n "${dst_port}" ] && dst_port="-ud ${dst_port}"
+			[ -n "${src_addr4}" ] && src_addr4="-is ${src_addr4}"
+
+			# shellcheck disable=SC2086 # sendip needs split options
+			B sendip -p ipv4 -p udp ${src_addr4} ${src_port} \
+						${dst_port} "${dst_addr4}"
+
+			src_port=
+			dst_port=
+			src_addr4=
+		}
+	elif command -v nc >/dev/null; then
+		if nc -u -w0 1.1.1.1 1 2>/dev/null; then
+			# OpenBSD netcat
+			nc_opt="-w0"
+		else
+			# GNU netcat
+			nc_opt="-q0"
+		fi
+
+		send_udp() {
+			if [ -n "${src_addr4}" ]; then
+				B ip addr add "${src_addr4}" dev veth_b
+				__src_addr4="-s ${src_addr4}"
+			fi
+			ip addr add "${dst_addr4}" dev veth_a 2>/dev/null
+			[ -n "${src_port}" ] && src_port="-p ${src_port}"
+
+			echo "" | B nc -u "${nc_opt}" "${__src_addr4}" \
+				  "${src_port}" "${dst_addr4}" "${dst_port}"
+
+			src_addr4=
+			src_port=
+		}
+	elif [ -z "$(bash -c 'type -p')" ]; then
+		send_udp() {
+			ip addr add "${dst_addr4}" dev veth_a 2>/dev/null
+			if [ -n "${src_addr4}" ]; then
+				B ip addr add "${src_addr4}/16" dev veth_b
+				B ip route add default dev veth_b
+			fi
+
+			B bash -c "echo > /dev/udp/${dst_addr4}/${dst_port}"
+
+			if [ -n "${src_addr4}" ]; then
+				B ip addr del "${src_addr4}/16" dev veth_b
+			fi
+			src_addr4=
+		}
+	else
+		return 1
+	fi
+}
+
+# Set up function to send single UDP packets on IPv6
+setup_send_udp6() {
+	if command -v sendip >/dev/null; then
+		send_udp6() {
+			[ -n "${src_port}" ] && src_port="-us ${src_port}"
+			[ -n "${dst_port}" ] && dst_port="-ud ${dst_port}"
+			if [ -n "${src_addr6}" ]; then
+				src_addr6="-6s ${src_addr6}"
+			else
+				src_addr6="-6s 2001:db8::2"
+			fi
+			ip -6 addr add "${dst_addr6}" dev veth_a nodad \
+				2>/dev/null
+
+			# shellcheck disable=SC2086 # this needs split options
+			B sendip -p ipv6 -p udp ${src_addr6} ${src_port} \
+						${dst_port} "${dst_addr6}"
+
+			src_port=
+			dst_port=
+			src_addr6=
+		}
+	elif command -v nc >/dev/null && nc -u -w0 1.1.1.1 1 2>/dev/null; then
+		# GNU netcat might not work with IPv6, try next tool
+		send_udp6() {
+			ip -6 addr add "${dst_addr6}" dev veth_a nodad \
+				2>/dev/null
+			if [ -n "${src_addr6}" ]; then
+				B ip addr add "${src_addr6}" dev veth_b nodad
+			else
+				src_addr6="2001:db8::2"
+			fi
+			[ -n "${src_port}" ] && src_port="-p ${src_port}"
+
+			# shellcheck disable=SC2086 # this needs split options
+			echo "" | B nc -u w0 "-s${src_addr6}" ${src_port} \
+					       ${dst_addr6} ${dst_port}
+
+			src_addr6=
+			src_port=
+		}
+	elif [ -z "$(bash -c 'type -p')" ]; then
+		send_udp6() {
+			ip -6 addr add "${dst_addr6}" dev veth_a nodad \
+				2>/dev/null
+			B ip addr add "${src_addr6}" dev veth_b nodad
+			B bash -c "echo > /dev/udp/${dst_addr6}/${dst_port}"
+			ip -6 addr del "${dst_addr6}" dev veth_a 2>/dev/null
+		}
+	else
+		return 1
+	fi
+}
+
+# Set up function to send TCP traffic on IPv4
+setup_flood_tcp() {
+	if command -v iperf3 >/dev/null; then
+		flood_tcp() {
+			[ -n "${dst_port}" ] && dst_port="-p ${dst_port}"
+			if [ -n "${src_addr4}" ]; then
+				B ip addr add "${src_addr4}/16" dev veth_b
+				src_addr4="-B ${src_addr4}"
+			else
+				B ip addr add dev veth_b 10.0.0.2
+				src_addr4="-B 10.0.0.2"
+			fi
+			if [ -n "${src_port}" ]; then
+				src_port="--cport ${src_port}"
+			fi
+			B ip route add default dev veth_b 2>/dev/null
+			ip addr add "${dst_addr4}" dev veth_a 2>/dev/null
+
+			# shellcheck disable=SC2086 # this needs split options
+			iperf3 -s -DB "${dst_addr4}" ${dst_port} >/dev/null 2>&1
+			sleep 2
+
+			# shellcheck disable=SC2086 # this needs split options
+			B iperf3 -c "${dst_addr4}" ${dst_port} ${src_port} \
+				${src_addr4} -l16 -t 1000
+
+			src_addr4=
+			src_port=
+			dst_port=
+		}
+	elif command -v iperf >/dev/null; then
+		flood_tcp() {
+			[ -n "${dst_port}" ] && dst_port="-p ${dst_port}"
+			if [ -n "${src_addr4}" ]; then
+				B ip addr add "${src_addr4}/16" dev veth_b
+				src_addr4="-B ${src_addr4}"
+			else
+				B ip addr add dev veth_b 10.0.0.2 2>/dev/null
+				src_addr4="-B 10.0.0.2"
+			fi
+			if [ -n "${src_port}" ]; then
+				src_addr4="${src_addr4}:${src_port}"
+			fi
+			B ip route add default dev veth_b
+			ip addr add "${dst_addr4}" dev veth_a 2>/dev/null
+
+			# shellcheck disable=SC2086 # this needs split options
+			iperf -s -DB "${dst_addr4}" ${dst_port} >/dev/null 2>&1
+			sleep 2
+
+			# shellcheck disable=SC2086 # this needs split options
+			B iperf -c "${dst_addr4}" ${dst_port} ${src_addr4} \
+				-l20 -t 1000
+
+			src_addr4=
+			src_port=
+			dst_port=
+		}
+	elif command -v netperf >/dev/null; then
+		flood_tcp() {
+			[ -n "${dst_port}" ] && dst_port="-p ${dst_port}"
+			if [ -n "${src_addr4}" ]; then
+				B ip addr add "${src_addr4}/16" dev veth_b
+			else
+				B ip addr add dev veth_b 10.0.0.2
+				src_addr4="10.0.0.2"
+			fi
+			if [ -n "${src_port}" ]; then
+				dst_port="${dst_port},${src_port}"
+			fi
+			B ip route add default dev veth_b
+			ip addr add "${dst_addr4}" dev veth_a 2>/dev/null
+
+			# shellcheck disable=SC2086 # this needs split options
+			netserver -4 ${dst_port} -L "${dst_addr4}" \
+				>/dev/null 2>&1
+			sleep 2
+
+			# shellcheck disable=SC2086 # this needs split options
+			B netperf -4 -H "${dst_addr4}" ${dst_port} \
+				-L "${src_addr4}" -l 1000 -t TCP_STREAM
+
+			src_addr4=
+			src_port=
+			dst_port=
+		}
+	else
+		return 1
+	fi
+}
+
+# Set up function to send TCP traffic on IPv6
+setup_flood_tcp6() {
+	if command -v iperf3 >/dev/null; then
+		flood_tcp6() {
+			[ -n "${dst_port}" ] && dst_port="-p ${dst_port}"
+			if [ -n "${src_addr6}" ]; then
+				B ip addr add "${src_addr6}" dev veth_b nodad
+				src_addr6="-B ${src_addr6}"
+			else
+				src_addr6="-B 2001:db8::2"
+			fi
+			if [ -n "${src_port}" ]; then
+				src_port="--cport ${src_port}"
+			fi
+			B ip route add default dev veth_b
+			ip -6 addr add "${dst_addr6}" dev veth_a nodad \
+				2>/dev/null
+
+			# shellcheck disable=SC2086 # this needs split options
+			iperf3 -s -DB "${dst_addr6}" ${dst_port} >/dev/null 2>&1
+			sleep 2
+
+			# shellcheck disable=SC2086 # this needs split options
+			B iperf3 -c "${dst_addr6}" ${dst_port} \
+				${src_port} ${src_addr6} -l16 -t 1000
+
+			src_addr6=
+			src_port=
+			dst_port=
+		}
+	elif command -v iperf >/dev/null; then
+		flood_tcp6() {
+			[ -n "${dst_port}" ] && dst_port="-p ${dst_port}"
+			if [ -n "${src_addr6}" ]; then
+				B ip addr add "${src_addr6}" dev veth_b nodad
+				src_addr6="-B ${src_addr6}"
+			else
+				src_addr6="-B 2001:db8::2"
+			fi
+			if [ -n "${src_port}" ]; then
+				src_addr6="${src_addr6}:${src_port}"
+			fi
+			B ip route add default dev veth_b
+			ip -6 addr add "${dst_addr6}" dev veth_a nodad \
+				2>/dev/null
+
+			# shellcheck disable=SC2086 # this needs split options
+			iperf -s -VDB "${dst_addr6}" ${dst_port} >/dev/null 2>&1
+			sleep 2
+
+			# shellcheck disable=SC2086 # this needs split options
+			B iperf -c "${dst_addr6}" -V ${dst_port} \
+				${src_addr6} -l1 -t 1000
+
+			src_addr6=
+			src_port=
+			dst_port=
+		}
+	elif command -v netperf >/dev/null; then
+		flood_tcp6() {
+			[ -n "${dst_port}" ] && dst_port="-p ${dst_port}"
+			if [ -n "${src_addr6}" ]; then
+				B ip addr add "${src_addr6}" dev veth_b nodad
+			else
+				src_addr6="2001:db8::2"
+			fi
+			if [ -n "${src_port}" ]; then
+				dst_port="${dst_port},${src_port}"
+			fi
+			B ip route add default dev veth_b
+			ip -6 addr add "${dst_addr6}" dev veth_a nodad \
+				2>/dev/null
+
+			# shellcheck disable=SC2086 # this needs split options
+			netserver -6 ${dst_port} -L "${dst_addr6}" \
+				>/dev/null 2>&1
+			sleep 2
+
+			# shellcheck disable=SC2086 # this needs split options
+			B netperf -6 -H "${dst_addr6}" ${dst_port} \
+				-L "${src_addr6}" -l 1000 -t TCP_STREAM
+
+			src_addr6=
+			src_port=
+			dst_port=
+		}
+	else
+		return 1
+	fi
+}
+
+# Set up function to send UDP traffic on IPv4
+setup_flood_udp() {
+	if command -v iperf3 >/dev/null; then
+		flood_udp() {
+			[ -n "${dst_port}" ] && dst_port="-p ${dst_port}"
+			if [ -n "${src_addr4}" ]; then
+				B ip addr add "${src_addr4}/16" dev veth_b
+				src_addr4="-B ${src_addr4}"
+			else
+				B ip addr add dev veth_b 10.0.0.2 2>/dev/null
+				src_addr4="-B 10.0.0.2"
+			fi
+			if [ -n "${src_port}" ]; then
+				src_port="--cport ${src_port}"
+			fi
+			B ip route add default dev veth_b
+			ip addr add "${dst_addr4}" dev veth_a 2>/dev/null
+
+			# shellcheck disable=SC2086 # this needs split options
+			iperf3 -s -DB "${dst_addr4}" ${dst_port}
+			sleep 2
+
+			# shellcheck disable=SC2086 # this needs split options
+			B iperf3 -u -c "${dst_addr4}" -Z -b 100M -l16 -t1000 \
+				${dst_port} ${src_port} ${src_addr4}
+
+			src_addr4=
+			src_port=
+			dst_port=
+		}
+	elif command -v iperf >/dev/null; then
+		flood_udp() {
+			[ -n "${dst_port}" ] && dst_port="-p ${dst_port}"
+			if [ -n "${src_addr4}" ]; then
+				B ip addr add "${src_addr4}/16" dev veth_b
+				src_addr4="-B ${src_addr4}"
+			else
+				B ip addr add dev veth_b 10.0.0.2
+				src_addr4="-B 10.0.0.2"
+			fi
+			if [ -n "${src_port}" ]; then
+				src_addr4="${src_addr4}:${src_port}"
+			fi
+			B ip route add default dev veth_b
+			ip addr add "${dst_addr4}" dev veth_a 2>/dev/null
+
+			# shellcheck disable=SC2086 # this needs split options
+			iperf -u -sDB "${dst_addr4}" ${dst_port} >/dev/null 2>&1
+			sleep 2
+
+			# shellcheck disable=SC2086 # this needs split options
+			B iperf -u -c "${dst_addr4}" -b 100M -l1 -t1000 \
+				${dst_port} ${src_addr4}
+
+			src_addr4=
+			src_port=
+			dst_port=
+		}
+	elif command -v netperf >/dev/null; then
+		flood_udp() {
+			[ -n "${dst_port}" ] && dst_port="-p ${dst_port}"
+			if [ -n "${src_addr4}" ]; then
+				B ip addr add "${src_addr4}/16" dev veth_b
+			else
+				B ip addr add dev veth_b 10.0.0.2
+				src_addr4="10.0.0.2"
+			fi
+			if [ -n "${src_port}" ]; then
+				dst_port="${dst_port},${src_port}"
+			fi
+			B ip route add default dev veth_b
+			ip addr add "${dst_addr4}" dev veth_a 2>/dev/null
+
+			# shellcheck disable=SC2086 # this needs split options
+			netserver -4 ${dst_port} -L "${dst_addr4}" \
+				>/dev/null 2>&1
+			sleep 2
+
+			# shellcheck disable=SC2086 # this needs split options
+			B netperf -4 -H "${dst_addr4}" ${dst_port} \
+				-L "${src_addr4}" -l 1000 -t UDP_STREAM
+
+			src_addr4=
+			src_port=
+			dst_port=
+		}
+	else
+		return 1
+	fi
+}
+
+# Find pktgen script and set up function to start pktgen injection
+setup_perf() {
+	for pktgen_script_path in ${PKTGEN_SCRIPT_PATHS} __notfound; do
+		command -v "${pktgen_script_path}" >/dev/null && break
+	done
+	[ "${pktgen_script_path}" = "__notfound" ] && return 1
+
+	perf_ipv4() {
+		${pktgen_script_path} -s80 \
+			-i veth_a -d "${dst_addr4}" -p "${dst_port}" \
+			-m "${dst_mac}" \
+			-t $(($(nproc) / 5 + 1)) -b10000 -n0 2>/dev/null &
+		perf_pid=$!
+	}
+	perf_ipv6() {
+		IP6=6 ${pktgen_script_path} -s100 \
+			-i veth_a -d "${dst_addr6}" -p "${dst_port}" \
+			-m "${dst_mac}" \
+			-t $(($(nproc) / 5 + 1)) -b10000 -n0 2>/dev/null &
+		perf_pid=$!
+	}
+}
+
+# Clean up before each test
+cleanup() {
+	nft reset counter inet filter test	>/dev/null 2>&1
+	nft flush ruleset			>/dev/null 2>&1
+	ip link del dummy0			2>/dev/null
+	ip route del default			2>/dev/null
+	ip -6 route del default			2>/dev/null
+	ip netns del B				2>/dev/null
+	ip link del veth_a			2>/dev/null
+	timeout=
+	killall iperf3				2>/dev/null
+	killall iperf				2>/dev/null
+	killall netperf				2>/dev/null
+	killall netserver			2>/dev/null
+	rm -f ${tmp}
+	sleep 2
+}
+
+# Entry point for setup functions
+setup() {
+	if [ "$(id -u)" -ne 0 ]; then
+		echo "  need to run as root"
+		exit ${KSELFTEST_SKIP}
+	fi
+
+	cleanup
+	check_tools || return 1
+	for arg do
+		if ! eval setup_"${arg}"; then
+			err "  ${arg} not supported"
+			return 1
+		fi
+	done
+}
+
+# Format integer into IPv4 address, summing 10.0.0.5 (arbitrary) to it
+format_addr4() {
+	a=$((${1} + 16777216 * 10 + 5))
+	printf "%i.%i.%i.%i"						\
+	       "$((a / 16777216))" "$((a % 16777216 / 65536))"	\
+	       "$((a % 65536 / 256))" "$((a % 256))"
+}
+
+# Format integer into IPv6 address, summing 2001:db8:: to it
+format_addr6() {
+	printf "2001:db8::%04x:%04x" "$((${1} / 65536))" "$((${1} % 65536))"
+}
+
+# Format integer into EUI-48 address, summing 00:01:00:00:00:00 to it
+format_mac() {
+	printf "00:01:%02x:%02x:%02x:%02x" \
+	       "$((${1} / 16777216))" "$((${1} % 16777216 / 65536))"	\
+	       "$((${1} % 65536 / 256))" "$((${1} % 256))"
+}
+
+# Format integer into port, avoid 0 port
+format_port() {
+	printf "%i" "$((${1} % 65534 + 1))"
+}
+
+# Drop suffixed '6' from L4 protocol, if any
+format_proto() {
+	printf "%s" "${proto}" | tr -d 6
+}
+
+# Format destination and source fields into nft concatenated type
+format() {
+	__start=
+	__end=
+	__expr="{ "
+
+	for f in ${dst}; do
+		[ "${__expr}" != "{ " ] && __expr="${__expr} . "
+
+		__start="$(eval format_"${f}" "${start}")"
+		__end="$(eval format_"${f}" "${end}")"
+
+		if [ "${f}" = "proto" ]; then
+			__expr="${__expr}${__start}"
+		else
+			__expr="${__expr}${__start}-${__end}"
+		fi
+	done
+	for f in ${src}; do
+		__expr="${__expr} . "
+		__start="$(eval format_"${f}" "${srcstart}")"
+		__end="$(eval format_"${f}" "${srcend}")"
+
+		if [ "${f}" = "proto" ]; then
+			__expr="${__expr}${__start}"
+		else
+			__expr="${__expr}${__start}-${__end}"
+		fi
+	done
+
+	if [ -n "${timeout}" ]; then
+		echo "${__expr} timeout ${timeout}s }"
+	else
+		echo "${__expr} }"
+	fi
+}
+
+# Format destination and source fields into nft type, start element only
+format_norange() {
+	__expr="{ "
+
+	for f in ${dst}; do
+		[ "${__expr}" != "{ " ] && __expr="${__expr} . "
+
+		__expr="${__expr}$(eval format_"${f}" "${start}")"
+	done
+	for f in ${src}; do
+		__expr="${__expr} . $(eval format_"${f}" "${start}")"
+	done
+
+	echo "${__expr} }"
+}
+
+# Format first destination field into nft type
+format_noconcat() {
+	for f in ${dst}; do
+		__start="$(eval format_"${f}" "${start}")"
+		__end="$(eval format_"${f}" "${end}")"
+
+		if [ "${f}" = "proto" ]; then
+			echo "{ ${__start} }"
+		else
+			echo "{ ${__start}-${__end} }"
+		fi
+		return
+	done
+}
+
+# Add single entry to 'test' set in 'inet filter' table
+add() {
+	if ! nft add element inet filter test "${1}"; then
+		err "Failed to add ${1} given ruleset:"
+		err "$(nft list ruleset -a)"
+		return 1
+	fi
+}
+
+# Format and output entries for sets in 'netdev perf' table
+add_perf() {
+	if [ "${1}" = "test" ]; then
+		echo "add element netdev perf test $(format)"
+	elif [ "${1}" = "norange" ]; then
+		echo "add element netdev perf norange $(format_norange)"
+	elif [ "${1}" = "noconcat" ]; then
+		echo "add element netdev perf noconcat $(format_noconcat)"
+	fi
+}
+
+# Add single entry to 'norange' set in 'netdev perf' table
+add_perf_norange() {
+	if ! nft add element netdev perf norange "${1}"; then
+		err "Failed to add ${1} given ruleset:"
+		err "$(nft list ruleset -a)"
+		return 1
+	fi
+}
+
+# Add single entry to 'noconcat' set in 'netdev perf' table
+add_perf_noconcat() {
+	if ! nft add element netdev perf noconcat "${1}"; then
+		err "Failed to add ${1} given ruleset:"
+		err "$(nft list ruleset -a)"
+		return 1
+	fi
+}
+
+# Delete single entry from set
+del() {
+	if ! nft delete element inet filter test "${1}"; then
+		err "Failed to delete ${1} given ruleset:"
+		err "$(nft list ruleset -a)"
+		return 1
+	fi
+}
+
+# Return packet count from 'test' counter in 'inet filter' table
+count_packets() {
+	found=0
+	for token in $(nft list counter inet filter test); do
+		[ ${found} -eq 1 ] && echo "${token}" && return
+		[ "${token}" = "packets" ] && found=1
+	done
+}
+
+# Return packet count from 'test' counter in 'netdev perf' table
+count_perf_packets() {
+	found=0
+	for token in $(nft list counter netdev perf test); do
+		[ ${found} -eq 1 ] && echo "${token}" && return
+		[ "${token}" = "packets" ] && found=1
+	done
+}
+
+# Set MAC addresses, send traffic according to specifier
+flood() {
+	ip link set veth_a address "$(format_mac "${1}")"
+	ip -n B link set veth_b address "$(format_mac "${2}")"
+
+	for f in ${dst}; do
+		eval dst_"$f"=\$\(format_\$f "${1}"\)
+	done
+	for f in ${src}; do
+		eval src_"$f"=\$\(format_\$f "${2}"\)
+	done
+	eval flood_\$proto
+}
+
+# Set MAC addresses, start pktgen injection
+perf() {
+	dst_mac="$(format_mac "${1}")"
+	ip link set veth_a address "${dst_mac}"
+
+	for f in ${dst}; do
+		eval dst_"$f"=\$\(format_\$f "${1}"\)
+	done
+	for f in ${src}; do
+		eval src_"$f"=\$\(format_\$f "${2}"\)
+	done
+	eval perf_\$perf_proto
+}
+
+# Set MAC addresses, send single packet, check that it matches, reset counter
+send_match() {
+	ip link set veth_a address "$(format_mac "${1}")"
+	ip -n B link set veth_b address "$(format_mac "${2}")"
+
+	for f in ${dst}; do
+		eval dst_"$f"=\$\(format_\$f "${1}"\)
+	done
+	for f in ${src}; do
+		eval src_"$f"=\$\(format_\$f "${2}"\)
+	done
+	eval send_\$proto
+	if [ "$(count_packets)" != "1" ]; then
+		err "${proto} packet to:"
+		err "  $(for f in ${dst}; do
+			 eval format_\$f "${1}"; printf ' '; done)"
+		err "from:"
+		err "  $(for f in ${src}; do
+			 eval format_\$f "${2}"; printf ' '; done)"
+		err "should have matched ruleset:"
+		err "$(nft list ruleset -a)"
+		return 1
+	fi
+	nft reset counter inet filter test >/dev/null
+}
+
+# Set MAC addresses, send single packet, check that it doesn't match
+send_nomatch() {
+	ip link set veth_a address "$(format_mac "${1}")"
+	ip -n B link set veth_b address "$(format_mac "${2}")"
+
+	for f in ${dst}; do
+		eval dst_"$f"=\$\(format_\$f "${1}"\)
+	done
+	for f in ${src}; do
+		eval src_"$f"=\$\(format_\$f "${2}"\)
+	done
+	eval send_\$proto
+	if [ "$(count_packets)" != "0" ]; then
+		err "${proto} packet to:"
+		err "  $(for f in ${dst}; do
+			 eval format_\$f "${1}"; printf ' '; done)"
+		err "from:"
+		err "  $(for f in ${src}; do
+			 eval format_\$f "${2}"; printf ' '; done)"
+		err "should not have matched ruleset:"
+		err "$(nft list ruleset -a)"
+		return 1
+	fi
+}
+
+# Correctness test template:
+# - add ranged element, check that packets match it
+# - check that packets outside range don't match it
+# - remove some elements, check that packets don't match anymore
+test_correctness() {
+	setup veth send_"${proto}" set || return ${KSELFTEST_SKIP}
+
+	range_size=1
+	for i in $(seq "${start}" $((start + count))); do
+		end=$((start + range_size))
+
+		# Avoid negative or zero-sized port ranges
+		if [ $((end / 65534)) -gt $((start / 65534)) ]; then
+			start=${end}
+			end=$((end + 1))
+		fi
+		srcstart=$((start + src_delta))
+		srcend=$((end + src_delta))
+
+		add "$(format)" || return 1
+		for j in $(seq ${start} $((range_size / 2 + 1)) ${end}); do
+			send_match "${j}" $((j + src_delta)) || return 1
+		done
+		send_nomatch $((end + 1)) $((end + 1 + src_delta)) || return 1
+
+		# Delete elements now and then
+		if [ $((i % 3)) -eq 0 ]; then
+			del "$(format)" || return 1
+			for j in $(seq ${start} \
+				   $((range_size / 2 + 1)) ${end}); do
+				send_nomatch "${j}" $((j + src_delta)) \
+					|| return 1
+			done
+		fi
+
+		range_size=$((range_size + 1))
+		start=$((end + range_size))
+	done
+}
+
+# Concurrency test template:
+# - add all the elements
+# - start a thread for each physical thread that:
+#   - adds all the elements
+#   - flushes the set
+#   - adds all the elements
+#   - flushes the entire ruleset
+#   - adds the set back
+#   - adds all the elements
+#   - delete all the elements
+test_concurrency() {
+	proto=${flood_proto}
+	tools=${flood_tools}
+	chain_spec=${flood_spec}
+	setup veth flood_"${proto}" set || return ${KSELFTEST_SKIP}
+
+	range_size=1
+	cstart=${start}
+	flood_pids=
+	for i in $(seq ${start} $((start + count))); do
+		end=$((start + range_size))
+		srcstart=$((start + src_delta))
+		srcend=$((end + src_delta))
+
+		add "$(format)" || return 1
+
+		flood "${i}" $((i + src_delta)) & flood_pids="${flood_pids} $!"
+
+		range_size=$((range_size + 1))
+		start=$((end + range_size))
+	done
+
+	sleep 10
+
+	pids=
+	for c in $(seq 1 "$(nproc)"); do (
+		for r in $(seq 1 "${race_repeat}"); do
+			range_size=1
+
+			# $start needs to be local to this subshell
+			# shellcheck disable=SC2030
+			start=${cstart}
+			for i in $(seq ${start} $((start + count))); do
+				end=$((start + range_size))
+				srcstart=$((start + src_delta))
+				srcend=$((end + src_delta))
+
+				add "$(format)" 2>/dev/null
+
+				range_size=$((range_size + 1))
+				start=$((end + range_size))
+			done
+
+			nft flush inet filter test 2>/dev/null
+
+			range_size=1
+			start=${cstart}
+			for i in $(seq ${start} $((start + count))); do
+				end=$((start + range_size))
+				srcstart=$((start + src_delta))
+				srcend=$((end + src_delta))
+
+				add "$(format)" 2>/dev/null
+
+				range_size=$((range_size + 1))
+				start=$((end + range_size))
+			done
+
+			nft flush ruleset
+			setup set 2>/dev/null
+
+			range_size=1
+			start=${cstart}
+			for i in $(seq ${start} $((start + count))); do
+				end=$((start + range_size))
+				srcstart=$((start + src_delta))
+				srcend=$((end + src_delta))
+
+				add "$(format)" 2>/dev/null
+
+				range_size=$((range_size + 1))
+				start=$((end + range_size))
+			done
+
+			range_size=1
+			start=${cstart}
+			for i in $(seq ${start} $((start + count))); do
+				end=$((start + range_size))
+				srcstart=$((start + src_delta))
+				srcend=$((end + src_delta))
+
+				del "$(format)" 2>/dev/null
+
+				range_size=$((range_size + 1))
+				start=$((end + range_size))
+			done
+		done
+	) & pids="${pids} $!"
+	done
+
+	# shellcheck disable=SC2046,SC2086 # word splitting wanted here
+	wait $(for pid in ${pids}; do echo ${pid}; done)
+	# shellcheck disable=SC2046,SC2086
+	kill $(for pid in ${flood_pids}; do echo ${pid}; done) 2>/dev/null
+	# shellcheck disable=SC2046,SC2086
+	wait $(for pid in ${flood_pids}; do echo ${pid}; done) 2>/dev/null
+
+	return 0
+}
+
+# Timeout test template:
+# - add all the elements with 3s timeout while checking that packets match
+# - wait 3s after the last insertion, check that packets don't match any entry
+test_timeout() {
+	setup veth send_"${proto}" set || return ${KSELFTEST_SKIP}
+
+	timeout=3
+	range_size=1
+	for i in $(seq "${start}" $((start + count))); do
+		end=$((start + range_size))
+		srcstart=$((start + src_delta))
+		srcend=$((end + src_delta))
+
+		add "$(format)" || return 1
+
+		for j in $(seq ${start} $((range_size / 2 + 1)) ${end}); do
+			send_match "${j}" $((j + src_delta)) || return 1
+		done
+
+		range_size=$((range_size + 1))
+		start=$((end + range_size))
+	done
+	sleep 3
+	for i in $(seq ${start} $((start + count))); do
+		end=$((start + range_size))
+		srcstart=$((start + src_delta))
+		srcend=$((end + src_delta))
+
+		for j in $(seq ${start} $((range_size / 2 + 1)) ${end}); do
+			send_nomatch "${j}" $((j + src_delta)) || return 1
+		done
+
+		range_size=$((range_size + 1))
+		start=$((end + range_size))
+	done
+}
+
+# Performance test template:
+# - add concatenated ranged entries
+# - add non-ranged concatenated entries (for hash set matching rate baseline)
+# - add ranged entries with first field only (for rbhash baseline)
+# - start pktgen injection directly on device rx path of this namespace
+# - measure drop only rate, hash and rbtree baselines, then matching rate
+test_performance() {
+	chain_spec=${perf_spec}
+	dst="${perf_dst}"
+	src="${perf_src}"
+	setup veth perf set || return ${KSELFTEST_SKIP}
+
+	first=${start}
+	range_size=1
+	for set in test norange noconcat; do
+		start=${first}
+		for i in $(seq ${start} $((start + perf_entries))); do
+			end=$((start + range_size))
+			srcstart=$((start + src_delta))
+			srcend=$((end + src_delta))
+
+			if [ $((end / 65534)) -gt $((start / 65534)) ]; then
+				start=${end}
+				end=$((end + 1))
+			elif [ ${start} -eq ${end} ]; then
+				end=$((start + 1))
+			fi
+
+			add_perf ${set}
+
+			start=$((end + range_size))
+		done > "${tmp}"
+		nft -f "${tmp}"
+	done
+
+	perf $((end - 1)) ${srcstart}
+
+	sleep 2
+
+	nft add rule netdev perf test counter name \"test\" drop
+	nft reset counter netdev perf test >/dev/null 2>&1
+	sleep "${perf_duration}"
+	pps="$(printf %10s $(($(count_perf_packets) / perf_duration)))"
+	info "    baseline (drop from netdev hook):            ${pps}pps"
+	handle="$(nft -a list chain netdev perf test | grep counter)"
+	handle="${handle##* }"
+	nft delete rule netdev perf test handle "${handle}"
+
+	nft add rule "netdev perf test ${chain_spec} @norange \
+		counter name \"test\" drop"
+	nft reset counter netdev perf test >/dev/null 2>&1
+	sleep "${perf_duration}"
+	pps="$(printf %10s $(($(count_perf_packets) / perf_duration)))"
+	info "    baseline hash (non-ranged entries):          ${pps}pps"
+	handle="$(nft -a list chain netdev perf test | grep counter)"
+	handle="${handle##* }"
+	nft delete rule netdev perf test handle "${handle}"
+
+	nft add rule "netdev perf test ${chain_spec%%. *} @noconcat \
+		counter name \"test\" drop"
+	nft reset counter netdev perf test >/dev/null 2>&1
+	sleep "${perf_duration}"
+	pps="$(printf %10s $(($(count_perf_packets) / perf_duration)))"
+	info "    baseline rbtree (match on first field only): ${pps}pps"
+	handle="$(nft -a list chain netdev perf test | grep counter)"
+	handle="${handle##* }"
+	nft delete rule netdev perf test handle "${handle}"
+
+	nft add rule "netdev perf test ${chain_spec} @test \
+		counter name \"test\" drop"
+	nft reset counter netdev perf test >/dev/null 2>&1
+	sleep "${perf_duration}"
+	pps="$(printf %10s $(($(count_perf_packets) / perf_duration)))"
+	p5="$(printf %5s "${perf_entries}")"
+	info "    set with ${p5} full, ranged entries:         ${pps}pps"
+	kill "${perf_pid}"
+}
+
+# Run everything in a separate network namespace
+[ "${1}" != "run" ] && { unshare -n "${0}" run; exit $?; }
+tmp="$(mktemp)"
+trap cleanup EXIT
+
+# Entry point for test runs
+passed=0
+for name in ${TESTS}; do
+	printf "TEST: %s\n" "${name}"
+	for type in ${TYPES}; do
+		eval desc=\$TYPE_"${type}"
+		IFS='
+'
+		for __line in ${desc}; do
+			# shellcheck disable=SC2086
+			eval ${__line%%	*}=\"${__line##*	}\";
+		done
+		IFS=' 	
+'
+
+		if [ "${name}" = "concurrency" ] && \
+		   [ "${race_repeat}" = "0" ]; then
+			continue
+		fi
+		if [ "${name}" = "performance" ] && \
+		   [ "${perf_duration}" = "0" ]; then
+			continue
+		fi
+
+		printf "  %-60s  " "${display}"
+		eval test_"${name}"
+		ret=$?
+
+		if [ $ret -eq 0 ]; then
+			printf "[ OK ]\n"
+			info_flush
+			passed=$((passed + 1))
+		elif [ $ret -eq 1 ]; then
+			printf "[FAIL]\n"
+			err_flush
+			exit 1
+		elif [ $ret -eq ${KSELFTEST_SKIP} ]; then
+			printf "[SKIP]\n"
+			err_flush
+		fi
+	done
+done
+
+[ ${passed} -eq 0 ] && exit ${KSELFTEST_SKIP}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH nf-next v2 5/8] nft_set_pipapo: Provide unrolled lookup loops for common field sizes
  2019-11-22 13:39 [PATCH nf-next v2 0/8] nftables: Set implementation for arbitrary concatenation of ranges Stefano Brivio
                   ` (3 preceding siblings ...)
  2019-11-22 13:40 ` [PATCH nf-next v2 4/8] selftests: netfilter: Introduce tests for sets with range concatenation Stefano Brivio
@ 2019-11-22 13:40 ` Stefano Brivio
  2019-11-22 13:40 ` [PATCH nf-next v2 6/8] nft_set_pipapo: Prepare for vectorised implementation: alignment Stefano Brivio
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 24+ messages in thread
From: Stefano Brivio @ 2019-11-22 13:40 UTC (permalink / raw)
  To: Pablo Neira Ayuso, netfilter-devel
  Cc: Florian Westphal, Kadlecsik József, Eric Garver, Phil Sutter

For non-vectorised lookup implementations, this increases matching
rates by 20 to 30% for most set types.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
v2: No changes

 net/netfilter/nft_set_pipapo.c | 86 +++++++++++++++++++++++++++++-----
 1 file changed, 73 insertions(+), 13 deletions(-)

diff --git a/net/netfilter/nft_set_pipapo.c b/net/netfilter/nft_set_pipapo.c
index 3cad9aedc168..0596dbd11319 100644
--- a/net/netfilter/nft_set_pipapo.c
+++ b/net/netfilter/nft_set_pipapo.c
@@ -526,6 +526,51 @@ static int pipapo_refill(unsigned long *map, int len, int rules,
 	return ret;
 }
 
+#define NFT_PIPAPO_AND_BUCKET(map, bucket, bsize, idx)			       \
+	do {								       \
+		for (idx = 0; idx < (bsize); idx++)			       \
+			map[idx] &= *((bucket) + idx);			       \
+	} while (0)
+
+#define NFT_PIPAPO_MATCH_2(map, lt, bsize, pkt, offset, idx)		       \
+	do {								       \
+		NFT_PIPAPO_AND_BUCKET(map,				       \
+				      lt +				       \
+				      (offset +  0 +   (*pkt >> 4)) * bsize,   \
+				      bsize, idx);			       \
+		NFT_PIPAPO_AND_BUCKET(map,				       \
+				      lt +				       \
+				      (offset + 16 + (*pkt & 0x0f)) * bsize,   \
+				      bsize, idx);			       \
+		pkt++;							       \
+	} while (0)
+
+#define NFT_PIPAPO_MATCH_4(map, lt, bsize, pkt, offset, idx)		       \
+	do {								       \
+		NFT_PIPAPO_MATCH_2(map, lt, bsize, pkt, offset, idx);	       \
+		NFT_PIPAPO_MATCH_2(map, lt, bsize, pkt, offset + 2 * 16, idx); \
+	} while (0)
+
+#define NFT_PIPAPO_MATCH_8(map, lt, bsize, pkt, offset, idx)		       \
+	do {								       \
+		NFT_PIPAPO_MATCH_4(map, lt, bsize, pkt, offset, idx);	       \
+		NFT_PIPAPO_MATCH_4(map, lt, bsize, pkt, offset + 4 * 16, idx); \
+	} while (0)
+
+#define NFT_PIPAPO_MATCH_12(map, lt, bsize, pkt, idx)			       \
+	do {								       \
+		NFT_PIPAPO_MATCH_8(map, lt, bsize, pkt, 0, idx);	       \
+		NFT_PIPAPO_MATCH_4(map, lt, bsize, pkt, 8 * 16, idx);	       \
+	} while (0)
+
+#define NFT_PIPAPO_MATCH_32(map, lt, bsize, pkt, idx)			       \
+	do {								       \
+		NFT_PIPAPO_MATCH_8(map, lt, bsize, pkt,  0, idx);	       \
+		NFT_PIPAPO_MATCH_8(map, lt, bsize, pkt,  8 * 16, idx);	       \
+		NFT_PIPAPO_MATCH_8(map, lt, bsize, pkt, 16 * 16, idx);	       \
+		NFT_PIPAPO_MATCH_8(map, lt, bsize, pkt, 24 * 16, idx);	       \
+	} while (0)
+
 /**
  * nft_pipapo_lookup() - Lookup function
  * @net:	Network namespace
@@ -566,24 +611,39 @@ static bool nft_pipapo_lookup(const struct net *net, const struct nft_set *set,
 	nft_pipapo_for_each_field(f, i, m) {
 		bool last = i == m->field_count - 1;
 		unsigned long *lt = f->lt;
-		int b, group;
+		int b, group, j;
 
 		/* For each 4-bit group: select lookup table bucket depending on
-		 * packet bytes value, then AND bucket value
+		 * packet bytes value, then AND bucket value. Unroll loops for
+		 * the most common cases (protocol, port, IPv4 address, MAC
+		 * address, IPv6 address).
 		 */
-		for (group = 0; group < f->groups; group++) {
-			u8 v;
+		if (f->groups == 2) {
+			NFT_PIPAPO_MATCH_2(res_map, lt, f->bsize, rp, 0, j);
+		} else if (f->groups == 4) {
+			NFT_PIPAPO_MATCH_4(res_map, lt, f->bsize, rp, 0, j);
+		} else if (f->groups == 8) {
+			NFT_PIPAPO_MATCH_8(res_map, lt, f->bsize, rp, 0, j);
+		} else if (f->groups == 12) {
+			NFT_PIPAPO_MATCH_12(res_map, lt, f->bsize, rp, j);
+		} else if (f->groups == 32) {
+			NFT_PIPAPO_MATCH_32(res_map, lt, f->bsize, rp, j);
+		} else {
+			for (group = 0; group < f->groups; group++) {
+				u8 v;
+
+				if (group % 2) {
+					v = *rp & 0x0f;
+					rp++;
+				} else {
+					v = *rp >> 4;
+				}
+				__bitmap_and(res_map, res_map,
+					     lt + v * f->bsize,
+					     f->bsize * BITS_PER_LONG);
 
-			if (group % 2) {
-				v = *rp & 0x0f;
-				rp++;
-			} else {
-				v = *rp >> 4;
+				lt += f->bsize * NFT_PIPAPO_BUCKETS;
 			}
-			__bitmap_and(res_map, res_map, lt + v * f->bsize,
-				     f->bsize * BITS_PER_LONG);
-
-			lt += f->bsize * NFT_PIPAPO_BUCKETS;
 		}
 
 		/* Now populate the bitmap for the next field, unless this is
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH nf-next v2 6/8] nft_set_pipapo: Prepare for vectorised implementation: alignment
  2019-11-22 13:39 [PATCH nf-next v2 0/8] nftables: Set implementation for arbitrary concatenation of ranges Stefano Brivio
                   ` (4 preceding siblings ...)
  2019-11-22 13:40 ` [PATCH nf-next v2 5/8] nft_set_pipapo: Provide unrolled lookup loops for common field sizes Stefano Brivio
@ 2019-11-22 13:40 ` Stefano Brivio
  2019-11-22 13:40 ` [PATCH nf-next v2 7/8] nft_set_pipapo: Prepare for vectorised implementation: helpers Stefano Brivio
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 24+ messages in thread
From: Stefano Brivio @ 2019-11-22 13:40 UTC (permalink / raw)
  To: Pablo Neira Ayuso, netfilter-devel
  Cc: Florian Westphal, Kadlecsik József, Eric Garver, Phil Sutter

SIMD vector extension sets require stricter alignment than native
instruction sets to operate efficiently (AVX, NEON) or for some
instructions to work at all (AltiVec).

Provide facilities to define arbitrary alignment for lookup tables
and scratch maps. By defining byte alignment with NFT_PIPAPO_ALIGN,
lt_aligned and scratch_aligned pointers become available.

Additional headroom is allocated, and pointers to the possibly
unaligned, originally allocated areas are kept so that they can
be freed.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
v2: No changes

 net/netfilter/nft_set_pipapo.c | 115 +++++++++++++++++++++++++++------
 1 file changed, 97 insertions(+), 18 deletions(-)

diff --git a/net/netfilter/nft_set_pipapo.c b/net/netfilter/nft_set_pipapo.c
index 0596dbd11319..92a0e3dc6f79 100644
--- a/net/netfilter/nft_set_pipapo.c
+++ b/net/netfilter/nft_set_pipapo.c
@@ -377,6 +377,22 @@
 #define NFT_PIPAPO_RULE0_MAX		((1UL << (NFT_PIPAPO_MAP_TOBITS - 1)) \
 					- (1UL << NFT_PIPAPO_MAP_NBITS))
 
+/* Definitions for vectorised implementations */
+#ifdef NFT_PIPAPO_ALIGN
+#define NFT_PIPAPO_ALIGN_HEADROOM					\
+	(NFT_PIPAPO_ALIGN - ARCH_KMALLOC_MINALIGN)
+#define NFT_PIPAPO_LT_ALIGN(lt)		(PTR_ALIGN((lt), NFT_PIPAPO_ALIGN))
+#define NFT_PIPAPO_LT_ASSIGN(field, x)					\
+	do {								\
+		(field)->lt_aligned = NFT_PIPAPO_LT_ALIGN(x);		\
+		(field)->lt = (x);					\
+	} while (0)
+#else
+#define NFT_PIPAPO_ALIGN_HEADROOM	0
+#define NFT_PIPAPO_LT_ALIGN(lt)		(lt)
+#define NFT_PIPAPO_LT_ASSIGN(field, x)	((field)->lt = (x))
+#endif /* NFT_PIPAPO_ALIGN */
+
 #define nft_pipapo_for_each_field(field, index, match)		\
 	for ((field) = (match)->f, (index) = 0;			\
 	     (index) < (match)->field_count;			\
@@ -410,12 +426,16 @@ union nft_pipapo_map_bucket {
  * @rules:	Number of inserted rules
  * @bsize:	Size of each bucket in lookup table, in longs
  * @lt:		Lookup table: 'groups' rows of NFT_PIPAPO_BUCKETS buckets
+ * @lt_aligned:	Version of @lt aligned to NFT_PIPAPO_ALIGN bytes
  * @mt:		Mapping table: one bucket per rule
  */
 struct nft_pipapo_field {
 	int groups;
 	unsigned long rules;
 	size_t bsize;
+#ifdef NFT_PIPAPO_ALIGN
+	unsigned long *lt_aligned;
+#endif
 	unsigned long *lt;
 	union nft_pipapo_map_bucket *mt;
 };
@@ -424,12 +444,16 @@ struct nft_pipapo_field {
  * struct nft_pipapo_match - Data used for lookup and matching
  * @field_count		Amount of fields in set
  * @scratch:		Preallocated per-CPU maps for partial matching results
+ * @scratch_aligned:	Version of @scratch aligned to NFT_PIPAPO_ALIGN bytes
  * @bsize_max:		Maximum lookup table bucket size of all fields, in longs
  * @rcu			Matching data is swapped on commits
  * @f:			Fields, with lookup and mapping tables
  */
 struct nft_pipapo_match {
 	int field_count;
+#ifdef NFT_PIPAPO_ALIGN
+	unsigned long * __percpu *scratch_aligned;
+#endif
 	unsigned long * __percpu *scratch;
 	size_t bsize_max;
 	struct rcu_head rcu;
@@ -733,8 +757,8 @@ static void *pipapo_get(const struct net *net, const struct nft_set *set,
 	memset(res_map, 0xff, m->bsize_max * sizeof(*res_map));
 
 	nft_pipapo_for_each_field(f, i, m) {
+		unsigned long *lt = NFT_PIPAPO_LT_ALIGN(f->lt);
 		bool last = i == m->field_count - 1;
-		unsigned long *lt = f->lt;
 		int b, group;
 
 		/* For each 4-bit group: select lookup table bucket depending on
@@ -828,6 +852,10 @@ static int pipapo_resize(struct nft_pipapo_field *f, int old_rules, int rules)
 	int group, bucket;
 
 	new_bucket_size = DIV_ROUND_UP(rules, BITS_PER_LONG);
+#ifdef NFT_PIPAPO_ALIGN
+	new_bucket_size = roundup(new_bucket_size,
+				  NFT_PIPAPO_ALIGN / sizeof(*new_lt));
+#endif
 
 	if (new_bucket_size == f->bsize)
 		goto mt;
@@ -838,12 +866,14 @@ static int pipapo_resize(struct nft_pipapo_field *f, int old_rules, int rules)
 		copy = new_bucket_size;
 
 	new_lt = kvzalloc(f->groups * NFT_PIPAPO_BUCKETS * new_bucket_size *
-			  sizeof(*new_lt), GFP_KERNEL);
+			  sizeof(*new_lt) + NFT_PIPAPO_ALIGN_HEADROOM,
+			  GFP_KERNEL);
 	if (!new_lt)
 		return -ENOMEM;
 
-	new_p = new_lt;
-	old_p = old_lt;
+	new_p = NFT_PIPAPO_LT_ALIGN(new_lt);
+	old_p = NFT_PIPAPO_LT_ALIGN(old_lt);
+
 	for (group = 0; group < f->groups; group++) {
 		for (bucket = 0; bucket < NFT_PIPAPO_BUCKETS; bucket++) {
 			memcpy(new_p, old_p, copy * sizeof(*new_p));
@@ -872,7 +902,7 @@ static int pipapo_resize(struct nft_pipapo_field *f, int old_rules, int rules)
 
 	if (new_lt) {
 		f->bsize = new_bucket_size;
-		f->lt = new_lt;
+		NFT_PIPAPO_LT_ASSIGN(f, new_lt);
 		kvfree(old_lt);
 	}
 
@@ -894,7 +924,8 @@ static void pipapo_bucket_set(struct nft_pipapo_field *f, int rule, int group,
 {
 	unsigned long *pos;
 
-	pos = f->lt + f->bsize * NFT_PIPAPO_BUCKETS * group;
+	pos = NFT_PIPAPO_LT_ALIGN(f->lt);
+	pos += f->bsize * NFT_PIPAPO_BUCKETS * group;
 	pos += f->bsize * v;
 
 	__set_bit(rule, pos);
@@ -1118,8 +1149,12 @@ static int pipapo_realloc_scratch(struct nft_pipapo_match *clone,
 
 	for_each_possible_cpu(i) {
 		unsigned long *scratch;
+#ifdef NFT_PIPAPO_ALIGN
+		unsigned long *scratch_aligned;
+#endif
 
-		scratch = kzalloc_node(bsize_max * sizeof(*scratch) * 2,
+		scratch = kzalloc_node(bsize_max * sizeof(*scratch) * 2 +
+				       NFT_PIPAPO_ALIGN_HEADROOM,
 				       GFP_KERNEL, cpu_to_node(i));
 		if (!scratch) {
 			/* On failure, there's no need to undo previous
@@ -1135,6 +1170,11 @@ static int pipapo_realloc_scratch(struct nft_pipapo_match *clone,
 		kfree(*per_cpu_ptr(clone->scratch, i));
 
 		*per_cpu_ptr(clone->scratch, i) = scratch;
+
+#ifdef NFT_PIPAPO_ALIGN
+		scratch_aligned = NFT_PIPAPO_LT_ALIGN(scratch);
+		*per_cpu_ptr(clone->scratch_aligned, i) = scratch_aligned;
+#endif
 	}
 
 	return 0;
@@ -1291,21 +1331,33 @@ static struct nft_pipapo_match *pipapo_clone(struct nft_pipapo_match *old)
 	if (!new->scratch)
 		goto out_scratch;
 
+#ifdef NFT_PIPAPO_ALIGN
+	new->scratch_aligned = alloc_percpu(*new->scratch_aligned);
+	if (!new->scratch_aligned)
+		goto out_scratch;
+#endif
+
 	rcu_head_init(&new->rcu);
 
 	src = old->f;
 	dst = new->f;
 
 	for (i = 0; i < old->field_count; i++) {
+		unsigned long *new_lt;
+
 		memcpy(dst, src, offsetof(struct nft_pipapo_field, lt));
 
-		dst->lt = kvzalloc(src->groups * NFT_PIPAPO_BUCKETS *
-				   src->bsize * sizeof(*dst->lt),
-				   GFP_KERNEL);
-		if (!dst->lt)
+		new_lt = kvzalloc(src->groups * NFT_PIPAPO_BUCKETS *
+				  src->bsize * sizeof(*dst->lt) +
+				  NFT_PIPAPO_ALIGN_HEADROOM,
+				  GFP_KERNEL);
+		if (!new_lt)
 			goto out_lt;
 
-		memcpy(dst->lt, src->lt,
+		NFT_PIPAPO_LT_ASSIGN(dst, new_lt);
+
+		memcpy(NFT_PIPAPO_LT_ALIGN(new_lt),
+		       NFT_PIPAPO_LT_ALIGN(src->lt),
 		       src->bsize * sizeof(*dst->lt) *
 		       src->groups * NFT_PIPAPO_BUCKETS);
 
@@ -1328,8 +1380,11 @@ static struct nft_pipapo_match *pipapo_clone(struct nft_pipapo_match *old)
 		kvfree(dst->lt);
 		dst--;
 	}
-	free_percpu(new->scratch);
+#ifdef NFT_PIPAPO_ALIGN
+	free_percpu(new->scratch_aligned);
+#endif
 out_scratch:
+	free_percpu(new->scratch);
 	kfree(new);
 
 	return ERR_PTR(-ENOMEM);
@@ -1485,7 +1540,8 @@ static void pipapo_drop(struct nft_pipapo_match *m,
 			unsigned long *pos;
 			int b;
 
-			pos = f->lt + g * NFT_PIPAPO_BUCKETS * f->bsize;
+			pos = NFT_PIPAPO_LT_ALIGN(f->lt) + g *
+			      NFT_PIPAPO_BUCKETS * f->bsize;
 
 			for (b = 0; b < NFT_PIPAPO_BUCKETS; b++) {
 				bitmap_cut(pos, pos, rulemap[i].to,
@@ -1590,6 +1646,9 @@ static void pipapo_reclaim_match(struct rcu_head *rcu)
 	for_each_possible_cpu(i)
 		kfree(*per_cpu_ptr(m->scratch, i));
 
+#ifdef NFT_PIPAPO_ALIGN
+	free_percpu(m->scratch_aligned);
+#endif
 	free_percpu(m->scratch);
 
 	pipapo_free_fields(m);
@@ -1817,7 +1876,8 @@ static int pipapo_get_boundaries(struct nft_pipapo_field *f, int first_rule,
 		for (b = 0; b < NFT_PIPAPO_BUCKETS; b++) {
 			unsigned long *pos;
 
-			pos = f->lt + (g * NFT_PIPAPO_BUCKETS + b) * f->bsize;
+			pos = NFT_PIPAPO_LT_ALIGN(f->lt) +
+			      (g * NFT_PIPAPO_BUCKETS + b) * f->bsize;
 			if (test_bit(first_rule, pos) && x0 == -1)
 				x0 = b;
 			if (test_bit(first_rule + rule_count - 1, pos))
@@ -2117,11 +2177,21 @@ static int nft_pipapo_init(const struct nft_set *set,
 	m->scratch = alloc_percpu(unsigned long *);
 	if (!m->scratch) {
 		err = -ENOMEM;
-		goto out_free;
+		goto out_scratch;
 	}
 	for_each_possible_cpu(i)
 		*per_cpu_ptr(m->scratch, i) = NULL;
 
+#ifdef NFT_PIPAPO_ALIGN
+	m->scratch_aligned = alloc_percpu(unsigned long *);
+	if (!m->scratch_aligned) {
+		err = -ENOMEM;
+		goto out_free;
+	}
+	for_each_possible_cpu(i)
+		*per_cpu_ptr(m->scratch_aligned, i) = NULL;
+#endif
+
 	rcu_head_init(&m->rcu);
 
 	f = m->f;
@@ -2139,7 +2209,7 @@ static int nft_pipapo_init(const struct nft_set *set,
 
 		f->bsize = 0;
 		f->rules = 0;
-		f->lt = NULL;
+		NFT_PIPAPO_LT_ASSIGN(f, NULL);
 		f->mt = NULL;
 
 		f++;
@@ -2159,7 +2229,11 @@ static int nft_pipapo_init(const struct nft_set *set,
 	return 0;
 
 out_free:
+#ifdef NFT_PIPAPO_ALIGN
+	free_percpu(m->scratch_aligned);
+#endif
 	free_percpu(m->scratch);
+out_scratch:
 	kfree(m);
 
 	return err;
@@ -2198,16 +2272,21 @@ static void nft_pipapo_destroy(const struct nft_set *set)
 			nft_set_elem_destroy(set, e, true);
 		}
 
+#ifdef NFT_PIPAPO_ALIGN
+		free_percpu(m->scratch_aligned);
+#endif
 		for_each_possible_cpu(cpu)
 			kfree(*per_cpu_ptr(m->scratch, cpu));
 		free_percpu(m->scratch);
-
 		pipapo_free_fields(m);
 		kfree(m);
 		priv->match = NULL;
 	}
 
 	if (priv->clone) {
+#ifdef NFT_PIPAPO_ALIGN
+		free_percpu(priv->clone->scratch_aligned);
+#endif
 		for_each_possible_cpu(cpu)
 			kfree(*per_cpu_ptr(priv->clone->scratch, cpu));
 		free_percpu(priv->clone->scratch);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH nf-next v2 7/8] nft_set_pipapo: Prepare for vectorised implementation: helpers
  2019-11-22 13:39 [PATCH nf-next v2 0/8] nftables: Set implementation for arbitrary concatenation of ranges Stefano Brivio
                   ` (5 preceding siblings ...)
  2019-11-22 13:40 ` [PATCH nf-next v2 6/8] nft_set_pipapo: Prepare for vectorised implementation: alignment Stefano Brivio
@ 2019-11-22 13:40 ` Stefano Brivio
  2019-11-22 13:40 ` [PATCH nf-next v2 8/8] nft_set_pipapo: Introduce AVX2-based lookup implementation Stefano Brivio
  2019-11-23 20:05 ` [PATCH nf-next v2 0/8] nftables: Set implementation for arbitrary concatenation of ranges Pablo Neira Ayuso
  8 siblings, 0 replies; 24+ messages in thread
From: Stefano Brivio @ 2019-11-22 13:40 UTC (permalink / raw)
  To: Pablo Neira Ayuso, netfilter-devel
  Cc: Florian Westphal, Kadlecsik József, Eric Garver, Phil Sutter

Move most macros and helpers to a header file, so that they can be
conveniently used by related implementations.

No functional changes are intended here.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
v2: No changes

 net/netfilter/nft_set_pipapo.c | 212 ++---------------------------
 net/netfilter/nft_set_pipapo.h | 236 +++++++++++++++++++++++++++++++++
 2 files changed, 244 insertions(+), 204 deletions(-)
 create mode 100644 net/netfilter/nft_set_pipapo.h

diff --git a/net/netfilter/nft_set_pipapo.c b/net/netfilter/nft_set_pipapo.c
index 92a0e3dc6f79..5076325b8093 100644
--- a/net/netfilter/nft_set_pipapo.c
+++ b/net/netfilter/nft_set_pipapo.c
@@ -330,173 +330,20 @@
 
 #include <linux/kernel.h>
 #include <linux/init.h>
-#include <linux/log2.h>
 #include <linux/module.h>
 #include <linux/netlink.h>
 #include <linux/netfilter.h>
 #include <linux/netfilter/nf_tables.h>
 #include <net/netfilter/nf_tables_core.h>
 #include <uapi/linux/netfilter/nf_tables.h>
-#include <net/ipv6.h>			/* For the maximum length of a field */
 #include <linux/bitmap.h>
 #include <linux/bitops.h>
 
-/* Count of concatenated fields depends on count of 32-bit nftables registers */
-#define NFT_PIPAPO_MAX_FIELDS		NFT_REG32_COUNT
-
-/* Largest supported field size */
-#define NFT_PIPAPO_MAX_BYTES		(sizeof(struct in6_addr))
-#define NFT_PIPAPO_MAX_BITS		(NFT_PIPAPO_MAX_BYTES * BITS_PER_BYTE)
-
-/* Number of bits to be grouped together in lookup table buckets, arbitrary */
-#define NFT_PIPAPO_GROUP_BITS		4
-#define NFT_PIPAPO_GROUPS_PER_BYTE	(BITS_PER_BYTE / NFT_PIPAPO_GROUP_BITS)
-
-/* Fields are padded to 32 bits in input registers */
-#define NFT_PIPAPO_GROUPS_PADDED_SIZE(x)				\
-	(round_up((x) / NFT_PIPAPO_GROUPS_PER_BYTE, sizeof(u32)))
-#define NFT_PIPAPO_GROUPS_PADDING(x)					\
-	(NFT_PIPAPO_GROUPS_PADDED_SIZE((x)) - (x) / NFT_PIPAPO_GROUPS_PER_BYTE)
-
-/* Number of buckets, given by 2 ^ n, with n grouped bits */
-#define NFT_PIPAPO_BUCKETS		(1 << NFT_PIPAPO_GROUP_BITS)
-
-/* Each n-bit range maps to up to n * 2 rules */
-#define NFT_PIPAPO_MAP_NBITS		(const_ilog2(NFT_PIPAPO_MAX_BITS * 2))
-
-/* Use the rest of mapping table buckets for rule indices, but it makes no sense
- * to exceed 32 bits
- */
-#if BITS_PER_LONG == 64
-#define NFT_PIPAPO_MAP_TOBITS		32
-#else
-#define NFT_PIPAPO_MAP_TOBITS		(BITS_PER_LONG - NFT_PIPAPO_MAP_NBITS)
-#endif
-
-/* ...which gives us the highest allowed index for a rule */
-#define NFT_PIPAPO_RULE0_MAX		((1UL << (NFT_PIPAPO_MAP_TOBITS - 1)) \
-					- (1UL << NFT_PIPAPO_MAP_NBITS))
-
-/* Definitions for vectorised implementations */
-#ifdef NFT_PIPAPO_ALIGN
-#define NFT_PIPAPO_ALIGN_HEADROOM					\
-	(NFT_PIPAPO_ALIGN - ARCH_KMALLOC_MINALIGN)
-#define NFT_PIPAPO_LT_ALIGN(lt)		(PTR_ALIGN((lt), NFT_PIPAPO_ALIGN))
-#define NFT_PIPAPO_LT_ASSIGN(field, x)					\
-	do {								\
-		(field)->lt_aligned = NFT_PIPAPO_LT_ALIGN(x);		\
-		(field)->lt = (x);					\
-	} while (0)
-#else
-#define NFT_PIPAPO_ALIGN_HEADROOM	0
-#define NFT_PIPAPO_LT_ALIGN(lt)		(lt)
-#define NFT_PIPAPO_LT_ASSIGN(field, x)	((field)->lt = (x))
-#endif /* NFT_PIPAPO_ALIGN */
-
-#define nft_pipapo_for_each_field(field, index, match)		\
-	for ((field) = (match)->f, (index) = 0;			\
-	     (index) < (match)->field_count;			\
-	     (index)++, (field)++)
-
-/**
- * union nft_pipapo_map_bucket - Bucket of mapping table
- * @to:		First rule number (in next field) this rule maps to
- * @n:		Number of rules (in next field) this rule maps to
- * @e:		If there's no next field, pointer to element this rule maps to
- */
-union nft_pipapo_map_bucket {
-	struct {
-#if BITS_PER_LONG == 64
-		static_assert(NFT_PIPAPO_MAP_TOBITS <= 32);
-		u32 to;
-
-		static_assert(NFT_PIPAPO_MAP_NBITS <= 32);
-		u32 n;
-#else
-		unsigned long to:NFT_PIPAPO_MAP_TOBITS;
-		unsigned long  n:NFT_PIPAPO_MAP_NBITS;
-#endif
-	};
-	struct nft_pipapo_elem *e;
-};
-
-/**
- * struct nft_pipapo_field - Lookup, mapping tables and related data for a field
- * @groups:	Amount of 4-bit groups
- * @rules:	Number of inserted rules
- * @bsize:	Size of each bucket in lookup table, in longs
- * @lt:		Lookup table: 'groups' rows of NFT_PIPAPO_BUCKETS buckets
- * @lt_aligned:	Version of @lt aligned to NFT_PIPAPO_ALIGN bytes
- * @mt:		Mapping table: one bucket per rule
- */
-struct nft_pipapo_field {
-	int groups;
-	unsigned long rules;
-	size_t bsize;
-#ifdef NFT_PIPAPO_ALIGN
-	unsigned long *lt_aligned;
-#endif
-	unsigned long *lt;
-	union nft_pipapo_map_bucket *mt;
-};
-
-/**
- * struct nft_pipapo_match - Data used for lookup and matching
- * @field_count		Amount of fields in set
- * @scratch:		Preallocated per-CPU maps for partial matching results
- * @scratch_aligned:	Version of @scratch aligned to NFT_PIPAPO_ALIGN bytes
- * @bsize_max:		Maximum lookup table bucket size of all fields, in longs
- * @rcu			Matching data is swapped on commits
- * @f:			Fields, with lookup and mapping tables
- */
-struct nft_pipapo_match {
-	int field_count;
-#ifdef NFT_PIPAPO_ALIGN
-	unsigned long * __percpu *scratch_aligned;
-#endif
-	unsigned long * __percpu *scratch;
-	size_t bsize_max;
-	struct rcu_head rcu;
-	struct nft_pipapo_field f[0];
-};
+#include "nft_set_pipapo.h"
 
 /* Current working bitmap index, toggled between field matches */
 static DEFINE_PER_CPU(bool, nft_pipapo_scratch_index);
 
-/**
- * struct nft_pipapo - Representation of a set
- * @match:	Currently in-use matching data
- * @clone:	Copy where pending insertions and deletions are kept
- * @groups:	Total amount of 4-bit groups for fields in this set
- * @width:	Total bytes to be matched for one packet, including padding
- * @dirty:	Working copy has pending insertions or deletions
- * @last_gc:	Timestamp of last garbage collection run, jiffies
- * @start_data:	Key data of start element for insertion
- * @start_elem:	Start element for insertion
- */
-struct nft_pipapo {
-	struct nft_pipapo_match __rcu *match;
-	struct nft_pipapo_match *clone;
-	int groups;
-	int width;
-	bool dirty;
-	unsigned long last_gc;
-	u8 start_data[NFT_DATA_VALUE_MAXLEN * sizeof(u32)];
-	struct nft_pipapo_elem *start_elem;
-};
-
-struct nft_pipapo_elem;
-
-/**
- * struct nft_pipapo_elem - API-facing representation of single set element
- * @start:	Pointer to element that represents start of interval
- * @ext:	nftables API extensions
- */
-struct nft_pipapo_elem {
-	struct nft_pipapo_elem *start;
-	struct nft_set_ext ext;
-};
-
 /**
  * pipapo_refill() - For each set bit, set bits from selected mapping table item
  * @map:	Bitmap to be scanned for set bits
@@ -514,9 +361,8 @@ struct nft_pipapo_elem {
  *
  * Return: -1 on no match, bit position on 'match_only', 0 otherwise.
  */
-static int pipapo_refill(unsigned long *map, int len, int rules,
-			 unsigned long *dst, union nft_pipapo_map_bucket *mt,
-			 bool match_only)
+int pipapo_refill(unsigned long *map, int len, int rules, unsigned long *dst,
+		  union nft_pipapo_map_bucket *mt, bool match_only)
 {
 	unsigned long bitset;
 	int k, ret = -1;
@@ -635,7 +481,7 @@ static bool nft_pipapo_lookup(const struct net *net, const struct nft_set *set,
 	nft_pipapo_for_each_field(f, i, m) {
 		bool last = i == m->field_count - 1;
 		unsigned long *lt = f->lt;
-		int b, group, j;
+		int b, j;
 
 		/* For each 4-bit group: select lookup table bucket depending on
 		 * packet bytes value, then AND bucket value. Unroll loops for
@@ -653,21 +499,8 @@ static bool nft_pipapo_lookup(const struct net *net, const struct nft_set *set,
 		} else if (f->groups == 32) {
 			NFT_PIPAPO_MATCH_32(res_map, lt, f->bsize, rp, j);
 		} else {
-			for (group = 0; group < f->groups; group++) {
-				u8 v;
-
-				if (group % 2) {
-					v = *rp & 0x0f;
-					rp++;
-				} else {
-					v = *rp >> 4;
-				}
-				__bitmap_and(res_map, res_map,
-					     lt + v * f->bsize,
-					     f->bsize * BITS_PER_LONG);
-
-				lt += f->bsize * NFT_PIPAPO_BUCKETS;
-			}
+			pipapo_and_field_buckets(f, res_map, rp);
+			rp += f->groups / NFT_PIPAPO_GROUPS_PER_BYTE;
 		}
 
 		/* Now populate the bitmap for the next field, unless this is
@@ -2077,21 +1910,11 @@ static u64 nft_pipapo_privsize(const struct nlattr * const nla[],
 }
 
 /**
- * nft_pipapo_estimate() - Estimate set size, space and lookup complexity
+ * nft_pipapo_estimate() - Set size, space and lookup complexity
  * @desc:	Set description, initial element count used here
  * @features:	Flags: NFT_SET_SUBKEY needs to be there
  * @est:	Storage for estimation data
  *
- * The size for this set type can vary dramatically, as it depends on the number
- * of rules (composing netmasks) the entries expand to. We compute the worst
- * case here, in order to ensure that other types are used if concatenation of
- * ranges is not needed.
- *
- * In general, for a non-ranged entry or a single composing netmask, we need
- * one bit in each of the sixteen NFT_PIPAPO_BUCKETS, for each 4-bit group (that
- * is, each input bit needs four bits of matching data), plus a bucket in the
- * mapping table for each field.
- *
  * Return: true
  */
 static bool nft_pipapo_estimate(const struct nft_set_desc *desc, u32 features,
@@ -2100,26 +1923,7 @@ static bool nft_pipapo_estimate(const struct nft_set_desc *desc, u32 features,
 	if (!(features & NFT_SET_SUBKEY))
 		return false;
 
-	est->size = sizeof(struct nft_pipapo) + sizeof(struct nft_pipapo_match);
-
-	/* Worst-case with current amount of 32-bit VM registers (16 of them):
-	 * - 2 IPv6 addresses	8 registers
-	 * - 2 interface names	8 registers
-	 * that is, four 128-bit fields:
-	 */
-	est->size += sizeof(struct nft_pipapo_field) * 4;
-
-	/* expanding to worst-case ranges, 128 * 2 rules each, resulting in:
-	 * - 128 4-bit groups
-	 * - each set entry taking 256 bits in each bucket
-	 */
-	est->size += desc->size * NFT_PIPAPO_MAX_BITS / NFT_PIPAPO_GROUP_BITS *
-		     NFT_PIPAPO_BUCKETS * NFT_PIPAPO_MAX_BITS * 2 /
-		     BITS_PER_BYTE;
-
-	/* and we need mapping buckets, too */
-	est->size += desc->size * NFT_PIPAPO_MAP_NBITS *
-		     sizeof(union nft_pipapo_map_bucket);
+	est->size = pipapo_estimate_size(desc->size);
 
 	est->lookup = NFT_SET_CLASS_O_LOG_N;
 
diff --git a/net/netfilter/nft_set_pipapo.h b/net/netfilter/nft_set_pipapo.h
new file mode 100644
index 000000000000..0261f72b804f
--- /dev/null
+++ b/net/netfilter/nft_set_pipapo.h
@@ -0,0 +1,236 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#ifndef _NFT_SET_PIPAPO_H
+
+#include <linux/log2.h>
+#include <net/ipv6.h>			/* For the maximum length of a field */
+
+/* Count of concatenated fields depends on count of 32-bit nftables registers */
+#define NFT_PIPAPO_MAX_FIELDS		NFT_REG32_COUNT
+
+/* Largest supported field size */
+#define NFT_PIPAPO_MAX_BYTES		(sizeof(struct in6_addr))
+#define NFT_PIPAPO_MAX_BITS		(NFT_PIPAPO_MAX_BYTES * BITS_PER_BYTE)
+
+/* Number of bits to be grouped together in lookup table buckets, arbitrary */
+#define NFT_PIPAPO_GROUP_BITS		4
+#define NFT_PIPAPO_GROUPS_PER_BYTE	(BITS_PER_BYTE / NFT_PIPAPO_GROUP_BITS)
+
+/* Fields are padded to 32 bits in input registers */
+#define NFT_PIPAPO_GROUPS_PADDED_SIZE(x)				\
+	(round_up((x) / NFT_PIPAPO_GROUPS_PER_BYTE, sizeof(u32)))
+#define NFT_PIPAPO_GROUPS_PADDING(x)					\
+	(NFT_PIPAPO_GROUPS_PADDED_SIZE((x)) - (x) / NFT_PIPAPO_GROUPS_PER_BYTE)
+
+/* Number of buckets, given by 2 ^ n, with n grouped bits */
+#define NFT_PIPAPO_BUCKETS		(1 << NFT_PIPAPO_GROUP_BITS)
+
+/* Each n-bit range maps to up to n * 2 rules */
+#define NFT_PIPAPO_MAP_NBITS		(const_ilog2(NFT_PIPAPO_MAX_BITS * 2))
+
+/* Use the rest of mapping table buckets for rule indices, but it makes no sense
+ * to exceed 32 bits
+ */
+#if BITS_PER_LONG == 64
+#define NFT_PIPAPO_MAP_TOBITS		32
+#else
+#define NFT_PIPAPO_MAP_TOBITS		(BITS_PER_LONG - NFT_PIPAPO_MAP_NBITS)
+#endif
+
+/* ...which gives us the highest allowed index for a rule */
+#define NFT_PIPAPO_RULE0_MAX		((1UL << (NFT_PIPAPO_MAP_TOBITS - 1)) \
+					- (1UL << NFT_PIPAPO_MAP_NBITS))
+
+/* Definitions for vectorised implementations */
+#ifdef NFT_PIPAPO_ALIGN
+#define NFT_PIPAPO_ALIGN_HEADROOM					\
+	(NFT_PIPAPO_ALIGN - ARCH_KMALLOC_MINALIGN)
+#define NFT_PIPAPO_LT_ALIGN(lt)		(PTR_ALIGN((lt), NFT_PIPAPO_ALIGN))
+#define NFT_PIPAPO_LT_ASSIGN(field, x)					\
+	do {								\
+		(field)->lt_aligned = NFT_PIPAPO_LT_ALIGN(x);		\
+		(field)->lt = (x);					\
+	} while (0);
+#else
+#define NFT_PIPAPO_ALIGN_HEADROOM	0
+#define NFT_PIPAPO_LT_ALIGN(lt)		(lt)
+#define NFT_PIPAPO_LT_ASSIGN(field, x)					\
+	do {								\
+		(field)->lt = (x);					\
+	} while (0);
+#endif /* NFT_PIPAPO_ALIGN */
+
+#define nft_pipapo_for_each_field(field, index, match)		\
+	for ((field) = (match)->f, (index) = 0;			\
+	     (index) < (match)->field_count;			\
+	     (index)++, (field)++)
+
+/**
+ * union nft_pipapo_map_bucket - Bucket of mapping table
+ * @to:		First rule number (in next field) this rule maps to
+ * @n:		Number of rules (in next field) this rule maps to
+ * @e:		If there's no next field, pointer to element this rule maps to
+ */
+union nft_pipapo_map_bucket {
+	struct {
+#if BITS_PER_LONG == 64
+		static_assert(NFT_PIPAPO_MAP_TOBITS <= 32);
+		u32 to;
+
+		static_assert(NFT_PIPAPO_MAP_NBITS <= 32);
+		u32 n;
+#else
+		unsigned long to:NFT_PIPAPO_MAP_TOBITS;
+		unsigned long  n:NFT_PIPAPO_MAP_NBITS;
+#endif
+	};
+	struct nft_pipapo_elem *e;
+};
+
+/**
+ * struct nft_pipapo_field - Lookup, mapping tables and related data for a field
+ * @groups:	Amount of 4-bit groups
+ * @rules:	Number of inserted rules
+ * @bsize:	Size of each bucket in lookup table, in longs
+ * @lt:		Lookup table: 'groups' rows of NFT_PIPAPO_BUCKETS buckets
+ * @lt_aligned:	Version of @lt aligned to NFT_PIPAPO_ALIGN bytes
+ * @mt:		Mapping table: one bucket per rule
+ */
+struct nft_pipapo_field {
+	int groups;
+	unsigned long rules;
+	size_t bsize;
+	unsigned long *lt;
+#ifdef NFT_PIPAPO_ALIGN
+	unsigned long *lt_aligned;
+#endif
+	union nft_pipapo_map_bucket *mt;
+};
+
+/**
+ * struct nft_pipapo_match - Data used for lookup and matching
+ * @field_count		Amount of fields in set
+ * @scratch:		Preallocated per-CPU maps for partial matching results
+ * @scratch_aligned:	Version of @scratch aligned to NFT_PIPAPO_ALIGN bytes
+ * @bsize_max:		Maximum lookup table bucket size of all fields, in longs
+ * @rcu			Matching data is swapped on commits
+ * @f:			Fields, with lookup and mapping tables
+ */
+struct nft_pipapo_match {
+	int field_count;
+#ifdef NFT_PIPAPO_ALIGN
+	unsigned long * __percpu *scratch_aligned;
+#endif
+	unsigned long * __percpu *scratch;
+	size_t bsize_max;
+	struct rcu_head rcu;
+	struct nft_pipapo_field f[0];
+};
+
+/**
+ * struct nft_pipapo - Representation of a set
+ * @match:	Currently in-use matching data
+ * @clone:	Copy where pending insertions and deletions are kept
+ * @groups:	Total amount of 4-bit groups for fields in this set
+ * @width:	Total bytes to be matched for one packet, including padding
+ * @dirty:	Working copy has pending insertions or deletions
+ * @last_gc:	Timestamp of last garbage collection run, jiffies
+ * @start_data:	Key data of start element for insertion
+ * @start_elem:	Start element for insertion
+ */
+struct nft_pipapo {
+	struct nft_pipapo_match __rcu *match;
+	struct nft_pipapo_match *clone;
+	int groups;
+	int width;
+	bool dirty;
+	unsigned long last_gc;
+	u8 start_data[NFT_DATA_VALUE_MAXLEN * sizeof(u32)];
+	struct nft_pipapo_elem *start_elem;
+};
+
+struct nft_pipapo_elem;
+
+/**
+ * struct nft_pipapo_elem - API-facing representation of single set element
+ * @start:	Pointer to element that represents start of interval
+ * @ext:	nftables API extensions
+ */
+struct nft_pipapo_elem {
+	struct nft_pipapo_elem *start;
+	struct nft_set_ext ext;
+};
+
+int pipapo_refill(unsigned long *map, int len, int rules, unsigned long *dst,
+		  union nft_pipapo_map_bucket *mt, bool match_only);
+
+/**
+ * pipapo_and_field_buckets() - Select buckets from packet data and intersect
+ * @f:		Field including lookup table
+ * @dst:	Scratch map for partial matching result
+ * @rp:		Packet data register pointer
+ */
+static inline void pipapo_and_field_buckets(struct nft_pipapo_field *f,
+					    unsigned long *dst, const u8 *rp)
+{
+	unsigned long *lt = NFT_PIPAPO_LT_ALIGN(f->lt);
+	int group;
+
+	for (group = 0; group < f->groups; group++) {
+		u8 v;
+
+		if (group % 2) {
+			v = *rp & 0x0f;
+			rp++;
+		} else {
+			v = *rp >> 4;
+		}
+		__bitmap_and(dst, dst, lt + v * f->bsize,
+			     f->bsize * BITS_PER_LONG);
+
+		lt += f->bsize * NFT_PIPAPO_BUCKETS;
+	}
+}
+
+/**
+ * pipapo_estimate_size() - Estimate worst-case for set size
+ * @elem_count:		Count of initial set elements
+ *
+ * The size for this set type can vary dramatically, as it depends on the number
+ * of rules (composing netmasks) the entries expand to. We compute the worst
+ * case here, in order to ensure that other types are used if concatenation of
+ * ranges is not needed.
+ *
+ * In general, for a non-ranged entry or a single composing netmask, we need
+ * one bit in each of the sixteen NFT_PIPAPO_BUCKETS, for each 4-bit group (that
+ * is, each input bit needs four bits of matching data), plus a bucket in the
+ * mapping table for each field.
+ *
+ * Return: estimated worst-case set size, in bytes.
+ */
+static int pipapo_estimate_size(int elem_count)
+{
+	int size = sizeof(struct nft_pipapo) + sizeof(struct nft_pipapo_match);
+
+	/* Worst-case with current amount of 32-bit VM registers (16 of them):
+	 * - 2 IPv6 addresses	8 registers
+	 * - 2 interface names	8 registers
+	 * that is, four 128-bit fields:
+	 */
+	size += sizeof(struct nft_pipapo_field) * 4;
+
+	/* expanding to worst-case ranges, 128 * 2 rules each, resulting in:
+	 * - 128 4-bit groups
+	 * - each set entry taking 256 bits in each bucket
+	 */
+	size += elem_count * NFT_PIPAPO_MAX_BITS / NFT_PIPAPO_GROUP_BITS *
+		NFT_PIPAPO_BUCKETS * NFT_PIPAPO_MAX_BITS * 2 / BITS_PER_BYTE;
+
+	/* and we need mapping buckets, too */
+	size += elem_count * NFT_PIPAPO_MAP_NBITS *
+		sizeof(union nft_pipapo_map_bucket);
+
+	return size;
+}
+
+#endif /* _NFT_SET_PIPAPO_H */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH nf-next v2 8/8] nft_set_pipapo: Introduce AVX2-based lookup implementation
  2019-11-22 13:39 [PATCH nf-next v2 0/8] nftables: Set implementation for arbitrary concatenation of ranges Stefano Brivio
                   ` (6 preceding siblings ...)
  2019-11-22 13:40 ` [PATCH nf-next v2 7/8] nft_set_pipapo: Prepare for vectorised implementation: helpers Stefano Brivio
@ 2019-11-22 13:40 ` Stefano Brivio
  2019-11-26  6:36   ` kbuild test robot
  2019-11-23 20:05 ` [PATCH nf-next v2 0/8] nftables: Set implementation for arbitrary concatenation of ranges Pablo Neira Ayuso
  8 siblings, 1 reply; 24+ messages in thread
From: Stefano Brivio @ 2019-11-22 13:40 UTC (permalink / raw)
  To: Pablo Neira Ayuso, netfilter-devel
  Cc: Florian Westphal, Kadlecsik József, Eric Garver, Phil Sutter

If the AVX2 set is available, we can exploit the repetitive
characteristic of this algorithm to provide a fast, vectorised
version by using 256-bit wide AVX2 operations for bucket loads and
bitwise intersections.

In most cases, this implementation consistently outperforms rbtree
set instances despite the fact they are configured to use a given,
single, ranged data type out of the ones used for performance
measurements by the nft_concat_range.sh kselftest.

That script, injecting packets directly on the ingoing device path
with pktgen, reports:

- for one AMD Epyc 7351 thread (2.9GHz, 512 KiB L1D$, 8 MiB L2$):

TEST: performance
  net,port                                                      [ OK ]
    baseline (drop from netdev hook):              10195687pps
    baseline hash (non-ranged entries):             6166047pps
    baseline rbtree (match on first field only):    2648166pps
    set with  1000 full, ranged entries:            5013920pps
  port,net                                                      [ OK ]
    baseline (drop from netdev hook):              10146446pps
    baseline hash (non-ranged entries):             5958857pps
    baseline rbtree (match on first field only):    3972543pps
    set with   100 full, ranged entries:            5032332pps
  net6,port                                                     [ OK ]
    baseline (drop from netdev hook):               9621089pps
    baseline hash (non-ranged entries):             4784304pps
    baseline rbtree (match on first field only):    1349369pps
    set with  1000 full, ranged entries:            2413250pps
  port,proto                                                    [ OK ]
    baseline (drop from netdev hook):              10821583pps
    baseline hash (non-ranged entries):             6809399pps
    baseline rbtree (match on first field only):    2799538pps
    set with 30000 full, ranged entries:            1921039pps
  net6,port,mac                                                 [ OK ]
    baseline (drop from netdev hook):               9460996pps
    baseline hash (non-ranged entries):             3893325pps
    baseline rbtree (match on first field only):    2919418pps
    set with    10 full, ranged entries:            2898623pps
  net6,port,mac,proto                                           [ OK ]
    baseline (drop from netdev hook):               9578663pps
    baseline hash (non-ranged entries):             3705263pps
    baseline rbtree (match on first field only):    1342876pps
    set with  1000 full, ranged entries:            2005030pps
  net,mac                                                       [ OK ]
    baseline (drop from netdev hook):              10153020pps
    baseline hash (non-ranged entries):             5145586pps
    baseline rbtree (match on first field only):    2664821pps
    set with  1000 full, ranged entries:            3922839pps

- for one Intel Core i7-6600U thread (3.4GHz, 64 KiB L1D$, 512 KiB L2$):

TEST: performance
  net,port                                                      [ OK ]
    baseline (drop from netdev hook):              10229859pps
    baseline hash (non-ranged entries):             6160930pps
    baseline rbtree (match on first field only):    2926966pps
    set with  1000 full, ranged entries:            4812717pps
  port,net                                                      [ OK ]
    baseline (drop from netdev hook):              10234013pps
    baseline hash (non-ranged entries):             6164457pps
    baseline rbtree (match on first field only):    4019270pps
    set with   100 full, ranged entries:            5072830pps
  net6,port                                                     [ OK ]
    baseline (drop from netdev hook):               9603512pps
    baseline hash (non-ranged entries):             4771150pps
    baseline rbtree (match on first field only):    1610077pps
    set with  1000 full, ranged entries:            2942229pps
  port,proto                                                    [ OK ]
    baseline (drop from netdev hook):              10912230pps
    baseline hash (non-ranged entries):             6906587pps
    baseline rbtree (match on first field only):    3156167pps
    set with 30000 full, ranged entries:            2440219pps
  net6,port,mac                                                 [ OK ]
    baseline (drop from netdev hook):              10020213pps
    baseline hash (non-ranged entries):             3415258pps
    baseline rbtree (match on first field only):    3167192pps
    set with    10 full, ranged entries:            2422204pps
  net6,port,mac,proto                                           [ OK ]
    baseline (drop from netdev hook):               9860087pps
    baseline hash (non-ranged entries):             3883861pps
    baseline rbtree (match on first field only):    1626784pps
    set with  1000 full, ranged entries:            2318861pps
  net,mac                                                       [ OK ]
    baseline (drop from netdev hook):              10313285pps
    baseline hash (non-ranged entries):             5324213pps
    baseline rbtree (match on first field only):    2970661pps
    set with  1000 full, ranged entries:            4128105pps

A similar strategy could be easily reused to implement specialised
versions for other SIMD sets, and I plan to post at least a NEON
version at a later time.

The vectorised implementation is automatically selected whenever
the AVX2 feature is available, and this can be detected with the
following check:

	[ $(uname -m) = "x86_64" ] && grep -q avx2 /proc/cpuinfo

In order to make set selection more explicit and visible, we might
at a later time export a different name, by introducing a new
attribute, e.g. NFTA_SET_OPS, as suggested by Phil Sutter on
netfilter-devel in <20180403211540.23700-3-phil@nwl.cc>.

v2:
 - extend scope of kernel_fpu_begin/end() to protect all accesses
   to scratch maps (Florian Westphal)
 - drop rcu_read_lock/unlock() from nft_pipapo_avx2_lookup(), it's
   already implied (Florian Westphal)
 - mention in commit message how to check if this set is used

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 include/net/netfilter/nf_tables_core.h |   1 +
 net/netfilter/Makefile                 |   3 +
 net/netfilter/nf_tables_set_core.c     |   6 +
 net/netfilter/nft_set_pipapo.c         |  25 +
 net/netfilter/nft_set_pipapo_avx2.c    | 838 +++++++++++++++++++++++++
 net/netfilter/nft_set_pipapo_avx2.h    |  14 +
 6 files changed, 887 insertions(+)
 create mode 100644 net/netfilter/nft_set_pipapo_avx2.c
 create mode 100644 net/netfilter/nft_set_pipapo_avx2.h

diff --git a/include/net/netfilter/nf_tables_core.h b/include/net/netfilter/nf_tables_core.h
index 9759257ec8ec..6b7cdc0a592f 100644
--- a/include/net/netfilter/nf_tables_core.h
+++ b/include/net/netfilter/nf_tables_core.h
@@ -75,6 +75,7 @@ extern struct nft_set_type nft_set_hash_fast_type;
 extern struct nft_set_type nft_set_rbtree_type;
 extern struct nft_set_type nft_set_bitmap_type;
 extern struct nft_set_type nft_set_pipapo_type;
+extern struct nft_set_type nft_set_pipapo_avx2_type;
 
 struct nft_expr;
 struct nft_regs;
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index 3f572e5a975e..847b759895fc 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -83,6 +83,9 @@ nf_tables-objs := nf_tables_core.o nf_tables_api.o nft_chain_filter.o \
 nf_tables_set-objs := nf_tables_set_core.o \
 		      nft_set_hash.o nft_set_bitmap.o nft_set_rbtree.o \
 		      nft_set_pipapo.o
+ifneq (,$(findstring -DCONFIG_AS_AVX2=1,$(KBUILD_CFLAGS)))
+nf_tables_set-objs += nft_set_pipapo_avx2.o
+endif
 
 obj-$(CONFIG_NF_TABLES)		+= nf_tables.o
 obj-$(CONFIG_NF_TABLES_SET)	+= nf_tables_set.o
diff --git a/net/netfilter/nf_tables_set_core.c b/net/netfilter/nf_tables_set_core.c
index 586b621007eb..4fa8f610038c 100644
--- a/net/netfilter/nf_tables_set_core.c
+++ b/net/netfilter/nf_tables_set_core.c
@@ -9,6 +9,9 @@ static int __init nf_tables_set_module_init(void)
 	nft_register_set(&nft_set_rhash_type);
 	nft_register_set(&nft_set_bitmap_type);
 	nft_register_set(&nft_set_rbtree_type);
+#ifdef CONFIG_AS_AVX2
+	nft_register_set(&nft_set_pipapo_avx2_type);
+#endif
 	nft_register_set(&nft_set_pipapo_type);
 
 	return 0;
@@ -17,6 +20,9 @@ static int __init nf_tables_set_module_init(void)
 static void __exit nf_tables_set_module_exit(void)
 {
 	nft_unregister_set(&nft_set_pipapo_type);
+#ifdef CONFIG_AS_AVX2
+	nft_unregister_set(&nft_set_pipapo_avx2_type);
+#endif
 	nft_unregister_set(&nft_set_rbtree_type);
 	nft_unregister_set(&nft_set_bitmap_type);
 	nft_unregister_set(&nft_set_rhash_type);
diff --git a/net/netfilter/nft_set_pipapo.c b/net/netfilter/nft_set_pipapo.c
index 5076325b8093..48a738639777 100644
--- a/net/netfilter/nft_set_pipapo.c
+++ b/net/netfilter/nft_set_pipapo.c
@@ -339,6 +339,7 @@
 #include <linux/bitmap.h>
 #include <linux/bitops.h>
 
+#include "nft_set_pipapo_avx2.h"
 #include "nft_set_pipapo.h"
 
 /* Current working bitmap index, toggled between field matches */
@@ -2138,3 +2139,27 @@ struct nft_set_type nft_set_pipapo_type __read_mostly = {
 		.elemsize	= offsetof(struct nft_pipapo_elem, ext),
 	},
 };
+
+#ifdef CONFIG_AS_AVX2
+struct nft_set_type nft_set_pipapo_avx2_type __read_mostly = {
+	.owner		= THIS_MODULE,
+	.features	= NFT_SET_INTERVAL | NFT_SET_MAP | NFT_SET_OBJECT |
+			  NFT_SET_TIMEOUT | NFT_SET_SUBKEY,
+	.ops		= {
+		.lookup		= nft_pipapo_avx2_lookup,
+		.insert		= nft_pipapo_insert,
+		.activate	= nft_pipapo_activate,
+		.deactivate	= nft_pipapo_deactivate,
+		.flush		= nft_pipapo_flush,
+		.remove		= nft_pipapo_remove,
+		.walk		= nft_pipapo_walk,
+		.get		= nft_pipapo_get,
+		.privsize	= nft_pipapo_privsize,
+		.estimate	= nft_pipapo_avx2_estimate,
+		.init		= nft_pipapo_init,
+		.destroy	= nft_pipapo_destroy,
+		.gc_init	= nft_pipapo_gc_init,
+		.elemsize	= offsetof(struct nft_pipapo_elem, ext),
+	},
+};
+#endif
diff --git a/net/netfilter/nft_set_pipapo_avx2.c b/net/netfilter/nft_set_pipapo_avx2.c
new file mode 100644
index 000000000000..2d79673bf21e
--- /dev/null
+++ b/net/netfilter/nft_set_pipapo_avx2.c
@@ -0,0 +1,838 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/* PIPAPO: PIle PAcket POlicies: AVX2 packet lookup routines
+ *
+ * Copyright (c) 2019 Red Hat GmbH
+ *
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ */
+
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/netlink.h>
+#include <linux/netfilter.h>
+#include <linux/netfilter/nf_tables.h>
+#include <net/netfilter/nf_tables_core.h>
+#include <uapi/linux/netfilter/nf_tables.h>
+#include <linux/bitmap.h>
+#include <linux/bitops.h>
+
+#include <linux/compiler.h>
+#include <asm/fpu/api.h>
+
+#include "nft_set_pipapo_avx2.h"
+#include "nft_set_pipapo.h"
+
+#define NFT_PIPAPO_LONGS_PER_M256	(XSAVE_YMM_SIZE / BITS_PER_LONG)
+
+/* Load from memory into YMM register with non-temporal hint ("stream load"),
+ * that is, don't fetch lines from memory into the cache. This avoids pushing
+ * precious packet data out of the cache hierarchy, and is appropriate when:
+ *
+ * - loading buckets from lookup tables, as they are not going to be used
+ *   again before packets are entirely classified
+ *
+ * - loading the result bitmap from the previous field, as it's never used
+ *   again
+ */
+#define NFT_PIPAPO_AVX2_LOAD(reg, loc)					\
+	asm volatile("vmovntdqa %0, %%ymm" #reg : : "m" (loc))
+
+/* Stream a single lookup table bucket into YMM register given lookup table,
+ * group index, value of packet bits, bucket size.
+ */
+#define NFT_PIPAPO_AVX2_BUCKET_LOAD(reg, lt, group, v, bsize)		\
+	NFT_PIPAPO_AVX2_LOAD(reg,					\
+			     lt[((group) * NFT_PIPAPO_BUCKETS + (v)) * (bsize)])
+
+/* Bitwise AND: the staple operation of this algorithm */
+#define NFT_PIPAPO_AVX2_AND(dst, a, b)					\
+	asm volatile("vpand %ymm" #a ", %ymm" #b ", %ymm" #dst)
+
+/* Jump to label if @reg is zero */
+#define NFT_PIPAPO_AVX2_NOMATCH_GOTO(reg, label)			\
+	asm_volatile_goto("vptest %%ymm" #reg ", %%ymm" #reg ";"	\
+			  "je %l[" #label "]" : : : : label)
+
+/* Store 256 bits from YMM register into memory. Contrary to bucket load
+ * operation, we don't bypass the cache here, as stored matching results
+ * are always used shortly after.
+ */
+#define NFT_PIPAPO_AVX2_STORE(loc, reg)					\
+	asm volatile("vmovdqa %%ymm" #reg ", %0" : "=m" (loc))
+
+/* Zero out a complete YMM register, @reg */
+#define NFT_PIPAPO_AVX2_ZERO(reg)					\
+	asm volatile("vpxor %ymm" #reg ", %ymm" #reg ", %ymm" #reg)
+
+/* Current working bitmap index, toggled between field matches */
+static DEFINE_PER_CPU(bool, nft_pipapo_avx2_scratch_index);
+
+/**
+ * nft_pipapo_avx2_prepare() - Prepare before main algorithm body
+ *
+ * This zeroes out ymm15, which is later used whenever we need to clear a
+ * memory location, by storing its content into memory.
+ */
+static void nft_pipapo_avx2_prepare(void)
+{
+	NFT_PIPAPO_AVX2_ZERO(15);
+}
+
+/**
+ * nft_pipapo_avx2_fill() - Fill a bitmap region with ones
+ * @data:	Base memory area
+ * @start:	First bit to set
+ * @len:	Count of bits to fill
+ *
+ * This is nothing else than a version of bitmap_set(), as used e.g. by
+ * pipapo_refill(), tailored for the microarchitectures using it and better
+ * suited for the specific usage: it's very likely that we'll set a small number
+ * of bits, not crossing a word boundary, and correct branch prediction is
+ * critical here.
+ *
+ * This function doesn't actually use any AVX2 instruction.
+ */
+static void nft_pipapo_avx2_fill(unsigned long *data, int start, int len)
+{
+	int offset = start % BITS_PER_LONG;
+	unsigned long mask;
+
+	data += start / BITS_PER_LONG;
+
+	if (likely(len == 1)) {
+		*data |= BIT(offset);
+		return;
+	}
+
+	if (likely(len < BITS_PER_LONG || offset)) {
+		if (likely(len + offset <= BITS_PER_LONG)) {
+			*data |= GENMASK(len - 1 + offset, offset);
+			return;
+		}
+
+		*data |= ~0UL << offset;
+		len -= BITS_PER_LONG - offset;
+		data++;
+
+		if (len <= BITS_PER_LONG) {
+			mask = ~0UL >> (BITS_PER_LONG - len);
+			*data |= mask;
+			return;
+		}
+	}
+
+	memset(data, 0xff, len / BITS_PER_BYTE);
+	data += len / BITS_PER_LONG;
+
+	len %= BITS_PER_LONG;
+	if (len)
+		*data |= ~0UL >> (BITS_PER_LONG - len);
+}
+
+/**
+ * nft_pipapo_avx2_refill() - Scan bitmap, select mapping table item, set bits
+ * @offset:	Start from given bitmap (equivalent to bucket) offset, in longs
+ * @map:	Bitmap to be scanned for set bits
+ * @dst:	Destination bitmap
+ * @mt:		Mapping table containing bit set specifiers
+ * @len:	Length of bitmap in longs
+ * @last:	Return index of first set bit, if this is the last field
+ *
+ * This is an alternative implementation of pipapo_refill() suitable for usage
+ * with AVX2 lookup routines: we know there are four words to be scanned, at
+ * a given offset inside the map, for each matching iteration.
+ *
+ * This function doesn't actually use any AVX2 instruction.
+ *
+ * Return: first set bit index if @last, index of first filled word otherwise.
+ */
+static int nft_pipapo_avx2_refill(int offset, unsigned long *map,
+				  unsigned long *dst,
+				  union nft_pipapo_map_bucket *mt, bool last)
+{
+	int ret = -1;
+
+#define NFT_PIPAPO_AVX2_REFILL_ONE_WORD(x)				\
+	do {								\
+		while (map[(x)]) {					\
+			int r = __builtin_ctzl(map[(x)]);		\
+			int i = (offset + (x)) * BITS_PER_LONG + r;	\
+									\
+			if (last)					\
+				return i;				\
+									\
+			nft_pipapo_avx2_fill(dst, mt[i].to, mt[i].n);	\
+									\
+			if (ret == -1)					\
+				ret = mt[i].to;				\
+									\
+			map[(x)] &= ~(1UL << r);			\
+		}							\
+	} while (0)
+
+	NFT_PIPAPO_AVX2_REFILL_ONE_WORD(0);
+	NFT_PIPAPO_AVX2_REFILL_ONE_WORD(1);
+	NFT_PIPAPO_AVX2_REFILL_ONE_WORD(2);
+	NFT_PIPAPO_AVX2_REFILL_ONE_WORD(3);
+#undef NFT_PIPAPO_AVX2_REFILL_ONE_WORD
+
+	return ret;
+}
+
+/**
+ * nft_pipapo_avx2_lookup2() - AVX2-based lookup for 2 four-bit groups
+ * @map:	Previous match result, used as initial bitmap
+ * @fill:	Destination bitmap to be filled with current match result
+ * @lt:		Lookup table for this field
+ * @mt:		Mapping table for this field
+ * @bsize:	Bucket size for this lookup table, in longs
+ * @pkt:	Packet data, pointer to input nftables register
+ * @first:	If this is the first field, don't source previous result
+ * @last:	Last field: stop at the first match and return bit index
+ * @offset:	Ignore buckets before the given index, no bits are filled there
+ *
+ * Load buckets from lookup table corresponding to the values of each 4-bit
+ * group of packet bytes, and perform a bitwise intersection between them. If
+ * this is the first field in the set, simply AND the buckets together
+ * (equivalent to using an all-ones starting bitmap), use the provided starting
+ * bitmap otherwise. Then call nft_pipapo_avx2_refill() to generate the next
+ * working bitmap, @fill.
+ *
+ * This is used for 8-bit fields (i.e. protocol numbers).
+ *
+ * Out-of-order (and superscalar) execution is vital here, so it's critical to
+ * avoid false data dependencies. CPU and compiler could (mostly) take care of
+ * this on their own, but the operation ordering is explicitly given here with
+ * a likely execution order in mind, to highlight possible stalls. That's why
+ * a number of logically distinct operations (i.e. loading buckets, intersecting
+ * buckets) are interleaved.
+ *
+ * Return: -1 on no match, rule index of match if @last, otherwise first long
+ * word index to be checked next (i.e. first filled word).
+ */
+static int nft_pipapo_avx2_lookup2(unsigned long *map, unsigned long *fill,
+				   unsigned long *lt,
+				   union nft_pipapo_map_bucket *mt,
+				   unsigned long bsize, const u8 *pkt,
+				   bool first, bool last, int offset)
+{
+	int i, ret = -1, m256_size = bsize / NFT_PIPAPO_LONGS_PER_M256, b;
+	u8 pg[2] = { pkt[0] >> 4, pkt[0] & 0xf };
+
+	lt += offset * NFT_PIPAPO_LONGS_PER_M256;
+	for (i = offset; i < m256_size; i++, lt += NFT_PIPAPO_LONGS_PER_M256) {
+		int i_ul = i * NFT_PIPAPO_LONGS_PER_M256;
+
+		if (first) {
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(0, lt, 0, pg[0], bsize);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(1, lt, 1, pg[1], bsize);
+			NFT_PIPAPO_AVX2_AND(4, 0, 1);
+		} else {
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(0, lt, 0, pg[0], bsize);
+			NFT_PIPAPO_AVX2_LOAD(2, map[i_ul]);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(1, lt, 1, pg[1], bsize);
+			NFT_PIPAPO_AVX2_NOMATCH_GOTO(2, nothing);
+			NFT_PIPAPO_AVX2_AND(3, 0, 1);
+			NFT_PIPAPO_AVX2_AND(4, 2, 3);
+		}
+
+		NFT_PIPAPO_AVX2_NOMATCH_GOTO(4, nomatch);
+		NFT_PIPAPO_AVX2_STORE(map[i_ul], 4);
+
+		b = nft_pipapo_avx2_refill(i_ul, &map[i_ul], fill, mt, last);
+		if (last)
+			return b;
+
+		if (unlikely(ret == -1))
+			ret = b / XSAVE_YMM_SIZE;
+
+		continue;
+nomatch:
+		NFT_PIPAPO_AVX2_STORE(map[i_ul], 15);
+nothing:
+		;
+	}
+
+	return ret;
+}
+
+/**
+ * nft_pipapo_avx2_lookup4() - AVX2-based lookup for 4 four-bit groups
+ * @map:	Previous match result, used as initial bitmap
+ * @fill:	Destination bitmap to be filled with current match result
+ * @lt:		Lookup table for this field
+ * @mt:		Mapping table for this field
+ * @bsize:	Bucket size for this lookup table, in longs
+ * @pkt:	Packet data, pointer to input nftables register
+ * @first:	If this is the first field, don't source previous result
+ * @last:	Last field: stop at the first match and return bit index
+ * @offset:	Ignore buckets before the given index, no bits are filled there
+ *
+ * See nft_pipapo_avx2_lookup2().
+ *
+ * This is used for 16-bit fields (i.e. ports).
+ *
+ * Return: -1 on no match, rule index of match if @last, otherwise first long
+ * word index to be checked next (i.e. first filled word).
+ */
+static int nft_pipapo_avx2_lookup4(unsigned long *map, unsigned long *fill,
+				   unsigned long *lt,
+				   union nft_pipapo_map_bucket *mt,
+				   unsigned long bsize, const u8 *pkt,
+				   bool first, bool last, int offset)
+{
+	int i, ret = -1, m256_size = bsize / NFT_PIPAPO_LONGS_PER_M256, b;
+	u8 pg[4] = { pkt[0] >> 4, pkt[0] & 0xf, pkt[1] >> 4, pkt[1] & 0xf };
+
+	lt += offset * NFT_PIPAPO_LONGS_PER_M256;
+	for (i = offset; i < m256_size; i++, lt += NFT_PIPAPO_LONGS_PER_M256) {
+		int i_ul = i * NFT_PIPAPO_LONGS_PER_M256;
+
+		if (first) {
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(0, lt, 0, pg[0], bsize);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(1, lt, 1, pg[1], bsize);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(2, lt, 2, pg[2], bsize);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(3, lt, 3, pg[3], bsize);
+			NFT_PIPAPO_AVX2_AND(4, 0, 1);
+			NFT_PIPAPO_AVX2_AND(5, 2, 3);
+			NFT_PIPAPO_AVX2_AND(7, 4, 5);
+		} else {
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(0, lt, 0, pg[0], bsize);
+
+			NFT_PIPAPO_AVX2_LOAD(1, map[i_ul]);
+
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(2, lt, 1, pg[1], bsize);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(3, lt, 2, pg[2], bsize);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(4, lt, 3, pg[3], bsize);
+			NFT_PIPAPO_AVX2_AND(5, 0, 1);
+
+			NFT_PIPAPO_AVX2_NOMATCH_GOTO(1, nothing);
+
+			NFT_PIPAPO_AVX2_AND(6, 2, 3);
+			NFT_PIPAPO_AVX2_AND(7, 4, 5);
+			/* Stall */
+			NFT_PIPAPO_AVX2_AND(7, 6, 7);
+		}
+
+		/* Stall */
+		NFT_PIPAPO_AVX2_NOMATCH_GOTO(7, nomatch);
+		NFT_PIPAPO_AVX2_STORE(map[i_ul], 7);
+
+		b = nft_pipapo_avx2_refill(i_ul, &map[i_ul], fill, mt, last);
+		if (last)
+			return b;
+
+		if (unlikely(ret == -1))
+			ret = b / XSAVE_YMM_SIZE;
+
+		continue;
+nomatch:
+		NFT_PIPAPO_AVX2_STORE(map[i_ul], 15);
+nothing:
+		;
+	}
+
+	return ret;
+}
+
+/**
+ * nft_pipapo_avx2_lookup8() - AVX2-based lookup for 8 four-bit groups
+ * @map:	Previous match result, used as initial bitmap
+ * @fill:	Destination bitmap to be filled with current match result
+ * @lt:		Lookup table for this field
+ * @mt:		Mapping table for this field
+ * @bsize:	Bucket size for this lookup table, in longs
+ * @pkt:	Packet data, pointer to input nftables register
+ * @first:	If this is the first field, don't source previous result
+ * @last:	Last field: stop at the first match and return bit index
+ * @offset:	Ignore buckets before the given index, no bits are filled there
+ *
+ * See nft_pipapo_avx2_lookup2().
+ *
+ * This is used for 32-bit fields (i.e. IPv4 addresses).
+ *
+ * Return: -1 on no match, rule index of match if @last, otherwise first long
+ * word index to be checked next (i.e. first filled word).
+ */
+static int nft_pipapo_avx2_lookup8(unsigned long *map, unsigned long *fill,
+				   unsigned long *lt,
+				   union nft_pipapo_map_bucket *mt,
+				   unsigned long bsize, const u8 *pkt,
+				   bool first, bool last, int offset)
+{
+	int i, ret = -1, m256_size = bsize / NFT_PIPAPO_LONGS_PER_M256, b;
+	u8 pg[8] = { pkt[0] >> 4, pkt[0] & 0xf, pkt[1] >> 4, pkt[1] & 0xf,
+		     pkt[2] >> 4, pkt[2] & 0xf, pkt[3] >> 4, pkt[3] & 0xf };
+
+	lt += offset * NFT_PIPAPO_LONGS_PER_M256;
+	for (i = offset; i < m256_size; i++, lt += NFT_PIPAPO_LONGS_PER_M256) {
+		int i_ul = i * NFT_PIPAPO_LONGS_PER_M256;
+
+		if (first) {
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(0,  lt, 0, pg[0], bsize);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(1,  lt, 1, pg[1], bsize);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(2,  lt, 2, pg[2], bsize);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(3,  lt, 3, pg[3], bsize);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(4,  lt, 4, pg[4], bsize);
+			NFT_PIPAPO_AVX2_AND(5,   0,  1);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(6,  lt, 5, pg[5], bsize);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(7,  lt, 6, pg[6], bsize);
+			NFT_PIPAPO_AVX2_AND(8,   2,  3);
+			NFT_PIPAPO_AVX2_AND(9,   4,  5);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(10, lt, 7, pg[7], bsize);
+			NFT_PIPAPO_AVX2_AND(11,  6,  7);
+			NFT_PIPAPO_AVX2_AND(12,  8,  9);
+			NFT_PIPAPO_AVX2_AND(13, 10, 11);
+
+			/* Stall */
+			NFT_PIPAPO_AVX2_AND(1,  12, 13);
+		} else {
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(0,  lt, 0, pg[0], bsize);
+			NFT_PIPAPO_AVX2_LOAD(1, map[i_ul]);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(2,  lt, 1, pg[1], bsize);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(3,  lt, 2, pg[2], bsize);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(4,  lt, 3, pg[3], bsize);
+
+			NFT_PIPAPO_AVX2_NOMATCH_GOTO(1, nothing);
+
+			NFT_PIPAPO_AVX2_AND(5,   0,  1);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(6,  lt, 4, pg[4], bsize);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(7,  lt, 5, pg[5], bsize);
+			NFT_PIPAPO_AVX2_AND(8,   2,  3);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(9,  lt, 6, pg[6], bsize);
+			NFT_PIPAPO_AVX2_AND(10,  4,  5);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(11, lt, 7, pg[7], bsize);
+			NFT_PIPAPO_AVX2_AND(12,  6,  7);
+			NFT_PIPAPO_AVX2_AND(13,  8,  9);
+			NFT_PIPAPO_AVX2_AND(14, 10, 11);
+
+			/* Stall */
+			NFT_PIPAPO_AVX2_AND(1,  12, 13);
+			NFT_PIPAPO_AVX2_AND(1,   1, 14);
+		}
+
+		NFT_PIPAPO_AVX2_NOMATCH_GOTO(1, nomatch);
+		NFT_PIPAPO_AVX2_STORE(map[i_ul], 1);
+
+		b = nft_pipapo_avx2_refill(i_ul, &map[i_ul], fill, mt, last);
+		if (last)
+			return b;
+
+		if (unlikely(ret == -1))
+			ret = b / XSAVE_YMM_SIZE;
+
+		continue;
+
+nomatch:
+		NFT_PIPAPO_AVX2_STORE(map[i_ul], 15);
+nothing:
+		;
+	}
+
+	return ret;
+}
+
+/**
+ * nft_pipapo_avx2_lookup12() - AVX2-based lookup for 12 four-bit groups
+ * @map:	Previous match result, used as initial bitmap
+ * @fill:	Destination bitmap to be filled with current match result
+ * @lt:		Lookup table for this field
+ * @mt:		Mapping table for this field
+ * @bsize:	Bucket size for this lookup table, in longs
+ * @pkt:	Packet data, pointer to input nftables register
+ * @first:	If this is the first field, don't source previous result
+ * @last:	Last field: stop at the first match and return bit index
+ * @offset:	Ignore buckets before the given index, no bits are filled there
+ *
+ * See nft_pipapo_avx2_lookup2().
+ *
+ * This is used for 48-bit fields (i.e. MAC addresses/EUI-48).
+ *
+ * Return: -1 on no match, rule index of match if @last, otherwise first long
+ * word index to be checked next (i.e. first filled word).
+ */
+static int nft_pipapo_avx2_lookup12(unsigned long *map, unsigned long *fill,
+				    unsigned long *lt,
+				    union nft_pipapo_map_bucket *mt,
+				    unsigned long bsize, const u8 *pkt,
+				    bool first, bool last, int offset)
+{
+	int i, ret = -1, m256_size = bsize / NFT_PIPAPO_LONGS_PER_M256, b;
+	u8 pg[12] = { pkt[0] >> 4, pkt[0] & 0xf, pkt[1] >> 4, pkt[1] & 0xf,
+		      pkt[2] >> 4, pkt[2] & 0xf, pkt[3] >> 4, pkt[3] & 0xf,
+		      pkt[4] >> 4, pkt[4] & 0xf, pkt[5] >> 4, pkt[5] & 0xf };
+
+	lt += offset * NFT_PIPAPO_LONGS_PER_M256;
+	for (i = offset; i < m256_size; i++, lt += NFT_PIPAPO_LONGS_PER_M256) {
+		int i_ul = i * NFT_PIPAPO_LONGS_PER_M256;
+
+		if (!first)
+			NFT_PIPAPO_AVX2_LOAD(0, map[i_ul]);
+
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(1,  lt,  0,  pg[0], bsize);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(2,  lt,  1,  pg[1], bsize);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(3,  lt,  2,  pg[2], bsize);
+
+		if (!first) {
+			NFT_PIPAPO_AVX2_NOMATCH_GOTO(0, nothing);
+			NFT_PIPAPO_AVX2_AND(1, 1, 0);
+		}
+
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(4,  lt,  3,  pg[3], bsize);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(5,  lt,  4,  pg[4], bsize);
+		NFT_PIPAPO_AVX2_AND(6,   2,  3);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(7,  lt,  5,  pg[5], bsize);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(8,  lt,  6,  pg[6], bsize);
+		NFT_PIPAPO_AVX2_AND(9,   1,  4);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(10, lt,  7,  pg[7], bsize);
+		NFT_PIPAPO_AVX2_AND(11,  5,  6);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(12, lt,  8,  pg[8], bsize);
+		NFT_PIPAPO_AVX2_AND(13,  7,  8);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(14, lt,  9,  pg[9], bsize);
+
+		NFT_PIPAPO_AVX2_AND(0,   9, 10);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(1,  lt, 10,  pg[10], bsize);
+		NFT_PIPAPO_AVX2_AND(2,  11, 12);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(3,  lt, 11,  pg[11], bsize);
+		NFT_PIPAPO_AVX2_AND(4,  13, 14);
+		NFT_PIPAPO_AVX2_AND(5,   0,  1);
+
+		NFT_PIPAPO_AVX2_AND(6,   2,  3);
+
+		/* Stalls */
+		NFT_PIPAPO_AVX2_AND(7,   4,  5);
+		NFT_PIPAPO_AVX2_AND(8,   6,  7);
+
+		NFT_PIPAPO_AVX2_NOMATCH_GOTO(8, nomatch);
+		NFT_PIPAPO_AVX2_STORE(map[i_ul], 8);
+
+		b = nft_pipapo_avx2_refill(i_ul, &map[i_ul], fill, mt, last);
+		if (last)
+			return b;
+
+		if (unlikely(ret == -1))
+			ret = b / XSAVE_YMM_SIZE;
+
+		continue;
+nomatch:
+		NFT_PIPAPO_AVX2_STORE(map[i_ul], 15);
+nothing:
+		;
+	}
+
+	return ret;
+}
+
+/**
+ * nft_pipapo_avx2_lookup32() - AVX2-based lookup for 32 four-bit groups
+ * @map:	Previous match result, used as initial bitmap
+ * @fill:	Destination bitmap to be filled with current match result
+ * @lt:		Lookup table for this field
+ * @mt:		Mapping table for this field
+ * @bsize:	Bucket size for this lookup table, in longs
+ * @pkt:	Packet data, pointer to input nftables register
+ * @first:	If this is the first field, don't source previous result
+ * @last:	Last field: stop at the first match and return bit index
+ * @offset:	Ignore buckets before the given index, no bits are filled there
+ *
+ * See nft_pipapo_avx2_lookup2().
+ *
+ * This is used for 128-bit fields (i.e. IPv6 addresses).
+ *
+ * Return: -1 on no match, rule index of match if @last, otherwise first long
+ * word index to be checked next (i.e. first filled word).
+ */
+static int nft_pipapo_avx2_lookup32(unsigned long *map, unsigned long *fill,
+				    unsigned long *lt,
+				    union nft_pipapo_map_bucket *mt,
+				    unsigned long bsize, const u8 *pkt,
+				    bool first, bool last, int offset)
+{
+	int i, ret = -1, m256_size = bsize / NFT_PIPAPO_LONGS_PER_M256, b;
+	u8 pg[32] = {  pkt[0] >> 4,  pkt[0] & 0xf,  pkt[1] >> 4,  pkt[1] & 0xf,
+		       pkt[2] >> 4,  pkt[2] & 0xf,  pkt[3] >> 4,  pkt[3] & 0xf,
+		       pkt[4] >> 4,  pkt[4] & 0xf,  pkt[5] >> 4,  pkt[5] & 0xf,
+		       pkt[6] >> 4,  pkt[6] & 0xf,  pkt[7] >> 4,  pkt[7] & 0xf,
+		       pkt[8] >> 4,  pkt[8] & 0xf,  pkt[9] >> 4,  pkt[9] & 0xf,
+		      pkt[10] >> 4, pkt[10] & 0xf, pkt[11] >> 4, pkt[11] & 0xf,
+		      pkt[12] >> 4, pkt[12] & 0xf, pkt[13] >> 4, pkt[13] & 0xf,
+		      pkt[14] >> 4, pkt[14] & 0xf, pkt[15] >> 4, pkt[15] & 0xf,
+		    };
+
+	lt += offset * NFT_PIPAPO_LONGS_PER_M256;
+	for (i = offset; i < m256_size; i++, lt += NFT_PIPAPO_LONGS_PER_M256) {
+		int i_ul = i * NFT_PIPAPO_LONGS_PER_M256;
+
+		if (!first)
+			NFT_PIPAPO_AVX2_LOAD(0, map[i_ul]);
+
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(1,  lt,  0,  pg[0], bsize);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(2,  lt,  1,  pg[1], bsize);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(3,  lt,  2,  pg[2], bsize);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(4,  lt,  3,  pg[3], bsize);
+		if (!first) {
+			NFT_PIPAPO_AVX2_NOMATCH_GOTO(0, nothing);
+			NFT_PIPAPO_AVX2_AND(1, 1, 0);
+		}
+
+		NFT_PIPAPO_AVX2_AND(5,   2,  3);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(6,  lt,  4,  pg[4], bsize);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(7,  lt,  5,  pg[5], bsize);
+		NFT_PIPAPO_AVX2_AND(8,   1,  4);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(9,  lt,  6,  pg[6], bsize);
+		NFT_PIPAPO_AVX2_AND(10,  5,  6);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(11, lt,  7,  pg[7], bsize);
+		NFT_PIPAPO_AVX2_AND(12,  7,  8);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(13, lt,  8,  pg[8], bsize);
+		NFT_PIPAPO_AVX2_AND(14,  9, 10);
+
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(0,  lt,  9,  pg[9], bsize);
+		NFT_PIPAPO_AVX2_AND(1,  11, 12);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(2,  lt, 10, pg[10], bsize);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(3,  lt, 11, pg[11], bsize);
+		NFT_PIPAPO_AVX2_AND(4,  13, 14);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(5,  lt, 12, pg[12], bsize);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(6,  lt, 13, pg[13], bsize);
+		NFT_PIPAPO_AVX2_AND(7,   0,  1);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(8,  lt, 14, pg[14], bsize);
+		NFT_PIPAPO_AVX2_AND(9,   2,  3);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(10, lt, 15, pg[15], bsize);
+		NFT_PIPAPO_AVX2_AND(11,  4,  5);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(12, lt, 16, pg[16], bsize);
+		NFT_PIPAPO_AVX2_AND(13,  6,  7);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(14, lt, 17, pg[17], bsize);
+
+		NFT_PIPAPO_AVX2_AND(0,   8,  9);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(1,  lt, 18, pg[18], bsize);
+		NFT_PIPAPO_AVX2_AND(2,  10, 11);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(3,  lt, 19, pg[19], bsize);
+		NFT_PIPAPO_AVX2_AND(4,  12, 13);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(5,  lt, 20, pg[20], bsize);
+		NFT_PIPAPO_AVX2_AND(6,  14,  0);
+		NFT_PIPAPO_AVX2_AND(7,   1,  2);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(8,  lt, 21, pg[21], bsize);
+		NFT_PIPAPO_AVX2_AND(9,   3,  4);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(10, lt, 22, pg[22], bsize);
+		NFT_PIPAPO_AVX2_AND(11,  5,  6);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(12, lt, 23, pg[23], bsize);
+		NFT_PIPAPO_AVX2_AND(13,  7,  8);
+
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(14, lt, 24, pg[24], bsize);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(0,  lt, 25, pg[25], bsize);
+		NFT_PIPAPO_AVX2_AND(1,   9, 10);
+		NFT_PIPAPO_AVX2_AND(2,  11, 12);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(3,  lt, 26, pg[26], bsize);
+		NFT_PIPAPO_AVX2_AND(4,  13, 14);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(5,  lt, 27, pg[27], bsize);
+		NFT_PIPAPO_AVX2_AND(6,   0,  1);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(7,  lt, 28, pg[28], bsize);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(8,  lt, 29, pg[29], bsize);
+		NFT_PIPAPO_AVX2_AND(9,   2,  3);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(10, lt, 30, pg[30], bsize);
+		NFT_PIPAPO_AVX2_AND(11,  4,  5);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(12, lt, 31, pg[31], bsize);
+
+		NFT_PIPAPO_AVX2_AND(0,   6,  7);
+		NFT_PIPAPO_AVX2_AND(1,   8,  9);
+		NFT_PIPAPO_AVX2_AND(2,  10, 11);
+		NFT_PIPAPO_AVX2_AND(3,  12,  0);
+
+		/* Stalls */
+		NFT_PIPAPO_AVX2_AND(4,   1,  2);
+		NFT_PIPAPO_AVX2_AND(5,   3,  4);
+
+		NFT_PIPAPO_AVX2_NOMATCH_GOTO(5, nomatch);
+		NFT_PIPAPO_AVX2_STORE(map[i_ul], 5);
+
+		b = nft_pipapo_avx2_refill(i_ul, &map[i_ul], fill, mt, last);
+		if (last)
+			return b;
+
+		if (unlikely(ret == -1))
+			ret = b / XSAVE_YMM_SIZE;
+
+		continue;
+nomatch:
+		NFT_PIPAPO_AVX2_STORE(map[i_ul], 15);
+nothing:
+		;
+	}
+
+	return ret;
+}
+
+/**
+ * nft_pipapo_avx2_lookup_noavx2() - Fallback function for uncommon field sizes
+ * @f:		Field to be matched
+ * @res:	Previous match result, used as initial bitmap
+ * @fill:	Destination bitmap to be filled with current match result
+ * @lt:		Lookup table for this field
+ * @mt:		Mapping table for this field
+ * @bsize:	Bucket size for this lookup table, in longs
+ * @rp:		Packet data, pointer to input nftables register
+ * @first:	If this is the first field, don't source previous result
+ * @last:	Last field: stop at the first match and return bit index
+ * @offset:	Ignore buckets before the given index, no bits are filled there
+ *
+ * This function should never be called, but is provided for the case the field
+ * size doesn't match any of the known data types. Matching rate is
+ * substantially lower than AVX2 routines.
+ *
+ * Return: -1 on no match, rule index of match if @last, otherwise first long
+ * word index to be checked next (i.e. first filled word).
+ */
+static int nft_pipapo_avx2_lookup_noavx2(struct nft_pipapo_field *f,
+					 unsigned long *res,
+					 unsigned long *fill, unsigned long *lt,
+					 union nft_pipapo_map_bucket *mt,
+					 unsigned long bsize, const u8 *rp,
+					 bool first, bool last, int offset)
+{
+	int i, ret = -1, b;
+
+	lt += offset * NFT_PIPAPO_LONGS_PER_M256;
+
+	if (first)
+		memset(res, 0xff, bsize * sizeof(*res));
+
+	for (i = offset; i < bsize; i++) {
+		pipapo_and_field_buckets(f, res, rp);
+
+		b = pipapo_refill(res, bsize, f->rules, fill, mt, last);
+
+		if (last)
+			return b;
+
+		if (ret == -1)
+			ret = b / XSAVE_YMM_SIZE;
+	}
+
+	return ret;
+}
+
+/**
+ * nft_pipapo_avx2_estimate() - Set size, space and lookup complexity
+ * @desc:	Set description, initial element count used here
+ * @features:	Flags: NFT_SET_SUBKEY needs to be there
+ * @est:	Storage for estimation data
+ *
+ * Return: true if @features match and AVX2 is available, false otherwise.
+ */
+bool nft_pipapo_avx2_estimate(const struct nft_set_desc *desc, u32 features,
+			      struct nft_set_estimate *est)
+{
+	if (!(features & NFT_SET_SUBKEY))
+		return false;
+
+	if (!boot_cpu_has(X86_FEATURE_AVX2) || !boot_cpu_has(X86_FEATURE_AVX))
+		return false;
+
+	est->size = pipapo_estimate_size(desc->size);
+	est->lookup = NFT_SET_CLASS_O_LOG_N;
+	est->space = NFT_SET_CLASS_O_N;
+
+	return true;
+}
+
+/**
+ * nft_pipapo_avx2_lookup() - Lookup function for AVX2 implementation
+ * @net:	Network namespace
+ * @set:	nftables API set representation
+ * @elem:	nftables API element representation containing key data
+ * @ext:	nftables API extension pointer, filled with matching reference
+ *
+ * For more details, see DOC: Theory of Operation in nft_set_pipapo.c.
+ *
+ * This implementation exploits the repetitive characteristic of the algorithm
+ * to provide a fast, vectorised version using the AVX2 SIMD instruction set.
+ *
+ * Return: true on match, false otherwise.
+ */
+bool nft_pipapo_avx2_lookup(const struct net *net, const struct nft_set *set,
+			    const u32 *key, const struct nft_set_ext **ext)
+{
+	struct nft_pipapo *priv = nft_set_priv(set);
+	unsigned long *res, *fill, *scratch;
+	u8 genmask = nft_genmask_cur(net);
+	const u8 *rp = (const u8 *)key;
+	struct nft_pipapo_match *m;
+	struct nft_pipapo_field *f;
+	bool map_index;
+	int i, ret = 0;
+
+	m = rcu_dereference(priv->match);
+
+	/* This also protects access to all data related to scratch maps */
+	kernel_fpu_begin();
+
+	if (unlikely(!m || !*raw_cpu_ptr(m->scratch))) {
+		kernel_fpu_end();
+		return false;
+	}
+
+	scratch = *raw_cpu_ptr(m->scratch_aligned);
+	map_index = raw_cpu_read(nft_pipapo_avx2_scratch_index);
+
+	res  = scratch + (map_index ? m->bsize_max : 0);
+	fill = scratch + (map_index ? 0 : m->bsize_max);
+
+	/* Starting map doesn't need to be set for this implementation */
+
+	nft_pipapo_avx2_prepare();
+
+next_match:
+	nft_pipapo_for_each_field(f, i, m) {
+		bool last = i == m->field_count - 1, first = !i;
+
+#define NFT_SET_PIPAPO_AVX2_LOOKUP(n)					\
+		(ret = nft_pipapo_avx2_lookup ##n(res, fill,		\
+						  f->lt_aligned, f->mt,	\
+						  f->bsize, rp,		\
+						  first, last, ret))
+
+		if (f->groups == 2) {
+			NFT_SET_PIPAPO_AVX2_LOOKUP(2);
+		} else if (f->groups == 4) {
+			NFT_SET_PIPAPO_AVX2_LOOKUP(4);
+		} else if (f->groups == 8) {
+			NFT_SET_PIPAPO_AVX2_LOOKUP(8);
+		} else if (f->groups == 12) {
+			NFT_SET_PIPAPO_AVX2_LOOKUP(12);
+		} else if (f->groups == 32) {
+			NFT_SET_PIPAPO_AVX2_LOOKUP(32);
+		} else {
+			ret = nft_pipapo_avx2_lookup_noavx2(f, res, fill,
+							    f->lt_aligned,
+							    f->mt, f->bsize,
+							    rp,
+							    first, last, ret);
+		}
+#undef NFT_SET_PIPAPO_AVX2_LOOKUP
+
+		if (ret < 0)
+			goto out;
+
+		if (last) {
+			*ext = &f->mt[ret].e->ext;
+			if (unlikely(nft_set_elem_expired(*ext) ||
+				     !nft_set_elem_active(*ext, genmask))) {
+				ret = 0;
+				goto next_match;
+			}
+
+			goto out;
+		}
+
+		map_index = !map_index;
+		swap(res, fill);
+		rp += NFT_PIPAPO_GROUPS_PADDED_SIZE(f->groups);
+	}
+
+out:
+	raw_cpu_write(nft_pipapo_avx2_scratch_index, map_index);
+	kernel_fpu_end();
+
+	return ret >= 0;
+}
diff --git a/net/netfilter/nft_set_pipapo_avx2.h b/net/netfilter/nft_set_pipapo_avx2.h
new file mode 100644
index 000000000000..396caf7bfca8
--- /dev/null
+++ b/net/netfilter/nft_set_pipapo_avx2.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _NFT_SET_PIPAPO_AVX2_H
+
+#ifdef CONFIG_AS_AVX2
+#include <asm/fpu/xstate.h>
+#define NFT_PIPAPO_ALIGN	(XSAVE_YMM_SIZE / BITS_PER_BYTE)
+
+bool nft_pipapo_avx2_lookup(const struct net *net, const struct nft_set *set,
+			    const u32 *key, const struct nft_set_ext **ext);
+bool nft_pipapo_avx2_estimate(const struct nft_set_desc *desc, u32 features,
+			      struct nft_set_estimate *est);
+#endif /* CONFIG_AS_AVX2 */
+
+#endif /* _NFT_SET_PIPAPO_AVX2_H */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH nf-next v2 1/8] netfilter: nf_tables: Support for subkeys, set with multiple ranged fields
  2019-11-22 13:40 ` [PATCH nf-next v2 1/8] netfilter: nf_tables: Support for subkeys, set with multiple ranged fields Stefano Brivio
@ 2019-11-23 20:01   ` Pablo Neira Ayuso
  2019-11-25  9:30     ` Stefano Brivio
  0 siblings, 1 reply; 24+ messages in thread
From: Pablo Neira Ayuso @ 2019-11-23 20:01 UTC (permalink / raw)
  To: Stefano Brivio
  Cc: netfilter-devel, Florian Westphal, Kadlecsik József,
	Eric Garver, Phil Sutter

[-- Attachment #1: Type: text/plain, Size: 3622 bytes --]

Hi Stefano,

On Fri, Nov 22, 2019 at 02:40:00PM +0100, Stefano Brivio wrote:
[...]
> diff --git a/include/uapi/linux/netfilter/nf_tables.h b/include/uapi/linux/netfilter/nf_tables.h
> index bb9b049310df..f8dbeac14898 100644
> --- a/include/uapi/linux/netfilter/nf_tables.h
> +++ b/include/uapi/linux/netfilter/nf_tables.h
> @@ -48,6 +48,7 @@ enum nft_registers {
>  
>  #define NFT_REG_SIZE	16
>  #define NFT_REG32_SIZE	4
> +#define NFT_REG32_COUNT	(NFT_REG32_15 - NFT_REG32_00 + 1)
>  
>  /**
>   * enum nft_verdicts - nf_tables internal verdicts
> @@ -275,6 +276,7 @@ enum nft_rule_compat_attributes {
>   * @NFT_SET_TIMEOUT: set uses timeouts
>   * @NFT_SET_EVAL: set can be updated from the evaluation path
>   * @NFT_SET_OBJECT: set contains stateful objects
> + * @NFT_SET_SUBKEY: set uses subkeys to map intervals for multiple fields
>   */
>  enum nft_set_flags {
>  	NFT_SET_ANONYMOUS		= 0x1,
> @@ -284,6 +286,7 @@ enum nft_set_flags {
>  	NFT_SET_TIMEOUT			= 0x10,
>  	NFT_SET_EVAL			= 0x20,
>  	NFT_SET_OBJECT			= 0x40,
> +	NFT_SET_SUBKEY			= 0x80,
>  };
>  
>  /**
> @@ -309,6 +312,17 @@ enum nft_set_desc_attributes {
>  };
>  #define NFTA_SET_DESC_MAX	(__NFTA_SET_DESC_MAX - 1)
>  
> +/**
> + * enum nft_set_subkey_attributes - subkeys for multiple ranged fields
> + *
> + * @NFTA_SET_SUBKEY_LEN: length of single field, in bits (NLA_U32)
> + */
> +enum nft_set_subkey_attributes {

Missing NFTA_SET_SUBKEY_UNSPEC here.

Not a problem if nla_parse_nested*() is not used as in your case,
probably good for consistency, in case there is a need for using such
function in the future.

> +	NFTA_SET_SUBKEY_LEN,
> +	__NFTA_SET_SUBKEY_MAX
> +};
> +#define NFTA_SET_SUBKEY_MAX	(__NFTA_SET_SUBKEY_MAX - 1)
> +
>  /**
>   * enum nft_set_attributes - nf_tables set netlink attributes
>   *
> @@ -327,6 +341,7 @@ enum nft_set_desc_attributes {
>   * @NFTA_SET_USERDATA: user data (NLA_BINARY)
>   * @NFTA_SET_OBJ_TYPE: stateful object type (NLA_U32: NFT_OBJECT_*)
>   * @NFTA_SET_HANDLE: set handle (NLA_U64)
> + * @NFTA_SET_SUBKEY: subkeys for multiple ranged fields (NLA_NESTED)
>   */
>  enum nft_set_attributes {
>  	NFTA_SET_UNSPEC,
> @@ -346,6 +361,7 @@ enum nft_set_attributes {
>  	NFTA_SET_PAD,
>  	NFTA_SET_OBJ_TYPE,
>  	NFTA_SET_HANDLE,
> +	NFTA_SET_SUBKEY,

Could you use NFTA_SET_DESC instead for this? The idea is to add the
missing front-end code to parse this new attribute and store the
subkeys length in set->desc.klen[], hence nft_pipapo_init() can just
use the already parsed data. I think this will simplify the code that
I'm seeing in nft_pipapo_init() a bit since not netlink parsing will
be required.

I'm attaching a sketch patch, including also the use of NFTA_LIST_ELEM:

NFTA_SET_DESC
  NFTA_SET_DESC_SIZE
  NFTA_SET_DESC_SUBKEY
     NFTA_LIST_ELEM
       NFTA_SET_SUBKEY_LEN
     NFTA_LIST_ELEM
       NFTA_SET_SUBKEY_LEN
     ...

Just in there's a need for more fields to describe the subkey in the
future, it's just more boilerplate code for the future extensibility.

Another suggestion is to rename NFT_SET_SUBKEY to NFT_SET_CONCAT, to
signal the kernel that userspace wants a datastructure that knows how
to deal with concatenations. Although concatenations can be done by
hashtable already, this flags is just interpreted by the kernel as a
hint on what kind of datastructure would fit better for what is
needed. The combination of the NFT_SET_INTERVAL and the NFT_SET_CONCAT
(if you're fine with the rename, of course) is what will kick in
pipapo to be used.

Attaching sketch for the netlink control plane with the changes I've
been describing above, compile-tested only.

Thanks.

[-- Attachment #2: subkeys.patch --]
[-- Type: text/x-diff, Size: 3899 bytes --]

diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
index 2b3e6a2309aa..0b105264cc4f 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -259,11 +259,15 @@ struct nft_set_iter {
  *	@klen: key length
  *	@dlen: data length
  *	@size: number of set elements
+ *	@subkeylen: element subkey lengths
+ *	@num_subkeys: number of subkeys in element
  */
 struct nft_set_desc {
 	unsigned int		klen;
 	unsigned int		dlen;
 	unsigned int		size;
+	u8			subkey_len[NFT_REG32_COUNT];
+	u8			num_subkeys;
 };
 
 /**
diff --git a/include/uapi/linux/netfilter/nf_tables.h b/include/uapi/linux/netfilter/nf_tables.h
index 79ab18b218be..d8ea2e72c960 100644
--- a/include/uapi/linux/netfilter/nf_tables.h
+++ b/include/uapi/linux/netfilter/nf_tables.h
@@ -48,6 +48,7 @@ enum nft_registers {
 
 #define NFT_REG_SIZE	16
 #define NFT_REG32_SIZE	4
+#define NFT_REG32_COUNT	(NFT_REG32_15 - NFT_REG32_00 + 1)
 
 /**
  * enum nft_verdicts - nf_tables internal verdicts
@@ -298,13 +299,27 @@ enum nft_set_policies {
 };
 
 /**
+ * enum nft_set_subkey_attributes - subkeys for multiple ranged fields
+ *
+ * @NFTA_SET_SUBKEY_LEN: length of single field, in bits (NLA_U32)
+ */
+enum nft_set_subkey_attributes {
+	NFTA_SET_SUBKEY_UNSPEC,
+	NFTA_SET_SUBKEY_LEN,
+	__NFTA_SET_SUBKEY_MAX
+};
+#define NFTA_SET_SUBKEY_MAX	(__NFTA_SET_SUBKEY_MAX - 1)
+
+/**
  * enum nft_set_desc_attributes - set element description
  *
  * @NFTA_SET_DESC_SIZE: number of elements in set (NLA_U32)
+ * @NFTA_SET_DESC_SUBKEYS: element subkeys in set (NLA_NESTED)
  */
 enum nft_set_desc_attributes {
 	NFTA_SET_DESC_UNSPEC,
 	NFTA_SET_DESC_SIZE,
+	NFTA_SET_DESC_SUBKEYS,
 	__NFTA_SET_DESC_MAX
 };
 #define NFTA_SET_DESC_MAX	(__NFTA_SET_DESC_MAX - 1)
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index b5051f4dbb26..1de97ec8d73d 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -3357,6 +3357,7 @@ static const struct nla_policy nft_set_policy[NFTA_SET_MAX + 1] = {
 
 static const struct nla_policy nft_set_desc_policy[NFTA_SET_DESC_MAX + 1] = {
 	[NFTA_SET_DESC_SIZE]		= { .type = NLA_U32 },
+	[NFTA_SET_DESC_SUBKEYS]		= { .type = NLA_NESTED },
 };
 
 static int nft_ctx_init_from_setattr(struct nft_ctx *ctx, struct net *net,
@@ -3763,6 +3764,51 @@ static int nf_tables_getset(struct net *net, struct sock *nlsk,
 	return err;
 }
 
+static const struct nla_policy nft_subkey_policy[NFTA_SET_SUBKEY_MAX + 1] = {
+	[NFTA_SET_SUBKEY_LEN]	= { .type = NLA_U32 },
+};
+
+static int nft_set_desc_subkey_parse(const struct nlattr *attr,
+				     struct nft_set_desc *desc)
+{
+	struct nlattr *tb[NFTA_SET_SUBKEY_MAX + 1];
+	int err;
+
+	err = nla_parse_nested_deprecated(tb, NFTA_SET_SUBKEY_MAX, attr,
+					  nft_subkey_policy, NULL);
+	if (err < 0)
+		return err;
+
+	if (!tb[NFTA_SET_SUBKEY_LEN])
+		return -EINVAL;
+
+	desc->subkey_len[desc->num_subkeys++] =
+		ntohl(nla_get_be32(tb[NFTA_SET_SUBKEY_LEN]));
+
+	return 0;
+}
+
+static int nft_set_desc_subkeys(struct nft_set_desc *desc,
+				const struct nlattr *nla)
+{
+	struct nlattr *attr;
+	int rem, err;
+
+	nla_for_each_nested(attr, nla, rem) {
+		if (nla_type(attr) != NFTA_LIST_ELEM)
+			return -EINVAL;
+
+		if (desc->num_subkeys >= NFT_REG32_COUNT)
+			return -E2BIG;
+
+		err = nft_set_desc_subkey_parse(attr, desc);
+		if (err < 0)
+			return err;
+	}
+
+	return 0;
+}
+
 static int nf_tables_set_desc_parse(struct nft_set_desc *desc,
 				    const struct nlattr *nla)
 {
@@ -3776,8 +3822,10 @@ static int nf_tables_set_desc_parse(struct nft_set_desc *desc,
 
 	if (da[NFTA_SET_DESC_SIZE] != NULL)
 		desc->size = ntohl(nla_get_be32(da[NFTA_SET_DESC_SIZE]));
+	if (da[NFTA_SET_DESC_SUBKEYS])
+		err = nft_set_desc_subkeys(desc, da[NFTA_SET_DESC_SUBKEYS]);
 
-	return 0;
+	return err;
 }
 
 static int nf_tables_newset(struct net *net, struct sock *nlsk,

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH nf-next v2 0/8] nftables: Set implementation for arbitrary concatenation of ranges
  2019-11-22 13:39 [PATCH nf-next v2 0/8] nftables: Set implementation for arbitrary concatenation of ranges Stefano Brivio
                   ` (7 preceding siblings ...)
  2019-11-22 13:40 ` [PATCH nf-next v2 8/8] nft_set_pipapo: Introduce AVX2-based lookup implementation Stefano Brivio
@ 2019-11-23 20:05 ` Pablo Neira Ayuso
  2019-11-25  9:31   ` Stefano Brivio
  8 siblings, 1 reply; 24+ messages in thread
From: Pablo Neira Ayuso @ 2019-11-23 20:05 UTC (permalink / raw)
  To: Stefano Brivio
  Cc: netfilter-devel, Florian Westphal, Kadlecsik József,
	Eric Garver, Phil Sutter

On Fri, Nov 22, 2019 at 02:39:59PM +0100, Stefano Brivio wrote:
[...]
> Patch 1/8 implements the needed UAPI bits: additions to the existing
> interface are kept to a minimum by recycling existing concepts for
> both ranging and concatenation, as suggested by Florian.
> 
> Patch 2/8 adds a new bitmap operation that copies the source bitmap
> onto the destination while removing a given region, and is needed to
> delete regions of arrays mapping between lookup tables.
> 
> Patch 3/8 is the actual set implementation.
> 
> Patch 4/8 introduces selftests for the new implementation.
[...]

After talking to Florian, I'm inclined to merge upstream up to patch
4/8 in this merge window, once the UAPI discussion is sorted out.

Thanks.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH nf-next v2 1/8] netfilter: nf_tables: Support for subkeys, set with multiple ranged fields
  2019-11-23 20:01   ` Pablo Neira Ayuso
@ 2019-11-25  9:30     ` Stefano Brivio
  2019-11-25  9:58       ` Pablo Neira Ayuso
  0 siblings, 1 reply; 24+ messages in thread
From: Stefano Brivio @ 2019-11-25  9:30 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netfilter-devel, Florian Westphal, Kadlecsik József,
	Eric Garver, Phil Sutter

Hi Pablo,

On Sat, 23 Nov 2019 21:01:08 +0100
Pablo Neira Ayuso <pablo@netfilter.org> wrote:

> Hi Stefano,
> 
> On Fri, Nov 22, 2019 at 02:40:00PM +0100, Stefano Brivio wrote:
> [...]
> > diff --git a/include/uapi/linux/netfilter/nf_tables.h b/include/uapi/linux/netfilter/nf_tables.h
> > index bb9b049310df..f8dbeac14898 100644
> > --- a/include/uapi/linux/netfilter/nf_tables.h
> > +++ b/include/uapi/linux/netfilter/nf_tables.h
> > @@ -48,6 +48,7 @@ enum nft_registers {
> >  
> >  #define NFT_REG_SIZE	16
> >  #define NFT_REG32_SIZE	4
> > +#define NFT_REG32_COUNT	(NFT_REG32_15 - NFT_REG32_00 + 1)
> >  
> >  /**
> >   * enum nft_verdicts - nf_tables internal verdicts
> > @@ -275,6 +276,7 @@ enum nft_rule_compat_attributes {
> >   * @NFT_SET_TIMEOUT: set uses timeouts
> >   * @NFT_SET_EVAL: set can be updated from the evaluation path
> >   * @NFT_SET_OBJECT: set contains stateful objects
> > + * @NFT_SET_SUBKEY: set uses subkeys to map intervals for multiple fields
> >   */
> >  enum nft_set_flags {
> >  	NFT_SET_ANONYMOUS		= 0x1,
> > @@ -284,6 +286,7 @@ enum nft_set_flags {
> >  	NFT_SET_TIMEOUT			= 0x10,
> >  	NFT_SET_EVAL			= 0x20,
> >  	NFT_SET_OBJECT			= 0x40,
> > +	NFT_SET_SUBKEY			= 0x80,
> >  };
> >  
> >  /**
> > @@ -309,6 +312,17 @@ enum nft_set_desc_attributes {
> >  };
> >  #define NFTA_SET_DESC_MAX	(__NFTA_SET_DESC_MAX - 1)
> >  
> > +/**
> > + * enum nft_set_subkey_attributes - subkeys for multiple ranged fields
> > + *
> > + * @NFTA_SET_SUBKEY_LEN: length of single field, in bits (NLA_U32)
> > + */
> > +enum nft_set_subkey_attributes {  
> 
> Missing NFTA_SET_SUBKEY_UNSPEC here.
> 
> Not a problem if nla_parse_nested*() is not used as in your case,
> probably good for consistency, in case there is a need for using such
> function in the future.
> 
> > +	NFTA_SET_SUBKEY_LEN,
> > +	__NFTA_SET_SUBKEY_MAX
> > +};
> > +#define NFTA_SET_SUBKEY_MAX	(__NFTA_SET_SUBKEY_MAX - 1)
> > +
> >  /**
> >   * enum nft_set_attributes - nf_tables set netlink attributes
> >   *
> > @@ -327,6 +341,7 @@ enum nft_set_desc_attributes {
> >   * @NFTA_SET_USERDATA: user data (NLA_BINARY)
> >   * @NFTA_SET_OBJ_TYPE: stateful object type (NLA_U32: NFT_OBJECT_*)
> >   * @NFTA_SET_HANDLE: set handle (NLA_U64)
> > + * @NFTA_SET_SUBKEY: subkeys for multiple ranged fields (NLA_NESTED)
> >   */
> >  enum nft_set_attributes {
> >  	NFTA_SET_UNSPEC,
> > @@ -346,6 +361,7 @@ enum nft_set_attributes {
> >  	NFTA_SET_PAD,
> >  	NFTA_SET_OBJ_TYPE,
> >  	NFTA_SET_HANDLE,
> > +	NFTA_SET_SUBKEY,  
> 
> Could you use NFTA_SET_DESC instead for this? The idea is to add the
> missing front-end code to parse this new attribute and store the
> subkeys length in set->desc.klen[], hence nft_pipapo_init() can just
> use the already parsed data.

Logically, I think it makes sense. I'll try to implement this in nft
and libnftnl and see if some fundamental issue pops up there.

> I think this will simplify the code that I'm seeing in
> nft_pipapo_init() a bit since not netlink parsing will be required.

I don't think it makes a real difference there, because the actual
parsing parts are rather limited:

	nla_for_each_nested(attr, nla[NFTA_SET_SUBKEY], rem) {
	[...]
		if (nla_len(attr) != sizeof(klen) ||
		    nla_type(attr) != NFTA_SET_SUBKEY_LEN)
			return -EINVAL;
	}

	[...]

	nla_for_each_nested(attr, nla[NFTA_SET_SUBKEY], rem) {
		klen = ntohl(nla_get_be32(attr));
	[...]
	}

the rest is validations (specific for this set type):

	nla_for_each_nested(attr, nla[NFTA_SET_SUBKEY], rem) {
		if (++field_count >= NFT_PIPAPO_MAX_FIELDS)
			return -EINVAL;
	[...]
	}

	[...]

	nla_for_each_nested(attr, nla[NFTA_SET_SUBKEY], rem) {
	[...]
		if (!klen || klen % NFT_PIPAPO_GROUP_BITS)
			goto out_free;

		if (klen > NFT_PIPAPO_MAX_BITS)
			goto out_free;
	[...]
	}

and calculations (also specific):

	nla_for_each_nested(attr, nla[NFTA_SET_SUBKEY], rem) {
		if (++field_count >= NFT_PIPAPO_MAX_FIELDS)
	[...]
	}

	nla_for_each_nested(attr, nla[NFTA_SET_SUBKEY], rem) {
	[...]
		priv->groups += f->groups = klen / NFT_PIPAPO_GROUP_BITS;
		priv->width += round_up(klen / BITS_PER_BYTE, sizeof(u32));
	[...]
	}

that we would still need.

> I'm attaching a sketch patch, including also the use of NFTA_LIST_ELEM:
> 
> NFTA_SET_DESC
>   NFTA_SET_DESC_SIZE
>   NFTA_SET_DESC_SUBKEY
>      NFTA_LIST_ELEM
>        NFTA_SET_SUBKEY_LEN
>      NFTA_LIST_ELEM
>        NFTA_SET_SUBKEY_LEN
>      ...
> 
> Just in there's a need for more fields to describe the subkey in the
> future, it's just more boilerplate code for the future extensibility.

Thanks! I'll play with it and see if I can fit all the pieces.

> Another suggestion is to rename NFT_SET_SUBKEY to NFT_SET_CONCAT, to
> signal the kernel that userspace wants a datastructure that knows how
> to deal with concatenations. Although concatenations can be done by
> hashtable already, this flags is just interpreted by the kernel as a
> hint on what kind of datastructure would fit better for what is
> needed. The combination of the NFT_SET_INTERVAL and the NFT_SET_CONCAT
> (if you're fine with the rename, of course) is what will kick in
> pipapo to be used.

I think that NFT_SET_CONCAT as you propose is conceptually a better
fit. I'm worried about the confusion this might generate for other set
implementations.

That is, a reasonable expectation is that userspace passes
NFT_SET_CONCAT whenever there's a concatenation, and hash
implementations support sets with that flag, too, so I would add it to
the supported feature flags of hash types, and it wouldn't be there for
rbtree.

Right now, that won't break anything: the flag might or might not be
present depending on userspace version, and selection of hash types
would proceed as usual. But I'm worried that we might miss this
subtlety in the future and break concatenation support for older
userspace versions.

Another idea could be that we get rid of this flag altogether: if we
move "subkeys" to set->desc, the ->estimate() functions of rbtree and
pipapo can check for those and refuse or allow set selection
accordingly. I have no idea yet if this introduces further complexity
for nft, because there we would need to decide how to create start/end
elements depending on the existing set description instead of using a
single flag. I can give it a try if it makes sense.

-- 
Stefano


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH nf-next v2 0/8] nftables: Set implementation for arbitrary concatenation of ranges
  2019-11-23 20:05 ` [PATCH nf-next v2 0/8] nftables: Set implementation for arbitrary concatenation of ranges Pablo Neira Ayuso
@ 2019-11-25  9:31   ` Stefano Brivio
  2019-11-25 10:02     ` Pablo Neira Ayuso
  0 siblings, 1 reply; 24+ messages in thread
From: Stefano Brivio @ 2019-11-25  9:31 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netfilter-devel, Florian Westphal, Kadlecsik József,
	Eric Garver, Phil Sutter

On Sat, 23 Nov 2019 21:05:18 +0100
Pablo Neira Ayuso <pablo@netfilter.org> wrote:

> On Fri, Nov 22, 2019 at 02:39:59PM +0100, Stefano Brivio wrote:
> [...]
> > Patch 1/8 implements the needed UAPI bits: additions to the existing
> > interface are kept to a minimum by recycling existing concepts for
> > both ranging and concatenation, as suggested by Florian.
> > 
> > Patch 2/8 adds a new bitmap operation that copies the source bitmap
> > onto the destination while removing a given region, and is needed to
> > delete regions of arrays mapping between lookup tables.
> > 
> > Patch 3/8 is the actual set implementation.
> > 
> > Patch 4/8 introduces selftests for the new implementation.  
> [...]
> 
> After talking to Florian, I'm inclined to merge upstream up to patch
> 4/8 in this merge window, once the UAPI discussion is sorted out.

Thanks for the update. Let me know if there's some specific topic or
concern I can start addressing for patches 5/8 to 8/8.

-- 
Stefano


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH nf-next v2 1/8] netfilter: nf_tables: Support for subkeys, set with multiple ranged fields
  2019-11-25  9:30     ` Stefano Brivio
@ 2019-11-25  9:58       ` Pablo Neira Ayuso
  2019-11-25 13:26         ` Stefano Brivio
  0 siblings, 1 reply; 24+ messages in thread
From: Pablo Neira Ayuso @ 2019-11-25  9:58 UTC (permalink / raw)
  To: Stefano Brivio
  Cc: netfilter-devel, Florian Westphal, Kadlecsik József,
	Eric Garver, Phil Sutter

On Mon, Nov 25, 2019 at 10:30:35AM +0100, Stefano Brivio wrote:
[...]
> Another idea could be that we get rid of this flag altogether: if we
> move "subkeys" to set->desc, the ->estimate() functions of rbtree and
> pipapo can check for those and refuse or allow set selection
> accordingly. I have no idea yet if this introduces further complexity
> for nft, because there we would need to decide how to create start/end
> elements depending on the existing set description instead of using a
> single flag. I can give it a try if it makes sense.

nft_set_desc can probably store a boolean 'concat' that is set on if
the NFTA_SET_DESC_SUBKEY attribute is specified. Then, this flag is
not needed and you can just rely on ->estimate() as you describe.

The hashtable will just ignore this description, it does not need the
description even if userspace pass it on since the interval flag is
set on.

You just have to update the rbtree to check for desc->concat, if this
is true, then rbtree->estimate() returns false.

BTW, then probably you can rename this attribute to
NFT_SET_DESC_CONCAT?

Thanks!

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH nf-next v2 0/8] nftables: Set implementation for arbitrary concatenation of ranges
  2019-11-25  9:31   ` Stefano Brivio
@ 2019-11-25 10:02     ` Pablo Neira Ayuso
  2019-11-25 13:36       ` Stefano Brivio
  0 siblings, 1 reply; 24+ messages in thread
From: Pablo Neira Ayuso @ 2019-11-25 10:02 UTC (permalink / raw)
  To: Stefano Brivio
  Cc: netfilter-devel, Florian Westphal, Kadlecsik József,
	Eric Garver, Phil Sutter

On Mon, Nov 25, 2019 at 10:31:06AM +0100, Stefano Brivio wrote:
> On Sat, 23 Nov 2019 21:05:18 +0100
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> 
> > On Fri, Nov 22, 2019 at 02:39:59PM +0100, Stefano Brivio wrote:
> > [...]
> > > Patch 1/8 implements the needed UAPI bits: additions to the existing
> > > interface are kept to a minimum by recycling existing concepts for
> > > both ranging and concatenation, as suggested by Florian.
> > > 
> > > Patch 2/8 adds a new bitmap operation that copies the source bitmap
> > > onto the destination while removing a given region, and is needed to
> > > delete regions of arrays mapping between lookup tables.
> > > 
> > > Patch 3/8 is the actual set implementation.
> > > 
> > > Patch 4/8 introduces selftests for the new implementation.  
> > [...]
> > 
> > After talking to Florian, I'm inclined to merge upstream up to patch
> > 4/8 in this merge window, once the UAPI discussion is sorted out.
> 
> Thanks for the update. Let me know if there's some specific topic or
> concern I can start addressing for patches 5/8 to 8/8.

Merge window is now closed, I was trying to get the bare minimum in
this round. Now we have a bit more time to merge this upstream.

BTW, do you have numbers comparing the AVX2 version with the C code? I
quickly had a look at your numbers, but not clear to me if this is
compared there.

Thanks.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH nf-next v2 1/8] netfilter: nf_tables: Support for subkeys, set with multiple ranged fields
  2019-11-25  9:58       ` Pablo Neira Ayuso
@ 2019-11-25 13:26         ` Stefano Brivio
  2019-11-25 14:30           ` Pablo Neira Ayuso
  0 siblings, 1 reply; 24+ messages in thread
From: Stefano Brivio @ 2019-11-25 13:26 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netfilter-devel, Florian Westphal, Kadlecsik József,
	Eric Garver, Phil Sutter

On Mon, 25 Nov 2019 10:58:17 +0100
Pablo Neira Ayuso <pablo@netfilter.org> wrote:

> On Mon, Nov 25, 2019 at 10:30:35AM +0100, Stefano Brivio wrote:
> [...]
> > Another idea could be that we get rid of this flag altogether: if we
> > move "subkeys" to set->desc, the ->estimate() functions of rbtree and
> > pipapo can check for those and refuse or allow set selection
> > accordingly. I have no idea yet if this introduces further complexity
> > for nft, because there we would need to decide how to create start/end
> > elements depending on the existing set description instead of using a
> > single flag. I can give it a try if it makes sense.  
> 
> nft_set_desc can probably store a boolean 'concat' that is set on if
> the NFTA_SET_DESC_SUBKEY attribute is specified. Then, this flag is
> not needed and you can just rely on ->estimate() as you describe.

I could even just check desc->num_subkeys from your patch then, without
adding another field to nft_set_desc. Too ugly?

> The hashtable will just ignore this description, it does not need the
> description even if userspace pass it on since the interval flag is
> set on.
> 
> You just have to update the rbtree to check for desc->concat, if this
> is true, then rbtree->estimate() returns false.

Yes, I think it all makes sense, thanks for detailing the idea. I'll get
to this in a few hours.

> BTW, then probably you can rename this attribute to
> NFT_SET_DESC_CONCAT?

It would include sizes, though. What about NFT_SET_DESC_SUBSIZE or
NFT_SET_DESC_FIELD_SIZE?

-- 
Stefano


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH nf-next v2 0/8] nftables: Set implementation for arbitrary concatenation of ranges
  2019-11-25 10:02     ` Pablo Neira Ayuso
@ 2019-11-25 13:36       ` Stefano Brivio
  0 siblings, 0 replies; 24+ messages in thread
From: Stefano Brivio @ 2019-11-25 13:36 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netfilter-devel, Florian Westphal, Kadlecsik József,
	Eric Garver, Phil Sutter

On Mon, 25 Nov 2019 11:02:14 +0100
Pablo Neira Ayuso <pablo@netfilter.org> wrote:

> BTW, do you have numbers comparing the AVX2 version with the C code? I
> quickly had a look at your numbers, but not clear to me if this is
> compared there.

No, sorry, I didn't report that anywhere, I probably should have in
the commit messages for 4/8 and 5/8. This was from v1 at 4/8, single
thread on AMD Epyc 7351, C implementation without unrolled loops:

TEST: performance
  net,port                                                      [ OK ]
    baseline (drop from netdev hook):               9971887pps
    baseline hash (non-ranged entries):             5991032pps
    baseline rbtree (match on first field only):    2666255pps
    set with  1000 full, ranged entries:            2220404pps
  port,net                                                      [ OK ]
    baseline (drop from netdev hook):              10004499pps
    baseline hash (non-ranged entries):             6011221pps
    baseline rbtree (match on first field only):    4035566pps
    set with   100 full, ranged entries:            4018240pps
  net6,port                                                     [ OK ]
    baseline (drop from netdev hook):               9497500pps
    baseline hash (non-ranged entries):             4685436pps
    baseline rbtree (match on first field only):    1354978pps
    set with  1000 full, ranged entries:            1052188pps
  port,proto                                                    [ OK ]
    baseline (drop from netdev hook):              10749256pps
    baseline hash (non-ranged entries):             6774103pps
    baseline rbtree (match on first field only):    2819211pps
    set with 30000 full, ranged entries:             283492pps
  net6,port,mac                                                 [ OK ]
    baseline (drop from netdev hook):               9463935pps
    baseline hash (non-ranged entries):             3777039pps
    baseline rbtree (match on first field only):    2943527pps
    set with    10 full, ranged entries:            1927899pps
  net6,port,mac,proto                                           [ OK ]
    baseline (drop from netdev hook):               9502200pps
    baseline hash (non-ranged entries):             3637739pps
    baseline rbtree (match on first field only):    1342323pps
    set with  1000 full, ranged entries:             753960pps
  net,mac                                                       [ OK ]
    baseline (drop from netdev hook):              10065715pps
    baseline hash (non-ranged entries):             5082895pps
    baseline rbtree (match on first field only):    2677391pps
    set with  1000 full, ranged entries:            1215104pps

I would re-run tests on v3 patches and include the comparisons in
commit messages. 

By the way, as you can see, even though the comparison with rbtree is
unfair (comparing > 1 fields adds substantial complexity), without AVX2
it doesn't scale as nicely. I plan to propose some optimisations that
should substantially improve the non-vectorised case, but what I have
in mind right now is a bit convoluted and I would skip it in this
initial submission.

-- 
Stefano


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH nf-next v2 1/8] netfilter: nf_tables: Support for subkeys, set with multiple ranged fields
  2019-11-25 13:26         ` Stefano Brivio
@ 2019-11-25 14:30           ` Pablo Neira Ayuso
  2019-11-25 14:54             ` Stefano Brivio
  0 siblings, 1 reply; 24+ messages in thread
From: Pablo Neira Ayuso @ 2019-11-25 14:30 UTC (permalink / raw)
  To: Stefano Brivio
  Cc: netfilter-devel, Florian Westphal, Kadlecsik József,
	Eric Garver, Phil Sutter

On Mon, Nov 25, 2019 at 02:26:16PM +0100, Stefano Brivio wrote:
> On Mon, 25 Nov 2019 10:58:17 +0100
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> 
> > On Mon, Nov 25, 2019 at 10:30:35AM +0100, Stefano Brivio wrote:
> > [...]
> > > Another idea could be that we get rid of this flag altogether: if we
> > > move "subkeys" to set->desc, the ->estimate() functions of rbtree and
> > > pipapo can check for those and refuse or allow set selection
> > > accordingly. I have no idea yet if this introduces further complexity
> > > for nft, because there we would need to decide how to create start/end
> > > elements depending on the existing set description instead of using a
> > > single flag. I can give it a try if it makes sense.  
> > 
> > nft_set_desc can probably store a boolean 'concat' that is set on if
> > the NFTA_SET_DESC_SUBKEY attribute is specified. Then, this flag is
> > not needed and you can just rely on ->estimate() as you describe.
> 
> I could even just check desc->num_subkeys from your patch then, without
> adding another field to nft_set_desc. Too ugly?

OK.

> > The hashtable will just ignore this description, it does not need the
> > description even if userspace pass it on since the interval flag is
> > set on.
> > 
> > You just have to update the rbtree to check for desc->concat, if this
> > is true, then rbtree->estimate() returns false.
> 
> Yes, I think it all makes sense, thanks for detailing the idea. I'll get
> to this in a few hours.
> 
> > BTW, then probably you can rename this attribute to
> > NFT_SET_DESC_CONCAT?
> 
> It would include sizes, though. What about NFT_SET_DESC_SUBSIZE or
> NFT_SET_DESC_FIELD_SIZE?

You mean this:

       NFT_SET_DESC_SUBSIZE
          NFT_SET_DESC_FIELD_SIZE
          NFT_SET_DESC_FIELD_SIZE

instead of this:

        NFT_SET_DESC_CONCAT
          NFT_LIST_ELEM
             NFT_SET_DESC_SUBKEY_LEN
          NFT_LIST_ELEM
             NFT_SET_DESC_SUBKEY_LEN

If I described this correctly, your approach is more simple indeed.

However, I don't really have specific requirements for the future
right now. The one below is leaving room to add more subkey fields (to
describe each subkey if that is ever required). My experience is that
leaving room to extend netlink in the future is usually a good idea,
that's all.

Instead of NFT_LIST_ELEM, something like NFT_SET_DESC_SUBKEY should be
fine too, ie.

        NFT_SET_DESC_CONCAT
          NFT_SET_DESC_SUBKEY
             NFT_SET_DESC_SUBKEY_LEN
          NFT_SET_DESC_SUBKEY
             NFT_SET_DESC_SUBKEY_LEN

This netlink stuff is tricky in that it's set on stone one exposed.

Thanks.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH nf-next v2 1/8] netfilter: nf_tables: Support for subkeys, set with multiple ranged fields
  2019-11-25 14:30           ` Pablo Neira Ayuso
@ 2019-11-25 14:54             ` Stefano Brivio
  2019-11-25 20:38               ` Pablo Neira Ayuso
  0 siblings, 1 reply; 24+ messages in thread
From: Stefano Brivio @ 2019-11-25 14:54 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netfilter-devel, Florian Westphal, Kadlecsik József,
	Eric Garver, Phil Sutter

On Mon, 25 Nov 2019 15:30:58 +0100
Pablo Neira Ayuso <pablo@netfilter.org> wrote:

> On Mon, Nov 25, 2019 at 02:26:16PM +0100, Stefano Brivio wrote:
> > On Mon, 25 Nov 2019 10:58:17 +0100
> > Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> >   
> > > On Mon, Nov 25, 2019 at 10:30:35AM +0100, Stefano Brivio wrote:
> > > [...]  
> > > > Another idea could be that we get rid of this flag altogether: if we
> > > > move "subkeys" to set->desc, the ->estimate() functions of rbtree and
> > > > pipapo can check for those and refuse or allow set selection
> > > > accordingly. I have no idea yet if this introduces further complexity
> > > > for nft, because there we would need to decide how to create start/end
> > > > elements depending on the existing set description instead of using a
> > > > single flag. I can give it a try if it makes sense.    
> > > 
> > > nft_set_desc can probably store a boolean 'concat' that is set on if
> > > the NFTA_SET_DESC_SUBKEY attribute is specified. Then, this flag is
> > > not needed and you can just rely on ->estimate() as you describe.  
> > 
> > I could even just check desc->num_subkeys from your patch then, without
> > adding another field to nft_set_desc. Too ugly?  
> 
> OK.
> 
> > > The hashtable will just ignore this description, it does not need the
> > > description even if userspace pass it on since the interval flag is
> > > set on.
> > > 
> > > You just have to update the rbtree to check for desc->concat, if this
> > > is true, then rbtree->estimate() returns false.  
> > 
> > Yes, I think it all makes sense, thanks for detailing the idea. I'll get
> > to this in a few hours.
> >   
> > > BTW, then probably you can rename this attribute to
> > > NFT_SET_DESC_CONCAT?  
> > 
> > It would include sizes, though. What about NFT_SET_DESC_SUBSIZE or
> > NFT_SET_DESC_FIELD_SIZE?  
> 
> You mean this:
> 
>        NFT_SET_DESC_SUBSIZE
>           NFT_SET_DESC_FIELD_SIZE
>           NFT_SET_DESC_FIELD_SIZE
> 
> instead of this:
> 
>         NFT_SET_DESC_CONCAT
>           NFT_LIST_ELEM
>              NFT_SET_DESC_SUBKEY_LEN
>           NFT_LIST_ELEM
>              NFT_SET_DESC_SUBKEY_LEN
> 
> If I described this correctly, your approach is more simple indeed.

Ah, yes, that's what I meant, but that's because I didn't understand
your intention in the first place. :) I see now.

> However, I don't really have specific requirements for the future
> right now. The one below is leaving room to add more subkey fields (to
> describe each subkey if that is ever required). My experience is that
> leaving room to extend netlink in the future is usually a good idea,
> that's all.
> 
> Instead of NFT_LIST_ELEM, something like NFT_SET_DESC_SUBKEY should be
> fine too, ie.
> 
>         NFT_SET_DESC_CONCAT
>           NFT_SET_DESC_SUBKEY
>              NFT_SET_DESC_SUBKEY_LEN
>           NFT_SET_DESC_SUBKEY
>              NFT_SET_DESC_SUBKEY_LEN

Actually:

>         NFT_SET_DESC_CONCAT
>           NFT_LIST_ELEM
>              NFT_SET_DESC_SUBKEY_LEN
>           NFT_LIST_ELEM
>              NFT_SET_DESC_SUBKEY_LEN

sounds better to me. Maybe "SUBKEY" starts looking a bit obscure here:
the "SUB" part is already there, the "KEY" part mostly refers to an
implementation detail. What about:

         NFT_SET_DESC_CONCAT
           NFT_LIST_ELEM
              NFT_SET_DESC_LEN
           NFT_LIST_ELEM
              NFT_SET_DESC_LEN

this?

-- 
Stefano


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH nf-next v2 1/8] netfilter: nf_tables: Support for subkeys, set with multiple ranged fields
  2019-11-25 14:54             ` Stefano Brivio
@ 2019-11-25 20:38               ` Pablo Neira Ayuso
  0 siblings, 0 replies; 24+ messages in thread
From: Pablo Neira Ayuso @ 2019-11-25 20:38 UTC (permalink / raw)
  To: Stefano Brivio
  Cc: netfilter-devel, Florian Westphal, Kadlecsik József,
	Eric Garver, Phil Sutter

On Mon, Nov 25, 2019 at 03:54:22PM +0100, Stefano Brivio wrote:
> On Mon, 25 Nov 2019 15:30:58 +0100
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> 
> > On Mon, Nov 25, 2019 at 02:26:16PM +0100, Stefano Brivio wrote:
> > > On Mon, 25 Nov 2019 10:58:17 +0100
> > > Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > >   
> > > > On Mon, Nov 25, 2019 at 10:30:35AM +0100, Stefano Brivio wrote:
> > > > [...]  
> > > > > Another idea could be that we get rid of this flag altogether: if we
> > > > > move "subkeys" to set->desc, the ->estimate() functions of rbtree and
> > > > > pipapo can check for those and refuse or allow set selection
> > > > > accordingly. I have no idea yet if this introduces further complexity
> > > > > for nft, because there we would need to decide how to create start/end
> > > > > elements depending on the existing set description instead of using a
> > > > > single flag. I can give it a try if it makes sense.    
> > > > 
> > > > nft_set_desc can probably store a boolean 'concat' that is set on if
> > > > the NFTA_SET_DESC_SUBKEY attribute is specified. Then, this flag is
> > > > not needed and you can just rely on ->estimate() as you describe.  
> > > 
> > > I could even just check desc->num_subkeys from your patch then, without
> > > adding another field to nft_set_desc. Too ugly?  
> > 
> > OK.
> > 
> > > > The hashtable will just ignore this description, it does not need the
> > > > description even if userspace pass it on since the interval flag is
> > > > set on.
> > > > 
> > > > You just have to update the rbtree to check for desc->concat, if this
> > > > is true, then rbtree->estimate() returns false.  
> > > 
> > > Yes, I think it all makes sense, thanks for detailing the idea. I'll get
> > > to this in a few hours.
> > >   
> > > > BTW, then probably you can rename this attribute to
> > > > NFT_SET_DESC_CONCAT?  
> > > 
> > > It would include sizes, though. What about NFT_SET_DESC_SUBSIZE or
> > > NFT_SET_DESC_FIELD_SIZE?  
> > 
> > You mean this:
> > 
> >        NFT_SET_DESC_SUBSIZE
> >           NFT_SET_DESC_FIELD_SIZE
> >           NFT_SET_DESC_FIELD_SIZE
> > 
> > instead of this:
> > 
> >         NFT_SET_DESC_CONCAT
> >           NFT_LIST_ELEM
> >              NFT_SET_DESC_SUBKEY_LEN
> >           NFT_LIST_ELEM
> >              NFT_SET_DESC_SUBKEY_LEN
> > 
> > If I described this correctly, your approach is more simple indeed.
> 
> Ah, yes, that's what I meant, but that's because I didn't understand
> your intention in the first place. :) I see now.
> 
> > However, I don't really have specific requirements for the future
> > right now. The one below is leaving room to add more subkey fields (to
> > describe each subkey if that is ever required). My experience is that
> > leaving room to extend netlink in the future is usually a good idea,
> > that's all.
> > 
> > Instead of NFT_LIST_ELEM, something like NFT_SET_DESC_SUBKEY should be
> > fine too, ie.
> > 
> >         NFT_SET_DESC_CONCAT
> >           NFT_SET_DESC_SUBKEY
> >              NFT_SET_DESC_SUBKEY_LEN
> >           NFT_SET_DESC_SUBKEY
> >              NFT_SET_DESC_SUBKEY_LEN
> 
> Actually:
> 
> >         NFT_SET_DESC_CONCAT
> >           NFT_LIST_ELEM
> >              NFT_SET_DESC_SUBKEY_LEN
> >           NFT_LIST_ELEM
> >              NFT_SET_DESC_SUBKEY_LEN
> 
> sounds better to me. Maybe "SUBKEY" starts looking a bit obscure here:
> the "SUB" part is already there, the "KEY" part mostly refers to an
> implementation detail. What about:
> 
>          NFT_SET_DESC_CONCAT
>            NFT_LIST_ELEM
>               NFT_SET_DESC_LEN
>            NFT_LIST_ELEM
>               NFT_SET_DESC_LEN
> 
> this?

I think the _SUBKEY_ infix is fine.

Problem with using NFT_SET_DESC_* for inner (nested) attributes is
that this matches the same prefix as top level netlink attributes.

Note there is NFT_SET_DESC_SIZE. If there is a NFT_SET_DESC_LEN that
is wrapped around NFT_SET_DESC_CONCAT (using same prefix) it might
look a bit misleading.

Having said this, although I started this debate about naming, I don't
have a strong opinion on all these names. As soon as the netlink
attribute scheme that is used allows for extensibility in the future,
such as allowing to add more descriptions for each subkey, I'll be fine.

Thanks.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH nf-next v2 8/8] nft_set_pipapo: Introduce AVX2-based lookup implementation
  2019-11-22 13:40 ` [PATCH nf-next v2 8/8] nft_set_pipapo: Introduce AVX2-based lookup implementation Stefano Brivio
@ 2019-11-26  6:36   ` kbuild test robot
  0 siblings, 0 replies; 24+ messages in thread
From: kbuild test robot @ 2019-11-26  6:36 UTC (permalink / raw)
  To: Stefano Brivio
  Cc: kbuild-all, Pablo Neira Ayuso, netfilter-devel, Florian Westphal,
	Kadlecsik József, Eric Garver, Phil Sutter

[-- Attachment #1: Type: text/plain, Size: 7225 bytes --]

Hi Stefano,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on nf-next/master]
[cannot apply to v5.4 next-20191125]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Stefano-Brivio/nftables-Set-implementation-for-arbitrary-concatenation-of-ranges/20191124-205742
base:   https://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git master
config: i386-allmodconfig (attached as .config)
compiler: gcc-7 (Debian 7.4.0-14) 7.4.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   net/netfilter/nft_set_pipapo_avx2.c: Assembler messages:
>> net/netfilter/nft_set_pipapo_avx2.c:80: Error: bad register name `%ymm15'
   net/netfilter/nft_set_pipapo_avx2.c:253: Error: bad register name `%ymm15'
   net/netfilter/nft_set_pipapo_avx2.c:332: Error: bad register name `%ymm15'
>> net/netfilter/nft_set_pipapo_avx2.c:382: Error: bad register name `%ymm8'
>> net/netfilter/nft_set_pipapo_avx2.c:383: Error: bad register name `%ymm9'
>> net/netfilter/nft_set_pipapo_avx2.c:384: Error: bad register name `%ymm10'
>> net/netfilter/nft_set_pipapo_avx2.c:385: Error: bad register name `%ymm11'
   net/netfilter/nft_set_pipapo_avx2.c:386: Error: bad register name `%ymm8'
   net/netfilter/nft_set_pipapo_avx2.c:387: Error: bad register name `%ymm10'
>> net/netfilter/nft_set_pipapo_avx2.c:390: Error: bad register name `%ymm12'
   net/netfilter/nft_set_pipapo_avx2.c:403: Error: bad register name `%ymm8'
   net/netfilter/nft_set_pipapo_avx2.c:404: Error: bad register name `%ymm9'
   net/netfilter/nft_set_pipapo_avx2.c:405: Error: bad register name `%ymm10'
   net/netfilter/nft_set_pipapo_avx2.c:406: Error: bad register name `%ymm11'
   net/netfilter/nft_set_pipapo_avx2.c:407: Error: bad register name `%ymm12'
   net/netfilter/nft_set_pipapo_avx2.c:408: Error: bad register name `%ymm8'
   net/netfilter/nft_set_pipapo_avx2.c:409: Error: bad register name `%ymm10'
   net/netfilter/nft_set_pipapo_avx2.c:412: Error: bad register name `%ymm12'
>> net/netfilter/nft_set_pipapo_avx2.c:413: Error: bad register name `%ymm14'
   net/netfilter/nft_set_pipapo_avx2.c:429: Error: bad register name `%ymm15'
   net/netfilter/nft_set_pipapo_avx2.c:487: Error: bad register name `%ymm8'
   net/netfilter/nft_set_pipapo_avx2.c:488: Error: bad register name `%ymm9'
   net/netfilter/nft_set_pipapo_avx2.c:489: Error: bad register name `%ymm10'
   net/netfilter/nft_set_pipapo_avx2.c:490: Error: bad register name `%ymm11'
   net/netfilter/nft_set_pipapo_avx2.c:491: Error: bad register name `%ymm12'
   net/netfilter/nft_set_pipapo_avx2.c:492: Error: bad register name `%ymm8'
   net/netfilter/nft_set_pipapo_avx2.c:493: Error: bad register name `%ymm14'
   net/netfilter/nft_set_pipapo_avx2.c:495: Error: bad register name `%ymm9'
   net/netfilter/nft_set_pipapo_avx2.c:497: Error: bad register name `%ymm11'
>> net/netfilter/nft_set_pipapo_avx2.c:499: Error: bad register name `%ymm13'
   net/netfilter/nft_set_pipapo_avx2.c:506: Error: bad register name `%ymm8'
   net/netfilter/nft_set_pipapo_avx2.c:508: Error: bad register name `%ymm8'
   net/netfilter/nft_set_pipapo_avx2.c:509: Error: bad register name `%ymm8'
   net/netfilter/nft_set_pipapo_avx2.c:520: Error: bad register name `%ymm15'
   net/netfilter/nft_set_pipapo_avx2.c:583: Error: bad register name `%ymm8'
   net/netfilter/nft_set_pipapo_avx2.c:584: Error: bad register name `%ymm9'
   net/netfilter/nft_set_pipapo_avx2.c:585: Error: bad register name `%ymm10'
   net/netfilter/nft_set_pipapo_avx2.c:586: Error: bad register name `%ymm11'
   net/netfilter/nft_set_pipapo_avx2.c:587: Error: bad register name `%ymm8'
   net/netfilter/nft_set_pipapo_avx2.c:588: Error: bad register name `%ymm13'
   net/netfilter/nft_set_pipapo_avx2.c:589: Error: bad register name `%ymm9'
   net/netfilter/nft_set_pipapo_avx2.c:592: Error: bad register name `%ymm11'
   net/netfilter/nft_set_pipapo_avx2.c:595: Error: bad register name `%ymm13'
   net/netfilter/nft_set_pipapo_avx2.c:599: Error: bad register name `%ymm8'
   net/netfilter/nft_set_pipapo_avx2.c:600: Error: bad register name `%ymm9'
   net/netfilter/nft_set_pipapo_avx2.c:601: Error: bad register name `%ymm10'
   net/netfilter/nft_set_pipapo_avx2.c:602: Error: bad register name `%ymm11'
   net/netfilter/nft_set_pipapo_avx2.c:603: Error: bad register name `%ymm12'
   net/netfilter/nft_set_pipapo_avx2.c:604: Error: bad register name `%ymm13'
   net/netfilter/nft_set_pipapo_avx2.c:605: Error: bad register name `%ymm14'
   net/netfilter/nft_set_pipapo_avx2.c:607: Error: bad register name `%ymm8'
   net/netfilter/nft_set_pipapo_avx2.c:609: Error: bad register name `%ymm10'
   net/netfilter/nft_set_pipapo_avx2.c:611: Error: bad register name `%ymm12'
   net/netfilter/nft_set_pipapo_avx2.c:613: Error: bad register name `%ymm14'
   net/netfilter/nft_set_pipapo_avx2.c:615: Error: bad register name `%ymm8'
   net/netfilter/nft_set_pipapo_avx2.c:616: Error: bad register name `%ymm9'
   net/netfilter/nft_set_pipapo_avx2.c:617: Error: bad register name `%ymm10'
   net/netfilter/nft_set_pipapo_avx2.c:618: Error: bad register name `%ymm11'
   net/netfilter/nft_set_pipapo_avx2.c:619: Error: bad register name `%ymm12'
   net/netfilter/nft_set_pipapo_avx2.c:620: Error: bad register name `%ymm8'
   net/netfilter/nft_set_pipapo_avx2.c:622: Error: bad register name `%ymm14'
   net/netfilter/nft_set_pipapo_avx2.c:624: Error: bad register name `%ymm9'
   net/netfilter/nft_set_pipapo_avx2.c:625: Error: bad register name `%ymm11'
   net/netfilter/nft_set_pipapo_avx2.c:627: Error: bad register name `%ymm13'
   net/netfilter/nft_set_pipapo_avx2.c:631: Error: bad register name `%ymm8'
   net/netfilter/nft_set_pipapo_avx2.c:632: Error: bad register name `%ymm9'
   net/netfilter/nft_set_pipapo_avx2.c:633: Error: bad register name `%ymm10'
   net/netfilter/nft_set_pipapo_avx2.c:634: Error: bad register name `%ymm11'
   net/netfilter/nft_set_pipapo_avx2.c:635: Error: bad register name `%ymm12'
   net/netfilter/nft_set_pipapo_avx2.c:638: Error: bad register name `%ymm8'
   net/netfilter/nft_set_pipapo_avx2.c:639: Error: bad register name `%ymm10'
   net/netfilter/nft_set_pipapo_avx2.c:640: Error: bad register name `%ymm12'
   net/netfilter/nft_set_pipapo_avx2.c:658: Error: bad register name `%ymm15'

vim +80 net/netfilter/nft_set_pipapo_avx2.c

    71	
    72	/**
    73	 * nft_pipapo_avx2_prepare() - Prepare before main algorithm body
    74	 *
    75	 * This zeroes out ymm15, which is later used whenever we need to clear a
    76	 * memory location, by storing its content into memory.
    77	 */
    78	static void nft_pipapo_avx2_prepare(void)
    79	{
  > 80		NFT_PIPAPO_AVX2_ZERO(15);
    81	}
    82	

---
0-DAY kernel test infrastructure                 Open Source Technology Center
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 70281 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH nf-next v2 3/8] nf_tables: Add set type for arbitrary concatenation of ranges
  2019-11-22 13:40 ` [PATCH nf-next v2 3/8] nf_tables: Add set type for arbitrary concatenation of ranges Stefano Brivio
@ 2019-11-27  9:29   ` Pablo Neira Ayuso
  2019-11-27 11:02     ` Stefano Brivio
  0 siblings, 1 reply; 24+ messages in thread
From: Pablo Neira Ayuso @ 2019-11-27  9:29 UTC (permalink / raw)
  To: Stefano Brivio
  Cc: netfilter-devel, Florian Westphal, Kadlecsik József,
	Eric Garver, Phil Sutter

Hi Stefano,

Just started reading, a few initial questions.

On Fri, Nov 22, 2019 at 02:40:02PM +0100, Stefano Brivio wrote:
[...]
> diff --git a/include/net/netfilter/nf_tables_core.h b/include/net/netfilter/nf_tables_core.h
> index 7281895fa6d9..9759257ec8ec 100644
> --- a/include/net/netfilter/nf_tables_core.h
> +++ b/include/net/netfilter/nf_tables_core.h
> @@ -74,6 +74,7 @@ extern struct nft_set_type nft_set_hash_type;
>  extern struct nft_set_type nft_set_hash_fast_type;
>  extern struct nft_set_type nft_set_rbtree_type;
>  extern struct nft_set_type nft_set_bitmap_type;
> +extern struct nft_set_type nft_set_pipapo_type;
>  
>  struct nft_expr;
>  struct nft_regs;
> diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
> index 5e9b2eb24349..3f572e5a975e 100644
> --- a/net/netfilter/Makefile
> +++ b/net/netfilter/Makefile
> @@ -81,7 +81,8 @@ nf_tables-objs := nf_tables_core.o nf_tables_api.o nft_chain_filter.o \
>  		  nft_chain_route.o nf_tables_offload.o
>  
>  nf_tables_set-objs := nf_tables_set_core.o \
> -		      nft_set_hash.o nft_set_bitmap.o nft_set_rbtree.o
> +		      nft_set_hash.o nft_set_bitmap.o nft_set_rbtree.o \
> +		      nft_set_pipapo.o
>  
>  obj-$(CONFIG_NF_TABLES)		+= nf_tables.o
>  obj-$(CONFIG_NF_TABLES_SET)	+= nf_tables_set.o
> diff --git a/net/netfilter/nf_tables_set_core.c b/net/netfilter/nf_tables_set_core.c
> index a9fce8d10051..586b621007eb 100644
> --- a/net/netfilter/nf_tables_set_core.c
> +++ b/net/netfilter/nf_tables_set_core.c
> @@ -9,12 +9,14 @@ static int __init nf_tables_set_module_init(void)
>  	nft_register_set(&nft_set_rhash_type);
>  	nft_register_set(&nft_set_bitmap_type);
>  	nft_register_set(&nft_set_rbtree_type);
> +	nft_register_set(&nft_set_pipapo_type);
>  
>  	return 0;
>  }
>  
>  static void __exit nf_tables_set_module_exit(void)
>  {
> +	nft_unregister_set(&nft_set_pipapo_type);
>  	nft_unregister_set(&nft_set_rbtree_type);
>  	nft_unregister_set(&nft_set_bitmap_type);
>  	nft_unregister_set(&nft_set_rhash_type);
> diff --git a/net/netfilter/nft_set_pipapo.c b/net/netfilter/nft_set_pipapo.c
> new file mode 100644
> index 000000000000..3cad9aedc168
> --- /dev/null
> +++ b/net/netfilter/nft_set_pipapo.c
> @@ -0,0 +1,2197 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +/* PIPAPO: PIle PAcket POlicies: set for arbitrary concatenations of ranges
> + *
> + * Copyright (c) 2019 Red Hat GmbH
> + *
> + * Author: Stefano Brivio <sbrivio@redhat.com>
> + */
> +
> +/**
> + * DOC: Theory of Operation
> + *
> + *
> + * Problem
> + * -------
> + *
> + * Match packet bytes against entries composed of ranged or non-ranged packet
> + * field specifiers, mapping them to arbitrary references. For example:
> + *
> + * ::
> + *
> + *               --- fields --->
> + *      |    [net],[port],[net]... => [reference]
> + *   entries [net],[port],[net]... => [reference]
> + *      |    [net],[port],[net]... => [reference]
> + *      V    ...
> + *
> + * where [net] fields can be IP ranges or netmasks, and [port] fields are port
> + * ranges. Arbitrary packet fields can be matched.
> + *
> + *
> + * Algorithm Overview
> + * ------------------
> + *
> + * This algorithm is loosely inspired by [Ligatti 2010], and fundamentally
> + * relies on the consideration that every contiguous range in a space of b bits
> + * can be converted into b * 2 netmasks, from Theorem 3 in [Rottenstreich 2010],
> + * as also illustrated in Section 9 of [Kogan 2014].
> + *
> + * Classification against a number of entries, that require matching given bits
> + * of a packet field, is performed by grouping those bits in sets of arbitrary
> + * size, and classifying packet bits one group at a time.
> + *
> + * Example:
> + *   to match the source port (16 bits) of a packet, we can divide those 16 bits
> + *   in 4 groups of 4 bits each. Given the entry:
> + *      0000 0001 0101 1001
> + *   and a packet with source port:
> + *      0000 0001 1010 1001
> + *   first and second groups match, but the third doesn't. We conclude that the
> + *   packet doesn't match the given entry.
> + *
> + * Translate the set to a sequence of lookup tables, one per field. Each table
> + * has two dimensions: bit groups to be matched for a single packet field, and
> + * all the possible values of said groups (buckets). Input entries are
> + * represented as one or more rules, depending on the number of composing
> + * netmasks for the given field specifier, and a group match is indicated as a
> + * set bit, with number corresponding to the rule index, in all the buckets
> + * whose value matches the entry for a given group.
> + *
> + * Rules are mapped between fields through an array of x, n pairs, with each
> + * item mapping a matched rule to one or more rules. The position of the pair in
> + * the array indicates the matched rule to be mapped to the next field, x
> + * indicates the first rule index in the next field, and n the amount of
> + * next-field rules the current rule maps to.
> + *
> + * The mapping array for the last field maps to the desired references.
> + *
> + * To match, we perform table lookups using the values of grouped packet bits,
> + * and use a sequence of bitwise operations to progressively evaluate rule
> + * matching.
> + *
> + * A stand-alone, reference implementation, also including notes about possible
> + * future optimisations, is available at:
> + *    https://pipapo.lameexcu.se/
> + *
> + * Insertion
> + * ---------
> + *
> + * - For each packet field:
> + *
> + *   - divide the b packet bits we want to classify into groups of size t,
> + *     obtaining ceil(b / t) groups
> + *
> + *      Example: match on destination IP address, with t = 4: 32 bits, 8 groups
> + *      of 4 bits each
> + *
> + *   - allocate a lookup table with one column ("bucket") for each possible
> + *     value of a group, and with one row for each group
> + *
> + *      Example: 8 groups, 2^4 buckets:
> + *
> + * ::
> + *
> + *                     bucket
> + *      group  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
> + *        0
> + *        1
> + *        2
> + *        3
> + *        4
> + *        5
> + *        6
> + *        7
> + *
> + *   - map the bits we want to classify for the current field, for a given
> + *     entry, to a single rule for non-ranged and netmask set items, and to one
> + *     or multiple rules for ranges. Ranges are expanded to composing netmasks
> + *     by pipapo_expand().
> + *
> + *      Example: 2 entries, 10.0.0.5:1024 and 192.168.1.0-192.168.2.1:2048
> + *      - rule #0: 10.0.0.5
> + *      - rule #1: 192.168.1.0/24
> + *      - rule #2: 192.168.2.0/31
> + *
> + *   - insert references to the rules in the lookup table, selecting buckets
> + *     according to bit values of a rule in the given group. This is done by
> + *     pipapo_insert().
> + *
> + *      Example: given:
> + *      - rule #0: 10.0.0.5 mapping to buckets
> + *        < 0 10  0 0   0 0  0 5 >
> + *      - rule #1: 192.168.1.0/24 mapping to buckets
> + *        < 12 0  10 8  0 1  < 0..15 > < 0..15 > >
> + *      - rule #2: 192.168.2.0/31 mapping to buckets
> + *        < 12 0  10 8  0 2  0 < 0..1 > >
> + *
> + *      these bits are set in the lookup table:
> + *
> + * ::
> + *
> + *                     bucket
> + *      group  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
> + *        0    0                                              1,2
> + *        1   1,2                                      0
> + *        2    0                                      1,2
> + *        3    0                              1,2
> + *        4  0,1,2
> + *        5    0   1   2
> + *        6  0,1,2 1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
> + *        7   1,2 1,2  1   1   1  0,1  1   1   1   1   1   1   1   1   1   1
> + *
> + *   - if this is not the last field in the set, fill a mapping array that maps
> + *     rules from the lookup table to rules belonging to the same entry in
> + *     the next lookup table, done by pipapo_map().
> + *
> + *     Note that as rules map to contiguous ranges of rules, given how netmask
> + *     expansion and insertion is performed, &union nft_pipapo_map_bucket stores
> + *     this information as pairs of first rule index, rule count.
> + *
> + *      Example: 2 entries, 10.0.0.5:1024 and 192.168.1.0-192.168.2.1:2048,
> + *      given lookup table #0 for field 0 (see example above):
> + *
> + * ::
> + *
> + *                     bucket
> + *      group  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
> + *        0    0                                              1,2
> + *        1   1,2                                      0
> + *        2    0                                      1,2
> + *        3    0                              1,2
> + *        4  0,1,2
> + *        5    0   1   2
> + *        6  0,1,2 1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
> + *        7   1,2 1,2  1   1   1  0,1  1   1   1   1   1   1   1   1   1   1
> + *
> + *      and lookup table #1 for field 1 with:
> + *      - rule #0: 1024 mapping to buckets
> + *        < 0  0  4  0 >
> + *      - rule #1: 2048 mapping to buckets
> + *        < 0  0  5  0 >
> + *
> + * ::
> + *
> + *                     bucket
> + *      group  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
> + *        0   0,1
> + *        1   0,1
> + *        2                    0   1
> + *        3   0,1
> + *
> + *      we need to map rules for 10.0.0.5 in lookup table #0 (rule #0) to 1024
> + *      in lookup table #1 (rule #0) and rules for 192.168.1.0-192.168.2.1
> + *      (rules #1, #2) to 2048 in lookup table #2 (rule #1):
> + *
> + * ::
> + *
> + *       rule indices in current field: 0    1    2
> + *       map to rules in next field:    0    1    1
> + *
> + *   - if this is the last field in the set, fill a mapping array that maps
> + *     rules from the last lookup table to element pointers, also done by
> + *     pipapo_map().
> + *
> + *     Note that, in this implementation, we have two elements (start, end) for
> + *     each entry. The pointer to the end element is stored in this array, and
> + *     the pointer to the start element is linked from it.
> + *
> + *      Example: entry 10.0.0.5:1024 has a corresponding &struct nft_pipapo_elem
> + *      pointer, 0x66, and element for 192.168.1.0-192.168.2.1:2048 is at 0x42.
> + *      From the rules of lookup table #1 as mapped above:
> + *
> + * ::
> + *
> + *       rule indices in last field:    0    1
> + *       map to elements:             0x42  0x66
> + *
> + *
> + * Matching
> + * --------
> + *
> + * We use a result bitmap, with the size of a single lookup table bucket, to
> + * represent the matching state that applies at every algorithm step. This is
> + * done by pipapo_lookup().
> + *
> + * - For each packet field:
> + *
> + *   - start with an all-ones result bitmap (res_map in pipapo_lookup())
> + *
> + *   - perform a lookup into the table corresponding to the current field,
> + *     for each group, and at every group, AND the current result bitmap with
> + *     the value from the lookup table bucket
> + *
> + * ::
> + *
> + *      Example: 192.168.1.5 < 12 0  10 8  0 1  0 5 >, with lookup table from
> + *      insertion examples.
> + *      Lookup table buckets are at least 3 bits wide, we'll assume 8 bits for
> + *      convenience in this example. Initial result bitmap is 0xff, the steps
> + *      below show the value of the result bitmap after each group is processed:
> + *
> + *                     bucket
> + *      group  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
> + *        0    0                                              1,2
> + *        result bitmap is now: 0xff & 0x6 [bucket 12] = 0x6
> + *
> + *        1   1,2                                      0
> + *        result bitmap is now: 0x6 & 0x6 [bucket 0] = 0x6
> + *
> + *        2    0                                      1,2
> + *        result bitmap is now: 0x6 & 0x6 [bucket 10] = 0x6
> + *
> + *        3    0                              1,2
> + *        result bitmap is now: 0x6 & 0x6 [bucket 8] = 0x6
> + *
> + *        4  0,1,2
> + *        result bitmap is now: 0x6 & 0x7 [bucket 0] = 0x6
> + *
> + *        5    0   1   2
> + *        result bitmap is now: 0x6 & 0x2 [bucket 1] = 0x2
> + *
> + *        6  0,1,2 1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
> + *        result bitmap is now: 0x2 & 0x7 [bucket 0] = 0x2
> + *
> + *        7   1,2 1,2  1   1   1  0,1  1   1   1   1   1   1   1   1   1   1
> + *        final result bitmap for this field is: 0x2 & 0x3 [bucket 5] = 0x2
> + *
> + *   - at the next field, start with a new, all-zeroes result bitmap. For each
> + *     bit set in the previous result bitmap, fill the new result bitmap
> + *     (fill_map in pipapo_lookup()) with the rule indices from the
> + *     corresponding buckets of the mapping field for this field, done by
> + *     pipapo_refill()
> + *
> + *      Example: with mapping table from insertion examples, with the current
> + *      result bitmap from the previous example, 0x02:
> + *
> + * ::
> + *
> + *       rule indices in current field: 0    1    2
> + *       map to rules in next field:    0    1    1
> + *
> + *      the new result bitmap will be 0x02: rule 1 was set, and rule 1 will be
> + *      set.
> + *
> + *      We can now extend this example to cover the second iteration of the step
> + *      above (lookup and AND bitmap): assuming the port field is
> + *      2048 < 0  0  5  0 >, with starting result bitmap 0x2, and lookup table
> + *      for "port" field from pre-computation example:
> + *
> + * ::
> + *
> + *                     bucket
> + *      group  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
> + *        0   0,1
> + *        1   0,1
> + *        2                    0   1
> + *        3   0,1
> + *
> + *       operations are: 0x2 & 0x3 [bucket 0] & 0x3 [bucket 0] & 0x2 [bucket 5]
> + *       & 0x3 [bucket 0], resulting bitmap is 0x2.
> + *
> + *   - if this is the last field in the set, look up the value from the mapping
> + *     array corresponding to the final result bitmap
> + *
> + *      Example: 0x2 resulting bitmap from 192.168.1.5:2048, mapping array for
> + *      last field from insertion example:
> + *
> + * ::
> + *
> + *       rule indices in last field:    0    1
> + *       map to elements:             0x42  0x66
> + *
> + *      the matching element is at 0x42.
> + *
> + *
> + * References
> + * ----------
> + *
> + * [Ligatti 2010]
> + *      A Packet-classification Algorithm for Arbitrary Bitmask Rules, with
> + *      Automatic Time-space Tradeoffs
> + *      Jay Ligatti, Josh Kuhn, and Chris Gage.
> + *      Proceedings of the IEEE International Conference on Computer
> + *      Communication Networks (ICCCN), August 2010.
> + *      http://www.cse.usf.edu/~ligatti/papers/grouper-conf.pdf
> + *
> + * [Rottenstreich 2010]
> + *      Worst-Case TCAM Rule Expansion
> + *      Ori Rottenstreich and Isaac Keslassy.
> + *      2010 Proceedings IEEE INFOCOM, San Diego, CA, 2010.
> + *      http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.212.4592&rep=rep1&type=pdf
> + *
> + * [Kogan 2014]
> + *      SAX-PAC (Scalable And eXpressive PAcket Classification)
> + *      Kirill Kogan, Sergey Nikolenko, Ori Rottenstreich, William Culhane,
> + *      and Patrick Eugster.
> + *      Proceedings of the 2014 ACM conference on SIGCOMM, August 2014.
> + *      http://www.sigcomm.org/sites/default/files/ccr/papers/2014/August/2619239-2626294.pdf
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/init.h>
> +#include <linux/log2.h>
> +#include <linux/module.h>
> +#include <linux/netlink.h>
> +#include <linux/netfilter.h>
> +#include <linux/netfilter/nf_tables.h>
> +#include <net/netfilter/nf_tables_core.h>
> +#include <uapi/linux/netfilter/nf_tables.h>
> +#include <net/ipv6.h>			/* For the maximum length of a field */
> +#include <linux/bitmap.h>
> +#include <linux/bitops.h>
> +
> +/* Count of concatenated fields depends on count of 32-bit nftables registers */
> +#define NFT_PIPAPO_MAX_FIELDS		NFT_REG32_COUNT
> +
> +/* Largest supported field size */
> +#define NFT_PIPAPO_MAX_BYTES		(sizeof(struct in6_addr))
> +#define NFT_PIPAPO_MAX_BITS		(NFT_PIPAPO_MAX_BYTES * BITS_PER_BYTE)
> +
> +/* Number of bits to be grouped together in lookup table buckets, arbitrary */
> +#define NFT_PIPAPO_GROUP_BITS		4
> +#define NFT_PIPAPO_GROUPS_PER_BYTE	(BITS_PER_BYTE / NFT_PIPAPO_GROUP_BITS)
> +
> +/* Fields are padded to 32 bits in input registers */
> +#define NFT_PIPAPO_GROUPS_PADDED_SIZE(x)				\
> +	(round_up((x) / NFT_PIPAPO_GROUPS_PER_BYTE, sizeof(u32)))
> +#define NFT_PIPAPO_GROUPS_PADDING(x)					\
> +	(NFT_PIPAPO_GROUPS_PADDED_SIZE((x)) - (x) / NFT_PIPAPO_GROUPS_PER_BYTE)
> +
> +/* Number of buckets, given by 2 ^ n, with n grouped bits */
> +#define NFT_PIPAPO_BUCKETS		(1 << NFT_PIPAPO_GROUP_BITS)
> +
> +/* Each n-bit range maps to up to n * 2 rules */
> +#define NFT_PIPAPO_MAP_NBITS		(const_ilog2(NFT_PIPAPO_MAX_BITS * 2))
> +
> +/* Use the rest of mapping table buckets for rule indices, but it makes no sense
> + * to exceed 32 bits
> + */
> +#if BITS_PER_LONG == 64
> +#define NFT_PIPAPO_MAP_TOBITS		32
> +#else
> +#define NFT_PIPAPO_MAP_TOBITS		(BITS_PER_LONG - NFT_PIPAPO_MAP_NBITS)
> +#endif
> +
> +/* ...which gives us the highest allowed index for a rule */
> +#define NFT_PIPAPO_RULE0_MAX		((1UL << (NFT_PIPAPO_MAP_TOBITS - 1)) \
> +					- (1UL << NFT_PIPAPO_MAP_NBITS))
> +
> +#define nft_pipapo_for_each_field(field, index, match)		\
> +	for ((field) = (match)->f, (index) = 0;			\
> +	     (index) < (match)->field_count;			\
> +	     (index)++, (field)++)
> +
> +/**
> + * union nft_pipapo_map_bucket - Bucket of mapping table
> + * @to:		First rule number (in next field) this rule maps to
> + * @n:		Number of rules (in next field) this rule maps to
> + * @e:		If there's no next field, pointer to element this rule maps to
> + */
> +union nft_pipapo_map_bucket {
> +	struct {
> +#if BITS_PER_LONG == 64
> +		static_assert(NFT_PIPAPO_MAP_TOBITS <= 32);
> +		u32 to;
> +
> +		static_assert(NFT_PIPAPO_MAP_NBITS <= 32);
> +		u32 n;
> +#else
> +		unsigned long to:NFT_PIPAPO_MAP_TOBITS;
> +		unsigned long  n:NFT_PIPAPO_MAP_NBITS;
> +#endif
> +	};
> +	struct nft_pipapo_elem *e;
> +};
> +
> +/**
> + * struct nft_pipapo_field - Lookup, mapping tables and related data for a field
> + * @groups:	Amount of 4-bit groups
> + * @rules:	Number of inserted rules
> + * @bsize:	Size of each bucket in lookup table, in longs
> + * @lt:		Lookup table: 'groups' rows of NFT_PIPAPO_BUCKETS buckets
> + * @mt:		Mapping table: one bucket per rule
> + */
> +struct nft_pipapo_field {
> +	int groups;
> +	unsigned long rules;
> +	size_t bsize;
> +	unsigned long *lt;
> +	union nft_pipapo_map_bucket *mt;
> +};
> +
> +/**
> + * struct nft_pipapo_match - Data used for lookup and matching
> + * @field_count		Amount of fields in set
> + * @scratch:		Preallocated per-CPU maps for partial matching results
> + * @bsize_max:		Maximum lookup table bucket size of all fields, in longs
> + * @rcu			Matching data is swapped on commits
> + * @f:			Fields, with lookup and mapping tables
> + */
> +struct nft_pipapo_match {
> +	int field_count;
> +	unsigned long * __percpu *scratch;
> +	size_t bsize_max;
> +	struct rcu_head rcu;
> +	struct nft_pipapo_field f[0];
> +};
> +
> +/* Current working bitmap index, toggled between field matches */
> +static DEFINE_PER_CPU(bool, nft_pipapo_scratch_index);
> +
> +/**
> + * struct nft_pipapo - Representation of a set
> + * @match:	Currently in-use matching data
> + * @clone:	Copy where pending insertions and deletions are kept
> + * @groups:	Total amount of 4-bit groups for fields in this set
> + * @width:	Total bytes to be matched for one packet, including padding
> + * @dirty:	Working copy has pending insertions or deletions
> + * @last_gc:	Timestamp of last garbage collection run, jiffies
> + * @start_data:	Key data of start element for insertion
> + * @start_elem:	Start element for insertion
> + */
> +struct nft_pipapo {
> +	struct nft_pipapo_match __rcu *match;
> +	struct nft_pipapo_match *clone;
> +	int groups;
> +	int width;
> +	bool dirty;
> +	unsigned long last_gc;
> +	u8 start_data[NFT_DATA_VALUE_MAXLEN * sizeof(u32)];
> +	struct nft_pipapo_elem *start_elem;
> +};
> +
> +struct nft_pipapo_elem;
> +
> +/**
> + * struct nft_pipapo_elem - API-facing representation of single set element
> + * @start:	Pointer to element that represents start of interval
> + * @ext:	nftables API extensions
> + */
> +struct nft_pipapo_elem {
> +	struct nft_pipapo_elem *start;
> +	struct nft_set_ext ext;
> +};
> +
> +/**
> + * pipapo_refill() - For each set bit, set bits from selected mapping table item
> + * @map:	Bitmap to be scanned for set bits
> + * @len:	Length of bitmap in longs
> + * @rules:	Number of rules in field
> + * @dst:	Destination bitmap
> + * @mt:		Mapping table containing bit set specifiers
> + * @match_only:	Find a single bit and return, don't fill
> + *
> + * Iteration over set bits with __builtin_ctzl(): Daniel Lemire, public domain.
> + *
> + * For each bit set in map, select the bucket from mapping table with index
> + * corresponding to the position of the bit set. Use start bit and amount of
> + * bits specified in bucket to fill region in dst.
> + *
> + * Return: -1 on no match, bit position on 'match_only', 0 otherwise.
> + */
> +static int pipapo_refill(unsigned long *map, int len, int rules,
> +			 unsigned long *dst, union nft_pipapo_map_bucket *mt,
> +			 bool match_only)
> +{
> +	unsigned long bitset;
> +	int k, ret = -1;
> +
> +	for (k = 0; k < len; k++) {
> +		bitset = map[k];
> +		while (bitset) {
> +			unsigned long t = bitset & -bitset;
> +			int r = __builtin_ctzl(bitset);
> +			int i = k * BITS_PER_LONG + r;
> +
> +			if (unlikely(i >= rules)) {
> +				map[k] = 0;
> +				return -1;
> +			}
> +
> +			if (unlikely(match_only)) {
> +				bitmap_clear(map, i, 1);
> +				return i;
> +			}
> +
> +			ret = 0;
> +
> +			bitmap_set(dst, mt[i].to, mt[i].n);
> +
> +			bitset ^= t;
> +		}
> +		map[k] = 0;
> +	}
> +
> +	return ret;
> +}
> +
> +/**
> + * nft_pipapo_lookup() - Lookup function
> + * @net:	Network namespace
> + * @set:	nftables API set representation
> + * @elem:	nftables API element representation containing key data
> + * @ext:	nftables API extension pointer, filled with matching reference
> + *
> + * For more details, see DOC: Theory of Operation.
> + *
> + * Return: true on match, false otherwise.
> + */
> +static bool nft_pipapo_lookup(const struct net *net, const struct nft_set *set,
> +			      const u32 *key, const struct nft_set_ext **ext)
> +{
> +	struct nft_pipapo *priv = nft_set_priv(set);
> +	unsigned long *res_map, *fill_map;
> +	u8 genmask = nft_genmask_cur(net);
> +	const u8 *rp = (const u8 *)key;
> +	struct nft_pipapo_match *m;
> +	struct nft_pipapo_field *f;
> +	bool map_index;
> +	int i;
> +
> +	local_bh_disable();
> +
> +	map_index = raw_cpu_read(nft_pipapo_scratch_index);
> +
> +	m = rcu_dereference(priv->match);
> +
> +	if (unlikely(!m || !*raw_cpu_ptr(m->scratch)))
> +		goto out;
> +
> +	res_map  = *raw_cpu_ptr(m->scratch) + (map_index ? m->bsize_max : 0);
> +	fill_map = *raw_cpu_ptr(m->scratch) + (map_index ? 0 : m->bsize_max);
> +
> +	memset(res_map, 0xff, m->bsize_max * sizeof(*res_map));
> +
> +	nft_pipapo_for_each_field(f, i, m) {
> +		bool last = i == m->field_count - 1;
> +		unsigned long *lt = f->lt;
> +		int b, group;
> +
> +		/* For each 4-bit group: select lookup table bucket depending on
> +		 * packet bytes value, then AND bucket value
> +		 */
> +		for (group = 0; group < f->groups; group++) {
> +			u8 v;
> +
> +			if (group % 2) {
> +				v = *rp & 0x0f;
> +				rp++;
> +			} else {
> +				v = *rp >> 4;
> +			}
> +			__bitmap_and(res_map, res_map, lt + v * f->bsize,
> +				     f->bsize * BITS_PER_LONG);
> +
> +			lt += f->bsize * NFT_PIPAPO_BUCKETS;
> +		}
> +
> +		/* Now populate the bitmap for the next field, unless this is
> +		 * the last field, in which case return the matched 'ext'
> +		 * pointer if any.
> +		 *
> +		 * Now res_map contains the matching bitmap, and fill_map is the
> +		 * bitmap for the next field.
> +		 */
> +next_match:
> +		b = pipapo_refill(res_map, f->bsize, f->rules, fill_map, f->mt,
> +				  last);
> +		if (b < 0) {
> +			raw_cpu_write(nft_pipapo_scratch_index, map_index);
> +			local_bh_enable();
> +
> +			return false;
> +		}
> +
> +		if (last) {
> +			*ext = &f->mt[b].e->ext;
> +			if (unlikely(nft_set_elem_expired(*ext) ||
> +				     !nft_set_elem_active(*ext, genmask)))
> +				goto next_match;
> +
> +			/* Last field: we're just returning the key without
> +			 * filling the initial bitmap for the next field, so the
> +			 * current inactive bitmap is clean and can be reused as
> +			 * *next* bitmap (not initial) for the next packet.
> +			 */
> +			raw_cpu_write(nft_pipapo_scratch_index, map_index);
> +			local_bh_enable();
> +
> +			return true;
> +		}
> +
> +		/* Swap bitmap indices: res_map is the initial bitmap for the
> +		 * next field, and fill_map is guaranteed to be all-zeroes at
> +		 * this point.
> +		 */
> +		map_index = !map_index;
> +		swap(res_map, fill_map);
> +
> +		rp += NFT_PIPAPO_GROUPS_PADDING(f->groups);
> +	}
> +
> +out:
> +	local_bh_enable();
> +	return false;
> +}
> +
> +/**
> + * pipapo_get() - Get matching start or end element reference given key data
> + * @net:	Network namespace
> + * @set:	nftables API set representation
> + * @data:	Key data to be matched against existing elements
> + * @flags:	If NFT_SET_ELEM_INTERVAL_END is passed, return the end element
> + *
> + * This is essentially the same as the lookup function, except that it matches
> + * key data against the uncommitted copy and doesn't use preallocated maps for
> + * bitmap results.
> + *
> + * Return: pointer to &struct nft_pipapo_elem on match, error pointer otherwise.
> + */
> +static void *pipapo_get(const struct net *net, const struct nft_set *set,
> +			const u8 *data, unsigned int flags)
> +{
> +	struct nft_pipapo *priv = nft_set_priv(set);
> +	struct nft_pipapo_match *m = priv->clone;
> +	unsigned long *res_map, *fill_map = NULL;
> +	void *ret = ERR_PTR(-ENOENT);
> +	struct nft_pipapo_field *f;
> +	int i;
> +
> +	res_map = kmalloc_array(m->bsize_max, sizeof(*res_map), GFP_ATOMIC);
> +	if (!res_map) {
> +		ret = ERR_PTR(-ENOMEM);
> +		goto out;
> +	}
> +
> +	fill_map = kcalloc(m->bsize_max, sizeof(*res_map), GFP_ATOMIC);
> +	if (!fill_map) {
> +		ret = ERR_PTR(-ENOMEM);
> +		goto out;
> +	}
> +
> +	memset(res_map, 0xff, m->bsize_max * sizeof(*res_map));
> +
> +	nft_pipapo_for_each_field(f, i, m) {
> +		bool last = i == m->field_count - 1;
> +		unsigned long *lt = f->lt;
> +		int b, group;
> +
> +		/* For each 4-bit group: select lookup table bucket depending on
> +		 * packet bytes value, then AND bucket value
> +		 */
> +		for (group = 0; group < f->groups; group++) {
> +			u8 v;
> +
> +			if (group % 2) {
> +				v = *data & 0x0f;
> +				data++;
> +			} else {
> +				v = *data >> 4;
> +			}
> +			__bitmap_and(res_map, res_map, lt + v * f->bsize,
> +				     f->bsize * BITS_PER_LONG);
> +
> +			lt += f->bsize * NFT_PIPAPO_BUCKETS;
> +		}
> +
> +		/* Now populate the bitmap for the next field, unless this is
> +		 * the last field, in which case return the matched 'ext'
> +		 * pointer if any.
> +		 *
> +		 * Now res_map contains the matching bitmap, and fill_map is the
> +		 * bitmap for the next field.
> +		 */
> +next_match:
> +		b = pipapo_refill(res_map, f->bsize, f->rules, fill_map, f->mt,
> +				  last);
> +		if (b < 0)
> +			goto out;
> +
> +		if (last) {
> +			if (nft_set_elem_expired(&f->mt[b].e->ext))
> +				goto next_match;
> +
> +			if (flags & NFT_SET_ELEM_INTERVAL_END)
> +				ret = f->mt[b].e;
> +			else
> +				ret = f->mt[b].e->start;
> +			goto out;
> +		}
> +
> +		data += NFT_PIPAPO_GROUPS_PADDING(f->groups);
> +
> +		/* Swap bitmap indices: fill_map will be the initial bitmap for
> +		 * the next field (i.e. the new res_map), and res_map is
> +		 * guaranteed to be all-zeroes at this point, ready to be filled
> +		 * according to the next mapping table.
> +		 */
> +		swap(res_map, fill_map);
> +	}
> +
> +out:
> +	kfree(fill_map);
> +	kfree(res_map);
> +	return ret;
> +}
> +
> +/**
> + * nft_pipapo_get() - Get matching element reference given key data
> + * @net:	Network namespace
> + * @set:	nftables API set representation
> + * @elem:	nftables API element representation containing key data
> + * @flags:	If NFT_SET_ELEM_INTERVAL_END is passed, return the end element
> + */
> +static void *nft_pipapo_get(const struct net *net, const struct nft_set *set,
> +			    const struct nft_set_elem *elem, unsigned int flags)
> +{
> +	return pipapo_get(net, set, (const u8 *)elem->key.val.data, flags);
> +}
> +
> +/**
> + * pipapo_resize() - Resize lookup or mapping table, or both
> + * @f:		Field containing lookup and mapping tables
> + * @old_rules:	Previous amount of rules in field
> + * @rules:	New amount of rules
> + *
> + * Increase, decrease or maintain tables size depending on new amount of rules,
> + * and copy data over. In case the new size is smaller, throw away data for
> + * highest-numbered rules.
> + *
> + * Return: 0 on success, -ENOMEM on allocation failure.
> + */
> +static int pipapo_resize(struct nft_pipapo_field *f, int old_rules, int rules)
> +{
> +	long *new_lt = NULL, *new_p, *old_lt = f->lt, *old_p;
> +	union nft_pipapo_map_bucket *new_mt, *old_mt = f->mt;
> +	size_t new_bucket_size, copy;
> +	int group, bucket;
> +
> +	new_bucket_size = DIV_ROUND_UP(rules, BITS_PER_LONG);
> +
> +	if (new_bucket_size == f->bsize)
> +		goto mt;
> +
> +	if (new_bucket_size > f->bsize)
> +		copy = f->bsize;
> +	else
> +		copy = new_bucket_size;
> +
> +	new_lt = kvzalloc(f->groups * NFT_PIPAPO_BUCKETS * new_bucket_size *
> +			  sizeof(*new_lt), GFP_KERNEL);
> +	if (!new_lt)
> +		return -ENOMEM;
> +
> +	new_p = new_lt;
> +	old_p = old_lt;
> +	for (group = 0; group < f->groups; group++) {
> +		for (bucket = 0; bucket < NFT_PIPAPO_BUCKETS; bucket++) {
> +			memcpy(new_p, old_p, copy * sizeof(*new_p));
> +			new_p += copy;
> +			old_p += copy;
> +
> +			if (new_bucket_size > f->bsize)
> +				new_p += new_bucket_size - f->bsize;
> +			else
> +				old_p += f->bsize - new_bucket_size;
> +		}
> +	}
> +
> +mt:
> +	new_mt = kvmalloc(rules * sizeof(*new_mt), GFP_KERNEL);
> +	if (!new_mt) {
> +		kvfree(new_lt);
> +		return -ENOMEM;
> +	}
> +
> +	memcpy(new_mt, f->mt, min(old_rules, rules) * sizeof(*new_mt));
> +	if (rules > old_rules) {
> +		memset(new_mt + old_rules, 0,
> +		       (rules - old_rules) * sizeof(*new_mt));
> +	}
> +
> +	if (new_lt) {
> +		f->bsize = new_bucket_size;
> +		f->lt = new_lt;
> +		kvfree(old_lt);
> +	}
> +
> +	f->mt = new_mt;
> +	kvfree(old_mt);
> +
> +	return 0;
> +}
> +
> +/**
> + * pipapo_bucket_set() - Set rule bit in bucket given group and group value
> + * @f:		Field containing lookup table
> + * @rule:	Rule index
> + * @group:	Group index
> + * @v:		Value of bit group
> + */
> +static void pipapo_bucket_set(struct nft_pipapo_field *f, int rule, int group,
> +			      int v)
> +{
> +	unsigned long *pos;
> +
> +	pos = f->lt + f->bsize * NFT_PIPAPO_BUCKETS * group;
> +	pos += f->bsize * v;
> +
> +	__set_bit(rule, pos);
> +}
> +
> +/**
> + * pipapo_insert() - Insert new rule in field given input key and mask length
> + * @f:		Field containing lookup table
> + * @k:		Input key for classification, without nftables padding
> + * @mask_bits:	Length of mask; matches field length for non-ranged entry
> + *
> + * Insert a new rule reference in lookup buckets corresponding to k and
> + * mask_bits.
> + *
> + * Return: 1 on success (one rule inserted), negative error code on failure.
> + */
> +static int pipapo_insert(struct nft_pipapo_field *f, const uint8_t *k,
> +			 int mask_bits)
> +{
> +	int rule = f->rules++, group, ret;
> +
> +	ret = pipapo_resize(f, f->rules - 1, f->rules);
> +	if (ret)
> +		return ret;
> +
> +	for (group = 0; group < f->groups; group++) {
> +		int i, v;
> +		u8 mask;
> +
> +		if (group % 2)
> +			v = k[group / 2] & 0x0f;
> +		else
> +			v = k[group / 2] >> 4;
> +
> +		if (mask_bits >= (group + 1) * 4) {
> +			/* Not masked */
> +			pipapo_bucket_set(f, rule, group, v);
> +		} else if (mask_bits <= group * 4) {
> +			/* Completely masked */
> +			for (i = 0; i < NFT_PIPAPO_BUCKETS; i++)
> +				pipapo_bucket_set(f, rule, group, i);
> +		} else {
> +			/* The mask limit falls on this group */
> +			mask = 0x0f >> (mask_bits - group * 4);
> +			for (i = 0; i < NFT_PIPAPO_BUCKETS; i++) {
> +				if ((i & ~mask) == (v & ~mask))
> +					pipapo_bucket_set(f, rule, group, i);
> +			}
> +		}
> +	}
> +
> +	return 1;
> +}
> +
> +/**
> + * pipapo_step_diff() - Check if setting @step bit in netmask would change it
> + * @base:	Mask we are expanding
> + * @step:	Step bit for given expansion step
> + * @len:	Total length of mask space (set and unset bits), bytes
> + *
> + * Convenience function for mask expansion.
> + *
> + * Return: true if step bit changes mask (i.e. isn't set), false otherwise.
> + */
> +static bool pipapo_step_diff(u8 *base, int step, int len)
> +{
> +	/* Network order, byte-addressed */
> +#ifdef __BIG_ENDIAN__
> +	return !(BIT(step % BITS_PER_BYTE) & base[step / BITS_PER_BYTE]);
> +#else
> +	return !(BIT(step % BITS_PER_BYTE) &
> +		 base[len - 1 - step / BITS_PER_BYTE]);
> +#endif
> +}
> +
> +/**
> + * pipapo_step_after_end() - Check if mask exceeds range end with given step
> + * @base:	Mask we are expanding
> + * @end:	End of range
> + * @step:	Step bit for given expansion step, highest bit to be set
> + * @len:	Total length of mask space (set and unset bits), bytes
> + *
> + * Convenience function for mask expansion.
> + *
> + * Return: true if mask exceeds range setting step bits, false otherwise.
> + */
> +static bool pipapo_step_after_end(const u8 *base, const u8 *end, int step,
> +				  int len)
> +{
> +	u8 tmp[NFT_PIPAPO_MAX_BYTES];
> +	int i;
> +
> +	memcpy(tmp, base, len);
> +
> +	/* Network order, byte-addressed */
> +	for (i = 0; i <= step; i++)
> +#ifdef __BIG_ENDIAN__
> +		tmp[i / BITS_PER_BYTE] |= BIT(i % BITS_PER_BYTE);
> +#else
> +		tmp[len - 1 - i / BITS_PER_BYTE] |= BIT(i % BITS_PER_BYTE);
> +#endif
> +
> +	return memcmp(tmp, end, len) > 0;
> +}
> +
> +/**
> + * pipapo_base_sum() - Sum step bit to given len-sized netmask base with carry
> + * @base:	Netmask base
> + * @step:	Step bit to sum
> + * @len:	Netmask length, bytes
> + */
> +static void pipapo_base_sum(u8 *base, int step, int len)
> +{
> +	bool carry = false;
> +	int i;
> +
> +	/* Network order, byte-addressed */
> +#ifdef __BIG_ENDIAN__
> +	for (i = step / BITS_PER_BYTE; i < len; i++) {
> +#else
> +	for (i = len - 1 - step / BITS_PER_BYTE; i >= 0; i--) {
> +#endif
> +		if (carry)
> +			base[i]++;
> +		else
> +			base[i] += 1 << (step % BITS_PER_BYTE);
> +
> +		if (base[i])
> +			break;
> +
> +		carry = true;
> +	}
> +}
> +
> +/**
> + * expand() - Expand range to composing netmasks and insert into lookup table
> + * @f:		Field containing lookup table
> + * @start:	Start of range
> + * @end:	End of range
> + * @len:	Length of value in bits
> + *
> + * Expand range to composing netmasks and insert corresponding rule references
> + * in lookup buckets.
> + *
> + * Return: number of inserted rules on success, negative error code on failure.
> + */
> +static int pipapo_expand(struct nft_pipapo_field *f,
> +			 const u8 *start, const u8 *end, int len)
> +{
> +	int step, masks = 0, bytes = DIV_ROUND_UP(len, BITS_PER_BYTE);
> +	u8 base[NFT_PIPAPO_MAX_BYTES];
> +
> +	memcpy(base, start, bytes);
> +	while (memcmp(base, end, bytes) <= 0) {
> +		int err;
> +
> +		step = 0;
> +		while (pipapo_step_diff(base, step, bytes)) {
> +			if (pipapo_step_after_end(base, end, step, bytes))
> +				break;
> +
> +			step++;
> +			if (step >= len) {
> +				if (!masks) {
> +					pipapo_insert(f, base, 0);
> +					masks = 1;
> +				}
> +				goto out;
> +			}
> +		}
> +
> +		err = pipapo_insert(f, base, len - step);
> +
> +		if (err < 0)
> +			return err;
> +
> +		masks++;
> +		pipapo_base_sum(base, step, bytes);
> +	}
> +out:
> +	return masks;
> +}
> +
> +/**
> + * pipapo_map() - Insert rules in mapping tables, mapping them between fields
> + * @m:		Matching data, including mapping table
> + * @map:	Table of rule maps: array of first rule and amount of rules
> + *		in next field a given rule maps to, for each field
> + * @ext:	For last field, nft_set_ext pointer matching rules map to
> + */
> +static void pipapo_map(struct nft_pipapo_match *m,
> +		       union nft_pipapo_map_bucket map[NFT_PIPAPO_MAX_FIELDS],
> +		       struct nft_pipapo_elem *e)
> +{
> +	struct nft_pipapo_field *f;
> +	int i, j;
> +
> +	for (i = 0, f = m->f; i < m->field_count - 1; i++, f++) {
> +		for (j = 0; j < map[i].n; j++) {
> +			f->mt[map[i].to + j].to = map[i + 1].to;
> +			f->mt[map[i].to + j].n = map[i + 1].n;
> +		}
> +	}
> +
> +	/* Last field: map to ext instead of mapping to next field */
> +	for (j = 0; j < map[i].n; j++)
> +		f->mt[map[i].to + j].e = e;
> +}
> +
> +/**
> + * pipapo_realloc_scratch() - Reallocate scratch maps for partial match results
> + * @clone:	Copy of matching data with pending insertions and deletions
> + * @bsize_max	Maximum bucket size, scratch maps cover two buckets
> + *
> + * Return: 0 on success, -ENOMEM on failure.
> + */
> +static int pipapo_realloc_scratch(struct nft_pipapo_match *clone,
> +				  unsigned long bsize_max)
> +{
> +	int i;
> +
> +	for_each_possible_cpu(i) {
> +		unsigned long *scratch;
> +
> +		scratch = kzalloc_node(bsize_max * sizeof(*scratch) * 2,
> +				       GFP_KERNEL, cpu_to_node(i));
> +		if (!scratch) {
> +			/* On failure, there's no need to undo previous
> +			 * allocations: this means that some scratch maps have
> +			 * a bigger allocated size now (this is only called on
> +			 * insertion), but the extra space won't be used by any
> +			 * CPU as new elements are not inserted and m->bsize_max
> +			 * is not updated.
> +			 */
> +			return -ENOMEM;
> +		}
> +
> +		kfree(*per_cpu_ptr(clone->scratch, i));
> +
> +		*per_cpu_ptr(clone->scratch, i) = scratch;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * nft_pipapo_insert() - Validate and insert ranged elements
> + * @net:	Network namespace
> + * @set:	nftables API set representation
> + * @elem:	nftables API element representation containing key data
> + * @flags:	If NFT_SET_ELEM_INTERVAL_END is passed, this is the end element
> + * @ext2:	Filled with pointer to &struct nft_set_ext in inserted element
> + *
> + * In this set implementation, this functions needs to be called twice, with
> + * start and end element, to obtain a valid entry insertion.
> + *
> + * Calls to this function are serialised with each other, so we can store
> + * element and key data on the first call with start element, and use it on the
> + * second call once we get the end element too.
> + *
> + * However, userspace could send a single NFT_SET_ELEM_INTERVAL_END element,
> + * without a start element, so we need to check for it explicitly before
> + * inserting an entry, lest we end up in nft_pipapo_walk() with an empty start
> + * element.
> + *
> + * Also, we need to make sure that the start element hasn't been deactivated or
> + * destroyed between the two calls to this function, otherwise we might link an
> + * invalid start item to the end item triggering the insertion. Clear
> + * priv->start_elem on any operation that might render it invalid.
> + *
> + * Return: 0 on success, error pointer on failure.
> + */
> +static int nft_pipapo_insert(const struct net *net, const struct nft_set *set,
> +			     const struct nft_set_elem *elem,
> +			     struct nft_set_ext **ext2)
> +{
> +	const struct nft_set_ext *ext = nft_set_elem_ext(set, elem->priv);
> +	const u8 *data = (const u8 *)elem->key.val.data, *start, *end;
> +	union nft_pipapo_map_bucket rulemap[NFT_PIPAPO_MAX_FIELDS];
> +	struct nft_pipapo *priv = nft_set_priv(set);
> +	struct nft_pipapo_match *m = priv->clone;
> +	struct nft_pipapo_elem *e = elem->priv;
> +	struct nft_pipapo_field *f;
> +	int i, bsize_max, err = 0;
> +	void *dup;
> +
> +	dup = nft_pipapo_get(net, set, elem, 0);
> +	if (PTR_ERR(dup) != -ENOENT) {
> +		priv->start_elem = NULL;
> +		if (IS_ERR(dup))
> +			return PTR_ERR(dup);
> +		*ext2 = dup;

dup should be of nft_set_ext type. I just had a look at
nft_pipapo_get() and I think this returns nft_pipapo_elem, which is
almost good, since it contains nft_set_ext, right?

I think you also need to check if the object is active in the next
generation via nft_genmask_next() and nft_set_elem_active(), otherwise
ignore it.

Note that the datastructure needs to temporarily deal with duplicates,
ie. one inactive object (just deleted) and one active object (just
added) for the next generation.

> +		return -EEXIST;
> +	}
> +
> +	if (!nft_set_ext_exists(ext, NFT_SET_EXT_FLAGS) ||
> +	    !(*nft_set_ext_flags(ext) & NFT_SET_ELEM_INTERVAL_END)) {
> +		priv->start_elem = e;
> +		*ext2 = &e->ext;
> +		memcpy(priv->start_data, data, priv->width);
> +		return 0;
> +	}
> +
> +	if (!priv->start_elem)
> +		return -EINVAL;

I'm working on a sketch patch to extend the front-end code to make
this easier for you, will post it asap, so you don't need this special
handling to collect both ends of the interval.

So far, just spend a bit of time on this, will get back to you with
more feedback.

Thanks for working on this!

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH nf-next v2 3/8] nf_tables: Add set type for arbitrary concatenation of ranges
  2019-11-27  9:29   ` Pablo Neira Ayuso
@ 2019-11-27 11:02     ` Stefano Brivio
  2019-11-27 18:29       ` Pablo Neira Ayuso
  0 siblings, 1 reply; 24+ messages in thread
From: Stefano Brivio @ 2019-11-27 11:02 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netfilter-devel, Florian Westphal, Kadlecsik József,
	Eric Garver, Phil Sutter

On Wed, 27 Nov 2019 10:29:45 +0100
Pablo Neira Ayuso <pablo@netfilter.org> wrote:

> Hi Stefano,
> 
> Just started reading, a few initial questions.
> 
> On Fri, Nov 22, 2019 at 02:40:02PM +0100, Stefano Brivio wrote:
> [...]
>
> > +static int nft_pipapo_insert(const struct net *net, const struct nft_set *set,
> > +			     const struct nft_set_elem *elem,
> > +			     struct nft_set_ext **ext2)
> > +{
> > +	const struct nft_set_ext *ext = nft_set_elem_ext(set, elem->priv);
> > +	const u8 *data = (const u8 *)elem->key.val.data, *start, *end;
> > +	union nft_pipapo_map_bucket rulemap[NFT_PIPAPO_MAX_FIELDS];
> > +	struct nft_pipapo *priv = nft_set_priv(set);
> > +	struct nft_pipapo_match *m = priv->clone;
> > +	struct nft_pipapo_elem *e = elem->priv;
> > +	struct nft_pipapo_field *f;
> > +	int i, bsize_max, err = 0;
> > +	void *dup;
> > +
> > +	dup = nft_pipapo_get(net, set, elem, 0);
> > +	if (PTR_ERR(dup) != -ENOENT) {
> > +		priv->start_elem = NULL;
> > +		if (IS_ERR(dup))
> > +			return PTR_ERR(dup);
> > +		*ext2 = dup;  
> 
> dup should be of nft_set_ext type. I just had a look at
> nft_pipapo_get() and I think this returns nft_pipapo_elem, which is
> almost good, since it contains nft_set_ext, right?

Right, it returns nft_pipapo_elem which contains that.

> I think you also need to check if the object is active in the next
> generation via nft_genmask_next() and nft_set_elem_active(), otherwise
> ignore it.

I guess I should actually do this in nft_pipapo_get(), also because we
don't want to return inactive elements when userspace "gets" them.

I just noticed this is currently inconsistent with the lookup, because
nft_pipapo_lookup() correctly does:

--
next_match:
		b = pipapo_refill(res_map, f->bsize, f->rules, fill_map, f->mt,
				  last);
		if (b < 0) {
			raw_cpu_write(nft_pipapo_scratch_index, map_index);
			local_bh_enable();

			return false;
		}

		if (last) {
			*ext = &f->mt[b].e->ext;
			if (unlikely(nft_set_elem_expired(*ext) ||
				     !nft_set_elem_active(*ext, genmask)))
				goto next_match;
--

but I forgot to implement the same check in pipapo_get():

--
next_match:
		b = pipapo_refill(res_map, f->bsize, f->rules, fill_map, f->mt,
				  last);
		if (b < 0)
			goto out;

		if (last) {
			if (nft_set_elem_expired(&f->mt[b].e->ext))
				goto next_match;
--

this check should simply include || !nft_set_elem_active(...), and then
I wouldn't need any further check in nft_pipapo_init(). I'd fix this in
v3.

I'm actually not sure if I need to report these elements to
nft_pipapo_remove(). If it's needed, I would add some kind of
"get_inactive" flag to pipapo_get(), which is true on the call from
nft_pipapo_remove(), and false on other paths. If the flag is true, the
nft_set_elem_active() check is then skipped.

> Note that the datastructure needs to temporarily deal with duplicates,
> ie. one inactive object (just deleted) and one active object (just
> added) for the next generation.

Yes, this is taken care of (except for the problem described above),
specifically, there can be n inactive objects, and a single active
object that are entirely overlapping.

This makes some optimisations harder to implement, namely, step 5.2.1
from:
	https://pipapo.lameexcu.se/pipapo/tree/pipapo.c#n337

because we need to allow entirely overlapping entries and map them to
possibly distinct elements.

Now, I think this would all be easier if the API implemented
transactions and commit in a way that appears more natural to me.

When I started working on this, I initially thought activate() would be
called once per transaction, not per element, so that insert() and
remove() would add or remove elements pending for a given transaction,
and activate() would commit it. Same for flush().

At that point, we would have a copy of lookup data with pending
insertions and without pending deletions, and on transaction commit,
this copy would become active, with no inactive elements into it.
Hence, no overlapping elements in live data.

This way we could also make transactions atomic. If activate() is
called once for each element in the transaction, that can't be atomic.

I plan to work on this (if it makes sense), but it looks rather
complicated to match this with existing set implementations and
especially current UAPI, that's the main reason why I "worked around"
this aspect for the moment being. I guess that having at least one set
implementation that can play along with this model would help later.

> > +		return -EEXIST;
> > +	}
> > +
> > +	if (!nft_set_ext_exists(ext, NFT_SET_EXT_FLAGS) ||
> > +	    !(*nft_set_ext_flags(ext) & NFT_SET_ELEM_INTERVAL_END)) {
> > +		priv->start_elem = e;
> > +		*ext2 = &e->ext;
> > +		memcpy(priv->start_data, data, priv->width);
> > +		return 0;
> > +	}
> > +
> > +	if (!priv->start_elem)
> > +		return -EINVAL;  
> 
> I'm working on a sketch patch to extend the front-end code to make
> this easier for you, will post it asap, so you don't need this special
> handling to collect both ends of the interval.

Nice, thanks. Mind that I think this is actually a bit ugly but fine.
As I was mentioning to Florian, it doesn't present any particular race
with bad consequences (at least in v2).

Right now I was trying to get the NFTA_SET_DESC_CONCAT >
NFTA_LIST_ELEM > NFTA_SET_FIELD_LEN nesting implemented in libnftnl in
a somewhat acceptable way. Let me know if the front-end changes would
affect this significantly, I'll wait for your patch in that case.

> So far, just spend a bit of time on this, will get back to you with
> more feedback.
> 
> Thanks for working on this!

And thanks for reviewing it!

-- 
Stefano


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH nf-next v2 3/8] nf_tables: Add set type for arbitrary concatenation of ranges
  2019-11-27 11:02     ` Stefano Brivio
@ 2019-11-27 18:29       ` Pablo Neira Ayuso
  0 siblings, 0 replies; 24+ messages in thread
From: Pablo Neira Ayuso @ 2019-11-27 18:29 UTC (permalink / raw)
  To: Stefano Brivio
  Cc: netfilter-devel, Florian Westphal, Kadlecsik József,
	Eric Garver, Phil Sutter

[-- Attachment #1: Type: text/plain, Size: 2129 bytes --]

Hi Stefano,

On Wed, Nov 27, 2019 at 12:02:49PM +0100, Stefano Brivio wrote:
> On Wed, 27 Nov 2019 10:29:45 +0100
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
[...]
> > On Fri, Nov 22, 2019 at 02:40:02PM +0100, Stefano Brivio wrote:
[...]
> > I think you also need to check if the object is active in the next
> > generation via nft_genmask_next() and nft_set_elem_active(), otherwise
> > ignore it.
> 
> I guess I should actually do this in nft_pipapo_get(), also because we
> don't want to return inactive elements when userspace "gets" them.

OK. Just a side note: nft_pipapo_get() is also used to get an
interval, that needs current generation. From the insert path, one
need to check the next generation.

[...]
> > > +		return -EEXIST;
> > > +	}
> > > +
> > > +	if (!nft_set_ext_exists(ext, NFT_SET_EXT_FLAGS) ||
> > > +	    !(*nft_set_ext_flags(ext) & NFT_SET_ELEM_INTERVAL_END)) {
> > > +		priv->start_elem = e;
> > > +		*ext2 = &e->ext;
> > > +		memcpy(priv->start_data, data, priv->width);
> > > +		return 0;
> > > +	}
> > > +
> > > +	if (!priv->start_elem)
> > > +		return -EINVAL;  
> > 
> > I'm working on a sketch patch to extend the front-end code to make
> > this easier for you, will post it asap, so you don't need this special
> > handling to collect both ends of the interval.
> 
> Nice, thanks. Mind that I think this is actually a bit ugly but fine.
> As I was mentioning to Florian, it doesn't present any particular race
> with bad consequences (at least in v2).
> 
> Right now I was trying to get the NFTA_SET_DESC_CONCAT >
> NFTA_LIST_ELEM > NFTA_SET_FIELD_LEN nesting implemented in libnftnl in
> a somewhat acceptable way. Let me know if the front-end changes would
> affect this significantly, I'll wait for your patch in that case.

I'm attaching a sketch patch, I need a bit more time to finish it. The
idea is to place the interval end in the same element, instead of two.
This should also simplify the rbtree implementation. Main issue is
that this needs a bit more work to make it backward compatible.

P.S: I'm skipping the transaction discussion in this email, will come
back to it later.

[-- Attachment #2: 0001-netfilter-nf_tables-add-NFT_SET_EXT_KEY_END.patch --]
[-- Type: text/x-diff, Size: 5674 bytes --]

From b6d159e8b3e3f1c6e41e6101996df36e6977c3e3 Mon Sep 17 00:00:00 2001
From: Pablo Neira Ayuso <pablo@netfilter.org>
Date: Wed, 27 Nov 2019 19:01:10 +0100
Subject: [PATCH] netfilter: nf_tables: add NFT_SET_EXT_KEY_END

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/netfilter/nf_tables.h        |  7 +++++++
 include/uapi/linux/netfilter/nf_tables.h |  2 ++
 net/netfilter/nf_tables_api.c            | 35 ++++++++++++++++++++++++++------
 3 files changed, 38 insertions(+), 6 deletions(-)

diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
index 2d0275f13bbf..51338b438e86 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -509,6 +509,7 @@ void nf_tables_destroy_set(const struct nft_ctx *ctx, struct nft_set *set);
  *	@NFT_SET_EXT_USERDATA: user data associated with the element
  *	@NFT_SET_EXT_EXPR: expression assiociated with the element
  *	@NFT_SET_EXT_OBJREF: stateful object reference associated with element
+ *	@NFT_SET_EXT_KEY_END: closing element key
  *	@NFT_SET_EXT_NUM: number of extension types
  */
 enum nft_set_extensions {
@@ -520,6 +521,7 @@ enum nft_set_extensions {
 	NFT_SET_EXT_USERDATA,
 	NFT_SET_EXT_EXPR,
 	NFT_SET_EXT_OBJREF,
+	NFT_SET_EXT_KEY_END,
 	NFT_SET_EXT_NUM
 };
 
@@ -606,6 +608,11 @@ static inline struct nft_data *nft_set_ext_key(const struct nft_set_ext *ext)
 	return nft_set_ext(ext, NFT_SET_EXT_KEY);
 }
 
+static inline struct nft_data *nft_set_ext_key_end(const struct nft_set_ext *ext)
+{
+	return nft_set_ext(ext, NFT_SET_EXT_KEY_END);
+}
+
 static inline struct nft_data *nft_set_ext_data(const struct nft_set_ext *ext)
 {
 	return nft_set_ext(ext, NFT_SET_EXT_DATA);
diff --git a/include/uapi/linux/netfilter/nf_tables.h b/include/uapi/linux/netfilter/nf_tables.h
index ed8881ad18ed..9e4f0a584c57 100644
--- a/include/uapi/linux/netfilter/nf_tables.h
+++ b/include/uapi/linux/netfilter/nf_tables.h
@@ -368,6 +368,7 @@ enum nft_set_elem_flags {
  * @NFTA_SET_ELEM_USERDATA: user data (NLA_BINARY)
  * @NFTA_SET_ELEM_EXPR: expression (NLA_NESTED: nft_expr_attributes)
  * @NFTA_SET_ELEM_OBJREF: stateful object reference (NLA_STRING)
+ * @NFTA_SET_ELEM_KEY_END: closing key value (NLA_STRING)
  */
 enum nft_set_elem_attributes {
 	NFTA_SET_ELEM_UNSPEC,
@@ -380,6 +381,7 @@ enum nft_set_elem_attributes {
 	NFTA_SET_ELEM_EXPR,
 	NFTA_SET_ELEM_PAD,
 	NFTA_SET_ELEM_OBJREF,
+	NFTA_SET_ELEM_KEY_END,
 	__NFTA_SET_ELEM_MAX
 };
 #define NFTA_SET_ELEM_MAX	(__NFTA_SET_ELEM_MAX - 1)
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index ab325c6fcfb8..b8b2b918bd47 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -3932,6 +3932,7 @@ static const struct nla_policy nft_set_elem_policy[NFTA_SET_ELEM_MAX + 1] = {
 					    .len = NFT_USERDATA_MAXLEN },
 	[NFTA_SET_ELEM_EXPR]		= { .type = NLA_NESTED },
 	[NFTA_SET_ELEM_OBJREF]		= { .type = NLA_STRING },
+	[NFTA_SET_ELEM_KEY_END]		= { .type = NLA_NESTED },
 };
 
 static const struct nla_policy nft_set_elem_list_policy[NFTA_SET_ELEM_LIST_MAX + 1] = {
@@ -4399,10 +4400,11 @@ static struct nft_trans *nft_trans_elem_alloc(struct nft_ctx *ctx,
 	return trans;
 }
 
-void *nft_set_elem_init(const struct nft_set *set,
-			const struct nft_set_ext_tmpl *tmpl,
-			const u32 *key, const u32 *data,
-			u64 timeout, u64 expiration, gfp_t gfp)
+static void *__nft_set_elem_init(const struct nft_set *set,
+				 const struct nft_set_ext_tmpl *tmpl,
+				 const u32 *key, const u32 *key_end,
+				 const u32 *data, u64 timeout, u64 expiration,
+				 gfp_t gfp)
 {
 	struct nft_set_ext *ext;
 	void *elem;
@@ -4415,6 +4417,8 @@ void *nft_set_elem_init(const struct nft_set *set,
 	nft_set_ext_init(ext, tmpl);
 
 	memcpy(nft_set_ext_key(ext), key, set->klen);
+	if (nft_set_ext_exists(ext, NFT_SET_EXT_KEY_END))
+		memcpy(nft_set_ext_key_end(ext), key_end, set->klen);
 	if (nft_set_ext_exists(ext, NFT_SET_EXT_DATA))
 		memcpy(nft_set_ext_data(ext), data, set->dlen);
 	if (nft_set_ext_exists(ext, NFT_SET_EXT_EXPIRATION)) {
@@ -4428,6 +4432,15 @@ void *nft_set_elem_init(const struct nft_set *set,
 	return elem;
 }
 
+void *nft_set_elem_init(const struct nft_set *set,
+			const struct nft_set_ext_tmpl *tmpl,
+			const u32 *key, const u32 *data, u64 timeout,
+			u64 expiration, gfp_t gfp)
+{
+	return __nft_set_elem_init(set, tmpl, key, NULL, data, timeout,
+				   expiration, gfp);
+}
+
 void nft_set_elem_destroy(const struct nft_set *set, void *elem,
 			  bool destroy_expr)
 {
@@ -4480,6 +4493,7 @@ static int nft_add_set_elem(struct nft_ctx *ctx, struct nft_set *set,
 	struct nft_set_binding *binding;
 	struct nft_object *obj = NULL;
 	struct nft_userdata *udata;
+	struct nft_data key_end;
 	struct nft_data_desc d2;
 	struct nft_data data;
 	enum nft_registers dreg;
@@ -4551,6 +4565,14 @@ static int nft_add_set_elem(struct nft_ctx *ctx, struct nft_set *set,
 	if (err < 0)
 		return err;
 
+	if (nla[NFTA_SET_ELEM_KEY_END]) {
+		err = nft_set_elem_key_ext(ctx, set, &key_end, &tmpl,
+					   nla[NFTA_SET_ELEM_KEY_END],
+					   NFT_SET_EXT_KEY_END);
+		if (err < 0)
+			return err;
+	}
+
 	if (timeout > 0) {
 		nft_set_ext_add(&tmpl, NFT_SET_EXT_EXPIRATION);
 		if (timeout != set->timeout)
@@ -4623,8 +4645,9 @@ static int nft_add_set_elem(struct nft_ctx *ctx, struct nft_set *set,
 	}
 
 	err = -ENOMEM;
-	elem.priv = nft_set_elem_init(set, &tmpl, elem.key.val.data, data.data,
-				      timeout, expiration, GFP_KERNEL);
+	elem.priv = __nft_set_elem_init(set, &tmpl, elem.key.val.data,
+					key_end.data, data.data, timeout,
+					expiration, GFP_KERNEL);
 	if (elem.priv == NULL)
 		goto err3;
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2019-11-27 18:30 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-22 13:39 [PATCH nf-next v2 0/8] nftables: Set implementation for arbitrary concatenation of ranges Stefano Brivio
2019-11-22 13:40 ` [PATCH nf-next v2 1/8] netfilter: nf_tables: Support for subkeys, set with multiple ranged fields Stefano Brivio
2019-11-23 20:01   ` Pablo Neira Ayuso
2019-11-25  9:30     ` Stefano Brivio
2019-11-25  9:58       ` Pablo Neira Ayuso
2019-11-25 13:26         ` Stefano Brivio
2019-11-25 14:30           ` Pablo Neira Ayuso
2019-11-25 14:54             ` Stefano Brivio
2019-11-25 20:38               ` Pablo Neira Ayuso
2019-11-22 13:40 ` [PATCH nf-next v2 2/8] bitmap: Introduce bitmap_cut(): cut bits and shift remaining Stefano Brivio
2019-11-22 13:40 ` [PATCH nf-next v2 3/8] nf_tables: Add set type for arbitrary concatenation of ranges Stefano Brivio
2019-11-27  9:29   ` Pablo Neira Ayuso
2019-11-27 11:02     ` Stefano Brivio
2019-11-27 18:29       ` Pablo Neira Ayuso
2019-11-22 13:40 ` [PATCH nf-next v2 4/8] selftests: netfilter: Introduce tests for sets with range concatenation Stefano Brivio
2019-11-22 13:40 ` [PATCH nf-next v2 5/8] nft_set_pipapo: Provide unrolled lookup loops for common field sizes Stefano Brivio
2019-11-22 13:40 ` [PATCH nf-next v2 6/8] nft_set_pipapo: Prepare for vectorised implementation: alignment Stefano Brivio
2019-11-22 13:40 ` [PATCH nf-next v2 7/8] nft_set_pipapo: Prepare for vectorised implementation: helpers Stefano Brivio
2019-11-22 13:40 ` [PATCH nf-next v2 8/8] nft_set_pipapo: Introduce AVX2-based lookup implementation Stefano Brivio
2019-11-26  6:36   ` kbuild test robot
2019-11-23 20:05 ` [PATCH nf-next v2 0/8] nftables: Set implementation for arbitrary concatenation of ranges Pablo Neira Ayuso
2019-11-25  9:31   ` Stefano Brivio
2019-11-25 10:02     ` Pablo Neira Ayuso
2019-11-25 13:36       ` Stefano Brivio

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).