Netfilter-Devel Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH nf-next v4 0/9] nftables: Set implementation for arbitrary concatenation of ranges
@ 2020-01-21 23:17 Stefano Brivio
  2020-01-21 23:17 ` [PATCH nf-next v4 1/9] netfilter: nf_tables: add nft_setelem_parse_key() Stefano Brivio
                   ` (10 more replies)
  0 siblings, 11 replies; 20+ messages in thread
From: Stefano Brivio @ 2020-01-21 23:17 UTC (permalink / raw)
  To: Pablo Neira Ayuso, netfilter-devel
  Cc: Florian Westphal, Kadlecsik József, Eric Garver, Phil Sutter

Existing nftables set implementations allow matching entries with
interval expressions (rbtree), e.g. 192.0.2.1-192.0.2.4, entries
specifying field concatenation (hash, rhash), e.g. 192.0.2.1:22,
but not both.

In other words, none of the set types allows matching on range
expressions for more than one packet field at a time, such as ipset
does with types bitmap:ip,mac, and, to a more limited extent
(netmasks, not arbitrary ranges), with types hash:net,net,
hash:net,port, hash:ip,port,net, and hash:net,port,net.

As a pure hash-based approach is unsuitable for matching on ranges,
and "proxying" the existing red-black tree type looks impractical as
elements would need to be shared and managed across all employed
trees, this new set implementation intends to fill the functionality
gap by employing a relatively novel approach.

The fundamental idea, illustrated in deeper detail in patch 5/9, is to
use lookup tables classifying a small number of grouped bits from each
field, and map the lookup results in a way that yields a verdict for
the full set of specified fields.

The grouping bit aspect is loosely inspired by the Grouper algorithm,
by Jay Ligatti, Josh Kuhn, and Chris Gage (see patch 5/9 for the full
reference).

A reference, stand-alone implementation of the algorithm itself is
available at:
	https://pipapo.lameexcu.se

Some notes about possible future optimisations are also mentioned
there. This algorithm reduces the matching problem to, essentially,
a repetitive sequence of simple bitwise operations, and is
particularly suitable to be optimised by leveraging SIMD instruction
sets. An AVX2-based implementation is also presented in this series.

I plan to post the adaptation of the existing AVX2 vectorised
implementation for (at least) NEON at a later time.

Patches 1/9 to 3/9 implement the needed infrastructure: new
attributes are used to describe length of single ranged fields in
concatenations and to denote the upper bound for ranges.

Patch 4/9 adds a new bitmap operation that copies the source bitmap
onto the destination while removing a given region, and is needed to
delete regions of arrays mapping between lookup tables.

Patch 5/9 is the actual set implementation.

Patch 6/9 introduces selftests for the new implementation.

Patches 7/9 and 8/9 are preparatory work to add an alternative,
vectorised lookup implementation.

Patch 9/9 contains the AVX2-based implementation of the lookup
routines.

The nftables and libnftnl counterparts depend on changes to the UAPI
header file included in patches 2/9 and 3/9.

Credits go to Jay Ligatti, Josh Kuhn, and Chris Gage for their
original Grouper implementation and article from ICCCN proceedings
(see reference in patch 5/9), and to Daniel Lemire for his public
domain implementation of a fast iterator on set bits using built-in
implementations of the CTZL operation, also included in patch 5/9.

Special thanks go to Florian Westphal for all the nftables consulting
and the original interface idea, to Sabrina Dubroca for support with
RCU and bit manipulation topics, to Eric Garver for an early review,
to Phil Sutter for reaffirming the need for the use case covered
here, and to Pablo Neira Ayuso for proposing a dramatic
simplification of the infrastructure.

v4:
 - fix build for 32-bit architectures: 64-bit division needs div_u64()
   in 5/9 (kbuild test robot <lkp@intel.com>)
v3:
 - add patches 1/9 and 2/9
 - drop patch 5/8 (unrolled lookup loops), as it actually decreases
   matching rates in some cases
 - other changes listed in single patches
v2: Changes listed in messages for 3/8 and 8/8

Pablo Neira Ayuso (2):
  netfilter: nf_tables: add nft_setelem_parse_key()
  netfilter: nf_tables: add NFTA_SET_ELEM_KEY_END attribute

Stefano Brivio (7):
  netfilter: nf_tables: Support for sets with multiple ranged fields
  bitmap: Introduce bitmap_cut(): cut bits and shift remaining
  nf_tables: Add set type for arbitrary concatenation of ranges
  selftests: netfilter: Introduce tests for sets with range
    concatenation
  nft_set_pipapo: Prepare for vectorised implementation: alignment
  nft_set_pipapo: Prepare for vectorised implementation: helpers
  nft_set_pipapo: Introduce AVX2-based lookup implementation

 include/linux/bitmap.h                        |    4 +
 include/net/netfilter/nf_tables.h             |   22 +-
 include/net/netfilter/nf_tables_core.h        |    2 +
 include/uapi/linux/netfilter/nf_tables.h      |   17 +
 lib/bitmap.c                                  |   66 +
 net/netfilter/Makefile                        |    8 +-
 net/netfilter/nf_tables_api.c                 |  260 ++-
 net/netfilter/nf_tables_set_core.c            |    8 +
 net/netfilter/nft_dynset.c                    |    2 +-
 net/netfilter/nft_set_pipapo.c                | 2012 +++++++++++++++++
 net/netfilter/nft_set_pipapo.h                |  237 ++
 net/netfilter/nft_set_pipapo_avx2.c           |  842 +++++++
 net/netfilter/nft_set_pipapo_avx2.h           |   14 +
 net/netfilter/nft_set_rbtree.c                |    3 +
 tools/testing/selftests/netfilter/Makefile    |    3 +-
 .../selftests/netfilter/nft_concat_range.sh   | 1481 ++++++++++++
 16 files changed, 4911 insertions(+), 70 deletions(-)
 create mode 100644 net/netfilter/nft_set_pipapo.c
 create mode 100644 net/netfilter/nft_set_pipapo.h
 create mode 100644 net/netfilter/nft_set_pipapo_avx2.c
 create mode 100644 net/netfilter/nft_set_pipapo_avx2.h
 create mode 100755 tools/testing/selftests/netfilter/nft_concat_range.sh

-- 
2.24.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH nf-next v4 1/9] netfilter: nf_tables: add nft_setelem_parse_key()
  2020-01-21 23:17 [PATCH nf-next v4 0/9] nftables: Set implementation for arbitrary concatenation of ranges Stefano Brivio
@ 2020-01-21 23:17 ` Stefano Brivio
  2020-01-21 23:17 ` [PATCH nf-next v4 2/9] netfilter: nf_tables: add NFTA_SET_ELEM_KEY_END attribute Stefano Brivio
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Stefano Brivio @ 2020-01-21 23:17 UTC (permalink / raw)
  To: Pablo Neira Ayuso, netfilter-devel
  Cc: Florian Westphal, Kadlecsik József, Eric Garver, Phil Sutter

From: Pablo Neira Ayuso <pablo@netfilter.org>

Add helper function to parse the set element key netlink attribute.

v4: No changes
v3: New patch

[sbrivio: refactor error paths and labels; use NFT_DATA_VALUE_MAXLEN
  instead of sizeof(*key) in helper, value can be longer than that;
  rebase]
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nf_tables_api.c | 91 +++++++++++++++++------------------
 1 file changed, 45 insertions(+), 46 deletions(-)

diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 43f05b3acd60..0628b9ad7aa4 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -4490,11 +4490,28 @@ static int nft_setelem_parse_flags(const struct nft_set *set,
 	return 0;
 }
 
+static int nft_setelem_parse_key(struct nft_ctx *ctx, struct nft_set *set,
+				 struct nft_data *key, struct nlattr *attr)
+{
+	struct nft_data_desc desc;
+	int err;
+
+	err = nft_data_init(ctx, key, NFT_DATA_VALUE_MAXLEN, &desc, attr);
+	if (err < 0)
+		return err;
+
+	if (desc.type != NFT_DATA_VALUE || desc.len != set->klen) {
+		nft_data_release(key, desc.type);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
 static int nft_get_set_elem(struct nft_ctx *ctx, struct nft_set *set,
 			    const struct nlattr *attr)
 {
 	struct nlattr *nla[NFTA_SET_ELEM_MAX + 1];
-	struct nft_data_desc desc;
 	struct nft_set_elem elem;
 	struct sk_buff *skb;
 	uint32_t flags = 0;
@@ -4513,17 +4530,11 @@ static int nft_get_set_elem(struct nft_ctx *ctx, struct nft_set *set,
 	if (err < 0)
 		return err;
 
-	err = nft_data_init(ctx, &elem.key.val, sizeof(elem.key), &desc,
-			    nla[NFTA_SET_ELEM_KEY]);
+	err = nft_setelem_parse_key(ctx, set, &elem.key.val,
+				    nla[NFTA_SET_ELEM_KEY]);
 	if (err < 0)
 		return err;
 
-	err = -EINVAL;
-	if (desc.type != NFT_DATA_VALUE || desc.len != set->klen) {
-		nft_data_release(&elem.key.val, desc.type);
-		return err;
-	}
-
 	priv = set->ops->get(ctx->net, set, &elem, flags);
 	if (IS_ERR(priv))
 		return PTR_ERR(priv);
@@ -4722,13 +4733,13 @@ static int nft_add_set_elem(struct nft_ctx *ctx, struct nft_set *set,
 {
 	struct nlattr *nla[NFTA_SET_ELEM_MAX + 1];
 	u8 genmask = nft_genmask_next(ctx->net);
-	struct nft_data_desc d1, d2;
 	struct nft_set_ext_tmpl tmpl;
 	struct nft_set_ext *ext, *ext2;
 	struct nft_set_elem elem;
 	struct nft_set_binding *binding;
 	struct nft_object *obj = NULL;
 	struct nft_userdata *udata;
+	struct nft_data_desc desc;
 	struct nft_data data;
 	enum nft_registers dreg;
 	struct nft_trans *trans;
@@ -4794,15 +4805,12 @@ static int nft_add_set_elem(struct nft_ctx *ctx, struct nft_set *set,
 			return err;
 	}
 
-	err = nft_data_init(ctx, &elem.key.val, sizeof(elem.key), &d1,
-			    nla[NFTA_SET_ELEM_KEY]);
+	err = nft_setelem_parse_key(ctx, set, &elem.key.val,
+				    nla[NFTA_SET_ELEM_KEY]);
 	if (err < 0)
 		goto err1;
-	err = -EINVAL;
-	if (d1.type != NFT_DATA_VALUE || d1.len != set->klen)
-		goto err2;
 
-	nft_set_ext_add_length(&tmpl, NFT_SET_EXT_KEY, d1.len);
+	nft_set_ext_add_length(&tmpl, NFT_SET_EXT_KEY, set->klen);
 	if (timeout > 0) {
 		nft_set_ext_add(&tmpl, NFT_SET_EXT_EXPIRATION);
 		if (timeout != set->timeout)
@@ -4825,13 +4833,13 @@ static int nft_add_set_elem(struct nft_ctx *ctx, struct nft_set *set,
 	}
 
 	if (nla[NFTA_SET_ELEM_DATA] != NULL) {
-		err = nft_data_init(ctx, &data, sizeof(data), &d2,
+		err = nft_data_init(ctx, &data, sizeof(data), &desc,
 				    nla[NFTA_SET_ELEM_DATA]);
 		if (err < 0)
 			goto err2;
 
 		err = -EINVAL;
-		if (set->dtype != NFT_DATA_VERDICT && d2.len != set->dlen)
+		if (set->dtype != NFT_DATA_VERDICT && desc.len != set->dlen)
 			goto err3;
 
 		dreg = nft_type_to_reg(set->dtype);
@@ -4848,18 +4856,18 @@ static int nft_add_set_elem(struct nft_ctx *ctx, struct nft_set *set,
 
 			err = nft_validate_register_store(&bind_ctx, dreg,
 							  &data,
-							  d2.type, d2.len);
+							  desc.type, desc.len);
 			if (err < 0)
 				goto err3;
 
-			if (d2.type == NFT_DATA_VERDICT &&
+			if (desc.type == NFT_DATA_VERDICT &&
 			    (data.verdict.code == NFT_GOTO ||
 			     data.verdict.code == NFT_JUMP))
 				nft_validate_state_update(ctx->net,
 							  NFT_VALIDATE_NEED);
 		}
 
-		nft_set_ext_add_length(&tmpl, NFT_SET_EXT_DATA, d2.len);
+		nft_set_ext_add_length(&tmpl, NFT_SET_EXT_DATA, desc.len);
 	}
 
 	/* The full maximum length of userdata can exceed the maximum
@@ -4942,9 +4950,9 @@ static int nft_add_set_elem(struct nft_ctx *ctx, struct nft_set *set,
 	kfree(elem.priv);
 err3:
 	if (nla[NFTA_SET_ELEM_DATA] != NULL)
-		nft_data_release(&data, d2.type);
+		nft_data_release(&data, desc.type);
 err2:
-	nft_data_release(&elem.key.val, d1.type);
+	nft_data_release(&elem.key.val, NFT_DATA_VALUE);
 err1:
 	return err;
 }
@@ -5040,7 +5048,6 @@ static int nft_del_setelem(struct nft_ctx *ctx, struct nft_set *set,
 {
 	struct nlattr *nla[NFTA_SET_ELEM_MAX + 1];
 	struct nft_set_ext_tmpl tmpl;
-	struct nft_data_desc desc;
 	struct nft_set_elem elem;
 	struct nft_set_ext *ext;
 	struct nft_trans *trans;
@@ -5051,11 +5058,10 @@ static int nft_del_setelem(struct nft_ctx *ctx, struct nft_set *set,
 	err = nla_parse_nested_deprecated(nla, NFTA_SET_ELEM_MAX, attr,
 					  nft_set_elem_policy, NULL);
 	if (err < 0)
-		goto err1;
+		return err;
 
-	err = -EINVAL;
 	if (nla[NFTA_SET_ELEM_KEY] == NULL)
-		goto err1;
+		return -EINVAL;
 
 	nft_set_ext_prepare(&tmpl);
 
@@ -5065,37 +5071,31 @@ static int nft_del_setelem(struct nft_ctx *ctx, struct nft_set *set,
 	if (flags != 0)
 		nft_set_ext_add(&tmpl, NFT_SET_EXT_FLAGS);
 
-	err = nft_data_init(ctx, &elem.key.val, sizeof(elem.key), &desc,
-			    nla[NFTA_SET_ELEM_KEY]);
+	err = nft_setelem_parse_key(ctx, set, &elem.key.val,
+				    nla[NFTA_SET_ELEM_KEY]);
 	if (err < 0)
-		goto err1;
-
-	err = -EINVAL;
-	if (desc.type != NFT_DATA_VALUE || desc.len != set->klen)
-		goto err2;
+		return err;
 
-	nft_set_ext_add_length(&tmpl, NFT_SET_EXT_KEY, desc.len);
+	nft_set_ext_add_length(&tmpl, NFT_SET_EXT_KEY, set->klen);
 
 	err = -ENOMEM;
 	elem.priv = nft_set_elem_init(set, &tmpl, elem.key.val.data, NULL, 0,
 				      0, GFP_KERNEL);
 	if (elem.priv == NULL)
-		goto err2;
+		goto fail_elem;
 
 	ext = nft_set_elem_ext(set, elem.priv);
 	if (flags)
 		*nft_set_ext_flags(ext) = flags;
 
 	trans = nft_trans_elem_alloc(ctx, NFT_MSG_DELSETELEM, set);
-	if (trans == NULL) {
-		err = -ENOMEM;
-		goto err3;
-	}
+	if (trans == NULL)
+		goto fail_trans;
 
 	priv = set->ops->deactivate(ctx->net, set, &elem);
 	if (priv == NULL) {
 		err = -ENOENT;
-		goto err4;
+		goto fail_ops;
 	}
 	kfree(elem.priv);
 	elem.priv = priv;
@@ -5106,13 +5106,12 @@ static int nft_del_setelem(struct nft_ctx *ctx, struct nft_set *set,
 	list_add_tail(&trans->list, &ctx->net->nft.commit_list);
 	return 0;
 
-err4:
+fail_ops:
 	kfree(trans);
-err3:
+fail_trans:
 	kfree(elem.priv);
-err2:
-	nft_data_release(&elem.key.val, desc.type);
-err1:
+fail_elem:
+	nft_data_release(&elem.key.val, NFT_DATA_VALUE);
 	return err;
 }
 
-- 
2.24.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH nf-next v4 2/9] netfilter: nf_tables: add NFTA_SET_ELEM_KEY_END attribute
  2020-01-21 23:17 [PATCH nf-next v4 0/9] nftables: Set implementation for arbitrary concatenation of ranges Stefano Brivio
  2020-01-21 23:17 ` [PATCH nf-next v4 1/9] netfilter: nf_tables: add nft_setelem_parse_key() Stefano Brivio
@ 2020-01-21 23:17 ` Stefano Brivio
  2020-01-21 23:17 ` [PATCH nf-next v4 3/9] netfilter: nf_tables: Support for sets with multiple ranged fields Stefano Brivio
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Stefano Brivio @ 2020-01-21 23:17 UTC (permalink / raw)
  To: Pablo Neira Ayuso, netfilter-devel
  Cc: Florian Westphal, Kadlecsik József, Eric Garver, Phil Sutter

From: Pablo Neira Ayuso <pablo@netfilter.org>

Add NFTA_SET_ELEM_KEY_END attribute to convey the closing element of the
interval between kernel and userspace.

This patch also adds the NFT_SET_EXT_KEY_END extension to store the
closing element value in this interval.

v4: No changes
v3: New patch

[sbrivio: refactor error paths and labels; add corresponding
  nft_set_ext_type for new key; rebase]
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/netfilter/nf_tables.h        | 14 +++-
 include/uapi/linux/netfilter/nf_tables.h |  2 +
 net/netfilter/nf_tables_api.c            | 85 ++++++++++++++++++------
 net/netfilter/nft_dynset.c               |  2 +-
 4 files changed, 79 insertions(+), 24 deletions(-)

diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
index fe7c50acc681..504c0aa93805 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -231,6 +231,7 @@ struct nft_userdata {
  *	struct nft_set_elem - generic representation of set elements
  *
  *	@key: element key
+ *	@key_end: closing element key
  *	@priv: element private data and extensions
  */
 struct nft_set_elem {
@@ -238,6 +239,10 @@ struct nft_set_elem {
 		u32		buf[NFT_DATA_VALUE_MAXLEN / sizeof(u32)];
 		struct nft_data	val;
 	} key;
+	union {
+		u32		buf[NFT_DATA_VALUE_MAXLEN / sizeof(u32)];
+		struct nft_data	val;
+	} key_end;
 	void			*priv;
 };
 
@@ -502,6 +507,7 @@ void nf_tables_destroy_set(const struct nft_ctx *ctx, struct nft_set *set);
  *	enum nft_set_extensions - set extension type IDs
  *
  *	@NFT_SET_EXT_KEY: element key
+ *	@NFT_SET_EXT_KEY_END: upper bound element key, for ranges
  *	@NFT_SET_EXT_DATA: mapping data
  *	@NFT_SET_EXT_FLAGS: element flags
  *	@NFT_SET_EXT_TIMEOUT: element timeout
@@ -513,6 +519,7 @@ void nf_tables_destroy_set(const struct nft_ctx *ctx, struct nft_set *set);
  */
 enum nft_set_extensions {
 	NFT_SET_EXT_KEY,
+	NFT_SET_EXT_KEY_END,
 	NFT_SET_EXT_DATA,
 	NFT_SET_EXT_FLAGS,
 	NFT_SET_EXT_TIMEOUT,
@@ -606,6 +613,11 @@ static inline struct nft_data *nft_set_ext_key(const struct nft_set_ext *ext)
 	return nft_set_ext(ext, NFT_SET_EXT_KEY);
 }
 
+static inline struct nft_data *nft_set_ext_key_end(const struct nft_set_ext *ext)
+{
+	return nft_set_ext(ext, NFT_SET_EXT_KEY_END);
+}
+
 static inline struct nft_data *nft_set_ext_data(const struct nft_set_ext *ext)
 {
 	return nft_set_ext(ext, NFT_SET_EXT_DATA);
@@ -655,7 +667,7 @@ static inline struct nft_object **nft_set_ext_obj(const struct nft_set_ext *ext)
 
 void *nft_set_elem_init(const struct nft_set *set,
 			const struct nft_set_ext_tmpl *tmpl,
-			const u32 *key, const u32 *data,
+			const u32 *key, const u32 *key_end, const u32 *data,
 			u64 timeout, u64 expiration, gfp_t gfp);
 void nft_set_elem_destroy(const struct nft_set *set, void *elem,
 			  bool destroy_expr);
diff --git a/include/uapi/linux/netfilter/nf_tables.h b/include/uapi/linux/netfilter/nf_tables.h
index 261864736b26..c13106496bd2 100644
--- a/include/uapi/linux/netfilter/nf_tables.h
+++ b/include/uapi/linux/netfilter/nf_tables.h
@@ -370,6 +370,7 @@ enum nft_set_elem_flags {
  * @NFTA_SET_ELEM_USERDATA: user data (NLA_BINARY)
  * @NFTA_SET_ELEM_EXPR: expression (NLA_NESTED: nft_expr_attributes)
  * @NFTA_SET_ELEM_OBJREF: stateful object reference (NLA_STRING)
+ * @NFTA_SET_ELEM_KEY_END: closing key value (NLA_NESTED: nft_data)
  */
 enum nft_set_elem_attributes {
 	NFTA_SET_ELEM_UNSPEC,
@@ -382,6 +383,7 @@ enum nft_set_elem_attributes {
 	NFTA_SET_ELEM_EXPR,
 	NFTA_SET_ELEM_PAD,
 	NFTA_SET_ELEM_OBJREF,
+	NFTA_SET_ELEM_KEY_END,
 	__NFTA_SET_ELEM_MAX
 };
 #define NFTA_SET_ELEM_MAX	(__NFTA_SET_ELEM_MAX - 1)
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 0628b9ad7aa4..6cfd65348958 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -4182,6 +4182,9 @@ const struct nft_set_ext_type nft_set_ext_types[] = {
 		.len	= sizeof(struct nft_userdata),
 		.align	= __alignof__(struct nft_userdata),
 	},
+	[NFT_SET_EXT_KEY_END]		= {
+		.align	= __alignof__(u32),
+	},
 };
 EXPORT_SYMBOL_GPL(nft_set_ext_types);
 
@@ -4199,6 +4202,7 @@ static const struct nla_policy nft_set_elem_policy[NFTA_SET_ELEM_MAX + 1] = {
 					    .len = NFT_USERDATA_MAXLEN },
 	[NFTA_SET_ELEM_EXPR]		= { .type = NLA_NESTED },
 	[NFTA_SET_ELEM_OBJREF]		= { .type = NLA_STRING },
+	[NFTA_SET_ELEM_KEY_END]		= { .type = NLA_NESTED },
 };
 
 static const struct nla_policy nft_set_elem_list_policy[NFTA_SET_ELEM_LIST_MAX + 1] = {
@@ -4248,6 +4252,11 @@ static int nf_tables_fill_setelem(struct sk_buff *skb,
 			  NFT_DATA_VALUE, set->klen) < 0)
 		goto nla_put_failure;
 
+	if (nft_set_ext_exists(ext, NFT_SET_EXT_KEY_END) &&
+	    nft_data_dump(skb, NFTA_SET_ELEM_KEY_END, nft_set_ext_key_end(ext),
+			  NFT_DATA_VALUE, set->klen) < 0)
+		goto nla_put_failure;
+
 	if (nft_set_ext_exists(ext, NFT_SET_EXT_DATA) &&
 	    nft_data_dump(skb, NFTA_SET_ELEM_DATA, nft_set_ext_data(ext),
 			  set->dtype == NFT_DATA_VERDICT ? NFT_DATA_VERDICT : NFT_DATA_VALUE,
@@ -4535,6 +4544,13 @@ static int nft_get_set_elem(struct nft_ctx *ctx, struct nft_set *set,
 	if (err < 0)
 		return err;
 
+	if (nla[NFTA_SET_ELEM_KEY_END]) {
+		err = nft_setelem_parse_key(ctx, set, &elem.key_end.val,
+					    nla[NFTA_SET_ELEM_KEY_END]);
+		if (err < 0)
+			return err;
+	}
+
 	priv = set->ops->get(ctx->net, set, &elem, flags);
 	if (IS_ERR(priv))
 		return PTR_ERR(priv);
@@ -4660,8 +4676,8 @@ static struct nft_trans *nft_trans_elem_alloc(struct nft_ctx *ctx,
 
 void *nft_set_elem_init(const struct nft_set *set,
 			const struct nft_set_ext_tmpl *tmpl,
-			const u32 *key, const u32 *data,
-			u64 timeout, u64 expiration, gfp_t gfp)
+			const u32 *key, const u32 *key_end,
+			const u32 *data, u64 timeout, u64 expiration, gfp_t gfp)
 {
 	struct nft_set_ext *ext;
 	void *elem;
@@ -4674,6 +4690,8 @@ void *nft_set_elem_init(const struct nft_set *set,
 	nft_set_ext_init(ext, tmpl);
 
 	memcpy(nft_set_ext_key(ext), key, set->klen);
+	if (nft_set_ext_exists(ext, NFT_SET_EXT_KEY_END))
+		memcpy(nft_set_ext_key_end(ext), key_end, set->klen);
 	if (nft_set_ext_exists(ext, NFT_SET_EXT_DATA))
 		memcpy(nft_set_ext_data(ext), data, set->dlen);
 	if (nft_set_ext_exists(ext, NFT_SET_EXT_EXPIRATION)) {
@@ -4808,9 +4826,19 @@ static int nft_add_set_elem(struct nft_ctx *ctx, struct nft_set *set,
 	err = nft_setelem_parse_key(ctx, set, &elem.key.val,
 				    nla[NFTA_SET_ELEM_KEY]);
 	if (err < 0)
-		goto err1;
+		return err;
 
 	nft_set_ext_add_length(&tmpl, NFT_SET_EXT_KEY, set->klen);
+
+	if (nla[NFTA_SET_ELEM_KEY_END]) {
+		err = nft_setelem_parse_key(ctx, set, &elem.key_end.val,
+					    nla[NFTA_SET_ELEM_KEY_END]);
+		if (err < 0)
+			goto err_parse_key;
+
+		nft_set_ext_add_length(&tmpl, NFT_SET_EXT_KEY_END, set->klen);
+	}
+
 	if (timeout > 0) {
 		nft_set_ext_add(&tmpl, NFT_SET_EXT_EXPIRATION);
 		if (timeout != set->timeout)
@@ -4820,14 +4848,14 @@ static int nft_add_set_elem(struct nft_ctx *ctx, struct nft_set *set,
 	if (nla[NFTA_SET_ELEM_OBJREF] != NULL) {
 		if (!(set->flags & NFT_SET_OBJECT)) {
 			err = -EINVAL;
-			goto err2;
+			goto err_parse_key_end;
 		}
 		obj = nft_obj_lookup(ctx->net, ctx->table,
 				     nla[NFTA_SET_ELEM_OBJREF],
 				     set->objtype, genmask);
 		if (IS_ERR(obj)) {
 			err = PTR_ERR(obj);
-			goto err2;
+			goto err_parse_key_end;
 		}
 		nft_set_ext_add(&tmpl, NFT_SET_EXT_OBJREF);
 	}
@@ -4836,11 +4864,11 @@ static int nft_add_set_elem(struct nft_ctx *ctx, struct nft_set *set,
 		err = nft_data_init(ctx, &data, sizeof(data), &desc,
 				    nla[NFTA_SET_ELEM_DATA]);
 		if (err < 0)
-			goto err2;
+			goto err_parse_key_end;
 
 		err = -EINVAL;
 		if (set->dtype != NFT_DATA_VERDICT && desc.len != set->dlen)
-			goto err3;
+			goto err_parse_data;
 
 		dreg = nft_type_to_reg(set->dtype);
 		list_for_each_entry(binding, &set->bindings, list) {
@@ -4858,7 +4886,7 @@ static int nft_add_set_elem(struct nft_ctx *ctx, struct nft_set *set,
 							  &data,
 							  desc.type, desc.len);
 			if (err < 0)
-				goto err3;
+				goto err_parse_data;
 
 			if (desc.type == NFT_DATA_VERDICT &&
 			    (data.verdict.code == NFT_GOTO ||
@@ -4883,10 +4911,11 @@ static int nft_add_set_elem(struct nft_ctx *ctx, struct nft_set *set,
 	}
 
 	err = -ENOMEM;
-	elem.priv = nft_set_elem_init(set, &tmpl, elem.key.val.data, data.data,
+	elem.priv = nft_set_elem_init(set, &tmpl, elem.key.val.data,
+				      elem.key_end.val.data, data.data,
 				      timeout, expiration, GFP_KERNEL);
 	if (elem.priv == NULL)
-		goto err3;
+		goto err_parse_data;
 
 	ext = nft_set_elem_ext(set, elem.priv);
 	if (flags)
@@ -4903,7 +4932,7 @@ static int nft_add_set_elem(struct nft_ctx *ctx, struct nft_set *set,
 
 	trans = nft_trans_elem_alloc(ctx, NFT_MSG_NEWSETELEM, set);
 	if (trans == NULL)
-		goto err4;
+		goto err_trans;
 
 	ext->genmask = nft_genmask_cur(ctx->net) | NFT_SET_ELEM_BUSY_MASK;
 	err = set->ops->insert(ctx->net, set, &elem, &ext2);
@@ -4914,7 +4943,7 @@ static int nft_add_set_elem(struct nft_ctx *ctx, struct nft_set *set,
 			    nft_set_ext_exists(ext, NFT_SET_EXT_OBJREF) ^
 			    nft_set_ext_exists(ext2, NFT_SET_EXT_OBJREF)) {
 				err = -EBUSY;
-				goto err5;
+				goto err_element_clash;
 			}
 			if ((nft_set_ext_exists(ext, NFT_SET_EXT_DATA) &&
 			     nft_set_ext_exists(ext2, NFT_SET_EXT_DATA) &&
@@ -4927,33 +4956,35 @@ static int nft_add_set_elem(struct nft_ctx *ctx, struct nft_set *set,
 			else if (!(nlmsg_flags & NLM_F_EXCL))
 				err = 0;
 		}
-		goto err5;
+		goto err_element_clash;
 	}
 
 	if (set->size &&
 	    !atomic_add_unless(&set->nelems, 1, set->size + set->ndeact)) {
 		err = -ENFILE;
-		goto err6;
+		goto err_set_full;
 	}
 
 	nft_trans_elem(trans) = elem;
 	list_add_tail(&trans->list, &ctx->net->nft.commit_list);
 	return 0;
 
-err6:
+err_set_full:
 	set->ops->remove(ctx->net, set, &elem);
-err5:
+err_element_clash:
 	kfree(trans);
-err4:
+err_trans:
 	if (obj)
 		obj->use--;
 	kfree(elem.priv);
-err3:
+err_parse_data:
 	if (nla[NFTA_SET_ELEM_DATA] != NULL)
 		nft_data_release(&data, desc.type);
-err2:
+err_parse_key_end:
+	nft_data_release(&elem.key_end.val, NFT_DATA_VALUE);
+err_parse_key:
 	nft_data_release(&elem.key.val, NFT_DATA_VALUE);
-err1:
+
 	return err;
 }
 
@@ -5078,9 +5109,19 @@ static int nft_del_setelem(struct nft_ctx *ctx, struct nft_set *set,
 
 	nft_set_ext_add_length(&tmpl, NFT_SET_EXT_KEY, set->klen);
 
+	if (nla[NFTA_SET_ELEM_KEY_END]) {
+		err = nft_setelem_parse_key(ctx, set, &elem.key_end.val,
+					    nla[NFTA_SET_ELEM_KEY_END]);
+		if (err < 0)
+			return err;
+
+		nft_set_ext_add_length(&tmpl, NFT_SET_EXT_KEY_END, set->klen);
+	}
+
 	err = -ENOMEM;
-	elem.priv = nft_set_elem_init(set, &tmpl, elem.key.val.data, NULL, 0,
-				      0, GFP_KERNEL);
+	elem.priv = nft_set_elem_init(set, &tmpl, elem.key.val.data,
+				      elem.key_end.val.data, NULL, 0, 0,
+				      GFP_KERNEL);
 	if (elem.priv == NULL)
 		goto fail_elem;
 
diff --git a/net/netfilter/nft_dynset.c b/net/netfilter/nft_dynset.c
index 8887295414dc..683785225a3e 100644
--- a/net/netfilter/nft_dynset.c
+++ b/net/netfilter/nft_dynset.c
@@ -54,7 +54,7 @@ static void *nft_dynset_new(struct nft_set *set, const struct nft_expr *expr,
 
 	timeout = priv->timeout ? : set->timeout;
 	elem = nft_set_elem_init(set, &priv->tmpl,
-				 &regs->data[priv->sreg_key],
+				 &regs->data[priv->sreg_key], NULL,
 				 &regs->data[priv->sreg_data],
 				 timeout, 0, GFP_ATOMIC);
 	if (elem == NULL)
-- 
2.24.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH nf-next v4 3/9] netfilter: nf_tables: Support for sets with multiple ranged fields
  2020-01-21 23:17 [PATCH nf-next v4 0/9] nftables: Set implementation for arbitrary concatenation of ranges Stefano Brivio
  2020-01-21 23:17 ` [PATCH nf-next v4 1/9] netfilter: nf_tables: add nft_setelem_parse_key() Stefano Brivio
  2020-01-21 23:17 ` [PATCH nf-next v4 2/9] netfilter: nf_tables: add NFTA_SET_ELEM_KEY_END attribute Stefano Brivio
@ 2020-01-21 23:17 ` Stefano Brivio
  2020-01-21 23:17 ` [PATCH nf-next v4 4/9] bitmap: Introduce bitmap_cut(): cut bits and shift remaining Stefano Brivio
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Stefano Brivio @ 2020-01-21 23:17 UTC (permalink / raw)
  To: Pablo Neira Ayuso, netfilter-devel
  Cc: Florian Westphal, Kadlecsik József, Eric Garver, Phil Sutter

Introduce a new nested netlink attribute, NFTA_SET_DESC_CONCAT, used
to specify the length of each field in a set concatenation.

This allows set implementations to support concatenation of multiple
ranged items, as they can divide the input key into matching data for
every single field. Such set implementations would be selected as
they specify support for NFT_SET_INTERVAL and allow desc->field_count
to be greater than one. Explicitly disallow this for nft_set_rbtree.

In order to specify the interval for a set entry, userspace would
include in NFTA_SET_DESC_CONCAT attributes field lengths, and pass
range endpoints as two separate keys, represented by attributes
NFTA_SET_ELEM_KEY and NFTA_SET_ELEM_KEY_END.

While at it, export the number of 32-bit registers available for
packet matching, as nftables will need this to know the maximum
number of field lengths that can be specified.

For example, "packets with an IPv4 address between 192.0.2.0 and
192.0.2.42, with destination port between 22 and 25", can be
expressed as two concatenated elements:

  NFTA_SET_ELEM_KEY:            192.0.2.0 . 22
  NFTA_SET_ELEM_KEY_END:        192.0.2.42 . 25

and NFTA_SET_DESC_CONCAT attribute would contain:

  NFTA_LIST_ELEM
    NFTA_SET_FIELD_LEN:		4
  NFTA_LIST_ELEM
    NFTA_SET_FIELD_LEN:		2

v4: No changes
v3: Complete rework, NFTA_SET_DESC_CONCAT instead of NFTA_SET_SUBKEY
v2: No changes

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 include/net/netfilter/nf_tables.h        |  8 +++
 include/uapi/linux/netfilter/nf_tables.h | 15 ++++
 net/netfilter/nf_tables_api.c            | 90 +++++++++++++++++++++++-
 net/netfilter/nft_set_rbtree.c           |  3 +
 4 files changed, 115 insertions(+), 1 deletion(-)

diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
index 504c0aa93805..4170c033d461 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -264,11 +264,15 @@ struct nft_set_iter {
  *	@klen: key length
  *	@dlen: data length
  *	@size: number of set elements
+ *	@field_len: length of each field in concatenation, bytes
+ *	@field_count: number of concatenated fields in element
  */
 struct nft_set_desc {
 	unsigned int		klen;
 	unsigned int		dlen;
 	unsigned int		size;
+	u8			field_len[NFT_REG32_COUNT];
+	u8			field_count;
 };
 
 /**
@@ -409,6 +413,8 @@ void nft_unregister_set(struct nft_set_type *type);
  * 	@dtype: data type (verdict or numeric type defined by userspace)
  * 	@objtype: object type (see NFT_OBJECT_* definitions)
  * 	@size: maximum set size
+ *	@field_len: length of each field in concatenation, bytes
+ *	@field_count: number of concatenated fields in element
  *	@use: number of rules references to this set
  * 	@nelems: number of elements
  * 	@ndeact: number of deactivated elements queued for removal
@@ -435,6 +441,8 @@ struct nft_set {
 	u32				dtype;
 	u32				objtype;
 	u32				size;
+	u8				field_len[NFT_REG32_COUNT];
+	u8				field_count;
 	u32				use;
 	atomic_t			nelems;
 	u32				ndeact;
diff --git a/include/uapi/linux/netfilter/nf_tables.h b/include/uapi/linux/netfilter/nf_tables.h
index c13106496bd2..065218a20bb7 100644
--- a/include/uapi/linux/netfilter/nf_tables.h
+++ b/include/uapi/linux/netfilter/nf_tables.h
@@ -48,6 +48,7 @@ enum nft_registers {
 
 #define NFT_REG_SIZE	16
 #define NFT_REG32_SIZE	4
+#define NFT_REG32_COUNT	(NFT_REG32_15 - NFT_REG32_00 + 1)
 
 /**
  * enum nft_verdicts - nf_tables internal verdicts
@@ -301,14 +302,28 @@ enum nft_set_policies {
  * enum nft_set_desc_attributes - set element description
  *
  * @NFTA_SET_DESC_SIZE: number of elements in set (NLA_U32)
+ * @NFTA_SET_DESC_CONCAT: description of field concatenation (NLA_NESTED)
  */
 enum nft_set_desc_attributes {
 	NFTA_SET_DESC_UNSPEC,
 	NFTA_SET_DESC_SIZE,
+	NFTA_SET_DESC_CONCAT,
 	__NFTA_SET_DESC_MAX
 };
 #define NFTA_SET_DESC_MAX	(__NFTA_SET_DESC_MAX - 1)
 
+/**
+ * enum nft_set_field_attributes - attributes of concatenated fields
+ *
+ * @NFTA_SET_FIELD_LEN: length of single field, in bits (NLA_U32)
+ */
+enum nft_set_field_attributes {
+	NFTA_SET_FIELD_UNSPEC,
+	NFTA_SET_FIELD_LEN,
+	__NFTA_SET_FIELD_MAX
+};
+#define NFTA_SET_FIELD_MAX	(__NFTA_SET_FIELD_MAX - 1)
+
 /**
  * enum nft_set_attributes - nf_tables set netlink attributes
  *
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 6cfd65348958..767688e5673c 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -3358,6 +3358,7 @@ static const struct nla_policy nft_set_policy[NFTA_SET_MAX + 1] = {
 
 static const struct nla_policy nft_set_desc_policy[NFTA_SET_DESC_MAX + 1] = {
 	[NFTA_SET_DESC_SIZE]		= { .type = NLA_U32 },
+	[NFTA_SET_DESC_CONCAT]		= { .type = NLA_NESTED },
 };
 
 static int nft_ctx_init_from_setattr(struct nft_ctx *ctx, struct net *net,
@@ -3524,6 +3525,33 @@ static __be64 nf_jiffies64_to_msecs(u64 input)
 	return cpu_to_be64(jiffies64_to_msecs(input));
 }
 
+static int nf_tables_fill_set_concat(struct sk_buff *skb,
+				     const struct nft_set *set)
+{
+	struct nlattr *concat, *field;
+	int i;
+
+	concat = nla_nest_start_noflag(skb, NFTA_SET_DESC_CONCAT);
+	if (!concat)
+		return -ENOMEM;
+
+	for (i = 0; i < set->field_count; i++) {
+		field = nla_nest_start_noflag(skb, NFTA_LIST_ELEM);
+		if (!field)
+			return -ENOMEM;
+
+		if (nla_put_be32(skb, NFTA_SET_FIELD_LEN,
+				 htonl(set->field_len[i])))
+			return -ENOMEM;
+
+		nla_nest_end(skb, field);
+	}
+
+	nla_nest_end(skb, concat);
+
+	return 0;
+}
+
 static int nf_tables_fill_set(struct sk_buff *skb, const struct nft_ctx *ctx,
 			      const struct nft_set *set, u16 event, u16 flags)
 {
@@ -3587,11 +3615,17 @@ static int nf_tables_fill_set(struct sk_buff *skb, const struct nft_ctx *ctx,
 		goto nla_put_failure;
 
 	desc = nla_nest_start_noflag(skb, NFTA_SET_DESC);
+
 	if (desc == NULL)
 		goto nla_put_failure;
 	if (set->size &&
 	    nla_put_be32(skb, NFTA_SET_DESC_SIZE, htonl(set->size)))
 		goto nla_put_failure;
+
+	if (set->field_count > 1 &&
+	    nf_tables_fill_set_concat(skb, set))
+		goto nla_put_failure;
+
 	nla_nest_end(skb, desc);
 
 	nlmsg_end(skb, nlh);
@@ -3764,6 +3798,53 @@ static int nf_tables_getset(struct net *net, struct sock *nlsk,
 	return err;
 }
 
+static const struct nla_policy nft_concat_policy[NFTA_SET_FIELD_MAX + 1] = {
+	[NFTA_SET_FIELD_LEN]	= { .type = NLA_U32 },
+};
+
+static int nft_set_desc_concat_parse(const struct nlattr *attr,
+				     struct nft_set_desc *desc)
+{
+	struct nlattr *tb[NFTA_SET_FIELD_MAX + 1];
+	u32 len;
+	int err;
+
+	err = nla_parse_nested_deprecated(tb, NFTA_SET_FIELD_MAX, attr,
+					  nft_concat_policy, NULL);
+	if (err < 0)
+		return err;
+
+	if (!tb[NFTA_SET_FIELD_LEN])
+		return -EINVAL;
+
+	len = ntohl(nla_get_be32(tb[NFTA_SET_FIELD_LEN]));
+
+	if (len * BITS_PER_BYTE / 32 > NFT_REG32_COUNT)
+		return -E2BIG;
+
+	desc->field_len[desc->field_count++] = len;
+
+	return 0;
+}
+
+static int nft_set_desc_concat(struct nft_set_desc *desc,
+			       const struct nlattr *nla)
+{
+	struct nlattr *attr;
+	int rem, err;
+
+	nla_for_each_nested(attr, nla, rem) {
+		if (nla_type(attr) != NFTA_LIST_ELEM)
+			return -EINVAL;
+
+		err = nft_set_desc_concat_parse(attr, desc);
+		if (err < 0)
+			return err;
+	}
+
+	return 0;
+}
+
 static int nf_tables_set_desc_parse(struct nft_set_desc *desc,
 				    const struct nlattr *nla)
 {
@@ -3777,8 +3858,10 @@ static int nf_tables_set_desc_parse(struct nft_set_desc *desc,
 
 	if (da[NFTA_SET_DESC_SIZE] != NULL)
 		desc->size = ntohl(nla_get_be32(da[NFTA_SET_DESC_SIZE]));
+	if (da[NFTA_SET_DESC_CONCAT])
+		err = nft_set_desc_concat(desc, da[NFTA_SET_DESC_CONCAT]);
 
-	return 0;
+	return err;
 }
 
 static int nf_tables_newset(struct net *net, struct sock *nlsk,
@@ -3801,6 +3884,7 @@ static int nf_tables_newset(struct net *net, struct sock *nlsk,
 	unsigned char *udata;
 	u16 udlen;
 	int err;
+	int i;
 
 	if (nla[NFTA_SET_TABLE] == NULL ||
 	    nla[NFTA_SET_NAME] == NULL ||
@@ -3979,6 +4063,10 @@ static int nf_tables_newset(struct net *net, struct sock *nlsk,
 	set->gc_int = gc_int;
 	set->handle = nf_tables_alloc_handle(table);
 
+	set->field_count = desc.field_count;
+	for (i = 0; i < desc.field_count; i++)
+		set->field_len[i] = desc.field_len[i];
+
 	err = ops->init(set, &desc, nla);
 	if (err < 0)
 		goto err3;
diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c
index a9f804f7a04a..5000b938ab1e 100644
--- a/net/netfilter/nft_set_rbtree.c
+++ b/net/netfilter/nft_set_rbtree.c
@@ -466,6 +466,9 @@ static void nft_rbtree_destroy(const struct nft_set *set)
 static bool nft_rbtree_estimate(const struct nft_set_desc *desc, u32 features,
 				struct nft_set_estimate *est)
 {
+	if (desc->field_count > 1)
+		return false;
+
 	if (desc->size)
 		est->size = sizeof(struct nft_rbtree) +
 			    desc->size * sizeof(struct nft_rbtree_elem);
-- 
2.24.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH nf-next v4 4/9] bitmap: Introduce bitmap_cut(): cut bits and shift remaining
  2020-01-21 23:17 [PATCH nf-next v4 0/9] nftables: Set implementation for arbitrary concatenation of ranges Stefano Brivio
                   ` (2 preceding siblings ...)
  2020-01-21 23:17 ` [PATCH nf-next v4 3/9] netfilter: nf_tables: Support for sets with multiple ranged fields Stefano Brivio
@ 2020-01-21 23:17 ` Stefano Brivio
  2020-01-21 23:17 ` [PATCH nf-next v4 5/9] nf_tables: Add set type for arbitrary concatenation of ranges Stefano Brivio
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Stefano Brivio @ 2020-01-21 23:17 UTC (permalink / raw)
  To: Pablo Neira Ayuso, netfilter-devel
  Cc: Florian Westphal, Kadlecsik József, Eric Garver, Phil Sutter

The new bitmap function bitmap_cut() copies bits from source to
destination by removing the region specified by parameters first
and cut, and remapping the bits above the cut region by right
shifting them.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
v4: No changes
v3: No changes
v2: No changes

 include/linux/bitmap.h |  4 +++
 lib/bitmap.c           | 66 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 70 insertions(+)

diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index ff335b22f23c..f0f3a9fffa6a 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -53,6 +53,7 @@
  *  bitmap_find_next_zero_area_off(buf, len, pos, n, mask)  as above
  *  bitmap_shift_right(dst, src, n, nbits)      *dst = *src >> n
  *  bitmap_shift_left(dst, src, n, nbits)       *dst = *src << n
+ *  bitmap_cut(dst, src, first, n, nbits)       Cut n bits from first, copy rest
  *  bitmap_replace(dst, old, new, mask, nbits)  *dst = (*old & ~(*mask)) | (*new & *mask)
  *  bitmap_remap(dst, src, old, new, nbits)     *dst = map(old, new)(src)
  *  bitmap_bitremap(oldbit, old, new, nbits)    newbit = map(old, new)(oldbit)
@@ -133,6 +134,9 @@ extern void __bitmap_shift_right(unsigned long *dst, const unsigned long *src,
 				unsigned int shift, unsigned int nbits);
 extern void __bitmap_shift_left(unsigned long *dst, const unsigned long *src,
 				unsigned int shift, unsigned int nbits);
+extern void bitmap_cut(unsigned long *dst, const unsigned long *src,
+		       unsigned int first, unsigned int cut,
+		       unsigned int nbits);
 extern int __bitmap_and(unsigned long *dst, const unsigned long *bitmap1,
 			const unsigned long *bitmap2, unsigned int nbits);
 extern void __bitmap_or(unsigned long *dst, const unsigned long *bitmap1,
diff --git a/lib/bitmap.c b/lib/bitmap.c
index 4250519d7d1c..6e175fbd69a9 100644
--- a/lib/bitmap.c
+++ b/lib/bitmap.c
@@ -168,6 +168,72 @@ void __bitmap_shift_left(unsigned long *dst, const unsigned long *src,
 }
 EXPORT_SYMBOL(__bitmap_shift_left);
 
+/**
+ * bitmap_cut() - remove bit region from bitmap and right shift remaining bits
+ * @dst: destination bitmap, might overlap with src
+ * @src: source bitmap
+ * @first: start bit of region to be removed
+ * @cut: number of bits to remove
+ * @nbits: bitmap size, in bits
+ *
+ * Set the n-th bit of @dst iff the n-th bit of @src is set and
+ * n is less than @first, or the m-th bit of @src is set for any
+ * m such that @first <= n < nbits, and m = n + @cut.
+ *
+ * In pictures, example for a big-endian 32-bit architecture:
+ *
+ * @src:
+ * 31                                   63
+ * |                                    |
+ * 10000000 11000001 11110010 00010101  10000000 11000001 01110010 00010101
+ *                 |  |              |                                    |
+ *                16  14             0                                   32
+ *
+ * if @cut is 3, and @first is 14, bits 14-16 in @src are cut and @dst is:
+ *
+ * 31                                   63
+ * |                                    |
+ * 10110000 00011000 00110010 00010101  00010000 00011000 00101110 01000010
+ *                    |              |                                    |
+ *                    14 (bit 17     0                                   32
+ *                        from @src)
+ *
+ * Note that @dst and @src might overlap partially or entirely.
+ *
+ * This is implemented in the obvious way, with a shift and carry
+ * step for each moved bit. Optimisation is left as an exercise
+ * for the compiler.
+ */
+void bitmap_cut(unsigned long *dst, const unsigned long *src,
+		unsigned int first, unsigned int cut, unsigned int nbits)
+{
+	unsigned int len = BITS_TO_LONGS(nbits);
+	unsigned long keep = 0, carry;
+	int i;
+
+	memmove(dst, src, len * sizeof(*dst));
+
+	if (first % BITS_PER_LONG) {
+		keep = src[first / BITS_PER_LONG] &
+		       (~0UL >> (BITS_PER_LONG - first % BITS_PER_LONG));
+	}
+
+	while (cut--) {
+		for (i = first / BITS_PER_LONG; i < len; i++) {
+			if (i < len - 1)
+				carry = dst[i + 1] & 1UL;
+			else
+				carry = 0;
+
+			dst[i] = (dst[i] >> 1) | (carry << (BITS_PER_LONG - 1));
+		}
+	}
+
+	dst[first / BITS_PER_LONG] &= ~0UL << (first % BITS_PER_LONG);
+	dst[first / BITS_PER_LONG] |= keep;
+}
+EXPORT_SYMBOL(bitmap_cut);
+
 int __bitmap_and(unsigned long *dst, const unsigned long *bitmap1,
 				const unsigned long *bitmap2, unsigned int bits)
 {
-- 
2.24.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH nf-next v4 5/9] nf_tables: Add set type for arbitrary concatenation of ranges
  2020-01-21 23:17 [PATCH nf-next v4 0/9] nftables: Set implementation for arbitrary concatenation of ranges Stefano Brivio
                   ` (3 preceding siblings ...)
  2020-01-21 23:17 ` [PATCH nf-next v4 4/9] bitmap: Introduce bitmap_cut(): cut bits and shift remaining Stefano Brivio
@ 2020-01-21 23:17 ` Stefano Brivio
  2020-02-07 11:23   ` Pablo Neira Ayuso
  2020-01-21 23:17 ` [PATCH nf-next v4 6/9] selftests: netfilter: Introduce tests for sets with range concatenation Stefano Brivio
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 20+ messages in thread
From: Stefano Brivio @ 2020-01-21 23:17 UTC (permalink / raw)
  To: Pablo Neira Ayuso, netfilter-devel
  Cc: Florian Westphal, Kadlecsik József, Eric Garver, Phil Sutter

This new set type allows for intervals in concatenated fields,
which are expressed in the usual way, that is, simple byte
concatenation with padding to 32 bits for single fields, and
given as ranges by specifying start and end elements containing,
each, the full concatenation of start and end values for the
single fields.

Ranges are expanded to composing netmasks, for each field: these
are inserted as rules in per-field lookup tables. Bits to be
classified are divided in 4-bit groups, and for each group, the
lookup table contains 4^2 buckets, representing all the possible
values of a bit group. This approach was inspired by the Grouper
algorithm:
	http://www.cse.usf.edu/~ligatti/projects/grouper/

Matching is performed by a sequence of AND operations between
bucket values, with buckets selected according to the value of
packet bits, for each group. The result of this sequence tells
us which rules matched for a given field.

In order to concatenate several ranged fields, per-field rules
are mapped using mapping arrays, one per field, that specify
which rules should be considered while matching the next field.
The mapping array for the last field contains a reference to
the element originally inserted.

The notes in nft_set_pipapo.c cover the algorithm in deeper
detail.

A pure hash-based approach is of no use here, as ranges need
to be classified. An implementation based on "proxying" the
existing red-black tree set type, creating a tree for each
field, was considered, but deemed impractical due to the fact
that elements would need to be shared between trees, at least
as long as we want to keep UAPI changes to a minimum.

A stand-alone implementation of this algorithm is available at:
	https://pipapo.lameexcu.se
together with notes about possible future optimisations
(in pipapo.c).

This algorithm was designed with data locality in mind, and can
be highly optimised for SIMD instruction sets, as the bulk of
the matching work is done with repetitive, simple bitwise
operations.

At this point, without further optimisations, nft_concat_range.sh
reports, for one AMD Epyc 7351 thread (2.9GHz, 512 KiB L1D$, 8 MiB
L2$):

TEST: performance
  net,port                                                      [ OK ]
    baseline (drop from netdev hook):              10190076pps
    baseline hash (non-ranged entries):             6179564pps
    baseline rbtree (match on first field only):    2950341pps
    set with  1000 full, ranged entries:            2304165pps
  port,net                                                      [ OK ]
    baseline (drop from netdev hook):              10143615pps
    baseline hash (non-ranged entries):             6135776pps
    baseline rbtree (match on first field only):    4311934pps
    set with   100 full, ranged entries:            4131471pps
  net6,port                                                     [ OK ]
    baseline (drop from netdev hook):               9730404pps
    baseline hash (non-ranged entries):             4809557pps
    baseline rbtree (match on first field only):    1501699pps
    set with  1000 full, ranged entries:            1092557pps
  port,proto                                                    [ OK ]
    baseline (drop from netdev hook):              10812426pps
    baseline hash (non-ranged entries):             6929353pps
    baseline rbtree (match on first field only):    3027105pps
    set with 30000 full, ranged entries:             284147pps
  net6,port,mac                                                 [ OK ]
    baseline (drop from netdev hook):               9660114pps
    baseline hash (non-ranged entries):             3778877pps
    baseline rbtree (match on first field only):    3179379pps
    set with    10 full, ranged entries:            2082880pps
  net6,port,mac,proto                                           [ OK ]
    baseline (drop from netdev hook):               9718324pps
    baseline hash (non-ranged entries):             3799021pps
    baseline rbtree (match on first field only):    1506689pps
    set with  1000 full, ranged entries:             783810pps
  net,mac                                                       [ OK ]
    baseline (drop from netdev hook):              10190029pps
    baseline hash (non-ranged entries):             5172218pps
    baseline rbtree (match on first field only):    2946863pps
    set with  1000 full, ranged entries:            1279122pps

v4:
 - fix build for 32-bit architectures: 64-bit division needs
   div_u64() (kbuild test robot <lkp@intel.com>)
v3:
 - rework interface for field length specification,
   NFT_SET_SUBKEY disappears and information is stored in
   description
 - remove scratch area to store closing element of ranges,
   as elements now come with an actual attribute to specify
   the upper range limit (Pablo Neira Ayuso)
 - also remove pointer to 'start' element from mapping table,
   closing key is now accessible via extension data
 - use bytes right away instead of bits for field lengths,
   this way we can also double the inner loop of the lookup
   function to take care of upper and lower bits in a single
   iteration (minor performance improvement)
 - make it clearer that set operations are actually atomic
   API-wise, but we can't e.g. implement flush() as one-shot
   action
 - fix type for 'dup' in nft_pipapo_insert(), check for
   duplicates only in the next generation, and in general take
   care of differentiating generation mask cases depending on
   the operation (Pablo Neira Ayuso)
 - report C implementation matching rate in commit message, so
   that AVX2 implementation can be compared (Pablo Neira Ayuso)
v2:
 - protect access to scratch maps in nft_pipapo_lookup() with
   local_bh_disable/enable() (Florian Westphal)
 - drop rcu_read_lock/unlock() from nft_pipapo_lookup(), it's
   already implied (Florian Westphal)
 - explain why partial allocation failures don't need handling
   in pipapo_realloc_scratch(), rename 'm' to clone and update
   related kerneldoc to make it clear we're not operating on
   the live copy (Florian Westphal)
 - add expicit check for priv->start_elem in
   nft_pipapo_insert() to avoid ending up in nft_pipapo_walk()
   with a NULL start element, and also zero it out in every
   operation that might make it invalid, so that insertion
   doesn't proceed with an invalid element (Florian Westphal)

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 include/net/netfilter/nf_tables_core.h |    1 +
 net/netfilter/Makefile                 |    3 +-
 net/netfilter/nf_tables_set_core.c     |    2 +
 net/netfilter/nft_set_pipapo.c         | 2102 ++++++++++++++++++++++++
 4 files changed, 2107 insertions(+), 1 deletion(-)
 create mode 100644 net/netfilter/nft_set_pipapo.c

diff --git a/include/net/netfilter/nf_tables_core.h b/include/net/netfilter/nf_tables_core.h
index 2656155b4069..29e7e1021267 100644
--- a/include/net/netfilter/nf_tables_core.h
+++ b/include/net/netfilter/nf_tables_core.h
@@ -74,6 +74,7 @@ extern struct nft_set_type nft_set_hash_type;
 extern struct nft_set_type nft_set_hash_fast_type;
 extern struct nft_set_type nft_set_rbtree_type;
 extern struct nft_set_type nft_set_bitmap_type;
+extern struct nft_set_type nft_set_pipapo_type;
 
 struct nft_expr;
 struct nft_regs;
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index 5e9b2eb24349..3f572e5a975e 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -81,7 +81,8 @@ nf_tables-objs := nf_tables_core.o nf_tables_api.o nft_chain_filter.o \
 		  nft_chain_route.o nf_tables_offload.o
 
 nf_tables_set-objs := nf_tables_set_core.o \
-		      nft_set_hash.o nft_set_bitmap.o nft_set_rbtree.o
+		      nft_set_hash.o nft_set_bitmap.o nft_set_rbtree.o \
+		      nft_set_pipapo.o
 
 obj-$(CONFIG_NF_TABLES)		+= nf_tables.o
 obj-$(CONFIG_NF_TABLES_SET)	+= nf_tables_set.o
diff --git a/net/netfilter/nf_tables_set_core.c b/net/netfilter/nf_tables_set_core.c
index a9fce8d10051..586b621007eb 100644
--- a/net/netfilter/nf_tables_set_core.c
+++ b/net/netfilter/nf_tables_set_core.c
@@ -9,12 +9,14 @@ static int __init nf_tables_set_module_init(void)
 	nft_register_set(&nft_set_rhash_type);
 	nft_register_set(&nft_set_bitmap_type);
 	nft_register_set(&nft_set_rbtree_type);
+	nft_register_set(&nft_set_pipapo_type);
 
 	return 0;
 }
 
 static void __exit nf_tables_set_module_exit(void)
 {
+	nft_unregister_set(&nft_set_pipapo_type);
 	nft_unregister_set(&nft_set_rbtree_type);
 	nft_unregister_set(&nft_set_bitmap_type);
 	nft_unregister_set(&nft_set_rhash_type);
diff --git a/net/netfilter/nft_set_pipapo.c b/net/netfilter/nft_set_pipapo.c
new file mode 100644
index 000000000000..f0cb1e13af50
--- /dev/null
+++ b/net/netfilter/nft_set_pipapo.c
@@ -0,0 +1,2102 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/* PIPAPO: PIle PAcket POlicies: set for arbitrary concatenations of ranges
+ *
+ * Copyright (c) 2019-2020 Red Hat GmbH
+ *
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ */
+
+/**
+ * DOC: Theory of Operation
+ *
+ *
+ * Problem
+ * -------
+ *
+ * Match packet bytes against entries composed of ranged or non-ranged packet
+ * field specifiers, mapping them to arbitrary references. For example:
+ *
+ * ::
+ *
+ *               --- fields --->
+ *      |    [net],[port],[net]... => [reference]
+ *   entries [net],[port],[net]... => [reference]
+ *      |    [net],[port],[net]... => [reference]
+ *      V    ...
+ *
+ * where [net] fields can be IP ranges or netmasks, and [port] fields are port
+ * ranges. Arbitrary packet fields can be matched.
+ *
+ *
+ * Algorithm Overview
+ * ------------------
+ *
+ * This algorithm is loosely inspired by [Ligatti 2010], and fundamentally
+ * relies on the consideration that every contiguous range in a space of b bits
+ * can be converted into b * 2 netmasks, from Theorem 3 in [Rottenstreich 2010],
+ * as also illustrated in Section 9 of [Kogan 2014].
+ *
+ * Classification against a number of entries, that require matching given bits
+ * of a packet field, is performed by grouping those bits in sets of arbitrary
+ * size, and classifying packet bits one group at a time.
+ *
+ * Example:
+ *   to match the source port (16 bits) of a packet, we can divide those 16 bits
+ *   in 4 groups of 4 bits each. Given the entry:
+ *      0000 0001 0101 1001
+ *   and a packet with source port:
+ *      0000 0001 1010 1001
+ *   first and second groups match, but the third doesn't. We conclude that the
+ *   packet doesn't match the given entry.
+ *
+ * Translate the set to a sequence of lookup tables, one per field. Each table
+ * has two dimensions: bit groups to be matched for a single packet field, and
+ * all the possible values of said groups (buckets). Input entries are
+ * represented as one or more rules, depending on the number of composing
+ * netmasks for the given field specifier, and a group match is indicated as a
+ * set bit, with number corresponding to the rule index, in all the buckets
+ * whose value matches the entry for a given group.
+ *
+ * Rules are mapped between fields through an array of x, n pairs, with each
+ * item mapping a matched rule to one or more rules. The position of the pair in
+ * the array indicates the matched rule to be mapped to the next field, x
+ * indicates the first rule index in the next field, and n the amount of
+ * next-field rules the current rule maps to.
+ *
+ * The mapping array for the last field maps to the desired references.
+ *
+ * To match, we perform table lookups using the values of grouped packet bits,
+ * and use a sequence of bitwise operations to progressively evaluate rule
+ * matching.
+ *
+ * A stand-alone, reference implementation, also including notes about possible
+ * future optimisations, is available at:
+ *    https://pipapo.lameexcu.se/
+ *
+ * Insertion
+ * ---------
+ *
+ * - For each packet field:
+ *
+ *   - divide the b packet bits we want to classify into groups of size t,
+ *     obtaining ceil(b / t) groups
+ *
+ *      Example: match on destination IP address, with t = 4: 32 bits, 8 groups
+ *      of 4 bits each
+ *
+ *   - allocate a lookup table with one column ("bucket") for each possible
+ *     value of a group, and with one row for each group
+ *
+ *      Example: 8 groups, 2^4 buckets:
+ *
+ * ::
+ *
+ *                     bucket
+ *      group  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
+ *        0
+ *        1
+ *        2
+ *        3
+ *        4
+ *        5
+ *        6
+ *        7
+ *
+ *   - map the bits we want to classify for the current field, for a given
+ *     entry, to a single rule for non-ranged and netmask set items, and to one
+ *     or multiple rules for ranges. Ranges are expanded to composing netmasks
+ *     by pipapo_expand().
+ *
+ *      Example: 2 entries, 10.0.0.5:1024 and 192.168.1.0-192.168.2.1:2048
+ *      - rule #0: 10.0.0.5
+ *      - rule #1: 192.168.1.0/24
+ *      - rule #2: 192.168.2.0/31
+ *
+ *   - insert references to the rules in the lookup table, selecting buckets
+ *     according to bit values of a rule in the given group. This is done by
+ *     pipapo_insert().
+ *
+ *      Example: given:
+ *      - rule #0: 10.0.0.5 mapping to buckets
+ *        < 0 10  0 0   0 0  0 5 >
+ *      - rule #1: 192.168.1.0/24 mapping to buckets
+ *        < 12 0  10 8  0 1  < 0..15 > < 0..15 > >
+ *      - rule #2: 192.168.2.0/31 mapping to buckets
+ *        < 12 0  10 8  0 2  0 < 0..1 > >
+ *
+ *      these bits are set in the lookup table:
+ *
+ * ::
+ *
+ *                     bucket
+ *      group  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
+ *        0    0                                              1,2
+ *        1   1,2                                      0
+ *        2    0                                      1,2
+ *        3    0                              1,2
+ *        4  0,1,2
+ *        5    0   1   2
+ *        6  0,1,2 1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
+ *        7   1,2 1,2  1   1   1  0,1  1   1   1   1   1   1   1   1   1   1
+ *
+ *   - if this is not the last field in the set, fill a mapping array that maps
+ *     rules from the lookup table to rules belonging to the same entry in
+ *     the next lookup table, done by pipapo_map().
+ *
+ *     Note that as rules map to contiguous ranges of rules, given how netmask
+ *     expansion and insertion is performed, &union nft_pipapo_map_bucket stores
+ *     this information as pairs of first rule index, rule count.
+ *
+ *      Example: 2 entries, 10.0.0.5:1024 and 192.168.1.0-192.168.2.1:2048,
+ *      given lookup table #0 for field 0 (see example above):
+ *
+ * ::
+ *
+ *                     bucket
+ *      group  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
+ *        0    0                                              1,2
+ *        1   1,2                                      0
+ *        2    0                                      1,2
+ *        3    0                              1,2
+ *        4  0,1,2
+ *        5    0   1   2
+ *        6  0,1,2 1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
+ *        7   1,2 1,2  1   1   1  0,1  1   1   1   1   1   1   1   1   1   1
+ *
+ *      and lookup table #1 for field 1 with:
+ *      - rule #0: 1024 mapping to buckets
+ *        < 0  0  4  0 >
+ *      - rule #1: 2048 mapping to buckets
+ *        < 0  0  5  0 >
+ *
+ * ::
+ *
+ *                     bucket
+ *      group  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
+ *        0   0,1
+ *        1   0,1
+ *        2                    0   1
+ *        3   0,1
+ *
+ *      we need to map rules for 10.0.0.5 in lookup table #0 (rule #0) to 1024
+ *      in lookup table #1 (rule #0) and rules for 192.168.1.0-192.168.2.1
+ *      (rules #1, #2) to 2048 in lookup table #2 (rule #1):
+ *
+ * ::
+ *
+ *       rule indices in current field: 0    1    2
+ *       map to rules in next field:    0    1    1
+ *
+ *   - if this is the last field in the set, fill a mapping array that maps
+ *     rules from the last lookup table to element pointers, also done by
+ *     pipapo_map().
+ *
+ *     Note that, in this implementation, we have two elements (start, end) for
+ *     each entry. The pointer to the end element is stored in this array, and
+ *     the pointer to the start element is linked from it.
+ *
+ *      Example: entry 10.0.0.5:1024 has a corresponding &struct nft_pipapo_elem
+ *      pointer, 0x66, and element for 192.168.1.0-192.168.2.1:2048 is at 0x42.
+ *      From the rules of lookup table #1 as mapped above:
+ *
+ * ::
+ *
+ *       rule indices in last field:    0    1
+ *       map to elements:             0x42  0x66
+ *
+ *
+ * Matching
+ * --------
+ *
+ * We use a result bitmap, with the size of a single lookup table bucket, to
+ * represent the matching state that applies at every algorithm step. This is
+ * done by pipapo_lookup().
+ *
+ * - For each packet field:
+ *
+ *   - start with an all-ones result bitmap (res_map in pipapo_lookup())
+ *
+ *   - perform a lookup into the table corresponding to the current field,
+ *     for each group, and at every group, AND the current result bitmap with
+ *     the value from the lookup table bucket
+ *
+ * ::
+ *
+ *      Example: 192.168.1.5 < 12 0  10 8  0 1  0 5 >, with lookup table from
+ *      insertion examples.
+ *      Lookup table buckets are at least 3 bits wide, we'll assume 8 bits for
+ *      convenience in this example. Initial result bitmap is 0xff, the steps
+ *      below show the value of the result bitmap after each group is processed:
+ *
+ *                     bucket
+ *      group  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
+ *        0    0                                              1,2
+ *        result bitmap is now: 0xff & 0x6 [bucket 12] = 0x6
+ *
+ *        1   1,2                                      0
+ *        result bitmap is now: 0x6 & 0x6 [bucket 0] = 0x6
+ *
+ *        2    0                                      1,2
+ *        result bitmap is now: 0x6 & 0x6 [bucket 10] = 0x6
+ *
+ *        3    0                              1,2
+ *        result bitmap is now: 0x6 & 0x6 [bucket 8] = 0x6
+ *
+ *        4  0,1,2
+ *        result bitmap is now: 0x6 & 0x7 [bucket 0] = 0x6
+ *
+ *        5    0   1   2
+ *        result bitmap is now: 0x6 & 0x2 [bucket 1] = 0x2
+ *
+ *        6  0,1,2 1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
+ *        result bitmap is now: 0x2 & 0x7 [bucket 0] = 0x2
+ *
+ *        7   1,2 1,2  1   1   1  0,1  1   1   1   1   1   1   1   1   1   1
+ *        final result bitmap for this field is: 0x2 & 0x3 [bucket 5] = 0x2
+ *
+ *   - at the next field, start with a new, all-zeroes result bitmap. For each
+ *     bit set in the previous result bitmap, fill the new result bitmap
+ *     (fill_map in pipapo_lookup()) with the rule indices from the
+ *     corresponding buckets of the mapping field for this field, done by
+ *     pipapo_refill()
+ *
+ *      Example: with mapping table from insertion examples, with the current
+ *      result bitmap from the previous example, 0x02:
+ *
+ * ::
+ *
+ *       rule indices in current field: 0    1    2
+ *       map to rules in next field:    0    1    1
+ *
+ *      the new result bitmap will be 0x02: rule 1 was set, and rule 1 will be
+ *      set.
+ *
+ *      We can now extend this example to cover the second iteration of the step
+ *      above (lookup and AND bitmap): assuming the port field is
+ *      2048 < 0  0  5  0 >, with starting result bitmap 0x2, and lookup table
+ *      for "port" field from pre-computation example:
+ *
+ * ::
+ *
+ *                     bucket
+ *      group  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
+ *        0   0,1
+ *        1   0,1
+ *        2                    0   1
+ *        3   0,1
+ *
+ *       operations are: 0x2 & 0x3 [bucket 0] & 0x3 [bucket 0] & 0x2 [bucket 5]
+ *       & 0x3 [bucket 0], resulting bitmap is 0x2.
+ *
+ *   - if this is the last field in the set, look up the value from the mapping
+ *     array corresponding to the final result bitmap
+ *
+ *      Example: 0x2 resulting bitmap from 192.168.1.5:2048, mapping array for
+ *      last field from insertion example:
+ *
+ * ::
+ *
+ *       rule indices in last field:    0    1
+ *       map to elements:             0x42  0x66
+ *
+ *      the matching element is at 0x42.
+ *
+ *
+ * References
+ * ----------
+ *
+ * [Ligatti 2010]
+ *      A Packet-classification Algorithm for Arbitrary Bitmask Rules, with
+ *      Automatic Time-space Tradeoffs
+ *      Jay Ligatti, Josh Kuhn, and Chris Gage.
+ *      Proceedings of the IEEE International Conference on Computer
+ *      Communication Networks (ICCCN), August 2010.
+ *      http://www.cse.usf.edu/~ligatti/papers/grouper-conf.pdf
+ *
+ * [Rottenstreich 2010]
+ *      Worst-Case TCAM Rule Expansion
+ *      Ori Rottenstreich and Isaac Keslassy.
+ *      2010 Proceedings IEEE INFOCOM, San Diego, CA, 2010.
+ *      http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.212.4592&rep=rep1&type=pdf
+ *
+ * [Kogan 2014]
+ *      SAX-PAC (Scalable And eXpressive PAcket Classification)
+ *      Kirill Kogan, Sergey Nikolenko, Ori Rottenstreich, William Culhane,
+ *      and Patrick Eugster.
+ *      Proceedings of the 2014 ACM conference on SIGCOMM, August 2014.
+ *      http://www.sigcomm.org/sites/default/files/ccr/papers/2014/August/2619239-2626294.pdf
+ */
+
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/log2.h>
+#include <linux/module.h>
+#include <linux/netlink.h>
+#include <linux/netfilter.h>
+#include <linux/netfilter/nf_tables.h>
+#include <net/netfilter/nf_tables_core.h>
+#include <uapi/linux/netfilter/nf_tables.h>
+#include <net/ipv6.h>			/* For the maximum length of a field */
+#include <linux/bitmap.h>
+#include <linux/bitops.h>
+
+/* Count of concatenated fields depends on count of 32-bit nftables registers */
+#define NFT_PIPAPO_MAX_FIELDS		NFT_REG32_COUNT
+
+/* Largest supported field size */
+#define NFT_PIPAPO_MAX_BYTES		(sizeof(struct in6_addr))
+#define NFT_PIPAPO_MAX_BITS		(NFT_PIPAPO_MAX_BYTES * BITS_PER_BYTE)
+
+/* Number of bits to be grouped together in lookup table buckets, arbitrary */
+#define NFT_PIPAPO_GROUP_BITS		4
+#define NFT_PIPAPO_GROUPS_PER_BYTE	(BITS_PER_BYTE / NFT_PIPAPO_GROUP_BITS)
+
+/* Fields are padded to 32 bits in input registers */
+#define NFT_PIPAPO_GROUPS_PADDED_SIZE(x)				\
+	(round_up((x) / NFT_PIPAPO_GROUPS_PER_BYTE, sizeof(u32)))
+#define NFT_PIPAPO_GROUPS_PADDING(x)					\
+	(NFT_PIPAPO_GROUPS_PADDED_SIZE((x)) - (x) / NFT_PIPAPO_GROUPS_PER_BYTE)
+
+/* Number of buckets, given by 2 ^ n, with n grouped bits */
+#define NFT_PIPAPO_BUCKETS		(1 << NFT_PIPAPO_GROUP_BITS)
+
+/* Each n-bit range maps to up to n * 2 rules */
+#define NFT_PIPAPO_MAP_NBITS		(const_ilog2(NFT_PIPAPO_MAX_BITS * 2))
+
+/* Use the rest of mapping table buckets for rule indices, but it makes no sense
+ * to exceed 32 bits
+ */
+#if BITS_PER_LONG == 64
+#define NFT_PIPAPO_MAP_TOBITS		32
+#else
+#define NFT_PIPAPO_MAP_TOBITS		(BITS_PER_LONG - NFT_PIPAPO_MAP_NBITS)
+#endif
+
+/* ...which gives us the highest allowed index for a rule */
+#define NFT_PIPAPO_RULE0_MAX		((1UL << (NFT_PIPAPO_MAP_TOBITS - 1)) \
+					- (1UL << NFT_PIPAPO_MAP_NBITS))
+
+#define nft_pipapo_for_each_field(field, index, match)		\
+	for ((field) = (match)->f, (index) = 0;			\
+	     (index) < (match)->field_count;			\
+	     (index)++, (field)++)
+
+/**
+ * union nft_pipapo_map_bucket - Bucket of mapping table
+ * @to:		First rule number (in next field) this rule maps to
+ * @n:		Number of rules (in next field) this rule maps to
+ * @e:		If there's no next field, pointer to element this rule maps to
+ */
+union nft_pipapo_map_bucket {
+	struct {
+#if BITS_PER_LONG == 64
+		static_assert(NFT_PIPAPO_MAP_TOBITS <= 32);
+		u32 to;
+
+		static_assert(NFT_PIPAPO_MAP_NBITS <= 32);
+		u32 n;
+#else
+		unsigned long to:NFT_PIPAPO_MAP_TOBITS;
+		unsigned long  n:NFT_PIPAPO_MAP_NBITS;
+#endif
+	};
+	struct nft_pipapo_elem *e;
+};
+
+/**
+ * struct nft_pipapo_field - Lookup, mapping tables and related data for a field
+ * @groups:	Amount of 4-bit groups
+ * @rules:	Number of inserted rules
+ * @bsize:	Size of each bucket in lookup table, in longs
+ * @lt:		Lookup table: 'groups' rows of NFT_PIPAPO_BUCKETS buckets
+ * @mt:		Mapping table: one bucket per rule
+ */
+struct nft_pipapo_field {
+	int groups;
+	unsigned long rules;
+	size_t bsize;
+	unsigned long *lt;
+	union nft_pipapo_map_bucket *mt;
+};
+
+/**
+ * struct nft_pipapo_match - Data used for lookup and matching
+ * @field_count		Amount of fields in set
+ * @scratch:		Preallocated per-CPU maps for partial matching results
+ * @bsize_max:		Maximum lookup table bucket size of all fields, in longs
+ * @rcu			Matching data is swapped on commits
+ * @f:			Fields, with lookup and mapping tables
+ */
+struct nft_pipapo_match {
+	int field_count;
+	unsigned long * __percpu *scratch;
+	size_t bsize_max;
+	struct rcu_head rcu;
+	struct nft_pipapo_field f[0];
+};
+
+/* Current working bitmap index, toggled between field matches */
+static DEFINE_PER_CPU(bool, nft_pipapo_scratch_index);
+
+/**
+ * struct nft_pipapo - Representation of a set
+ * @match:	Currently in-use matching data
+ * @clone:	Copy where pending insertions and deletions are kept
+ * @groups:	Total amount of 4-bit groups for fields in this set
+ * @width:	Total bytes to be matched for one packet, including padding
+ * @dirty:	Working copy has pending insertions or deletions
+ * @last_gc:	Timestamp of last garbage collection run, jiffies
+ */
+struct nft_pipapo {
+	struct nft_pipapo_match __rcu *match;
+	struct nft_pipapo_match *clone;
+	int groups;
+	int width;
+	bool dirty;
+	unsigned long last_gc;
+};
+
+struct nft_pipapo_elem;
+
+/**
+ * struct nft_pipapo_elem - API-facing representation of single set element
+ * @ext:	nftables API extensions
+ */
+struct nft_pipapo_elem {
+	struct nft_set_ext ext;
+};
+
+/**
+ * pipapo_refill() - For each set bit, set bits from selected mapping table item
+ * @map:	Bitmap to be scanned for set bits
+ * @len:	Length of bitmap in longs
+ * @rules:	Number of rules in field
+ * @dst:	Destination bitmap
+ * @mt:		Mapping table containing bit set specifiers
+ * @match_only:	Find a single bit and return, don't fill
+ *
+ * Iteration over set bits with __builtin_ctzl(): Daniel Lemire, public domain.
+ *
+ * For each bit set in map, select the bucket from mapping table with index
+ * corresponding to the position of the bit set. Use start bit and amount of
+ * bits specified in bucket to fill region in dst.
+ *
+ * Return: -1 on no match, bit position on 'match_only', 0 otherwise.
+ */
+static int pipapo_refill(unsigned long *map, int len, int rules,
+			 unsigned long *dst, union nft_pipapo_map_bucket *mt,
+			 bool match_only)
+{
+	unsigned long bitset;
+	int k, ret = -1;
+
+	for (k = 0; k < len; k++) {
+		bitset = map[k];
+		while (bitset) {
+			unsigned long t = bitset & -bitset;
+			int r = __builtin_ctzl(bitset);
+			int i = k * BITS_PER_LONG + r;
+
+			if (unlikely(i >= rules)) {
+				map[k] = 0;
+				return -1;
+			}
+
+			if (unlikely(match_only)) {
+				bitmap_clear(map, i, 1);
+				return i;
+			}
+
+			ret = 0;
+
+			bitmap_set(dst, mt[i].to, mt[i].n);
+
+			bitset ^= t;
+		}
+		map[k] = 0;
+	}
+
+	return ret;
+}
+
+/**
+ * nft_pipapo_lookup() - Lookup function
+ * @net:	Network namespace
+ * @set:	nftables API set representation
+ * @elem:	nftables API element representation containing key data
+ * @ext:	nftables API extension pointer, filled with matching reference
+ *
+ * For more details, see DOC: Theory of Operation.
+ *
+ * Return: true on match, false otherwise.
+ */
+static bool nft_pipapo_lookup(const struct net *net, const struct nft_set *set,
+			      const u32 *key, const struct nft_set_ext **ext)
+{
+	struct nft_pipapo *priv = nft_set_priv(set);
+	unsigned long *res_map, *fill_map;
+	u8 genmask = nft_genmask_cur(net);
+	const u8 *rp = (const u8 *)key;
+	struct nft_pipapo_match *m;
+	struct nft_pipapo_field *f;
+	bool map_index;
+	int i;
+
+	local_bh_disable();
+
+	map_index = raw_cpu_read(nft_pipapo_scratch_index);
+
+	m = rcu_dereference(priv->match);
+
+	if (unlikely(!m || !*raw_cpu_ptr(m->scratch)))
+		goto out;
+
+	res_map  = *raw_cpu_ptr(m->scratch) + (map_index ? m->bsize_max : 0);
+	fill_map = *raw_cpu_ptr(m->scratch) + (map_index ? 0 : m->bsize_max);
+
+	memset(res_map, 0xff, m->bsize_max * sizeof(*res_map));
+
+	nft_pipapo_for_each_field(f, i, m) {
+		bool last = i == m->field_count - 1;
+		unsigned long *lt = f->lt;
+		int b, group;
+
+		/* For each 4-bit group: select lookup table bucket depending on
+		 * packet bytes value, then AND bucket value
+		 */
+		for (group = 0; group < f->groups; group += 2) {
+			u8 v;
+
+			v = *rp >> 4;
+			__bitmap_and(res_map, res_map, lt + v * f->bsize,
+				     f->bsize * BITS_PER_LONG);
+			lt += f->bsize * NFT_PIPAPO_BUCKETS;
+
+			v = *rp & 0x0f;
+			rp++;
+			__bitmap_and(res_map, res_map, lt + v * f->bsize,
+				     f->bsize * BITS_PER_LONG);
+			lt += f->bsize * NFT_PIPAPO_BUCKETS;
+		}
+
+		/* Now populate the bitmap for the next field, unless this is
+		 * the last field, in which case return the matched 'ext'
+		 * pointer if any.
+		 *
+		 * Now res_map contains the matching bitmap, and fill_map is the
+		 * bitmap for the next field.
+		 */
+next_match:
+		b = pipapo_refill(res_map, f->bsize, f->rules, fill_map, f->mt,
+				  last);
+		if (b < 0) {
+			raw_cpu_write(nft_pipapo_scratch_index, map_index);
+			local_bh_enable();
+
+			return false;
+		}
+
+		if (last) {
+			*ext = &f->mt[b].e->ext;
+			if (unlikely(nft_set_elem_expired(*ext) ||
+				     !nft_set_elem_active(*ext, genmask)))
+				goto next_match;
+
+			/* Last field: we're just returning the key without
+			 * filling the initial bitmap for the next field, so the
+			 * current inactive bitmap is clean and can be reused as
+			 * *next* bitmap (not initial) for the next packet.
+			 */
+			raw_cpu_write(nft_pipapo_scratch_index, map_index);
+			local_bh_enable();
+
+			return true;
+		}
+
+		/* Swap bitmap indices: res_map is the initial bitmap for the
+		 * next field, and fill_map is guaranteed to be all-zeroes at
+		 * this point.
+		 */
+		map_index = !map_index;
+		swap(res_map, fill_map);
+
+		rp += NFT_PIPAPO_GROUPS_PADDING(f->groups);
+	}
+
+out:
+	local_bh_enable();
+	return false;
+}
+
+/**
+ * pipapo_get() - Get matching element reference given key data
+ * @net:	Network namespace
+ * @set:	nftables API set representation
+ * @data:	Key data to be matched against existing elements
+ * @genmask:	If set, check that element is active in given genmask
+ *
+ * This is essentially the same as the lookup function, except that it matches
+ * key data against the uncommitted copy and doesn't use preallocated maps for
+ * bitmap results.
+ *
+ * Return: pointer to &struct nft_pipapo_elem on match, error pointer otherwise.
+ */
+static struct nft_pipapo_elem *pipapo_get(const struct net *net,
+					  const struct nft_set *set,
+					  const u8 *data, u8 genmask)
+{
+	struct nft_pipapo_elem *ret = ERR_PTR(-ENOENT);
+	struct nft_pipapo *priv = nft_set_priv(set);
+	struct nft_pipapo_match *m = priv->clone;
+	unsigned long *res_map, *fill_map = NULL;
+	struct nft_pipapo_field *f;
+	int i;
+
+	res_map = kmalloc_array(m->bsize_max, sizeof(*res_map), GFP_ATOMIC);
+	if (!res_map) {
+		ret = ERR_PTR(-ENOMEM);
+		goto out;
+	}
+
+	fill_map = kcalloc(m->bsize_max, sizeof(*res_map), GFP_ATOMIC);
+	if (!fill_map) {
+		ret = ERR_PTR(-ENOMEM);
+		goto out;
+	}
+
+	memset(res_map, 0xff, m->bsize_max * sizeof(*res_map));
+
+	nft_pipapo_for_each_field(f, i, m) {
+		bool last = i == m->field_count - 1;
+		unsigned long *lt = f->lt;
+		int b, group;
+
+		/* For each 4-bit group: select lookup table bucket depending on
+		 * packet bytes value, then AND bucket value
+		 */
+		for (group = 0; group < f->groups; group++) {
+			u8 v;
+
+			if (group % 2) {
+				v = *data & 0x0f;
+				data++;
+			} else {
+				v = *data >> 4;
+			}
+			__bitmap_and(res_map, res_map, lt + v * f->bsize,
+				     f->bsize * BITS_PER_LONG);
+
+			lt += f->bsize * NFT_PIPAPO_BUCKETS;
+		}
+
+		/* Now populate the bitmap for the next field, unless this is
+		 * the last field, in which case return the matched 'ext'
+		 * pointer if any.
+		 *
+		 * Now res_map contains the matching bitmap, and fill_map is the
+		 * bitmap for the next field.
+		 */
+next_match:
+		b = pipapo_refill(res_map, f->bsize, f->rules, fill_map, f->mt,
+				  last);
+		if (b < 0)
+			goto out;
+
+		if (last) {
+			if (nft_set_elem_expired(&f->mt[b].e->ext) ||
+			    (genmask &&
+			     !nft_set_elem_active(&f->mt[b].e->ext, genmask)))
+				goto next_match;
+
+			ret = f->mt[b].e;
+			goto out;
+		}
+
+		data += NFT_PIPAPO_GROUPS_PADDING(f->groups);
+
+		/* Swap bitmap indices: fill_map will be the initial bitmap for
+		 * the next field (i.e. the new res_map), and res_map is
+		 * guaranteed to be all-zeroes at this point, ready to be filled
+		 * according to the next mapping table.
+		 */
+		swap(res_map, fill_map);
+	}
+
+out:
+	kfree(fill_map);
+	kfree(res_map);
+	return ret;
+}
+
+/**
+ * nft_pipapo_get() - Get matching element reference given key data
+ * @net:	Network namespace
+ * @set:	nftables API set representation
+ * @elem:	nftables API element representation containing key data
+ * @flags:	Unused
+ */
+void *nft_pipapo_get(const struct net *net, const struct nft_set *set,
+		     const struct nft_set_elem *elem, unsigned int flags)
+{
+	return pipapo_get(net, set, (const u8 *)elem->key.val.data,
+			  nft_genmask_cur(net));
+}
+
+/**
+ * pipapo_resize() - Resize lookup or mapping table, or both
+ * @f:		Field containing lookup and mapping tables
+ * @old_rules:	Previous amount of rules in field
+ * @rules:	New amount of rules
+ *
+ * Increase, decrease or maintain tables size depending on new amount of rules,
+ * and copy data over. In case the new size is smaller, throw away data for
+ * highest-numbered rules.
+ *
+ * Return: 0 on success, -ENOMEM on allocation failure.
+ */
+static int pipapo_resize(struct nft_pipapo_field *f, int old_rules, int rules)
+{
+	long *new_lt = NULL, *new_p, *old_lt = f->lt, *old_p;
+	union nft_pipapo_map_bucket *new_mt, *old_mt = f->mt;
+	size_t new_bucket_size, copy;
+	int group, bucket;
+
+	new_bucket_size = DIV_ROUND_UP(rules, BITS_PER_LONG);
+
+	if (new_bucket_size == f->bsize)
+		goto mt;
+
+	if (new_bucket_size > f->bsize)
+		copy = f->bsize;
+	else
+		copy = new_bucket_size;
+
+	new_lt = kvzalloc(f->groups * NFT_PIPAPO_BUCKETS * new_bucket_size *
+			  sizeof(*new_lt), GFP_KERNEL);
+	if (!new_lt)
+		return -ENOMEM;
+
+	new_p = new_lt;
+	old_p = old_lt;
+	for (group = 0; group < f->groups; group++) {
+		for (bucket = 0; bucket < NFT_PIPAPO_BUCKETS; bucket++) {
+			memcpy(new_p, old_p, copy * sizeof(*new_p));
+			new_p += copy;
+			old_p += copy;
+
+			if (new_bucket_size > f->bsize)
+				new_p += new_bucket_size - f->bsize;
+			else
+				old_p += f->bsize - new_bucket_size;
+		}
+	}
+
+mt:
+	new_mt = kvmalloc(rules * sizeof(*new_mt), GFP_KERNEL);
+	if (!new_mt) {
+		kvfree(new_lt);
+		return -ENOMEM;
+	}
+
+	memcpy(new_mt, f->mt, min(old_rules, rules) * sizeof(*new_mt));
+	if (rules > old_rules) {
+		memset(new_mt + old_rules, 0,
+		       (rules - old_rules) * sizeof(*new_mt));
+	}
+
+	if (new_lt) {
+		f->bsize = new_bucket_size;
+		f->lt = new_lt;
+		kvfree(old_lt);
+	}
+
+	f->mt = new_mt;
+	kvfree(old_mt);
+
+	return 0;
+}
+
+/**
+ * pipapo_bucket_set() - Set rule bit in bucket given group and group value
+ * @f:		Field containing lookup table
+ * @rule:	Rule index
+ * @group:	Group index
+ * @v:		Value of bit group
+ */
+static void pipapo_bucket_set(struct nft_pipapo_field *f, int rule, int group,
+			      int v)
+{
+	unsigned long *pos;
+
+	pos = f->lt + f->bsize * NFT_PIPAPO_BUCKETS * group;
+	pos += f->bsize * v;
+
+	__set_bit(rule, pos);
+}
+
+/**
+ * pipapo_insert() - Insert new rule in field given input key and mask length
+ * @f:		Field containing lookup table
+ * @k:		Input key for classification, without nftables padding
+ * @mask_bits:	Length of mask; matches field length for non-ranged entry
+ *
+ * Insert a new rule reference in lookup buckets corresponding to k and
+ * mask_bits.
+ *
+ * Return: 1 on success (one rule inserted), negative error code on failure.
+ */
+static int pipapo_insert(struct nft_pipapo_field *f, const uint8_t *k,
+			 int mask_bits)
+{
+	int rule = f->rules++, group, ret;
+
+	ret = pipapo_resize(f, f->rules - 1, f->rules);
+	if (ret)
+		return ret;
+
+	for (group = 0; group < f->groups; group++) {
+		int i, v;
+		u8 mask;
+
+		if (group % 2)
+			v = k[group / 2] & 0x0f;
+		else
+			v = k[group / 2] >> 4;
+
+		if (mask_bits >= (group + 1) * 4) {
+			/* Not masked */
+			pipapo_bucket_set(f, rule, group, v);
+		} else if (mask_bits <= group * 4) {
+			/* Completely masked */
+			for (i = 0; i < NFT_PIPAPO_BUCKETS; i++)
+				pipapo_bucket_set(f, rule, group, i);
+		} else {
+			/* The mask limit falls on this group */
+			mask = 0x0f >> (mask_bits - group * 4);
+			for (i = 0; i < NFT_PIPAPO_BUCKETS; i++) {
+				if ((i & ~mask) == (v & ~mask))
+					pipapo_bucket_set(f, rule, group, i);
+			}
+		}
+	}
+
+	return 1;
+}
+
+/**
+ * pipapo_step_diff() - Check if setting @step bit in netmask would change it
+ * @base:	Mask we are expanding
+ * @step:	Step bit for given expansion step
+ * @len:	Total length of mask space (set and unset bits), bytes
+ *
+ * Convenience function for mask expansion.
+ *
+ * Return: true if step bit changes mask (i.e. isn't set), false otherwise.
+ */
+static bool pipapo_step_diff(u8 *base, int step, int len)
+{
+	/* Network order, byte-addressed */
+#ifdef __BIG_ENDIAN__
+	return !(BIT(step % BITS_PER_BYTE) & base[step / BITS_PER_BYTE]);
+#else
+	return !(BIT(step % BITS_PER_BYTE) &
+		 base[len - 1 - step / BITS_PER_BYTE]);
+#endif
+}
+
+/**
+ * pipapo_step_after_end() - Check if mask exceeds range end with given step
+ * @base:	Mask we are expanding
+ * @end:	End of range
+ * @step:	Step bit for given expansion step, highest bit to be set
+ * @len:	Total length of mask space (set and unset bits), bytes
+ *
+ * Convenience function for mask expansion.
+ *
+ * Return: true if mask exceeds range setting step bits, false otherwise.
+ */
+static bool pipapo_step_after_end(const u8 *base, const u8 *end, int step,
+				  int len)
+{
+	u8 tmp[NFT_PIPAPO_MAX_BYTES];
+	int i;
+
+	memcpy(tmp, base, len);
+
+	/* Network order, byte-addressed */
+	for (i = 0; i <= step; i++)
+#ifdef __BIG_ENDIAN__
+		tmp[i / BITS_PER_BYTE] |= BIT(i % BITS_PER_BYTE);
+#else
+		tmp[len - 1 - i / BITS_PER_BYTE] |= BIT(i % BITS_PER_BYTE);
+#endif
+
+	return memcmp(tmp, end, len) > 0;
+}
+
+/**
+ * pipapo_base_sum() - Sum step bit to given len-sized netmask base with carry
+ * @base:	Netmask base
+ * @step:	Step bit to sum
+ * @len:	Netmask length, bytes
+ */
+static void pipapo_base_sum(u8 *base, int step, int len)
+{
+	bool carry = false;
+	int i;
+
+	/* Network order, byte-addressed */
+#ifdef __BIG_ENDIAN__
+	for (i = step / BITS_PER_BYTE; i < len; i++) {
+#else
+	for (i = len - 1 - step / BITS_PER_BYTE; i >= 0; i--) {
+#endif
+		if (carry)
+			base[i]++;
+		else
+			base[i] += 1 << (step % BITS_PER_BYTE);
+
+		if (base[i])
+			break;
+
+		carry = true;
+	}
+}
+
+/**
+ * pipapo_expand() - Expand to composing netmasks, insert into lookup table
+ * @f:		Field containing lookup table
+ * @start:	Start of range
+ * @end:	End of range
+ * @len:	Length of value in bits
+ *
+ * Expand range to composing netmasks and insert corresponding rule references
+ * in lookup buckets.
+ *
+ * Return: number of inserted rules on success, negative error code on failure.
+ */
+static int pipapo_expand(struct nft_pipapo_field *f,
+			 const u8 *start, const u8 *end, int len)
+{
+	int step, masks = 0, bytes = DIV_ROUND_UP(len, BITS_PER_BYTE);
+	u8 base[NFT_PIPAPO_MAX_BYTES];
+
+	memcpy(base, start, bytes);
+	while (memcmp(base, end, bytes) <= 0) {
+		int err;
+
+		step = 0;
+		while (pipapo_step_diff(base, step, bytes)) {
+			if (pipapo_step_after_end(base, end, step, bytes))
+				break;
+
+			step++;
+			if (step >= len) {
+				if (!masks) {
+					pipapo_insert(f, base, 0);
+					masks = 1;
+				}
+				goto out;
+			}
+		}
+
+		err = pipapo_insert(f, base, len - step);
+
+		if (err < 0)
+			return err;
+
+		masks++;
+		pipapo_base_sum(base, step, bytes);
+	}
+out:
+	return masks;
+}
+
+/**
+ * pipapo_map() - Insert rules in mapping tables, mapping them between fields
+ * @m:		Matching data, including mapping table
+ * @map:	Table of rule maps: array of first rule and amount of rules
+ *		in next field a given rule maps to, for each field
+ * @ext:	For last field, nft_set_ext pointer matching rules map to
+ */
+static void pipapo_map(struct nft_pipapo_match *m,
+		       union nft_pipapo_map_bucket map[NFT_PIPAPO_MAX_FIELDS],
+		       struct nft_pipapo_elem *e)
+{
+	struct nft_pipapo_field *f;
+	int i, j;
+
+	for (i = 0, f = m->f; i < m->field_count - 1; i++, f++) {
+		for (j = 0; j < map[i].n; j++) {
+			f->mt[map[i].to + j].to = map[i + 1].to;
+			f->mt[map[i].to + j].n = map[i + 1].n;
+		}
+	}
+
+	/* Last field: map to ext instead of mapping to next field */
+	for (j = 0; j < map[i].n; j++)
+		f->mt[map[i].to + j].e = e;
+}
+
+/**
+ * pipapo_realloc_scratch() - Reallocate scratch maps for partial match results
+ * @clone:	Copy of matching data with pending insertions and deletions
+ * @bsize_max	Maximum bucket size, scratch maps cover two buckets
+ *
+ * Return: 0 on success, -ENOMEM on failure.
+ */
+static int pipapo_realloc_scratch(struct nft_pipapo_match *clone,
+				  unsigned long bsize_max)
+{
+	int i;
+
+	for_each_possible_cpu(i) {
+		unsigned long *scratch;
+
+		scratch = kzalloc_node(bsize_max * sizeof(*scratch) * 2,
+				       GFP_KERNEL, cpu_to_node(i));
+		if (!scratch) {
+			/* On failure, there's no need to undo previous
+			 * allocations: this means that some scratch maps have
+			 * a bigger allocated size now (this is only called on
+			 * insertion), but the extra space won't be used by any
+			 * CPU as new elements are not inserted and m->bsize_max
+			 * is not updated.
+			 */
+			return -ENOMEM;
+		}
+
+		kfree(*per_cpu_ptr(clone->scratch, i));
+
+		*per_cpu_ptr(clone->scratch, i) = scratch;
+	}
+
+	return 0;
+}
+
+/**
+ * nft_pipapo_insert() - Validate and insert ranged elements
+ * @net:	Network namespace
+ * @set:	nftables API set representation
+ * @elem:	nftables API element representation containing key data
+ * @ext2:	Filled with pointer to &struct nft_set_ext in inserted element
+ *
+ * Return: 0 on success, error pointer on failure.
+ */
+static int nft_pipapo_insert(const struct net *net, const struct nft_set *set,
+			     const struct nft_set_elem *elem,
+			     struct nft_set_ext **ext2)
+{
+	const struct nft_set_ext *ext = nft_set_elem_ext(set, elem->priv);
+	union nft_pipapo_map_bucket rulemap[NFT_PIPAPO_MAX_FIELDS];
+	const u8 *start = (const u8 *)elem->key.val.data, *end;
+	struct nft_pipapo_elem *e = elem->priv, *dup;
+	struct nft_pipapo *priv = nft_set_priv(set);
+	struct nft_pipapo_match *m = priv->clone;
+	u8 genmask = nft_genmask_next(net);
+	struct nft_pipapo_field *f;
+	int i, bsize_max, err = 0;
+
+	dup = pipapo_get(net, set, start, genmask);
+	if (PTR_ERR(dup) == -ENOENT) {
+		if (nft_set_ext_exists(ext, NFT_SET_EXT_KEY_END)) {
+			end = (const u8 *)nft_set_ext_key_end(ext)->data;
+			dup = pipapo_get(net, set, end, nft_genmask_next(net));
+		} else {
+			end = start;
+		}
+	}
+
+	if (PTR_ERR(dup) != -ENOENT) {
+		if (IS_ERR(dup))
+			return PTR_ERR(dup);
+		*ext2 = &dup->ext;
+		return -EEXIST;
+	}
+
+	/* Validate */
+	nft_pipapo_for_each_field(f, i, m) {
+		const u8 *start_p = start, *end_p = end;
+
+		if (f->rules >= (unsigned long)NFT_PIPAPO_RULE0_MAX)
+			return -ENOSPC;
+
+		if (memcmp(start_p, end_p,
+			   f->groups / NFT_PIPAPO_GROUPS_PER_BYTE) > 0)
+			return -EINVAL;
+
+		start_p += NFT_PIPAPO_GROUPS_PADDED_SIZE(f->groups);
+		end_p += NFT_PIPAPO_GROUPS_PADDED_SIZE(f->groups);
+	}
+
+	/* Insert */
+	priv->dirty = true;
+
+	bsize_max = m->bsize_max;
+
+	nft_pipapo_for_each_field(f, i, m) {
+		int ret;
+
+		rulemap[i].to = f->rules;
+
+		ret = memcmp(start, end,
+			     f->groups / NFT_PIPAPO_GROUPS_PER_BYTE);
+		if (!ret) {
+			ret = pipapo_insert(f, start,
+					    f->groups * NFT_PIPAPO_GROUP_BITS);
+		} else {
+			ret = pipapo_expand(f, start, end,
+					    f->groups * NFT_PIPAPO_GROUP_BITS);
+		}
+
+		if (f->bsize > bsize_max)
+			bsize_max = f->bsize;
+
+		rulemap[i].n = ret;
+
+		start += NFT_PIPAPO_GROUPS_PADDED_SIZE(f->groups);
+		end += NFT_PIPAPO_GROUPS_PADDED_SIZE(f->groups);
+	}
+
+	if (!*this_cpu_ptr(m->scratch) || bsize_max > m->bsize_max) {
+		err = pipapo_realloc_scratch(m, bsize_max);
+		if (err)
+			return err;
+
+		this_cpu_write(nft_pipapo_scratch_index, false);
+
+		m->bsize_max = bsize_max;
+	}
+
+	*ext2 = &e->ext;
+
+	pipapo_map(m, rulemap, e);
+
+	return 0;
+}
+
+/**
+ * pipapo_clone() - Clone matching data to create new working copy
+ * @old:	Existing matching data
+ *
+ * Return: copy of matching data passed as 'old', error pointer on failure
+ */
+static struct nft_pipapo_match *pipapo_clone(struct nft_pipapo_match *old)
+{
+	struct nft_pipapo_field *dst, *src;
+	struct nft_pipapo_match *new;
+	int i;
+
+	new = kmalloc(sizeof(*new) + sizeof(*dst) * old->field_count,
+		      GFP_KERNEL);
+	if (!new)
+		return ERR_PTR(-ENOMEM);
+
+	new->field_count = old->field_count;
+	new->bsize_max = old->bsize_max;
+
+	new->scratch = alloc_percpu(*new->scratch);
+	if (!new->scratch)
+		goto out_scratch;
+
+	rcu_head_init(&new->rcu);
+
+	src = old->f;
+	dst = new->f;
+
+	for (i = 0; i < old->field_count; i++) {
+		memcpy(dst, src, offsetof(struct nft_pipapo_field, lt));
+
+		dst->lt = kvzalloc(src->groups * NFT_PIPAPO_BUCKETS *
+				   src->bsize * sizeof(*dst->lt),
+				   GFP_KERNEL);
+		if (!dst->lt)
+			goto out_lt;
+
+		memcpy(dst->lt, src->lt,
+		       src->bsize * sizeof(*dst->lt) *
+		       src->groups * NFT_PIPAPO_BUCKETS);
+
+		dst->mt = kvmalloc(src->rules * sizeof(*src->mt), GFP_KERNEL);
+		if (!dst->mt)
+			goto out_mt;
+
+		memcpy(dst->mt, src->mt, src->rules * sizeof(*src->mt));
+		src++;
+		dst++;
+	}
+
+	return new;
+
+out_mt:
+	kvfree(dst->lt);
+out_lt:
+	for (dst--; i > 0; i--) {
+		kvfree(dst->mt);
+		kvfree(dst->lt);
+		dst--;
+	}
+	free_percpu(new->scratch);
+out_scratch:
+	kfree(new);
+
+	return ERR_PTR(-ENOMEM);
+}
+
+/**
+ * pipapo_rules_same_key() - Get number of rules originated from the same entry
+ * @f:		Field containing mapping table
+ * @first:	Index of first rule in set of rules mapping to same entry
+ *
+ * Using the fact that all rules in a field that originated from the same entry
+ * will map to the same set of rules in the next field, or to the same element
+ * reference, return the cardinality of the set of rules that originated from
+ * the same entry as the rule with index @first, @first rule included.
+ *
+ * In pictures:
+ *				rules
+ *	field #0		0    1    2    3    4
+ *		map to:		0    1   2-4  2-4  5-9
+ *				.    .    .......   . ...
+ *				|    |    |    | \   \
+ *				|    |    |    |  \   \
+ *				|    |    |    |   \   \
+ *				'    '    '    '    '   \
+ *	in field #1		0    1    2    3    4    5 ...
+ *
+ * if this is called for rule 2 on field #0, it will return 3, as also rules 2
+ * and 3 in field 0 map to the same set of rules (2, 3, 4) in the next field.
+ *
+ * For the last field in a set, we can rely on associated entries to map to the
+ * same element references.
+ *
+ * Return: Number of rules that originated from the same entry as @first.
+ */
+static int pipapo_rules_same_key(struct nft_pipapo_field *f, int first)
+{
+	struct nft_pipapo_elem *e = NULL; /* Keep gcc happy */
+	int r;
+
+	for (r = first; r < f->rules; r++) {
+		if (r != first && e != f->mt[r].e)
+			return r - first;
+
+		e = f->mt[r].e;
+	}
+
+	if (r != first)
+		return r - first;
+
+	return 0;
+}
+
+/**
+ * pipapo_unmap() - Remove rules from mapping tables, renumber remaining ones
+ * @mt:		Mapping array
+ * @rules:	Original amount of rules in mapping table
+ * @start:	First rule index to be removed
+ * @n:		Amount of rules to be removed
+ * @to_offset:	First rule index, in next field, this group of rules maps to
+ * @is_last:	If this is the last field, delete reference from mapping array
+ *
+ * This is used to unmap rules from the mapping table for a single field,
+ * maintaining consistency and compactness for the existing ones.
+ *
+ * In pictures: let's assume that we want to delete rules 2 and 3 from the
+ * following mapping array:
+ *
+ *                 rules
+ *               0      1      2      3      4
+ *      map to:  4-10   4-10   11-15  11-15  16-18
+ *
+ * the result will be:
+ *
+ *                 rules
+ *               0      1      2
+ *      map to:  4-10   4-10   11-13
+ *
+ * for fields before the last one. In case this is the mapping table for the
+ * last field in a set, and rules map to pointers to &struct nft_pipapo_elem:
+ *
+ *                      rules
+ *                        0      1      2      3      4
+ *  element pointers:  0x42   0x42   0x33   0x33   0x44
+ *
+ * the result will be:
+ *
+ *                      rules
+ *                        0      1      2
+ *  element pointers:  0x42   0x42   0x44
+ */
+static void pipapo_unmap(union nft_pipapo_map_bucket *mt, int rules,
+			 int start, int n, int to_offset, bool is_last)
+{
+	int i;
+
+	memmove(mt + start, mt + start + n, (rules - start - n) * sizeof(*mt));
+	memset(mt + rules - n, 0, n * sizeof(*mt));
+
+	if (is_last)
+		return;
+
+	for (i = start; i < rules - n; i++)
+		mt[i].to -= to_offset;
+}
+
+/**
+ * pipapo_drop() - Delete entry from lookup and mapping tables, given rule map
+ * @m:		Matching data
+ * @rulemap	Table of rule maps, arrays of first rule and amount of rules
+ *		in next field a given entry maps to, for each field
+ *
+ * For each rule in lookup table buckets mapping to this set of rules, drop
+ * all bits set in lookup table mapping. In pictures, assuming we want to drop
+ * rules 0 and 1 from this lookup table:
+ *
+ *                     bucket
+ *      group  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
+ *        0    0                                              1,2
+ *        1   1,2                                      0
+ *        2    0                                      1,2
+ *        3    0                              1,2
+ *        4  0,1,2
+ *        5    0   1   2
+ *        6  0,1,2 1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
+ *        7   1,2 1,2  1   1   1  0,1  1   1   1   1   1   1   1   1   1   1
+ *
+ * rule 2 becomes rule 0, and the result will be:
+ *
+ *                     bucket
+ *      group  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
+ *        0                                                    0
+ *        1    0
+ *        2                                            0
+ *        3                                    0
+ *        4    0
+ *        5            0
+ *        6    0
+ *        7    0   0
+ *
+ * once this is done, call unmap() to drop all the corresponding rule references
+ * from mapping tables.
+ */
+static void pipapo_drop(struct nft_pipapo_match *m,
+			union nft_pipapo_map_bucket rulemap[])
+{
+	struct nft_pipapo_field *f;
+	int i;
+
+	nft_pipapo_for_each_field(f, i, m) {
+		int g;
+
+		for (g = 0; g < f->groups; g++) {
+			unsigned long *pos;
+			int b;
+
+			pos = f->lt + g * NFT_PIPAPO_BUCKETS * f->bsize;
+
+			for (b = 0; b < NFT_PIPAPO_BUCKETS; b++) {
+				bitmap_cut(pos, pos, rulemap[i].to,
+					   rulemap[i].n,
+					   f->bsize * BITS_PER_LONG);
+
+				pos += f->bsize;
+			}
+		}
+
+		pipapo_unmap(f->mt, f->rules, rulemap[i].to, rulemap[i].n,
+			     rulemap[i + 1].n, i == m->field_count - 1);
+		if (pipapo_resize(f, f->rules, f->rules - rulemap[i].n)) {
+			/* We can ignore this, a failure to shrink tables down
+			 * doesn't make tables invalid.
+			 */
+			;
+		}
+		f->rules -= rulemap[i].n;
+	}
+}
+
+/**
+ * pipapo_gc() - Drop expired entries from set, destroy start and end elements
+ * @set:	nftables API set representation
+ * @m:		Matching data
+ */
+static void pipapo_gc(const struct nft_set *set, struct nft_pipapo_match *m)
+{
+	struct nft_pipapo *priv = nft_set_priv(set);
+	int rules_f0, first_rule = 0;
+
+	while ((rules_f0 = pipapo_rules_same_key(m->f, first_rule))) {
+		union nft_pipapo_map_bucket rulemap[NFT_PIPAPO_MAX_FIELDS];
+		struct nft_pipapo_field *f;
+		struct nft_pipapo_elem *e;
+		int i, start, rules_fx;
+
+		start = first_rule;
+		rules_fx = rules_f0;
+
+		nft_pipapo_for_each_field(f, i, m) {
+			rulemap[i].to = start;
+			rulemap[i].n = rules_fx;
+
+			if (i < m->field_count - 1) {
+				rules_fx = f->mt[start].n;
+				start = f->mt[start].to;
+			}
+		}
+
+		/* Pick the last field, and its last index */
+		f--;
+		i--;
+		e = f->mt[rulemap[i].to].e;
+		if (nft_set_elem_expired(&e->ext) &&
+		    !nft_set_elem_mark_busy(&e->ext)) {
+			priv->dirty = true;
+			pipapo_drop(m, rulemap);
+
+			rcu_barrier();
+			nft_set_elem_destroy(set, e, true);
+
+			/* And check again current first rule, which is now the
+			 * first we haven't checked.
+			 */
+		} else {
+			first_rule += rules_f0;
+		}
+	}
+
+	priv->last_gc = jiffies;
+}
+
+/**
+ * pipapo_free_fields() - Free per-field tables contained in matching data
+ * @m:		Matching data
+ */
+static void pipapo_free_fields(struct nft_pipapo_match *m)
+{
+	struct nft_pipapo_field *f;
+	int i;
+
+	nft_pipapo_for_each_field(f, i, m) {
+		kvfree(f->lt);
+		kvfree(f->mt);
+	}
+}
+
+/**
+ * pipapo_reclaim_match - RCU callback to free fields from old matching data
+ * @rcu:	RCU head
+ */
+static void pipapo_reclaim_match(struct rcu_head *rcu)
+{
+	struct nft_pipapo_match *m;
+	int i;
+
+	m = container_of(rcu, struct nft_pipapo_match, rcu);
+
+	for_each_possible_cpu(i)
+		kfree(*per_cpu_ptr(m->scratch, i));
+
+	free_percpu(m->scratch);
+
+	pipapo_free_fields(m);
+
+	kfree(m);
+}
+
+/**
+ * pipapo_commit() - Replace lookup data with current working copy
+ * @set:	nftables API set representation
+ *
+ * While at it, check if we should perform garbage collection on the working
+ * copy before committing it for lookup, and don't replace the table if the
+ * working copy doesn't have pending changes.
+ *
+ * We also need to create a new working copy for subsequent insertions and
+ * deletions.
+ */
+static void pipapo_commit(const struct nft_set *set)
+{
+	struct nft_pipapo *priv = nft_set_priv(set);
+	struct nft_pipapo_match *new_clone, *old;
+
+	if (time_after_eq(jiffies, priv->last_gc + nft_set_gc_interval(set)))
+		pipapo_gc(set, priv->clone);
+
+	if (!priv->dirty)
+		return;
+
+	new_clone = pipapo_clone(priv->clone);
+	if (IS_ERR(new_clone))
+		return;
+
+	priv->dirty = false;
+
+	old = rcu_access_pointer(priv->match);
+	rcu_assign_pointer(priv->match, priv->clone);
+	if (old)
+		call_rcu(&old->rcu, pipapo_reclaim_match);
+
+	priv->clone = new_clone;
+}
+
+/**
+ * nft_pipapo_activate() - Mark element reference as active given key, commit
+ * @net:	Network namespace
+ * @set:	nftables API set representation
+ * @elem:	nftables API element representation containing key data
+ *
+ * On insertion, elements are added to a copy of the matching data currently
+ * in use for lookups, and not directly inserted into current lookup data, so
+ * we'll take care of that by calling pipapo_commit() here. Both
+ * nft_pipapo_insert() and nft_pipapo_activate() are called once for each
+ * element, hence we can't purpose either one as a real commit operation.
+ */
+static void nft_pipapo_activate(const struct net *net,
+				const struct nft_set *set,
+				const struct nft_set_elem *elem)
+{
+	struct nft_pipapo_elem *e;
+
+	e = pipapo_get(net, set, (const u8 *)elem->key.val.data, 0);
+	if (IS_ERR(e))
+		return;
+
+	nft_set_elem_change_active(net, set, &e->ext);
+	nft_set_elem_clear_busy(&e->ext);
+
+	pipapo_commit(set);
+}
+
+/**
+ * pipapo_deactivate() - Check that element is in set, mark as inactive
+ * @net:	Network namespace
+ * @set:	nftables API set representation
+ * @data:	Input key data
+ * @ext:	nftables API extension pointer, used to check for end element
+ *
+ * This is a convenience function that can be called from both
+ * nft_pipapo_deactivate() and nft_pipapo_flush(), as they are in fact the same
+ * operation.
+ *
+ * Return: deactivated element if found, NULL otherwise.
+ */
+static void *pipapo_deactivate(const struct net *net, const struct nft_set *set,
+			       const u8 *data, const struct nft_set_ext *ext)
+{
+	struct nft_pipapo_elem *e;
+
+	e = pipapo_get(net, set, data, nft_genmask_next(net));
+	if (IS_ERR(e))
+		return NULL;
+
+	nft_set_elem_change_active(net, set, &e->ext);
+
+	return e;
+}
+
+/**
+ * nft_pipapo_deactivate() - Call pipapo_deactivate() to make element inactive
+ * @net:	Network namespace
+ * @set:	nftables API set representation
+ * @elem:	nftables API element representation containing key data
+ *
+ * Return: deactivated element if found, NULL otherwise.
+ */
+static void *nft_pipapo_deactivate(const struct net *net,
+				   const struct nft_set *set,
+				   const struct nft_set_elem *elem)
+{
+	const struct nft_set_ext *ext = nft_set_elem_ext(set, elem->priv);
+
+	return pipapo_deactivate(net, set, (const u8 *)elem->key.val.data, ext);
+}
+
+/**
+ * nft_pipapo_flush() - Call pipapo_deactivate() to make element inactive
+ * @net:	Network namespace
+ * @set:	nftables API set representation
+ * @elem:	nftables API element representation containing key data
+ *
+ * This is functionally the same as nft_pipapo_deactivate(), with a slightly
+ * different interface, and it's also called once for each element in a set
+ * being flushed, so we can't implement, strictly speaking, a flush operation,
+ * which would otherwise be as simple as allocating an empty copy of the
+ * matching data.
+ *
+ * Note that we could in theory do that, mark the set as flushed, and ignore
+ * subsequent calls, but we would leak all the elements after the first one,
+ * because they wouldn't then be freed as result of API calls.
+ *
+ * Return: true if element was found and deactivated.
+ */
+static bool nft_pipapo_flush(const struct net *net, const struct nft_set *set,
+			     void *elem)
+{
+	struct nft_pipapo_elem *e = elem;
+
+	return pipapo_deactivate(net, set, (const u8 *)nft_set_ext_key(&e->ext),
+				 &e->ext);
+}
+
+/**
+ * pipapo_get_boundaries() - Get byte interval for associated rules
+ * @f:		Field including lookup table
+ * @first_rule:	First rule (lowest index)
+ * @rule_count:	Number of associated rules
+ * @left:	Byte expression for left boundary (start of range)
+ * @right:	Byte expression for right boundary (end of range)
+ *
+ * Given the first rule and amount of rules that originated from the same entry,
+ * build the original range associated with the entry, and calculate the length
+ * of the originating netmask.
+ *
+ * In pictures:
+ *
+ *                     bucket
+ *      group  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
+ *        0                                                   1,2
+ *        1   1,2
+ *        2                                           1,2
+ *        3                                   1,2
+ *        4   1,2
+ *        5        1   2
+ *        6   1,2  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
+ *        7   1,2 1,2  1   1   1   1   1   1   1   1   1   1   1   1   1   1
+ *
+ * this is the lookup table corresponding to the IPv4 range
+ * 192.168.1.0-192.168.2.1, which was expanded to the two composing netmasks,
+ * rule #1: 192.168.1.0/24, and rule #2: 192.168.2.0/31.
+ *
+ * This function fills @left and @right with the byte values of the leftmost
+ * and rightmost bucket indices for the lowest and highest rule indices,
+ * respectively. If @first_rule is 1 and @rule_count is 2, we obtain, in
+ * nibbles:
+ *   left:  < 12, 0, 10, 8, 0, 1, 0, 0 >
+ *   right: < 12, 0, 10, 8, 0, 2, 2, 1 >
+ * corresponding to bytes:
+ *   left:  < 192, 168, 1, 0 >
+ *   right: < 192, 168, 2, 1 >
+ * with mask length irrelevant here, unused on return, as the range is already
+ * defined by its start and end points. The mask length is relevant for a single
+ * ranged entry instead: if @first_rule is 1 and @rule_count is 1, we ignore
+ * rule 2 above: @left becomes < 192, 168, 1, 0 >, @right becomes
+ * < 192, 168, 1, 255 >, and the mask length, calculated from the distances
+ * between leftmost and rightmost bucket indices for each group, would be 24.
+ *
+ * Return: mask length, in bits.
+ */
+static int pipapo_get_boundaries(struct nft_pipapo_field *f, int first_rule,
+				 int rule_count, u8 *left, u8 *right)
+{
+	u8 *l = left, *r = right;
+	int g, mask_len = 0;
+
+	for (g = 0; g < f->groups; g++) {
+		int b, x0, x1;
+
+		x0 = -1;
+		x1 = -1;
+		for (b = 0; b < NFT_PIPAPO_BUCKETS; b++) {
+			unsigned long *pos;
+
+			pos = f->lt + (g * NFT_PIPAPO_BUCKETS + b) * f->bsize;
+			if (test_bit(first_rule, pos) && x0 == -1)
+				x0 = b;
+			if (test_bit(first_rule + rule_count - 1, pos))
+				x1 = b;
+		}
+
+		if (g % 2) {
+			*(l++) |= x0 & 0x0f;
+			*(r++) |= x1 & 0x0f;
+		} else {
+			*l |= x0 << 4;
+			*r |= x1 << 4;
+		}
+
+		if (x1 - x0 == 0)
+			mask_len += 4;
+		else if (x1 - x0 == 1)
+			mask_len += 3;
+		else if (x1 - x0 == 3)
+			mask_len += 2;
+		else if (x1 - x0 == 7)
+			mask_len += 1;
+	}
+
+	return mask_len;
+}
+
+/**
+ * pipapo_match_field() - Match rules against byte ranges
+ * @f:		Field including the lookup table
+ * @first_rule:	First of associated rules originating from same entry
+ * @rule_count:	Amount of associated rules
+ * @start:	Start of range to be matched
+ * @end:	End of range to be matched
+ *
+ * Return: true on match, false otherwise.
+ */
+static bool pipapo_match_field(struct nft_pipapo_field *f,
+			       int first_rule, int rule_count,
+			       const u8 *start, const u8 *end)
+{
+	u8 right[NFT_PIPAPO_MAX_BYTES] = { 0 };
+	u8 left[NFT_PIPAPO_MAX_BYTES] = { 0 };
+
+	pipapo_get_boundaries(f, first_rule, rule_count, left, right);
+
+	return !memcmp(start, left, f->groups / NFT_PIPAPO_GROUPS_PER_BYTE) &&
+	       !memcmp(end, right, f->groups / NFT_PIPAPO_GROUPS_PER_BYTE);
+}
+
+/**
+ * nft_pipapo_remove() - Remove element given key, commit
+ * @net:	Network namespace
+ * @set:	nftables API set representation
+ * @elem:	nftables API element representation containing key data
+ *
+ * Similarly to nft_pipapo_activate(), this is used as commit operation by the
+ * API, but it's called once per element in the pending transaction, so we can't
+ * implement this as a single commit operation. Closest we can get is to remove
+ * the matched element here, if any, and commit the updated matching data.
+ */
+static void nft_pipapo_remove(const struct net *net, const struct nft_set *set,
+			      const struct nft_set_elem *elem)
+{
+	const u8 *data = (const u8 *)elem->key.val.data;
+	struct nft_pipapo *priv = nft_set_priv(set);
+	struct nft_pipapo_match *m = priv->clone;
+	int rules_f0, first_rule = 0;
+	struct nft_pipapo_elem *e;
+
+	e = pipapo_get(net, set, data, 0);
+	if (IS_ERR(e))
+		return;
+
+	while ((rules_f0 = pipapo_rules_same_key(m->f, first_rule))) {
+		union nft_pipapo_map_bucket rulemap[NFT_PIPAPO_MAX_FIELDS];
+		const u8 *match_start, *match_end;
+		struct nft_pipapo_field *f;
+		int i, start, rules_fx;
+
+		match_start = data;
+		match_end = (const u8 *)nft_set_ext_key_end(&e->ext)->data;
+
+		start = first_rule;
+		rules_fx = rules_f0;
+
+		nft_pipapo_for_each_field(f, i, m) {
+			if (!pipapo_match_field(f, start, rules_fx,
+						match_start, match_end))
+				break;
+
+			rulemap[i].to = start;
+			rulemap[i].n = rules_fx;
+
+			rules_fx = f->mt[start].n;
+			start = f->mt[start].to;
+
+			match_start += NFT_PIPAPO_GROUPS_PADDED_SIZE(f->groups);
+			match_end += NFT_PIPAPO_GROUPS_PADDED_SIZE(f->groups);
+		}
+
+		if (i == m->field_count) {
+			priv->dirty = true;
+			pipapo_drop(m, rulemap);
+			pipapo_commit(set);
+			return;
+		}
+
+		first_rule += rules_f0;
+	}
+}
+
+/**
+ * nft_pipapo_walk() - Walk over elements
+ * @ctx:	nftables API context
+ * @set:	nftables API set representation
+ * @iter:	Iterator
+ *
+ * As elements are referenced in the mapping array for the last field, directly
+ * scan that array: there's no need to follow rule mappings from the first
+ * field.
+ */
+static void nft_pipapo_walk(const struct nft_ctx *ctx, struct nft_set *set,
+			    struct nft_set_iter *iter)
+{
+	struct nft_pipapo *priv = nft_set_priv(set);
+	struct nft_pipapo_match *m;
+	struct nft_pipapo_field *f;
+	int i, r;
+
+	rcu_read_lock();
+	m = rcu_dereference(priv->match);
+
+	if (unlikely(!m))
+		goto out;
+
+	for (i = 0, f = m->f; i < m->field_count - 1; i++, f++)
+		;
+
+	for (r = 0; r < f->rules; r++) {
+		struct nft_pipapo_elem *e;
+		struct nft_set_elem elem;
+
+		if (r < f->rules - 1 && f->mt[r + 1].e == f->mt[r].e)
+			continue;
+
+		if (iter->count < iter->skip)
+			goto cont;
+
+		e = f->mt[r].e;
+		if (nft_set_elem_expired(&e->ext))
+			goto cont;
+
+		elem.priv = e;
+
+		iter->err = iter->fn(ctx, set, iter, &elem);
+		if (iter->err < 0)
+			goto out;
+
+cont:
+		iter->count++;
+	}
+
+out:
+	rcu_read_unlock();
+}
+
+/**
+ * nft_pipapo_privsize() - Return the size of private data for the set
+ * @nla:	netlink attributes, ignored as size doesn't depend on them
+ * @desc:	Set description, ignored as size doesn't depend on it
+ *
+ * Return: size of private data for this set implementation, in bytes
+ */
+static u64 nft_pipapo_privsize(const struct nlattr * const nla[],
+			       const struct nft_set_desc *desc)
+{
+	return sizeof(struct nft_pipapo);
+}
+
+/**
+ * nft_pipapo_estimate() - Estimate set size, space and lookup complexity
+ * @desc:	Set description, element count and field description used here
+ * @features:	Flags: NFT_SET_INTERVAL needs to be there
+ * @est:	Storage for estimation data
+ *
+ * The size for this set type can vary dramatically, as it depends on the number
+ * of rules (composing netmasks) the entries expand to. We compute the worst
+ * case here.
+ *
+ * In general, for a non-ranged entry or a single composing netmask, we need
+ * one bit in each of the sixteen NFT_PIPAPO_BUCKETS, for each 4-bit group (that
+ * is, each input bit needs four bits of matching data), plus a bucket in the
+ * mapping table for each field.
+ *
+ * Return: true only for compatible range concatenations
+ */
+static bool nft_pipapo_estimate(const struct nft_set_desc *desc, u32 features,
+				struct nft_set_estimate *est)
+{
+	unsigned long entry_size;
+	int i;
+
+	if (!(features & NFT_SET_INTERVAL) || desc->field_count <= 1)
+		return false;
+
+	for (i = 0, entry_size = 0; i < desc->field_count; i++) {
+		unsigned long rules;
+
+		if (desc->field_len[i] > NFT_PIPAPO_MAX_BYTES)
+			return false;
+
+		/* Worst-case ranges for each concatenated field: each n-bit
+		 * field can expand to up to n * 2 rules in each bucket, and
+		 * each rule also needs a mapping bucket.
+		 */
+		rules = ilog2(desc->field_len[i] * BITS_PER_BYTE) * 2;
+		entry_size += rules * NFT_PIPAPO_BUCKETS / BITS_PER_BYTE;
+		entry_size += rules * sizeof(union nft_pipapo_map_bucket);
+	}
+
+	/* Rules in lookup and mapping tables are needed for each entry */
+	est->size = desc->size * entry_size;
+	if (est->size && div_u64(est->size, desc->size) != entry_size)
+		return false;
+
+	est->size += sizeof(struct nft_pipapo) +
+		     sizeof(struct nft_pipapo_match) * 2;
+
+	est->size += sizeof(struct nft_pipapo_field) * desc->field_count;
+
+	est->lookup = NFT_SET_CLASS_O_LOG_N;
+
+	est->space = NFT_SET_CLASS_O_N;
+
+	return true;
+}
+
+/**
+ * nft_pipapo_init() - Initialise data for a set instance
+ * @set:	nftables API set representation
+ * @desc:	Set description
+ * @nla:	netlink attributes
+ *
+ * Validate number and size of fields passed as NFTA_SET_DESC_CONCAT netlink
+ * attributes, initialise internal set parameters, current instance of matching
+ * data and a copy for subsequent insertions.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+static int nft_pipapo_init(const struct nft_set *set,
+			   const struct nft_set_desc *desc,
+			   const struct nlattr * const nla[])
+{
+	struct nft_pipapo *priv = nft_set_priv(set);
+	struct nft_pipapo_match *m;
+	struct nft_pipapo_field *f;
+	int err, i;
+
+	if (desc->field_count > NFT_PIPAPO_MAX_FIELDS)
+		return -EINVAL;
+
+	m = kmalloc(sizeof(*priv->match) + sizeof(*f) * desc->field_count,
+		    GFP_KERNEL);
+	if (!m)
+		return -ENOMEM;
+
+	m->field_count = desc->field_count;
+	m->bsize_max = 0;
+
+	m->scratch = alloc_percpu(unsigned long *);
+	if (!m->scratch) {
+		err = -ENOMEM;
+		goto out_free;
+	}
+	for_each_possible_cpu(i)
+		*per_cpu_ptr(m->scratch, i) = NULL;
+
+	rcu_head_init(&m->rcu);
+
+	nft_pipapo_for_each_field(f, i, m) {
+		f->groups = desc->field_len[i] * NFT_PIPAPO_GROUPS_PER_BYTE;
+		priv->groups += f->groups;
+
+		priv->width += round_up(desc->field_len[i], sizeof(u32));
+
+		f->bsize = 0;
+		f->rules = 0;
+		f->lt = NULL;
+		f->mt = NULL;
+	}
+
+	/* Create an initial clone of matching data for next insertion */
+	priv->clone = pipapo_clone(m);
+	if (IS_ERR(priv->clone)) {
+		err = PTR_ERR(priv->clone);
+		goto out_free;
+	}
+
+	priv->dirty = false;
+
+	rcu_assign_pointer(priv->match, m);
+
+	return 0;
+
+out_free:
+	free_percpu(m->scratch);
+	kfree(m);
+
+	return err;
+}
+
+/**
+ * nft_pipapo_destroy() - Free private data for set and all committed elements
+ * @set:	nftables API set representation
+ */
+static void nft_pipapo_destroy(const struct nft_set *set)
+{
+	struct nft_pipapo *priv = nft_set_priv(set);
+	struct nft_pipapo_match *m;
+	struct nft_pipapo_field *f;
+	int i, r, cpu;
+
+	m = rcu_dereference_protected(priv->match, true);
+	if (m) {
+		rcu_barrier();
+
+		for (i = 0, f = m->f; i < m->field_count - 1; i++, f++)
+			;
+
+		for (r = 0; r < f->rules; r++) {
+			struct nft_pipapo_elem *e;
+
+			if (r < f->rules - 1 && f->mt[r + 1].e == f->mt[r].e)
+				continue;
+
+			e = f->mt[r].e;
+
+			nft_set_elem_destroy(set, e, true);
+		}
+
+		for_each_possible_cpu(cpu)
+			kfree(*per_cpu_ptr(m->scratch, cpu));
+		free_percpu(m->scratch);
+
+		pipapo_free_fields(m);
+		kfree(m);
+		priv->match = NULL;
+	}
+
+	if (priv->clone) {
+		for_each_possible_cpu(cpu)
+			kfree(*per_cpu_ptr(priv->clone->scratch, cpu));
+		free_percpu(priv->clone->scratch);
+
+		pipapo_free_fields(priv->clone);
+		kfree(priv->clone);
+		priv->clone = NULL;
+	}
+}
+
+/**
+ * nft_pipapo_gc_init() - Initialise garbage collection
+ * @set:	nftables API set representation
+ *
+ * Instead of actually setting up a periodic work for garbage collection, as
+ * this operation requires a swap of matching data with the working copy, we'll
+ * do that opportunistically with other commit operations if the interval is
+ * elapsed, so we just need to set the current jiffies timestamp here.
+ */
+static void nft_pipapo_gc_init(const struct nft_set *set)
+{
+	struct nft_pipapo *priv = nft_set_priv(set);
+
+	priv->last_gc = jiffies;
+}
+
+struct nft_set_type nft_set_pipapo_type __read_mostly = {
+	.owner		= THIS_MODULE,
+	.features	= NFT_SET_INTERVAL | NFT_SET_MAP | NFT_SET_OBJECT |
+			  NFT_SET_TIMEOUT,
+	.ops		= {
+		.lookup		= nft_pipapo_lookup,
+		.insert		= nft_pipapo_insert,
+		.activate	= nft_pipapo_activate,
+		.deactivate	= nft_pipapo_deactivate,
+		.flush		= nft_pipapo_flush,
+		.remove		= nft_pipapo_remove,
+		.walk		= nft_pipapo_walk,
+		.get		= nft_pipapo_get,
+		.privsize	= nft_pipapo_privsize,
+		.estimate	= nft_pipapo_estimate,
+		.init		= nft_pipapo_init,
+		.destroy	= nft_pipapo_destroy,
+		.gc_init	= nft_pipapo_gc_init,
+		.elemsize	= offsetof(struct nft_pipapo_elem, ext),
+	},
+};
-- 
2.24.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH nf-next v4 6/9] selftests: netfilter: Introduce tests for sets with range concatenation
  2020-01-21 23:17 [PATCH nf-next v4 0/9] nftables: Set implementation for arbitrary concatenation of ranges Stefano Brivio
                   ` (4 preceding siblings ...)
  2020-01-21 23:17 ` [PATCH nf-next v4 5/9] nf_tables: Add set type for arbitrary concatenation of ranges Stefano Brivio
@ 2020-01-21 23:17 ` Stefano Brivio
  2020-01-21 23:17 ` [PATCH nf-next v4 7/9] nft_set_pipapo: Prepare for vectorised implementation: alignment Stefano Brivio
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Stefano Brivio @ 2020-01-21 23:17 UTC (permalink / raw)
  To: Pablo Neira Ayuso, netfilter-devel
  Cc: Florian Westphal, Kadlecsik József, Eric Garver, Phil Sutter

This test covers functionality and stability of the newly added
nftables set implementation supporting concatenation of ranged
fields.

For some selected set expression types, test:
- correctness, by checking that packets match or don't
- concurrency, by attempting races between insertion, deletion, lookup
- timeout feature, checking that packets don't match expired entries

and (roughly) estimate matching rates, comparing to baselines for
simple drop on netdev ingress hook and for hash and rbtrees sets.

In order to send packets, this needs one of sendip, netcat or bash.
To flood with traffic, iperf3, iperf and netperf are supported. For
performance measurements, this relies on the sample pktgen script
pktgen_bench_xmit_mode_netif_receive.sh.

If none of the tools suitable for a given test are available, specific
tests will be skipped.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
v4: No changes
v3: No changes
v2: No changes

 tools/testing/selftests/netfilter/Makefile    |    3 +-
 .../selftests/netfilter/nft_concat_range.sh   | 1481 +++++++++++++++++
 2 files changed, 1483 insertions(+), 1 deletion(-)
 create mode 100755 tools/testing/selftests/netfilter/nft_concat_range.sh

diff --git a/tools/testing/selftests/netfilter/Makefile b/tools/testing/selftests/netfilter/Makefile
index de1032b5ddea..08194aa44006 100644
--- a/tools/testing/selftests/netfilter/Makefile
+++ b/tools/testing/selftests/netfilter/Makefile
@@ -2,6 +2,7 @@
 # Makefile for netfilter selftests
 
 TEST_PROGS := nft_trans_stress.sh nft_nat.sh bridge_brouter.sh \
-	conntrack_icmp_related.sh nft_flowtable.sh ipvs.sh
+	conntrack_icmp_related.sh nft_flowtable.sh ipvs.sh \
+	nft_concat_range.sh
 
 include ../lib.mk
diff --git a/tools/testing/selftests/netfilter/nft_concat_range.sh b/tools/testing/selftests/netfilter/nft_concat_range.sh
new file mode 100755
index 000000000000..aca21dde102a
--- /dev/null
+++ b/tools/testing/selftests/netfilter/nft_concat_range.sh
@@ -0,0 +1,1481 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+#
+# nft_concat_range.sh - Tests for sets with concatenation of ranged fields
+#
+# Copyright (c) 2019 Red Hat GmbH
+#
+# Author: Stefano Brivio <sbrivio@redhat.com>
+#
+# shellcheck disable=SC2154,SC2034,SC2016,SC2030,SC2031
+# ^ Configuration and templates sourced with eval, counters reused in subshells
+
+KSELFTEST_SKIP=4
+
+# Available test groups:
+# - correctness: check that packets match given entries, and only those
+# - concurrency: attempt races between insertion, deletion and lookup
+# - timeout: check that packets match entries until they expire
+# - performance: estimate matching rate, compare with rbtree and hash baselines
+TESTS="correctness concurrency timeout"
+[ "${quicktest}" != "1" ] && TESTS="${TESTS} performance"
+
+# Set types, defined by TYPE_ variables below
+TYPES="net_port port_net net6_port port_proto net6_port_mac net6_port_mac_proto
+       net_port_net net_mac net_mac_icmp net6_mac_icmp net6_port_net6_port
+       net_port_mac_proto_net"
+
+# List of possible paths to pktgen script from kernel tree for performance tests
+PKTGEN_SCRIPT_PATHS="
+	../../../samples/pktgen/pktgen_bench_xmit_mode_netif_receive.sh
+	pktgen/pktgen_bench_xmit_mode_netif_receive.sh"
+
+# Definition of set types:
+# display	display text for test report
+# type_spec	nftables set type specifier
+# chain_spec	nftables type specifier for rules mapping to set
+# dst		call sequence of format_*() functions for destination fields
+# src		call sequence of format_*() functions for source fields
+# start		initial integer used to generate addresses and ports
+# count		count of entries to generate and match
+# src_delta	number summed to destination generator for source fields
+# tools		list of tools for correctness and timeout tests, any can be used
+# proto		L4 protocol of test packets
+#
+# race_repeat	race attempts per thread, 0 disables concurrency test for type
+# flood_tools	list of tools for concurrency tests, any can be used
+# flood_proto	L4 protocol of test packets for concurrency tests
+# flood_spec	nftables type specifier for concurrency tests
+#
+# perf_duration	duration of single pktgen injection test
+# perf_spec	nftables type specifier for performance tests
+# perf_dst	format_*() functions for destination fields in performance test
+# perf_src	format_*() functions for source fields in performance test
+# perf_entries	number of set entries for performance test
+# perf_proto	L3 protocol of test packets
+TYPE_net_port="
+display		net,port
+type_spec	ipv4_addr . inet_service
+chain_spec	ip daddr . udp dport
+dst		addr4 port
+src		 
+start		1
+count		5
+src_delta	2000
+tools		sendip nc bash
+proto		udp
+
+race_repeat	3
+flood_tools	iperf3 iperf netperf
+flood_proto	udp
+flood_spec	ip daddr . udp dport
+
+perf_duration	5
+perf_spec	ip daddr . udp dport
+perf_dst	addr4 port
+perf_src	 
+perf_entries	1000
+perf_proto	ipv4
+"
+
+TYPE_port_net="
+display		port,net
+type_spec	inet_service . ipv4_addr
+chain_spec	udp dport . ip daddr
+dst		port addr4
+src		 
+start		1
+count		5
+src_delta	2000
+tools		sendip nc bash
+proto		udp
+
+race_repeat	3
+flood_tools	iperf3 iperf netperf
+flood_proto	udp
+flood_spec	udp dport . ip daddr
+
+perf_duration	5
+perf_spec	udp dport . ip daddr
+perf_dst	port addr4
+perf_src	 
+perf_entries	100
+perf_proto	ipv4
+"
+
+TYPE_net6_port="
+display		net6,port
+type_spec	ipv6_addr . inet_service
+chain_spec	ip6 daddr . udp dport
+dst		addr6 port
+src		 
+start		10
+count		5
+src_delta	2000
+tools		sendip nc bash
+proto		udp6
+
+race_repeat	3
+flood_tools	iperf3 iperf netperf
+flood_proto	tcp6
+flood_spec	ip6 daddr . udp dport
+
+perf_duration	5
+perf_spec	ip6 daddr . udp dport
+perf_dst	addr6 port
+perf_src	 
+perf_entries	1000
+perf_proto	ipv6
+"
+
+TYPE_port_proto="
+display		port,proto
+type_spec	inet_service . inet_proto
+chain_spec	udp dport . meta l4proto
+dst		port proto
+src		 
+start		1
+count		5
+src_delta	2000
+tools		sendip nc bash
+proto		udp
+
+race_repeat	0
+
+perf_duration	5
+perf_spec	udp dport . meta l4proto
+perf_dst	port proto
+perf_src	 
+perf_entries	30000
+perf_proto	ipv4
+"
+
+TYPE_net6_port_mac="
+display		net6,port,mac
+type_spec	ipv6_addr . inet_service . ether_addr
+chain_spec	ip6 daddr . udp dport . ether saddr
+dst		addr6 port
+src		mac
+start		10
+count		5
+src_delta	2000
+tools		sendip nc bash
+proto		udp6
+
+race_repeat	0
+
+perf_duration	5
+perf_spec	ip6 daddr . udp dport . ether daddr
+perf_dst	addr6 port mac
+perf_src	 
+perf_entries	10
+perf_proto	ipv6
+"
+
+TYPE_net6_port_mac_proto="
+display		net6,port,mac,proto
+type_spec	ipv6_addr . inet_service . ether_addr . inet_proto
+chain_spec	ip6 daddr . udp dport . ether saddr . meta l4proto
+dst		addr6 port
+src		mac proto
+start		10
+count		5
+src_delta	2000
+tools		sendip nc bash
+proto		udp6
+
+race_repeat	0
+
+perf_duration	5
+perf_spec	ip6 daddr . udp dport . ether daddr . meta l4proto
+perf_dst	addr6 port mac proto
+perf_src	 
+perf_entries	1000
+perf_proto	ipv6
+"
+
+TYPE_net_port_net="
+display		net,port,net
+type_spec	ipv4_addr . inet_service . ipv4_addr
+chain_spec	ip daddr . udp dport . ip saddr
+dst		addr4 port
+src		addr4
+start		1
+count		5
+src_delta	2000
+tools		sendip nc bash
+proto		udp
+
+race_repeat	3
+flood_tools	iperf3 iperf netperf
+flood_proto	tcp
+flood_spec	ip daddr . udp dport . ip saddr
+
+perf_duration	0
+"
+
+TYPE_net6_port_net6_port="
+display		net6,port,net6,port
+type_spec	ipv6_addr . inet_service . ipv6_addr . inet_service
+chain_spec	ip6 daddr . udp dport . ip6 saddr . udp sport
+dst		addr6 port
+src		addr6 port
+start		10
+count		5
+src_delta	2000
+tools		sendip nc
+proto		udp6
+
+race_repeat	3
+flood_tools	iperf3 iperf netperf
+flood_proto	tcp6
+flood_spec	ip6 daddr . tcp dport . ip6 saddr . tcp sport
+
+perf_duration	0
+"
+
+TYPE_net_port_mac_proto_net="
+display		net,port,mac,proto,net
+type_spec	ipv4_addr . inet_service . ether_addr . inet_proto . ipv4_addr
+chain_spec	ip daddr . udp dport . ether saddr . meta l4proto . ip saddr
+dst		addr4 port
+src		mac proto addr4
+start		1
+count		5
+src_delta	2000
+tools		sendip nc bash
+proto		udp
+
+race_repeat	0
+
+perf_duration	0
+"
+
+TYPE_net_mac="
+display		net,mac
+type_spec	ipv4_addr . ether_addr
+chain_spec	ip daddr . ether saddr
+dst		addr4
+src		mac
+start		1
+count		5
+src_delta	2000
+tools		sendip nc bash
+proto		udp
+
+race_repeat	0
+
+perf_duration	5
+perf_spec	ip daddr . ether daddr
+perf_dst	addr4 mac
+perf_src	 
+perf_entries	1000
+perf_proto	ipv4
+"
+
+TYPE_net_mac_icmp="
+display		net,mac - ICMP
+type_spec	ipv4_addr . ether_addr
+chain_spec	ip daddr . ether saddr
+dst		addr4
+src		mac
+start		1
+count		5
+src_delta	2000
+tools		ping
+proto		icmp
+
+race_repeat	0
+
+perf_duration	0
+"
+
+TYPE_net6_mac_icmp="
+display		net6,mac - ICMPv6
+type_spec	ipv6_addr . ether_addr
+chain_spec	ip6 daddr . ether saddr
+dst		addr6
+src		mac
+start		10
+count		50
+src_delta	2000
+tools		ping
+proto		icmp6
+
+race_repeat	0
+
+perf_duration	0
+"
+
+TYPE_net_port_proto_net="
+display		net,port,proto,net
+type_spec	ipv4_addr . inet_service . inet_proto . ipv4_addr
+chain_spec	ip daddr . udp dport . meta l4proto . ip saddr
+dst		addr4 port proto
+src		addr4
+start		1
+count		5
+src_delta	2000
+tools		sendip nc
+proto		udp
+
+race_repeat	3
+flood_tools	iperf3 iperf netperf
+flood_proto	tcp
+flood_spec	ip daddr . tcp dport . meta l4proto . ip saddr
+
+perf_duration	0
+"
+
+# Set template for all tests, types and rules are filled in depending on test
+set_template='
+flush ruleset
+
+table inet filter {
+	counter test {
+		packets 0 bytes 0
+	}
+
+	set test {
+		type ${type_spec}
+		flags interval,timeout
+	}
+
+	chain input {
+		type filter hook prerouting priority 0; policy accept;
+		${chain_spec} @test counter name \"test\"
+	}
+}
+
+table netdev perf {
+	counter test {
+		packets 0 bytes 0
+	}
+
+	counter match {
+		packets 0 bytes 0
+	}
+
+	set test {
+		type ${type_spec}
+		flags interval
+	}
+
+	set norange {
+		type ${type_spec}
+	}
+
+	set noconcat {
+		type ${type_spec%% *}
+		flags interval
+	}
+
+	chain test {
+		type filter hook ingress device veth_a priority 0;
+	}
+}
+'
+
+err_buf=
+info_buf=
+
+# Append string to error buffer
+err() {
+	err_buf="${err_buf}${1}
+"
+}
+
+# Append string to information buffer
+info() {
+	info_buf="${info_buf}${1}
+"
+}
+
+# Flush error buffer to stdout
+err_flush() {
+	printf "%s" "${err_buf}"
+	err_buf=
+}
+
+# Flush information buffer to stdout
+info_flush() {
+	printf "%s" "${info_buf}"
+	info_buf=
+}
+
+# Setup veth pair: this namespace receives traffic, B generates it
+setup_veth() {
+	ip netns add B
+	ip link add veth_a type veth peer name veth_b || return 1
+
+	ip link set veth_a up
+	ip link set veth_b netns B
+
+	ip -n B link set veth_b up
+
+	ip addr add dev veth_a 10.0.0.1
+	ip route add default dev veth_a
+
+	ip -6 addr add fe80::1/64 dev veth_a nodad
+	ip -6 addr add 2001:db8::1/64 dev veth_a nodad
+	ip -6 route add default dev veth_a
+
+	ip -n B route add default dev veth_b
+
+	ip -6 -n B addr add fe80::2/64 dev veth_b nodad
+	ip -6 -n B addr add 2001:db8::2/64 dev veth_b nodad
+	ip -6 -n B route add default dev veth_b
+
+	B() {
+		ip netns exec B "$@" >/dev/null 2>&1
+	}
+
+	sleep 2
+}
+
+# Fill in set template and initialise set
+setup_set() {
+	eval "echo \"${set_template}\"" | nft -f -
+}
+
+# Check that at least one of the needed tools is available
+check_tools() {
+	__tools=
+	for tool in ${tools}; do
+		if [ "${tool}" = "nc" ] && [ "${proto}" = "udp6" ] && \
+		   ! nc -u -w0 1.1.1.1 1 2>/dev/null; then
+			# Some GNU netcat builds might not support IPv6
+			__tools="${__tools} netcat-openbsd"
+			continue
+		fi
+		__tools="${__tools} ${tool}"
+
+		command -v "${tool}" >/dev/null && return 0
+	done
+	err "need one of:${__tools}, skipping" && return 1
+}
+
+# Set up function to send ICMP packets
+setup_send_icmp() {
+	send_icmp() {
+		B ping -c1 -W1 "${dst_addr4}" >/dev/null 2>&1
+	}
+}
+
+# Set up function to send ICMPv6 packets
+setup_send_icmp6() {
+	if command -v ping6 >/dev/null; then
+		send_icmp6() {
+			ip -6 addr add "${dst_addr6}" dev veth_a nodad \
+				2>/dev/null
+			B ping6 -q -c1 -W1 "${dst_addr6}"
+		}
+	else
+		send_icmp6() {
+			ip -6 addr add "${dst_addr6}" dev veth_a nodad \
+				2>/dev/null
+			B ping -q -6 -c1 -W1 "${dst_addr6}"
+		}
+	fi
+}
+
+# Set up function to send single UDP packets on IPv4
+setup_send_udp() {
+	if command -v sendip >/dev/null; then
+		send_udp() {
+			[ -n "${src_port}" ] && src_port="-us ${src_port}"
+			[ -n "${dst_port}" ] && dst_port="-ud ${dst_port}"
+			[ -n "${src_addr4}" ] && src_addr4="-is ${src_addr4}"
+
+			# shellcheck disable=SC2086 # sendip needs split options
+			B sendip -p ipv4 -p udp ${src_addr4} ${src_port} \
+						${dst_port} "${dst_addr4}"
+
+			src_port=
+			dst_port=
+			src_addr4=
+		}
+	elif command -v nc >/dev/null; then
+		if nc -u -w0 1.1.1.1 1 2>/dev/null; then
+			# OpenBSD netcat
+			nc_opt="-w0"
+		else
+			# GNU netcat
+			nc_opt="-q0"
+		fi
+
+		send_udp() {
+			if [ -n "${src_addr4}" ]; then
+				B ip addr add "${src_addr4}" dev veth_b
+				__src_addr4="-s ${src_addr4}"
+			fi
+			ip addr add "${dst_addr4}" dev veth_a 2>/dev/null
+			[ -n "${src_port}" ] && src_port="-p ${src_port}"
+
+			echo "" | B nc -u "${nc_opt}" "${__src_addr4}" \
+				  "${src_port}" "${dst_addr4}" "${dst_port}"
+
+			src_addr4=
+			src_port=
+		}
+	elif [ -z "$(bash -c 'type -p')" ]; then
+		send_udp() {
+			ip addr add "${dst_addr4}" dev veth_a 2>/dev/null
+			if [ -n "${src_addr4}" ]; then
+				B ip addr add "${src_addr4}/16" dev veth_b
+				B ip route add default dev veth_b
+			fi
+
+			B bash -c "echo > /dev/udp/${dst_addr4}/${dst_port}"
+
+			if [ -n "${src_addr4}" ]; then
+				B ip addr del "${src_addr4}/16" dev veth_b
+			fi
+			src_addr4=
+		}
+	else
+		return 1
+	fi
+}
+
+# Set up function to send single UDP packets on IPv6
+setup_send_udp6() {
+	if command -v sendip >/dev/null; then
+		send_udp6() {
+			[ -n "${src_port}" ] && src_port="-us ${src_port}"
+			[ -n "${dst_port}" ] && dst_port="-ud ${dst_port}"
+			if [ -n "${src_addr6}" ]; then
+				src_addr6="-6s ${src_addr6}"
+			else
+				src_addr6="-6s 2001:db8::2"
+			fi
+			ip -6 addr add "${dst_addr6}" dev veth_a nodad \
+				2>/dev/null
+
+			# shellcheck disable=SC2086 # this needs split options
+			B sendip -p ipv6 -p udp ${src_addr6} ${src_port} \
+						${dst_port} "${dst_addr6}"
+
+			src_port=
+			dst_port=
+			src_addr6=
+		}
+	elif command -v nc >/dev/null && nc -u -w0 1.1.1.1 1 2>/dev/null; then
+		# GNU netcat might not work with IPv6, try next tool
+		send_udp6() {
+			ip -6 addr add "${dst_addr6}" dev veth_a nodad \
+				2>/dev/null
+			if [ -n "${src_addr6}" ]; then
+				B ip addr add "${src_addr6}" dev veth_b nodad
+			else
+				src_addr6="2001:db8::2"
+			fi
+			[ -n "${src_port}" ] && src_port="-p ${src_port}"
+
+			# shellcheck disable=SC2086 # this needs split options
+			echo "" | B nc -u w0 "-s${src_addr6}" ${src_port} \
+					       ${dst_addr6} ${dst_port}
+
+			src_addr6=
+			src_port=
+		}
+	elif [ -z "$(bash -c 'type -p')" ]; then
+		send_udp6() {
+			ip -6 addr add "${dst_addr6}" dev veth_a nodad \
+				2>/dev/null
+			B ip addr add "${src_addr6}" dev veth_b nodad
+			B bash -c "echo > /dev/udp/${dst_addr6}/${dst_port}"
+			ip -6 addr del "${dst_addr6}" dev veth_a 2>/dev/null
+		}
+	else
+		return 1
+	fi
+}
+
+# Set up function to send TCP traffic on IPv4
+setup_flood_tcp() {
+	if command -v iperf3 >/dev/null; then
+		flood_tcp() {
+			[ -n "${dst_port}" ] && dst_port="-p ${dst_port}"
+			if [ -n "${src_addr4}" ]; then
+				B ip addr add "${src_addr4}/16" dev veth_b
+				src_addr4="-B ${src_addr4}"
+			else
+				B ip addr add dev veth_b 10.0.0.2
+				src_addr4="-B 10.0.0.2"
+			fi
+			if [ -n "${src_port}" ]; then
+				src_port="--cport ${src_port}"
+			fi
+			B ip route add default dev veth_b 2>/dev/null
+			ip addr add "${dst_addr4}" dev veth_a 2>/dev/null
+
+			# shellcheck disable=SC2086 # this needs split options
+			iperf3 -s -DB "${dst_addr4}" ${dst_port} >/dev/null 2>&1
+			sleep 2
+
+			# shellcheck disable=SC2086 # this needs split options
+			B iperf3 -c "${dst_addr4}" ${dst_port} ${src_port} \
+				${src_addr4} -l16 -t 1000
+
+			src_addr4=
+			src_port=
+			dst_port=
+		}
+	elif command -v iperf >/dev/null; then
+		flood_tcp() {
+			[ -n "${dst_port}" ] && dst_port="-p ${dst_port}"
+			if [ -n "${src_addr4}" ]; then
+				B ip addr add "${src_addr4}/16" dev veth_b
+				src_addr4="-B ${src_addr4}"
+			else
+				B ip addr add dev veth_b 10.0.0.2 2>/dev/null
+				src_addr4="-B 10.0.0.2"
+			fi
+			if [ -n "${src_port}" ]; then
+				src_addr4="${src_addr4}:${src_port}"
+			fi
+			B ip route add default dev veth_b
+			ip addr add "${dst_addr4}" dev veth_a 2>/dev/null
+
+			# shellcheck disable=SC2086 # this needs split options
+			iperf -s -DB "${dst_addr4}" ${dst_port} >/dev/null 2>&1
+			sleep 2
+
+			# shellcheck disable=SC2086 # this needs split options
+			B iperf -c "${dst_addr4}" ${dst_port} ${src_addr4} \
+				-l20 -t 1000
+
+			src_addr4=
+			src_port=
+			dst_port=
+		}
+	elif command -v netperf >/dev/null; then
+		flood_tcp() {
+			[ -n "${dst_port}" ] && dst_port="-p ${dst_port}"
+			if [ -n "${src_addr4}" ]; then
+				B ip addr add "${src_addr4}/16" dev veth_b
+			else
+				B ip addr add dev veth_b 10.0.0.2
+				src_addr4="10.0.0.2"
+			fi
+			if [ -n "${src_port}" ]; then
+				dst_port="${dst_port},${src_port}"
+			fi
+			B ip route add default dev veth_b
+			ip addr add "${dst_addr4}" dev veth_a 2>/dev/null
+
+			# shellcheck disable=SC2086 # this needs split options
+			netserver -4 ${dst_port} -L "${dst_addr4}" \
+				>/dev/null 2>&1
+			sleep 2
+
+			# shellcheck disable=SC2086 # this needs split options
+			B netperf -4 -H "${dst_addr4}" ${dst_port} \
+				-L "${src_addr4}" -l 1000 -t TCP_STREAM
+
+			src_addr4=
+			src_port=
+			dst_port=
+		}
+	else
+		return 1
+	fi
+}
+
+# Set up function to send TCP traffic on IPv6
+setup_flood_tcp6() {
+	if command -v iperf3 >/dev/null; then
+		flood_tcp6() {
+			[ -n "${dst_port}" ] && dst_port="-p ${dst_port}"
+			if [ -n "${src_addr6}" ]; then
+				B ip addr add "${src_addr6}" dev veth_b nodad
+				src_addr6="-B ${src_addr6}"
+			else
+				src_addr6="-B 2001:db8::2"
+			fi
+			if [ -n "${src_port}" ]; then
+				src_port="--cport ${src_port}"
+			fi
+			B ip route add default dev veth_b
+			ip -6 addr add "${dst_addr6}" dev veth_a nodad \
+				2>/dev/null
+
+			# shellcheck disable=SC2086 # this needs split options
+			iperf3 -s -DB "${dst_addr6}" ${dst_port} >/dev/null 2>&1
+			sleep 2
+
+			# shellcheck disable=SC2086 # this needs split options
+			B iperf3 -c "${dst_addr6}" ${dst_port} \
+				${src_port} ${src_addr6} -l16 -t 1000
+
+			src_addr6=
+			src_port=
+			dst_port=
+		}
+	elif command -v iperf >/dev/null; then
+		flood_tcp6() {
+			[ -n "${dst_port}" ] && dst_port="-p ${dst_port}"
+			if [ -n "${src_addr6}" ]; then
+				B ip addr add "${src_addr6}" dev veth_b nodad
+				src_addr6="-B ${src_addr6}"
+			else
+				src_addr6="-B 2001:db8::2"
+			fi
+			if [ -n "${src_port}" ]; then
+				src_addr6="${src_addr6}:${src_port}"
+			fi
+			B ip route add default dev veth_b
+			ip -6 addr add "${dst_addr6}" dev veth_a nodad \
+				2>/dev/null
+
+			# shellcheck disable=SC2086 # this needs split options
+			iperf -s -VDB "${dst_addr6}" ${dst_port} >/dev/null 2>&1
+			sleep 2
+
+			# shellcheck disable=SC2086 # this needs split options
+			B iperf -c "${dst_addr6}" -V ${dst_port} \
+				${src_addr6} -l1 -t 1000
+
+			src_addr6=
+			src_port=
+			dst_port=
+		}
+	elif command -v netperf >/dev/null; then
+		flood_tcp6() {
+			[ -n "${dst_port}" ] && dst_port="-p ${dst_port}"
+			if [ -n "${src_addr6}" ]; then
+				B ip addr add "${src_addr6}" dev veth_b nodad
+			else
+				src_addr6="2001:db8::2"
+			fi
+			if [ -n "${src_port}" ]; then
+				dst_port="${dst_port},${src_port}"
+			fi
+			B ip route add default dev veth_b
+			ip -6 addr add "${dst_addr6}" dev veth_a nodad \
+				2>/dev/null
+
+			# shellcheck disable=SC2086 # this needs split options
+			netserver -6 ${dst_port} -L "${dst_addr6}" \
+				>/dev/null 2>&1
+			sleep 2
+
+			# shellcheck disable=SC2086 # this needs split options
+			B netperf -6 -H "${dst_addr6}" ${dst_port} \
+				-L "${src_addr6}" -l 1000 -t TCP_STREAM
+
+			src_addr6=
+			src_port=
+			dst_port=
+		}
+	else
+		return 1
+	fi
+}
+
+# Set up function to send UDP traffic on IPv4
+setup_flood_udp() {
+	if command -v iperf3 >/dev/null; then
+		flood_udp() {
+			[ -n "${dst_port}" ] && dst_port="-p ${dst_port}"
+			if [ -n "${src_addr4}" ]; then
+				B ip addr add "${src_addr4}/16" dev veth_b
+				src_addr4="-B ${src_addr4}"
+			else
+				B ip addr add dev veth_b 10.0.0.2 2>/dev/null
+				src_addr4="-B 10.0.0.2"
+			fi
+			if [ -n "${src_port}" ]; then
+				src_port="--cport ${src_port}"
+			fi
+			B ip route add default dev veth_b
+			ip addr add "${dst_addr4}" dev veth_a 2>/dev/null
+
+			# shellcheck disable=SC2086 # this needs split options
+			iperf3 -s -DB "${dst_addr4}" ${dst_port}
+			sleep 2
+
+			# shellcheck disable=SC2086 # this needs split options
+			B iperf3 -u -c "${dst_addr4}" -Z -b 100M -l16 -t1000 \
+				${dst_port} ${src_port} ${src_addr4}
+
+			src_addr4=
+			src_port=
+			dst_port=
+		}
+	elif command -v iperf >/dev/null; then
+		flood_udp() {
+			[ -n "${dst_port}" ] && dst_port="-p ${dst_port}"
+			if [ -n "${src_addr4}" ]; then
+				B ip addr add "${src_addr4}/16" dev veth_b
+				src_addr4="-B ${src_addr4}"
+			else
+				B ip addr add dev veth_b 10.0.0.2
+				src_addr4="-B 10.0.0.2"
+			fi
+			if [ -n "${src_port}" ]; then
+				src_addr4="${src_addr4}:${src_port}"
+			fi
+			B ip route add default dev veth_b
+			ip addr add "${dst_addr4}" dev veth_a 2>/dev/null
+
+			# shellcheck disable=SC2086 # this needs split options
+			iperf -u -sDB "${dst_addr4}" ${dst_port} >/dev/null 2>&1
+			sleep 2
+
+			# shellcheck disable=SC2086 # this needs split options
+			B iperf -u -c "${dst_addr4}" -b 100M -l1 -t1000 \
+				${dst_port} ${src_addr4}
+
+			src_addr4=
+			src_port=
+			dst_port=
+		}
+	elif command -v netperf >/dev/null; then
+		flood_udp() {
+			[ -n "${dst_port}" ] && dst_port="-p ${dst_port}"
+			if [ -n "${src_addr4}" ]; then
+				B ip addr add "${src_addr4}/16" dev veth_b
+			else
+				B ip addr add dev veth_b 10.0.0.2
+				src_addr4="10.0.0.2"
+			fi
+			if [ -n "${src_port}" ]; then
+				dst_port="${dst_port},${src_port}"
+			fi
+			B ip route add default dev veth_b
+			ip addr add "${dst_addr4}" dev veth_a 2>/dev/null
+
+			# shellcheck disable=SC2086 # this needs split options
+			netserver -4 ${dst_port} -L "${dst_addr4}" \
+				>/dev/null 2>&1
+			sleep 2
+
+			# shellcheck disable=SC2086 # this needs split options
+			B netperf -4 -H "${dst_addr4}" ${dst_port} \
+				-L "${src_addr4}" -l 1000 -t UDP_STREAM
+
+			src_addr4=
+			src_port=
+			dst_port=
+		}
+	else
+		return 1
+	fi
+}
+
+# Find pktgen script and set up function to start pktgen injection
+setup_perf() {
+	for pktgen_script_path in ${PKTGEN_SCRIPT_PATHS} __notfound; do
+		command -v "${pktgen_script_path}" >/dev/null && break
+	done
+	[ "${pktgen_script_path}" = "__notfound" ] && return 1
+
+	perf_ipv4() {
+		${pktgen_script_path} -s80 \
+			-i veth_a -d "${dst_addr4}" -p "${dst_port}" \
+			-m "${dst_mac}" \
+			-t $(($(nproc) / 5 + 1)) -b10000 -n0 2>/dev/null &
+		perf_pid=$!
+	}
+	perf_ipv6() {
+		IP6=6 ${pktgen_script_path} -s100 \
+			-i veth_a -d "${dst_addr6}" -p "${dst_port}" \
+			-m "${dst_mac}" \
+			-t $(($(nproc) / 5 + 1)) -b10000 -n0 2>/dev/null &
+		perf_pid=$!
+	}
+}
+
+# Clean up before each test
+cleanup() {
+	nft reset counter inet filter test	>/dev/null 2>&1
+	nft flush ruleset			>/dev/null 2>&1
+	ip link del dummy0			2>/dev/null
+	ip route del default			2>/dev/null
+	ip -6 route del default			2>/dev/null
+	ip netns del B				2>/dev/null
+	ip link del veth_a			2>/dev/null
+	timeout=
+	killall iperf3				2>/dev/null
+	killall iperf				2>/dev/null
+	killall netperf				2>/dev/null
+	killall netserver			2>/dev/null
+	rm -f ${tmp}
+	sleep 2
+}
+
+# Entry point for setup functions
+setup() {
+	if [ "$(id -u)" -ne 0 ]; then
+		echo "  need to run as root"
+		exit ${KSELFTEST_SKIP}
+	fi
+
+	cleanup
+	check_tools || return 1
+	for arg do
+		if ! eval setup_"${arg}"; then
+			err "  ${arg} not supported"
+			return 1
+		fi
+	done
+}
+
+# Format integer into IPv4 address, summing 10.0.0.5 (arbitrary) to it
+format_addr4() {
+	a=$((${1} + 16777216 * 10 + 5))
+	printf "%i.%i.%i.%i"						\
+	       "$((a / 16777216))" "$((a % 16777216 / 65536))"	\
+	       "$((a % 65536 / 256))" "$((a % 256))"
+}
+
+# Format integer into IPv6 address, summing 2001:db8:: to it
+format_addr6() {
+	printf "2001:db8::%04x:%04x" "$((${1} / 65536))" "$((${1} % 65536))"
+}
+
+# Format integer into EUI-48 address, summing 00:01:00:00:00:00 to it
+format_mac() {
+	printf "00:01:%02x:%02x:%02x:%02x" \
+	       "$((${1} / 16777216))" "$((${1} % 16777216 / 65536))"	\
+	       "$((${1} % 65536 / 256))" "$((${1} % 256))"
+}
+
+# Format integer into port, avoid 0 port
+format_port() {
+	printf "%i" "$((${1} % 65534 + 1))"
+}
+
+# Drop suffixed '6' from L4 protocol, if any
+format_proto() {
+	printf "%s" "${proto}" | tr -d 6
+}
+
+# Format destination and source fields into nft concatenated type
+format() {
+	__start=
+	__end=
+	__expr="{ "
+
+	for f in ${dst}; do
+		[ "${__expr}" != "{ " ] && __expr="${__expr} . "
+
+		__start="$(eval format_"${f}" "${start}")"
+		__end="$(eval format_"${f}" "${end}")"
+
+		if [ "${f}" = "proto" ]; then
+			__expr="${__expr}${__start}"
+		else
+			__expr="${__expr}${__start}-${__end}"
+		fi
+	done
+	for f in ${src}; do
+		__expr="${__expr} . "
+		__start="$(eval format_"${f}" "${srcstart}")"
+		__end="$(eval format_"${f}" "${srcend}")"
+
+		if [ "${f}" = "proto" ]; then
+			__expr="${__expr}${__start}"
+		else
+			__expr="${__expr}${__start}-${__end}"
+		fi
+	done
+
+	if [ -n "${timeout}" ]; then
+		echo "${__expr} timeout ${timeout}s }"
+	else
+		echo "${__expr} }"
+	fi
+}
+
+# Format destination and source fields into nft type, start element only
+format_norange() {
+	__expr="{ "
+
+	for f in ${dst}; do
+		[ "${__expr}" != "{ " ] && __expr="${__expr} . "
+
+		__expr="${__expr}$(eval format_"${f}" "${start}")"
+	done
+	for f in ${src}; do
+		__expr="${__expr} . $(eval format_"${f}" "${start}")"
+	done
+
+	echo "${__expr} }"
+}
+
+# Format first destination field into nft type
+format_noconcat() {
+	for f in ${dst}; do
+		__start="$(eval format_"${f}" "${start}")"
+		__end="$(eval format_"${f}" "${end}")"
+
+		if [ "${f}" = "proto" ]; then
+			echo "{ ${__start} }"
+		else
+			echo "{ ${__start}-${__end} }"
+		fi
+		return
+	done
+}
+
+# Add single entry to 'test' set in 'inet filter' table
+add() {
+	if ! nft add element inet filter test "${1}"; then
+		err "Failed to add ${1} given ruleset:"
+		err "$(nft list ruleset -a)"
+		return 1
+	fi
+}
+
+# Format and output entries for sets in 'netdev perf' table
+add_perf() {
+	if [ "${1}" = "test" ]; then
+		echo "add element netdev perf test $(format)"
+	elif [ "${1}" = "norange" ]; then
+		echo "add element netdev perf norange $(format_norange)"
+	elif [ "${1}" = "noconcat" ]; then
+		echo "add element netdev perf noconcat $(format_noconcat)"
+	fi
+}
+
+# Add single entry to 'norange' set in 'netdev perf' table
+add_perf_norange() {
+	if ! nft add element netdev perf norange "${1}"; then
+		err "Failed to add ${1} given ruleset:"
+		err "$(nft list ruleset -a)"
+		return 1
+	fi
+}
+
+# Add single entry to 'noconcat' set in 'netdev perf' table
+add_perf_noconcat() {
+	if ! nft add element netdev perf noconcat "${1}"; then
+		err "Failed to add ${1} given ruleset:"
+		err "$(nft list ruleset -a)"
+		return 1
+	fi
+}
+
+# Delete single entry from set
+del() {
+	if ! nft delete element inet filter test "${1}"; then
+		err "Failed to delete ${1} given ruleset:"
+		err "$(nft list ruleset -a)"
+		return 1
+	fi
+}
+
+# Return packet count from 'test' counter in 'inet filter' table
+count_packets() {
+	found=0
+	for token in $(nft list counter inet filter test); do
+		[ ${found} -eq 1 ] && echo "${token}" && return
+		[ "${token}" = "packets" ] && found=1
+	done
+}
+
+# Return packet count from 'test' counter in 'netdev perf' table
+count_perf_packets() {
+	found=0
+	for token in $(nft list counter netdev perf test); do
+		[ ${found} -eq 1 ] && echo "${token}" && return
+		[ "${token}" = "packets" ] && found=1
+	done
+}
+
+# Set MAC addresses, send traffic according to specifier
+flood() {
+	ip link set veth_a address "$(format_mac "${1}")"
+	ip -n B link set veth_b address "$(format_mac "${2}")"
+
+	for f in ${dst}; do
+		eval dst_"$f"=\$\(format_\$f "${1}"\)
+	done
+	for f in ${src}; do
+		eval src_"$f"=\$\(format_\$f "${2}"\)
+	done
+	eval flood_\$proto
+}
+
+# Set MAC addresses, start pktgen injection
+perf() {
+	dst_mac="$(format_mac "${1}")"
+	ip link set veth_a address "${dst_mac}"
+
+	for f in ${dst}; do
+		eval dst_"$f"=\$\(format_\$f "${1}"\)
+	done
+	for f in ${src}; do
+		eval src_"$f"=\$\(format_\$f "${2}"\)
+	done
+	eval perf_\$perf_proto
+}
+
+# Set MAC addresses, send single packet, check that it matches, reset counter
+send_match() {
+	ip link set veth_a address "$(format_mac "${1}")"
+	ip -n B link set veth_b address "$(format_mac "${2}")"
+
+	for f in ${dst}; do
+		eval dst_"$f"=\$\(format_\$f "${1}"\)
+	done
+	for f in ${src}; do
+		eval src_"$f"=\$\(format_\$f "${2}"\)
+	done
+	eval send_\$proto
+	if [ "$(count_packets)" != "1" ]; then
+		err "${proto} packet to:"
+		err "  $(for f in ${dst}; do
+			 eval format_\$f "${1}"; printf ' '; done)"
+		err "from:"
+		err "  $(for f in ${src}; do
+			 eval format_\$f "${2}"; printf ' '; done)"
+		err "should have matched ruleset:"
+		err "$(nft list ruleset -a)"
+		return 1
+	fi
+	nft reset counter inet filter test >/dev/null
+}
+
+# Set MAC addresses, send single packet, check that it doesn't match
+send_nomatch() {
+	ip link set veth_a address "$(format_mac "${1}")"
+	ip -n B link set veth_b address "$(format_mac "${2}")"
+
+	for f in ${dst}; do
+		eval dst_"$f"=\$\(format_\$f "${1}"\)
+	done
+	for f in ${src}; do
+		eval src_"$f"=\$\(format_\$f "${2}"\)
+	done
+	eval send_\$proto
+	if [ "$(count_packets)" != "0" ]; then
+		err "${proto} packet to:"
+		err "  $(for f in ${dst}; do
+			 eval format_\$f "${1}"; printf ' '; done)"
+		err "from:"
+		err "  $(for f in ${src}; do
+			 eval format_\$f "${2}"; printf ' '; done)"
+		err "should not have matched ruleset:"
+		err "$(nft list ruleset -a)"
+		return 1
+	fi
+}
+
+# Correctness test template:
+# - add ranged element, check that packets match it
+# - check that packets outside range don't match it
+# - remove some elements, check that packets don't match anymore
+test_correctness() {
+	setup veth send_"${proto}" set || return ${KSELFTEST_SKIP}
+
+	range_size=1
+	for i in $(seq "${start}" $((start + count))); do
+		end=$((start + range_size))
+
+		# Avoid negative or zero-sized port ranges
+		if [ $((end / 65534)) -gt $((start / 65534)) ]; then
+			start=${end}
+			end=$((end + 1))
+		fi
+		srcstart=$((start + src_delta))
+		srcend=$((end + src_delta))
+
+		add "$(format)" || return 1
+		for j in $(seq ${start} $((range_size / 2 + 1)) ${end}); do
+			send_match "${j}" $((j + src_delta)) || return 1
+		done
+		send_nomatch $((end + 1)) $((end + 1 + src_delta)) || return 1
+
+		# Delete elements now and then
+		if [ $((i % 3)) -eq 0 ]; then
+			del "$(format)" || return 1
+			for j in $(seq ${start} \
+				   $((range_size / 2 + 1)) ${end}); do
+				send_nomatch "${j}" $((j + src_delta)) \
+					|| return 1
+			done
+		fi
+
+		range_size=$((range_size + 1))
+		start=$((end + range_size))
+	done
+}
+
+# Concurrency test template:
+# - add all the elements
+# - start a thread for each physical thread that:
+#   - adds all the elements
+#   - flushes the set
+#   - adds all the elements
+#   - flushes the entire ruleset
+#   - adds the set back
+#   - adds all the elements
+#   - delete all the elements
+test_concurrency() {
+	proto=${flood_proto}
+	tools=${flood_tools}
+	chain_spec=${flood_spec}
+	setup veth flood_"${proto}" set || return ${KSELFTEST_SKIP}
+
+	range_size=1
+	cstart=${start}
+	flood_pids=
+	for i in $(seq ${start} $((start + count))); do
+		end=$((start + range_size))
+		srcstart=$((start + src_delta))
+		srcend=$((end + src_delta))
+
+		add "$(format)" || return 1
+
+		flood "${i}" $((i + src_delta)) & flood_pids="${flood_pids} $!"
+
+		range_size=$((range_size + 1))
+		start=$((end + range_size))
+	done
+
+	sleep 10
+
+	pids=
+	for c in $(seq 1 "$(nproc)"); do (
+		for r in $(seq 1 "${race_repeat}"); do
+			range_size=1
+
+			# $start needs to be local to this subshell
+			# shellcheck disable=SC2030
+			start=${cstart}
+			for i in $(seq ${start} $((start + count))); do
+				end=$((start + range_size))
+				srcstart=$((start + src_delta))
+				srcend=$((end + src_delta))
+
+				add "$(format)" 2>/dev/null
+
+				range_size=$((range_size + 1))
+				start=$((end + range_size))
+			done
+
+			nft flush inet filter test 2>/dev/null
+
+			range_size=1
+			start=${cstart}
+			for i in $(seq ${start} $((start + count))); do
+				end=$((start + range_size))
+				srcstart=$((start + src_delta))
+				srcend=$((end + src_delta))
+
+				add "$(format)" 2>/dev/null
+
+				range_size=$((range_size + 1))
+				start=$((end + range_size))
+			done
+
+			nft flush ruleset
+			setup set 2>/dev/null
+
+			range_size=1
+			start=${cstart}
+			for i in $(seq ${start} $((start + count))); do
+				end=$((start + range_size))
+				srcstart=$((start + src_delta))
+				srcend=$((end + src_delta))
+
+				add "$(format)" 2>/dev/null
+
+				range_size=$((range_size + 1))
+				start=$((end + range_size))
+			done
+
+			range_size=1
+			start=${cstart}
+			for i in $(seq ${start} $((start + count))); do
+				end=$((start + range_size))
+				srcstart=$((start + src_delta))
+				srcend=$((end + src_delta))
+
+				del "$(format)" 2>/dev/null
+
+				range_size=$((range_size + 1))
+				start=$((end + range_size))
+			done
+		done
+	) & pids="${pids} $!"
+	done
+
+	# shellcheck disable=SC2046,SC2086 # word splitting wanted here
+	wait $(for pid in ${pids}; do echo ${pid}; done)
+	# shellcheck disable=SC2046,SC2086
+	kill $(for pid in ${flood_pids}; do echo ${pid}; done) 2>/dev/null
+	# shellcheck disable=SC2046,SC2086
+	wait $(for pid in ${flood_pids}; do echo ${pid}; done) 2>/dev/null
+
+	return 0
+}
+
+# Timeout test template:
+# - add all the elements with 3s timeout while checking that packets match
+# - wait 3s after the last insertion, check that packets don't match any entry
+test_timeout() {
+	setup veth send_"${proto}" set || return ${KSELFTEST_SKIP}
+
+	timeout=3
+	range_size=1
+	for i in $(seq "${start}" $((start + count))); do
+		end=$((start + range_size))
+		srcstart=$((start + src_delta))
+		srcend=$((end + src_delta))
+
+		add "$(format)" || return 1
+
+		for j in $(seq ${start} $((range_size / 2 + 1)) ${end}); do
+			send_match "${j}" $((j + src_delta)) || return 1
+		done
+
+		range_size=$((range_size + 1))
+		start=$((end + range_size))
+	done
+	sleep 3
+	for i in $(seq ${start} $((start + count))); do
+		end=$((start + range_size))
+		srcstart=$((start + src_delta))
+		srcend=$((end + src_delta))
+
+		for j in $(seq ${start} $((range_size / 2 + 1)) ${end}); do
+			send_nomatch "${j}" $((j + src_delta)) || return 1
+		done
+
+		range_size=$((range_size + 1))
+		start=$((end + range_size))
+	done
+}
+
+# Performance test template:
+# - add concatenated ranged entries
+# - add non-ranged concatenated entries (for hash set matching rate baseline)
+# - add ranged entries with first field only (for rbhash baseline)
+# - start pktgen injection directly on device rx path of this namespace
+# - measure drop only rate, hash and rbtree baselines, then matching rate
+test_performance() {
+	chain_spec=${perf_spec}
+	dst="${perf_dst}"
+	src="${perf_src}"
+	setup veth perf set || return ${KSELFTEST_SKIP}
+
+	first=${start}
+	range_size=1
+	for set in test norange noconcat; do
+		start=${first}
+		for i in $(seq ${start} $((start + perf_entries))); do
+			end=$((start + range_size))
+			srcstart=$((start + src_delta))
+			srcend=$((end + src_delta))
+
+			if [ $((end / 65534)) -gt $((start / 65534)) ]; then
+				start=${end}
+				end=$((end + 1))
+			elif [ ${start} -eq ${end} ]; then
+				end=$((start + 1))
+			fi
+
+			add_perf ${set}
+
+			start=$((end + range_size))
+		done > "${tmp}"
+		nft -f "${tmp}"
+	done
+
+	perf $((end - 1)) ${srcstart}
+
+	sleep 2
+
+	nft add rule netdev perf test counter name \"test\" drop
+	nft reset counter netdev perf test >/dev/null 2>&1
+	sleep "${perf_duration}"
+	pps="$(printf %10s $(($(count_perf_packets) / perf_duration)))"
+	info "    baseline (drop from netdev hook):            ${pps}pps"
+	handle="$(nft -a list chain netdev perf test | grep counter)"
+	handle="${handle##* }"
+	nft delete rule netdev perf test handle "${handle}"
+
+	nft add rule "netdev perf test ${chain_spec} @norange \
+		counter name \"test\" drop"
+	nft reset counter netdev perf test >/dev/null 2>&1
+	sleep "${perf_duration}"
+	pps="$(printf %10s $(($(count_perf_packets) / perf_duration)))"
+	info "    baseline hash (non-ranged entries):          ${pps}pps"
+	handle="$(nft -a list chain netdev perf test | grep counter)"
+	handle="${handle##* }"
+	nft delete rule netdev perf test handle "${handle}"
+
+	nft add rule "netdev perf test ${chain_spec%%. *} @noconcat \
+		counter name \"test\" drop"
+	nft reset counter netdev perf test >/dev/null 2>&1
+	sleep "${perf_duration}"
+	pps="$(printf %10s $(($(count_perf_packets) / perf_duration)))"
+	info "    baseline rbtree (match on first field only): ${pps}pps"
+	handle="$(nft -a list chain netdev perf test | grep counter)"
+	handle="${handle##* }"
+	nft delete rule netdev perf test handle "${handle}"
+
+	nft add rule "netdev perf test ${chain_spec} @test \
+		counter name \"test\" drop"
+	nft reset counter netdev perf test >/dev/null 2>&1
+	sleep "${perf_duration}"
+	pps="$(printf %10s $(($(count_perf_packets) / perf_duration)))"
+	p5="$(printf %5s "${perf_entries}")"
+	info "    set with ${p5} full, ranged entries:         ${pps}pps"
+	kill "${perf_pid}"
+}
+
+# Run everything in a separate network namespace
+[ "${1}" != "run" ] && { unshare -n "${0}" run; exit $?; }
+tmp="$(mktemp)"
+trap cleanup EXIT
+
+# Entry point for test runs
+passed=0
+for name in ${TESTS}; do
+	printf "TEST: %s\n" "${name}"
+	for type in ${TYPES}; do
+		eval desc=\$TYPE_"${type}"
+		IFS='
+'
+		for __line in ${desc}; do
+			# shellcheck disable=SC2086
+			eval ${__line%%	*}=\"${__line##*	}\";
+		done
+		IFS=' 	
+'
+
+		if [ "${name}" = "concurrency" ] && \
+		   [ "${race_repeat}" = "0" ]; then
+			continue
+		fi
+		if [ "${name}" = "performance" ] && \
+		   [ "${perf_duration}" = "0" ]; then
+			continue
+		fi
+
+		printf "  %-60s  " "${display}"
+		eval test_"${name}"
+		ret=$?
+
+		if [ $ret -eq 0 ]; then
+			printf "[ OK ]\n"
+			info_flush
+			passed=$((passed + 1))
+		elif [ $ret -eq 1 ]; then
+			printf "[FAIL]\n"
+			err_flush
+			exit 1
+		elif [ $ret -eq ${KSELFTEST_SKIP} ]; then
+			printf "[SKIP]\n"
+			err_flush
+		fi
+	done
+done
+
+[ ${passed} -eq 0 ] && exit ${KSELFTEST_SKIP}
-- 
2.24.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH nf-next v4 7/9] nft_set_pipapo: Prepare for vectorised implementation: alignment
  2020-01-21 23:17 [PATCH nf-next v4 0/9] nftables: Set implementation for arbitrary concatenation of ranges Stefano Brivio
                   ` (5 preceding siblings ...)
  2020-01-21 23:17 ` [PATCH nf-next v4 6/9] selftests: netfilter: Introduce tests for sets with range concatenation Stefano Brivio
@ 2020-01-21 23:17 ` Stefano Brivio
  2020-01-21 23:17 ` [PATCH nf-next v4 8/9] nft_set_pipapo: Prepare for vectorised implementation: helpers Stefano Brivio
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Stefano Brivio @ 2020-01-21 23:17 UTC (permalink / raw)
  To: Pablo Neira Ayuso, netfilter-devel
  Cc: Florian Westphal, Kadlecsik József, Eric Garver, Phil Sutter

SIMD vector extension sets require stricter alignment than native
instruction sets to operate efficiently (AVX, NEON) or for some
instructions to work at all (AltiVec).

Provide facilities to define arbitrary alignment for lookup tables
and scratch maps. By defining byte alignment with NFT_PIPAPO_ALIGN,
lt_aligned and scratch_aligned pointers become available.

Additional headroom is allocated, and pointers to the possibly
unaligned, originally allocated areas are kept so that they can
be freed.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
v4: No changes
v3: No changes
v2: No changes

 net/netfilter/nft_set_pipapo.c | 115 +++++++++++++++++++++++++++------
 1 file changed, 97 insertions(+), 18 deletions(-)

diff --git a/net/netfilter/nft_set_pipapo.c b/net/netfilter/nft_set_pipapo.c
index f0cb1e13af50..a9665d44e4f8 100644
--- a/net/netfilter/nft_set_pipapo.c
+++ b/net/netfilter/nft_set_pipapo.c
@@ -377,6 +377,22 @@
 #define NFT_PIPAPO_RULE0_MAX		((1UL << (NFT_PIPAPO_MAP_TOBITS - 1)) \
 					- (1UL << NFT_PIPAPO_MAP_NBITS))
 
+/* Definitions for vectorised implementations */
+#ifdef NFT_PIPAPO_ALIGN
+#define NFT_PIPAPO_ALIGN_HEADROOM					\
+	(NFT_PIPAPO_ALIGN - ARCH_KMALLOC_MINALIGN)
+#define NFT_PIPAPO_LT_ALIGN(lt)		(PTR_ALIGN((lt), NFT_PIPAPO_ALIGN))
+#define NFT_PIPAPO_LT_ASSIGN(field, x)					\
+	do {								\
+		(field)->lt_aligned = NFT_PIPAPO_LT_ALIGN(x);		\
+		(field)->lt = (x);					\
+	} while (0)
+#else
+#define NFT_PIPAPO_ALIGN_HEADROOM	0
+#define NFT_PIPAPO_LT_ALIGN(lt)		(lt)
+#define NFT_PIPAPO_LT_ASSIGN(field, x)	((field)->lt = (x))
+#endif /* NFT_PIPAPO_ALIGN */
+
 #define nft_pipapo_for_each_field(field, index, match)		\
 	for ((field) = (match)->f, (index) = 0;			\
 	     (index) < (match)->field_count;			\
@@ -410,12 +426,16 @@ union nft_pipapo_map_bucket {
  * @rules:	Number of inserted rules
  * @bsize:	Size of each bucket in lookup table, in longs
  * @lt:		Lookup table: 'groups' rows of NFT_PIPAPO_BUCKETS buckets
+ * @lt_aligned:	Version of @lt aligned to NFT_PIPAPO_ALIGN bytes
  * @mt:		Mapping table: one bucket per rule
  */
 struct nft_pipapo_field {
 	int groups;
 	unsigned long rules;
 	size_t bsize;
+#ifdef NFT_PIPAPO_ALIGN
+	unsigned long *lt_aligned;
+#endif
 	unsigned long *lt;
 	union nft_pipapo_map_bucket *mt;
 };
@@ -424,12 +444,16 @@ struct nft_pipapo_field {
  * struct nft_pipapo_match - Data used for lookup and matching
  * @field_count		Amount of fields in set
  * @scratch:		Preallocated per-CPU maps for partial matching results
+ * @scratch_aligned:	Version of @scratch aligned to NFT_PIPAPO_ALIGN bytes
  * @bsize_max:		Maximum lookup table bucket size of all fields, in longs
  * @rcu			Matching data is swapped on commits
  * @f:			Fields, with lookup and mapping tables
  */
 struct nft_pipapo_match {
 	int field_count;
+#ifdef NFT_PIPAPO_ALIGN
+	unsigned long * __percpu *scratch_aligned;
+#endif
 	unsigned long * __percpu *scratch;
 	size_t bsize_max;
 	struct rcu_head rcu;
@@ -668,8 +692,8 @@ static struct nft_pipapo_elem *pipapo_get(const struct net *net,
 	memset(res_map, 0xff, m->bsize_max * sizeof(*res_map));
 
 	nft_pipapo_for_each_field(f, i, m) {
+		unsigned long *lt = NFT_PIPAPO_LT_ALIGN(f->lt);
 		bool last = i == m->field_count - 1;
-		unsigned long *lt = f->lt;
 		int b, group;
 
 		/* For each 4-bit group: select lookup table bucket depending on
@@ -763,6 +787,10 @@ static int pipapo_resize(struct nft_pipapo_field *f, int old_rules, int rules)
 	int group, bucket;
 
 	new_bucket_size = DIV_ROUND_UP(rules, BITS_PER_LONG);
+#ifdef NFT_PIPAPO_ALIGN
+	new_bucket_size = roundup(new_bucket_size,
+				  NFT_PIPAPO_ALIGN / sizeof(*new_lt));
+#endif
 
 	if (new_bucket_size == f->bsize)
 		goto mt;
@@ -773,12 +801,14 @@ static int pipapo_resize(struct nft_pipapo_field *f, int old_rules, int rules)
 		copy = new_bucket_size;
 
 	new_lt = kvzalloc(f->groups * NFT_PIPAPO_BUCKETS * new_bucket_size *
-			  sizeof(*new_lt), GFP_KERNEL);
+			  sizeof(*new_lt) + NFT_PIPAPO_ALIGN_HEADROOM,
+			  GFP_KERNEL);
 	if (!new_lt)
 		return -ENOMEM;
 
-	new_p = new_lt;
-	old_p = old_lt;
+	new_p = NFT_PIPAPO_LT_ALIGN(new_lt);
+	old_p = NFT_PIPAPO_LT_ALIGN(old_lt);
+
 	for (group = 0; group < f->groups; group++) {
 		for (bucket = 0; bucket < NFT_PIPAPO_BUCKETS; bucket++) {
 			memcpy(new_p, old_p, copy * sizeof(*new_p));
@@ -807,7 +837,7 @@ static int pipapo_resize(struct nft_pipapo_field *f, int old_rules, int rules)
 
 	if (new_lt) {
 		f->bsize = new_bucket_size;
-		f->lt = new_lt;
+		NFT_PIPAPO_LT_ASSIGN(f, new_lt);
 		kvfree(old_lt);
 	}
 
@@ -829,7 +859,8 @@ static void pipapo_bucket_set(struct nft_pipapo_field *f, int rule, int group,
 {
 	unsigned long *pos;
 
-	pos = f->lt + f->bsize * NFT_PIPAPO_BUCKETS * group;
+	pos = NFT_PIPAPO_LT_ALIGN(f->lt);
+	pos += f->bsize * NFT_PIPAPO_BUCKETS * group;
 	pos += f->bsize * v;
 
 	__set_bit(rule, pos);
@@ -1053,8 +1084,12 @@ static int pipapo_realloc_scratch(struct nft_pipapo_match *clone,
 
 	for_each_possible_cpu(i) {
 		unsigned long *scratch;
+#ifdef NFT_PIPAPO_ALIGN
+		unsigned long *scratch_aligned;
+#endif
 
-		scratch = kzalloc_node(bsize_max * sizeof(*scratch) * 2,
+		scratch = kzalloc_node(bsize_max * sizeof(*scratch) * 2 +
+				       NFT_PIPAPO_ALIGN_HEADROOM,
 				       GFP_KERNEL, cpu_to_node(i));
 		if (!scratch) {
 			/* On failure, there's no need to undo previous
@@ -1070,6 +1105,11 @@ static int pipapo_realloc_scratch(struct nft_pipapo_match *clone,
 		kfree(*per_cpu_ptr(clone->scratch, i));
 
 		*per_cpu_ptr(clone->scratch, i) = scratch;
+
+#ifdef NFT_PIPAPO_ALIGN
+		scratch_aligned = NFT_PIPAPO_LT_ALIGN(scratch);
+		*per_cpu_ptr(clone->scratch_aligned, i) = scratch_aligned;
+#endif
 	}
 
 	return 0;
@@ -1200,21 +1240,33 @@ static struct nft_pipapo_match *pipapo_clone(struct nft_pipapo_match *old)
 	if (!new->scratch)
 		goto out_scratch;
 
+#ifdef NFT_PIPAPO_ALIGN
+	new->scratch_aligned = alloc_percpu(*new->scratch_aligned);
+	if (!new->scratch_aligned)
+		goto out_scratch;
+#endif
+
 	rcu_head_init(&new->rcu);
 
 	src = old->f;
 	dst = new->f;
 
 	for (i = 0; i < old->field_count; i++) {
+		unsigned long *new_lt;
+
 		memcpy(dst, src, offsetof(struct nft_pipapo_field, lt));
 
-		dst->lt = kvzalloc(src->groups * NFT_PIPAPO_BUCKETS *
-				   src->bsize * sizeof(*dst->lt),
-				   GFP_KERNEL);
-		if (!dst->lt)
+		new_lt = kvzalloc(src->groups * NFT_PIPAPO_BUCKETS *
+				  src->bsize * sizeof(*dst->lt) +
+				  NFT_PIPAPO_ALIGN_HEADROOM,
+				  GFP_KERNEL);
+		if (!new_lt)
 			goto out_lt;
 
-		memcpy(dst->lt, src->lt,
+		NFT_PIPAPO_LT_ASSIGN(dst, new_lt);
+
+		memcpy(NFT_PIPAPO_LT_ALIGN(new_lt),
+		       NFT_PIPAPO_LT_ALIGN(src->lt),
 		       src->bsize * sizeof(*dst->lt) *
 		       src->groups * NFT_PIPAPO_BUCKETS);
 
@@ -1237,8 +1289,11 @@ static struct nft_pipapo_match *pipapo_clone(struct nft_pipapo_match *old)
 		kvfree(dst->lt);
 		dst--;
 	}
-	free_percpu(new->scratch);
+#ifdef NFT_PIPAPO_ALIGN
+	free_percpu(new->scratch_aligned);
+#endif
 out_scratch:
+	free_percpu(new->scratch);
 	kfree(new);
 
 	return ERR_PTR(-ENOMEM);
@@ -1394,7 +1449,8 @@ static void pipapo_drop(struct nft_pipapo_match *m,
 			unsigned long *pos;
 			int b;
 
-			pos = f->lt + g * NFT_PIPAPO_BUCKETS * f->bsize;
+			pos = NFT_PIPAPO_LT_ALIGN(f->lt) + g *
+			      NFT_PIPAPO_BUCKETS * f->bsize;
 
 			for (b = 0; b < NFT_PIPAPO_BUCKETS; b++) {
 				bitmap_cut(pos, pos, rulemap[i].to,
@@ -1498,6 +1554,9 @@ static void pipapo_reclaim_match(struct rcu_head *rcu)
 	for_each_possible_cpu(i)
 		kfree(*per_cpu_ptr(m->scratch, i));
 
+#ifdef NFT_PIPAPO_ALIGN
+	free_percpu(m->scratch_aligned);
+#endif
 	free_percpu(m->scratch);
 
 	pipapo_free_fields(m);
@@ -1701,7 +1760,8 @@ static int pipapo_get_boundaries(struct nft_pipapo_field *f, int first_rule,
 		for (b = 0; b < NFT_PIPAPO_BUCKETS; b++) {
 			unsigned long *pos;
 
-			pos = f->lt + (g * NFT_PIPAPO_BUCKETS + b) * f->bsize;
+			pos = NFT_PIPAPO_LT_ALIGN(f->lt) +
+			      (g * NFT_PIPAPO_BUCKETS + b) * f->bsize;
 			if (test_bit(first_rule, pos) && x0 == -1)
 				x0 = b;
 			if (test_bit(first_rule + rule_count - 1, pos))
@@ -1975,11 +2035,21 @@ static int nft_pipapo_init(const struct nft_set *set,
 	m->scratch = alloc_percpu(unsigned long *);
 	if (!m->scratch) {
 		err = -ENOMEM;
-		goto out_free;
+		goto out_scratch;
 	}
 	for_each_possible_cpu(i)
 		*per_cpu_ptr(m->scratch, i) = NULL;
 
+#ifdef NFT_PIPAPO_ALIGN
+	m->scratch_aligned = alloc_percpu(unsigned long *);
+	if (!m->scratch_aligned) {
+		err = -ENOMEM;
+		goto out_free;
+	}
+	for_each_possible_cpu(i)
+		*per_cpu_ptr(m->scratch_aligned, i) = NULL;
+#endif
+
 	rcu_head_init(&m->rcu);
 
 	nft_pipapo_for_each_field(f, i, m) {
@@ -1990,7 +2060,7 @@ static int nft_pipapo_init(const struct nft_set *set,
 
 		f->bsize = 0;
 		f->rules = 0;
-		f->lt = NULL;
+		NFT_PIPAPO_LT_ASSIGN(f, NULL);
 		f->mt = NULL;
 	}
 
@@ -2008,7 +2078,11 @@ static int nft_pipapo_init(const struct nft_set *set,
 	return 0;
 
 out_free:
+#ifdef NFT_PIPAPO_ALIGN
+	free_percpu(m->scratch_aligned);
+#endif
 	free_percpu(m->scratch);
+out_scratch:
 	kfree(m);
 
 	return err;
@@ -2043,16 +2117,21 @@ static void nft_pipapo_destroy(const struct nft_set *set)
 			nft_set_elem_destroy(set, e, true);
 		}
 
+#ifdef NFT_PIPAPO_ALIGN
+		free_percpu(m->scratch_aligned);
+#endif
 		for_each_possible_cpu(cpu)
 			kfree(*per_cpu_ptr(m->scratch, cpu));
 		free_percpu(m->scratch);
-
 		pipapo_free_fields(m);
 		kfree(m);
 		priv->match = NULL;
 	}
 
 	if (priv->clone) {
+#ifdef NFT_PIPAPO_ALIGN
+		free_percpu(priv->clone->scratch_aligned);
+#endif
 		for_each_possible_cpu(cpu)
 			kfree(*per_cpu_ptr(priv->clone->scratch, cpu));
 		free_percpu(priv->clone->scratch);
-- 
2.24.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH nf-next v4 8/9] nft_set_pipapo: Prepare for vectorised implementation: helpers
  2020-01-21 23:17 [PATCH nf-next v4 0/9] nftables: Set implementation for arbitrary concatenation of ranges Stefano Brivio
                   ` (6 preceding siblings ...)
  2020-01-21 23:17 ` [PATCH nf-next v4 7/9] nft_set_pipapo: Prepare for vectorised implementation: alignment Stefano Brivio
@ 2020-01-21 23:17 ` Stefano Brivio
  2020-01-21 23:17 ` [PATCH nf-next v4 9/9] nft_set_pipapo: Introduce AVX2-based lookup implementation Stefano Brivio
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Stefano Brivio @ 2020-01-21 23:17 UTC (permalink / raw)
  To: Pablo Neira Ayuso, netfilter-devel
  Cc: Florian Westphal, Kadlecsik József, Eric Garver, Phil Sutter

Move most macros and helpers to a header file, so that they can be
conveniently used by related implementations.

No functional changes are intended here.

v4: No changes
v3: Fix comment for pipapo_estimate_size(), we return 0 on overflow
v2: No changes

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 net/netfilter/nft_set_pipapo.c | 216 ++----------------------------
 net/netfilter/nft_set_pipapo.h | 237 +++++++++++++++++++++++++++++++++
 2 files changed, 248 insertions(+), 205 deletions(-)
 create mode 100644 net/netfilter/nft_set_pipapo.h

diff --git a/net/netfilter/nft_set_pipapo.c b/net/netfilter/nft_set_pipapo.c
index a9665d44e4f8..e7f4cecea7d6 100644
--- a/net/netfilter/nft_set_pipapo.c
+++ b/net/netfilter/nft_set_pipapo.c
@@ -330,167 +330,20 @@
 
 #include <linux/kernel.h>
 #include <linux/init.h>
-#include <linux/log2.h>
 #include <linux/module.h>
 #include <linux/netlink.h>
 #include <linux/netfilter.h>
 #include <linux/netfilter/nf_tables.h>
 #include <net/netfilter/nf_tables_core.h>
 #include <uapi/linux/netfilter/nf_tables.h>
-#include <net/ipv6.h>			/* For the maximum length of a field */
 #include <linux/bitmap.h>
 #include <linux/bitops.h>
 
-/* Count of concatenated fields depends on count of 32-bit nftables registers */
-#define NFT_PIPAPO_MAX_FIELDS		NFT_REG32_COUNT
-
-/* Largest supported field size */
-#define NFT_PIPAPO_MAX_BYTES		(sizeof(struct in6_addr))
-#define NFT_PIPAPO_MAX_BITS		(NFT_PIPAPO_MAX_BYTES * BITS_PER_BYTE)
-
-/* Number of bits to be grouped together in lookup table buckets, arbitrary */
-#define NFT_PIPAPO_GROUP_BITS		4
-#define NFT_PIPAPO_GROUPS_PER_BYTE	(BITS_PER_BYTE / NFT_PIPAPO_GROUP_BITS)
-
-/* Fields are padded to 32 bits in input registers */
-#define NFT_PIPAPO_GROUPS_PADDED_SIZE(x)				\
-	(round_up((x) / NFT_PIPAPO_GROUPS_PER_BYTE, sizeof(u32)))
-#define NFT_PIPAPO_GROUPS_PADDING(x)					\
-	(NFT_PIPAPO_GROUPS_PADDED_SIZE((x)) - (x) / NFT_PIPAPO_GROUPS_PER_BYTE)
-
-/* Number of buckets, given by 2 ^ n, with n grouped bits */
-#define NFT_PIPAPO_BUCKETS		(1 << NFT_PIPAPO_GROUP_BITS)
-
-/* Each n-bit range maps to up to n * 2 rules */
-#define NFT_PIPAPO_MAP_NBITS		(const_ilog2(NFT_PIPAPO_MAX_BITS * 2))
-
-/* Use the rest of mapping table buckets for rule indices, but it makes no sense
- * to exceed 32 bits
- */
-#if BITS_PER_LONG == 64
-#define NFT_PIPAPO_MAP_TOBITS		32
-#else
-#define NFT_PIPAPO_MAP_TOBITS		(BITS_PER_LONG - NFT_PIPAPO_MAP_NBITS)
-#endif
-
-/* ...which gives us the highest allowed index for a rule */
-#define NFT_PIPAPO_RULE0_MAX		((1UL << (NFT_PIPAPO_MAP_TOBITS - 1)) \
-					- (1UL << NFT_PIPAPO_MAP_NBITS))
-
-/* Definitions for vectorised implementations */
-#ifdef NFT_PIPAPO_ALIGN
-#define NFT_PIPAPO_ALIGN_HEADROOM					\
-	(NFT_PIPAPO_ALIGN - ARCH_KMALLOC_MINALIGN)
-#define NFT_PIPAPO_LT_ALIGN(lt)		(PTR_ALIGN((lt), NFT_PIPAPO_ALIGN))
-#define NFT_PIPAPO_LT_ASSIGN(field, x)					\
-	do {								\
-		(field)->lt_aligned = NFT_PIPAPO_LT_ALIGN(x);		\
-		(field)->lt = (x);					\
-	} while (0)
-#else
-#define NFT_PIPAPO_ALIGN_HEADROOM	0
-#define NFT_PIPAPO_LT_ALIGN(lt)		(lt)
-#define NFT_PIPAPO_LT_ASSIGN(field, x)	((field)->lt = (x))
-#endif /* NFT_PIPAPO_ALIGN */
-
-#define nft_pipapo_for_each_field(field, index, match)		\
-	for ((field) = (match)->f, (index) = 0;			\
-	     (index) < (match)->field_count;			\
-	     (index)++, (field)++)
-
-/**
- * union nft_pipapo_map_bucket - Bucket of mapping table
- * @to:		First rule number (in next field) this rule maps to
- * @n:		Number of rules (in next field) this rule maps to
- * @e:		If there's no next field, pointer to element this rule maps to
- */
-union nft_pipapo_map_bucket {
-	struct {
-#if BITS_PER_LONG == 64
-		static_assert(NFT_PIPAPO_MAP_TOBITS <= 32);
-		u32 to;
-
-		static_assert(NFT_PIPAPO_MAP_NBITS <= 32);
-		u32 n;
-#else
-		unsigned long to:NFT_PIPAPO_MAP_TOBITS;
-		unsigned long  n:NFT_PIPAPO_MAP_NBITS;
-#endif
-	};
-	struct nft_pipapo_elem *e;
-};
-
-/**
- * struct nft_pipapo_field - Lookup, mapping tables and related data for a field
- * @groups:	Amount of 4-bit groups
- * @rules:	Number of inserted rules
- * @bsize:	Size of each bucket in lookup table, in longs
- * @lt:		Lookup table: 'groups' rows of NFT_PIPAPO_BUCKETS buckets
- * @lt_aligned:	Version of @lt aligned to NFT_PIPAPO_ALIGN bytes
- * @mt:		Mapping table: one bucket per rule
- */
-struct nft_pipapo_field {
-	int groups;
-	unsigned long rules;
-	size_t bsize;
-#ifdef NFT_PIPAPO_ALIGN
-	unsigned long *lt_aligned;
-#endif
-	unsigned long *lt;
-	union nft_pipapo_map_bucket *mt;
-};
-
-/**
- * struct nft_pipapo_match - Data used for lookup and matching
- * @field_count		Amount of fields in set
- * @scratch:		Preallocated per-CPU maps for partial matching results
- * @scratch_aligned:	Version of @scratch aligned to NFT_PIPAPO_ALIGN bytes
- * @bsize_max:		Maximum lookup table bucket size of all fields, in longs
- * @rcu			Matching data is swapped on commits
- * @f:			Fields, with lookup and mapping tables
- */
-struct nft_pipapo_match {
-	int field_count;
-#ifdef NFT_PIPAPO_ALIGN
-	unsigned long * __percpu *scratch_aligned;
-#endif
-	unsigned long * __percpu *scratch;
-	size_t bsize_max;
-	struct rcu_head rcu;
-	struct nft_pipapo_field f[0];
-};
+#include "nft_set_pipapo.h"
 
 /* Current working bitmap index, toggled between field matches */
 static DEFINE_PER_CPU(bool, nft_pipapo_scratch_index);
 
-/**
- * struct nft_pipapo - Representation of a set
- * @match:	Currently in-use matching data
- * @clone:	Copy where pending insertions and deletions are kept
- * @groups:	Total amount of 4-bit groups for fields in this set
- * @width:	Total bytes to be matched for one packet, including padding
- * @dirty:	Working copy has pending insertions or deletions
- * @last_gc:	Timestamp of last garbage collection run, jiffies
- */
-struct nft_pipapo {
-	struct nft_pipapo_match __rcu *match;
-	struct nft_pipapo_match *clone;
-	int groups;
-	int width;
-	bool dirty;
-	unsigned long last_gc;
-};
-
-struct nft_pipapo_elem;
-
-/**
- * struct nft_pipapo_elem - API-facing representation of single set element
- * @ext:	nftables API extensions
- */
-struct nft_pipapo_elem {
-	struct nft_set_ext ext;
-};
-
 /**
  * pipapo_refill() - For each set bit, set bits from selected mapping table item
  * @map:	Bitmap to be scanned for set bits
@@ -508,9 +361,8 @@ struct nft_pipapo_elem {
  *
  * Return: -1 on no match, bit position on 'match_only', 0 otherwise.
  */
-static int pipapo_refill(unsigned long *map, int len, int rules,
-			 unsigned long *dst, union nft_pipapo_map_bucket *mt,
-			 bool match_only)
+int pipapo_refill(unsigned long *map, int len, int rules, unsigned long *dst,
+		  union nft_pipapo_map_bucket *mt, bool match_only)
 {
 	unsigned long bitset;
 	int k, ret = -1;
@@ -583,26 +435,13 @@ static bool nft_pipapo_lookup(const struct net *net, const struct nft_set *set,
 
 	nft_pipapo_for_each_field(f, i, m) {
 		bool last = i == m->field_count - 1;
-		unsigned long *lt = f->lt;
-		int b, group;
+		int b;
 
 		/* For each 4-bit group: select lookup table bucket depending on
 		 * packet bytes value, then AND bucket value
 		 */
-		for (group = 0; group < f->groups; group += 2) {
-			u8 v;
-
-			v = *rp >> 4;
-			__bitmap_and(res_map, res_map, lt + v * f->bsize,
-				     f->bsize * BITS_PER_LONG);
-			lt += f->bsize * NFT_PIPAPO_BUCKETS;
-
-			v = *rp & 0x0f;
-			rp++;
-			__bitmap_and(res_map, res_map, lt + v * f->bsize,
-				     f->bsize * BITS_PER_LONG);
-			lt += f->bsize * NFT_PIPAPO_BUCKETS;
-		}
+		pipapo_and_field_buckets(f, res_map, rp);
+		rp += f->groups / NFT_PIPAPO_GROUPS_PER_BYTE;
 
 		/* Now populate the bitmap for the next field, unless this is
 		 * the last field, in which case return the matched 'ext'
@@ -1943,56 +1782,23 @@ static u64 nft_pipapo_privsize(const struct nlattr * const nla[],
 }
 
 /**
- * nft_pipapo_estimate() - Estimate set size, space and lookup complexity
- * @desc:	Set description, element count and field description used here
+ * nft_pipapo_estimate() - Set size, space and lookup complexity
+ * @desc:	Set description, element count and field description used
  * @features:	Flags: NFT_SET_INTERVAL needs to be there
  * @est:	Storage for estimation data
  *
- * The size for this set type can vary dramatically, as it depends on the number
- * of rules (composing netmasks) the entries expand to. We compute the worst
- * case here.
- *
- * In general, for a non-ranged entry or a single composing netmask, we need
- * one bit in each of the sixteen NFT_PIPAPO_BUCKETS, for each 4-bit group (that
- * is, each input bit needs four bits of matching data), plus a bucket in the
- * mapping table for each field.
- *
- * Return: true only for compatible range concatenations
+ * Return: true if set description is compatible, false otherwise
  */
 static bool nft_pipapo_estimate(const struct nft_set_desc *desc, u32 features,
 				struct nft_set_estimate *est)
 {
-	unsigned long entry_size;
-	int i;
-
 	if (!(features & NFT_SET_INTERVAL) || desc->field_count <= 1)
 		return false;
 
-	for (i = 0, entry_size = 0; i < desc->field_count; i++) {
-		unsigned long rules;
-
-		if (desc->field_len[i] > NFT_PIPAPO_MAX_BYTES)
-			return false;
-
-		/* Worst-case ranges for each concatenated field: each n-bit
-		 * field can expand to up to n * 2 rules in each bucket, and
-		 * each rule also needs a mapping bucket.
-		 */
-		rules = ilog2(desc->field_len[i] * BITS_PER_BYTE) * 2;
-		entry_size += rules * NFT_PIPAPO_BUCKETS / BITS_PER_BYTE;
-		entry_size += rules * sizeof(union nft_pipapo_map_bucket);
-	}
-
-	/* Rules in lookup and mapping tables are needed for each entry */
-	est->size = desc->size * entry_size;
-	if (est->size && div_u64(est->size, desc->size) != entry_size)
+	est->size = pipapo_estimate_size(desc);
+	if (!est->size)
 		return false;
 
-	est->size += sizeof(struct nft_pipapo) +
-		     sizeof(struct nft_pipapo_match) * 2;
-
-	est->size += sizeof(struct nft_pipapo_field) * desc->field_count;
-
 	est->lookup = NFT_SET_CLASS_O_LOG_N;
 
 	est->space = NFT_SET_CLASS_O_N;
diff --git a/net/netfilter/nft_set_pipapo.h b/net/netfilter/nft_set_pipapo.h
new file mode 100644
index 000000000000..f11fbfa2c9aa
--- /dev/null
+++ b/net/netfilter/nft_set_pipapo.h
@@ -0,0 +1,237 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#ifndef _NFT_SET_PIPAPO_H
+
+#include <linux/log2.h>
+#include <net/ipv6.h>			/* For the maximum length of a field */
+
+/* Count of concatenated fields depends on count of 32-bit nftables registers */
+#define NFT_PIPAPO_MAX_FIELDS		NFT_REG32_COUNT
+
+/* Largest supported field size */
+#define NFT_PIPAPO_MAX_BYTES		(sizeof(struct in6_addr))
+#define NFT_PIPAPO_MAX_BITS		(NFT_PIPAPO_MAX_BYTES * BITS_PER_BYTE)
+
+/* Number of bits to be grouped together in lookup table buckets, arbitrary */
+#define NFT_PIPAPO_GROUP_BITS		4
+#define NFT_PIPAPO_GROUPS_PER_BYTE	(BITS_PER_BYTE / NFT_PIPAPO_GROUP_BITS)
+
+/* Fields are padded to 32 bits in input registers */
+#define NFT_PIPAPO_GROUPS_PADDED_SIZE(x)				\
+	(round_up((x) / NFT_PIPAPO_GROUPS_PER_BYTE, sizeof(u32)))
+#define NFT_PIPAPO_GROUPS_PADDING(x)					\
+	(NFT_PIPAPO_GROUPS_PADDED_SIZE((x)) - (x) / NFT_PIPAPO_GROUPS_PER_BYTE)
+
+/* Number of buckets, given by 2 ^ n, with n grouped bits */
+#define NFT_PIPAPO_BUCKETS		(1 << NFT_PIPAPO_GROUP_BITS)
+
+/* Each n-bit range maps to up to n * 2 rules */
+#define NFT_PIPAPO_MAP_NBITS		(const_ilog2(NFT_PIPAPO_MAX_BITS * 2))
+
+/* Use the rest of mapping table buckets for rule indices, but it makes no sense
+ * to exceed 32 bits
+ */
+#if BITS_PER_LONG == 64
+#define NFT_PIPAPO_MAP_TOBITS		32
+#else
+#define NFT_PIPAPO_MAP_TOBITS		(BITS_PER_LONG - NFT_PIPAPO_MAP_NBITS)
+#endif
+
+/* ...which gives us the highest allowed index for a rule */
+#define NFT_PIPAPO_RULE0_MAX		((1UL << (NFT_PIPAPO_MAP_TOBITS - 1)) \
+					- (1UL << NFT_PIPAPO_MAP_NBITS))
+
+/* Definitions for vectorised implementations */
+#ifdef NFT_PIPAPO_ALIGN
+#define NFT_PIPAPO_ALIGN_HEADROOM					\
+	(NFT_PIPAPO_ALIGN - ARCH_KMALLOC_MINALIGN)
+#define NFT_PIPAPO_LT_ALIGN(lt)		(PTR_ALIGN((lt), NFT_PIPAPO_ALIGN))
+#define NFT_PIPAPO_LT_ASSIGN(field, x)					\
+	do {								\
+		(field)->lt_aligned = NFT_PIPAPO_LT_ALIGN(x);		\
+		(field)->lt = (x);					\
+	} while (0);
+#else
+#define NFT_PIPAPO_ALIGN_HEADROOM	0
+#define NFT_PIPAPO_LT_ALIGN(lt)		(lt)
+#define NFT_PIPAPO_LT_ASSIGN(field, x)					\
+	do {								\
+		(field)->lt = (x);					\
+	} while (0);
+#endif /* NFT_PIPAPO_ALIGN */
+
+#define nft_pipapo_for_each_field(field, index, match)		\
+	for ((field) = (match)->f, (index) = 0;			\
+	     (index) < (match)->field_count;			\
+	     (index)++, (field)++)
+
+/**
+ * union nft_pipapo_map_bucket - Bucket of mapping table
+ * @to:		First rule number (in next field) this rule maps to
+ * @n:		Number of rules (in next field) this rule maps to
+ * @e:		If there's no next field, pointer to element this rule maps to
+ */
+union nft_pipapo_map_bucket {
+	struct {
+#if BITS_PER_LONG == 64
+		static_assert(NFT_PIPAPO_MAP_TOBITS <= 32);
+		u32 to;
+
+		static_assert(NFT_PIPAPO_MAP_NBITS <= 32);
+		u32 n;
+#else
+		unsigned long to:NFT_PIPAPO_MAP_TOBITS;
+		unsigned long  n:NFT_PIPAPO_MAP_NBITS;
+#endif
+	};
+	struct nft_pipapo_elem *e;
+};
+
+/**
+ * struct nft_pipapo_field - Lookup, mapping tables and related data for a field
+ * @groups:	Amount of 4-bit groups
+ * @rules:	Number of inserted rules
+ * @bsize:	Size of each bucket in lookup table, in longs
+ * @lt:		Lookup table: 'groups' rows of NFT_PIPAPO_BUCKETS buckets
+ * @lt_aligned:	Version of @lt aligned to NFT_PIPAPO_ALIGN bytes
+ * @mt:		Mapping table: one bucket per rule
+ */
+struct nft_pipapo_field {
+	int groups;
+	unsigned long rules;
+	size_t bsize;
+	unsigned long *lt;
+#ifdef NFT_PIPAPO_ALIGN
+	unsigned long *lt_aligned;
+#endif
+	union nft_pipapo_map_bucket *mt;
+};
+
+/**
+ * struct nft_pipapo_match - Data used for lookup and matching
+ * @field_count		Amount of fields in set
+ * @scratch:		Preallocated per-CPU maps for partial matching results
+ * @scratch_aligned:	Version of @scratch aligned to NFT_PIPAPO_ALIGN bytes
+ * @bsize_max:		Maximum lookup table bucket size of all fields, in longs
+ * @rcu			Matching data is swapped on commits
+ * @f:			Fields, with lookup and mapping tables
+ */
+struct nft_pipapo_match {
+	int field_count;
+#ifdef NFT_PIPAPO_ALIGN
+	unsigned long * __percpu *scratch_aligned;
+#endif
+	unsigned long * __percpu *scratch;
+	size_t bsize_max;
+	struct rcu_head rcu;
+	struct nft_pipapo_field f[0];
+};
+
+/**
+ * struct nft_pipapo - Representation of a set
+ * @match:	Currently in-use matching data
+ * @clone:	Copy where pending insertions and deletions are kept
+ * @groups:	Total amount of 4-bit groups for fields in this set
+ * @width:	Total bytes to be matched for one packet, including padding
+ * @dirty:	Working copy has pending insertions or deletions
+ * @last_gc:	Timestamp of last garbage collection run, jiffies
+ */
+struct nft_pipapo {
+	struct nft_pipapo_match __rcu *match;
+	struct nft_pipapo_match *clone;
+	int groups;
+	int width;
+	bool dirty;
+	unsigned long last_gc;
+};
+
+struct nft_pipapo_elem;
+
+/**
+ * struct nft_pipapo_elem - API-facing representation of single set element
+ * @ext:	nftables API extensions
+ */
+struct nft_pipapo_elem {
+	struct nft_set_ext ext;
+};
+
+int pipapo_refill(unsigned long *map, int len, int rules, unsigned long *dst,
+		  union nft_pipapo_map_bucket *mt, bool match_only);
+
+/**
+ * pipapo_and_field_buckets() - Select buckets from packet data and intersect
+ * @f:		Field including lookup table
+ * @dst:	Scratch map for partial matching result
+ * @rp:		Packet data register pointer
+ */
+static inline void pipapo_and_field_buckets(struct nft_pipapo_field *f,
+					    unsigned long *dst, const u8 *rp)
+{
+	unsigned long *lt = NFT_PIPAPO_LT_ALIGN(f->lt);
+	int group;
+
+	for (group = 0; group < f->groups; group += 2) {
+		u8 v;
+
+		v = *rp >> 4;
+		__bitmap_and(dst, dst, lt + v * f->bsize,
+			     f->bsize * BITS_PER_LONG);
+		lt += f->bsize * NFT_PIPAPO_BUCKETS;
+
+		v = *rp & 0x0f;
+		rp++;
+		__bitmap_and(dst, dst, lt + v * f->bsize,
+			     f->bsize * BITS_PER_LONG);
+		lt += f->bsize * NFT_PIPAPO_BUCKETS;
+	}
+}
+
+/**
+ * pipapo_estimate_size() - Estimate worst-case for set size
+ * @desc:	Set description, element count and field description used here
+ *
+ * The size for this set type can vary dramatically, as it depends on the number
+ * of rules (composing netmasks) the entries expand to. We compute the worst
+ * case here.
+ *
+ * In general, for a non-ranged entry or a single composing netmask, we need
+ * one bit in each of the sixteen NFT_PIPAPO_BUCKETS, for each 4-bit group (that
+ * is, each input bit needs four bits of matching data), plus a bucket in the
+ * mapping table for each field.
+ *
+ * Return: worst-case set size in bytes, 0 on any overflow
+ */
+static u64 pipapo_estimate_size(const struct nft_set_desc *desc)
+{
+	unsigned long entry_size;
+	u64 size;
+	int i;
+
+	for (i = 0, entry_size = 0; i < desc->field_count; i++) {
+		unsigned long rules;
+
+		if (desc->field_len[i] > NFT_PIPAPO_MAX_BYTES)
+			return 0;
+
+		/* Worst-case ranges for each concatenated field: each n-bit
+		 * field can expand to up to n * 2 rules in each bucket, and
+		 * each rule also needs a mapping bucket.
+		 */
+		rules = ilog2(desc->field_len[i] * BITS_PER_BYTE) * 2;
+		entry_size += rules * NFT_PIPAPO_BUCKETS / BITS_PER_BYTE;
+		entry_size += rules * sizeof(union nft_pipapo_map_bucket);
+	}
+
+	/* Rules in lookup and mapping tables are needed for each entry */
+	size = desc->size * entry_size;
+	if (size && div_u64(size, desc->size) != entry_size)
+		return 0;
+
+	size += sizeof(struct nft_pipapo) + sizeof(struct nft_pipapo_match) * 2;
+
+	size += sizeof(struct nft_pipapo_field) * desc->field_count;
+
+	return size;
+}
+
+#endif /* _NFT_SET_PIPAPO_H */
-- 
2.24.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH nf-next v4 9/9] nft_set_pipapo: Introduce AVX2-based lookup implementation
  2020-01-21 23:17 [PATCH nf-next v4 0/9] nftables: Set implementation for arbitrary concatenation of ranges Stefano Brivio
                   ` (7 preceding siblings ...)
  2020-01-21 23:17 ` [PATCH nf-next v4 8/9] nft_set_pipapo: Prepare for vectorised implementation: helpers Stefano Brivio
@ 2020-01-21 23:17 ` Stefano Brivio
  2020-01-27  6:41   ` kbuild test robot
  2020-01-27  8:20 ` [PATCH nf-next v4 0/9] nftables: Set implementation for arbitrary concatenation of ranges Pablo Neira Ayuso
  2020-02-20 10:52 ` Phil Sutter
  10 siblings, 1 reply; 20+ messages in thread
From: Stefano Brivio @ 2020-01-21 23:17 UTC (permalink / raw)
  To: Pablo Neira Ayuso, netfilter-devel
  Cc: Florian Westphal, Kadlecsik József, Eric Garver, Phil Sutter

If the AVX2 set is available, we can exploit the repetitive
characteristic of this algorithm to provide a fast, vectorised
version by using 256-bit wide AVX2 operations for bucket loads and
bitwise intersections.

In most cases, this implementation consistently outperforms rbtree
set instances despite the fact they are configured to use a given,
single, ranged data type out of the ones used for performance
measurements by the nft_concat_range.sh kselftest.

That script, injecting packets directly on the ingoing device path
with pktgen, reports:

- for one AMD Epyc 7402 thread (3.35GHz, 768 KiB L1D$, 12 MiB L2$):
  net,port                                                      [ OK ]
    baseline (drop from netdev hook):              13816909pps
    baseline hash (non-ranged entries):             7706821pps
    baseline rbtree (match on first field only):    3719979pps
    set with  1000 full, ranged entries:            5843256pps
  port,net                                                      [ OK ]
    baseline (drop from netdev hook):              13440355pps
    baseline hash (non-ranged entries):             7755855pps
    baseline rbtree (match on first field only):    5404151pps
    set with   100 full, ranged entries:            6274637pps
  net6,port                                                     [ OK ]
    baseline (drop from netdev hook):              12695318pps
    baseline hash (non-ranged entries):             5998414pps
    baseline rbtree (match on first field only):    1704466pps
    set with  1000 full, ranged entries:            3258636pps
  port,proto                                                    [ OK ]
    baseline (drop from netdev hook):              14045198pps
    baseline hash (non-ranged entries):             8586447pps
    baseline rbtree (match on first field only):    3811115pps
    set with 30000 full, ranged entries:            2200493pps
  net6,port,mac                                                 [ OK ]
    baseline (drop from netdev hook):              12644024pps
    baseline hash (non-ranged entries):             4834372pps
    baseline rbtree (match on first field only):    3654772pps
    set with    10 full, ranged entries:            3655568pps
  net6,port,mac,proto                                           [ OK ]
    baseline (drop from netdev hook):              12545632pps
    baseline hash (non-ranged entries):             4656663pps
    baseline rbtree (match on first field only):    1713780pps
    set with  1000 full, ranged entries:            2529071pps
  net,mac                                                       [ OK ]
    baseline (drop from netdev hook):              13766991pps
    baseline hash (non-ranged entries):             6440069pps
    baseline rbtree (match on first field only):    3739526pps
    set with  1000 full, ranged entries:            4818210pps

- for one AMD Epyc 7351 thread (2.9GHz, 512 KiB L1D$, 8 MiB L2$):
  net,port                                                      [ OK ]
    baseline (drop from netdev hook):              10170346pps
    baseline hash (non-ranged entries):             6214729pps
    baseline rbtree (match on first field only):    2589686pps
    set with  1000 full, ranged entries:            4695300pps
  port,net                                                      [ OK ]
    baseline (drop from netdev hook):              10162240pps
    baseline hash (non-ranged entries):             6199651pps
    baseline rbtree (match on first field only):    4176819pps
    set with   100 full, ranged entries:            4884376pps
  net6,port                                                     [ OK ]
    baseline (drop from netdev hook):               9732630pps
    baseline hash (non-ranged entries):             4747333pps
    baseline rbtree (match on first field only):    1376541pps
    set with  1000 full, ranged entries:            2486028pps
  port,proto                                                    [ OK ]
    baseline (drop from netdev hook):              10682224pps
    baseline hash (non-ranged entries):             6872565pps
    baseline rbtree (match on first field only):    2793442pps
    set with 30000 full, ranged entries:            1876571pps
  net6,port,mac                                                 [ OK ]
    baseline (drop from netdev hook):               9718917pps
    baseline hash (non-ranged entries):             3969930pps
    baseline rbtree (match on first field only):    3082588pps
    set with    10 full, ranged entries:            2988231pps
  net6,port,mac,proto                                           [ OK ]
    baseline (drop from netdev hook):               9754800pps
    baseline hash (non-ranged entries):             3810961pps
    baseline rbtree (match on first field only):    1365740pps
    set with  1000 full, ranged entries:            1967771pps
  net,mac                                                       [ OK ]
    baseline (drop from netdev hook):              10206690pps
    baseline hash (non-ranged entries):             5237175pps
    baseline rbtree (match on first field only):    2975866pps
    set with  1000 full, ranged entries:            3896154pps

- for one Intel Core i7-6600U thread (3.4GHz, 64 KiB L1D$, 512 KiB L2$):
  net,port                                                      [ OK ]
    baseline (drop from netdev hook):              10021039pps
    baseline hash (non-ranged entries):             6061766pps
    baseline rbtree (match on first field only):    3304312pps
    set with  1000 full, ranged entries:            4844887pps
  port,net                                                      [ OK ]
    baseline (drop from netdev hook):              10865207pps
    baseline hash (non-ranged entries):             6435691pps
    baseline rbtree (match on first field only):    4861128pps
    set with   100 full, ranged entries:            5246583pps
  net6,port                                                     [ OK ]
    baseline (drop from netdev hook):              10173990pps
    baseline hash (non-ranged entries):             4992955pps
    baseline rbtree (match on first field only):    1769058pps
    set with  1000 full, ranged entries:            3186628pps
  port,proto                                                    [ OK ]
    baseline (drop from netdev hook):              10680118pps
    baseline hash (non-ranged entries):             7127671pps
    baseline rbtree (match on first field only):    4001820pps
    set with 30000 full, ranged entries:            2591677pps
  net6,port,mac                                                 [ OK ]
    baseline (drop from netdev hook):               9932346pps
    baseline hash (non-ranged entries):             4216648pps
    baseline rbtree (match on first field only):    3414029pps
    set with    10 full, ranged entries:            3164909pps
  net6,port,mac,proto                                           [ OK ]
    baseline (drop from netdev hook):               9967056pps
    baseline hash (non-ranged entries):             4024868pps
    baseline rbtree (match on first field only):    1777420pps
    set with  1000 full, ranged entries:            2457853pps
  net,mac                                                       [ OK ]
    baseline (drop from netdev hook):              10702441pps
    baseline hash (non-ranged entries):             5615505pps
    baseline rbtree (match on first field only):    3488851pps
    set with  1000 full, ranged entries:            4257577pps

A similar strategy could be easily reused to implement specialised
versions for other SIMD sets, and I plan to post at least a NEON
version at a later time.

The vectorised implementation is automatically selected whenever
the AVX2 feature is available, and this can be detected with the
following check:

	[ $(uname -m) = "x86_64" ] && grep -q avx2 /proc/cpuinfo

In order to make set selection more explicit and visible, we might
at a later time export a different name, by introducing a new
attribute, e.g. NFTA_SET_OPS, as suggested by Phil Sutter on
netfilter-devel in <20180403211540.23700-3-phil@nwl.cc>.

v4: No changes
v3:
 - update matching rate data in commit message
 - skip AVX2 check in Makefile for i386
   (kbuild test robot <lkp@intel.com>)
v2:
 - extend scope of kernel_fpu_begin/end() to protect all accesses
   to scratch maps (Florian Westphal)
 - drop rcu_read_lock/unlock() from nft_pipapo_avx2_lookup(), it's
   already implied (Florian Westphal)
 - mention in commit message how to check if this set is used

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 include/net/netfilter/nf_tables_core.h |   1 +
 net/netfilter/Makefile                 |   5 +
 net/netfilter/nf_tables_set_core.c     |   6 +
 net/netfilter/nft_set_pipapo.c         |  25 +
 net/netfilter/nft_set_pipapo_avx2.c    | 842 +++++++++++++++++++++++++
 net/netfilter/nft_set_pipapo_avx2.h    |  14 +
 6 files changed, 893 insertions(+)
 create mode 100644 net/netfilter/nft_set_pipapo_avx2.c
 create mode 100644 net/netfilter/nft_set_pipapo_avx2.h

diff --git a/include/net/netfilter/nf_tables_core.h b/include/net/netfilter/nf_tables_core.h
index 29e7e1021267..549d5f9ea8c3 100644
--- a/include/net/netfilter/nf_tables_core.h
+++ b/include/net/netfilter/nf_tables_core.h
@@ -75,6 +75,7 @@ extern struct nft_set_type nft_set_hash_fast_type;
 extern struct nft_set_type nft_set_rbtree_type;
 extern struct nft_set_type nft_set_bitmap_type;
 extern struct nft_set_type nft_set_pipapo_type;
+extern struct nft_set_type nft_set_pipapo_avx2_type;
 
 struct nft_expr;
 struct nft_regs;
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index 3f572e5a975e..4c1896943f6e 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -83,6 +83,11 @@ nf_tables-objs := nf_tables_core.o nf_tables_api.o nft_chain_filter.o \
 nf_tables_set-objs := nf_tables_set_core.o \
 		      nft_set_hash.o nft_set_bitmap.o nft_set_rbtree.o \
 		      nft_set_pipapo.o
+ifeq ($(ARCH),x86_64)
+ifneq (,$(findstring -DCONFIG_AS_AVX2=1,$(KBUILD_CFLAGS)))
+nf_tables_set-objs += nft_set_pipapo_avx2.o
+endif
+endif
 
 obj-$(CONFIG_NF_TABLES)		+= nf_tables.o
 obj-$(CONFIG_NF_TABLES_SET)	+= nf_tables_set.o
diff --git a/net/netfilter/nf_tables_set_core.c b/net/netfilter/nf_tables_set_core.c
index 586b621007eb..4fa8f610038c 100644
--- a/net/netfilter/nf_tables_set_core.c
+++ b/net/netfilter/nf_tables_set_core.c
@@ -9,6 +9,9 @@ static int __init nf_tables_set_module_init(void)
 	nft_register_set(&nft_set_rhash_type);
 	nft_register_set(&nft_set_bitmap_type);
 	nft_register_set(&nft_set_rbtree_type);
+#ifdef CONFIG_AS_AVX2
+	nft_register_set(&nft_set_pipapo_avx2_type);
+#endif
 	nft_register_set(&nft_set_pipapo_type);
 
 	return 0;
@@ -17,6 +20,9 @@ static int __init nf_tables_set_module_init(void)
 static void __exit nf_tables_set_module_exit(void)
 {
 	nft_unregister_set(&nft_set_pipapo_type);
+#ifdef CONFIG_AS_AVX2
+	nft_unregister_set(&nft_set_pipapo_avx2_type);
+#endif
 	nft_unregister_set(&nft_set_rbtree_type);
 	nft_unregister_set(&nft_set_bitmap_type);
 	nft_unregister_set(&nft_set_rhash_type);
diff --git a/net/netfilter/nft_set_pipapo.c b/net/netfilter/nft_set_pipapo.c
index e7f4cecea7d6..396eb434aa75 100644
--- a/net/netfilter/nft_set_pipapo.c
+++ b/net/netfilter/nft_set_pipapo.c
@@ -339,6 +339,7 @@
 #include <linux/bitmap.h>
 #include <linux/bitops.h>
 
+#include "nft_set_pipapo_avx2.h"
 #include "nft_set_pipapo.h"
 
 /* Current working bitmap index, toggled between field matches */
@@ -1985,3 +1986,27 @@ struct nft_set_type nft_set_pipapo_type __read_mostly = {
 		.elemsize	= offsetof(struct nft_pipapo_elem, ext),
 	},
 };
+
+#ifdef CONFIG_AS_AVX2
+struct nft_set_type nft_set_pipapo_avx2_type __read_mostly = {
+	.owner		= THIS_MODULE,
+	.features	= NFT_SET_INTERVAL | NFT_SET_MAP | NFT_SET_OBJECT |
+			  NFT_SET_TIMEOUT,
+	.ops		= {
+		.lookup		= nft_pipapo_avx2_lookup,
+		.insert		= nft_pipapo_insert,
+		.activate	= nft_pipapo_activate,
+		.deactivate	= nft_pipapo_deactivate,
+		.flush		= nft_pipapo_flush,
+		.remove		= nft_pipapo_remove,
+		.walk		= nft_pipapo_walk,
+		.get		= nft_pipapo_get,
+		.privsize	= nft_pipapo_privsize,
+		.estimate	= nft_pipapo_avx2_estimate,
+		.init		= nft_pipapo_init,
+		.destroy	= nft_pipapo_destroy,
+		.gc_init	= nft_pipapo_gc_init,
+		.elemsize	= offsetof(struct nft_pipapo_elem, ext),
+	},
+};
+#endif
diff --git a/net/netfilter/nft_set_pipapo_avx2.c b/net/netfilter/nft_set_pipapo_avx2.c
new file mode 100644
index 000000000000..b33e2a05b5e8
--- /dev/null
+++ b/net/netfilter/nft_set_pipapo_avx2.c
@@ -0,0 +1,842 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/* PIPAPO: PIle PAcket POlicies: AVX2 packet lookup routines
+ *
+ * Copyright (c) 2019-2020 Red Hat GmbH
+ *
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ */
+
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/netlink.h>
+#include <linux/netfilter.h>
+#include <linux/netfilter/nf_tables.h>
+#include <net/netfilter/nf_tables_core.h>
+#include <uapi/linux/netfilter/nf_tables.h>
+#include <linux/bitmap.h>
+#include <linux/bitops.h>
+
+#include <linux/compiler.h>
+#include <asm/fpu/api.h>
+
+#include "nft_set_pipapo_avx2.h"
+#include "nft_set_pipapo.h"
+
+#define NFT_PIPAPO_LONGS_PER_M256	(XSAVE_YMM_SIZE / BITS_PER_LONG)
+
+/* Load from memory into YMM register with non-temporal hint ("stream load"),
+ * that is, don't fetch lines from memory into the cache. This avoids pushing
+ * precious packet data out of the cache hierarchy, and is appropriate when:
+ *
+ * - loading buckets from lookup tables, as they are not going to be used
+ *   again before packets are entirely classified
+ *
+ * - loading the result bitmap from the previous field, as it's never used
+ *   again
+ */
+#define NFT_PIPAPO_AVX2_LOAD(reg, loc)					\
+	asm volatile("vmovntdqa %0, %%ymm" #reg : : "m" (loc))
+
+/* Stream a single lookup table bucket into YMM register given lookup table,
+ * group index, value of packet bits, bucket size.
+ */
+#define NFT_PIPAPO_AVX2_BUCKET_LOAD(reg, lt, group, v, bsize)		\
+	NFT_PIPAPO_AVX2_LOAD(reg,					\
+			     lt[((group) * NFT_PIPAPO_BUCKETS + (v)) * (bsize)])
+
+/* Bitwise AND: the staple operation of this algorithm */
+#define NFT_PIPAPO_AVX2_AND(dst, a, b)					\
+	asm volatile("vpand %ymm" #a ", %ymm" #b ", %ymm" #dst)
+
+/* Jump to label if @reg is zero */
+#define NFT_PIPAPO_AVX2_NOMATCH_GOTO(reg, label)			\
+	asm_volatile_goto("vptest %%ymm" #reg ", %%ymm" #reg ";"	\
+			  "je %l[" #label "]" : : : : label)
+
+/* Store 256 bits from YMM register into memory. Contrary to bucket load
+ * operation, we don't bypass the cache here, as stored matching results
+ * are always used shortly after.
+ */
+#define NFT_PIPAPO_AVX2_STORE(loc, reg)					\
+	asm volatile("vmovdqa %%ymm" #reg ", %0" : "=m" (loc))
+
+/* Zero out a complete YMM register, @reg */
+#define NFT_PIPAPO_AVX2_ZERO(reg)					\
+	asm volatile("vpxor %ymm" #reg ", %ymm" #reg ", %ymm" #reg)
+
+/* Current working bitmap index, toggled between field matches */
+static DEFINE_PER_CPU(bool, nft_pipapo_avx2_scratch_index);
+
+/**
+ * nft_pipapo_avx2_prepare() - Prepare before main algorithm body
+ *
+ * This zeroes out ymm15, which is later used whenever we need to clear a
+ * memory location, by storing its content into memory.
+ */
+static void nft_pipapo_avx2_prepare(void)
+{
+	NFT_PIPAPO_AVX2_ZERO(15);
+}
+
+/**
+ * nft_pipapo_avx2_fill() - Fill a bitmap region with ones
+ * @data:	Base memory area
+ * @start:	First bit to set
+ * @len:	Count of bits to fill
+ *
+ * This is nothing else than a version of bitmap_set(), as used e.g. by
+ * pipapo_refill(), tailored for the microarchitectures using it and better
+ * suited for the specific usage: it's very likely that we'll set a small number
+ * of bits, not crossing a word boundary, and correct branch prediction is
+ * critical here.
+ *
+ * This function doesn't actually use any AVX2 instruction.
+ */
+static void nft_pipapo_avx2_fill(unsigned long *data, int start, int len)
+{
+	int offset = start % BITS_PER_LONG;
+	unsigned long mask;
+
+	data += start / BITS_PER_LONG;
+
+	if (likely(len == 1)) {
+		*data |= BIT(offset);
+		return;
+	}
+
+	if (likely(len < BITS_PER_LONG || offset)) {
+		if (likely(len + offset <= BITS_PER_LONG)) {
+			*data |= GENMASK(len - 1 + offset, offset);
+			return;
+		}
+
+		*data |= ~0UL << offset;
+		len -= BITS_PER_LONG - offset;
+		data++;
+
+		if (len <= BITS_PER_LONG) {
+			mask = ~0UL >> (BITS_PER_LONG - len);
+			*data |= mask;
+			return;
+		}
+	}
+
+	memset(data, 0xff, len / BITS_PER_BYTE);
+	data += len / BITS_PER_LONG;
+
+	len %= BITS_PER_LONG;
+	if (len)
+		*data |= ~0UL >> (BITS_PER_LONG - len);
+}
+
+/**
+ * nft_pipapo_avx2_refill() - Scan bitmap, select mapping table item, set bits
+ * @offset:	Start from given bitmap (equivalent to bucket) offset, in longs
+ * @map:	Bitmap to be scanned for set bits
+ * @dst:	Destination bitmap
+ * @mt:		Mapping table containing bit set specifiers
+ * @len:	Length of bitmap in longs
+ * @last:	Return index of first set bit, if this is the last field
+ *
+ * This is an alternative implementation of pipapo_refill() suitable for usage
+ * with AVX2 lookup routines: we know there are four words to be scanned, at
+ * a given offset inside the map, for each matching iteration.
+ *
+ * This function doesn't actually use any AVX2 instruction.
+ *
+ * Return: first set bit index if @last, index of first filled word otherwise.
+ */
+static int nft_pipapo_avx2_refill(int offset, unsigned long *map,
+				  unsigned long *dst,
+				  union nft_pipapo_map_bucket *mt, bool last)
+{
+	int ret = -1;
+
+#define NFT_PIPAPO_AVX2_REFILL_ONE_WORD(x)				\
+	do {								\
+		while (map[(x)]) {					\
+			int r = __builtin_ctzl(map[(x)]);		\
+			int i = (offset + (x)) * BITS_PER_LONG + r;	\
+									\
+			if (last)					\
+				return i;				\
+									\
+			nft_pipapo_avx2_fill(dst, mt[i].to, mt[i].n);	\
+									\
+			if (ret == -1)					\
+				ret = mt[i].to;				\
+									\
+			map[(x)] &= ~(1UL << r);			\
+		}							\
+	} while (0)
+
+	NFT_PIPAPO_AVX2_REFILL_ONE_WORD(0);
+	NFT_PIPAPO_AVX2_REFILL_ONE_WORD(1);
+	NFT_PIPAPO_AVX2_REFILL_ONE_WORD(2);
+	NFT_PIPAPO_AVX2_REFILL_ONE_WORD(3);
+#undef NFT_PIPAPO_AVX2_REFILL_ONE_WORD
+
+	return ret;
+}
+
+/**
+ * nft_pipapo_avx2_lookup2() - AVX2-based lookup for 2 four-bit groups
+ * @map:	Previous match result, used as initial bitmap
+ * @fill:	Destination bitmap to be filled with current match result
+ * @lt:		Lookup table for this field
+ * @mt:		Mapping table for this field
+ * @bsize:	Bucket size for this lookup table, in longs
+ * @pkt:	Packet data, pointer to input nftables register
+ * @first:	If this is the first field, don't source previous result
+ * @last:	Last field: stop at the first match and return bit index
+ * @offset:	Ignore buckets before the given index, no bits are filled there
+ *
+ * Load buckets from lookup table corresponding to the values of each 4-bit
+ * group of packet bytes, and perform a bitwise intersection between them. If
+ * this is the first field in the set, simply AND the buckets together
+ * (equivalent to using an all-ones starting bitmap), use the provided starting
+ * bitmap otherwise. Then call nft_pipapo_avx2_refill() to generate the next
+ * working bitmap, @fill.
+ *
+ * This is used for 8-bit fields (i.e. protocol numbers).
+ *
+ * Out-of-order (and superscalar) execution is vital here, so it's critical to
+ * avoid false data dependencies. CPU and compiler could (mostly) take care of
+ * this on their own, but the operation ordering is explicitly given here with
+ * a likely execution order in mind, to highlight possible stalls. That's why
+ * a number of logically distinct operations (i.e. loading buckets, intersecting
+ * buckets) are interleaved.
+ *
+ * Return: -1 on no match, rule index of match if @last, otherwise first long
+ * word index to be checked next (i.e. first filled word).
+ */
+static int nft_pipapo_avx2_lookup2(unsigned long *map, unsigned long *fill,
+				   unsigned long *lt,
+				   union nft_pipapo_map_bucket *mt,
+				   unsigned long bsize, const u8 *pkt,
+				   bool first, bool last, int offset)
+{
+	int i, ret = -1, m256_size = bsize / NFT_PIPAPO_LONGS_PER_M256, b;
+	u8 pg[2] = { pkt[0] >> 4, pkt[0] & 0xf };
+
+	lt += offset * NFT_PIPAPO_LONGS_PER_M256;
+	for (i = offset; i < m256_size; i++, lt += NFT_PIPAPO_LONGS_PER_M256) {
+		int i_ul = i * NFT_PIPAPO_LONGS_PER_M256;
+
+		if (first) {
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(0, lt, 0, pg[0], bsize);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(1, lt, 1, pg[1], bsize);
+			NFT_PIPAPO_AVX2_AND(4, 0, 1);
+		} else {
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(0, lt, 0, pg[0], bsize);
+			NFT_PIPAPO_AVX2_LOAD(2, map[i_ul]);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(1, lt, 1, pg[1], bsize);
+			NFT_PIPAPO_AVX2_NOMATCH_GOTO(2, nothing);
+			NFT_PIPAPO_AVX2_AND(3, 0, 1);
+			NFT_PIPAPO_AVX2_AND(4, 2, 3);
+		}
+
+		NFT_PIPAPO_AVX2_NOMATCH_GOTO(4, nomatch);
+		NFT_PIPAPO_AVX2_STORE(map[i_ul], 4);
+
+		b = nft_pipapo_avx2_refill(i_ul, &map[i_ul], fill, mt, last);
+		if (last)
+			return b;
+
+		if (unlikely(ret == -1))
+			ret = b / XSAVE_YMM_SIZE;
+
+		continue;
+nomatch:
+		NFT_PIPAPO_AVX2_STORE(map[i_ul], 15);
+nothing:
+		;
+	}
+
+	return ret;
+}
+
+/**
+ * nft_pipapo_avx2_lookup4() - AVX2-based lookup for 4 four-bit groups
+ * @map:	Previous match result, used as initial bitmap
+ * @fill:	Destination bitmap to be filled with current match result
+ * @lt:		Lookup table for this field
+ * @mt:		Mapping table for this field
+ * @bsize:	Bucket size for this lookup table, in longs
+ * @pkt:	Packet data, pointer to input nftables register
+ * @first:	If this is the first field, don't source previous result
+ * @last:	Last field: stop at the first match and return bit index
+ * @offset:	Ignore buckets before the given index, no bits are filled there
+ *
+ * See nft_pipapo_avx2_lookup2().
+ *
+ * This is used for 16-bit fields (i.e. ports).
+ *
+ * Return: -1 on no match, rule index of match if @last, otherwise first long
+ * word index to be checked next (i.e. first filled word).
+ */
+static int nft_pipapo_avx2_lookup4(unsigned long *map, unsigned long *fill,
+				   unsigned long *lt,
+				   union nft_pipapo_map_bucket *mt,
+				   unsigned long bsize, const u8 *pkt,
+				   bool first, bool last, int offset)
+{
+	int i, ret = -1, m256_size = bsize / NFT_PIPAPO_LONGS_PER_M256, b;
+	u8 pg[4] = { pkt[0] >> 4, pkt[0] & 0xf, pkt[1] >> 4, pkt[1] & 0xf };
+
+	lt += offset * NFT_PIPAPO_LONGS_PER_M256;
+	for (i = offset; i < m256_size; i++, lt += NFT_PIPAPO_LONGS_PER_M256) {
+		int i_ul = i * NFT_PIPAPO_LONGS_PER_M256;
+
+		if (first) {
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(0, lt, 0, pg[0], bsize);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(1, lt, 1, pg[1], bsize);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(2, lt, 2, pg[2], bsize);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(3, lt, 3, pg[3], bsize);
+			NFT_PIPAPO_AVX2_AND(4, 0, 1);
+			NFT_PIPAPO_AVX2_AND(5, 2, 3);
+			NFT_PIPAPO_AVX2_AND(7, 4, 5);
+		} else {
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(0, lt, 0, pg[0], bsize);
+
+			NFT_PIPAPO_AVX2_LOAD(1, map[i_ul]);
+
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(2, lt, 1, pg[1], bsize);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(3, lt, 2, pg[2], bsize);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(4, lt, 3, pg[3], bsize);
+			NFT_PIPAPO_AVX2_AND(5, 0, 1);
+
+			NFT_PIPAPO_AVX2_NOMATCH_GOTO(1, nothing);
+
+			NFT_PIPAPO_AVX2_AND(6, 2, 3);
+			NFT_PIPAPO_AVX2_AND(7, 4, 5);
+			/* Stall */
+			NFT_PIPAPO_AVX2_AND(7, 6, 7);
+		}
+
+		/* Stall */
+		NFT_PIPAPO_AVX2_NOMATCH_GOTO(7, nomatch);
+		NFT_PIPAPO_AVX2_STORE(map[i_ul], 7);
+
+		b = nft_pipapo_avx2_refill(i_ul, &map[i_ul], fill, mt, last);
+		if (last)
+			return b;
+
+		if (unlikely(ret == -1))
+			ret = b / XSAVE_YMM_SIZE;
+
+		continue;
+nomatch:
+		NFT_PIPAPO_AVX2_STORE(map[i_ul], 15);
+nothing:
+		;
+	}
+
+	return ret;
+}
+
+/**
+ * nft_pipapo_avx2_lookup8() - AVX2-based lookup for 8 four-bit groups
+ * @map:	Previous match result, used as initial bitmap
+ * @fill:	Destination bitmap to be filled with current match result
+ * @lt:		Lookup table for this field
+ * @mt:		Mapping table for this field
+ * @bsize:	Bucket size for this lookup table, in longs
+ * @pkt:	Packet data, pointer to input nftables register
+ * @first:	If this is the first field, don't source previous result
+ * @last:	Last field: stop at the first match and return bit index
+ * @offset:	Ignore buckets before the given index, no bits are filled there
+ *
+ * See nft_pipapo_avx2_lookup2().
+ *
+ * This is used for 32-bit fields (i.e. IPv4 addresses).
+ *
+ * Return: -1 on no match, rule index of match if @last, otherwise first long
+ * word index to be checked next (i.e. first filled word).
+ */
+static int nft_pipapo_avx2_lookup8(unsigned long *map, unsigned long *fill,
+				   unsigned long *lt,
+				   union nft_pipapo_map_bucket *mt,
+				   unsigned long bsize, const u8 *pkt,
+				   bool first, bool last, int offset)
+{
+	int i, ret = -1, m256_size = bsize / NFT_PIPAPO_LONGS_PER_M256, b;
+	u8 pg[8] = { pkt[0] >> 4, pkt[0] & 0xf, pkt[1] >> 4, pkt[1] & 0xf,
+		     pkt[2] >> 4, pkt[2] & 0xf, pkt[3] >> 4, pkt[3] & 0xf };
+
+	lt += offset * NFT_PIPAPO_LONGS_PER_M256;
+	for (i = offset; i < m256_size; i++, lt += NFT_PIPAPO_LONGS_PER_M256) {
+		int i_ul = i * NFT_PIPAPO_LONGS_PER_M256;
+
+		if (first) {
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(0,  lt, 0, pg[0], bsize);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(1,  lt, 1, pg[1], bsize);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(2,  lt, 2, pg[2], bsize);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(3,  lt, 3, pg[3], bsize);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(4,  lt, 4, pg[4], bsize);
+			NFT_PIPAPO_AVX2_AND(5,   0,  1);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(6,  lt, 5, pg[5], bsize);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(7,  lt, 6, pg[6], bsize);
+			NFT_PIPAPO_AVX2_AND(8,   2,  3);
+			NFT_PIPAPO_AVX2_AND(9,   4,  5);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(10, lt, 7, pg[7], bsize);
+			NFT_PIPAPO_AVX2_AND(11,  6,  7);
+			NFT_PIPAPO_AVX2_AND(12,  8,  9);
+			NFT_PIPAPO_AVX2_AND(13, 10, 11);
+
+			/* Stall */
+			NFT_PIPAPO_AVX2_AND(1,  12, 13);
+		} else {
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(0,  lt, 0, pg[0], bsize);
+			NFT_PIPAPO_AVX2_LOAD(1, map[i_ul]);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(2,  lt, 1, pg[1], bsize);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(3,  lt, 2, pg[2], bsize);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(4,  lt, 3, pg[3], bsize);
+
+			NFT_PIPAPO_AVX2_NOMATCH_GOTO(1, nothing);
+
+			NFT_PIPAPO_AVX2_AND(5,   0,  1);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(6,  lt, 4, pg[4], bsize);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(7,  lt, 5, pg[5], bsize);
+			NFT_PIPAPO_AVX2_AND(8,   2,  3);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(9,  lt, 6, pg[6], bsize);
+			NFT_PIPAPO_AVX2_AND(10,  4,  5);
+			NFT_PIPAPO_AVX2_BUCKET_LOAD(11, lt, 7, pg[7], bsize);
+			NFT_PIPAPO_AVX2_AND(12,  6,  7);
+			NFT_PIPAPO_AVX2_AND(13,  8,  9);
+			NFT_PIPAPO_AVX2_AND(14, 10, 11);
+
+			/* Stall */
+			NFT_PIPAPO_AVX2_AND(1,  12, 13);
+			NFT_PIPAPO_AVX2_AND(1,   1, 14);
+		}
+
+		NFT_PIPAPO_AVX2_NOMATCH_GOTO(1, nomatch);
+		NFT_PIPAPO_AVX2_STORE(map[i_ul], 1);
+
+		b = nft_pipapo_avx2_refill(i_ul, &map[i_ul], fill, mt, last);
+		if (last)
+			return b;
+
+		if (unlikely(ret == -1))
+			ret = b / XSAVE_YMM_SIZE;
+
+		continue;
+
+nomatch:
+		NFT_PIPAPO_AVX2_STORE(map[i_ul], 15);
+nothing:
+		;
+	}
+
+	return ret;
+}
+
+/**
+ * nft_pipapo_avx2_lookup12() - AVX2-based lookup for 12 four-bit groups
+ * @map:	Previous match result, used as initial bitmap
+ * @fill:	Destination bitmap to be filled with current match result
+ * @lt:		Lookup table for this field
+ * @mt:		Mapping table for this field
+ * @bsize:	Bucket size for this lookup table, in longs
+ * @pkt:	Packet data, pointer to input nftables register
+ * @first:	If this is the first field, don't source previous result
+ * @last:	Last field: stop at the first match and return bit index
+ * @offset:	Ignore buckets before the given index, no bits are filled there
+ *
+ * See nft_pipapo_avx2_lookup2().
+ *
+ * This is used for 48-bit fields (i.e. MAC addresses/EUI-48).
+ *
+ * Return: -1 on no match, rule index of match if @last, otherwise first long
+ * word index to be checked next (i.e. first filled word).
+ */
+static int nft_pipapo_avx2_lookup12(unsigned long *map, unsigned long *fill,
+				    unsigned long *lt,
+				    union nft_pipapo_map_bucket *mt,
+				    unsigned long bsize, const u8 *pkt,
+				    bool first, bool last, int offset)
+{
+	int i, ret = -1, m256_size = bsize / NFT_PIPAPO_LONGS_PER_M256, b;
+	u8 pg[12] = { pkt[0] >> 4, pkt[0] & 0xf, pkt[1] >> 4, pkt[1] & 0xf,
+		      pkt[2] >> 4, pkt[2] & 0xf, pkt[3] >> 4, pkt[3] & 0xf,
+		      pkt[4] >> 4, pkt[4] & 0xf, pkt[5] >> 4, pkt[5] & 0xf };
+
+	lt += offset * NFT_PIPAPO_LONGS_PER_M256;
+	for (i = offset; i < m256_size; i++, lt += NFT_PIPAPO_LONGS_PER_M256) {
+		int i_ul = i * NFT_PIPAPO_LONGS_PER_M256;
+
+		if (!first)
+			NFT_PIPAPO_AVX2_LOAD(0, map[i_ul]);
+
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(1,  lt,  0,  pg[0], bsize);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(2,  lt,  1,  pg[1], bsize);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(3,  lt,  2,  pg[2], bsize);
+
+		if (!first) {
+			NFT_PIPAPO_AVX2_NOMATCH_GOTO(0, nothing);
+			NFT_PIPAPO_AVX2_AND(1, 1, 0);
+		}
+
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(4,  lt,  3,  pg[3], bsize);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(5,  lt,  4,  pg[4], bsize);
+		NFT_PIPAPO_AVX2_AND(6,   2,  3);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(7,  lt,  5,  pg[5], bsize);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(8,  lt,  6,  pg[6], bsize);
+		NFT_PIPAPO_AVX2_AND(9,   1,  4);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(10, lt,  7,  pg[7], bsize);
+		NFT_PIPAPO_AVX2_AND(11,  5,  6);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(12, lt,  8,  pg[8], bsize);
+		NFT_PIPAPO_AVX2_AND(13,  7,  8);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(14, lt,  9,  pg[9], bsize);
+
+		NFT_PIPAPO_AVX2_AND(0,   9, 10);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(1,  lt, 10,  pg[10], bsize);
+		NFT_PIPAPO_AVX2_AND(2,  11, 12);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(3,  lt, 11,  pg[11], bsize);
+		NFT_PIPAPO_AVX2_AND(4,  13, 14);
+		NFT_PIPAPO_AVX2_AND(5,   0,  1);
+
+		NFT_PIPAPO_AVX2_AND(6,   2,  3);
+
+		/* Stalls */
+		NFT_PIPAPO_AVX2_AND(7,   4,  5);
+		NFT_PIPAPO_AVX2_AND(8,   6,  7);
+
+		NFT_PIPAPO_AVX2_NOMATCH_GOTO(8, nomatch);
+		NFT_PIPAPO_AVX2_STORE(map[i_ul], 8);
+
+		b = nft_pipapo_avx2_refill(i_ul, &map[i_ul], fill, mt, last);
+		if (last)
+			return b;
+
+		if (unlikely(ret == -1))
+			ret = b / XSAVE_YMM_SIZE;
+
+		continue;
+nomatch:
+		NFT_PIPAPO_AVX2_STORE(map[i_ul], 15);
+nothing:
+		;
+	}
+
+	return ret;
+}
+
+/**
+ * nft_pipapo_avx2_lookup32() - AVX2-based lookup for 32 four-bit groups
+ * @map:	Previous match result, used as initial bitmap
+ * @fill:	Destination bitmap to be filled with current match result
+ * @lt:		Lookup table for this field
+ * @mt:		Mapping table for this field
+ * @bsize:	Bucket size for this lookup table, in longs
+ * @pkt:	Packet data, pointer to input nftables register
+ * @first:	If this is the first field, don't source previous result
+ * @last:	Last field: stop at the first match and return bit index
+ * @offset:	Ignore buckets before the given index, no bits are filled there
+ *
+ * See nft_pipapo_avx2_lookup2().
+ *
+ * This is used for 128-bit fields (i.e. IPv6 addresses).
+ *
+ * Return: -1 on no match, rule index of match if @last, otherwise first long
+ * word index to be checked next (i.e. first filled word).
+ */
+static int nft_pipapo_avx2_lookup32(unsigned long *map, unsigned long *fill,
+				    unsigned long *lt,
+				    union nft_pipapo_map_bucket *mt,
+				    unsigned long bsize, const u8 *pkt,
+				    bool first, bool last, int offset)
+{
+	int i, ret = -1, m256_size = bsize / NFT_PIPAPO_LONGS_PER_M256, b;
+	u8 pg[32] = {  pkt[0] >> 4,  pkt[0] & 0xf,  pkt[1] >> 4,  pkt[1] & 0xf,
+		       pkt[2] >> 4,  pkt[2] & 0xf,  pkt[3] >> 4,  pkt[3] & 0xf,
+		       pkt[4] >> 4,  pkt[4] & 0xf,  pkt[5] >> 4,  pkt[5] & 0xf,
+		       pkt[6] >> 4,  pkt[6] & 0xf,  pkt[7] >> 4,  pkt[7] & 0xf,
+		       pkt[8] >> 4,  pkt[8] & 0xf,  pkt[9] >> 4,  pkt[9] & 0xf,
+		      pkt[10] >> 4, pkt[10] & 0xf, pkt[11] >> 4, pkt[11] & 0xf,
+		      pkt[12] >> 4, pkt[12] & 0xf, pkt[13] >> 4, pkt[13] & 0xf,
+		      pkt[14] >> 4, pkt[14] & 0xf, pkt[15] >> 4, pkt[15] & 0xf,
+		    };
+
+	lt += offset * NFT_PIPAPO_LONGS_PER_M256;
+	for (i = offset; i < m256_size; i++, lt += NFT_PIPAPO_LONGS_PER_M256) {
+		int i_ul = i * NFT_PIPAPO_LONGS_PER_M256;
+
+		if (!first)
+			NFT_PIPAPO_AVX2_LOAD(0, map[i_ul]);
+
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(1,  lt,  0,  pg[0], bsize);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(2,  lt,  1,  pg[1], bsize);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(3,  lt,  2,  pg[2], bsize);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(4,  lt,  3,  pg[3], bsize);
+		if (!first) {
+			NFT_PIPAPO_AVX2_NOMATCH_GOTO(0, nothing);
+			NFT_PIPAPO_AVX2_AND(1, 1, 0);
+		}
+
+		NFT_PIPAPO_AVX2_AND(5,   2,  3);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(6,  lt,  4,  pg[4], bsize);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(7,  lt,  5,  pg[5], bsize);
+		NFT_PIPAPO_AVX2_AND(8,   1,  4);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(9,  lt,  6,  pg[6], bsize);
+		NFT_PIPAPO_AVX2_AND(10,  5,  6);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(11, lt,  7,  pg[7], bsize);
+		NFT_PIPAPO_AVX2_AND(12,  7,  8);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(13, lt,  8,  pg[8], bsize);
+		NFT_PIPAPO_AVX2_AND(14,  9, 10);
+
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(0,  lt,  9,  pg[9], bsize);
+		NFT_PIPAPO_AVX2_AND(1,  11, 12);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(2,  lt, 10, pg[10], bsize);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(3,  lt, 11, pg[11], bsize);
+		NFT_PIPAPO_AVX2_AND(4,  13, 14);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(5,  lt, 12, pg[12], bsize);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(6,  lt, 13, pg[13], bsize);
+		NFT_PIPAPO_AVX2_AND(7,   0,  1);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(8,  lt, 14, pg[14], bsize);
+		NFT_PIPAPO_AVX2_AND(9,   2,  3);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(10, lt, 15, pg[15], bsize);
+		NFT_PIPAPO_AVX2_AND(11,  4,  5);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(12, lt, 16, pg[16], bsize);
+		NFT_PIPAPO_AVX2_AND(13,  6,  7);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(14, lt, 17, pg[17], bsize);
+
+		NFT_PIPAPO_AVX2_AND(0,   8,  9);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(1,  lt, 18, pg[18], bsize);
+		NFT_PIPAPO_AVX2_AND(2,  10, 11);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(3,  lt, 19, pg[19], bsize);
+		NFT_PIPAPO_AVX2_AND(4,  12, 13);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(5,  lt, 20, pg[20], bsize);
+		NFT_PIPAPO_AVX2_AND(6,  14,  0);
+		NFT_PIPAPO_AVX2_AND(7,   1,  2);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(8,  lt, 21, pg[21], bsize);
+		NFT_PIPAPO_AVX2_AND(9,   3,  4);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(10, lt, 22, pg[22], bsize);
+		NFT_PIPAPO_AVX2_AND(11,  5,  6);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(12, lt, 23, pg[23], bsize);
+		NFT_PIPAPO_AVX2_AND(13,  7,  8);
+
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(14, lt, 24, pg[24], bsize);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(0,  lt, 25, pg[25], bsize);
+		NFT_PIPAPO_AVX2_AND(1,   9, 10);
+		NFT_PIPAPO_AVX2_AND(2,  11, 12);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(3,  lt, 26, pg[26], bsize);
+		NFT_PIPAPO_AVX2_AND(4,  13, 14);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(5,  lt, 27, pg[27], bsize);
+		NFT_PIPAPO_AVX2_AND(6,   0,  1);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(7,  lt, 28, pg[28], bsize);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(8,  lt, 29, pg[29], bsize);
+		NFT_PIPAPO_AVX2_AND(9,   2,  3);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(10, lt, 30, pg[30], bsize);
+		NFT_PIPAPO_AVX2_AND(11,  4,  5);
+		NFT_PIPAPO_AVX2_BUCKET_LOAD(12, lt, 31, pg[31], bsize);
+
+		NFT_PIPAPO_AVX2_AND(0,   6,  7);
+		NFT_PIPAPO_AVX2_AND(1,   8,  9);
+		NFT_PIPAPO_AVX2_AND(2,  10, 11);
+		NFT_PIPAPO_AVX2_AND(3,  12,  0);
+
+		/* Stalls */
+		NFT_PIPAPO_AVX2_AND(4,   1,  2);
+		NFT_PIPAPO_AVX2_AND(5,   3,  4);
+
+		NFT_PIPAPO_AVX2_NOMATCH_GOTO(5, nomatch);
+		NFT_PIPAPO_AVX2_STORE(map[i_ul], 5);
+
+		b = nft_pipapo_avx2_refill(i_ul, &map[i_ul], fill, mt, last);
+		if (last)
+			return b;
+
+		if (unlikely(ret == -1))
+			ret = b / XSAVE_YMM_SIZE;
+
+		continue;
+nomatch:
+		NFT_PIPAPO_AVX2_STORE(map[i_ul], 15);
+nothing:
+		;
+	}
+
+	return ret;
+}
+
+/**
+ * nft_pipapo_avx2_lookup_noavx2() - Fallback function for uncommon field sizes
+ * @f:		Field to be matched
+ * @res:	Previous match result, used as initial bitmap
+ * @fill:	Destination bitmap to be filled with current match result
+ * @lt:		Lookup table for this field
+ * @mt:		Mapping table for this field
+ * @bsize:	Bucket size for this lookup table, in longs
+ * @rp:		Packet data, pointer to input nftables register
+ * @first:	If this is the first field, don't source previous result
+ * @last:	Last field: stop at the first match and return bit index
+ * @offset:	Ignore buckets before the given index, no bits are filled there
+ *
+ * This function should never be called, but is provided for the case the field
+ * size doesn't match any of the known data types. Matching rate is
+ * substantially lower than AVX2 routines.
+ *
+ * Return: -1 on no match, rule index of match if @last, otherwise first long
+ * word index to be checked next (i.e. first filled word).
+ */
+static int nft_pipapo_avx2_lookup_noavx2(struct nft_pipapo_field *f,
+					 unsigned long *res,
+					 unsigned long *fill, unsigned long *lt,
+					 union nft_pipapo_map_bucket *mt,
+					 unsigned long bsize, const u8 *rp,
+					 bool first, bool last, int offset)
+{
+	int i, ret = -1, b;
+
+	lt += offset * NFT_PIPAPO_LONGS_PER_M256;
+
+	if (first)
+		memset(res, 0xff, bsize * sizeof(*res));
+
+	for (i = offset; i < bsize; i++) {
+		pipapo_and_field_buckets(f, res, rp);
+
+		b = pipapo_refill(res, bsize, f->rules, fill, mt, last);
+
+		if (last)
+			return b;
+
+		if (ret == -1)
+			ret = b / XSAVE_YMM_SIZE;
+	}
+
+	return ret;
+}
+
+/**
+ * nft_pipapo_avx2_estimate() - Set size, space and lookup complexity
+ * @desc:	Set description, element count and field description used
+ * @features:	Flags: NFT_SET_INTERVAL needs to be there
+ * @est:	Storage for estimation data
+ *
+ * Return: true if set is compatible and AVX2 available, false otherwise.
+ */
+bool nft_pipapo_avx2_estimate(const struct nft_set_desc *desc, u32 features,
+			      struct nft_set_estimate *est)
+{
+	if (!(features & NFT_SET_INTERVAL) || desc->field_count <= 1)
+		return false;
+
+	if (!boot_cpu_has(X86_FEATURE_AVX2) || !boot_cpu_has(X86_FEATURE_AVX))
+		return false;
+
+	est->size = pipapo_estimate_size(desc);
+	if (!est->size)
+		return false;
+
+	est->lookup = NFT_SET_CLASS_O_LOG_N;
+
+	est->space = NFT_SET_CLASS_O_N;
+
+	return true;
+}
+
+/**
+ * nft_pipapo_avx2_lookup() - Lookup function for AVX2 implementation
+ * @net:	Network namespace
+ * @set:	nftables API set representation
+ * @elem:	nftables API element representation containing key data
+ * @ext:	nftables API extension pointer, filled with matching reference
+ *
+ * For more details, see DOC: Theory of Operation in nft_set_pipapo.c.
+ *
+ * This implementation exploits the repetitive characteristic of the algorithm
+ * to provide a fast, vectorised version using the AVX2 SIMD instruction set.
+ *
+ * Return: true on match, false otherwise.
+ */
+bool nft_pipapo_avx2_lookup(const struct net *net, const struct nft_set *set,
+			    const u32 *key, const struct nft_set_ext **ext)
+{
+	struct nft_pipapo *priv = nft_set_priv(set);
+	unsigned long *res, *fill, *scratch;
+	u8 genmask = nft_genmask_cur(net);
+	const u8 *rp = (const u8 *)key;
+	struct nft_pipapo_match *m;
+	struct nft_pipapo_field *f;
+	bool map_index;
+	int i, ret = 0;
+
+	m = rcu_dereference(priv->match);
+
+	/* This also protects access to all data related to scratch maps */
+	kernel_fpu_begin();
+
+	if (unlikely(!m || !*raw_cpu_ptr(m->scratch))) {
+		kernel_fpu_end();
+		return false;
+	}
+
+	scratch = *raw_cpu_ptr(m->scratch_aligned);
+	map_index = raw_cpu_read(nft_pipapo_avx2_scratch_index);
+
+	res  = scratch + (map_index ? m->bsize_max : 0);
+	fill = scratch + (map_index ? 0 : m->bsize_max);
+
+	/* Starting map doesn't need to be set for this implementation */
+
+	nft_pipapo_avx2_prepare();
+
+next_match:
+	nft_pipapo_for_each_field(f, i, m) {
+		bool last = i == m->field_count - 1, first = !i;
+
+#define NFT_SET_PIPAPO_AVX2_LOOKUP(n)					\
+		(ret = nft_pipapo_avx2_lookup ##n(res, fill,		\
+						  f->lt_aligned, f->mt,	\
+						  f->bsize, rp,		\
+						  first, last, ret))
+
+		if (f->groups == 2) {
+			NFT_SET_PIPAPO_AVX2_LOOKUP(2);
+		} else if (f->groups == 4) {
+			NFT_SET_PIPAPO_AVX2_LOOKUP(4);
+		} else if (f->groups == 8) {
+			NFT_SET_PIPAPO_AVX2_LOOKUP(8);
+		} else if (f->groups == 12) {
+			NFT_SET_PIPAPO_AVX2_LOOKUP(12);
+		} else if (f->groups == 32) {
+			NFT_SET_PIPAPO_AVX2_LOOKUP(32);
+		} else {
+			ret = nft_pipapo_avx2_lookup_noavx2(f, res, fill,
+							    f->lt_aligned,
+							    f->mt, f->bsize,
+							    rp,
+							    first, last, ret);
+		}
+#undef NFT_SET_PIPAPO_AVX2_LOOKUP
+
+		if (ret < 0)
+			goto out;
+
+		if (last) {
+			*ext = &f->mt[ret].e->ext;
+			if (unlikely(nft_set_elem_expired(*ext) ||
+				     !nft_set_elem_active(*ext, genmask))) {
+				ret = 0;
+				goto next_match;
+			}
+
+			goto out;
+		}
+
+		map_index = !map_index;
+		swap(res, fill);
+		rp += NFT_PIPAPO_GROUPS_PADDED_SIZE(f->groups);
+	}
+
+out:
+	raw_cpu_write(nft_pipapo_avx2_scratch_index, map_index);
+	kernel_fpu_end();
+
+	return ret >= 0;
+}
diff --git a/net/netfilter/nft_set_pipapo_avx2.h b/net/netfilter/nft_set_pipapo_avx2.h
new file mode 100644
index 000000000000..396caf7bfca8
--- /dev/null
+++ b/net/netfilter/nft_set_pipapo_avx2.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _NFT_SET_PIPAPO_AVX2_H
+
+#ifdef CONFIG_AS_AVX2
+#include <asm/fpu/xstate.h>
+#define NFT_PIPAPO_ALIGN	(XSAVE_YMM_SIZE / BITS_PER_BYTE)
+
+bool nft_pipapo_avx2_lookup(const struct net *net, const struct nft_set *set,
+			    const u32 *key, const struct nft_set_ext **ext);
+bool nft_pipapo_avx2_estimate(const struct nft_set_desc *desc, u32 features,
+			      struct nft_set_estimate *est);
+#endif /* CONFIG_AS_AVX2 */
+
+#endif /* _NFT_SET_PIPAPO_AVX2_H */
-- 
2.24.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH nf-next v4 9/9] nft_set_pipapo: Introduce AVX2-based lookup implementation
  2020-01-21 23:17 ` [PATCH nf-next v4 9/9] nft_set_pipapo: Introduce AVX2-based lookup implementation Stefano Brivio
@ 2020-01-27  6:41   ` kbuild test robot
  0 siblings, 0 replies; 20+ messages in thread
From: kbuild test robot @ 2020-01-27  6:41 UTC (permalink / raw)
  To: Stefano Brivio
  Cc: kbuild-all, Pablo Neira Ayuso, netfilter-devel, Florian Westphal,
	Kadlecsik József, Eric Garver, Phil Sutter

[-- Attachment #1: Type: text/plain, Size: 1335 bytes --]

Hi Stefano,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on nf-next/master]
[also build test ERROR on kselftest/next]
[cannot apply to nf/master linus/master ipvs/master v5.5 next-20200124]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Stefano-Brivio/nftables-Set-implementation-for-arbitrary-concatenation-of-ranges/20200124-112940
base:   https://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git master
config: i386-randconfig-d001-20200127 (attached as .config)
compiler: gcc-7 (Debian 7.5.0-3) 7.5.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> ERROR: "nft_pipapo_avx2_lookup" [net/netfilter/nf_tables_set.ko] undefined!
>> ERROR: "nft_pipapo_avx2_estimate" [net/netfilter/nf_tables_set.ko] undefined!

---
0-DAY kernel test infrastructure                 Open Source Technology Center
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 33880 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH nf-next v4 0/9] nftables: Set implementation for arbitrary concatenation of ranges
  2020-01-21 23:17 [PATCH nf-next v4 0/9] nftables: Set implementation for arbitrary concatenation of ranges Stefano Brivio
                   ` (8 preceding siblings ...)
  2020-01-21 23:17 ` [PATCH nf-next v4 9/9] nft_set_pipapo: Introduce AVX2-based lookup implementation Stefano Brivio
@ 2020-01-27  8:20 ` Pablo Neira Ayuso
  2020-02-20 10:52 ` Phil Sutter
  10 siblings, 0 replies; 20+ messages in thread
From: Pablo Neira Ayuso @ 2020-01-27  8:20 UTC (permalink / raw)
  To: Stefano Brivio
  Cc: netfilter-devel, Florian Westphal, Kadlecsik József,
	Eric Garver, Phil Sutter

On Wed, Jan 22, 2020 at 12:17:50AM +0100, Stefano Brivio wrote:
> Existing nftables set implementations allow matching entries with
> interval expressions (rbtree), e.g. 192.0.2.1-192.0.2.4, entries
> specifying field concatenation (hash, rhash), e.g. 192.0.2.1:22,
> but not both.
> 
> In other words, none of the set types allows matching on range
> expressions for more than one packet field at a time, such as ipset
> does with types bitmap:ip,mac, and, to a more limited extent
> (netmasks, not arbitrary ranges), with types hash:net,net,
> hash:net,port, hash:ip,port,net, and hash:net,port,net.
> 
> As a pure hash-based approach is unsuitable for matching on ranges,
> and "proxying" the existing red-black tree type looks impractical as
> elements would need to be shared and managed across all employed
> trees, this new set implementation intends to fill the functionality
> gap by employing a relatively novel approach.
> 
> The fundamental idea, illustrated in deeper detail in patch 5/9, is to
> use lookup tables classifying a small number of grouped bits from each
> field, and map the lookup results in a way that yields a verdict for
> the full set of specified fields.
> 
> The grouping bit aspect is loosely inspired by the Grouper algorithm,
> by Jay Ligatti, Josh Kuhn, and Chris Gage (see patch 5/9 for the full
> reference).
> 
> A reference, stand-alone implementation of the algorithm itself is
> available at:
> 	https://pipapo.lameexcu.se
> 
> Some notes about possible future optimisations are also mentioned
> there. This algorithm reduces the matching problem to, essentially,
> a repetitive sequence of simple bitwise operations, and is
> particularly suitable to be optimised by leveraging SIMD instruction
> sets. An AVX2-based implementation is also presented in this series.
> 
> I plan to post the adaptation of the existing AVX2 vectorised
> implementation for (at least) NEON at a later time.
> 
> Patches 1/9 to 3/9 implement the needed infrastructure: new
> attributes are used to describe length of single ranged fields in
> concatenations and to denote the upper bound for ranges.
> 
> Patch 4/9 adds a new bitmap operation that copies the source bitmap
> onto the destination while removing a given region, and is needed to
> delete regions of arrays mapping between lookup tables.
> 
> Patch 5/9 is the actual set implementation.
> 
> Patch 6/9 introduces selftests for the new implementation.

Applied up to 6/9.

Merge window will close soon and I'm going to be a bit defensive and
take only the batch that include the initial implementation. I would
prefer if we all use this round to start using the C implementation
upstream and report bugs. While I have received positive feedback from
other fellows meanwhile privately, this batch is large and I'm
inclined to follow this approach.

Please, don't be disappointed, and just follow up with more patches
once merge window opens up again.

Thanks.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH nf-next v4 5/9] nf_tables: Add set type for arbitrary concatenation of ranges
  2020-01-21 23:17 ` [PATCH nf-next v4 5/9] nf_tables: Add set type for arbitrary concatenation of ranges Stefano Brivio
@ 2020-02-07 11:23   ` Pablo Neira Ayuso
  2020-02-10 15:10     ` Stefano Brivio
  0 siblings, 1 reply; 20+ messages in thread
From: Pablo Neira Ayuso @ 2020-02-07 11:23 UTC (permalink / raw)
  To: Stefano Brivio
  Cc: netfilter-devel, Florian Westphal, Kadlecsik József,
	Eric Garver, Phil Sutter

Hi Stefano,

A bit of late feedback on your new datastructure.

Did you tests with 8-bits grouping instead of 4-bits?

Assuming a concatenation of 12 bytes (each field 4 bytes, hence 3
fields):

* Using 4-bits groups: the number of buckets is 2^4 = 16 multiplied
  by the bucket word (assuming one long word, 8 bytes, 64 pipapo
  rules) is 16 * 8 = 128 bytes per group-row in the looking table. Then,
  the number of group-rows is 8 given that 32 bits, then 32 / 4 = 8
  group-rows.

  8 * 128 bytes = 1024 bytes per lookup table.

  Assuming 3 fields, then this is 1024 * 3 = 3072 bytes.

* Using 8-bits groups: 2^8 = 256, then 256 * 8 = 2048 bytes per
  group-row. Then, 32 / 8 = 4 group-rows in total.

  4 * 2048 bytes = 8192 bytes per lookup table.

  Therefore, 3 * 8192 = 24576 bytes. Still rather small.

This is missing the mapping table that links the lookup tables in the
memory counting. And I'm assuming that the number of pipapo rules in
the lookup table fits into 64-bits bucket long word.

Anyway, my understanding is that the more bits you use for grouping,
the larger the lookup table becomes.

Still, both look very small in terms of memory consumption for these
days.

I'm just telling this because the C implementation can probably get
better numbers at the cost of consuming more memory? Probably do this
at some point?

BTW, with a few more knobs it should be possible to integrate better
this datastructure into the transaction infrastructure, this can be
done incrementally.

More questions below.

On Wed, Jan 22, 2020 at 12:17:55AM +0100, Stefano Brivio wrote:
[...]
> diff --git a/net/netfilter/nft_set_pipapo.c b/net/netfilter/nft_set_pipapo.c
> new file mode 100644
> index 000000000000..f0cb1e13af50
> --- /dev/null
> +++ b/net/netfilter/nft_set_pipapo.c
[...]
> + *
> + *       rule indices in current field: 0    1    2
> + *       map to rules in next field:    0    1    1
> + *
> + *      the new result bitmap will be 0x02: rule 1 was set, and rule 1 will be
> + *      set.
> + *
> + *      We can now extend this example to cover the second iteration of the step
> + *      above (lookup and AND bitmap): assuming the port field is
> + *      2048 < 0  0  5  0 >, with starting result bitmap 0x2, and lookup table
> + *      for "port" field from pre-computation example:
> + *
> + * ::
> + *
> + *                     bucket
> + *      group  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
> + *        0   0,1
> + *        1   0,1
> + *        2                    0   1
> + *        3   0,1
> + *
> + *       operations are: 0x2 & 0x3 [bucket 0] & 0x3 [bucket 0] & 0x2 [bucket 5]
> + *       & 0x3 [bucket 0], resulting bitmap is 0x2.
> + *
> + *   - if this is the last field in the set, look up the value from the mapping
> + *     array corresponding to the final result bitmap
> + *
> + *      Example: 0x2 resulting bitmap from 192.168.1.5:2048, mapping array for
> + *      last field from insertion example:
> + *
> + * ::
> + *
> + *       rule indices in last field:    0    1
> + *       map to elements:             0x42  0x66

Should this be instead?

            rule indices in last field:    0    1
            map to elements:             0x66  0x42

If the resulting bitmap is 0x2, then this is actually pointing to
rule index 1 in this lookup table, that is the 2048.

> + *      the matching element is at 0x42.

Hence, the matching 0x42 element.

Otherwise, I don't understand how to interpret the "result bitmap". I
thought this contains the matching pipapo rule index that is expressed
as a bitmask.

[...]
> +static int pipapo_refill(unsigned long *map, int len, int rules,
> +			 unsigned long *dst, union nft_pipapo_map_bucket *mt,
> +			 bool match_only)
> +{
> +	unsigned long bitset;
> +	int k, ret = -1;
> +
> +	for (k = 0; k < len; k++) {
> +		bitset = map[k];
> +		while (bitset) {
> +			unsigned long t = bitset & -bitset;
> +			int r = __builtin_ctzl(bitset);
> +			int i = k * BITS_PER_LONG + r;
> +
> +			if (unlikely(i >= rules)) {
> +				map[k] = 0;
> +				return -1;
> +			}
> +
> +			if (unlikely(match_only)) {

Not a big issue, but this branch holds true for the last field, my
understanding for unlikely() is that it should be used for paths where
99.99...% is not going to happen. Not a big issue, just that when
reading the code I got confused this is actually likely to happen.

> +				bitmap_clear(map, i, 1);
> +				return i;
> +			}
> +
> +			ret = 0;
> +
> +			bitmap_set(dst, mt[i].to, mt[i].n);
> +
> +			bitset ^= t;
> +		}
> +		map[k] = 0;
> +	}
> +
> +	return ret;
> +}> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH nf-next v4 5/9] nf_tables: Add set type for arbitrary concatenation of ranges
  2020-02-07 11:23   ` Pablo Neira Ayuso
@ 2020-02-10 15:10     ` Stefano Brivio
  2020-02-14 18:16       ` Pablo Neira Ayuso
  0 siblings, 1 reply; 20+ messages in thread
From: Stefano Brivio @ 2020-02-10 15:10 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netfilter-devel, Florian Westphal, Kadlecsik József,
	Eric Garver, Phil Sutter

On Fri, 7 Feb 2020 12:23:08 +0100
Pablo Neira Ayuso <pablo@netfilter.org> wrote:

> Hi Stefano,
> 
> A bit of late feedback on your new datastructure.
> 
> Did you tests with 8-bits grouping instead of 4-bits?

Yes, at the very beginning, not with the final implementation. It was
somewhat faster (on x86_64, I don't remember how much) for small
numbers of rules, then I thought we would use too much memory, because:

> Assuming a concatenation of 12 bytes (each field 4 bytes, hence 3
> fields):
> 
> * Using 4-bits groups: the number of buckets is 2^4 = 16 multiplied
>   by the bucket word (assuming one long word, 8 bytes, 64 pipapo
>   rules) is 16 * 8 = 128 bytes per group-row in the looking table. Then,
>   the number of group-rows is 8 given that 32 bits, then 32 / 4 = 8
>   group-rows.
> 
>   8 * 128 bytes = 1024 bytes per lookup table.
> 
>   Assuming 3 fields, then this is 1024 * 3 = 3072 bytes.
> 
> * Using 8-bits groups: 2^8 = 256, then 256 * 8 = 2048 bytes per
>   group-row. Then, 32 / 8 = 4 group-rows in total.
> 
>   4 * 2048 bytes = 8192 bytes per lookup table.
> 
>   Therefore, 3 * 8192 = 24576 bytes. Still rather small.
> 
> This is missing the mapping table that links the lookup tables in the
> memory counting. And I'm assuming that the number of pipapo rules in
> the lookup table fits into 64-bits bucket long word.

...the (reasonable?) worst case I wanted to cover was two IPv6
addresses, one port, one MAC address (in ipset terms
"net,net,port,mac"), with 2 ^ 16 unique, non-overlapping entries each
(or ranges expanding to that amount of rules), because that's what
(single, non-concatenated) ipset "bitmap" types can do.

Also ignoring the mapping table (it's "small"), with 4-bit buckets:

- for the IPv6 addresses, we have 16 buckets, each 2 ^ 16
  bits wide, and 32 groups (128 bits / 4 bits), that is, 8MiB in
  total

- for the MAC address, 16 buckets, each 2 ^ 16 bits wide, and 12
  groups, 1.5MiB

- for the port, 16 buckets, each 2 ^ 12 bits wide, 2 groups, 0.25MiB

that is, 9.75MiB.

With 8-bit buckets: we can just multiply everything by 8 (that is,
2 ^ 8 / 2 ^ 4 / 2, because we have 2 ^ (8 - 4) times the buckets, with
half the groups), 78MiB.

And that started feeling like "a lot". However, I'm probably overdoing
with the worst case -- this was just to explain what brought me to the
4-bit choice, now I start doubting about it.

> Anyway, my understanding is that the more bits you use for grouping,
> the larger the lookup table becomes.
> 
> Still, both look very small in terms of memory consumption for these
> days.
> 
> I'm just telling this because the C implementation can probably get
> better numbers at the cost of consuming more memory? Probably do this
> at some point?

Another topic is the additional amount of cachelines we would use. I
don't expect that effect to be visible, but I might be wrong.

So yes, I think it's definitely worth a try, thanks for the hint! I'll
try to look into this soon and test it on a few archs (something with
small cachelines, say MIPS r2k, would be worth checking, too).

We could even consider to dynamically adjust group size depending on
the set size, I don't know yet if that gets too convoluted.

> BTW, with a few more knobs it should be possible to integrate better
> this datastructure into the transaction infrastructure, this can be
> done incrementally.
> 
> More questions below.
> 
> On Wed, Jan 22, 2020 at 12:17:55AM +0100, Stefano Brivio wrote:
> [...]
> > diff --git a/net/netfilter/nft_set_pipapo.c b/net/netfilter/nft_set_pipapo.c
> > new file mode 100644
> > index 000000000000..f0cb1e13af50
> > --- /dev/null
> > +++ b/net/netfilter/nft_set_pipapo.c  
> [...]
> > + *
> > + *       rule indices in current field: 0    1    2
> > + *       map to rules in next field:    0    1    1
> > + *
> > + *      the new result bitmap will be 0x02: rule 1 was set, and rule 1 will be
> > + *      set.
> > + *
> > + *      We can now extend this example to cover the second iteration of the step
> > + *      above (lookup and AND bitmap): assuming the port field is
> > + *      2048 < 0  0  5  0 >, with starting result bitmap 0x2, and lookup table
> > + *      for "port" field from pre-computation example:
> > + *
> > + * ::
> > + *
> > + *                     bucket
> > + *      group  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
> > + *        0   0,1
> > + *        1   0,1
> > + *        2                    0   1
> > + *        3   0,1
> > + *
> > + *       operations are: 0x2 & 0x3 [bucket 0] & 0x3 [bucket 0] & 0x2 [bucket 5]
> > + *       & 0x3 [bucket 0], resulting bitmap is 0x2.
> > + *
> > + *   - if this is the last field in the set, look up the value from the mapping
> > + *     array corresponding to the final result bitmap
> > + *
> > + *      Example: 0x2 resulting bitmap from 192.168.1.5:2048, mapping array for
> > + *      last field from insertion example:
> > + *
> > + * ::
> > + *
> > + *       rule indices in last field:    0    1
> > + *       map to elements:             0x42  0x66  
> 
> Should this be instead?
> 
>             rule indices in last field:    0    1
>             map to elements:             0x66  0x42
> 
> If the resulting bitmap is 0x2, then this is actually pointing to
> rule index 1 in this lookup table, that is the 2048.

Right! Good catch, thanks.

I swapped the values also in the "insertion" example above. For some
reason, this was correct in the equivalent example of the stand-alone
implementation:
	https://pipapo.lameexcu.se/pipapo/tree/pipapo.c#n162

> > + *      the matching element is at 0x42.  
> 
> Hence, the matching 0x42 element.
> 
> Otherwise, I don't understand how to interpret the "result bitmap". I
> thought this contains the matching pipapo rule index that is expressed
> as a bitmask.

Yes, that's correct. Let me know if you want me to send a patch or if
you'd rather fix it.

> [...]
> > +static int pipapo_refill(unsigned long *map, int len, int rules,
> > +			 unsigned long *dst, union nft_pipapo_map_bucket *mt,
> > +			 bool match_only)
> > +{
> > +	unsigned long bitset;
> > +	int k, ret = -1;
> > +
> > +	for (k = 0; k < len; k++) {
> > +		bitset = map[k];
> > +		while (bitset) {
> > +			unsigned long t = bitset & -bitset;
> > +			int r = __builtin_ctzl(bitset);
> > +			int i = k * BITS_PER_LONG + r;
> > +
> > +			if (unlikely(i >= rules)) {
> > +				map[k] = 0;
> > +				return -1;
> > +			}
> > +
> > +			if (unlikely(match_only)) {  
> 
> Not a big issue, but this branch holds true for the last field, my
> understanding for unlikely() is that it should be used for paths where
> 99.99...% is not going to happen. Not a big issue, just that when
> reading the code I got confused this is actually likely to happen.

You're right, I wanted to make sure we avoid branching for the "common"
case (this early return happens just once), but this is probably an
abuse. I'll look into a more acceptable way to achieve this.

> > +				bitmap_clear(map, i, 1);
> > +				return i;
> > +			}
> > +
> > +			ret = 0;
> > +
> > +			bitmap_set(dst, mt[i].to, mt[i].n);
> > +
> > +			bitset ^= t;
> > +		}
> > +		map[k] = 0;
> > +	}
> > +
> > +	return ret;  
> > +}>   

-- 
Stefano


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH nf-next v4 5/9] nf_tables: Add set type for arbitrary concatenation of ranges
  2020-02-10 15:10     ` Stefano Brivio
@ 2020-02-14 18:16       ` Pablo Neira Ayuso
  2020-02-14 19:42         ` Stefano Brivio
  0 siblings, 1 reply; 20+ messages in thread
From: Pablo Neira Ayuso @ 2020-02-14 18:16 UTC (permalink / raw)
  To: Stefano Brivio
  Cc: netfilter-devel, Florian Westphal, Kadlecsik József,
	Eric Garver, Phil Sutter

On Mon, Feb 10, 2020 at 04:10:47PM +0100, Stefano Brivio wrote:
> On Fri, 7 Feb 2020 12:23:08 +0100
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
[...]
> > Did you tests with 8-bits grouping instead of 4-bits?
> 
> Yes, at the very beginning, not with the final implementation. It was
> somewhat faster (on x86_64, I don't remember how much) for small
> numbers of rules, then I thought we would use too much memory, because:
> 
> > Assuming a concatenation of 12 bytes (each field 4 bytes, hence 3
> > fields):
> > 
> > * Using 4-bits groups: the number of buckets is 2^4 = 16 multiplied
> >   by the bucket word (assuming one long word, 8 bytes, 64 pipapo
> >   rules) is 16 * 8 = 128 bytes per group-row in the looking table. Then,
> >   the number of group-rows is 8 given that 32 bits, then 32 / 4 = 8
> >   group-rows.
> > 
> >   8 * 128 bytes = 1024 bytes per lookup table.
> > 
> >   Assuming 3 fields, then this is 1024 * 3 = 3072 bytes.
> > 
> > * Using 8-bits groups: 2^8 = 256, then 256 * 8 = 2048 bytes per
> >   group-row. Then, 32 / 8 = 4 group-rows in total.
> > 
> >   4 * 2048 bytes = 8192 bytes per lookup table.
> > 
> >   Therefore, 3 * 8192 = 24576 bytes. Still rather small.
> > 
> > This is missing the mapping table that links the lookup tables in the
> > memory counting. And I'm assuming that the number of pipapo rules in
> > the lookup table fits into 64-bits bucket long word.
> 
> ...the (reasonable?) worst case I wanted to cover was two IPv6
> addresses, one port, one MAC address (in ipset terms
> "net,net,port,mac"), with 2 ^ 16 unique, non-overlapping entries each
> (or ranges expanding to that amount of rules), because that's what
> (single, non-concatenated) ipset "bitmap" types can do.

I see, so you were considering the worst case. You're assuming each
element takes exactly one pipapo rule, so it's 2^16 elements, correct?

You refer to a property that says that you can split a range into a
2*n netmasks IIRC. Do you know what is the worst case when splitting
ranges?

There is no ipset set like this, but I agree usecase might happen.

> Also ignoring the mapping table (it's "small"), with 4-bit buckets:
> 
> - for the IPv6 addresses, we have 16 buckets, each 2 ^ 16
>   bits wide, and 32 groups (128 bits / 4 bits), that is, 8MiB in
>   total
>
> - for the MAC address, 16 buckets, each 2 ^ 16 bits wide, and 12
>   groups, 1.5MiB
>
> - for the port, 16 buckets, each 2 ^ 12 bits wide, 2 groups, 0.25MiB
>
> that is, 9.75MiB.
> 
> With 8-bit buckets: we can just multiply everything by 8 (that is,
> 2 ^ 8 / 2 ^ 4 / 2, because we have 2 ^ (8 - 4) times the buckets, with
> half the groups), 78MiB.

Yes, this is large. Compared to a hashtable with 2^16 entries, then
it's 2^17 hashtable buckets and using struct hlist_head, this is 2
MBytes. Then, each hlist_node is 16 bytes, so 2^16 * 16 ~= 1 MByte.
That is 3 MBytes if my maths are fine.

Just telling this to find what could be considered as reasonable
amount of memory to be consumed. ~10 MBytes is slightly more than, but
I agree you selected a reasonable worst through this "complex tuple".

> And that started feeling like "a lot". However, I'm probably overdoing
> with the worst case -- this was just to explain what brought me to the
> 4-bit choice, now I start doubting about it.
> 
> > Anyway, my understanding is that the more bits you use for grouping,
> > the larger the lookup table becomes.
> > 
> > Still, both look very small in terms of memory consumption for these
> > days.
> > 
> > I'm just telling this because the C implementation can probably get
> > better numbers at the cost of consuming more memory? Probably do this
> > at some point?
> 
> Another topic is the additional amount of cachelines we would use. I
> don't expect that effect to be visible, but I might be wrong.
> 
> So yes, I think it's definitely worth a try, thanks for the hint! I'll
> try to look into this soon and test it on a few archs (something with
> small cachelines, say MIPS r2k, would be worth checking, too).
> 
> We could even consider to dynamically adjust group size depending on
> the set size, I don't know yet if that gets too convoluted.

Yes, keeping this maintainable is a good idea.

The per-cpu scratch index is only required if we cannot fit in the
"result bitmap" into the stack, right?

Probably up to 256 bytes result bitmap in the stack is reasonable?
That makes 8192 pipapo rules. There will be no need to disable bh and
make use of the percpu scratchpad area in that case.

If adjusting the code to deal with variable length "pipapo word" size
is not too convoluted, then you could just deal with the variable word
size from the insert / delete / get (slow) path and register one
lookup function for the version that is optimized for this pipapo word
size.

Probably adding helper function to deal with pipapo words would help
to prepare for such update in the future. There is the ->estimate
function that allows to calculate for the best word size depending on
all the information this gets from the set definition.

Just to keep this in the radar for future work.

Thanks.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH nf-next v4 5/9] nf_tables: Add set type for arbitrary concatenation of ranges
  2020-02-14 18:16       ` Pablo Neira Ayuso
@ 2020-02-14 19:42         ` Stefano Brivio
  2020-02-14 20:42           ` Pablo Neira Ayuso
  0 siblings, 1 reply; 20+ messages in thread
From: Stefano Brivio @ 2020-02-14 19:42 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netfilter-devel, Florian Westphal, Kadlecsik József,
	Eric Garver, Phil Sutter

On Fri, 14 Feb 2020 19:16:34 +0100
Pablo Neira Ayuso <pablo@netfilter.org> wrote:

> On Mon, Feb 10, 2020 at 04:10:47PM +0100, Stefano Brivio wrote:
> > On Fri, 7 Feb 2020 12:23:08 +0100
> > Pablo Neira Ayuso <pablo@netfilter.org> wrote:  
> [...]
> > > Did you tests with 8-bits grouping instead of 4-bits?  
> > 
> > Yes, at the very beginning, not with the final implementation. It was
> > somewhat faster (on x86_64, I don't remember how much) for small
> > numbers of rules, then I thought we would use too much memory, because:
> >   
> > > Assuming a concatenation of 12 bytes (each field 4 bytes, hence 3
> > > fields):
> > > 
> > > * Using 4-bits groups: the number of buckets is 2^4 = 16 multiplied
> > >   by the bucket word (assuming one long word, 8 bytes, 64 pipapo
> > >   rules) is 16 * 8 = 128 bytes per group-row in the looking table. Then,
> > >   the number of group-rows is 8 given that 32 bits, then 32 / 4 = 8
> > >   group-rows.
> > > 
> > >   8 * 128 bytes = 1024 bytes per lookup table.
> > > 
> > >   Assuming 3 fields, then this is 1024 * 3 = 3072 bytes.
> > > 
> > > * Using 8-bits groups: 2^8 = 256, then 256 * 8 = 2048 bytes per
> > >   group-row. Then, 32 / 8 = 4 group-rows in total.
> > > 
> > >   4 * 2048 bytes = 8192 bytes per lookup table.
> > > 
> > >   Therefore, 3 * 8192 = 24576 bytes. Still rather small.
> > > 
> > > This is missing the mapping table that links the lookup tables in the
> > > memory counting. And I'm assuming that the number of pipapo rules in
> > > the lookup table fits into 64-bits bucket long word.  
> > 
> > ...the (reasonable?) worst case I wanted to cover was two IPv6
> > addresses, one port, one MAC address (in ipset terms
> > "net,net,port,mac"), with 2 ^ 16 unique, non-overlapping entries each
> > (or ranges expanding to that amount of rules), because that's what
> > (single, non-concatenated) ipset "bitmap" types can do.  
> 
> I see, so you were considering the worst case. You're assuming each
> element takes exactly one pipapo rule, so it's 2^16 elements, correct?

Yes, right: the ranges I considered would be disjoint and of size one
(single, non-overlapping addresses). I start thinking my worst case is
nowhere close to being reasonable, actually. :)

> You refer to a property that says that you can split a range into a
> 2*n netmasks IIRC. Do you know what is the worst case when splitting
> ranges?

I'm not sure I got your question: that is exactly the worst case, i.e.
we can have _up to_ 2 * n netmasks (hence rules) given a range of n
bits. There's an additional upper bound on this, given by the address
space, but single fields in a concatenation can overlap.

For example, we can have up to 128 rules for an IPv6 range where at
least 64 bits differ between the endpoints, and which would contain
2 ^ 64 addresses. Or, say, the IPv4 range 1.2.3.4 - 255.255.0.2 is
expressed by 42 rules.

By the way, 0.0.0.1 - 255.255.255.254 takes 62 rules, so we can
*probably* say it's 2 * n - 2, but I don't have a formal proof for that.

I have a couple of ways in mind to get that down to n / 2, but it's not
straightforward and it will take me some time (assuming it makes
sense). For the n bound, we can introduce negations (proof in
literature), and I have some kind of ugly prototype. For the n / 2
bound, I'd need some auxiliary data structure to keep insertion
invertible.

In practice, the "average" case is much less, but to define it we would
first need to agree on what are the actual components of the
multivariate distribution... size and start? Is it a Poisson
distribution then? After spending some time on this and disagreeing
with myself I'd shyly recommend to skip the topic. :)

> There is no ipset set like this, but I agree usecase might happen.

Actually, for ipset, a "net,port,net,port" type was proposed
(netfilter-devel <20181216213039.399-1-oliver@uptheinter.net>), but when
József enquired about the intended use case, none was given. So maybe
this whole "net,net,port,mac" story makes even less sense.

> > Also ignoring the mapping table (it's "small"), with 4-bit buckets:
> > 
> > - for the IPv6 addresses, we have 16 buckets, each 2 ^ 16
> >   bits wide, and 32 groups (128 bits / 4 bits), that is, 8MiB in
> >   total
> >
> > - for the MAC address, 16 buckets, each 2 ^ 16 bits wide, and 12
> >   groups, 1.5MiB
> >
> > - for the port, 16 buckets, each 2 ^ 12 bits wide, 2 groups, 0.25MiB
> >
> > that is, 9.75MiB.
> > 
> > With 8-bit buckets: we can just multiply everything by 8 (that is,
> > 2 ^ 8 / 2 ^ 4 / 2, because we have 2 ^ (8 - 4) times the buckets, with
> > half the groups), 78MiB.  
> 
> Yes, this is large. Compared to a hashtable with 2^16 entries, then
> it's 2^17 hashtable buckets and using struct hlist_head, this is 2
> MBytes. Then, each hlist_node is 16 bytes, so 2^16 * 16 ~= 1 MByte.
> That is 3 MBytes if my maths are fine.

Sounds correct to me as well.

> Just telling this to find what could be considered as reasonable
> amount of memory to be consumed. ~10 MBytes is slightly more than, but
> I agree you selected a reasonable worst through this "complex tuple".
> 
> > And that started feeling like "a lot". However, I'm probably overdoing
> > with the worst case -- this was just to explain what brought me to the
> > 4-bit choice, now I start doubting about it.
> >   
> > > Anyway, my understanding is that the more bits you use for grouping,
> > > the larger the lookup table becomes.
> > > 
> > > Still, both look very small in terms of memory consumption for these
> > > days.
> > > 
> > > I'm just telling this because the C implementation can probably get
> > > better numbers at the cost of consuming more memory? Probably do this
> > > at some point?  
> > 
> > Another topic is the additional amount of cachelines we would use. I
> > don't expect that effect to be visible, but I might be wrong.
> > 
> > So yes, I think it's definitely worth a try, thanks for the hint! I'll
> > try to look into this soon and test it on a few archs (something with
> > small cachelines, say MIPS r2k, would be worth checking, too).
> > 
> > We could even consider to dynamically adjust group size depending on
> > the set size, I don't know yet if that gets too convoluted.  
> 
> Yes, keeping this maintainable is a good idea.
> 
> The per-cpu scratch index is only required if we cannot fit in the
> "result bitmap" into the stack, right?

Right.

> Probably up to 256 bytes result bitmap in the stack is reasonable?
> That makes 8192 pipapo rules. There will be no need to disable bh and
> make use of the percpu scratchpad area in that case.

Right -- the question is whether that would mean yet another
implementation for the lookup function.

> If adjusting the code to deal with variable length "pipapo word" size
> is not too convoluted, then you could just deal with the variable word
> size from the insert / delete / get (slow) path and register one
> lookup function for the version that is optimized for this pipapo word
> size.

Yes, I like this a lot -- we would also need one function to rebuild
tables when the word size changes, but that sounds almost trivial.
Changes for the slow path are actually rather simple.

Still, I start doubting quite heavily that my original worst case is
reasonable. If we stick to the one you mentioned, or even something in
between, it makes no sense to keep 4-bit buckets.

By the way, I went ahead and tried the 8-bit bucket version of the C
implementation only, on my usual x86_64 box (one thread, AMD Epyc 7351).
I think it's worth it:

                4-bit       8-bit
net,port
 1000 entries   2304165pps  2901299pps
port,net
 100 entries    4131471pps  4751247pps
net6,port
 1000 entries   1092557pps  1651037pps
port,proto
 30000 entries   284147pps   449665pps
net6,port,mac
 10 entries     2082880pps  2762291pps
net6,port,mac,proto
 1000 entries    783810pps  1195823pps
net,mac
 1000 entries   1279122pps  1934003pps

I would now proceed extending this to the AVX2 implementation and (once
I finish it) to the NEON one, I actually expect bigger gains there.

> Probably adding helper function to deal with pipapo words would help
> to prepare for such update in the future. There is the ->estimate
> function that allows to calculate for the best word size depending on
> all the information this gets from the set definition.

Hm, I really think it should be kind of painless to make this dynamic
on insertion/deletion.

-- 
Stefano


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH nf-next v4 5/9] nf_tables: Add set type for arbitrary concatenation of ranges
  2020-02-14 19:42         ` Stefano Brivio
@ 2020-02-14 20:42           ` Pablo Neira Ayuso
  2020-02-14 23:06             ` Stefano Brivio
  0 siblings, 1 reply; 20+ messages in thread
From: Pablo Neira Ayuso @ 2020-02-14 20:42 UTC (permalink / raw)
  To: Stefano Brivio
  Cc: netfilter-devel, Florian Westphal, Kadlecsik József,
	Eric Garver, Phil Sutter

On Fri, Feb 14, 2020 at 08:42:13PM +0100, Stefano Brivio wrote:
> On Fri, 14 Feb 2020 19:16:34 +0100
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
[...]
> > You refer to a property that says that you can split a range into a
> > 2*n netmasks IIRC. Do you know what is the worst case when splitting
> > ranges?
> 
> I'm not sure I got your question: that is exactly the worst case, i.e.
> we can have _up to_ 2 * n netmasks (hence rules) given a range of n
> bits. There's an additional upper bound on this, given by the address
> space, but single fields in a concatenation can overlap.
> 
> For example, we can have up to 128 rules for an IPv6 range where at
> least 64 bits differ between the endpoints, and which would contain
> 2 ^ 64 addresses. Or, say, the IPv4 range 1.2.3.4 - 255.255.0.2 is
> expressed by 42 rules.
> 
> By the way, 0.0.0.1 - 255.255.255.254 takes 62 rules, so we can
> *probably* say it's 2 * n - 2, but I don't have a formal proof for that.

By "splitting" I was actually refering to "expanding", so you're
replying here to my worst-case range-to-rules expansion question.

> I have a couple of ways in mind to get that down to n / 2, but it's not
> straightforward and it will take me some time (assuming it makes
> sense). For the n bound, we can introduce negations (proof in
> literature), and I have some kind of ugly prototype. For the n / 2
> bound, I'd need some auxiliary data structure to keep insertion
> invertible.

OK, so there is room to improve the "rule expansion" logic. I didn't
spend much time on that front yet.

> In practice, the "average" case is much less, but to define it we would
> first need to agree on what are the actual components of the
> multivariate distribution... size and start? Is it a Poisson
> distribution then? After spending some time on this and disagreeing
> with myself I'd shyly recommend to skip the topic. :)

Yes, I agree to stick to something relatively simple and good is just
fine.

> > There is no ipset set like this, but I agree usecase might happen.
> 
> Actually, for ipset, a "net,port,net,port" type was proposed
> (netfilter-devel <20181216213039.399-1-oliver@uptheinter.net>), but when
> József enquired about the intended use case, none was given. So maybe
> this whole "net,net,port,mac" story makes even less sense.

Would it make sense to you to restrict pipapo to 3 fields until there
is someone with a usecase for this?

[...]
> > The per-cpu scratch index is only required if we cannot fit in the
> > "result bitmap" into the stack, right?
> 
> Right.
> 
> > Probably up to 256 bytes result bitmap in the stack is reasonable?
> > That makes 8192 pipapo rules. There will be no need to disable bh and
> > make use of the percpu scratchpad area in that case.
> 
> Right -- the question is whether that would mean yet another
> implementation for the lookup function.

This would need another lookup function that can be selected from
control plane path. The set size and the range-to-rule expansion
worst-case can tell us if it would fit into the stack. It's would be
just one extra lookup function for this case, ~80-100 LOC.

> > If adjusting the code to deal with variable length "pipapo word" size
> > is not too convoluted, then you could just deal with the variable word
> > size from the insert / delete / get (slow) path and register one
> > lookup function for the version that is optimized for this pipapo word
> > size.
> 
> Yes, I like this a lot -- we would also need one function to rebuild
> tables when the word size changes, but that sounds almost trivial.
> Changes for the slow path are actually rather simple.
>
> Still, I start doubting quite heavily that my original worst case is
> reasonable. If we stick to the one you mentioned, or even something in
> between, it makes no sense to keep 4-bit buckets.

OK, then moving to 8-bits will probably remove a bit of code which is
dealing with "nibbles".

> By the way, I went ahead and tried the 8-bit bucket version of the C
> implementation only, on my usual x86_64 box (one thread, AMD Epyc 7351).
> I think it's worth it:
> 
>                 4-bit       8-bit
> net,port
>  1000 entries   2304165pps  2901299pps
> port,net
>  100 entries    4131471pps  4751247pps
> net6,port
>  1000 entries   1092557pps  1651037pps
> port,proto
>  30000 entries   284147pps   449665pps
> net6,port,mac
>  10 entries     2082880pps  2762291pps
> net6,port,mac,proto
>  1000 entries    783810pps  1195823pps
> net,mac
>  1000 entries   1279122pps  1934003pps

Assuming the same concatenation type, larger bucket size makes pps
drop in the C implementation?

> I would now proceed extending this to the AVX2 implementation and (once
> I finish it) to the NEON one, I actually expect bigger gains there.

Good. BTW, probably you can add a new NFT_SET_CLASS_JIT class that
comes becomes NFT_SET_CLASS_O_1 to make the set routine that selects
the set pick the jit version instead.

> > Probably adding helper function to deal with pipapo words would help
> > to prepare for such update in the future. There is the ->estimate
> > function that allows to calculate for the best word size depending on
> > all the information this gets from the set definition.
> 
> Hm, I really think it should be kind of painless to make this dynamic
> on insertion/deletion.

OK, good. How would you like to proceed?

Thanks!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH nf-next v4 5/9] nf_tables: Add set type for arbitrary concatenation of ranges
  2020-02-14 20:42           ` Pablo Neira Ayuso
@ 2020-02-14 23:06             ` Stefano Brivio
  0 siblings, 0 replies; 20+ messages in thread
From: Stefano Brivio @ 2020-02-14 23:06 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netfilter-devel, Florian Westphal, Kadlecsik József,
	Eric Garver, Phil Sutter

On Fri, 14 Feb 2020 21:42:25 +0100
Pablo Neira Ayuso <pablo@netfilter.org> wrote:

> On Fri, Feb 14, 2020 at 08:42:13PM +0100, Stefano Brivio wrote:
> > On Fri, 14 Feb 2020 19:16:34 +0100
> > Pablo Neira Ayuso <pablo@netfilter.org> wrote:  
> [...]
> > > You refer to a property that says that you can split a range into a
> > > 2*n netmasks IIRC. Do you know what is the worst case when splitting
> > > ranges?  
> > 
> > I'm not sure I got your question: that is exactly the worst case, i.e.
> > we can have _up to_ 2 * n netmasks (hence rules) given a range of n
> > bits. There's an additional upper bound on this, given by the address
> > space, but single fields in a concatenation can overlap.
> > 
> > For example, we can have up to 128 rules for an IPv6 range where at
> > least 64 bits differ between the endpoints, and which would contain
> > 2 ^ 64 addresses. Or, say, the IPv4 range 1.2.3.4 - 255.255.0.2 is
> > expressed by 42 rules.
> > 
> > By the way, 0.0.0.1 - 255.255.255.254 takes 62 rules, so we can
> > *probably* say it's 2 * n - 2, but I don't have a formal proof for that.  
> 
> By "splitting" I was actually refering to "expanding", so you're
> replying here to my worst-case range-to-rules expansion question.
> 
> > I have a couple of ways in mind to get that down to n / 2, but it's not
> > straightforward and it will take me some time (assuming it makes
> > sense). For the n bound, we can introduce negations (proof in
> > literature), and I have some kind of ugly prototype. For the n / 2
> > bound, I'd need some auxiliary data structure to keep insertion
> > invertible.  
> 
> OK, so there is room to improve the "rule expansion" logic. I didn't
> spend much time on that front yet.

I wrote some (barely understandable if at all) notes here:
	https://pipapo.lameexcu.se/pipapo/tree/pipapo.c#n271

but I can't guarantee at this stage that it can all be implemented, or
not necessarily in that way. Should you ever (try to) read it, take it
as a rough, buggy draft.

> > In practice, the "average" case is much less, but to define it we would
> > first need to agree on what are the actual components of the
> > multivariate distribution... size and start? Is it a Poisson
> > distribution then? After spending some time on this and disagreeing
> > with myself I'd shyly recommend to skip the topic. :)  
> 
> Yes, I agree to stick to something relatively simple and good is just
> fine.
> 
> > > There is no ipset set like this, but I agree usecase might happen.  
> > 
> > Actually, for ipset, a "net,port,net,port" type was proposed
> > (netfilter-devel <20181216213039.399-1-oliver@uptheinter.net>), but when
> > József enquired about the intended use case, none was given. So maybe
> > this whole "net,net,port,mac" story makes even less sense.  
> 
> Would it make sense to you to restrict pipapo to 3 fields until there
> is someone with a usecase for this?

I wouldn't. I'd rather implement the dynamic switch of bucket size if
this is the alternative. I think the current version is rather generic,
doesn't use too much memory, and is also reasonably tested, so I don't
feel like taking working functionality away.

> [...]
> > > The per-cpu scratch index is only required if we cannot fit in the
> > > "result bitmap" into the stack, right?  
> > 
> > Right.
> >   
> > > Probably up to 256 bytes result bitmap in the stack is reasonable?
> > > That makes 8192 pipapo rules. There will be no need to disable bh and
> > > make use of the percpu scratchpad area in that case.  
> > 
> > Right -- the question is whether that would mean yet another
> > implementation for the lookup function.  
> 
> This would need another lookup function that can be selected from
> control plane path. The set size and the range-to-rule expansion
> worst-case can tell us if it would fit into the stack. It's would be
> just one extra lookup function for this case, ~80-100 LOC.

Ah, you're right, it's not much. On the other hand, advantages seems to
be:

- dropping the local_bh_{disable,enable}() pair: I can't find numbers
  right now, but the difference in pps between v2 (when Florian pointed
  out it was needed) and v1 wasn't statistically relevant

- avoiding the access to per-cpu allocated areas, here I have
  before/after numbers just for net,mac with 1000 entries on the AVX2
  implementation (AMD 7351, one thread), it didn't look significant,
  3.99Mpps instead of 3.90Mpps. Maybe it would make a difference when
  tested on more threads though (as a reference, same CPU, 10 threads,
  as of v2: 38.02Mpps) or with smaller L2

- memory usage (and how that affects locality on some archs), but if we
  do this for cases with small numbers of entries, it's probably not a
  big difference either

Anyway, I guess it's worth a try -- I'll try to find some time for it.

> > > If adjusting the code to deal with variable length "pipapo word" size
> > > is not too convoluted, then you could just deal with the variable word
> > > size from the insert / delete / get (slow) path and register one
> > > lookup function for the version that is optimized for this pipapo word
> > > size.  
> > 
> > Yes, I like this a lot -- we would also need one function to rebuild
> > tables when the word size changes, but that sounds almost trivial.
> > Changes for the slow path are actually rather simple.
> >
> > Still, I start doubting quite heavily that my original worst case is
> > reasonable. If we stick to the one you mentioned, or even something in
> > between, it makes no sense to keep 4-bit buckets.  
> 
> OK, then moving to 8-bits will probably remove a bit of code which is
> dealing with "nibbles".

Right now I have:
	 1 file changed, 16 insertions(+), 37 deletions(-)

but okay, I'll give the dynamic switch a try, first.

> > By the way, I went ahead and tried the 8-bit bucket version of the C
> > implementation only, on my usual x86_64 box (one thread, AMD Epyc 7351).
> > I think it's worth it:
> > 
> >                 4-bit       8-bit
> > net,port
> >  1000 entries   2304165pps  2901299pps
> > port,net
> >  100 entries    4131471pps  4751247pps
> > net6,port
> >  1000 entries   1092557pps  1651037pps
> > port,proto
> >  30000 entries   284147pps   449665pps
> > net6,port,mac
> >  10 entries     2082880pps  2762291pps
> > net6,port,mac,proto
> >  1000 entries    783810pps  1195823pps
> > net,mac
> >  1000 entries   1279122pps  1934003pps  
> 
> Assuming the same concatenation type, larger bucket size makes pps
> drop in the C implementation?

Uh, no, they go up with larger buckets, in the figures above we have
+26%, +15%, +51%, +58%, +33%, +53%, +51%.

> > I would now proceed extending this to the AVX2 implementation and (once
> > I finish it) to the NEON one, I actually expect bigger gains there.  
> 
> Good. BTW, probably you can add a new NFT_SET_CLASS_JIT class that
> comes becomes NFT_SET_CLASS_O_1 to make the set routine that selects
> the set pick the jit version instead.

It's not exactly JIT though :) it's hand-written assembly. It's coming
up a bit nicer for NEON as I can use compiler intrinsics there.

Maybe I didn't understand your point, but I don't think this is needed.
In 9/9 of v4 I simply had, in nft_pipapo_avx2_estimate():

	if (!boot_cpu_has(X86_FEATURE_AVX2) || !boot_cpu_has(X86_FEATURE_AVX))
		return false;

> > > Probably adding helper function to deal with pipapo words would help
> > > to prepare for such update in the future. There is the ->estimate
> > > function that allows to calculate for the best word size depending on
> > > all the information this gets from the set definition.  
> > 
> > Hm, I really think it should be kind of painless to make this dynamic
> > on insertion/deletion.  
> 
> OK, good. How would you like to proceed?

Let me give a shot at the dynamic bucket size switching. If that looks
reasonable, I'd proceed extending that to the AVX2 implementation,
finish the NEON implementation including it right away, and prepare a
series. If it doesn't, I'll report back :)

-- 
Stefano


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH nf-next v4 0/9] nftables: Set implementation for arbitrary concatenation of ranges
  2020-01-21 23:17 [PATCH nf-next v4 0/9] nftables: Set implementation for arbitrary concatenation of ranges Stefano Brivio
                   ` (9 preceding siblings ...)
  2020-01-27  8:20 ` [PATCH nf-next v4 0/9] nftables: Set implementation for arbitrary concatenation of ranges Pablo Neira Ayuso
@ 2020-02-20 10:52 ` Phil Sutter
  2020-02-20 11:04   ` Stefano Brivio
  10 siblings, 1 reply; 20+ messages in thread
From: Phil Sutter @ 2020-02-20 10:52 UTC (permalink / raw)
  To: Stefano Brivio
  Cc: Pablo Neira Ayuso, netfilter-devel, Florian Westphal,
	Kadlecsik József, Eric Garver

Hi Stefano,

When playing with adding multiple elements, I suddenly noticed a
disturbance in the force (general protection fault). Here's a
reproducer:

| $NFT -f - <<EOF
| table t {
|         set s {
|                 type ipv4_addr . inet_service
|                 flags interval
|         }
| }
| EOF
| 
| $NFT add element t s '{ 10.0.0.1 . 22-25, 10.0.0.1 . 10-20 }'
| $NFT flush set t s
| $NFT add element t s '{ 10.0.0.1 . 10-20, 10.0.0.1 . 22-25 }'

It is pretty reliable, though sometimes needs a second call. Looks like some
things going on in parallel which shouldn't. Here's a typical last breath:

[   71.319848] general protection fault, probably for non-canonical address 0x6f6b6e696c2e756e: 0000 [#1] PREEMPT SMP PTI
[   71.321540] CPU: 3 PID: 1201 Comm: kworker/3:2 Not tainted 5.6.0-rc1-00377-g2bb07f4e1d861 #192
[   71.322746] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190711_202441-buildvm-armv7-10.arm.fedoraproject.org-2.fc31 04/01/2014
[   71.324430] Workqueue: events nf_tables_trans_destroy_work [nf_tables]
[   71.325387] RIP: 0010:nft_set_elem_destroy+0xa5/0x110 [nf_tables]
[   71.326164] Code: 89 d4 84 c0 74 0e 8b 77 44 0f b6 f8 48 01 df e8 41 ff ff ff 45 84 e4 74 36 44 0f b6 63 08 45 84 e4 74 2c 49 01 dc 49 8b 04 24 <48> 8b 40 38 48 85 c0 74 4f 48 89 e7 4c 8b
[   71.328423] RSP: 0018:ffffc9000226fd90 EFLAGS: 00010282
[   71.329225] RAX: 6f6b6e696c2e756e RBX: ffff88813ab79f60 RCX: ffff88813931b5a0
[   71.330365] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff88813ab79f9a
[   71.331473] RBP: ffff88813ab79f60 R08: 0000000000000008 R09: 0000000000000000
[   71.332627] R10: 000000000000021c R11: 0000000000000000 R12: ffff88813ab79fc2
[   71.333615] R13: ffff88813b3adf50 R14: dead000000000100 R15: ffff88813931b8a0
[   71.334596] FS:  0000000000000000(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000
[   71.335780] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   71.336577] CR2: 000055ac683710f0 CR3: 000000013a222003 CR4: 0000000000360ee0
[   71.337533] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   71.338557] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   71.339718] Call Trace:
[   71.340093]  nft_pipapo_destroy+0x7a/0x170 [nf_tables_set]
[   71.340973]  nft_set_destroy+0x20/0x50 [nf_tables]
[   71.341879]  nf_tables_trans_destroy_work+0x246/0x260 [nf_tables]
[   71.342916]  process_one_work+0x1d5/0x3c0
[   71.343601]  worker_thread+0x4a/0x3c0
[   71.344229]  kthread+0xfb/0x130
[   71.344780]  ? process_one_work+0x3c0/0x3c0
[   71.345477]  ? kthread_park+0x90/0x90
[   71.346129]  ret_from_fork+0x35/0x40
[   71.346748] Modules linked in: nf_tables_set nf_tables nfnetlink 8021q [last unloaded: nfnetlink]
[   71.348153] ---[ end trace 2eaa8149ca759bcc ]---
[   71.349066] RIP: 0010:nft_set_elem_destroy+0xa5/0x110 [nf_tables]
[   71.350016] Code: 89 d4 84 c0 74 0e 8b 77 44 0f b6 f8 48 01 df e8 41 ff ff ff 45 84 e4 74 36 44 0f b6 63 08 45 84 e4 74 2c 49 01 dc 49 8b 04 24 <48> 8b 40 38 48 85 c0 74 4f 48 89 e7 4c 8b
[   71.350017] RSP: 0018:ffffc9000226fd90 EFLAGS: 00010282
[   71.350019] RAX: 6f6b6e696c2e756e RBX: ffff88813ab79f60 RCX: ffff88813931b5a0
[   71.350019] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff88813ab79f9a
[   71.350020] RBP: ffff88813ab79f60 R08: 0000000000000008 R09: 0000000000000000
[   71.350021] R10: 000000000000021c R11: 0000000000000000 R12: ffff88813ab79fc2
[   71.350022] R13: ffff88813b3adf50 R14: dead000000000100 R15: ffff88813931b8a0
[   71.350025] FS:  0000000000000000(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000
[   71.350026] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   71.350027] CR2: 000055ac683710f0 CR3: 000000013a222003 CR4: 0000000000360ee0
[   71.350028] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   71.350028] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   71.350030] Kernel panic - not syncing: Fatal exception
[   71.350412] Kernel Offset: disabled
[   71.365922] ---[ end Kernel panic - not syncing: Fatal exception ]---

Cheers, Phil

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH nf-next v4 0/9] nftables: Set implementation for arbitrary concatenation of ranges
  2020-02-20 10:52 ` Phil Sutter
@ 2020-02-20 11:04   ` Stefano Brivio
  0 siblings, 0 replies; 20+ messages in thread
From: Stefano Brivio @ 2020-02-20 11:04 UTC (permalink / raw)
  To: Phil Sutter
  Cc: Pablo Neira Ayuso, netfilter-devel, Florian Westphal,
	Kadlecsik József, Eric Garver

Hi Phil,

On Thu, 20 Feb 2020 11:52:41 +0100
Phil Sutter <phil@nwl.cc> wrote:

> Hi Stefano,
> 
> When playing with adding multiple elements, I suddenly noticed a
> disturbance in the force (general protection fault). Here's a
> reproducer:
> 
> | $NFT -f - <<EOF
> | table t {
> |         set s {
> |                 type ipv4_addr . inet_service
> |                 flags interval
> |         }
> | }
> | EOF
> | 
> | $NFT add element t s '{ 10.0.0.1 . 22-25, 10.0.0.1 . 10-20 }'
> | $NFT flush set t s
> | $NFT add element t s '{ 10.0.0.1 . 10-20, 10.0.0.1 . 22-25 }'
> 
> It is pretty reliable, though sometimes needs a second call. Looks like some
> things going on in parallel which shouldn't. Here's a typical last breath:
> 
> [   71.319848] general protection fault, probably for non-canonical address 0x6f6b6e696c2e756e: 0000 [#1] PREEMPT SMP PTI
> [   71.321540] CPU: 3 PID: 1201 Comm: kworker/3:2 Not tainted 5.6.0-rc1-00377-g2bb07f4e1d861 #192
> [   71.322746] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190711_202441-buildvm-armv7-10.arm.fedoraproject.org-2.fc31 04/01/2014
> [   71.324430] Workqueue: events nf_tables_trans_destroy_work [nf_tables]
> [   71.325387] RIP: 0010:nft_set_elem_destroy+0xa5/0x110 [nf_tables]

Ouch, thanks for reporting, I'll check in a few hours.

-- 
Stefano


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, back to index

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-21 23:17 [PATCH nf-next v4 0/9] nftables: Set implementation for arbitrary concatenation of ranges Stefano Brivio
2020-01-21 23:17 ` [PATCH nf-next v4 1/9] netfilter: nf_tables: add nft_setelem_parse_key() Stefano Brivio
2020-01-21 23:17 ` [PATCH nf-next v4 2/9] netfilter: nf_tables: add NFTA_SET_ELEM_KEY_END attribute Stefano Brivio
2020-01-21 23:17 ` [PATCH nf-next v4 3/9] netfilter: nf_tables: Support for sets with multiple ranged fields Stefano Brivio
2020-01-21 23:17 ` [PATCH nf-next v4 4/9] bitmap: Introduce bitmap_cut(): cut bits and shift remaining Stefano Brivio
2020-01-21 23:17 ` [PATCH nf-next v4 5/9] nf_tables: Add set type for arbitrary concatenation of ranges Stefano Brivio
2020-02-07 11:23   ` Pablo Neira Ayuso
2020-02-10 15:10     ` Stefano Brivio
2020-02-14 18:16       ` Pablo Neira Ayuso
2020-02-14 19:42         ` Stefano Brivio
2020-02-14 20:42           ` Pablo Neira Ayuso
2020-02-14 23:06             ` Stefano Brivio
2020-01-21 23:17 ` [PATCH nf-next v4 6/9] selftests: netfilter: Introduce tests for sets with range concatenation Stefano Brivio
2020-01-21 23:17 ` [PATCH nf-next v4 7/9] nft_set_pipapo: Prepare for vectorised implementation: alignment Stefano Brivio
2020-01-21 23:17 ` [PATCH nf-next v4 8/9] nft_set_pipapo: Prepare for vectorised implementation: helpers Stefano Brivio
2020-01-21 23:17 ` [PATCH nf-next v4 9/9] nft_set_pipapo: Introduce AVX2-based lookup implementation Stefano Brivio
2020-01-27  6:41   ` kbuild test robot
2020-01-27  8:20 ` [PATCH nf-next v4 0/9] nftables: Set implementation for arbitrary concatenation of ranges Pablo Neira Ayuso
2020-02-20 10:52 ` Phil Sutter
2020-02-20 11:04   ` Stefano Brivio

Netfilter-Devel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/netfilter-devel/0 netfilter-devel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 netfilter-devel netfilter-devel/ https://lore.kernel.org/netfilter-devel \
		netfilter-devel@vger.kernel.org
	public-inbox-index netfilter-devel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.netfilter-devel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git