All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC] netfilter: nf_tables: add flowtable map for xdp offload
@ 2023-10-19 20:25 Florian Westphal
  2023-10-23 10:33 ` Lorenzo Bianconi
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Florian Westphal @ 2023-10-19 20:25 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal, Lorenzo Bianconi

This adds a small internal mapping table so that a new bpf (xdp) kfunc
can perform lookups in a flowtable.

I have no intent to push this without nft integration of the xdp program,
this RFC is just to get comments on the general direction because there
is a chicken/egg issue:

As-is, xdp program has access to the device pointer, but no way to do a
lookup in a flowtable -- there is no way to obtain the needed struct
without whacky stunts.

This would allow to such lookup just from device address: the bpf
kfunc would call nf_flowtable_by_dev() internally.

Limitation:

A device cannot be added to multiple flowtables, the mapping needs
to be unique.

As for integration with the kernel, there are several options:

1. Auto-add to the dev-xdp table whenever HW offload is requested.

2. Add to the dev-xdp table, but only if the HW offload request fails.
   (softfallback).

3. add a dedicated 'xdp offload' flag to UAPI.

3) should not be needed, because userspace already controls this:
   to make it work userspace needs to attach the xdp program to the
   network device in the first place.

My thinking is to add a xdp-offload flag to the nft grammer only.
Its not needed on nf uapi side and it would tell nft to attach the xdp
flowtable forward program to the devices listed in the flowtable.

Also, packet flow is altered (qdiscs is bypassed), which is a strong
argument against default-usage.

The xdp prog source would be included with nftables.git and nft
would either attach/detach them or ship an extra prog that does this (TBD).

Open questions:

Do we need to support dev-in-multiple flowtables?  I would like to
avoid this, this likely means the future "xdp" flag in nftables would
be restricted to "inet" family.  Alternative would be to change the key to
'device address plus protocol family', the xdp prog could derive that from the
packet data.

Timeout handling.  Should the XDP program even bother to refresh the
flowtable timeout?

It might make more sense to intentionally have packets
flow through the normal path periodically so neigh entries are up to date.

Also note that flow_offload_xdp struct likely needs to have a refcount
or genmask so that it integrates with the two-phase commit protocol on
netfilter side
(i.e., we should allow 're-add' because its needed to make flush+reload
 work).

Not SoB, too raw for my taste.

CC: Lorenzo Bianconi <lorenzo@kernel.org>
---
 net/netfilter/nf_flow_table_offload.c | 131 +++++++++++++++++++++++++-
 1 file changed, 130 insertions(+), 1 deletion(-)

diff --git a/net/netfilter/nf_flow_table_offload.c b/net/netfilter/nf_flow_table_offload.c
index a010b25076ca..10313d296a8a 100644
--- a/net/netfilter/nf_flow_table_offload.c
+++ b/net/netfilter/nf_flow_table_offload.c
@@ -17,6 +17,92 @@ static struct workqueue_struct *nf_flow_offload_add_wq;
 static struct workqueue_struct *nf_flow_offload_del_wq;
 static struct workqueue_struct *nf_flow_offload_stats_wq;
 
+struct flow_offload_xdp {
+	struct hlist_node hnode;
+
+	unsigned long net_device_addr;
+	struct nf_flowtable *ft;
+
+	struct rcu_head	rcuhead;
+};
+
+#define NF_XDP_HT_BITS	4
+static DEFINE_HASHTABLE(nf_xdp_hashtable, NF_XDP_HT_BITS);
+static DEFINE_MUTEX(nf_xdp_hashtable_lock);
+
+/* caller must hold rcu read lock */
+struct nf_flowtable *nf_flowtable_by_dev(const struct net_device *dev)
+{
+	unsigned long key = (unsigned long)dev;
+	const struct flow_offload_xdp *cur;
+
+	hash_for_each_possible_rcu(nf_xdp_hashtable, cur, hnode, key) {
+		if (key == cur->net_device_addr)
+			return cur->ft;
+	}
+
+	return NULL;
+}
+
+static int nf_flowtable_by_dev_insert(struct nf_flowtable *ft,
+				      const struct net_device *dev)
+{
+	unsigned long key = (unsigned long)dev;
+	struct flow_offload_xdp *cur;
+	int err = 0;
+
+	mutex_lock(&nf_xdp_hashtable_lock);
+	hash_for_each_possible(nf_xdp_hashtable, cur, hnode, key) {
+		if (key != cur->net_device_addr)
+			continue;
+		err = -EEXIST;
+		break;
+	}
+
+	if (err == 0) {
+		struct flow_offload_xdp *new;
+
+		new = kzalloc(sizeof(*new), GFP_KERNEL);
+		if (new) {
+			new->net_device_addr = key;
+			new->ft = ft;
+
+			hash_add_rcu(nf_xdp_hashtable, &new->hnode, key);
+		} else {
+			err = -ENOMEM;
+		}
+	}
+
+	mutex_unlock(&nf_xdp_hashtable_lock);
+
+	DEBUG_NET_WARN_ON_ONCE(err == 0 && nf_flowtable_by_dev(dev) != ft);
+
+	return err;
+}
+
+static void nf_flowtable_by_dev_remove(const struct net_device *dev)
+{
+	unsigned long key = (unsigned long)dev;
+	struct flow_offload_xdp *cur;
+	bool found = false;
+
+	mutex_lock(&nf_xdp_hashtable_lock);
+
+	hash_for_each_possible(nf_xdp_hashtable, cur, hnode, key) {
+		if (key != cur->net_device_addr)
+			continue;
+
+		hash_del_rcu(&cur->hnode);
+		kfree_rcu(cur, rcuhead);
+		found = true;
+		break;
+	}
+
+	mutex_unlock(&nf_xdp_hashtable_lock);
+
+	WARN_ON_ONCE(!found);
+}
+
 struct flow_offload_work {
 	struct list_head	list;
 	enum flow_cls_command	cmd;
@@ -1183,6 +1269,38 @@ static int nf_flow_table_offload_cmd(struct flow_block_offload *bo,
 	return 0;
 }
 
+static int nf_flow_offload_xdp_setup(struct nf_flowtable *flowtable,
+				     struct net_device *dev,
+				     enum flow_block_command cmd)
+{
+	switch (cmd) {
+	case FLOW_BLOCK_BIND:
+		return nf_flowtable_by_dev_insert(flowtable, dev);
+	case FLOW_BLOCK_UNBIND:
+		nf_flowtable_by_dev_remove(dev);
+		return 0;
+	}
+
+	WARN_ON_ONCE(1);
+	return 0;
+}
+
+static void nf_flow_offload_xdp_cancel(struct nf_flowtable *flowtable,
+				       struct net_device *dev,
+				       enum flow_block_command cmd)
+{
+	switch (cmd) {
+	case FLOW_BLOCK_BIND:
+		nf_flowtable_by_dev_remove(dev);
+		return;
+	case FLOW_BLOCK_UNBIND:
+		/* We do not re-bind in case hw offload would report error
+		 * on *unregister*.
+		 */
+		break;
+	}
+}
+
 int nf_flow_table_offload_setup(struct nf_flowtable *flowtable,
 				struct net_device *dev,
 				enum flow_block_command cmd)
@@ -1191,6 +1309,15 @@ int nf_flow_table_offload_setup(struct nf_flowtable *flowtable,
 	struct flow_block_offload bo;
 	int err;
 
+	/* XXX:
+	 *
+	 * XDP offload could be made 'never fails', as xdp
+	 * frames that don't match are simply passed up to
+	 * normal nf hooks (skb sw flowtable), or to stack.
+	 */
+	if (nf_flow_offload_xdp_setup(flowtable, dev, cmd))
+		return -EBUSY;
+
 	if (!nf_flowtable_hw_offload(flowtable))
 		return 0;
 
@@ -1200,8 +1327,10 @@ int nf_flow_table_offload_setup(struct nf_flowtable *flowtable,
 	else
 		err = nf_flow_table_indr_offload_cmd(&bo, flowtable, dev, cmd,
 						     &extack);
-	if (err < 0)
+	if (err < 0) {
+		nf_flow_offload_xdp_cancel(flowtable, dev, cmd);
 		return err;
+	}
 
 	return nf_flow_table_block_setup(flowtable, &bo, cmd);
 }
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] netfilter: nf_tables: add flowtable map for xdp offload
  2023-10-19 20:25 [PATCH RFC] netfilter: nf_tables: add flowtable map for xdp offload Florian Westphal
@ 2023-10-23 10:33 ` Lorenzo Bianconi
  2023-10-23 11:16   ` Florian Westphal
  2023-11-02  8:30 ` Florian Westphal
  2023-11-02 10:49 ` Toke Høiland-Jørgensen
  2 siblings, 1 reply; 11+ messages in thread
From: Lorenzo Bianconi @ 2023-10-23 10:33 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 7837 bytes --]

> This adds a small internal mapping table so that a new bpf (xdp) kfunc
> can perform lookups in a flowtable.
> 
> I have no intent to push this without nft integration of the xdp program,
> this RFC is just to get comments on the general direction because there
> is a chicken/egg issue:
> 
> As-is, xdp program has access to the device pointer, but no way to do a
> lookup in a flowtable -- there is no way to obtain the needed struct
> without whacky stunts.
> 
> This would allow to such lookup just from device address: the bpf
> kfunc would call nf_flowtable_by_dev() internally.
> 
> Limitation:
> 
> A device cannot be added to multiple flowtables, the mapping needs
> to be unique.
> 
> As for integration with the kernel, there are several options:
> 
> 1. Auto-add to the dev-xdp table whenever HW offload is requested.
> 
> 2. Add to the dev-xdp table, but only if the HW offload request fails.
>    (softfallback).
> 
> 3. add a dedicated 'xdp offload' flag to UAPI.
> 
> 3) should not be needed, because userspace already controls this:
>    to make it work userspace needs to attach the xdp program to the
>    network device in the first place.
> 
> My thinking is to add a xdp-offload flag to the nft grammer only.
> Its not needed on nf uapi side and it would tell nft to attach the xdp
> flowtable forward program to the devices listed in the flowtable.
> 
> Also, packet flow is altered (qdiscs is bypassed), which is a strong
> argument against default-usage.
> 
> The xdp prog source would be included with nftables.git and nft
> would either attach/detach them or ship an extra prog that does this (TBD).

Hi Florian,

thx for working on this, I tested this patch with the flowtable lookup kfunc
I am working on (code is available here [0]) and it works properly.

Regards,
Lorenzo

[0] https://github.com/LorenzoBianconi/bpf-next/tree/xdp-flowtable-kfunc

> 
> Open questions:
> 
> Do we need to support dev-in-multiple flowtables?  I would like to
> avoid this, this likely means the future "xdp" flag in nftables would
> be restricted to "inet" family.  Alternative would be to change the key to
> 'device address plus protocol family', the xdp prog could derive that from the
> packet data.
> 
> Timeout handling.  Should the XDP program even bother to refresh the
> flowtable timeout?

I was assuming the flowtable lookup kfunc can take care of it.

> 
> It might make more sense to intentionally have packets
> flow through the normal path periodically so neigh entries are up to date.
> 
> Also note that flow_offload_xdp struct likely needs to have a refcount
> or genmask so that it integrates with the two-phase commit protocol on
> netfilter side
> (i.e., we should allow 're-add' because its needed to make flush+reload
>  work).
> 
> Not SoB, too raw for my taste.
> 
> CC: Lorenzo Bianconi <lorenzo@kernel.org>
> ---
>  net/netfilter/nf_flow_table_offload.c | 131 +++++++++++++++++++++++++-
>  1 file changed, 130 insertions(+), 1 deletion(-)
> 
> diff --git a/net/netfilter/nf_flow_table_offload.c b/net/netfilter/nf_flow_table_offload.c
> index a010b25076ca..10313d296a8a 100644
> --- a/net/netfilter/nf_flow_table_offload.c
> +++ b/net/netfilter/nf_flow_table_offload.c
> @@ -17,6 +17,92 @@ static struct workqueue_struct *nf_flow_offload_add_wq;
>  static struct workqueue_struct *nf_flow_offload_del_wq;
>  static struct workqueue_struct *nf_flow_offload_stats_wq;
>  
> +struct flow_offload_xdp {
> +	struct hlist_node hnode;
> +
> +	unsigned long net_device_addr;
> +	struct nf_flowtable *ft;
> +
> +	struct rcu_head	rcuhead;
> +};
> +
> +#define NF_XDP_HT_BITS	4
> +static DEFINE_HASHTABLE(nf_xdp_hashtable, NF_XDP_HT_BITS);
> +static DEFINE_MUTEX(nf_xdp_hashtable_lock);
> +
> +/* caller must hold rcu read lock */
> +struct nf_flowtable *nf_flowtable_by_dev(const struct net_device *dev)
> +{

I think this routine needs to be added to some include file (e.g.
include/net/netfilter/nf_flow_table.h)

> +	unsigned long key = (unsigned long)dev;
> +	const struct flow_offload_xdp *cur;
> +
> +	hash_for_each_possible_rcu(nf_xdp_hashtable, cur, hnode, key) {
> +		if (key == cur->net_device_addr)
> +			return cur->ft;
> +	}
> +
> +	return NULL;
> +}
> +
> +static int nf_flowtable_by_dev_insert(struct nf_flowtable *ft,
> +				      const struct net_device *dev)
> +{
> +	unsigned long key = (unsigned long)dev;
> +	struct flow_offload_xdp *cur;
> +	int err = 0;
> +
> +	mutex_lock(&nf_xdp_hashtable_lock);
> +	hash_for_each_possible(nf_xdp_hashtable, cur, hnode, key) {
> +		if (key != cur->net_device_addr)
> +			continue;
> +		err = -EEXIST;
> +		break;
> +	}
> +
> +	if (err == 0) {
> +		struct flow_offload_xdp *new;
> +
> +		new = kzalloc(sizeof(*new), GFP_KERNEL);
> +		if (new) {
> +			new->net_device_addr = key;
> +			new->ft = ft;
> +
> +			hash_add_rcu(nf_xdp_hashtable, &new->hnode, key);
> +		} else {
> +			err = -ENOMEM;
> +		}
> +	}
> +
> +	mutex_unlock(&nf_xdp_hashtable_lock);
> +
> +	DEBUG_NET_WARN_ON_ONCE(err == 0 && nf_flowtable_by_dev(dev) != ft);
> +
> +	return err;
> +}
> +
> +static void nf_flowtable_by_dev_remove(const struct net_device *dev)
> +{
> +	unsigned long key = (unsigned long)dev;
> +	struct flow_offload_xdp *cur;
> +	bool found = false;
> +
> +	mutex_lock(&nf_xdp_hashtable_lock);
> +
> +	hash_for_each_possible(nf_xdp_hashtable, cur, hnode, key) {
> +		if (key != cur->net_device_addr)
> +			continue;
> +
> +		hash_del_rcu(&cur->hnode);
> +		kfree_rcu(cur, rcuhead);
> +		found = true;
> +		break;
> +	}
> +
> +	mutex_unlock(&nf_xdp_hashtable_lock);
> +
> +	WARN_ON_ONCE(!found);
> +}
> +
>  struct flow_offload_work {
>  	struct list_head	list;
>  	enum flow_cls_command	cmd;
> @@ -1183,6 +1269,38 @@ static int nf_flow_table_offload_cmd(struct flow_block_offload *bo,
>  	return 0;
>  }
>  
> +static int nf_flow_offload_xdp_setup(struct nf_flowtable *flowtable,
> +				     struct net_device *dev,
> +				     enum flow_block_command cmd)
> +{
> +	switch (cmd) {
> +	case FLOW_BLOCK_BIND:
> +		return nf_flowtable_by_dev_insert(flowtable, dev);
> +	case FLOW_BLOCK_UNBIND:
> +		nf_flowtable_by_dev_remove(dev);
> +		return 0;
> +	}
> +
> +	WARN_ON_ONCE(1);
> +	return 0;
> +}
> +
> +static void nf_flow_offload_xdp_cancel(struct nf_flowtable *flowtable,
> +				       struct net_device *dev,
> +				       enum flow_block_command cmd)
> +{
> +	switch (cmd) {
> +	case FLOW_BLOCK_BIND:
> +		nf_flowtable_by_dev_remove(dev);
> +		return;
> +	case FLOW_BLOCK_UNBIND:
> +		/* We do not re-bind in case hw offload would report error
> +		 * on *unregister*.
> +		 */
> +		break;
> +	}
> +}
> +
>  int nf_flow_table_offload_setup(struct nf_flowtable *flowtable,
>  				struct net_device *dev,
>  				enum flow_block_command cmd)
> @@ -1191,6 +1309,15 @@ int nf_flow_table_offload_setup(struct nf_flowtable *flowtable,
>  	struct flow_block_offload bo;
>  	int err;
>  
> +	/* XXX:
> +	 *
> +	 * XDP offload could be made 'never fails', as xdp
> +	 * frames that don't match are simply passed up to
> +	 * normal nf hooks (skb sw flowtable), or to stack.
> +	 */
> +	if (nf_flow_offload_xdp_setup(flowtable, dev, cmd))
> +		return -EBUSY;
> +
>  	if (!nf_flowtable_hw_offload(flowtable))
>  		return 0;
>  
> @@ -1200,8 +1327,10 @@ int nf_flow_table_offload_setup(struct nf_flowtable *flowtable,
>  	else
>  		err = nf_flow_table_indr_offload_cmd(&bo, flowtable, dev, cmd,
>  						     &extack);
> -	if (err < 0)
> +	if (err < 0) {
> +		nf_flow_offload_xdp_cancel(flowtable, dev, cmd);
>  		return err;
> +	}
>  
>  	return nf_flow_table_block_setup(flowtable, &bo, cmd);
>  }
> -- 
> 2.41.0
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] netfilter: nf_tables: add flowtable map for xdp offload
  2023-10-23 10:33 ` Lorenzo Bianconi
@ 2023-10-23 11:16   ` Florian Westphal
  0 siblings, 0 replies; 11+ messages in thread
From: Florian Westphal @ 2023-10-23 11:16 UTC (permalink / raw)
  To: Lorenzo Bianconi; +Cc: Florian Westphal, netfilter-devel

Lorenzo Bianconi <lorenzo@kernel.org> wrote:
> thx for working on this, I tested this patch with the flowtable lookup kfunc
> I am working on (code is available here [0]) and it works properly.

Thanks!

> > 
> > Do we need to support dev-in-multiple flowtables?  I would like to
> > avoid this, this likely means the future "xdp" flag in nftables would
> > be restricted to "inet" family.  Alternative would be to change the key to
> > 'device address plus protocol family', the xdp prog could derive that from the
> > packet data.
> > 
> > Timeout handling.  Should the XDP program even bother to refresh the
> > flowtable timeout?
> 
> I was assuming the flowtable lookup kfunc can take care of it.

I'm worried about stale neigh cache, resp. making sure that it
gets renewed.

> > +struct nf_flowtable *nf_flowtable_by_dev(const struct net_device *dev)
> > +{
> 
> I think this routine needs to be added to some include file (e.g.
> include/net/netfilter/nf_flow_table.h)

Right.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] netfilter: nf_tables: add flowtable map for xdp offload
  2023-10-19 20:25 [PATCH RFC] netfilter: nf_tables: add flowtable map for xdp offload Florian Westphal
  2023-10-23 10:33 ` Lorenzo Bianconi
@ 2023-11-02  8:30 ` Florian Westphal
  2023-11-09 11:09   ` Florian Westphal
  2023-11-02 10:49 ` Toke Høiland-Jørgensen
  2 siblings, 1 reply; 11+ messages in thread
From: Florian Westphal @ 2023-11-02  8:30 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel, Lorenzo Bianconi

Florian Westphal <fw@strlen.de> wrote:
> This adds a small internal mapping table so that a new bpf (xdp) kfunc
> A device cannot be added to multiple flowtables, the mapping needs
> to be unique.

This breaks two cases:
1.  Two-Phase commmit protocol:
nft -f - <<EOF
flush ruleset
table t {
	flowtable ...
EOF

fails when called a 2nd time.  This problem also exists
for at least the mlx hw offload too.

It would be good to fix this generically but I do not see
how given this problem is nftables specific and not really
flowtable related.

2. currently nftables supports
table ip t {
	flowtable f {
		devices = { eth0 ...

table ip6 t {
	flowtable f {
		devices = { eth0 ...

table inet t {
	flowtable f {
		devices = { eth0 ...

... and this works, i.e. same device can be part of
up to 6 flowtables.

This one is easier to fix, the program can guess ip/ip6
based to packet data and can a family to the kfunc as a
function argument.

inet would be shadowed / hidden when an ip/ip6 flowtable
mapping exists as well.

This is not nice, but the ip(6) and inet thing should
not occur in practice and nothing breaks here because
existing sw path is still going to work.

> +static int nf_flow_offload_xdp_setup(struct nf_flowtable *flowtable,
> +				     struct net_device *dev,
> +				     enum flow_block_command cmd)
> +{
> +	switch (cmd) {
> +	case FLOW_BLOCK_BIND:
> +		return nf_flowtable_by_dev_insert(flowtable, dev);

This is fine or at least can be made to work.

> +	case FLOW_BLOCK_UNBIND:
> +		nf_flowtable_by_dev_remove(dev);

This is broken.  UNBIND comes too late when things are torn down.

I only see two solutions:

1. add a new nf_flow_offload_unbind_prepare() that does this
2. Decouple nf_flowtable from nft_flowtable and make nf_flowtable
   refcounted.  As-is, the UNBIND will result in UAF because the
   underlying structures will be free'd immediately after this,
   without any synchronize_rcu().

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] netfilter: nf_tables: add flowtable map for xdp offload
  2023-10-19 20:25 [PATCH RFC] netfilter: nf_tables: add flowtable map for xdp offload Florian Westphal
  2023-10-23 10:33 ` Lorenzo Bianconi
  2023-11-02  8:30 ` Florian Westphal
@ 2023-11-02 10:49 ` Toke Høiland-Jørgensen
  2023-11-02 10:54   ` Florian Westphal
  2023-11-02 11:07   ` Florian Westphal
  2 siblings, 2 replies; 11+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-11-02 10:49 UTC (permalink / raw)
  To: Florian Westphal, netfilter-devel; +Cc: Florian Westphal, Lorenzo Bianconi

Florian Westphal <fw@strlen.de> writes:

> This adds a small internal mapping table so that a new bpf (xdp) kfunc
> can perform lookups in a flowtable.
>
> I have no intent to push this without nft integration of the xdp program,
> this RFC is just to get comments on the general direction because there
> is a chicken/egg issue:
>
> As-is, xdp program has access to the device pointer, but no way to do a
> lookup in a flowtable -- there is no way to obtain the needed struct
> without whacky stunts.

So IIUC correctly, this would all be controlled by userspace anyway (by
the nft binary), right? In which case, couldn't userspace also provide
the reference to the right flowtable instance, by sticking it into a bpf
map? We'd probably need some special handling on the UAPI side to insert
a flowtable pointer, but from the BPF side it could just look like a
kptr in a map that the program pulls out and passes to the lookup kfunc.
And the map would take a refcnt, making sure the table doesn't disappear
underneath the XDP program. It could even improve performance since
there would be one less hashtable lookup.

The drawback would be that this would make it harder to integrate into
other XDP data planes, as you'd need to coordinate with nft to keep the
right flowtable references alive even if nft doesn't control the XDP
program. But maybe that's doable, somehow?

[...]

> My thinking is to add a xdp-offload flag to the nft grammer only.
> Its not needed on nf uapi side and it would tell nft to attach the xdp
> flowtable forward program to the devices listed in the flowtable.
>
> Also, packet flow is altered (qdiscs is bypassed), which is a strong
> argument against default-usage.

I agree that at this point XDP has two many quirks to be something we
can turn on by default. However, I think we should support XDP data
planes that are not necessarily under the control of nft itself.
Specifically, I am planning to add an 'xdp-forward' utility to xdp-tools
which would enable a semi-automatic XDP fast path using both this and
other hooks like the fib lookup helper. So it would be nice to make the
different pieces as loosely coupled as is practical (cf what I wrote
above).

> Open questions:
>
> Do we need to support dev-in-multiple flowtables?  I would like to
> avoid this, this likely means the future "xdp" flag in nftables would
> be restricted to "inet" family.  Alternative would be to change the key to
> 'device address plus protocol family', the xdp prog could derive that from the
> packet data.

We can always start with the simple case and add more options later if
it turns out to be useful? With kfuncs we do have some flexibility in
terms of adjusting the API (although I think we should strive for
keeping it as stable as we can).

> Timeout handling.  Should the XDP program even bother to refresh the
> flowtable timeout?
>
> It might make more sense to intentionally have packets
> flow through the normal path periodically so neigh entries are up to
> date.

Hmm, I see what you mean, but I worry that this would lead to some nasty
latency blips when a flow transitions back and forth between kernel and
XDP paths. Also, there's a reordering problem as the state is changed:
the first goes through the stack, sets the flow state to active, then
gets transmitted. But while that sits in the qdisc waiting to go out on
the wire, the next packet arrives, gets handled by the XDP fastpath and
ends up overtaking the first packet on the TX side. Not sure we have a
good solution for this in general :(

-Toke

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] netfilter: nf_tables: add flowtable map for xdp offload
  2023-11-02 10:49 ` Toke Høiland-Jørgensen
@ 2023-11-02 10:54   ` Florian Westphal
  2023-11-02 11:07     ` Toke Høiland-Jørgensen
  2023-11-02 11:07   ` Florian Westphal
  1 sibling, 1 reply; 11+ messages in thread
From: Florian Westphal @ 2023-11-02 10:54 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Florian Westphal, netfilter-devel, Lorenzo Bianconi

Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> > It might make more sense to intentionally have packets
> > flow through the normal path periodically so neigh entries are up to
> > date.
> 
> Hmm, I see what you mean, but I worry that this would lead to some nasty
> latency blips when a flow transitions back and forth between kernel and
> XDP paths. Also, there's a reordering problem as the state is changed:
> the first goes through the stack, sets the flow state to active, then
> gets transmitted. But while that sits in the qdisc waiting to go out on
> the wire, the next packet arrives, gets handled by the XDP fastpath and
> ends up overtaking the first packet on the TX side. Not sure we have a
> good solution for this in general :(

From nft based flowtable offload we already had a feature request to
bounce flows back to normal path periodially, this was because people
wanted to make sure that long-living flows get revalidated vs. current
netfilter ruleset and not the one that was active at flow offload time.

There was a patch for it, using a new sysctl, and author never came
back with an updated patch to handle this via the ruleset instead.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] netfilter: nf_tables: add flowtable map for xdp offload
  2023-11-02 10:49 ` Toke Høiland-Jørgensen
  2023-11-02 10:54   ` Florian Westphal
@ 2023-11-02 11:07   ` Florian Westphal
  2023-11-02 11:25     ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 11+ messages in thread
From: Florian Westphal @ 2023-11-02 11:07 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Florian Westphal, netfilter-devel, Lorenzo Bianconi

Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> So IIUC correctly, this would all be controlled by userspace anyway (by
> the nft binary), right? In which case, couldn't userspace also provide
> the reference to the right flowtable instance, by sticking it into a bpf
> map? We'd probably need some special handling on the UAPI side to insert
> a flowtable pointer, but from the BPF side it could just look like a
> kptr in a map that the program pulls out and passes to the lookup kfunc.
> And the map would take a refcnt, making sure the table doesn't disappear
> underneath the XDP program. It could even improve performance since
> there would be one less hashtable lookup.

That requires kernel changes.  Not only are flowtables not refcounted
at this time, we also have no unique identifier in the uapi; only a
combination (table name, family, flowtable name, OR table name and
handle id).

Also all of netfilter userland is network namespaced, so same keys
can exist in different net namespaces.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] netfilter: nf_tables: add flowtable map for xdp offload
  2023-11-02 10:54   ` Florian Westphal
@ 2023-11-02 11:07     ` Toke Høiland-Jørgensen
  2023-11-02 11:11       ` Florian Westphal
  0 siblings, 1 reply; 11+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-11-02 11:07 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Florian Westphal, netfilter-devel, Lorenzo Bianconi

Florian Westphal <fw@strlen.de> writes:

> Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>> > It might make more sense to intentionally have packets
>> > flow through the normal path periodically so neigh entries are up to
>> > date.
>> 
>> Hmm, I see what you mean, but I worry that this would lead to some nasty
>> latency blips when a flow transitions back and forth between kernel and
>> XDP paths. Also, there's a reordering problem as the state is changed:
>> the first goes through the stack, sets the flow state to active, then
>> gets transmitted. But while that sits in the qdisc waiting to go out on
>> the wire, the next packet arrives, gets handled by the XDP fastpath and
>> ends up overtaking the first packet on the TX side. Not sure we have a
>> good solution for this in general :(
>
> From nft based flowtable offload we already had a feature request to
> bounce flows back to normal path periodially, this was because people
> wanted to make sure that long-living flows get revalidated vs. current
> netfilter ruleset and not the one that was active at flow offload time.
>
> There was a patch for it, using a new sysctl, and author never came
> back with an updated patch to handle this via the ruleset instead.

Right, if there's an existing policy knob for this it makes sense to
support it in the XDP case as well, of course.

Does HW flow offload deal with that reordering case at all, BTW? I
assume it could happen for that as well?

-Toke

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] netfilter: nf_tables: add flowtable map for xdp offload
  2023-11-02 11:07     ` Toke Høiland-Jørgensen
@ 2023-11-02 11:11       ` Florian Westphal
  0 siblings, 0 replies; 11+ messages in thread
From: Florian Westphal @ 2023-11-02 11:11 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Florian Westphal, netfilter-devel, Lorenzo Bianconi

Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> > From nft based flowtable offload we already had a feature request to
> > bounce flows back to normal path periodially, this was because people
> > wanted to make sure that long-living flows get revalidated vs. current
> > netfilter ruleset and not the one that was active at flow offload time.
> >
> > There was a patch for it, using a new sysctl, and author never came
> > back with an updated patch to handle this via the ruleset instead.
> 
> Right, if there's an existing policy knob for this it makes sense to
> support it in the XDP case as well, of course.
> 
> Does HW flow offload deal with that reordering case at all, BTW? I
> assume it could happen for that as well?

We do not have such a feature, only a request.  The patch was never
applied because we belive a sysctl based approach is not very flexible.

HW can bounce offloaded packets at any time.
SW fallback bounces packets that it cannot handle, currently those
are ip options or ip fragments.

FIN or RST moves flow back to slowpath permanently.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] netfilter: nf_tables: add flowtable map for xdp offload
  2023-11-02 11:07   ` Florian Westphal
@ 2023-11-02 11:25     ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 11+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-11-02 11:25 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Florian Westphal, netfilter-devel, Lorenzo Bianconi

Florian Westphal <fw@strlen.de> writes:

> Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>> So IIUC correctly, this would all be controlled by userspace anyway (by
>> the nft binary), right? In which case, couldn't userspace also provide
>> the reference to the right flowtable instance, by sticking it into a bpf
>> map? We'd probably need some special handling on the UAPI side to insert
>> a flowtable pointer, but from the BPF side it could just look like a
>> kptr in a map that the program pulls out and passes to the lookup kfunc.
>> And the map would take a refcnt, making sure the table doesn't disappear
>> underneath the XDP program. It could even improve performance since
>> there would be one less hashtable lookup.
>
> That requires kernel changes.

Well, you started this thread with a kernel patch, so presumably we need
kernel changes in any case? ;)

> Not only are flowtables not refcounted at this time, we also have no
> unique identifier in the uapi; only a combination (table name, family,
> flowtable name, OR table name and handle id).

But presumably that combination is enough to uniquely identify a table,
right? So we could just use the same tuple in the map insert API.
Doesn't *have* to be a numeric unique ID. And you were talking about
adding refcnts anyway in your follow-up message?

I'm not saying it's entirely a slam dunk, but having the reference to
flowtable entries be managed out of band does add a lot of flexibility
on the BPF side; in that sense it's analogous to how BPF map references
are handled.

> Also all of netfilter userland is network namespaced, so same keys
> can exist in different net namespaces.

XDP is namespaced as well, conceptually. I.e., ifindexes are used to
refer to interfaces, and a device will be removed from a devmap if it is
moved to a different namespace, that sort of thing.

-Toke

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] netfilter: nf_tables: add flowtable map for xdp offload
  2023-11-02  8:30 ` Florian Westphal
@ 2023-11-09 11:09   ` Florian Westphal
  0 siblings, 0 replies; 11+ messages in thread
From: Florian Westphal @ 2023-11-09 11:09 UTC (permalink / raw)
  To: Florian Westphal; +Cc: pablo, netfilter-devel, Lorenzo Bianconi

Florian Westphal <fw@strlen.de> wrote:

Pablo, I am going to make changes to the flowtable infra, it would
be nice if you could nack/ack the following approach before I start to
spend cycles on this:

> Florian Westphal <fw@strlen.de> wrote:
> This is fine or at least can be made to work.
> 
> > +	case FLOW_BLOCK_UNBIND:
> > +		nf_flowtable_by_dev_remove(dev);
> 
> This is broken.  UNBIND comes too late when things are torn down.
> 
> I only see two solutions:
> 
> 1. add a new nf_flow_offload_unbind_prepare() that does this
> 2. Decouple nf_flowtable from nft_flowtable and make nf_flowtable
>    refcounted.  As-is, the UNBIND will result in UAF because the
>    underlying structures will be free'd immediately after this,
>    without any synchronize_rcu().

I'll go with 2).  Rough Plan is:

1. Do not embed nf_flowtable into nft_flowtable or the act_ct struct,
   make this a pointer, which is allocated/freed via kmalloc/kfree.

2. add a refcount_t to nf_flowtable, so the nf_flowtable can have
   its reference incremented for as long as an XDP program might
   be using this.

3. Change nf_flowtable_free so that it will honor the reference count.
   Last _put will queue destruction to rcu worker, so the teardown of
   the rhashtable and other items passes through another
   synchronize_rcu().  This will also move kfree() of nf_flowtable into
   nf_flowtable_free.

4. Refactor the nf_flowtable init function so it will allocate and
   return the flowtable structure.

Only alternative I see is to rework both act_ct and nf_tables_api.c
to perform UNBINDs *before* synchronize_rcu (and the flowtable
teardown).

Doing that means that the xdp prog will not be able to 'pin' the
underlying nf_flowtable (as its embedded in another data structure,
either tcf_ct_flow_table or nft_flowtable), and would have to rely
on the callers to prevent the kfunc from every returning an nf_flowtable
that has already been free'd.

Not a problem now, but would be a problem in case bpf would gain
the ability to expose struct nf_flowtable as a refcounted kptr.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-11-09 11:09 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-19 20:25 [PATCH RFC] netfilter: nf_tables: add flowtable map for xdp offload Florian Westphal
2023-10-23 10:33 ` Lorenzo Bianconi
2023-10-23 11:16   ` Florian Westphal
2023-11-02  8:30 ` Florian Westphal
2023-11-09 11:09   ` Florian Westphal
2023-11-02 10:49 ` Toke Høiland-Jørgensen
2023-11-02 10:54   ` Florian Westphal
2023-11-02 11:07     ` Toke Høiland-Jørgensen
2023-11-02 11:11       ` Florian Westphal
2023-11-02 11:07   ` Florian Westphal
2023-11-02 11:25     ` Toke Høiland-Jørgensen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.