All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 00/11] Netfilter updates for net-next
@ 2022-05-19 22:01 Pablo Neira Ayuso
  2022-05-19 22:01 ` [PATCH net-next 01/11] netfilter: Use l3mdev flow key when re-routing mangled packets Pablo Neira Ayuso
                   ` (10 more replies)
  0 siblings, 11 replies; 19+ messages in thread
From: Pablo Neira Ayuso @ 2022-05-19 22:01 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni

Hi,

The following patchset contains Netfilter updates for net-next, misc
updates and fallout fixes from recent Florian's code rewritting (from
last pull request):

1) Use new flowi4_l3mdev field in ip_route_me_harder(), from Martin Willi.

2) Avoid unnecessary GC with a timestamp in conncount, from William Tu
   and Yifeng Sun.

3) Remove TCP conntrack debugging, from Florian Westphal.

4) Fix compilation warning in ctnetlink, from Florian.

5) Add flowtable entry count and limit hw entries toggles, from
   Vlad Buslov and Oz Shlomo.

6) Add flowtable in-flight workqueue objects count, also from Vlad and Oz.

7) syzbot warning in nfnetlink bind, from Florian.

8) Refetch conntrack after __nf_conntrack_confirm(), from Florian Westphal.

9) Move struct nf_ct_timeout back at the bottom of the ctnl_time, to
   where it before recent update, also from Florian.

10) A few NL_SET_BAD_ATTR() for nf_tables netlink set element commands.

Please, pull these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next.git

Thanks.

----------------------------------------------------------------

The following changes since commit 5cf15ce3c8f1ef431dc9fa845c6d1674f630ecd1:

  Merge branch 'Renesas-RSZ-V2M-support' (2022-05-16 10:14:27 +0100)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next.git HEAD

for you to fetch changes up to eb6fb4d6ecbcfd69dfc36fbedbafc9860aeef1e4:

  netfilter: nf_tables: set element extended ACK reporting support (2022-05-19 22:39:50 +0200)

----------------------------------------------------------------
Florian Westphal (4):
      netfilter: conntrack: remove pr_debug callsites from tcp tracker
      netfilter: nfnetlink: fix warn in nfnetlink_unbind
      netfilter: conntrack: re-fetch conntrack after insertion
      netfilter: cttimeout: fix slab-out-of-bounds read in cttimeout_net_exit

Martin Willi (1):
      netfilter: Use l3mdev flow key when re-routing mangled packets

Pablo Neira Ayuso (1):
      netfilter: nf_tables: set element extended ACK reporting support

Stephen Rothwell (1):
      netfilter: ctnetlink: fix up for "netfilter: conntrack: remove unconfirmed list"

Vlad Buslov (3):
      net/sched: act_ct: set 'net' pointer when creating new nf_flow_table
      netfilter: nf_flow_table: count and limit hw offloaded entries
      netfilter: nf_flow_table: count pending offload workqueue tasks

William Tu (1):
      netfilter: nf_conncount: reduce unnecessary GC

 Documentation/networking/nf_conntrack-sysctl.rst |   9 ++
 include/net/net_namespace.h                      |   6 +
 include/net/netfilter/nf_conntrack_core.h        |   7 +-
 include/net/netfilter/nf_conntrack_count.h       |   1 +
 include/net/netfilter/nf_flow_table.h            |  57 +++++++++
 include/net/netns/flow_table.h                   |  14 +++
 net/ipv4/netfilter.c                             |   3 +-
 net/ipv6/netfilter.c                             |   3 +-
 net/netfilter/Kconfig                            |   9 ++
 net/netfilter/Makefile                           |   1 +
 net/netfilter/nf_conncount.c                     |  11 ++
 net/netfilter/nf_conntrack_netlink.c             |   2 +
 net/netfilter/nf_conntrack_proto_tcp.c           |  52 +-------
 net/netfilter/nf_flow_table_core.c               |  89 +++++++++++++-
 net/netfilter/nf_flow_table_offload.c            |  55 +++++++--
 net/netfilter/nf_flow_table_sysctl.c             | 148 +++++++++++++++++++++++
 net/netfilter/nf_tables_api.c                    |  12 +-
 net/netfilter/nfnetlink.c                        |  24 +---
 net/netfilter/nfnetlink_cttimeout.c              |   5 +-
 net/sched/act_ct.c                               |   5 +-
 20 files changed, 423 insertions(+), 90 deletions(-)
 create mode 100644 include/net/netns/flow_table.h
 create mode 100644 net/netfilter/nf_flow_table_sysctl.c

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH net-next 01/11] netfilter: Use l3mdev flow key when re-routing mangled packets
  2022-05-19 22:01 [PATCH net-next 00/11] Netfilter updates for net-next Pablo Neira Ayuso
@ 2022-05-19 22:01 ` Pablo Neira Ayuso
  2022-05-20  5:00   ` patchwork-bot+netdevbpf
  2022-05-19 22:01 ` [PATCH net-next 02/11] netfilter: nf_conncount: reduce unnecessary GC Pablo Neira Ayuso
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 19+ messages in thread
From: Pablo Neira Ayuso @ 2022-05-19 22:01 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni

From: Martin Willi <martin@strongswan.org>

Commit 40867d74c374 ("net: Add l3mdev index to flow struct and avoid oif
reset for port devices") introduces a flow key specific for layer 3
domains, such as a VRF master device. This allows for explicit VRF domain
selection instead of abusing the oif flow key.

Update ip[6]_route_me_harder() to make use of that new key when re-routing
mangled packets within VRFs instead of setting the flow oif, making it
consistent with other users.

Signed-off-by: Martin Willi <martin@strongswan.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/ipv4/netfilter.c | 3 +--
 net/ipv6/netfilter.c | 3 +--
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/netfilter.c b/net/ipv4/netfilter.c
index aff707988e23..bd135165482a 100644
--- a/net/ipv4/netfilter.c
+++ b/net/ipv4/netfilter.c
@@ -45,8 +45,7 @@ int ip_route_me_harder(struct net *net, struct sock *sk, struct sk_buff *skb, un
 	fl4.saddr = saddr;
 	fl4.flowi4_tos = RT_TOS(iph->tos);
 	fl4.flowi4_oif = sk ? sk->sk_bound_dev_if : 0;
-	if (!fl4.flowi4_oif)
-		fl4.flowi4_oif = l3mdev_master_ifindex(dev);
+	fl4.flowi4_l3mdev = l3mdev_master_ifindex(dev);
 	fl4.flowi4_mark = skb->mark;
 	fl4.flowi4_flags = flags;
 	fib4_rules_early_flow_dissect(net, skb, &fl4, &flkeys);
diff --git a/net/ipv6/netfilter.c b/net/ipv6/netfilter.c
index 8ce60ab89015..857713d7a38a 100644
--- a/net/ipv6/netfilter.c
+++ b/net/ipv6/netfilter.c
@@ -31,6 +31,7 @@ int ip6_route_me_harder(struct net *net, struct sock *sk_partial, struct sk_buff
 	int strict = (ipv6_addr_type(&iph->daddr) &
 		      (IPV6_ADDR_MULTICAST | IPV6_ADDR_LINKLOCAL));
 	struct flowi6 fl6 = {
+		.flowi6_l3mdev = l3mdev_master_ifindex(dev),
 		.flowi6_mark = skb->mark,
 		.flowi6_uid = sock_net_uid(net, sk),
 		.daddr = iph->daddr,
@@ -42,8 +43,6 @@ int ip6_route_me_harder(struct net *net, struct sock *sk_partial, struct sk_buff
 		fl6.flowi6_oif = sk->sk_bound_dev_if;
 	else if (strict)
 		fl6.flowi6_oif = dev->ifindex;
-	else
-		fl6.flowi6_oif = l3mdev_master_ifindex(dev);
 
 	fib6_rules_early_flow_dissect(net, skb, &fl6, &flkeys);
 	dst = ip6_route_output(net, sk, &fl6);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH net-next 02/11] netfilter: nf_conncount: reduce unnecessary GC
  2022-05-19 22:01 [PATCH net-next 00/11] Netfilter updates for net-next Pablo Neira Ayuso
  2022-05-19 22:01 ` [PATCH net-next 01/11] netfilter: Use l3mdev flow key when re-routing mangled packets Pablo Neira Ayuso
@ 2022-05-19 22:01 ` Pablo Neira Ayuso
  2022-05-19 22:01 ` [PATCH net-next 03/11] netfilter: conntrack: remove pr_debug callsites from tcp tracker Pablo Neira Ayuso
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 19+ messages in thread
From: Pablo Neira Ayuso @ 2022-05-19 22:01 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni

From: William Tu <u9012063@gmail.com>

Currently nf_conncount can trigger garbage collection (GC)
at multiple places. Each GC process takes a spin_lock_bh
to traverse the nf_conncount_list. We found that when testing
port scanning use two parallel nmap, because the number of
connection increase fast, the nf_conncount_count and its
subsequent call to __nf_conncount_add take too much time,
causing several CPU lockup. This happens when user set the
conntrack limit to +20,000, because the larger the limit,
the longer the list that GC has to traverse.

The patch mitigate the performance issue by avoiding unnecessary
GC with a timestamp. Whenever nf_conncount has done a GC,
a timestamp is updated, and beforce the next time GC is
triggered, we make sure it's more than a jiffies.
By doin this we can greatly reduce the CPU cycles and
avoid the softirq lockup.

To reproduce it in OVS,
$ ovs-appctl dpctl/ct-set-limits zone=1,limit=20000
$ ovs-appctl dpctl/ct-get-limits

At another machine, runs two nmap
$ nmap -p1- <IP>
$ nmap -p1- <IP>

Signed-off-by: William Tu <u9012063@gmail.com>
Co-authored-by: Yifeng Sun <pkusunyifeng@gmail.com>
Reported-by: Greg Rose <gvrose8192@gmail.com>
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/netfilter/nf_conntrack_count.h |  1 +
 net/netfilter/nf_conncount.c               | 11 +++++++++++
 2 files changed, 12 insertions(+)

diff --git a/include/net/netfilter/nf_conntrack_count.h b/include/net/netfilter/nf_conntrack_count.h
index 9645b47fa7e4..e227d997fc71 100644
--- a/include/net/netfilter/nf_conntrack_count.h
+++ b/include/net/netfilter/nf_conntrack_count.h
@@ -10,6 +10,7 @@ struct nf_conncount_data;
 
 struct nf_conncount_list {
 	spinlock_t list_lock;
+	u32 last_gc;		/* jiffies at most recent gc */
 	struct list_head head;	/* connections with the same filtering key */
 	unsigned int count;	/* length of list */
 };
diff --git a/net/netfilter/nf_conncount.c b/net/netfilter/nf_conncount.c
index 82f36beb2e76..5d8ed6c90b7e 100644
--- a/net/netfilter/nf_conncount.c
+++ b/net/netfilter/nf_conncount.c
@@ -132,6 +132,9 @@ static int __nf_conncount_add(struct net *net,
 	struct nf_conn *found_ct;
 	unsigned int collect = 0;
 
+	if (time_is_after_eq_jiffies((unsigned long)list->last_gc))
+		goto add_new_node;
+
 	/* check the saved connections */
 	list_for_each_entry_safe(conn, conn_n, &list->head, node) {
 		if (collect > CONNCOUNT_GC_MAX_NODES)
@@ -177,6 +180,7 @@ static int __nf_conncount_add(struct net *net,
 		nf_ct_put(found_ct);
 	}
 
+add_new_node:
 	if (WARN_ON_ONCE(list->count > INT_MAX))
 		return -EOVERFLOW;
 
@@ -190,6 +194,7 @@ static int __nf_conncount_add(struct net *net,
 	conn->jiffies32 = (u32)jiffies;
 	list_add_tail(&conn->node, &list->head);
 	list->count++;
+	list->last_gc = (u32)jiffies;
 	return 0;
 }
 
@@ -214,6 +219,7 @@ void nf_conncount_list_init(struct nf_conncount_list *list)
 	spin_lock_init(&list->list_lock);
 	INIT_LIST_HEAD(&list->head);
 	list->count = 0;
+	list->last_gc = (u32)jiffies;
 }
 EXPORT_SYMBOL_GPL(nf_conncount_list_init);
 
@@ -227,6 +233,10 @@ bool nf_conncount_gc_list(struct net *net,
 	unsigned int collected = 0;
 	bool ret = false;
 
+	/* don't bother if we just did GC */
+	if (time_is_after_eq_jiffies((unsigned long)READ_ONCE(list->last_gc)))
+		return false;
+
 	/* don't bother if other cpu is already doing GC */
 	if (!spin_trylock(&list->list_lock))
 		return false;
@@ -258,6 +268,7 @@ bool nf_conncount_gc_list(struct net *net,
 
 	if (!list->count)
 		ret = true;
+	list->last_gc = (u32)jiffies;
 	spin_unlock(&list->list_lock);
 
 	return ret;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH net-next 03/11] netfilter: conntrack: remove pr_debug callsites from tcp tracker
  2022-05-19 22:01 [PATCH net-next 00/11] Netfilter updates for net-next Pablo Neira Ayuso
  2022-05-19 22:01 ` [PATCH net-next 01/11] netfilter: Use l3mdev flow key when re-routing mangled packets Pablo Neira Ayuso
  2022-05-19 22:01 ` [PATCH net-next 02/11] netfilter: nf_conncount: reduce unnecessary GC Pablo Neira Ayuso
@ 2022-05-19 22:01 ` Pablo Neira Ayuso
  2022-05-19 22:01 ` [PATCH net-next 04/11] netfilter: ctnetlink: fix up for "netfilter: conntrack: remove unconfirmed list" Pablo Neira Ayuso
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 19+ messages in thread
From: Pablo Neira Ayuso @ 2022-05-19 22:01 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni

From: Florian Westphal <fw@strlen.de>

They are either obsolete or useless.

Those in the normal processing path cannot be enabled on a production
system; they generate too much noise.

One pr_debug call resides in an error path and does provide useful info,
merge it with the existing nf_log_invalid().

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nf_conntrack_proto_tcp.c | 52 ++------------------------
 1 file changed, 4 insertions(+), 48 deletions(-)

diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
index 204a5cdff5b1..a63b51dceaf2 100644
--- a/net/netfilter/nf_conntrack_proto_tcp.c
+++ b/net/netfilter/nf_conntrack_proto_tcp.c
@@ -485,7 +485,6 @@ static bool tcp_in_window(struct nf_conn *ct,
 	struct nf_tcp_net *tn = nf_tcp_pernet(net);
 	struct ip_ct_tcp_state *sender = &state->seen[dir];
 	struct ip_ct_tcp_state *receiver = &state->seen[!dir];
-	const struct nf_conntrack_tuple *tuple = &ct->tuplehash[dir].tuple;
 	__u32 seq, ack, sack, end, win, swin;
 	u16 win_raw;
 	s32 receiver_offset;
@@ -508,18 +507,6 @@ static bool tcp_in_window(struct nf_conn *ct,
 	ack -= receiver_offset;
 	sack -= receiver_offset;
 
-	pr_debug("tcp_in_window: START\n");
-	pr_debug("tcp_in_window: ");
-	nf_ct_dump_tuple(tuple);
-	pr_debug("seq=%u ack=%u+(%d) sack=%u+(%d) win=%u end=%u\n",
-		 seq, ack, receiver_offset, sack, receiver_offset, win, end);
-	pr_debug("tcp_in_window: sender end=%u maxend=%u maxwin=%u scale=%i "
-		 "receiver end=%u maxend=%u maxwin=%u scale=%i\n",
-		 sender->td_end, sender->td_maxend, sender->td_maxwin,
-		 sender->td_scale,
-		 receiver->td_end, receiver->td_maxend, receiver->td_maxwin,
-		 receiver->td_scale);
-
 	if (sender->td_maxwin == 0) {
 		/*
 		 * Initialize sender data.
@@ -597,27 +584,10 @@ static bool tcp_in_window(struct nf_conn *ct,
 		 */
 		seq = end = sender->td_end;
 
-	pr_debug("tcp_in_window: ");
-	nf_ct_dump_tuple(tuple);
-	pr_debug("seq=%u ack=%u+(%d) sack=%u+(%d) win=%u end=%u\n",
-		 seq, ack, receiver_offset, sack, receiver_offset, win, end);
-	pr_debug("tcp_in_window: sender end=%u maxend=%u maxwin=%u scale=%i "
-		 "receiver end=%u maxend=%u maxwin=%u scale=%i\n",
-		 sender->td_end, sender->td_maxend, sender->td_maxwin,
-		 sender->td_scale,
-		 receiver->td_end, receiver->td_maxend, receiver->td_maxwin,
-		 receiver->td_scale);
-
 	/* Is the ending sequence in the receive window (if available)? */
 	in_recv_win = !receiver->td_maxwin ||
 		      after(end, sender->td_end - receiver->td_maxwin - 1);
 
-	pr_debug("tcp_in_window: I=%i II=%i III=%i IV=%i\n",
-		 before(seq, sender->td_maxend + 1),
-		 (in_recv_win ? 1 : 0),
-		 before(sack, receiver->td_end + 1),
-		 after(sack, receiver->td_end - MAXACKWINDOW(sender) - 1));
-
 	if (before(seq, sender->td_maxend + 1) &&
 	    in_recv_win &&
 	    before(sack, receiver->td_end + 1) &&
@@ -698,11 +668,6 @@ static bool tcp_in_window(struct nf_conn *ct,
 		}
 	}
 
-	pr_debug("tcp_in_window: res=%u sender end=%u maxend=%u maxwin=%u "
-		 "receiver end=%u maxend=%u maxwin=%u\n",
-		 res, sender->td_end, sender->td_maxend, sender->td_maxwin,
-		 receiver->td_end, receiver->td_maxend, receiver->td_maxwin);
-
 	return res;
 }
 
@@ -772,8 +737,6 @@ static noinline bool tcp_new(struct nf_conn *ct, const struct sk_buff *skb,
 	enum tcp_conntrack new_state;
 	struct net *net = nf_ct_net(ct);
 	const struct nf_tcp_net *tn = nf_tcp_pernet(net);
-	const struct ip_ct_tcp_state *sender = &ct->proto.tcp.seen[0];
-	const struct ip_ct_tcp_state *receiver = &ct->proto.tcp.seen[1];
 
 	/* Don't need lock here: this conntrack not in circulation yet */
 	new_state = tcp_conntracks[0][get_conntrack_index(th)][TCP_CONNTRACK_NONE];
@@ -826,14 +789,6 @@ static noinline bool tcp_new(struct nf_conn *ct, const struct sk_buff *skb,
 
 	/* tcp_packet will set them */
 	ct->proto.tcp.last_index = TCP_NONE_SET;
-
-	pr_debug("%s: sender end=%u maxend=%u maxwin=%u scale=%i "
-		 "receiver end=%u maxend=%u maxwin=%u scale=%i\n",
-		 __func__,
-		 sender->td_end, sender->td_maxend, sender->td_maxwin,
-		 sender->td_scale,
-		 receiver->td_end, receiver->td_maxend, receiver->td_maxwin,
-		 receiver->td_scale);
 	return true;
 }
 
@@ -1032,10 +987,11 @@ int nf_conntrack_tcp_packet(struct nf_conn *ct,
 		}
 
 		/* Invalid packet */
-		pr_debug("nf_ct_tcp: Invalid dir=%i index=%u ostate=%u\n",
-			 dir, get_conntrack_index(th), old_state);
 		spin_unlock_bh(&ct->lock);
-		nf_ct_l4proto_log_invalid(skb, ct, state, "invalid state");
+		nf_ct_l4proto_log_invalid(skb, ct, state,
+					  "packet (index %d) in dir %d invalid, state %s",
+					  index, dir,
+					  tcp_conntrack_names[old_state]);
 		return -NF_ACCEPT;
 	case TCP_CONNTRACK_TIME_WAIT:
 		/* RFC5961 compliance cause stack to send "challenge-ACK"
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH net-next 04/11] netfilter: ctnetlink: fix up for "netfilter: conntrack: remove unconfirmed list"
  2022-05-19 22:01 [PATCH net-next 00/11] Netfilter updates for net-next Pablo Neira Ayuso
                   ` (2 preceding siblings ...)
  2022-05-19 22:01 ` [PATCH net-next 03/11] netfilter: conntrack: remove pr_debug callsites from tcp tracker Pablo Neira Ayuso
@ 2022-05-19 22:01 ` Pablo Neira Ayuso
  2022-05-19 22:02 ` [PATCH net-next 05/11] net/sched: act_ct: set 'net' pointer when creating new nf_flow_table Pablo Neira Ayuso
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 19+ messages in thread
From: Pablo Neira Ayuso @ 2022-05-19 22:01 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni

From: Stephen Rothwell <sfr@canb.auug.org.au>

After merging the net-next tree, today's linux-next build (powerpc
ppc64_defconfig) produced this warning:

nf_conntrack_netlink.c:1717 warning: 'ctnetlink_dump_one_entry' defined but not used

Fixes: 8a75a2c17410 ("netfilter: conntrack: remove unconfirmed list")
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/nf_conntrack_netlink.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index e768f59741a6..722af5e309ba 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -1714,6 +1714,7 @@ static int ctnetlink_done_list(struct netlink_callback *cb)
 	return 0;
 }
 
+#ifdef CONFIG_NF_CONNTRACK_EVENTS
 static int ctnetlink_dump_one_entry(struct sk_buff *skb,
 				    struct netlink_callback *cb,
 				    struct nf_conn *ct,
@@ -1754,6 +1755,7 @@ static int ctnetlink_dump_one_entry(struct sk_buff *skb,
 
 	return res;
 }
+#endif
 
 static int
 ctnetlink_dump_unconfirmed(struct sk_buff *skb, struct netlink_callback *cb)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH net-next 05/11] net/sched: act_ct: set 'net' pointer when creating new nf_flow_table
  2022-05-19 22:01 [PATCH net-next 00/11] Netfilter updates for net-next Pablo Neira Ayuso
                   ` (3 preceding siblings ...)
  2022-05-19 22:01 ` [PATCH net-next 04/11] netfilter: ctnetlink: fix up for "netfilter: conntrack: remove unconfirmed list" Pablo Neira Ayuso
@ 2022-05-19 22:02 ` Pablo Neira Ayuso
  2022-05-19 22:02 ` [PATCH net-next 06/11] netfilter: nf_flow_table: count and limit hw offloaded entries Pablo Neira Ayuso
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 19+ messages in thread
From: Pablo Neira Ayuso @ 2022-05-19 22:02 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni

From: Vlad Buslov <vladbu@nvidia.com>

Following patches in series use the pointer to access flow table offload
debug variables.

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/sched/act_ct.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c
index 8af9d6e5ba61..58e0dfed22c6 100644
--- a/net/sched/act_ct.c
+++ b/net/sched/act_ct.c
@@ -277,7 +277,7 @@ static struct nf_flowtable_type flowtable_ct = {
 	.owner		= THIS_MODULE,
 };
 
-static int tcf_ct_flow_table_get(struct tcf_ct_params *params)
+static int tcf_ct_flow_table_get(struct net *net, struct tcf_ct_params *params)
 {
 	struct tcf_ct_flow_table *ct_ft;
 	int err = -ENOMEM;
@@ -303,6 +303,7 @@ static int tcf_ct_flow_table_get(struct tcf_ct_params *params)
 	err = nf_flow_table_init(&ct_ft->nf_ft);
 	if (err)
 		goto err_init;
+	write_pnet(&ct_ft->nf_ft.net, net);
 
 	__module_get(THIS_MODULE);
 out_unlock:
@@ -1391,7 +1392,7 @@ static int tcf_ct_init(struct net *net, struct nlattr *nla,
 	if (err)
 		goto cleanup;
 
-	err = tcf_ct_flow_table_get(params);
+	err = tcf_ct_flow_table_get(net, params);
 	if (err)
 		goto cleanup;
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH net-next 06/11] netfilter: nf_flow_table: count and limit hw offloaded entries
  2022-05-19 22:01 [PATCH net-next 00/11] Netfilter updates for net-next Pablo Neira Ayuso
                   ` (4 preceding siblings ...)
  2022-05-19 22:02 ` [PATCH net-next 05/11] net/sched: act_ct: set 'net' pointer when creating new nf_flow_table Pablo Neira Ayuso
@ 2022-05-19 22:02 ` Pablo Neira Ayuso
  2022-05-19 23:11   ` Jakub Kicinski
  2022-05-19 22:02 ` [PATCH net-next 07/11] netfilter: nf_flow_table: count pending offload workqueue tasks Pablo Neira Ayuso
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 19+ messages in thread
From: Pablo Neira Ayuso @ 2022-05-19 22:02 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni

From: Vlad Buslov <vladbu@nvidia.com>

To improve hardware offload debuggability and scalability introduce
'nf_flowtable_count_hw' and 'nf_flowtable_max_hw' sysctl entries in new
dedicated 'net/netfilter/ft' namespace. Add new pernet struct nf_ft_net in
order to store the counter and sysctl header of new sysctl table.

Count the offloaded flows in workqueue add task handler. Verify that
offloaded flow total is lower than allowed maximum before calling the
driver callbacks. To prevent spamming the 'add' workqueue with tasks when
flows can't be offloaded anymore also check that count is below limit
before queuing offload work. This doesn't prevent all redundant workqueue
task since counter can be taken by concurrent work handler after the check
had been performed but before the offload job is executed but it still
greatly reduces such occurrences. Note that flows that were not offloaded
due to counter being larger than the cap can still be offloaded via refresh
function.

Ensure that flows are accounted correctly by verifying IPS_HW_OFFLOAD_BIT
value before counting them. This ensures that add/refresh code path
increments the counter exactly once per flow when setting the bit and
decrements it only for accounted flows when deleting the flow with the bit
set.

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 .../networking/nf_conntrack-sysctl.rst        |  9 +++
 include/net/netfilter/nf_flow_table.h         | 36 ++++++++++
 net/netfilter/Makefile                        |  1 +
 net/netfilter/nf_flow_table_core.c            | 55 ++++++++++++++-
 net/netfilter/nf_flow_table_offload.c         | 38 ++++++++--
 net/netfilter/nf_flow_table_sysctl.c          | 69 +++++++++++++++++++
 6 files changed, 200 insertions(+), 8 deletions(-)
 create mode 100644 net/netfilter/nf_flow_table_sysctl.c

diff --git a/Documentation/networking/nf_conntrack-sysctl.rst b/Documentation/networking/nf_conntrack-sysctl.rst
index 834945ebc4cd..116c27f46c24 100644
--- a/Documentation/networking/nf_conntrack-sysctl.rst
+++ b/Documentation/networking/nf_conntrack-sysctl.rst
@@ -210,3 +210,12 @@ nf_flowtable_udp_timeout - INTEGER (seconds)
         Control offload timeout for udp connections.
         UDP connections may be offloaded from nf conntrack to nf flow table.
         Once aged, the connection is returned to nf conntrack with udp pickup timeout.
+
+nf_flowtable_count_hw - INTEGER (read-only)
+	Number of flowtable entries that are currently offloaded to hardware.
+
+nf_flowtable_max_hw - INTEGER
+	- 0 - disabled (default)
+	- not 0 - enabled
+
+	Cap on maximum total amount of flowtable entries offloaded to hardware.
diff --git a/include/net/netfilter/nf_flow_table.h b/include/net/netfilter/nf_flow_table.h
index 64daafd1fc41..9709f984fba2 100644
--- a/include/net/netfilter/nf_flow_table.h
+++ b/include/net/netfilter/nf_flow_table.h
@@ -335,4 +335,40 @@ static inline __be16 nf_flow_pppoe_proto(const struct sk_buff *skb)
 	return 0;
 }
 
+struct nf_ft_net {
+	atomic_t count_hw;
+#ifdef CONFIG_SYSCTL
+	struct ctl_table_header *sysctl_header;
+#endif
+};
+
+extern unsigned int nf_ft_hw_max;
+extern unsigned int nf_ft_net_id;
+
+#include <net/netns/generic.h>
+
+static inline struct nf_ft_net *nf_ft_pernet(const struct net *net)
+{
+	return net_generic(net, nf_ft_net_id);
+}
+
+static inline struct nf_ft_net *nf_ft_pernet_get(struct nf_flowtable *flow_table)
+{
+	return nf_ft_pernet(read_pnet(&flow_table->net));
+}
+
+#ifdef CONFIG_SYSCTL
+int nf_flow_table_init_sysctl(struct net *net);
+void nf_flow_table_fini_sysctl(struct net *net);
+#else
+static inline int nf_flow_table_init_sysctl(struct net *net)
+{
+	return 0;
+}
+
+static inline void nf_flow_table_fini_sysctl(struct net *net)
+{
+}
+#endif /* CONFIG_SYSCTL */
+
 #endif /* _NF_FLOW_TABLE_H */
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index 238b6a620e88..9e3c1f9c6d07 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -128,6 +128,7 @@ obj-$(CONFIG_NFT_FWD_NETDEV)	+= nft_fwd_netdev.o
 obj-$(CONFIG_NF_FLOW_TABLE)	+= nf_flow_table.o
 nf_flow_table-objs		:= nf_flow_table_core.o nf_flow_table_ip.o \
 				   nf_flow_table_offload.o
+nf_flow_table-$(CONFIG_SYSCTL)	+= nf_flow_table_sysctl.o
 
 obj-$(CONFIG_NF_FLOW_TABLE_INET) += nf_flow_table_inet.o
 
diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
index 3db256da919b..e2598f98017c 100644
--- a/net/netfilter/nf_flow_table_core.c
+++ b/net/netfilter/nf_flow_table_core.c
@@ -277,6 +277,13 @@ static const struct rhashtable_params nf_flow_offload_rhash_params = {
 	.automatic_shrinking	= true,
 };
 
+static bool flow_max_hw_entries(struct nf_flowtable *flow_table)
+{
+	struct nf_ft_net *fnet = nf_ft_pernet_get(flow_table);
+
+	return nf_ft_hw_max && atomic_read(&fnet->count_hw) >= nf_ft_hw_max;
+}
+
 unsigned long flow_offload_get_timeout(struct flow_offload *flow)
 {
 	unsigned long timeout = NF_FLOW_TIMEOUT;
@@ -320,7 +327,8 @@ int flow_offload_add(struct nf_flowtable *flow_table, struct flow_offload *flow)
 
 	nf_ct_offload_timeout(flow->ct);
 
-	if (nf_flowtable_hw_offload(flow_table)) {
+	if (nf_flowtable_hw_offload(flow_table) &&
+	    !flow_max_hw_entries(flow_table)) {
 		__set_bit(NF_FLOW_HW, &flow->flags);
 		nf_flow_offload_add(flow_table, flow);
 	}
@@ -338,9 +346,11 @@ void flow_offload_refresh(struct nf_flowtable *flow_table,
 	if (READ_ONCE(flow->timeout) != timeout)
 		WRITE_ONCE(flow->timeout, timeout);
 
-	if (likely(!nf_flowtable_hw_offload(flow_table)))
+	if (likely(!nf_flowtable_hw_offload(flow_table) ||
+		   flow_max_hw_entries(flow_table)))
 		return;
 
+	set_bit(NF_FLOW_HW, &flow->flags);
 	nf_flow_offload_add(flow_table, flow);
 }
 EXPORT_SYMBOL_GPL(flow_offload_refresh);
@@ -652,14 +662,53 @@ void nf_flow_table_free(struct nf_flowtable *flow_table)
 }
 EXPORT_SYMBOL_GPL(nf_flow_table_free);
 
+static int nf_flow_table_pernet_init(struct net *net)
+{
+	return nf_flow_table_init_sysctl(net);
+}
+
+static void nf_flow_table_pernet_exit(struct list_head *net_exit_list)
+{
+	struct net *net;
+
+	list_for_each_entry(net, net_exit_list, exit_list)
+		nf_flow_table_fini_sysctl(net);
+}
+
+unsigned int nf_ft_net_id __read_mostly;
+
+static struct pernet_operations nf_flow_table_net_ops = {
+	.init = nf_flow_table_pernet_init,
+	.exit_batch = nf_flow_table_pernet_exit,
+	.id = &nf_ft_net_id,
+	.size = sizeof(struct nf_ft_net),
+};
+
 static int __init nf_flow_table_module_init(void)
 {
-	return nf_flow_table_offload_init();
+	int ret;
+
+	nf_ft_hw_max = 0;
+
+	ret = register_pernet_subsys(&nf_flow_table_net_ops);
+	if (ret < 0)
+		return ret;
+
+	ret = nf_flow_table_offload_init();
+	if (ret)
+		goto out_offload;
+
+	return 0;
+
+out_offload:
+	unregister_pernet_subsys(&nf_flow_table_net_ops);
+	return ret;
 }
 
 static void __exit nf_flow_table_module_exit(void)
 {
 	nf_flow_table_offload_exit();
+	unregister_pernet_subsys(&nf_flow_table_net_ops);
 }
 
 module_init(nf_flow_table_module_init);
diff --git a/net/netfilter/nf_flow_table_offload.c b/net/netfilter/nf_flow_table_offload.c
index 11b6e1942092..5c7146eb646a 100644
--- a/net/netfilter/nf_flow_table_offload.c
+++ b/net/netfilter/nf_flow_table_offload.c
@@ -13,6 +13,8 @@
 #include <net/netfilter/nf_conntrack_core.h>
 #include <net/netfilter/nf_conntrack_tuple.h>
 
+unsigned int nf_ft_hw_max __read_mostly;
+
 static struct workqueue_struct *nf_flow_offload_add_wq;
 static struct workqueue_struct *nf_flow_offload_del_wq;
 static struct workqueue_struct *nf_flow_offload_stats_wq;
@@ -903,30 +905,56 @@ static int flow_offload_rule_add(struct flow_offload_work *offload,
 	return 0;
 }
 
+static bool flow_get_max_hw_entries(struct nf_flowtable *flow_table)
+{
+	struct nf_ft_net *fnet = nf_ft_pernet_get(flow_table);
+	int count_hw = atomic_inc_return(&fnet->count_hw);
+
+	if (nf_ft_hw_max && count_hw > nf_ft_hw_max) {
+		atomic_dec(&fnet->count_hw);
+		return false;
+	}
+	return true;
+}
+
 static void flow_offload_work_add(struct flow_offload_work *offload)
 {
+	struct nf_ft_net *fnet = nf_ft_pernet_get(offload->flowtable);
 	struct nf_flow_rule *flow_rule[FLOW_OFFLOAD_DIR_MAX];
 	int err;
 
+	if (!flow_get_max_hw_entries(offload->flowtable))
+		return;
+
 	err = nf_flow_offload_alloc(offload, flow_rule);
 	if (err < 0)
-		return;
+		goto out_alloc;
 
 	err = flow_offload_rule_add(offload, flow_rule);
 	if (err < 0)
-		goto out;
+		goto out_add;
 
-	set_bit(IPS_HW_OFFLOAD_BIT, &offload->flow->ct->status);
+	if (test_and_set_bit(IPS_HW_OFFLOAD_BIT, &offload->flow->ct->status))
+		atomic_dec(&fnet->count_hw);
+	nf_flow_offload_destroy(flow_rule);
+	return;
 
-out:
+out_add:
 	nf_flow_offload_destroy(flow_rule);
+out_alloc:
+	atomic_dec(&fnet->count_hw);
 }
 
 static void flow_offload_work_del(struct flow_offload_work *offload)
 {
-	clear_bit(IPS_HW_OFFLOAD_BIT, &offload->flow->ct->status);
+	bool offloaded = test_and_clear_bit(IPS_HW_OFFLOAD_BIT,
+					    &offload->flow->ct->status);
+	struct nf_ft_net *fnet = nf_ft_pernet_get(offload->flowtable);
+
 	flow_offload_tuple_del(offload, FLOW_OFFLOAD_DIR_ORIGINAL);
 	flow_offload_tuple_del(offload, FLOW_OFFLOAD_DIR_REPLY);
+	if (offloaded)
+		atomic_dec(&fnet->count_hw);
 	set_bit(NF_FLOW_HW_DEAD, &offload->flow->flags);
 }
 
diff --git a/net/netfilter/nf_flow_table_sysctl.c b/net/netfilter/nf_flow_table_sysctl.c
new file mode 100644
index 000000000000..2e7539be8f88
--- /dev/null
+++ b/net/netfilter/nf_flow_table_sysctl.c
@@ -0,0 +1,69 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/kernel.h>
+#include <net/netfilter/nf_flow_table.h>
+
+enum nf_ct_sysctl_index {
+	NF_SYSCTL_FLOW_TABLE_MAX_HW,
+	NF_SYSCTL_FLOW_TABLE_COUNT_HW,
+
+	__NF_SYSCTL_FLOW_TABLE_LAST_SYSCTL,
+};
+
+#define NF_SYSCTL_FLOW_TABLE_LAST_SYSCTL                                       \
+	(__NF_SYSCTL_FLOW_TABLE_LAST_SYSCTL + 1)
+
+static struct ctl_table nf_flow_table_sysctl_table[] = {
+	[NF_SYSCTL_FLOW_TABLE_MAX_HW] = {
+		.procname	= "nf_flowtable_max_hw",
+		.data		= &nf_ft_hw_max,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	[NF_SYSCTL_FLOW_TABLE_COUNT_HW] = {
+		.procname	= "nf_flowtable_count_hw",
+		.maxlen		= sizeof(int),
+		.mode		= 0444,
+		.proc_handler	= proc_dointvec,
+	},
+	{}
+};
+
+int nf_flow_table_init_sysctl(struct net *net)
+{
+	struct nf_ft_net *fnet = nf_ft_pernet(net);
+	struct ctl_table *table;
+
+	BUILD_BUG_ON(ARRAY_SIZE(nf_flow_table_sysctl_table) != NF_SYSCTL_FLOW_TABLE_LAST_SYSCTL);
+
+	table = kmemdup(nf_flow_table_sysctl_table, sizeof(nf_flow_table_sysctl_table),
+			GFP_KERNEL);
+	if (!table)
+		return -ENOMEM;
+
+	table[NF_SYSCTL_FLOW_TABLE_COUNT_HW].data = &fnet->count_hw;
+
+	/* Don't allow non-init_net ns to alter global sysctls */
+	if (!net_eq(&init_net, net))
+		table[NF_SYSCTL_FLOW_TABLE_MAX_HW].mode = 0444;
+
+	fnet->sysctl_header = register_net_sysctl(net, "net/netfilter/ft", table);
+	if (!fnet->sysctl_header)
+		goto out_unregister_netfilter;
+
+	return 0;
+
+out_unregister_netfilter:
+	kfree(table);
+	return -ENOMEM;
+}
+
+void nf_flow_table_fini_sysctl(struct net *net)
+{
+	struct nf_ft_net *fnet = nf_ft_pernet(net);
+	struct ctl_table *table;
+
+	table = fnet->sysctl_header->ctl_table_arg;
+	unregister_net_sysctl_table(fnet->sysctl_header);
+	kfree(table);
+}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH net-next 07/11] netfilter: nf_flow_table: count pending offload workqueue tasks
  2022-05-19 22:01 [PATCH net-next 00/11] Netfilter updates for net-next Pablo Neira Ayuso
                   ` (5 preceding siblings ...)
  2022-05-19 22:02 ` [PATCH net-next 06/11] netfilter: nf_flow_table: count and limit hw offloaded entries Pablo Neira Ayuso
@ 2022-05-19 22:02 ` Pablo Neira Ayuso
  2022-05-19 22:02 ` [PATCH net-next 08/11] netfilter: nfnetlink: fix warn in nfnetlink_unbind Pablo Neira Ayuso
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 19+ messages in thread
From: Pablo Neira Ayuso @ 2022-05-19 22:02 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni

From: Vlad Buslov <vladbu@nvidia.com>

To improve hardware offload debuggability count pending 'add', 'del' and
'stats' flow_table offload workqueue tasks. Counters are incremented before
scheduling new task and decremented when workqueue handler finishes
executing. These counters allow user to diagnose congestion on hardware
offload workqueues that can happen when either CPU is starved and workqueue
jobs are executed at lower rate than new ones are added or when
hardware/driver can't keep up with the rate.

Implement the described counters as percpu counters inside new struct
netns_ft which is stored inside struct net. Expose them via new procfs file
'/proc/net/stats/nf_flowtable' that is similar to existing 'nf_conntrack'
file.

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/net_namespace.h           |  6 ++
 include/net/netfilter/nf_flow_table.h | 21 +++++++
 include/net/netns/flow_table.h        | 14 +++++
 net/netfilter/Kconfig                 |  9 +++
 net/netfilter/nf_flow_table_core.c    | 38 ++++++++++++-
 net/netfilter/nf_flow_table_offload.c | 17 +++++-
 net/netfilter/nf_flow_table_sysctl.c  | 79 +++++++++++++++++++++++++++
 7 files changed, 179 insertions(+), 5 deletions(-)
 create mode 100644 include/net/netns/flow_table.h

diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index c4f5601f6e32..bf770c13e19b 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -26,6 +26,9 @@
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
 #include <net/netns/conntrack.h>
 #endif
+#if IS_ENABLED(CONFIG_NF_FLOW_TABLE)
+#include <net/netns/flow_table.h>
+#endif
 #include <net/netns/nftables.h>
 #include <net/netns/xfrm.h>
 #include <net/netns/mpls.h>
@@ -140,6 +143,9 @@ struct net {
 #if defined(CONFIG_NF_TABLES) || defined(CONFIG_NF_TABLES_MODULE)
 	struct netns_nftables	nft;
 #endif
+#if IS_ENABLED(CONFIG_NF_FLOW_TABLE)
+	struct netns_ft ft;
+#endif
 #endif
 #ifdef CONFIG_WEXT_CORE
 	struct sk_buff_head	wext_nlevents;
diff --git a/include/net/netfilter/nf_flow_table.h b/include/net/netfilter/nf_flow_table.h
index 9709f984fba2..e6d63228efd4 100644
--- a/include/net/netfilter/nf_flow_table.h
+++ b/include/net/netfilter/nf_flow_table.h
@@ -371,4 +371,25 @@ static inline void nf_flow_table_fini_sysctl(struct net *net)
 }
 #endif /* CONFIG_SYSCTL */
 
+#define NF_FLOW_TABLE_STAT_INC(net, count) __this_cpu_inc((net)->ft.stat->count)
+#define NF_FLOW_TABLE_STAT_DEC(net, count) __this_cpu_dec((net)->ft.stat->count)
+#define NF_FLOW_TABLE_STAT_INC_ATOMIC(net, count)	\
+	this_cpu_inc((net)->ft.stat->count)
+#define NF_FLOW_TABLE_STAT_DEC_ATOMIC(net, count)	\
+	this_cpu_dec((net)->ft.stat->count)
+
+#ifdef CONFIG_NF_FLOW_TABLE_PROCFS
+int nf_flow_table_init_proc(struct net *net);
+void nf_flow_table_fini_proc(struct net *net);
+#else
+static inline int nf_flow_table_init_proc(struct net *net)
+{
+	return 0;
+}
+
+static inline void nf_flow_table_fini_proc(struct net *net)
+{
+}
+#endif /* CONFIG_NF_FLOW_TABLE_PROCFS */
+
 #endif /* _NF_FLOW_TABLE_H */
diff --git a/include/net/netns/flow_table.h b/include/net/netns/flow_table.h
new file mode 100644
index 000000000000..1c5fc657e267
--- /dev/null
+++ b/include/net/netns/flow_table.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __NETNS_FLOW_TABLE_H
+#define __NETNS_FLOW_TABLE_H
+
+struct nf_flow_table_stat {
+	unsigned int count_wq_add;
+	unsigned int count_wq_del;
+	unsigned int count_wq_stats;
+};
+
+struct netns_ft {
+	struct nf_flow_table_stat __percpu *stat;
+};
+#endif
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index ddc54b6d18ee..df6abbfe0079 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -734,6 +734,15 @@ config NF_FLOW_TABLE
 
 	  To compile it as a module, choose M here.
 
+config NF_FLOW_TABLE_PROCFS
+	bool "Supply flow table statistics in procfs"
+	default y
+	depends on PROC_FS
+	depends on SYSCTL
+	help
+	  This option enables for the flow table offload statistics
+	  to be shown in procfs under net/netfilter/nf_flowtable.
+
 config NETFILTER_XTABLES
 	tristate "Netfilter Xtables support (required for ip_tables)"
 	default m if NETFILTER_ADVANCED=n
diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
index e2598f98017c..c86dd627ef42 100644
--- a/net/netfilter/nf_flow_table_core.c
+++ b/net/netfilter/nf_flow_table_core.c
@@ -662,17 +662,51 @@ void nf_flow_table_free(struct nf_flowtable *flow_table)
 }
 EXPORT_SYMBOL_GPL(nf_flow_table_free);
 
+static int nf_flow_table_init_net(struct net *net)
+{
+	net->ft.stat = alloc_percpu(struct nf_flow_table_stat);
+	return net->ft.stat ? 0 : -ENOMEM;
+}
+
+static void nf_flow_table_fini_net(struct net *net)
+{
+	free_percpu(net->ft.stat);
+}
+
 static int nf_flow_table_pernet_init(struct net *net)
 {
-	return nf_flow_table_init_sysctl(net);
+	int ret;
+
+	ret = nf_flow_table_init_net(net);
+	if (ret < 0)
+		return ret;
+
+	ret = nf_flow_table_init_sysctl(net);
+	if (ret < 0)
+		goto out_sysctl;
+
+	ret = nf_flow_table_init_proc(net);
+	if (ret < 0)
+		goto out_proc;
+
+	return 0;
+
+out_proc:
+	nf_flow_table_fini_sysctl(net);
+out_sysctl:
+	nf_flow_table_fini_net(net);
+	return ret;
 }
 
 static void nf_flow_table_pernet_exit(struct list_head *net_exit_list)
 {
 	struct net *net;
 
-	list_for_each_entry(net, net_exit_list, exit_list)
+	list_for_each_entry(net, net_exit_list, exit_list) {
+		nf_flow_table_fini_proc(net);
 		nf_flow_table_fini_sysctl(net);
+		nf_flow_table_fini_net(net);
+	}
 }
 
 unsigned int nf_ft_net_id __read_mostly;
diff --git a/net/netfilter/nf_flow_table_offload.c b/net/netfilter/nf_flow_table_offload.c
index 5c7146eb646a..074322c1176f 100644
--- a/net/netfilter/nf_flow_table_offload.c
+++ b/net/netfilter/nf_flow_table_offload.c
@@ -995,17 +995,22 @@ static void flow_offload_work_stats(struct flow_offload_work *offload)
 static void flow_offload_work_handler(struct work_struct *work)
 {
 	struct flow_offload_work *offload;
+	struct net *net;
 
 	offload = container_of(work, struct flow_offload_work, work);
+	net = read_pnet(&offload->flowtable->net);
 	switch (offload->cmd) {
 		case FLOW_CLS_REPLACE:
 			flow_offload_work_add(offload);
+			NF_FLOW_TABLE_STAT_DEC_ATOMIC(net, count_wq_add);
 			break;
 		case FLOW_CLS_DESTROY:
 			flow_offload_work_del(offload);
+			NF_FLOW_TABLE_STAT_DEC_ATOMIC(net, count_wq_del);
 			break;
 		case FLOW_CLS_STATS:
 			flow_offload_work_stats(offload);
+			NF_FLOW_TABLE_STAT_DEC_ATOMIC(net, count_wq_stats);
 			break;
 		default:
 			WARN_ON_ONCE(1);
@@ -1017,12 +1022,18 @@ static void flow_offload_work_handler(struct work_struct *work)
 
 static void flow_offload_queue_work(struct flow_offload_work *offload)
 {
-	if (offload->cmd == FLOW_CLS_REPLACE)
+	struct net *net = read_pnet(&offload->flowtable->net);
+
+	if (offload->cmd == FLOW_CLS_REPLACE) {
+		NF_FLOW_TABLE_STAT_INC(net, count_wq_add);
 		queue_work(nf_flow_offload_add_wq, &offload->work);
-	else if (offload->cmd == FLOW_CLS_DESTROY)
+	} else if (offload->cmd == FLOW_CLS_DESTROY) {
+		NF_FLOW_TABLE_STAT_INC(net, count_wq_del);
 		queue_work(nf_flow_offload_del_wq, &offload->work);
-	else
+	} else {
+		NF_FLOW_TABLE_STAT_INC(net, count_wq_stats);
 		queue_work(nf_flow_offload_stats_wq, &offload->work);
+	}
 }
 
 static struct flow_offload_work *
diff --git a/net/netfilter/nf_flow_table_sysctl.c b/net/netfilter/nf_flow_table_sysctl.c
index 2e7539be8f88..edbdcc8731d9 100644
--- a/net/netfilter/nf_flow_table_sysctl.c
+++ b/net/netfilter/nf_flow_table_sysctl.c
@@ -1,7 +1,86 @@
 // SPDX-License-Identifier: GPL-2.0-only
 #include <linux/kernel.h>
+#include <linux/proc_fs.h>
 #include <net/netfilter/nf_flow_table.h>
 
+#ifdef CONFIG_NF_FLOW_TABLE_PROCFS
+static void *nf_flow_table_cpu_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	struct net *net = seq_file_net(seq);
+	int cpu;
+
+	if (*pos == 0)
+		return SEQ_START_TOKEN;
+
+	for (cpu = *pos - 1; cpu < nr_cpu_ids; ++cpu) {
+		if (!cpu_possible(cpu))
+			continue;
+		*pos = cpu + 1;
+		return per_cpu_ptr(net->ft.stat, cpu);
+	}
+
+	return NULL;
+}
+
+static void *nf_flow_table_cpu_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct net *net = seq_file_net(seq);
+	int cpu;
+
+	for (cpu = *pos; cpu < nr_cpu_ids; ++cpu) {
+		if (!cpu_possible(cpu))
+			continue;
+		*pos = cpu + 1;
+		return per_cpu_ptr(net->ft.stat, cpu);
+	}
+	(*pos)++;
+	return NULL;
+}
+
+static void nf_flow_table_cpu_seq_stop(struct seq_file *seq, void *v)
+{
+}
+
+static int nf_flow_table_cpu_seq_show(struct seq_file *seq, void *v)
+{
+	const struct nf_flow_table_stat *st = v;
+
+	if (v == SEQ_START_TOKEN) {
+		seq_puts(seq, "wq_add   wq_del   wq_stats\n");
+		return 0;
+	}
+
+	seq_printf(seq, "%8d %8d %8d\n",
+		   st->count_wq_add,
+		   st->count_wq_del,
+		   st->count_wq_stats
+		);
+	return 0;
+}
+
+static const struct seq_operations nf_flow_table_cpu_seq_ops = {
+	.start	= nf_flow_table_cpu_seq_start,
+	.next	= nf_flow_table_cpu_seq_next,
+	.stop	= nf_flow_table_cpu_seq_stop,
+	.show	= nf_flow_table_cpu_seq_show,
+};
+
+int nf_flow_table_init_proc(struct net *net)
+{
+	struct proc_dir_entry *pde;
+
+	pde = proc_create_net("nf_flowtable", 0444, net->proc_net_stat,
+			      &nf_flow_table_cpu_seq_ops,
+			      sizeof(struct seq_net_private));
+	return pde ? 0 : -ENOMEM;
+}
+
+void nf_flow_table_fini_proc(struct net *net)
+{
+	remove_proc_entry("nf_flowtable", net->proc_net_stat);
+}
+#endif /* CONFIG_NF_FLOW_TABLE_PROCFS */
+
 enum nf_ct_sysctl_index {
 	NF_SYSCTL_FLOW_TABLE_MAX_HW,
 	NF_SYSCTL_FLOW_TABLE_COUNT_HW,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH net-next 08/11] netfilter: nfnetlink: fix warn in nfnetlink_unbind
  2022-05-19 22:01 [PATCH net-next 00/11] Netfilter updates for net-next Pablo Neira Ayuso
                   ` (6 preceding siblings ...)
  2022-05-19 22:02 ` [PATCH net-next 07/11] netfilter: nf_flow_table: count pending offload workqueue tasks Pablo Neira Ayuso
@ 2022-05-19 22:02 ` Pablo Neira Ayuso
  2022-05-19 22:02 ` [PATCH net-next 09/11] netfilter: conntrack: re-fetch conntrack after insertion Pablo Neira Ayuso
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 19+ messages in thread
From: Pablo Neira Ayuso @ 2022-05-19 22:02 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni

From: Florian Westphal <fw@strlen.de>

syzbot reports following warn:
WARNING: CPU: 0 PID: 3600 at net/netfilter/nfnetlink.c:703 nfnetlink_unbind+0x357/0x3b0 net/netfilter/nfnetlink.c:694

The syzbot generated program does this:

socket(AF_NETLINK, SOCK_RAW, NETLINK_NETFILTER) = 3
setsockopt(3, SOL_NETLINK, NETLINK_DROP_MEMBERSHIP, [1], 4) = 0

... which triggers 'WARN_ON_ONCE(nfnlnet->ctnetlink_listeners == 0)' check.

Instead of counting, just enable reporting for every bind request
and check if we still have listeners on unbind.

While at it, also add the needed bounds check on nfnl_group2type[]
access.

Reported-by: <syzbot+4903218f7fba0a2d6226@syzkaller.appspotmail.com>
Reported-by: <syzbot+afd2d80e495f96049571@syzkaller.appspotmail.com>
Fixes: 2794cdb0b97b ("netfilter: nfnetlink: allow to detect if ctnetlink listeners exist")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nfnetlink.c | 24 +++++-------------------
 1 file changed, 5 insertions(+), 19 deletions(-)

diff --git a/net/netfilter/nfnetlink.c b/net/netfilter/nfnetlink.c
index ad3bbe34ca88..2f7c477fc9e7 100644
--- a/net/netfilter/nfnetlink.c
+++ b/net/netfilter/nfnetlink.c
@@ -45,7 +45,6 @@ MODULE_DESCRIPTION("Netfilter messages via netlink socket");
 static unsigned int nfnetlink_pernet_id __read_mostly;
 
 struct nfnl_net {
-	unsigned int ctnetlink_listeners;
 	struct sock *nfnl;
 };
 
@@ -673,18 +672,8 @@ static int nfnetlink_bind(struct net *net, int group)
 
 #ifdef CONFIG_NF_CONNTRACK_EVENTS
 	if (type == NFNL_SUBSYS_CTNETLINK) {
-		struct nfnl_net *nfnlnet = nfnl_pernet(net);
-
 		nfnl_lock(NFNL_SUBSYS_CTNETLINK);
-
-		if (WARN_ON_ONCE(nfnlnet->ctnetlink_listeners == UINT_MAX)) {
-			nfnl_unlock(NFNL_SUBSYS_CTNETLINK);
-			return -EOVERFLOW;
-		}
-
-		nfnlnet->ctnetlink_listeners++;
-		if (nfnlnet->ctnetlink_listeners == 1)
-			WRITE_ONCE(net->ct.ctnetlink_has_listener, true);
+		WRITE_ONCE(net->ct.ctnetlink_has_listener, true);
 		nfnl_unlock(NFNL_SUBSYS_CTNETLINK);
 	}
 #endif
@@ -694,15 +683,12 @@ static int nfnetlink_bind(struct net *net, int group)
 static void nfnetlink_unbind(struct net *net, int group)
 {
 #ifdef CONFIG_NF_CONNTRACK_EVENTS
-	int type = nfnl_group2type[group];
-
-	if (type == NFNL_SUBSYS_CTNETLINK) {
-		struct nfnl_net *nfnlnet = nfnl_pernet(net);
+	if (group <= NFNLGRP_NONE || group > NFNLGRP_MAX)
+		return;
 
+	if (nfnl_group2type[group] == NFNL_SUBSYS_CTNETLINK) {
 		nfnl_lock(NFNL_SUBSYS_CTNETLINK);
-		WARN_ON_ONCE(nfnlnet->ctnetlink_listeners == 0);
-		nfnlnet->ctnetlink_listeners--;
-		if (nfnlnet->ctnetlink_listeners == 0)
+		if (!nfnetlink_has_listeners(net, group))
 			WRITE_ONCE(net->ct.ctnetlink_has_listener, false);
 		nfnl_unlock(NFNL_SUBSYS_CTNETLINK);
 	}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH net-next 09/11] netfilter: conntrack: re-fetch conntrack after insertion
  2022-05-19 22:01 [PATCH net-next 00/11] Netfilter updates for net-next Pablo Neira Ayuso
                   ` (7 preceding siblings ...)
  2022-05-19 22:02 ` [PATCH net-next 08/11] netfilter: nfnetlink: fix warn in nfnetlink_unbind Pablo Neira Ayuso
@ 2022-05-19 22:02 ` Pablo Neira Ayuso
  2022-05-19 22:02 ` [PATCH net-next 10/11] netfilter: cttimeout: fix slab-out-of-bounds read in cttimeout_net_exit Pablo Neira Ayuso
  2022-05-19 22:02 ` [PATCH net-next 11/11] netfilter: nf_tables: set element extended ACK reporting support Pablo Neira Ayuso
  10 siblings, 0 replies; 19+ messages in thread
From: Pablo Neira Ayuso @ 2022-05-19 22:02 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni

From: Florian Westphal <fw@strlen.de>

In case the conntrack is clashing, insertion can free skb->_nfct and
set skb->_nfct to the already-confirmed entry.

This wasn't found before because the conntrack entry and the extension
space used to free'd after an rcu grace period, plus the race needs
events enabled to trigger.

Reported-by: <syzbot+793a590957d9c1b96620@syzkaller.appspotmail.com>
Fixes: 71d8c47fc653 ("netfilter: conntrack: introduce clash resolution on insertion race")
Fixes: 2ad9d7747c10 ("netfilter: conntrack: free extension area immediately")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/netfilter/nf_conntrack_core.h | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/net/netfilter/nf_conntrack_core.h b/include/net/netfilter/nf_conntrack_core.h
index 6406cfee34c2..37866c8386e2 100644
--- a/include/net/netfilter/nf_conntrack_core.h
+++ b/include/net/netfilter/nf_conntrack_core.h
@@ -58,8 +58,13 @@ static inline int nf_conntrack_confirm(struct sk_buff *skb)
 	int ret = NF_ACCEPT;
 
 	if (ct) {
-		if (!nf_ct_is_confirmed(ct))
+		if (!nf_ct_is_confirmed(ct)) {
 			ret = __nf_conntrack_confirm(skb);
+
+			if (ret == NF_ACCEPT)
+				ct = (struct nf_conn *)skb_nfct(skb);
+		}
+
 		if (ret == NF_ACCEPT && nf_ct_ecache_exist(ct))
 			nf_ct_deliver_cached_events(ct);
 	}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH net-next 10/11] netfilter: cttimeout: fix slab-out-of-bounds read in cttimeout_net_exit
  2022-05-19 22:01 [PATCH net-next 00/11] Netfilter updates for net-next Pablo Neira Ayuso
                   ` (8 preceding siblings ...)
  2022-05-19 22:02 ` [PATCH net-next 09/11] netfilter: conntrack: re-fetch conntrack after insertion Pablo Neira Ayuso
@ 2022-05-19 22:02 ` Pablo Neira Ayuso
  2022-05-19 22:02 ` [PATCH net-next 11/11] netfilter: nf_tables: set element extended ACK reporting support Pablo Neira Ayuso
  10 siblings, 0 replies; 19+ messages in thread
From: Pablo Neira Ayuso @ 2022-05-19 22:02 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni

From: Florian Westphal <fw@strlen.de>

syzbot reports:
BUG: KASAN: slab-out-of-bounds in __list_del_entry_valid+0xcc/0xf0 lib/list_debug.c:42
[..]
 list_del include/linux/list.h:148 [inline]
 cttimeout_net_exit+0x211/0x540 net/netfilter/nfnetlink_cttimeout.c:617

No reproducer so far. Looking at recent changes in this area
its clear that the free_head must not be at the end of the
structure because nf_ct_timeout structure has variable size.

Reported-by: <syzbot+92968395eedbdbd3617d@syzkaller.appspotmail.com>
Fixes: 78222bacfca9 ("netfilter: cttimeout: decouple unlink and free on netns destruction")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nfnetlink_cttimeout.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/nfnetlink_cttimeout.c b/net/netfilter/nfnetlink_cttimeout.c
index f069c24c6146..af15102bc696 100644
--- a/net/netfilter/nfnetlink_cttimeout.c
+++ b/net/netfilter/nfnetlink_cttimeout.c
@@ -35,12 +35,13 @@ static unsigned int nfct_timeout_id __read_mostly;
 
 struct ctnl_timeout {
 	struct list_head	head;
+	struct list_head	free_head;
 	struct rcu_head		rcu_head;
 	refcount_t		refcnt;
 	char			name[CTNL_TIMEOUT_NAME_MAX];
-	struct nf_ct_timeout	timeout;
 
-	struct list_head	free_head;
+	/* must be at the end */
+	struct nf_ct_timeout	timeout;
 };
 
 struct nfct_timeout_pernet {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH net-next 11/11] netfilter: nf_tables: set element extended ACK reporting support
  2022-05-19 22:01 [PATCH net-next 00/11] Netfilter updates for net-next Pablo Neira Ayuso
                   ` (9 preceding siblings ...)
  2022-05-19 22:02 ` [PATCH net-next 10/11] netfilter: cttimeout: fix slab-out-of-bounds read in cttimeout_net_exit Pablo Neira Ayuso
@ 2022-05-19 22:02 ` Pablo Neira Ayuso
  10 siblings, 0 replies; 19+ messages in thread
From: Pablo Neira Ayuso @ 2022-05-19 22:02 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni

Report the element that causes problems via netlink extended ACK for set
element commands.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nf_tables_api.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index f3ad02a399f8..bd248a5ee68b 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -5344,8 +5344,10 @@ static int nf_tables_getsetelem(struct sk_buff *skb,
 
 	nla_for_each_nested(attr, nla[NFTA_SET_ELEM_LIST_ELEMENTS], rem) {
 		err = nft_get_set_elem(&ctx, set, attr);
-		if (err < 0)
+		if (err < 0) {
+			NL_SET_BAD_ATTR(extack, attr);
 			break;
+		}
 	}
 
 	return err;
@@ -6125,8 +6127,10 @@ static int nf_tables_newsetelem(struct sk_buff *skb,
 
 	nla_for_each_nested(attr, nla[NFTA_SET_ELEM_LIST_ELEMENTS], rem) {
 		err = nft_add_set_elem(&ctx, set, attr, info->nlh->nlmsg_flags);
-		if (err < 0)
+		if (err < 0) {
+			NL_SET_BAD_ATTR(extack, attr);
 			return err;
+		}
 	}
 
 	if (nft_net->validate_state == NFT_VALIDATE_DO)
@@ -6396,8 +6400,10 @@ static int nf_tables_delsetelem(struct sk_buff *skb,
 
 	nla_for_each_nested(attr, nla[NFTA_SET_ELEM_LIST_ELEMENTS], rem) {
 		err = nft_del_setelem(&ctx, set, attr);
-		if (err < 0)
+		if (err < 0) {
+			NL_SET_BAD_ATTR(extack, attr);
 			break;
+		}
 	}
 	return err;
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next 06/11] netfilter: nf_flow_table: count and limit hw offloaded entries
  2022-05-19 22:02 ` [PATCH net-next 06/11] netfilter: nf_flow_table: count and limit hw offloaded entries Pablo Neira Ayuso
@ 2022-05-19 23:11   ` Jakub Kicinski
  2022-05-20  4:55     ` Jakub Kicinski
  2022-05-20  7:44     ` Pablo Neira Ayuso
  0 siblings, 2 replies; 19+ messages in thread
From: Jakub Kicinski @ 2022-05-19 23:11 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, davem, netdev, pabeni, Felix Fietkau

On Fri, 20 May 2022 00:02:01 +0200 Pablo Neira Ayuso wrote:
> To improve hardware offload debuggability and scalability introduce
> 'nf_flowtable_count_hw' and 'nf_flowtable_max_hw' sysctl entries in new
> dedicated 'net/netfilter/ft' namespace. Add new pernet struct nf_ft_net in
> order to store the counter and sysctl header of new sysctl table.
> 
> Count the offloaded flows in workqueue add task handler. Verify that
> offloaded flow total is lower than allowed maximum before calling the
> driver callbacks. To prevent spamming the 'add' workqueue with tasks when
> flows can't be offloaded anymore also check that count is below limit
> before queuing offload work. This doesn't prevent all redundant workqueue
> task since counter can be taken by concurrent work handler after the check
> had been performed but before the offload job is executed but it still
> greatly reduces such occurrences. Note that flows that were not offloaded
> due to counter being larger than the cap can still be offloaded via refresh
> function.
> 
> Ensure that flows are accounted correctly by verifying IPS_HW_OFFLOAD_BIT
> value before counting them. This ensures that add/refresh code path
> increments the counter exactly once per flow when setting the bit and
> decrements it only for accounted flows when deleting the flow with the bit
> set.

Why a sysctl and not a netlink attr per table or per device?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next 06/11] netfilter: nf_flow_table: count and limit hw offloaded entries
  2022-05-19 23:11   ` Jakub Kicinski
@ 2022-05-20  4:55     ` Jakub Kicinski
  2022-05-20  7:44     ` Pablo Neira Ayuso
  1 sibling, 0 replies; 19+ messages in thread
From: Jakub Kicinski @ 2022-05-20  4:55 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, davem, netdev, pabeni, Felix Fietkau

On Thu, 19 May 2022 16:11:36 -0700 Jakub Kicinski wrote:
> On Fri, 20 May 2022 00:02:01 +0200 Pablo Neira Ayuso wrote:
> > To improve hardware offload debuggability and scalability introduce
> > 'nf_flowtable_count_hw' and 'nf_flowtable_max_hw' sysctl entries in new
> > dedicated 'net/netfilter/ft' namespace. Add new pernet struct nf_ft_net in
> > order to store the counter and sysctl header of new sysctl table.
> > 
> > Count the offloaded flows in workqueue add task handler. Verify that
> > offloaded flow total is lower than allowed maximum before calling the
> > driver callbacks. To prevent spamming the 'add' workqueue with tasks when
> > flows can't be offloaded anymore also check that count is below limit
> > before queuing offload work. This doesn't prevent all redundant workqueue
> > task since counter can be taken by concurrent work handler after the check
> > had been performed but before the offload job is executed but it still
> > greatly reduces such occurrences. Note that flows that were not offloaded
> > due to counter being larger than the cap can still be offloaded via refresh
> > function.
> > 
> > Ensure that flows are accounted correctly by verifying IPS_HW_OFFLOAD_BIT
> > value before counting them. This ensures that add/refresh code path
> > increments the counter exactly once per flow when setting the bit and
> > decrements it only for accounted flows when deleting the flow with the bit
> > set.  
> 
> Why a sysctl and not a netlink attr per table or per device?

Let me do something unorthodox and pull just the first 4 patches 
for now so the warning goes away...

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next 01/11] netfilter: Use l3mdev flow key when re-routing mangled packets
  2022-05-19 22:01 ` [PATCH net-next 01/11] netfilter: Use l3mdev flow key when re-routing mangled packets Pablo Neira Ayuso
@ 2022-05-20  5:00   ` patchwork-bot+netdevbpf
  0 siblings, 0 replies; 19+ messages in thread
From: patchwork-bot+netdevbpf @ 2022-05-20  5:00 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, davem, netdev, kuba, pabeni

Hello:

This series was applied to netdev/net-next.git (master)
by Florian Westphal <fw@strlen.de>:

On Fri, 20 May 2022 00:01:56 +0200 you wrote:
> From: Martin Willi <martin@strongswan.org>
> 
> Commit 40867d74c374 ("net: Add l3mdev index to flow struct and avoid oif
> reset for port devices") introduces a flow key specific for layer 3
> domains, such as a VRF master device. This allows for explicit VRF domain
> selection instead of abusing the oif flow key.
> 
> [...]

Here is the summary with links:
  - [net-next,01/11] netfilter: Use l3mdev flow key when re-routing mangled packets
    https://git.kernel.org/netdev/net-next/c/2c50fc04757f
  - [net-next,02/11] netfilter: nf_conncount: reduce unnecessary GC
    https://git.kernel.org/netdev/net-next/c/d265929930e2
  - [net-next,03/11] netfilter: conntrack: remove pr_debug callsites from tcp tracker
    https://git.kernel.org/netdev/net-next/c/f74360d3440c
  - [net-next,04/11] netfilter: ctnetlink: fix up for "netfilter: conntrack: remove unconfirmed list"
    https://git.kernel.org/netdev/net-next/c/58a94a62a53f
  - [net-next,05/11] net/sched: act_ct: set 'net' pointer when creating new nf_flow_table
    (no matching commit)
  - [net-next,06/11] netfilter: nf_flow_table: count and limit hw offloaded entries
    (no matching commit)
  - [net-next,07/11] netfilter: nf_flow_table: count pending offload workqueue tasks
    (no matching commit)
  - [net-next,08/11] netfilter: nfnetlink: fix warn in nfnetlink_unbind
    (no matching commit)
  - [net-next,09/11] netfilter: conntrack: re-fetch conntrack after insertion
    (no matching commit)
  - [net-next,10/11] netfilter: cttimeout: fix slab-out-of-bounds read in cttimeout_net_exit
    (no matching commit)
  - [net-next,11/11] netfilter: nf_tables: set element extended ACK reporting support
    (no matching commit)

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next 06/11] netfilter: nf_flow_table: count and limit hw offloaded entries
  2022-05-19 23:11   ` Jakub Kicinski
  2022-05-20  4:55     ` Jakub Kicinski
@ 2022-05-20  7:44     ` Pablo Neira Ayuso
  2022-05-20 17:56       ` Jakub Kicinski
  1 sibling, 1 reply; 19+ messages in thread
From: Pablo Neira Ayuso @ 2022-05-20  7:44 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netfilter-devel, davem, netdev, pabeni, Felix Fietkau

On Thu, May 19, 2022 at 04:11:36PM -0700, Jakub Kicinski wrote:
> On Fri, 20 May 2022 00:02:01 +0200 Pablo Neira Ayuso wrote:
> > To improve hardware offload debuggability and scalability introduce
> > 'nf_flowtable_count_hw' and 'nf_flowtable_max_hw' sysctl entries in new
> > dedicated 'net/netfilter/ft' namespace. Add new pernet struct nf_ft_net in
> > order to store the counter and sysctl header of new sysctl table.
> > 
> > Count the offloaded flows in workqueue add task handler. Verify that
> > offloaded flow total is lower than allowed maximum before calling the
> > driver callbacks. To prevent spamming the 'add' workqueue with tasks when
> > flows can't be offloaded anymore also check that count is below limit
> > before queuing offload work. This doesn't prevent all redundant workqueue
> > task since counter can be taken by concurrent work handler after the check
> > had been performed but before the offload job is executed but it still
> > greatly reduces such occurrences. Note that flows that were not offloaded
> > due to counter being larger than the cap can still be offloaded via refresh
> > function.
> > 
> > Ensure that flows are accounted correctly by verifying IPS_HW_OFFLOAD_BIT
> > value before counting them. This ensures that add/refresh code path
> > increments the counter exactly once per flow when setting the bit and
> > decrements it only for accounted flows when deleting the flow with the bit
> > set.
> 
> Why a sysctl and not a netlink attr per table or per device?

Per-device is not an option, because the flowtable represents a
compound of devices.

Moreover, in tc ct act the flowtable is not bound to a device, while
in netfilter/nf_tables it is.

tc ct act does not expose flowtables to userspace in any way, they
internally allocate one flowtable per zone. I assume there os no good
netlink interface for them.

For netfilter/nftables, it should be possible to add per-flowtable
netlink attributes, my plan is to extend the flowtable netlink
attribute to add a flowtable maximum size.

This sysctl count and limit hw will just work as a global limit (which
is optional), my plan is that the upcoming per-flowtable limit will
just override this global limit.

I think it is a reasonable tradeoff for the different requirements of
the flowtable infrastructure users given there are two clients
currently for this code.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next 06/11] netfilter: nf_flow_table: count and limit hw offloaded entries
  2022-05-20  7:44     ` Pablo Neira Ayuso
@ 2022-05-20 17:56       ` Jakub Kicinski
  2022-05-20 22:17         ` Pablo Neira Ayuso
  0 siblings, 1 reply; 19+ messages in thread
From: Jakub Kicinski @ 2022-05-20 17:56 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, davem, netdev, pabeni, Felix Fietkau

On Fri, 20 May 2022 09:44:57 +0200 Pablo Neira Ayuso wrote:
> > Why a sysctl and not a netlink attr per table or per device?  
> 
> Per-device is not an option, because the flowtable represents a
> compound of devices.
> 
> Moreover, in tc ct act the flowtable is not bound to a device, while
> in netfilter/nf_tables it is.
> 
> tc ct act does not expose flowtables to userspace in any way, they
> internally allocate one flowtable per zone. I assume there os no good
> netlink interface for them.
> 
> For netfilter/nftables, it should be possible to add per-flowtable
> netlink attributes, my plan is to extend the flowtable netlink
> attribute to add a flowtable maximum size.
> 
> This sysctl count and limit hw will just work as a global limit (which
> is optional), my plan is that the upcoming per-flowtable limit will
> just override this global limit.
> 
> I think it is a reasonable tradeoff for the different requirements of
> the flowtable infrastructure users given there are two clients
> currently for this code.

net namespace is a software administrative unit, setting HW offload
limits on it does not compute for me. It's worse than a module param.

Can we go back to the problem statement? It sounds like the device
has limited but unknown capacity and the sysctl is supposed to be set
by the user magically to the right size, preventing HW flow table from
filling up? Did I get it right? If so some form of request flow control
seems like a better idea...

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next 06/11] netfilter: nf_flow_table: count and limit hw offloaded entries
  2022-05-20 17:56       ` Jakub Kicinski
@ 2022-05-20 22:17         ` Pablo Neira Ayuso
  2022-05-20 23:16           ` Jakub Kicinski
  0 siblings, 1 reply; 19+ messages in thread
From: Pablo Neira Ayuso @ 2022-05-20 22:17 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netfilter-devel, davem, netdev, pabeni, Felix Fietkau, Oz Shlomo,
	paulb, vladbu

On Fri, May 20, 2022 at 10:56:06AM -0700, Jakub Kicinski wrote:
> On Fri, 20 May 2022 09:44:57 +0200 Pablo Neira Ayuso wrote:
> > > Why a sysctl and not a netlink attr per table or per device?  
> > 
> > Per-device is not an option, because the flowtable represents a
> > compound of devices.
> > 
> > Moreover, in tc ct act the flowtable is not bound to a device, while
> > in netfilter/nf_tables it is.
> > 
> > tc ct act does not expose flowtables to userspace in any way, they
> > internally allocate one flowtable per zone. I assume there os no good
> > netlink interface for them.
> > 
> > For netfilter/nftables, it should be possible to add per-flowtable
> > netlink attributes, my plan is to extend the flowtable netlink
> > attribute to add a flowtable maximum size.
> > 
> > This sysctl count and limit hw will just work as a global limit (which
> > is optional), my plan is that the upcoming per-flowtable limit will
> > just override this global limit.
> > 
> > I think it is a reasonable tradeoff for the different requirements of
> > the flowtable infrastructure users given there are two clients
> > currently for this code.
> 
> net namespace is a software administrative unit, setting HW offload
> limits on it does not compute for me. It's worse than a module param.
> 
> Can we go back to the problem statement? It sounds like the device
> has limited but unknown capacity and the sysctl is supposed to be set
> by the user magically to the right size, preventing HW flow table from
> filling up? Did I get it right? If so some form of request flow control
> seems like a better idea...

Policy can also throttle down the maximum number of entries in the
hardware, but policy is complementary to the hard cap.

Once the hw cap is reached, the implementation falls back to the
software flowtable datapath.

Regarding the "magic number", it would be good if devices can expose
these properties through interface, maybe FLOW_BLOCK_PROBE to fetch
device properties and capabilities.

In general, I would also prefer a netlink interface for this, but for
tc ct, this would need to expose the existing flowtable objects via a
new netlink command. Then, I assume such cap would be per ct zone
(there is internally one flowtable per conntrack zone).

BTW, Cc'ing Oz, Paul and Vlad.

Meanwhile, what do you want me to do, toss this patchset?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next 06/11] netfilter: nf_flow_table: count and limit hw offloaded entries
  2022-05-20 22:17         ` Pablo Neira Ayuso
@ 2022-05-20 23:16           ` Jakub Kicinski
  0 siblings, 0 replies; 19+ messages in thread
From: Jakub Kicinski @ 2022-05-20 23:16 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netfilter-devel, davem, netdev, pabeni, Felix Fietkau, Oz Shlomo,
	paulb, vladbu

On Sat, 21 May 2022 00:17:32 +0200 Pablo Neira Ayuso wrote:
> Policy can also throttle down the maximum number of entries in the
> hardware, but policy is complementary to the hard cap.
> 
> Once the hw cap is reached, the implementation falls back to the
> software flowtable datapath.

Understood.

> Regarding the "magic number", it would be good if devices can expose
> these properties through interface, maybe FLOW_BLOCK_PROBE to fetch
> device properties and capabilities.

Fingers crossed, however, if the device is multi-user getting exact
cap may be pretty much impossible. Then again the user is supposed
to be able to pull the cap for sysfs out of the hat so I'm confused.

What I was thinking of was pausing offload requests for a jiffy if we
get ENOSPC 3 times in a row, or some such.

> In general, I would also prefer a netlink interface for this, but for
> tc ct, this would need to expose the existing flowtable objects via a
> new netlink command. Then, I assume such cap would be per ct zone
> (there is internally one flowtable per conntrack zone).
> 
> BTW, Cc'ing Oz, Paul and Vlad.

Ah, thanks, I added Felix just in case but didn't check if authors 
are already on CC :S

> Meanwhile, what do you want me to do, toss this patchset?

Yeah, if you don't mind.. We're too close to the merge window to
tentatively take stuff that's under discussion.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2022-05-20 23:16 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-19 22:01 [PATCH net-next 00/11] Netfilter updates for net-next Pablo Neira Ayuso
2022-05-19 22:01 ` [PATCH net-next 01/11] netfilter: Use l3mdev flow key when re-routing mangled packets Pablo Neira Ayuso
2022-05-20  5:00   ` patchwork-bot+netdevbpf
2022-05-19 22:01 ` [PATCH net-next 02/11] netfilter: nf_conncount: reduce unnecessary GC Pablo Neira Ayuso
2022-05-19 22:01 ` [PATCH net-next 03/11] netfilter: conntrack: remove pr_debug callsites from tcp tracker Pablo Neira Ayuso
2022-05-19 22:01 ` [PATCH net-next 04/11] netfilter: ctnetlink: fix up for "netfilter: conntrack: remove unconfirmed list" Pablo Neira Ayuso
2022-05-19 22:02 ` [PATCH net-next 05/11] net/sched: act_ct: set 'net' pointer when creating new nf_flow_table Pablo Neira Ayuso
2022-05-19 22:02 ` [PATCH net-next 06/11] netfilter: nf_flow_table: count and limit hw offloaded entries Pablo Neira Ayuso
2022-05-19 23:11   ` Jakub Kicinski
2022-05-20  4:55     ` Jakub Kicinski
2022-05-20  7:44     ` Pablo Neira Ayuso
2022-05-20 17:56       ` Jakub Kicinski
2022-05-20 22:17         ` Pablo Neira Ayuso
2022-05-20 23:16           ` Jakub Kicinski
2022-05-19 22:02 ` [PATCH net-next 07/11] netfilter: nf_flow_table: count pending offload workqueue tasks Pablo Neira Ayuso
2022-05-19 22:02 ` [PATCH net-next 08/11] netfilter: nfnetlink: fix warn in nfnetlink_unbind Pablo Neira Ayuso
2022-05-19 22:02 ` [PATCH net-next 09/11] netfilter: conntrack: re-fetch conntrack after insertion Pablo Neira Ayuso
2022-05-19 22:02 ` [PATCH net-next 10/11] netfilter: cttimeout: fix slab-out-of-bounds read in cttimeout_net_exit Pablo Neira Ayuso
2022-05-19 22:02 ` [PATCH net-next 11/11] netfilter: nf_tables: set element extended ACK reporting support Pablo Neira Ayuso

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.