[PATCH net-next 0/8] Conntrack offload debuggability improvements

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH net-next 0/8] Conntrack offload debuggability improvements
@ 2022-02-22 15:09 Vlad Buslov
  2022-02-22 15:09 ` [PATCH net-next 1/8] net/sched: act_ct: set 'net' pointer when creating new nf_flow_table Vlad Buslov
                   ` (7 more replies)
  0 siblings, 8 replies; 27+ messages in thread
From: Vlad Buslov @ 2022-02-22 15:09 UTC (permalink / raw)
  To: netfilter-devel; +Cc: pablo, kadlec, fw, ozsh, paulb, Vlad Buslov

Current conntrack offload implementation doesn't provide much visibility
and control over offload code. The code just schedules asynchronous
offload tasks on dedicated workqueues without regard of current queue
size even when scheduled task will only be processed after significant
delay and might be no longer needed.

Improve the debuggability situation by implementing following new
functionality:

- Sysctls for current total count of offloaded flow and
  user-configurable maximum. Capping the amount of offloaded flows can
  be useful for the allocations of hardware resources.

- Sysctls for current total of workqueue tasks for nf_ft_offload_add,
  nf_ft_offload_del and nf_ft_offload_stats queues. Also, allow setting
  maximum of total allowed concurrent 'add' tasks. This allows
  visibility for flow offload delay due to system scheduling offload
  tasks faster than driver/hardware can process them and allows setting
  some bound on the delay (for example, in case of short-lived
  connections user might prefer to skip offloading of flow that will be
  only be offloaded in 10 seconds). Note that the flow can still be
  offloaded afterwards via 'refresh' mechanism if both total hardware
  count and workqueue count are reduced below limits.

- Tracepoints in offload code. These are primary targeted to facilitate
  writing BPF helpers for some common debug scenarios (creating
  histogram of latency between scheduling flow offload and processing
  the task, dynamic difference between new offloads and deletions,
  etc.).

Vlad Buslov (8):
  net/sched: act_ct: set 'net' pointer when creating new nf_flow_table
  netfilter: introduce total count of hw offloaded flow table entries
  netfilter: introduce max count of hw offloaded flow table entries
  netfilter: introduce total count of hw offload 'add' workqueue tasks
  netfilter: introduce max count of hw offload 'add' workqueue tasks
  netfilter: introduce total count of hw offload 'del' workqueue tasks
  netfilter: introduce total count of hw offload 'stats' wq tasks
  netfilter: flowtable: add hardware offload tracepoints

 include/net/netfilter/nf_flow_table.h       |  9 ++++
 include/net/netns/nftables.h                |  6 +++
 net/netfilter/nf_conntrack_standalone.c     | 56 +++++++++++++++++++++
 net/netfilter/nf_flow_table_core.c          | 33 +++++++++++-
 net/netfilter/nf_flow_table_offload.c       | 43 ++++++++++++----
 net/netfilter/nf_flow_table_offload_trace.h | 48 ++++++++++++++++++
 net/sched/act_ct.c                          |  5 +-
 7 files changed, 186 insertions(+), 14 deletions(-)
 create mode 100644 net/netfilter/nf_flow_table_offload_trace.h

-- 
2.31.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH net-next 1/8] net/sched: act_ct: set 'net' pointer when creating new nf_flow_table
  2022-02-22 15:09 [PATCH net-next 0/8] Conntrack offload debuggability improvements Vlad Buslov
@ 2022-02-22 15:09 ` Vlad Buslov
  2022-03-07 21:09   ` Pablo Neira Ayuso
  2022-02-22 15:09 ` [PATCH net-next 2/8] netfilter: introduce total count of hw offloaded flow table entries Vlad Buslov
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Vlad Buslov @ 2022-02-22 15:09 UTC (permalink / raw)
  To: netfilter-devel; +Cc: pablo, kadlec, fw, ozsh, paulb, Vlad Buslov

Following patches in series use the pointer to access flow table offload
debug variables.

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
---
 net/sched/act_ct.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c
index 7108e71ce4db..a2624f6c4d92 100644
--- a/net/sched/act_ct.c
+++ b/net/sched/act_ct.c
@@ -277,7 +277,7 @@ static struct nf_flowtable_type flowtable_ct = {
 	.owner		= THIS_MODULE,
 };
 
-static int tcf_ct_flow_table_get(struct tcf_ct_params *params)
+static int tcf_ct_flow_table_get(struct net *net, struct tcf_ct_params *params)
 {
 	struct tcf_ct_flow_table *ct_ft;
 	int err = -ENOMEM;
@@ -303,6 +303,7 @@ static int tcf_ct_flow_table_get(struct tcf_ct_params *params)
 	err = nf_flow_table_init(&ct_ft->nf_ft);
 	if (err)
 		goto err_init;
+	write_pnet(&ct_ft->nf_ft.net, net);
 
 	__module_get(THIS_MODULE);
 out_unlock:
@@ -1321,7 +1322,7 @@ static int tcf_ct_init(struct net *net, struct nlattr *nla,
 	if (err)
 		goto cleanup;
 
-	err = tcf_ct_flow_table_get(params);
+	err = tcf_ct_flow_table_get(net, params);
 	if (err)
 		goto cleanup;
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next 2/8] netfilter: introduce total count of hw offloaded flow table entries
  2022-02-22 15:09 [PATCH net-next 0/8] Conntrack offload debuggability improvements Vlad Buslov
  2022-02-22 15:09 ` [PATCH net-next 1/8] net/sched: act_ct: set 'net' pointer when creating new nf_flow_table Vlad Buslov
@ 2022-02-22 15:09 ` Vlad Buslov
  2022-03-07 21:47   ` Pablo Neira Ayuso
  2022-03-07 21:56   ` Pablo Neira Ayuso
  2022-02-22 15:09 ` [PATCH net-next 3/8] netfilter: introduce max " Vlad Buslov
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 27+ messages in thread
From: Vlad Buslov @ 2022-02-22 15:09 UTC (permalink / raw)
  To: netfilter-devel; +Cc: pablo, kadlec, fw, ozsh, paulb, Vlad Buslov

To improve hardware offload debuggability and allow capping total amount of
offloaded entries in following patch extend struct netns_nftables with
'count_hw' counter and expose it to userspace as 'nf_flowtable_count_hw'
sysctl entry. Increment the counter together with setting NF_FLOW_HW flag
when scheduling offload add task on workqueue and decrement it after
successfully scheduling offload del task.

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
---
 include/net/netns/nftables.h            |  1 +
 net/netfilter/nf_conntrack_standalone.c | 12 ++++++++++++
 net/netfilter/nf_flow_table_core.c      | 12 ++++++++++--
 3 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/include/net/netns/nftables.h b/include/net/netns/nftables.h
index 8c77832d0240..262b8b3213cb 100644
--- a/include/net/netns/nftables.h
+++ b/include/net/netns/nftables.h
@@ -6,6 +6,7 @@
 
 struct netns_nftables {
 	u8			gencursor;
+	atomic_t		count_hw;
 };
 
 #endif
diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c
index 3e1afd10a9b6..8cd55d502ffd 100644
--- a/net/netfilter/nf_conntrack_standalone.c
+++ b/net/netfilter/nf_conntrack_standalone.c
@@ -618,6 +618,9 @@ enum nf_ct_sysctl_index {
 #ifdef CONFIG_LWTUNNEL
 	NF_SYSCTL_CT_LWTUNNEL,
 #endif
+#if IS_ENABLED(CONFIG_NF_FLOW_TABLE)
+	NF_SYSCTL_CT_COUNT_HW,
+#endif
 
 	__NF_SYSCTL_CT_LAST_SYSCTL,
 };
@@ -973,6 +976,14 @@ static struct ctl_table nf_ct_sysctl_table[] = {
 		.mode		= 0644,
 		.proc_handler	= nf_hooks_lwtunnel_sysctl_handler,
 	},
+#endif
+#if IS_ENABLED(CONFIG_NF_FLOW_TABLE)
+	[NF_SYSCTL_CT_COUNT_HW] = {
+		.procname	= "nf_flowtable_count_hw",
+		.maxlen		= sizeof(int),
+		.mode		= 0444,
+		.proc_handler	= proc_dointvec,
+	},
 #endif
 	{}
 };
@@ -1111,6 +1122,7 @@ static int nf_conntrack_standalone_init_sysctl(struct net *net)
 	table[NF_SYSCTL_CT_PROTO_TIMEOUT_UDP_STREAM].data = &un->timeouts[UDP_CT_REPLIED];
 #if IS_ENABLED(CONFIG_NF_FLOW_TABLE)
 	table[NF_SYSCTL_CT_PROTO_TIMEOUT_UDP_OFFLOAD].data = &un->offload_timeout;
+	table[NF_SYSCTL_CT_COUNT_HW].data = &net->nft.count_hw;
 #endif
 
 	nf_conntrack_standalone_init_tcp_sysctl(net, table);
diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
index b90eca7a2f22..9f2b68bc6f80 100644
--- a/net/netfilter/nf_flow_table_core.c
+++ b/net/netfilter/nf_flow_table_core.c
@@ -315,6 +315,9 @@ int flow_offload_add(struct nf_flowtable *flow_table, struct flow_offload *flow)
 	nf_ct_offload_timeout(flow->ct);
 
 	if (nf_flowtable_hw_offload(flow_table)) {
+		struct net *net = read_pnet(&flow_table->net);
+
+		atomic_inc(&net->nft.count_hw);
 		__set_bit(NF_FLOW_HW, &flow->flags);
 		nf_flow_offload_add(flow_table, flow);
 	}
@@ -462,10 +465,15 @@ static void nf_flow_offload_gc_step(struct flow_offload *flow, void *data)
 
 	if (test_bit(NF_FLOW_TEARDOWN, &flow->flags)) {
 		if (test_bit(NF_FLOW_HW, &flow->flags)) {
-			if (!test_bit(NF_FLOW_HW_DYING, &flow->flags))
+			if (!test_bit(NF_FLOW_HW_DYING, &flow->flags)) {
+				struct net *net = read_pnet(&flow_table->net);
+
 				nf_flow_offload_del(flow_table, flow);
-			else if (test_bit(NF_FLOW_HW_DEAD, &flow->flags))
+				if (test_bit(NF_FLOW_HW_DYING, &flow->flags))
+					atomic_dec(&net->nft.count_hw);
+			} else if (test_bit(NF_FLOW_HW_DEAD, &flow->flags)) {
 				flow_offload_del(flow_table, flow);
+			}
 		} else {
 			flow_offload_del(flow_table, flow);
 		}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next 3/8] netfilter: introduce max count of hw offloaded flow table entries
  2022-02-22 15:09 [PATCH net-next 0/8] Conntrack offload debuggability improvements Vlad Buslov
  2022-02-22 15:09 ` [PATCH net-next 1/8] net/sched: act_ct: set 'net' pointer when creating new nf_flow_table Vlad Buslov
  2022-02-22 15:09 ` [PATCH net-next 2/8] netfilter: introduce total count of hw offloaded flow table entries Vlad Buslov
@ 2022-02-22 15:09 ` Vlad Buslov
  2022-03-07 22:13   ` Pablo Neira Ayuso
  2022-02-22 15:09 ` [PATCH net-next 4/8] netfilter: introduce total count of hw offload 'add' workqueue tasks Vlad Buslov
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Vlad Buslov @ 2022-02-22 15:09 UTC (permalink / raw)
  To: netfilter-devel; +Cc: pablo, kadlec, fw, ozsh, paulb, Vlad Buslov

To improve hardware offload debuggability extend struct netns_nftables with
'max_hw' counter and expose it to userspace as 'nf_flowtable_max_hw' sysctl
entry. Verify that count_hw value is less than max_hw and don't offload new
flow table entry to hardware, if this is not the case. Flows that were not
offloaded due to count_hw being larger than set maximum can still be
offloaded via refresh function. Mark such flows with NF_FLOW_HW bit and
only count them once by checking that the bit was previously not set.

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
---
 include/net/netns/nftables.h            |  1 +
 net/netfilter/nf_conntrack_standalone.c | 11 +++++++++++
 net/netfilter/nf_flow_table_core.c      | 25 +++++++++++++++++++++++--
 3 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/include/net/netns/nftables.h b/include/net/netns/nftables.h
index 262b8b3213cb..5677f21fdd4c 100644
--- a/include/net/netns/nftables.h
+++ b/include/net/netns/nftables.h
@@ -7,6 +7,7 @@
 struct netns_nftables {
 	u8			gencursor;
 	atomic_t		count_hw;
+	int			max_hw;
 };
 
 #endif
diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c
index 8cd55d502ffd..af0dea471119 100644
--- a/net/netfilter/nf_conntrack_standalone.c
+++ b/net/netfilter/nf_conntrack_standalone.c
@@ -620,6 +620,7 @@ enum nf_ct_sysctl_index {
 #endif
 #if IS_ENABLED(CONFIG_NF_FLOW_TABLE)
 	NF_SYSCTL_CT_COUNT_HW,
+	NF_SYSCTL_CT_MAX_HW,
 #endif
 
 	__NF_SYSCTL_CT_LAST_SYSCTL,
@@ -984,6 +985,12 @@ static struct ctl_table nf_ct_sysctl_table[] = {
 		.mode		= 0444,
 		.proc_handler	= proc_dointvec,
 	},
+	[NF_SYSCTL_CT_MAX_HW] = {
+		.procname	= "nf_flowtable_max_hw",
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 #endif
 	{}
 };
@@ -1123,6 +1130,7 @@ static int nf_conntrack_standalone_init_sysctl(struct net *net)
 #if IS_ENABLED(CONFIG_NF_FLOW_TABLE)
 	table[NF_SYSCTL_CT_PROTO_TIMEOUT_UDP_OFFLOAD].data = &un->offload_timeout;
 	table[NF_SYSCTL_CT_COUNT_HW].data = &net->nft.count_hw;
+	table[NF_SYSCTL_CT_MAX_HW].data = &net->nft.max_hw;
 #endif
 
 	nf_conntrack_standalone_init_tcp_sysctl(net, table);
@@ -1135,6 +1143,9 @@ static int nf_conntrack_standalone_init_sysctl(struct net *net)
 		table[NF_SYSCTL_CT_MAX].mode = 0444;
 		table[NF_SYSCTL_CT_EXPECT_MAX].mode = 0444;
 		table[NF_SYSCTL_CT_BUCKETS].mode = 0444;
+#if IS_ENABLED(CONFIG_NF_FLOW_TABLE)
+		table[NF_SYSCTL_CT_MAX_HW].mode = 0444;
+#endif
 	}
 
 	cnet->sysctl_header = register_net_sysctl(net, "net/netfilter", table);
diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
index 9f2b68bc6f80..2631bd0ae9ae 100644
--- a/net/netfilter/nf_flow_table_core.c
+++ b/net/netfilter/nf_flow_table_core.c
@@ -290,6 +290,20 @@ unsigned long flow_offload_get_timeout(struct flow_offload *flow)
 	return timeout;
 }
 
+static bool flow_offload_inc_count_hw(struct nf_flowtable *flow_table)
+{
+	struct net *net = read_pnet(&flow_table->net);
+	int max_hw = net->nft.max_hw, count_hw;
+
+	count_hw = atomic_inc_return(&net->nft.count_hw);
+	if (max_hw && count_hw > max_hw) {
+		atomic_dec(&net->nft.count_hw);
+		return false;
+	}
+
+	return true;
+}
+
 int flow_offload_add(struct nf_flowtable *flow_table, struct flow_offload *flow)
 {
 	int err;
@@ -315,9 +329,9 @@ int flow_offload_add(struct nf_flowtable *flow_table, struct flow_offload *flow)
 	nf_ct_offload_timeout(flow->ct);
 
 	if (nf_flowtable_hw_offload(flow_table)) {
-		struct net *net = read_pnet(&flow_table->net);
+		if (!flow_offload_inc_count_hw(flow_table))
+			return 0;
 
-		atomic_inc(&net->nft.count_hw);
 		__set_bit(NF_FLOW_HW, &flow->flags);
 		nf_flow_offload_add(flow_table, flow);
 	}
@@ -329,6 +343,7 @@ EXPORT_SYMBOL_GPL(flow_offload_add);
 void flow_offload_refresh(struct nf_flowtable *flow_table,
 			  struct flow_offload *flow)
 {
+	struct net *net = read_pnet(&flow_table->net);
 	u32 timeout;
 
 	timeout = nf_flowtable_time_stamp + flow_offload_get_timeout(flow);
@@ -338,6 +353,12 @@ void flow_offload_refresh(struct nf_flowtable *flow_table,
 	if (likely(!nf_flowtable_hw_offload(flow_table)))
 		return;
 
+	if (!flow_offload_inc_count_hw(flow_table))
+		return;
+	/* only count each flow once when setting NF_FLOW_HW bit */
+	if (test_and_set_bit(NF_FLOW_HW, &flow->flags))
+		atomic_dec(&net->nft.count_hw);
+
 	nf_flow_offload_add(flow_table, flow);
 }
 EXPORT_SYMBOL_GPL(flow_offload_refresh);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next 4/8] netfilter: introduce total count of hw offload 'add' workqueue tasks
  2022-02-22 15:09 [PATCH net-next 0/8] Conntrack offload debuggability improvements Vlad Buslov
                   ` (2 preceding siblings ...)
  2022-02-22 15:09 ` [PATCH net-next 3/8] netfilter: introduce max " Vlad Buslov
@ 2022-02-22 15:09 ` Vlad Buslov
  2022-03-07 22:46   ` Pablo Neira Ayuso
  2022-02-22 15:10 ` [PATCH net-next 5/8] netfilter: introduce max " Vlad Buslov
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Vlad Buslov @ 2022-02-22 15:09 UTC (permalink / raw)
  To: netfilter-devel; +Cc: pablo, kadlec, fw, ozsh, paulb, Vlad Buslov

To improve hardware offload debuggability and allow capping total amount of
offload 'add' in-flight entries on workqueue in following patch extend
struct netns_nftables with 'count_wq_add' counter and expose it to
userspace as 'nf_flowtable_count_wq_add' sysctl entry. Increment the
counter when allocating new workqueue task and decrement it after
flow_offload_work_add() is finished.

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
---
 include/net/netns/nftables.h            |  1 +
 net/netfilter/nf_conntrack_standalone.c |  8 ++++++++
 net/netfilter/nf_flow_table_offload.c   | 10 +++++++++-
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/include/net/netns/nftables.h b/include/net/netns/nftables.h
index 5677f21fdd4c..a971d986a75b 100644
--- a/include/net/netns/nftables.h
+++ b/include/net/netns/nftables.h
@@ -8,6 +8,7 @@ struct netns_nftables {
 	u8			gencursor;
 	atomic_t		count_hw;
 	int			max_hw;
+	atomic_t		count_wq_add;
 };
 
 #endif
diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c
index af0dea471119..fe2327823f7a 100644
--- a/net/netfilter/nf_conntrack_standalone.c
+++ b/net/netfilter/nf_conntrack_standalone.c
@@ -621,6 +621,7 @@ enum nf_ct_sysctl_index {
 #if IS_ENABLED(CONFIG_NF_FLOW_TABLE)
 	NF_SYSCTL_CT_COUNT_HW,
 	NF_SYSCTL_CT_MAX_HW,
+	NF_SYSCTL_CT_COUNT_WQ_ADD,
 #endif
 
 	__NF_SYSCTL_CT_LAST_SYSCTL,
@@ -991,6 +992,12 @@ static struct ctl_table nf_ct_sysctl_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	[NF_SYSCTL_CT_COUNT_WQ_ADD] = {
+		.procname	= "nf_flowtable_count_wq_add",
+		.maxlen		= sizeof(int),
+		.mode		= 0444,
+		.proc_handler	= proc_dointvec,
+	},
 #endif
 	{}
 };
@@ -1131,6 +1138,7 @@ static int nf_conntrack_standalone_init_sysctl(struct net *net)
 	table[NF_SYSCTL_CT_PROTO_TIMEOUT_UDP_OFFLOAD].data = &un->offload_timeout;
 	table[NF_SYSCTL_CT_COUNT_HW].data = &net->nft.count_hw;
 	table[NF_SYSCTL_CT_MAX_HW].data = &net->nft.max_hw;
+	table[NF_SYSCTL_CT_COUNT_WQ_ADD].data = &net->nft.count_wq_add;
 #endif
 
 	nf_conntrack_standalone_init_tcp_sysctl(net, table);
diff --git a/net/netfilter/nf_flow_table_offload.c b/net/netfilter/nf_flow_table_offload.c
index b561e0a44a45..ffbcf0cfefeb 100644
--- a/net/netfilter/nf_flow_table_offload.c
+++ b/net/netfilter/nf_flow_table_offload.c
@@ -953,11 +953,15 @@ static void flow_offload_work_stats(struct flow_offload_work *offload)
 static void flow_offload_work_handler(struct work_struct *work)
 {
 	struct flow_offload_work *offload;
+	struct net *net;
 
 	offload = container_of(work, struct flow_offload_work, work);
+	net = read_pnet(&offload->flowtable->net);
+
 	switch (offload->cmd) {
 		case FLOW_CLS_REPLACE:
 			flow_offload_work_add(offload);
+			atomic_dec(&net->nft.count_wq_add);
 			break;
 		case FLOW_CLS_DESTROY:
 			flow_offload_work_del(offload);
@@ -1011,11 +1015,15 @@ nf_flow_offload_work_alloc(struct nf_flowtable *flowtable,
 void nf_flow_offload_add(struct nf_flowtable *flowtable,
 			 struct flow_offload *flow)
 {
+	struct net *net = read_pnet(&flowtable->net);
 	struct flow_offload_work *offload;
 
+	atomic_inc(&net->nft.count_wq_add);
 	offload = nf_flow_offload_work_alloc(flowtable, flow, FLOW_CLS_REPLACE);
-	if (!offload)
+	if (!offload) {
+		atomic_dec(&net->nft.count_wq_add);
 		return;
+	}
 
 	flow_offload_queue_work(offload);
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next 5/8] netfilter: introduce max count of hw offload 'add' workqueue tasks
  2022-02-22 15:09 [PATCH net-next 0/8] Conntrack offload debuggability improvements Vlad Buslov
                   ` (3 preceding siblings ...)
  2022-02-22 15:09 ` [PATCH net-next 4/8] netfilter: introduce total count of hw offload 'add' workqueue tasks Vlad Buslov
@ 2022-02-22 15:10 ` Vlad Buslov
  2022-03-07 22:43   ` Pablo Neira Ayuso
  2022-02-22 15:10 ` [PATCH net-next 6/8] netfilter: introduce total count of hw offload 'del' " Vlad Buslov
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Vlad Buslov @ 2022-02-22 15:10 UTC (permalink / raw)
  To: netfilter-devel; +Cc: pablo, kadlec, fw, ozsh, paulb, Vlad Buslov

To improve hardware offload debuggability extend struct netns_nftables with
'max_wq_add' counter and expose it to userspace as
'nf_flowtable_max_wq_add' sysctl entry. Verify that count_wq_add value is
less than max_wq_add and don't schedule new flow table entry 'add' task to
workqueue, if this is not the case.

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
---
 include/net/netns/nftables.h            | 1 +
 net/netfilter/nf_conntrack_standalone.c | 9 +++++++++
 net/netfilter/nf_flow_table_offload.c   | 9 ++++++++-
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/include/net/netns/nftables.h b/include/net/netns/nftables.h
index a971d986a75b..ce270803ef27 100644
--- a/include/net/netns/nftables.h
+++ b/include/net/netns/nftables.h
@@ -9,6 +9,7 @@ struct netns_nftables {
 	atomic_t		count_hw;
 	int			max_hw;
 	atomic_t		count_wq_add;
+	int			max_wq_add;
 };
 
 #endif
diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c
index fe2327823f7a..26e2b86eb060 100644
--- a/net/netfilter/nf_conntrack_standalone.c
+++ b/net/netfilter/nf_conntrack_standalone.c
@@ -622,6 +622,7 @@ enum nf_ct_sysctl_index {
 	NF_SYSCTL_CT_COUNT_HW,
 	NF_SYSCTL_CT_MAX_HW,
 	NF_SYSCTL_CT_COUNT_WQ_ADD,
+	NF_SYSCTL_CT_MAX_WQ_ADD,
 #endif
 
 	__NF_SYSCTL_CT_LAST_SYSCTL,
@@ -998,6 +999,12 @@ static struct ctl_table nf_ct_sysctl_table[] = {
 		.mode		= 0444,
 		.proc_handler	= proc_dointvec,
 	},
+	[NF_SYSCTL_CT_MAX_WQ_ADD] = {
+		.procname	= "nf_flowtable_max_wq_add",
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 #endif
 	{}
 };
@@ -1139,6 +1146,7 @@ static int nf_conntrack_standalone_init_sysctl(struct net *net)
 	table[NF_SYSCTL_CT_COUNT_HW].data = &net->nft.count_hw;
 	table[NF_SYSCTL_CT_MAX_HW].data = &net->nft.max_hw;
 	table[NF_SYSCTL_CT_COUNT_WQ_ADD].data = &net->nft.count_wq_add;
+	table[NF_SYSCTL_CT_MAX_WQ_ADD].data = &net->nft.max_wq_add;
 #endif
 
 	nf_conntrack_standalone_init_tcp_sysctl(net, table);
@@ -1153,6 +1161,7 @@ static int nf_conntrack_standalone_init_sysctl(struct net *net)
 		table[NF_SYSCTL_CT_BUCKETS].mode = 0444;
 #if IS_ENABLED(CONFIG_NF_FLOW_TABLE)
 		table[NF_SYSCTL_CT_MAX_HW].mode = 0444;
+		table[NF_SYSCTL_CT_MAX_WQ_ADD].mode = 0444;
 #endif
 	}
 
diff --git a/net/netfilter/nf_flow_table_offload.c b/net/netfilter/nf_flow_table_offload.c
index ffbcf0cfefeb..e29aa51696f5 100644
--- a/net/netfilter/nf_flow_table_offload.c
+++ b/net/netfilter/nf_flow_table_offload.c
@@ -1016,9 +1016,16 @@ void nf_flow_offload_add(struct nf_flowtable *flowtable,
 			 struct flow_offload *flow)
 {
 	struct net *net = read_pnet(&flowtable->net);
+	int max_wq_add = net->nft.max_wq_add;
 	struct flow_offload_work *offload;
+	int count_wq_add;
+
+	count_wq_add = atomic_inc_return(&net->nft.count_wq_add);
+	if (max_wq_add && count_wq_add > max_wq_add) {
+		atomic_dec(&net->nft.count_wq_add);
+		return;
+	}
 
-	atomic_inc(&net->nft.count_wq_add);
 	offload = nf_flow_offload_work_alloc(flowtable, flow, FLOW_CLS_REPLACE);
 	if (!offload) {
 		atomic_dec(&net->nft.count_wq_add);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next 6/8] netfilter: introduce total count of hw offload 'del' workqueue tasks
  2022-02-22 15:09 [PATCH net-next 0/8] Conntrack offload debuggability improvements Vlad Buslov
                   ` (4 preceding siblings ...)
  2022-02-22 15:10 ` [PATCH net-next 5/8] netfilter: introduce max " Vlad Buslov
@ 2022-02-22 15:10 ` Vlad Buslov
  2022-02-22 15:10 ` [PATCH net-next 7/8] netfilter: introduce total count of hw offload 'stats' wq tasks Vlad Buslov
  2022-02-22 15:10 ` [PATCH net-next 8/8] netfilter: flowtable: add hardware offload tracepoints Vlad Buslov
  7 siblings, 0 replies; 27+ messages in thread
From: Vlad Buslov @ 2022-02-22 15:10 UTC (permalink / raw)
  To: netfilter-devel; +Cc: pablo, kadlec, fw, ozsh, paulb, Vlad Buslov

To improve hardware offload debuggability extend struct netns_nftables with
'count_wq_del' counter and expose it to userspace as
'nf_flowtable_count_wq_del' sysctl entry. Increment the counter when
allocating new workqueue task and decrement it after
flow_offload_work_del() is finished.

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
---
 include/net/netns/nftables.h            | 1 +
 net/netfilter/nf_conntrack_standalone.c | 8 ++++++++
 net/netfilter/nf_flow_table_offload.c   | 3 +++
 3 files changed, 12 insertions(+)

diff --git a/include/net/netns/nftables.h b/include/net/netns/nftables.h
index ce270803ef27..ad6058d2b5f9 100644
--- a/include/net/netns/nftables.h
+++ b/include/net/netns/nftables.h
@@ -10,6 +10,7 @@ struct netns_nftables {
 	int			max_hw;
 	atomic_t		count_wq_add;
 	int			max_wq_add;
+	atomic_t		count_wq_del;
 };
 
 #endif
diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c
index 26e2b86eb060..cebdf389f7eb 100644
--- a/net/netfilter/nf_conntrack_standalone.c
+++ b/net/netfilter/nf_conntrack_standalone.c
@@ -623,6 +623,7 @@ enum nf_ct_sysctl_index {
 	NF_SYSCTL_CT_MAX_HW,
 	NF_SYSCTL_CT_COUNT_WQ_ADD,
 	NF_SYSCTL_CT_MAX_WQ_ADD,
+	NF_SYSCTL_CT_COUNT_WQ_DEL,
 #endif
 
 	__NF_SYSCTL_CT_LAST_SYSCTL,
@@ -1005,6 +1006,12 @@ static struct ctl_table nf_ct_sysctl_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	[NF_SYSCTL_CT_COUNT_WQ_DEL] = {
+		.procname	= "nf_flowtable_count_wq_del",
+		.maxlen		= sizeof(int),
+		.mode		= 0444,
+		.proc_handler	= proc_dointvec,
+	},
 #endif
 	{}
 };
@@ -1147,6 +1154,7 @@ static int nf_conntrack_standalone_init_sysctl(struct net *net)
 	table[NF_SYSCTL_CT_MAX_HW].data = &net->nft.max_hw;
 	table[NF_SYSCTL_CT_COUNT_WQ_ADD].data = &net->nft.count_wq_add;
 	table[NF_SYSCTL_CT_MAX_WQ_ADD].data = &net->nft.max_wq_add;
+	table[NF_SYSCTL_CT_COUNT_WQ_DEL].data = &net->nft.count_wq_del;
 #endif
 
 	nf_conntrack_standalone_init_tcp_sysctl(net, table);
diff --git a/net/netfilter/nf_flow_table_offload.c b/net/netfilter/nf_flow_table_offload.c
index e29aa51696f5..82cd78c8eb25 100644
--- a/net/netfilter/nf_flow_table_offload.c
+++ b/net/netfilter/nf_flow_table_offload.c
@@ -965,6 +965,7 @@ static void flow_offload_work_handler(struct work_struct *work)
 			break;
 		case FLOW_CLS_DESTROY:
 			flow_offload_work_del(offload);
+			atomic_dec(&net->nft.count_wq_del);
 			break;
 		case FLOW_CLS_STATS:
 			flow_offload_work_stats(offload);
@@ -1038,12 +1039,14 @@ void nf_flow_offload_add(struct nf_flowtable *flowtable,
 void nf_flow_offload_del(struct nf_flowtable *flowtable,
 			 struct flow_offload *flow)
 {
+	struct net *net = read_pnet(&flowtable->net);
 	struct flow_offload_work *offload;
 
 	offload = nf_flow_offload_work_alloc(flowtable, flow, FLOW_CLS_DESTROY);
 	if (!offload)
 		return;
 
+	atomic_inc(&net->nft.count_wq_del);
 	set_bit(NF_FLOW_HW_DYING, &flow->flags);
 	flow_offload_queue_work(offload);
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next 7/8] netfilter: introduce total count of hw offload 'stats' wq tasks
  2022-02-22 15:09 [PATCH net-next 0/8] Conntrack offload debuggability improvements Vlad Buslov
                   ` (5 preceding siblings ...)
  2022-02-22 15:10 ` [PATCH net-next 6/8] netfilter: introduce total count of hw offload 'del' " Vlad Buslov
@ 2022-02-22 15:10 ` Vlad Buslov
  2022-02-22 15:10 ` [PATCH net-next 8/8] netfilter: flowtable: add hardware offload tracepoints Vlad Buslov
  7 siblings, 0 replies; 27+ messages in thread
From: Vlad Buslov @ 2022-02-22 15:10 UTC (permalink / raw)
  To: netfilter-devel; +Cc: pablo, kadlec, fw, ozsh, paulb, Vlad Buslov

To improve hardware offload debuggability extend struct netns_nftables with
'count_wq_stats' counter and expose it to userspace as
'nf_flowtable_count_wq_stats' sysctl entry. Increment the counter when
allocating new workqueue task and decrement it after
flow_offload_work_stats() is finished.

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
---
 include/net/netns/nftables.h            | 1 +
 net/netfilter/nf_conntrack_standalone.c | 8 ++++++++
 net/netfilter/nf_flow_table_offload.c   | 3 +++
 3 files changed, 12 insertions(+)

diff --git a/include/net/netns/nftables.h b/include/net/netns/nftables.h
index ad6058d2b5f9..60b3905c0ae9 100644
--- a/include/net/netns/nftables.h
+++ b/include/net/netns/nftables.h
@@ -11,6 +11,7 @@ struct netns_nftables {
 	atomic_t		count_wq_add;
 	int			max_wq_add;
 	atomic_t		count_wq_del;
+	atomic_t		count_wq_stats;
 };
 
 #endif
diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c
index cebdf389f7eb..4ec49ff6593c 100644
--- a/net/netfilter/nf_conntrack_standalone.c
+++ b/net/netfilter/nf_conntrack_standalone.c
@@ -624,6 +624,7 @@ enum nf_ct_sysctl_index {
 	NF_SYSCTL_CT_COUNT_WQ_ADD,
 	NF_SYSCTL_CT_MAX_WQ_ADD,
 	NF_SYSCTL_CT_COUNT_WQ_DEL,
+	NF_SYSCTL_CT_COUNT_WQ_STATS,
 #endif
 
 	__NF_SYSCTL_CT_LAST_SYSCTL,
@@ -1012,6 +1013,12 @@ static struct ctl_table nf_ct_sysctl_table[] = {
 		.mode		= 0444,
 		.proc_handler	= proc_dointvec,
 	},
+	[NF_SYSCTL_CT_COUNT_WQ_STATS] = {
+		.procname	= "nf_flowtable_count_wq_stats",
+		.maxlen		= sizeof(int),
+		.mode		= 0444,
+		.proc_handler	= proc_dointvec,
+	},
 #endif
 	{}
 };
@@ -1155,6 +1162,7 @@ static int nf_conntrack_standalone_init_sysctl(struct net *net)
 	table[NF_SYSCTL_CT_COUNT_WQ_ADD].data = &net->nft.count_wq_add;
 	table[NF_SYSCTL_CT_MAX_WQ_ADD].data = &net->nft.max_wq_add;
 	table[NF_SYSCTL_CT_COUNT_WQ_DEL].data = &net->nft.count_wq_del;
+	table[NF_SYSCTL_CT_COUNT_WQ_STATS].data = &net->nft.count_wq_stats;
 #endif
 
 	nf_conntrack_standalone_init_tcp_sysctl(net, table);
diff --git a/net/netfilter/nf_flow_table_offload.c b/net/netfilter/nf_flow_table_offload.c
index 82cd78c8eb25..ff52d903aad9 100644
--- a/net/netfilter/nf_flow_table_offload.c
+++ b/net/netfilter/nf_flow_table_offload.c
@@ -969,6 +969,7 @@ static void flow_offload_work_handler(struct work_struct *work)
 			break;
 		case FLOW_CLS_STATS:
 			flow_offload_work_stats(offload);
+			atomic_dec(&net->nft.count_wq_stats);
 			break;
 		default:
 			WARN_ON_ONCE(1);
@@ -1054,6 +1055,7 @@ void nf_flow_offload_del(struct nf_flowtable *flowtable,
 void nf_flow_offload_stats(struct nf_flowtable *flowtable,
 			   struct flow_offload *flow)
 {
+	struct net *net = read_pnet(&flowtable->net);
 	struct flow_offload_work *offload;
 	__s32 delta;
 
@@ -1065,6 +1067,7 @@ void nf_flow_offload_stats(struct nf_flowtable *flowtable,
 	if (!offload)
 		return;
 
+	atomic_inc(&net->nft.count_wq_stats);
 	flow_offload_queue_work(offload);
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next 8/8] netfilter: flowtable: add hardware offload tracepoints
  2022-02-22 15:09 [PATCH net-next 0/8] Conntrack offload debuggability improvements Vlad Buslov
                   ` (6 preceding siblings ...)
  2022-02-22 15:10 ` [PATCH net-next 7/8] netfilter: introduce total count of hw offload 'stats' wq tasks Vlad Buslov
@ 2022-02-22 15:10 ` Vlad Buslov
  2022-03-07 22:49   ` Pablo Neira Ayuso
  7 siblings, 1 reply; 27+ messages in thread
From: Vlad Buslov @ 2022-02-22 15:10 UTC (permalink / raw)
  To: netfilter-devel; +Cc: pablo, kadlec, fw, ozsh, paulb, Vlad Buslov

Add tracepoints to trace creation and start of execution of flowtable
hardware offload 'add', 'del' and 'stats' tasks. Move struct
flow_offload_work from source into header file to allow access to structure
fields from tracepoint code.

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
---
 include/net/netfilter/nf_flow_table.h       |  9 ++++
 net/netfilter/nf_flow_table_offload.c       | 20 +++++----
 net/netfilter/nf_flow_table_offload_trace.h | 48 +++++++++++++++++++++
 3 files changed, 68 insertions(+), 9 deletions(-)
 create mode 100644 net/netfilter/nf_flow_table_offload_trace.h

diff --git a/include/net/netfilter/nf_flow_table.h b/include/net/netfilter/nf_flow_table.h
index a3647fadf1cc..5e2aef34acaa 100644
--- a/include/net/netfilter/nf_flow_table.h
+++ b/include/net/netfilter/nf_flow_table.h
@@ -174,6 +174,15 @@ struct flow_offload {
 	struct rcu_head				rcu_head;
 };
 
+struct flow_offload_work {
+	struct list_head list;
+	enum flow_cls_command cmd;
+	int priority;
+	struct nf_flowtable *flowtable;
+	struct flow_offload *flow;
+	struct work_struct work;
+};
+
 #define NF_FLOW_TIMEOUT (30 * HZ)
 #define nf_flowtable_time_stamp	(u32)jiffies
 
diff --git a/net/netfilter/nf_flow_table_offload.c b/net/netfilter/nf_flow_table_offload.c
index ff52d903aad9..bf94050d5b54 100644
--- a/net/netfilter/nf_flow_table_offload.c
+++ b/net/netfilter/nf_flow_table_offload.c
@@ -12,20 +12,13 @@
 #include <net/netfilter/nf_conntrack_acct.h>
 #include <net/netfilter/nf_conntrack_core.h>
 #include <net/netfilter/nf_conntrack_tuple.h>
+#define CREATE_TRACE_POINTS
+#include "nf_flow_table_offload_trace.h"
 
 static struct workqueue_struct *nf_flow_offload_add_wq;
 static struct workqueue_struct *nf_flow_offload_del_wq;
 static struct workqueue_struct *nf_flow_offload_stats_wq;
 
-struct flow_offload_work {
-	struct list_head	list;
-	enum flow_cls_command	cmd;
-	int			priority;
-	struct nf_flowtable	*flowtable;
-	struct flow_offload	*flow;
-	struct work_struct	work;
-};
-
 #define NF_FLOW_DISSECTOR(__match, __type, __field)	\
 	(__match)->dissector.offset[__type] =		\
 		offsetof(struct nf_flow_key, __field)
@@ -895,6 +888,8 @@ static void flow_offload_work_add(struct flow_offload_work *offload)
 	struct nf_flow_rule *flow_rule[FLOW_OFFLOAD_DIR_MAX];
 	int err;
 
+	trace_flow_offload_work_add(offload);
+
 	err = nf_flow_offload_alloc(offload, flow_rule);
 	if (err < 0)
 		return;
@@ -911,6 +906,8 @@ static void flow_offload_work_add(struct flow_offload_work *offload)
 
 static void flow_offload_work_del(struct flow_offload_work *offload)
 {
+	trace_flow_offload_work_del(offload);
+
 	clear_bit(IPS_HW_OFFLOAD_BIT, &offload->flow->ct->status);
 	flow_offload_tuple_del(offload, FLOW_OFFLOAD_DIR_ORIGINAL);
 	flow_offload_tuple_del(offload, FLOW_OFFLOAD_DIR_REPLY);
@@ -931,6 +928,8 @@ static void flow_offload_work_stats(struct flow_offload_work *offload)
 	struct flow_stats stats[FLOW_OFFLOAD_DIR_MAX] = {};
 	u64 lastused;
 
+	trace_flow_offload_work_stats(offload);
+
 	flow_offload_tuple_stats(offload, FLOW_OFFLOAD_DIR_ORIGINAL, &stats[0]);
 	flow_offload_tuple_stats(offload, FLOW_OFFLOAD_DIR_REPLY, &stats[1]);
 
@@ -1034,6 +1033,7 @@ void nf_flow_offload_add(struct nf_flowtable *flowtable,
 		return;
 	}
 
+	trace_flow_offload_add(offload);
 	flow_offload_queue_work(offload);
 }
 
@@ -1048,6 +1048,7 @@ void nf_flow_offload_del(struct nf_flowtable *flowtable,
 		return;
 
 	atomic_inc(&net->nft.count_wq_del);
+	trace_flow_offload_del(offload);
 	set_bit(NF_FLOW_HW_DYING, &flow->flags);
 	flow_offload_queue_work(offload);
 }
@@ -1068,6 +1069,7 @@ void nf_flow_offload_stats(struct nf_flowtable *flowtable,
 		return;
 
 	atomic_inc(&net->nft.count_wq_stats);
+	trace_flow_offload_stats(offload);
 	flow_offload_queue_work(offload);
 }
 
diff --git a/net/netfilter/nf_flow_table_offload_trace.h b/net/netfilter/nf_flow_table_offload_trace.h
new file mode 100644
index 000000000000..49cfbc2ec35d
--- /dev/null
+++ b/net/netfilter/nf_flow_table_offload_trace.h
@@ -0,0 +1,48 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM nf
+
+#if !defined(_NF_FLOW_TABLE_OFFLOAD_TRACE_) || defined(TRACE_HEADER_MULTI_READ)
+#define _NF_FLOW_TABLE_OFFLOAD_TRACE_
+
+#include <linux/tracepoint.h>
+#include <net/netfilter/nf_tables.h>
+
+DECLARE_EVENT_CLASS(
+	nf_flow_offload_work_template,
+	TP_PROTO(struct flow_offload_work *w),
+	TP_ARGS(w),
+	TP_STRUCT__entry(
+		__field(void *, work)
+		__field(void *, flowtable)
+		__field(void *, flow)
+	),
+	TP_fast_assign(
+		__entry->work = w;
+		__entry->flowtable = w->flowtable;
+		__entry->flow = w->flow;
+	),
+	TP_printk("work=%p flowtable=%p flow=%p",
+		  __entry->work, __entry->flowtable, __entry->flow)
+);
+
+#define DEFINE_NF_FLOW_OFFLOAD_WORK_EVENT(name)				\
+	DEFINE_EVENT(nf_flow_offload_work_template, name,		\
+		     TP_PROTO(struct flow_offload_work *w), TP_ARGS(w))
+
+DEFINE_NF_FLOW_OFFLOAD_WORK_EVENT(flow_offload_add);
+DEFINE_NF_FLOW_OFFLOAD_WORK_EVENT(flow_offload_work_add);
+DEFINE_NF_FLOW_OFFLOAD_WORK_EVENT(flow_offload_del);
+DEFINE_NF_FLOW_OFFLOAD_WORK_EVENT(flow_offload_work_del);
+DEFINE_NF_FLOW_OFFLOAD_WORK_EVENT(flow_offload_stats);
+DEFINE_NF_FLOW_OFFLOAD_WORK_EVENT(flow_offload_work_stats);
+
+#endif
+
+/* This part must be outside protection */
+#undef TRACE_INCLUDE_PATH
+#define TRACE_INCLUDE_PATH ../../net/netfilter
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_FILE nf_flow_table_offload_trace
+#include <trace/define_trace.h>
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 1/8] net/sched: act_ct: set 'net' pointer when creating new nf_flow_table
  2022-02-22 15:09 ` [PATCH net-next 1/8] net/sched: act_ct: set 'net' pointer when creating new nf_flow_table Vlad Buslov
@ 2022-03-07 21:09   ` Pablo Neira Ayuso
  0 siblings, 0 replies; 27+ messages in thread
From: Pablo Neira Ayuso @ 2022-03-07 21:09 UTC (permalink / raw)
  To: Vlad Buslov; +Cc: netfilter-devel, kadlec, fw, ozsh, paulb

On Tue, Feb 22, 2022 at 05:09:56PM +0200, Vlad Buslov wrote:
> Following patches in series use the pointer to access flow table offload
> debug variables.
> 
> Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
> Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
> Reviewed-by: Paul Blakey <paulb@nvidia.com>

Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>

> ---
>  net/sched/act_ct.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c
> index 7108e71ce4db..a2624f6c4d92 100644
> --- a/net/sched/act_ct.c
> +++ b/net/sched/act_ct.c
> @@ -277,7 +277,7 @@ static struct nf_flowtable_type flowtable_ct = {
>  	.owner		= THIS_MODULE,
>  };
>  
> -static int tcf_ct_flow_table_get(struct tcf_ct_params *params)
> +static int tcf_ct_flow_table_get(struct net *net, struct tcf_ct_params *params)
>  {
>  	struct tcf_ct_flow_table *ct_ft;
>  	int err = -ENOMEM;
> @@ -303,6 +303,7 @@ static int tcf_ct_flow_table_get(struct tcf_ct_params *params)
>  	err = nf_flow_table_init(&ct_ft->nf_ft);
>  	if (err)
>  		goto err_init;
> +	write_pnet(&ct_ft->nf_ft.net, net);
>  
>  	__module_get(THIS_MODULE);
>  out_unlock:
> @@ -1321,7 +1322,7 @@ static int tcf_ct_init(struct net *net, struct nlattr *nla,
>  	if (err)
>  		goto cleanup;
>  
> -	err = tcf_ct_flow_table_get(params);
> +	err = tcf_ct_flow_table_get(net, params);
>  	if (err)
>  		goto cleanup;
>  
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 2/8] netfilter: introduce total count of hw offloaded flow table entries
  2022-02-22 15:09 ` [PATCH net-next 2/8] netfilter: introduce total count of hw offloaded flow table entries Vlad Buslov
@ 2022-03-07 21:47   ` Pablo Neira Ayuso
  2022-03-12 18:56     ` Vlad Buslov
  2022-03-07 21:56   ` Pablo Neira Ayuso
  1 sibling, 1 reply; 27+ messages in thread
From: Pablo Neira Ayuso @ 2022-03-07 21:47 UTC (permalink / raw)
  To: Vlad Buslov; +Cc: netfilter-devel, kadlec, fw, ozsh, paulb

On Tue, Feb 22, 2022 at 05:09:57PM +0200, Vlad Buslov wrote:
> To improve hardware offload debuggability and allow capping total amount of
> offloaded entries in following patch extend struct netns_nftables with
> 'count_hw' counter and expose it to userspace as 'nf_flowtable_count_hw'
> sysctl entry. Increment the counter together with setting NF_FLOW_HW flag
> when scheduling offload add task on workqueue and decrement it after
> successfully scheduling offload del task.
> 
> Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
> Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
> Reviewed-by: Paul Blakey <paulb@nvidia.com>
> ---
>  include/net/netns/nftables.h            |  1 +
>  net/netfilter/nf_conntrack_standalone.c | 12 ++++++++++++
>  net/netfilter/nf_flow_table_core.c      | 12 ++++++++++--
>  3 files changed, 23 insertions(+), 2 deletions(-)
> 
> diff --git a/include/net/netns/nftables.h b/include/net/netns/nftables.h
> index 8c77832d0240..262b8b3213cb 100644
> --- a/include/net/netns/nftables.h
> +++ b/include/net/netns/nftables.h
> @@ -6,6 +6,7 @@
>  
>  struct netns_nftables {
>  	u8			gencursor;
> +	atomic_t		count_hw;
>  };
>  
>  #endif
> diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c
> index 3e1afd10a9b6..8cd55d502ffd 100644
> --- a/net/netfilter/nf_conntrack_standalone.c
> +++ b/net/netfilter/nf_conntrack_standalone.c
> @@ -618,6 +618,9 @@ enum nf_ct_sysctl_index {
>  #ifdef CONFIG_LWTUNNEL
>  	NF_SYSCTL_CT_LWTUNNEL,
>  #endif
> +#if IS_ENABLED(CONFIG_NF_FLOW_TABLE)
> +	NF_SYSCTL_CT_COUNT_HW,
> +#endif
>  
>  	__NF_SYSCTL_CT_LAST_SYSCTL,
>  };
> @@ -973,6 +976,14 @@ static struct ctl_table nf_ct_sysctl_table[] = {
>  		.mode		= 0644,
>  		.proc_handler	= nf_hooks_lwtunnel_sysctl_handler,
>  	},
> +#endif
> +#if IS_ENABLED(CONFIG_NF_FLOW_TABLE)
> +	[NF_SYSCTL_CT_COUNT_HW] = {
> +		.procname	= "nf_flowtable_count_hw",
> +		.maxlen		= sizeof(int),
> +		.mode		= 0444,
> +		.proc_handler	= proc_dointvec,
> +	},
>  #endif
>  	{}
>  };
> @@ -1111,6 +1122,7 @@ static int nf_conntrack_standalone_init_sysctl(struct net *net)
>  	table[NF_SYSCTL_CT_PROTO_TIMEOUT_UDP_STREAM].data = &un->timeouts[UDP_CT_REPLIED];
>  #if IS_ENABLED(CONFIG_NF_FLOW_TABLE)
>  	table[NF_SYSCTL_CT_PROTO_TIMEOUT_UDP_OFFLOAD].data = &un->offload_timeout;
> +	table[NF_SYSCTL_CT_COUNT_HW].data = &net->nft.count_hw;
>  #endif
>  
>  	nf_conntrack_standalone_init_tcp_sysctl(net, table);
> diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
> index b90eca7a2f22..9f2b68bc6f80 100644
> --- a/net/netfilter/nf_flow_table_core.c
> +++ b/net/netfilter/nf_flow_table_core.c
> @@ -315,6 +315,9 @@ int flow_offload_add(struct nf_flowtable *flow_table, struct flow_offload *flow)
>  	nf_ct_offload_timeout(flow->ct);
>  
>  	if (nf_flowtable_hw_offload(flow_table)) {
> +		struct net *net = read_pnet(&flow_table->net);
> +
> +		atomic_inc(&net->nft.count_hw);
>  		__set_bit(NF_FLOW_HW, &flow->flags);
>  		nf_flow_offload_add(flow_table, flow);

Better increment this new counter from flow_offload_work_add()? There
is offload->flowtable->net, you could use it from there to bump the
counter once this is placed in hardware (by when IPS_HW_OFFLOAD_BIT is
set on).

That also moves the atomic would be away from the packet path.

>  	}
> @@ -462,10 +465,15 @@ static void nf_flow_offload_gc_step(struct flow_offload *flow, void *data)
>  
>  	if (test_bit(NF_FLOW_TEARDOWN, &flow->flags)) {
>  		if (test_bit(NF_FLOW_HW, &flow->flags)) {
> -			if (!test_bit(NF_FLOW_HW_DYING, &flow->flags))
> +			if (!test_bit(NF_FLOW_HW_DYING, &flow->flags)) {
> +				struct net *net = read_pnet(&flow_table->net);
> +
>  				nf_flow_offload_del(flow_table, flow);
> -			else if (test_bit(NF_FLOW_HW_DEAD, &flow->flags))
> +				if (test_bit(NF_FLOW_HW_DYING, &flow->flags))
> +					atomic_dec(&net->nft.count_hw);

Same with this, I'd suggest to decrement it from flow_offload_work_del().

> +			} else if (test_bit(NF_FLOW_HW_DEAD, &flow->flags)) {
>  				flow_offload_del(flow_table, flow);
> +			}
>  		} else {
>  			flow_offload_del(flow_table, flow);
>  		}
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 2/8] netfilter: introduce total count of hw offloaded flow table entries
  2022-02-22 15:09 ` [PATCH net-next 2/8] netfilter: introduce total count of hw offloaded flow table entries Vlad Buslov
  2022-03-07 21:47   ` Pablo Neira Ayuso
@ 2022-03-07 21:56   ` Pablo Neira Ayuso
  2022-03-12 19:51     ` Vlad Buslov
  1 sibling, 1 reply; 27+ messages in thread
From: Pablo Neira Ayuso @ 2022-03-07 21:56 UTC (permalink / raw)
  To: Vlad Buslov; +Cc: netfilter-devel, kadlec, fw, ozsh, paulb

On Tue, Feb 22, 2022 at 05:09:57PM +0200, Vlad Buslov wrote:
> To improve hardware offload debuggability and allow capping total amount of
> offloaded entries in following patch extend struct netns_nftables with
> 'count_hw' counter and expose it to userspace as 'nf_flowtable_count_hw'
> sysctl entry. Increment the counter together with setting NF_FLOW_HW flag
> when scheduling offload add task on workqueue and decrement it after
> successfully scheduling offload del task.
> 
> Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
> Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
> Reviewed-by: Paul Blakey <paulb@nvidia.com>
> ---
>  include/net/netns/nftables.h            |  1 +
>  net/netfilter/nf_conntrack_standalone.c | 12 ++++++++++++
>  net/netfilter/nf_flow_table_core.c      | 12 ++++++++++--
>  3 files changed, 23 insertions(+), 2 deletions(-)
> 
> diff --git a/include/net/netns/nftables.h b/include/net/netns/nftables.h
> index 8c77832d0240..262b8b3213cb 100644
> --- a/include/net/netns/nftables.h
> +++ b/include/net/netns/nftables.h
> @@ -6,6 +6,7 @@
>  
>  struct netns_nftables {
>  	u8			gencursor;
> +	atomic_t		count_hw;

In addition to the previous comments: I'd suggest to use
register_pernet_subsys() and register the sysctl from the
nf_flow_table_offload.c through nf_flow_table_offload_init()
file instead of using the conntrack nf_ct_sysctl_table[].

That would require a bit more work though.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 3/8] netfilter: introduce max count of hw offloaded flow table entries
  2022-02-22 15:09 ` [PATCH net-next 3/8] netfilter: introduce max " Vlad Buslov
@ 2022-03-07 22:13   ` Pablo Neira Ayuso
  2022-03-12 19:32     ` Vlad Buslov
  0 siblings, 1 reply; 27+ messages in thread
From: Pablo Neira Ayuso @ 2022-03-07 22:13 UTC (permalink / raw)
  To: Vlad Buslov; +Cc: netfilter-devel, kadlec, fw, ozsh, paulb

On Tue, Feb 22, 2022 at 05:09:58PM +0200, Vlad Buslov wrote:
> To improve hardware offload debuggability extend struct netns_nftables with
> 'max_hw' counter and expose it to userspace as 'nf_flowtable_max_hw' sysctl
> entry. Verify that count_hw value is less than max_hw and don't offload new
> flow table entry to hardware, if this is not the case. Flows that were not
> offloaded due to count_hw being larger than set maximum can still be
> offloaded via refresh function. Mark such flows with NF_FLOW_HW bit and
> only count them once by checking that the bit was previously not set.

Fine with me. Keep in mind there is also an implicit cap with the
maximum number of conntrack entries (nf_conntrack_max).

> Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
> Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
> Reviewed-by: Paul Blakey <paulb@nvidia.com>
> ---
>  include/net/netns/nftables.h            |  1 +
>  net/netfilter/nf_conntrack_standalone.c | 11 +++++++++++
>  net/netfilter/nf_flow_table_core.c      | 25 +++++++++++++++++++++++--
>  3 files changed, 35 insertions(+), 2 deletions(-)
> 
> diff --git a/include/net/netns/nftables.h b/include/net/netns/nftables.h
> index 262b8b3213cb..5677f21fdd4c 100644
> --- a/include/net/netns/nftables.h
> +++ b/include/net/netns/nftables.h
> @@ -7,6 +7,7 @@
>  struct netns_nftables {
>  	u8			gencursor;
>  	atomic_t		count_hw;
> +	int			max_hw;
>  };
>  
>  #endif
> diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c
> index 8cd55d502ffd..af0dea471119 100644
> --- a/net/netfilter/nf_conntrack_standalone.c
> +++ b/net/netfilter/nf_conntrack_standalone.c
> @@ -620,6 +620,7 @@ enum nf_ct_sysctl_index {
>  #endif
>  #if IS_ENABLED(CONFIG_NF_FLOW_TABLE)
>  	NF_SYSCTL_CT_COUNT_HW,
> +	NF_SYSCTL_CT_MAX_HW,
>  #endif
>  
>  	__NF_SYSCTL_CT_LAST_SYSCTL,
> @@ -984,6 +985,12 @@ static struct ctl_table nf_ct_sysctl_table[] = {
>  		.mode		= 0444,
>  		.proc_handler	= proc_dointvec,
>  	},
> +	[NF_SYSCTL_CT_MAX_HW] = {
> +		.procname	= "nf_flowtable_max_hw",
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec,
> +	},
>  #endif
>  	{}
>  };
> @@ -1123,6 +1130,7 @@ static int nf_conntrack_standalone_init_sysctl(struct net *net)
>  #if IS_ENABLED(CONFIG_NF_FLOW_TABLE)
>  	table[NF_SYSCTL_CT_PROTO_TIMEOUT_UDP_OFFLOAD].data = &un->offload_timeout;
>  	table[NF_SYSCTL_CT_COUNT_HW].data = &net->nft.count_hw;
> +	table[NF_SYSCTL_CT_MAX_HW].data = &net->nft.max_hw;
>  #endif
>  
>  	nf_conntrack_standalone_init_tcp_sysctl(net, table);
> @@ -1135,6 +1143,9 @@ static int nf_conntrack_standalone_init_sysctl(struct net *net)
>  		table[NF_SYSCTL_CT_MAX].mode = 0444;
>  		table[NF_SYSCTL_CT_EXPECT_MAX].mode = 0444;
>  		table[NF_SYSCTL_CT_BUCKETS].mode = 0444;
> +#if IS_ENABLED(CONFIG_NF_FLOW_TABLE)
> +		table[NF_SYSCTL_CT_MAX_HW].mode = 0444;
> +#endif
>  	}
>  
>  	cnet->sysctl_header = register_net_sysctl(net, "net/netfilter", table);
> diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
> index 9f2b68bc6f80..2631bd0ae9ae 100644
> --- a/net/netfilter/nf_flow_table_core.c
> +++ b/net/netfilter/nf_flow_table_core.c
> @@ -290,6 +290,20 @@ unsigned long flow_offload_get_timeout(struct flow_offload *flow)
>  	return timeout;
>  }
>  
> +static bool flow_offload_inc_count_hw(struct nf_flowtable *flow_table)
> +{
> +	struct net *net = read_pnet(&flow_table->net);
> +	int max_hw = net->nft.max_hw, count_hw;
> +
> +	count_hw = atomic_inc_return(&net->nft.count_hw);
> +	if (max_hw && count_hw > max_hw) {
> +		atomic_dec(&net->nft.count_hw);
> +		return false;
> +	}
> +
> +	return true;
> +}
> +
>  int flow_offload_add(struct nf_flowtable *flow_table, struct flow_offload *flow)
>  {
>  	int err;
> @@ -315,9 +329,9 @@ int flow_offload_add(struct nf_flowtable *flow_table, struct flow_offload *flow)
>  	nf_ct_offload_timeout(flow->ct);
>  
>  	if (nf_flowtable_hw_offload(flow_table)) {
> -		struct net *net = read_pnet(&flow_table->net);
> +		if (!flow_offload_inc_count_hw(flow_table))
> +			return 0;

Maybe, here something like:

                if (atomic_read(count_hw) > max_hw)
                        return 0;

to catch for early overlimit.

Then, use the logic I suggested for patch 2/8, ie. increment/decrement
this counter from the work handlers?

> -		atomic_inc(&net->nft.count_hw);
>  		__set_bit(NF_FLOW_HW, &flow->flags);
>  		nf_flow_offload_add(flow_table, flow);
>  	}
> @@ -329,6 +343,7 @@ EXPORT_SYMBOL_GPL(flow_offload_add);
>  void flow_offload_refresh(struct nf_flowtable *flow_table,
>  			  struct flow_offload *flow)
>  {
> +	struct net *net = read_pnet(&flow_table->net);
>  	u32 timeout;
>  
>  	timeout = nf_flowtable_time_stamp + flow_offload_get_timeout(flow);
> @@ -338,6 +353,12 @@ void flow_offload_refresh(struct nf_flowtable *flow_table,
>  	if (likely(!nf_flowtable_hw_offload(flow_table)))
>  		return;
>  
> +	if (!flow_offload_inc_count_hw(flow_table))

This is packet path, better avoid this atomic operation if possible.

> +		return;
> +	/* only count each flow once when setting NF_FLOW_HW bit */
> +	if (test_and_set_bit(NF_FLOW_HW, &flow->flags))
> +		atomic_dec(&net->nft.count_hw);
> +
>  	nf_flow_offload_add(flow_table, flow);
>  }
>  EXPORT_SYMBOL_GPL(flow_offload_refresh);
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 5/8] netfilter: introduce max count of hw offload 'add' workqueue tasks
  2022-02-22 15:10 ` [PATCH net-next 5/8] netfilter: introduce max " Vlad Buslov
@ 2022-03-07 22:43   ` Pablo Neira Ayuso
  2022-03-12 19:59     ` Vlad Buslov
  0 siblings, 1 reply; 27+ messages in thread
From: Pablo Neira Ayuso @ 2022-03-07 22:43 UTC (permalink / raw)
  To: Vlad Buslov; +Cc: netfilter-devel, kadlec, fw, ozsh, paulb

On Tue, Feb 22, 2022 at 05:10:00PM +0200, Vlad Buslov wrote:
> To improve hardware offload debuggability extend struct netns_nftables with
> 'max_wq_add' counter and expose it to userspace as
> 'nf_flowtable_max_wq_add' sysctl entry. Verify that count_wq_add value is
> less than max_wq_add and don't schedule new flow table entry 'add' task to
> workqueue, if this is not the case.

For this toggle, I'm not sure what value I would select, this maximum
number of in-flight work objects might be a difficult guess for users.
This is disabled by default, but what reasonable limit could be set?

Moreover, there is also tc-police and ratelimit that could be combined
from the ruleset to throttle the number of entries that are created to
the flowtable, ie. packet is accepted but the tc ct action is skipped.
I agree this would not specifically restrict the number of in-flight
pending work though since this depends on how fast the worker empties
the queue.

My understanding is that the goal for this toggle is to tackle a
scenario where the network creates a pattern to create high load on
the workqueue (e.g. lots of small flows per second). In that case, it
is likely safer to use add an explicit ratelimit to skip the tc ct
action in case of stress, and probably easier to infer?

> Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
> Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
> Reviewed-by: Paul Blakey <paulb@nvidia.com>
> ---
>  include/net/netns/nftables.h            | 1 +
>  net/netfilter/nf_conntrack_standalone.c | 9 +++++++++
>  net/netfilter/nf_flow_table_offload.c   | 9 ++++++++-
>  3 files changed, 18 insertions(+), 1 deletion(-)
>
> diff --git a/include/net/netns/nftables.h b/include/net/netns/nftables.h
> index a971d986a75b..ce270803ef27 100644
> --- a/include/net/netns/nftables.h
> +++ b/include/net/netns/nftables.h
> @@ -9,6 +9,7 @@ struct netns_nftables {
>  	atomic_t		count_hw;
>  	int			max_hw;
>  	atomic_t		count_wq_add;
> +	int			max_wq_add;
>  };
>
>  #endif
> diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c
> index fe2327823f7a..26e2b86eb060 100644
> --- a/net/netfilter/nf_conntrack_standalone.c
> +++ b/net/netfilter/nf_conntrack_standalone.c
> @@ -622,6 +622,7 @@ enum nf_ct_sysctl_index {
>  	NF_SYSCTL_CT_COUNT_HW,
>  	NF_SYSCTL_CT_MAX_HW,
>  	NF_SYSCTL_CT_COUNT_WQ_ADD,
> +	NF_SYSCTL_CT_MAX_WQ_ADD,
>  #endif
>
>  	__NF_SYSCTL_CT_LAST_SYSCTL,
> @@ -998,6 +999,12 @@ static struct ctl_table nf_ct_sysctl_table[] = {
>  		.mode		= 0444,
>  		.proc_handler	= proc_dointvec,
>  	},
> +	[NF_SYSCTL_CT_MAX_WQ_ADD] = {
> +		.procname	= "nf_flowtable_max_wq_add",
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec,
> +	},
>  #endif
>  	{}
>  };
> @@ -1139,6 +1146,7 @@ static int nf_conntrack_standalone_init_sysctl(struct net *net)
>  	table[NF_SYSCTL_CT_COUNT_HW].data = &net->nft.count_hw;
>  	table[NF_SYSCTL_CT_MAX_HW].data = &net->nft.max_hw;
>  	table[NF_SYSCTL_CT_COUNT_WQ_ADD].data = &net->nft.count_wq_add;
> +	table[NF_SYSCTL_CT_MAX_WQ_ADD].data = &net->nft.max_wq_add;
>  #endif
>
>  	nf_conntrack_standalone_init_tcp_sysctl(net, table);
> @@ -1153,6 +1161,7 @@ static int nf_conntrack_standalone_init_sysctl(struct net *net)
>  		table[NF_SYSCTL_CT_BUCKETS].mode = 0444;
>  #if IS_ENABLED(CONFIG_NF_FLOW_TABLE)
>  		table[NF_SYSCTL_CT_MAX_HW].mode = 0444;
> +		table[NF_SYSCTL_CT_MAX_WQ_ADD].mode = 0444;
>  #endif
>  	}
>
> diff --git a/net/netfilter/nf_flow_table_offload.c b/net/netfilter/nf_flow_table_offload.c
> index ffbcf0cfefeb..e29aa51696f5 100644
> --- a/net/netfilter/nf_flow_table_offload.c
> +++ b/net/netfilter/nf_flow_table_offload.c
> @@ -1016,9 +1016,16 @@ void nf_flow_offload_add(struct nf_flowtable *flowtable,
>  			 struct flow_offload *flow)
>  {
>  	struct net *net = read_pnet(&flowtable->net);
> +	int max_wq_add = net->nft.max_wq_add;
>  	struct flow_offload_work *offload;
> +	int count_wq_add;
> +
> +	count_wq_add = atomic_inc_return(&net->nft.count_wq_add);
> +	if (max_wq_add && count_wq_add > max_wq_add) {
> +		atomic_dec(&net->nft.count_wq_add);
> +		return;
> +	}
>
> -	atomic_inc(&net->nft.count_wq_add);
>  	offload = nf_flow_offload_work_alloc(flowtable, flow, FLOW_CLS_REPLACE);
>  	if (!offload) {
>  		atomic_dec(&net->nft.count_wq_add);
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 4/8] netfilter: introduce total count of hw offload 'add' workqueue tasks
  2022-02-22 15:09 ` [PATCH net-next 4/8] netfilter: introduce total count of hw offload 'add' workqueue tasks Vlad Buslov
@ 2022-03-07 22:46   ` Pablo Neira Ayuso
  0 siblings, 0 replies; 27+ messages in thread
From: Pablo Neira Ayuso @ 2022-03-07 22:46 UTC (permalink / raw)
  To: Vlad Buslov; +Cc: netfilter-devel, kadlec, fw, ozsh, paulb

On Tue, Feb 22, 2022 at 05:09:59PM +0200, Vlad Buslov wrote:
> To improve hardware offload debuggability and allow capping total amount of
> offload 'add' in-flight entries on workqueue in following patch extend
> struct netns_nftables with 'count_wq_add' counter and expose it to
> userspace as 'nf_flowtable_count_wq_add' sysctl entry. Increment the
> counter when allocating new workqueue task and decrement it after
> flow_offload_work_add() is finished.

No objections to expose stats as described in patches 4/8, 6/8 and 7/8.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 8/8] netfilter: flowtable: add hardware offload tracepoints
  2022-02-22 15:10 ` [PATCH net-next 8/8] netfilter: flowtable: add hardware offload tracepoints Vlad Buslov
@ 2022-03-07 22:49   ` Pablo Neira Ayuso
  2022-03-12 20:05     ` Vlad Buslov
  0 siblings, 1 reply; 27+ messages in thread
From: Pablo Neira Ayuso @ 2022-03-07 22:49 UTC (permalink / raw)
  To: Vlad Buslov; +Cc: netfilter-devel, kadlec, fw, ozsh, paulb

On Tue, Feb 22, 2022 at 05:10:03PM +0200, Vlad Buslov wrote:
> Add tracepoints to trace creation and start of execution of flowtable
> hardware offload 'add', 'del' and 'stats' tasks. Move struct
> flow_offload_work from source into header file to allow access to structure
> fields from tracepoint code.

This patch, I would prefer to keep it back and explore exposing trace
infrastructure for the flowtable through netlink.

> Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
> Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
> Reviewed-by: Paul Blakey <paulb@nvidia.com>
> ---
>  include/net/netfilter/nf_flow_table.h       |  9 ++++
>  net/netfilter/nf_flow_table_offload.c       | 20 +++++----
>  net/netfilter/nf_flow_table_offload_trace.h | 48 +++++++++++++++++++++
>  3 files changed, 68 insertions(+), 9 deletions(-)
>  create mode 100644 net/netfilter/nf_flow_table_offload_trace.h
> 
> diff --git a/include/net/netfilter/nf_flow_table.h b/include/net/netfilter/nf_flow_table.h
> index a3647fadf1cc..5e2aef34acaa 100644
> --- a/include/net/netfilter/nf_flow_table.h
> +++ b/include/net/netfilter/nf_flow_table.h
> @@ -174,6 +174,15 @@ struct flow_offload {
>  	struct rcu_head				rcu_head;
>  };
>  
> +struct flow_offload_work {
> +	struct list_head list;
> +	enum flow_cls_command cmd;
> +	int priority;
> +	struct nf_flowtable *flowtable;
> +	struct flow_offload *flow;
> +	struct work_struct work;
> +};
> +
>  #define NF_FLOW_TIMEOUT (30 * HZ)
>  #define nf_flowtable_time_stamp	(u32)jiffies
>  
> diff --git a/net/netfilter/nf_flow_table_offload.c b/net/netfilter/nf_flow_table_offload.c
> index ff52d903aad9..bf94050d5b54 100644
> --- a/net/netfilter/nf_flow_table_offload.c
> +++ b/net/netfilter/nf_flow_table_offload.c
> @@ -12,20 +12,13 @@
>  #include <net/netfilter/nf_conntrack_acct.h>
>  #include <net/netfilter/nf_conntrack_core.h>
>  #include <net/netfilter/nf_conntrack_tuple.h>
> +#define CREATE_TRACE_POINTS
> +#include "nf_flow_table_offload_trace.h"
>  
>  static struct workqueue_struct *nf_flow_offload_add_wq;
>  static struct workqueue_struct *nf_flow_offload_del_wq;
>  static struct workqueue_struct *nf_flow_offload_stats_wq;
>  
> -struct flow_offload_work {
> -	struct list_head	list;
> -	enum flow_cls_command	cmd;
> -	int			priority;
> -	struct nf_flowtable	*flowtable;
> -	struct flow_offload	*flow;
> -	struct work_struct	work;
> -};
> -
>  #define NF_FLOW_DISSECTOR(__match, __type, __field)	\
>  	(__match)->dissector.offset[__type] =		\
>  		offsetof(struct nf_flow_key, __field)
> @@ -895,6 +888,8 @@ static void flow_offload_work_add(struct flow_offload_work *offload)
>  	struct nf_flow_rule *flow_rule[FLOW_OFFLOAD_DIR_MAX];
>  	int err;
>  
> +	trace_flow_offload_work_add(offload);
> +
>  	err = nf_flow_offload_alloc(offload, flow_rule);
>  	if (err < 0)
>  		return;
> @@ -911,6 +906,8 @@ static void flow_offload_work_add(struct flow_offload_work *offload)
>  
>  static void flow_offload_work_del(struct flow_offload_work *offload)
>  {
> +	trace_flow_offload_work_del(offload);
> +
>  	clear_bit(IPS_HW_OFFLOAD_BIT, &offload->flow->ct->status);
>  	flow_offload_tuple_del(offload, FLOW_OFFLOAD_DIR_ORIGINAL);
>  	flow_offload_tuple_del(offload, FLOW_OFFLOAD_DIR_REPLY);
> @@ -931,6 +928,8 @@ static void flow_offload_work_stats(struct flow_offload_work *offload)
>  	struct flow_stats stats[FLOW_OFFLOAD_DIR_MAX] = {};
>  	u64 lastused;
>  
> +	trace_flow_offload_work_stats(offload);
> +
>  	flow_offload_tuple_stats(offload, FLOW_OFFLOAD_DIR_ORIGINAL, &stats[0]);
>  	flow_offload_tuple_stats(offload, FLOW_OFFLOAD_DIR_REPLY, &stats[1]);
>  
> @@ -1034,6 +1033,7 @@ void nf_flow_offload_add(struct nf_flowtable *flowtable,
>  		return;
>  	}
>  
> +	trace_flow_offload_add(offload);
>  	flow_offload_queue_work(offload);
>  }
>  
> @@ -1048,6 +1048,7 @@ void nf_flow_offload_del(struct nf_flowtable *flowtable,
>  		return;
>  
>  	atomic_inc(&net->nft.count_wq_del);
> +	trace_flow_offload_del(offload);
>  	set_bit(NF_FLOW_HW_DYING, &flow->flags);
>  	flow_offload_queue_work(offload);
>  }
> @@ -1068,6 +1069,7 @@ void nf_flow_offload_stats(struct nf_flowtable *flowtable,
>  		return;
>  
>  	atomic_inc(&net->nft.count_wq_stats);
> +	trace_flow_offload_stats(offload);
>  	flow_offload_queue_work(offload);
>  }
>  
> diff --git a/net/netfilter/nf_flow_table_offload_trace.h b/net/netfilter/nf_flow_table_offload_trace.h
> new file mode 100644
> index 000000000000..49cfbc2ec35d
> --- /dev/null
> +++ b/net/netfilter/nf_flow_table_offload_trace.h
> @@ -0,0 +1,48 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM nf
> +
> +#if !defined(_NF_FLOW_TABLE_OFFLOAD_TRACE_) || defined(TRACE_HEADER_MULTI_READ)
> +#define _NF_FLOW_TABLE_OFFLOAD_TRACE_
> +
> +#include <linux/tracepoint.h>
> +#include <net/netfilter/nf_tables.h>
> +
> +DECLARE_EVENT_CLASS(
> +	nf_flow_offload_work_template,
> +	TP_PROTO(struct flow_offload_work *w),
> +	TP_ARGS(w),
> +	TP_STRUCT__entry(
> +		__field(void *, work)
> +		__field(void *, flowtable)
> +		__field(void *, flow)
> +	),
> +	TP_fast_assign(
> +		__entry->work = w;
> +		__entry->flowtable = w->flowtable;
> +		__entry->flow = w->flow;
> +	),
> +	TP_printk("work=%p flowtable=%p flow=%p",
> +		  __entry->work, __entry->flowtable, __entry->flow)
> +);
> +
> +#define DEFINE_NF_FLOW_OFFLOAD_WORK_EVENT(name)				\
> +	DEFINE_EVENT(nf_flow_offload_work_template, name,		\
> +		     TP_PROTO(struct flow_offload_work *w), TP_ARGS(w))
> +
> +DEFINE_NF_FLOW_OFFLOAD_WORK_EVENT(flow_offload_add);
> +DEFINE_NF_FLOW_OFFLOAD_WORK_EVENT(flow_offload_work_add);
> +DEFINE_NF_FLOW_OFFLOAD_WORK_EVENT(flow_offload_del);
> +DEFINE_NF_FLOW_OFFLOAD_WORK_EVENT(flow_offload_work_del);
> +DEFINE_NF_FLOW_OFFLOAD_WORK_EVENT(flow_offload_stats);
> +DEFINE_NF_FLOW_OFFLOAD_WORK_EVENT(flow_offload_work_stats);
> +
> +#endif
> +
> +/* This part must be outside protection */
> +#undef TRACE_INCLUDE_PATH
> +#define TRACE_INCLUDE_PATH ../../net/netfilter
> +#undef TRACE_INCLUDE_FILE
> +#define TRACE_INCLUDE_FILE nf_flow_table_offload_trace
> +#include <trace/define_trace.h>
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 2/8] netfilter: introduce total count of hw offloaded flow table entries
  2022-03-07 21:47   ` Pablo Neira Ayuso
@ 2022-03-12 18:56     ` Vlad Buslov
  2022-03-15 10:23       ` Pablo Neira Ayuso
  0 siblings, 1 reply; 27+ messages in thread
From: Vlad Buslov @ 2022-03-12 18:56 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, kadlec, fw, ozsh, paulb

On Mon 07 Mar 2022 at 22:47, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Tue, Feb 22, 2022 at 05:09:57PM +0200, Vlad Buslov wrote:
>> To improve hardware offload debuggability and allow capping total amount of
>> offloaded entries in following patch extend struct netns_nftables with
>> 'count_hw' counter and expose it to userspace as 'nf_flowtable_count_hw'
>> sysctl entry. Increment the counter together with setting NF_FLOW_HW flag
>> when scheduling offload add task on workqueue and decrement it after
>> successfully scheduling offload del task.
>> 
>> Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
>> Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
>> Reviewed-by: Paul Blakey <paulb@nvidia.com>
>> ---
>>  include/net/netns/nftables.h            |  1 +
>>  net/netfilter/nf_conntrack_standalone.c | 12 ++++++++++++
>>  net/netfilter/nf_flow_table_core.c      | 12 ++++++++++--
>>  3 files changed, 23 insertions(+), 2 deletions(-)
>> 
>> diff --git a/include/net/netns/nftables.h b/include/net/netns/nftables.h
>> index 8c77832d0240..262b8b3213cb 100644
>> --- a/include/net/netns/nftables.h
>> +++ b/include/net/netns/nftables.h
>> @@ -6,6 +6,7 @@
>>  
>>  struct netns_nftables {
>>  	u8			gencursor;
>> +	atomic_t		count_hw;
>>  };
>>  
>>  #endif
>> diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c
>> index 3e1afd10a9b6..8cd55d502ffd 100644
>> --- a/net/netfilter/nf_conntrack_standalone.c
>> +++ b/net/netfilter/nf_conntrack_standalone.c
>> @@ -618,6 +618,9 @@ enum nf_ct_sysctl_index {
>>  #ifdef CONFIG_LWTUNNEL
>>  	NF_SYSCTL_CT_LWTUNNEL,
>>  #endif
>> +#if IS_ENABLED(CONFIG_NF_FLOW_TABLE)
>> +	NF_SYSCTL_CT_COUNT_HW,
>> +#endif
>>  
>>  	__NF_SYSCTL_CT_LAST_SYSCTL,
>>  };
>> @@ -973,6 +976,14 @@ static struct ctl_table nf_ct_sysctl_table[] = {
>>  		.mode		= 0644,
>>  		.proc_handler	= nf_hooks_lwtunnel_sysctl_handler,
>>  	},
>> +#endif
>> +#if IS_ENABLED(CONFIG_NF_FLOW_TABLE)
>> +	[NF_SYSCTL_CT_COUNT_HW] = {
>> +		.procname	= "nf_flowtable_count_hw",
>> +		.maxlen		= sizeof(int),
>> +		.mode		= 0444,
>> +		.proc_handler	= proc_dointvec,
>> +	},
>>  #endif
>>  	{}
>>  };
>> @@ -1111,6 +1122,7 @@ static int nf_conntrack_standalone_init_sysctl(struct net *net)
>>  	table[NF_SYSCTL_CT_PROTO_TIMEOUT_UDP_STREAM].data = &un->timeouts[UDP_CT_REPLIED];
>>  #if IS_ENABLED(CONFIG_NF_FLOW_TABLE)
>>  	table[NF_SYSCTL_CT_PROTO_TIMEOUT_UDP_OFFLOAD].data = &un->offload_timeout;
>> +	table[NF_SYSCTL_CT_COUNT_HW].data = &net->nft.count_hw;
>>  #endif
>>  
>>  	nf_conntrack_standalone_init_tcp_sysctl(net, table);
>> diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
>> index b90eca7a2f22..9f2b68bc6f80 100644
>> --- a/net/netfilter/nf_flow_table_core.c
>> +++ b/net/netfilter/nf_flow_table_core.c
>> @@ -315,6 +315,9 @@ int flow_offload_add(struct nf_flowtable *flow_table, struct flow_offload *flow)
>>  	nf_ct_offload_timeout(flow->ct);
>>  
>>  	if (nf_flowtable_hw_offload(flow_table)) {
>> +		struct net *net = read_pnet(&flow_table->net);
>> +
>> +		atomic_inc(&net->nft.count_hw);
>>  		__set_bit(NF_FLOW_HW, &flow->flags);
>>  		nf_flow_offload_add(flow_table, flow);
>
> Better increment this new counter from flow_offload_work_add()? There
> is offload->flowtable->net, you could use it from there to bump the
> counter once this is placed in hardware (by when IPS_HW_OFFLOAD_BIT is
> set on).

Hi Pablo,

Thanks for reviewing my code and sorry for the late reply.

We explored the approach you propose and found several issues with it.
First, the nice benefit of implementation in this patch is that having
counter increment in flow_offload_add() (and test in following patch)
completely avoids spamming the workqueue when the limit is reached which
is an important concern for slower embedded DPU cores. Second, it is not
possible to change it when IPS_HW_OFFLOAD_BIT is set at the very end of
flow_offload_work_add() function because in following patch we need to
verify that counter is in user-specified limit before attempting
offload. Third, changing the counter in wq tasks makes it hard to
balance correctly. Consider following cases:

- flow_offload_work_add() can be called arbitrary amount of times per
  flow due to refresh logic. However, any such flow is still deleted
  only once.

- flow_offload_work_del() can be called for flows that were never
  actually offloaded (it is called for flows that have NF_FLOW_HW bit
  that is unconditionally set before attempting to schedule offload task
  on wq).

Counter balancing issues could maybe be solved by carefully
conditionally changing it based on current value IPS_HW_OFFLOAD_BIT, but
spamming the workqueue can't be prevented for such design.

>
> That also moves the atomic would be away from the packet path.

I understand your concern. However, note that this atomic is normally
changed once for adding offloaded flow and once for removing it. The
code path is only executed per-packet in error cases where flow has
failed to offload and refresh is called repeatedly for same flow.

>
>>  	}
>> @@ -462,10 +465,15 @@ static void nf_flow_offload_gc_step(struct flow_offload *flow, void *data)
>>  
>>  	if (test_bit(NF_FLOW_TEARDOWN, &flow->flags)) {
>>  		if (test_bit(NF_FLOW_HW, &flow->flags)) {
>> -			if (!test_bit(NF_FLOW_HW_DYING, &flow->flags))
>> +			if (!test_bit(NF_FLOW_HW_DYING, &flow->flags)) {
>> +				struct net *net = read_pnet(&flow_table->net);
>> +
>>  				nf_flow_offload_del(flow_table, flow);
>> -			else if (test_bit(NF_FLOW_HW_DEAD, &flow->flags))
>> +				if (test_bit(NF_FLOW_HW_DYING, &flow->flags))
>> +					atomic_dec(&net->nft.count_hw);
>
> Same with this, I'd suggest to decrement it from flow_offload_work_del().
>
>> +			} else if (test_bit(NF_FLOW_HW_DEAD, &flow->flags)) {
>>  				flow_offload_del(flow_table, flow);
>> +			}
>>  		} else {
>>  			flow_offload_del(flow_table, flow);
>>  		}
>> -- 
>> 2.31.1
>> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 3/8] netfilter: introduce max count of hw offloaded flow table entries
  2022-03-07 22:13   ` Pablo Neira Ayuso
@ 2022-03-12 19:32     ` Vlad Buslov
  0 siblings, 0 replies; 27+ messages in thread
From: Vlad Buslov @ 2022-03-12 19:32 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, kadlec, fw, ozsh, paulb


On Mon 07 Mar 2022 at 23:13, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Tue, Feb 22, 2022 at 05:09:58PM +0200, Vlad Buslov wrote:
>> To improve hardware offload debuggability extend struct netns_nftables with
>> 'max_hw' counter and expose it to userspace as 'nf_flowtable_max_hw' sysctl
>> entry. Verify that count_hw value is less than max_hw and don't offload new
>> flow table entry to hardware, if this is not the case. Flows that were not
>> offloaded due to count_hw being larger than set maximum can still be
>> offloaded via refresh function. Mark such flows with NF_FLOW_HW bit and
>> only count them once by checking that the bit was previously not set.
>
> Fine with me. Keep in mind there is also an implicit cap with the
> maximum number of conntrack entries (nf_conntrack_max).

I know. With this counter we have more graceful behavior comparing to
just dropping the packets outright - the flow is still processed in
software and can eventually be offloaded by refresh mechanism when
counter is decremented due to deletion of existing offloaded flow.

>
>> Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
>> Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
>> Reviewed-by: Paul Blakey <paulb@nvidia.com>
>> ---
>>  include/net/netns/nftables.h            |  1 +
>>  net/netfilter/nf_conntrack_standalone.c | 11 +++++++++++
>>  net/netfilter/nf_flow_table_core.c      | 25 +++++++++++++++++++++++--
>>  3 files changed, 35 insertions(+), 2 deletions(-)
>> 
>> diff --git a/include/net/netns/nftables.h b/include/net/netns/nftables.h
>> index 262b8b3213cb..5677f21fdd4c 100644
>> --- a/include/net/netns/nftables.h
>> +++ b/include/net/netns/nftables.h
>> @@ -7,6 +7,7 @@
>>  struct netns_nftables {
>>  	u8			gencursor;
>>  	atomic_t		count_hw;
>> +	int			max_hw;
>>  };
>>  
>>  #endif
>> diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c
>> index 8cd55d502ffd..af0dea471119 100644
>> --- a/net/netfilter/nf_conntrack_standalone.c
>> +++ b/net/netfilter/nf_conntrack_standalone.c
>> @@ -620,6 +620,7 @@ enum nf_ct_sysctl_index {
>>  #endif
>>  #if IS_ENABLED(CONFIG_NF_FLOW_TABLE)
>>  	NF_SYSCTL_CT_COUNT_HW,
>> +	NF_SYSCTL_CT_MAX_HW,
>>  #endif
>>  
>>  	__NF_SYSCTL_CT_LAST_SYSCTL,
>> @@ -984,6 +985,12 @@ static struct ctl_table nf_ct_sysctl_table[] = {
>>  		.mode		= 0444,
>>  		.proc_handler	= proc_dointvec,
>>  	},
>> +	[NF_SYSCTL_CT_MAX_HW] = {
>> +		.procname	= "nf_flowtable_max_hw",
>> +		.maxlen		= sizeof(int),
>> +		.mode		= 0644,
>> +		.proc_handler	= proc_dointvec,
>> +	},
>>  #endif
>>  	{}
>>  };
>> @@ -1123,6 +1130,7 @@ static int nf_conntrack_standalone_init_sysctl(struct net *net)
>>  #if IS_ENABLED(CONFIG_NF_FLOW_TABLE)
>>  	table[NF_SYSCTL_CT_PROTO_TIMEOUT_UDP_OFFLOAD].data = &un->offload_timeout;
>>  	table[NF_SYSCTL_CT_COUNT_HW].data = &net->nft.count_hw;
>> +	table[NF_SYSCTL_CT_MAX_HW].data = &net->nft.max_hw;
>>  #endif
>>  
>>  	nf_conntrack_standalone_init_tcp_sysctl(net, table);
>> @@ -1135,6 +1143,9 @@ static int nf_conntrack_standalone_init_sysctl(struct net *net)
>>  		table[NF_SYSCTL_CT_MAX].mode = 0444;
>>  		table[NF_SYSCTL_CT_EXPECT_MAX].mode = 0444;
>>  		table[NF_SYSCTL_CT_BUCKETS].mode = 0444;
>> +#if IS_ENABLED(CONFIG_NF_FLOW_TABLE)
>> +		table[NF_SYSCTL_CT_MAX_HW].mode = 0444;
>> +#endif
>>  	}
>>  
>>  	cnet->sysctl_header = register_net_sysctl(net, "net/netfilter", table);
>> diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
>> index 9f2b68bc6f80..2631bd0ae9ae 100644
>> --- a/net/netfilter/nf_flow_table_core.c
>> +++ b/net/netfilter/nf_flow_table_core.c
>> @@ -290,6 +290,20 @@ unsigned long flow_offload_get_timeout(struct flow_offload *flow)
>>  	return timeout;
>>  }
>>  
>> +static bool flow_offload_inc_count_hw(struct nf_flowtable *flow_table)
>> +{
>> +	struct net *net = read_pnet(&flow_table->net);
>> +	int max_hw = net->nft.max_hw, count_hw;
>> +
>> +	count_hw = atomic_inc_return(&net->nft.count_hw);
>> +	if (max_hw && count_hw > max_hw) {
>> +		atomic_dec(&net->nft.count_hw);
>> +		return false;
>> +	}
>> +
>> +	return true;
>> +}
>> +
>>  int flow_offload_add(struct nf_flowtable *flow_table, struct flow_offload *flow)
>>  {
>>  	int err;
>> @@ -315,9 +329,9 @@ int flow_offload_add(struct nf_flowtable *flow_table, struct flow_offload *flow)
>>  	nf_ct_offload_timeout(flow->ct);
>>  
>>  	if (nf_flowtable_hw_offload(flow_table)) {
>> -		struct net *net = read_pnet(&flow_table->net);
>> +		if (!flow_offload_inc_count_hw(flow_table))
>> +			return 0;
>
> Maybe, here something like:
>
>                 if (atomic_read(count_hw) > max_hw)
>                         return 0;
>
> to catch for early overlimit.
>
> Then, use the logic I suggested for patch 2/8, ie. increment/decrement
> this counter from the work handlers?

Hmm, so you are suggesting to just read the counter value on data path
and only change it in wq task? That could prevent the wq spam, assuming
the counter balancing issue is solved.

>
>> -		atomic_inc(&net->nft.count_hw);
>>  		__set_bit(NF_FLOW_HW, &flow->flags);
>>  		nf_flow_offload_add(flow_table, flow);
>>  	}
>> @@ -329,6 +343,7 @@ EXPORT_SYMBOL_GPL(flow_offload_add);
>>  void flow_offload_refresh(struct nf_flowtable *flow_table,
>>  			  struct flow_offload *flow)
>>  {
>> +	struct net *net = read_pnet(&flow_table->net);
>>  	u32 timeout;
>>  
>>  	timeout = nf_flowtable_time_stamp + flow_offload_get_timeout(flow);
>> @@ -338,6 +353,12 @@ void flow_offload_refresh(struct nf_flowtable *flow_table,
>>  	if (likely(!nf_flowtable_hw_offload(flow_table)))
>>  		return;
>>  
>> +	if (!flow_offload_inc_count_hw(flow_table))
>
> This is packet path, better avoid this atomic operation if possible.
>
>> +		return;
>> +	/* only count each flow once when setting NF_FLOW_HW bit */
>> +	if (test_and_set_bit(NF_FLOW_HW, &flow->flags))
>> +		atomic_dec(&net->nft.count_hw);
>> +
>>  	nf_flow_offload_add(flow_table, flow);
>>  }
>>  EXPORT_SYMBOL_GPL(flow_offload_refresh);
>> -- 
>> 2.31.1
>> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 2/8] netfilter: introduce total count of hw offloaded flow table entries
  2022-03-07 21:56   ` Pablo Neira Ayuso
@ 2022-03-12 19:51     ` Vlad Buslov
  2022-03-15 10:41       ` Pablo Neira Ayuso
  0 siblings, 1 reply; 27+ messages in thread
From: Vlad Buslov @ 2022-03-12 19:51 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, kadlec, fw, ozsh, paulb


On Mon 07 Mar 2022 at 22:56, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Tue, Feb 22, 2022 at 05:09:57PM +0200, Vlad Buslov wrote:
>> To improve hardware offload debuggability and allow capping total amount of
>> offloaded entries in following patch extend struct netns_nftables with
>> 'count_hw' counter and expose it to userspace as 'nf_flowtable_count_hw'
>> sysctl entry. Increment the counter together with setting NF_FLOW_HW flag
>> when scheduling offload add task on workqueue and decrement it after
>> successfully scheduling offload del task.
>> 
>> Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
>> Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
>> Reviewed-by: Paul Blakey <paulb@nvidia.com>
>> ---
>>  include/net/netns/nftables.h            |  1 +
>>  net/netfilter/nf_conntrack_standalone.c | 12 ++++++++++++
>>  net/netfilter/nf_flow_table_core.c      | 12 ++++++++++--
>>  3 files changed, 23 insertions(+), 2 deletions(-)
>> 
>> diff --git a/include/net/netns/nftables.h b/include/net/netns/nftables.h
>> index 8c77832d0240..262b8b3213cb 100644
>> --- a/include/net/netns/nftables.h
>> +++ b/include/net/netns/nftables.h
>> @@ -6,6 +6,7 @@
>>  
>>  struct netns_nftables {
>>  	u8			gencursor;
>> +	atomic_t		count_hw;
>
> In addition to the previous comments: I'd suggest to use
> register_pernet_subsys() and register the sysctl from the
> nf_flow_table_offload.c through nf_flow_table_offload_init()
> file instead of using the conntrack nf_ct_sysctl_table[].
>
> That would require a bit more work though.

I added the new sysctl in ct because there is already similar-ish
NF_SYSCTL_CT_PROTO_TIMEOUT_UDP_OFFLOAD that is also part of ct sysctl
but is actually used by flow table code. I'll implement dedicated sysctl
table for nf_flow_table_* code, if you suggest it is warranted for this
change.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 5/8] netfilter: introduce max count of hw offload 'add' workqueue tasks
  2022-03-07 22:43   ` Pablo Neira Ayuso
@ 2022-03-12 19:59     ` Vlad Buslov
  0 siblings, 0 replies; 27+ messages in thread
From: Vlad Buslov @ 2022-03-12 19:59 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, kadlec, fw, ozsh, paulb


On Mon 07 Mar 2022 at 23:43, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Tue, Feb 22, 2022 at 05:10:00PM +0200, Vlad Buslov wrote:
>> To improve hardware offload debuggability extend struct netns_nftables with
>> 'max_wq_add' counter and expose it to userspace as
>> 'nf_flowtable_max_wq_add' sysctl entry. Verify that count_wq_add value is
>> less than max_wq_add and don't schedule new flow table entry 'add' task to
>> workqueue, if this is not the case.
>
> For this toggle, I'm not sure what value I would select, this maximum
> number of in-flight work objects might be a difficult guess for users.
> This is disabled by default, but what reasonable limit could be set?
>
> Moreover, there is also tc-police and ratelimit that could be combined
> from the ruleset to throttle the number of entries that are created to
> the flowtable, ie. packet is accepted but the tc ct action is skipped.
> I agree this would not specifically restrict the number of in-flight
> pending work though since this depends on how fast the worker empties
> the queue.
>
> My understanding is that the goal for this toggle is to tackle a
> scenario where the network creates a pattern to create high load on
> the workqueue (e.g. lots of small flows per second). In that case, it
> is likely safer to use add an explicit ratelimit to skip the tc ct
> action in case of stress, and probably easier to infer?
>

I'm afraid such approach would be a very crude substitute for dedicated
counter because the offload rate depends on both current system load
(CPU can be busy with something else) and also number of existing
offloaded flows (offloading 1000000th flow is probably slower than the
1st one). However, current workqueue size directly reflects system
(over)load. With this, it is probably impossible to come up with good
rate limit value, but quite straight forward to infer something like
"even in perfect conditions there is no use in scheduling another
offload task on the workqueue that already has 1m tasks pending".

>> Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
>> Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
>> Reviewed-by: Paul Blakey <paulb@nvidia.com>
>> ---
>>  include/net/netns/nftables.h            | 1 +
>>  net/netfilter/nf_conntrack_standalone.c | 9 +++++++++
>>  net/netfilter/nf_flow_table_offload.c   | 9 ++++++++-
>>  3 files changed, 18 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/net/netns/nftables.h b/include/net/netns/nftables.h
>> index a971d986a75b..ce270803ef27 100644
>> --- a/include/net/netns/nftables.h
>> +++ b/include/net/netns/nftables.h
>> @@ -9,6 +9,7 @@ struct netns_nftables {
>>  	atomic_t		count_hw;
>>  	int			max_hw;
>>  	atomic_t		count_wq_add;
>> +	int			max_wq_add;
>>  };
>>
>>  #endif
>> diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c
>> index fe2327823f7a..26e2b86eb060 100644
>> --- a/net/netfilter/nf_conntrack_standalone.c
>> +++ b/net/netfilter/nf_conntrack_standalone.c
>> @@ -622,6 +622,7 @@ enum nf_ct_sysctl_index {
>>  	NF_SYSCTL_CT_COUNT_HW,
>>  	NF_SYSCTL_CT_MAX_HW,
>>  	NF_SYSCTL_CT_COUNT_WQ_ADD,
>> +	NF_SYSCTL_CT_MAX_WQ_ADD,
>>  #endif
>>
>>  	__NF_SYSCTL_CT_LAST_SYSCTL,
>> @@ -998,6 +999,12 @@ static struct ctl_table nf_ct_sysctl_table[] = {
>>  		.mode		= 0444,
>>  		.proc_handler	= proc_dointvec,
>>  	},
>> +	[NF_SYSCTL_CT_MAX_WQ_ADD] = {
>> +		.procname	= "nf_flowtable_max_wq_add",
>> +		.maxlen		= sizeof(int),
>> +		.mode		= 0644,
>> +		.proc_handler	= proc_dointvec,
>> +	},
>>  #endif
>>  	{}
>>  };
>> @@ -1139,6 +1146,7 @@ static int nf_conntrack_standalone_init_sysctl(struct net *net)
>>  	table[NF_SYSCTL_CT_COUNT_HW].data = &net->nft.count_hw;
>>  	table[NF_SYSCTL_CT_MAX_HW].data = &net->nft.max_hw;
>>  	table[NF_SYSCTL_CT_COUNT_WQ_ADD].data = &net->nft.count_wq_add;
>> +	table[NF_SYSCTL_CT_MAX_WQ_ADD].data = &net->nft.max_wq_add;
>>  #endif
>>
>>  	nf_conntrack_standalone_init_tcp_sysctl(net, table);
>> @@ -1153,6 +1161,7 @@ static int nf_conntrack_standalone_init_sysctl(struct net *net)
>>  		table[NF_SYSCTL_CT_BUCKETS].mode = 0444;
>>  #if IS_ENABLED(CONFIG_NF_FLOW_TABLE)
>>  		table[NF_SYSCTL_CT_MAX_HW].mode = 0444;
>> +		table[NF_SYSCTL_CT_MAX_WQ_ADD].mode = 0444;
>>  #endif
>>  	}
>>
>> diff --git a/net/netfilter/nf_flow_table_offload.c b/net/netfilter/nf_flow_table_offload.c
>> index ffbcf0cfefeb..e29aa51696f5 100644
>> --- a/net/netfilter/nf_flow_table_offload.c
>> +++ b/net/netfilter/nf_flow_table_offload.c
>> @@ -1016,9 +1016,16 @@ void nf_flow_offload_add(struct nf_flowtable *flowtable,
>>  			 struct flow_offload *flow)
>>  {
>>  	struct net *net = read_pnet(&flowtable->net);
>> +	int max_wq_add = net->nft.max_wq_add;
>>  	struct flow_offload_work *offload;
>> +	int count_wq_add;
>> +
>> +	count_wq_add = atomic_inc_return(&net->nft.count_wq_add);
>> +	if (max_wq_add && count_wq_add > max_wq_add) {
>> +		atomic_dec(&net->nft.count_wq_add);
>> +		return;
>> +	}
>>
>> -	atomic_inc(&net->nft.count_wq_add);
>>  	offload = nf_flow_offload_work_alloc(flowtable, flow, FLOW_CLS_REPLACE);
>>  	if (!offload) {
>>  		atomic_dec(&net->nft.count_wq_add);
>> --
>> 2.31.1
>>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 8/8] netfilter: flowtable: add hardware offload tracepoints
  2022-03-07 22:49   ` Pablo Neira Ayuso
@ 2022-03-12 20:05     ` Vlad Buslov
  2022-03-15 10:29       ` Pablo Neira Ayuso
  0 siblings, 1 reply; 27+ messages in thread
From: Vlad Buslov @ 2022-03-12 20:05 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, kadlec, fw, ozsh, paulb


On Mon 07 Mar 2022 at 23:49, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Tue, Feb 22, 2022 at 05:10:03PM +0200, Vlad Buslov wrote:
>> Add tracepoints to trace creation and start of execution of flowtable
>> hardware offload 'add', 'del' and 'stats' tasks. Move struct
>> flow_offload_work from source into header file to allow access to structure
>> fields from tracepoint code.
>
> This patch, I would prefer to keep it back and explore exposing trace
> infrastructure for the flowtable through netlink.
>

What approach do you have in mind with netlink? I used tracepoints here
because they are:

- Incur no performance penalty when disabled.

- Handy to attach BPF programs to.

According to my experience with optimizing TC control path parsing
Netlink is CPU-intensive. I am also not aware of mechanisms to leverage
it to attach BPF.

>> Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
>> Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
>> Reviewed-by: Paul Blakey <paulb@nvidia.com>
>> ---
>>  include/net/netfilter/nf_flow_table.h       |  9 ++++
>>  net/netfilter/nf_flow_table_offload.c       | 20 +++++----
>>  net/netfilter/nf_flow_table_offload_trace.h | 48 +++++++++++++++++++++
>>  3 files changed, 68 insertions(+), 9 deletions(-)
>>  create mode 100644 net/netfilter/nf_flow_table_offload_trace.h
>> 
>> diff --git a/include/net/netfilter/nf_flow_table.h b/include/net/netfilter/nf_flow_table.h
>> index a3647fadf1cc..5e2aef34acaa 100644
>> --- a/include/net/netfilter/nf_flow_table.h
>> +++ b/include/net/netfilter/nf_flow_table.h
>> @@ -174,6 +174,15 @@ struct flow_offload {
>>  	struct rcu_head				rcu_head;
>>  };
>>  
>> +struct flow_offload_work {
>> +	struct list_head list;
>> +	enum flow_cls_command cmd;
>> +	int priority;
>> +	struct nf_flowtable *flowtable;
>> +	struct flow_offload *flow;
>> +	struct work_struct work;
>> +};
>> +
>>  #define NF_FLOW_TIMEOUT (30 * HZ)
>>  #define nf_flowtable_time_stamp	(u32)jiffies
>>  
>> diff --git a/net/netfilter/nf_flow_table_offload.c b/net/netfilter/nf_flow_table_offload.c
>> index ff52d903aad9..bf94050d5b54 100644
>> --- a/net/netfilter/nf_flow_table_offload.c
>> +++ b/net/netfilter/nf_flow_table_offload.c
>> @@ -12,20 +12,13 @@
>>  #include <net/netfilter/nf_conntrack_acct.h>
>>  #include <net/netfilter/nf_conntrack_core.h>
>>  #include <net/netfilter/nf_conntrack_tuple.h>
>> +#define CREATE_TRACE_POINTS
>> +#include "nf_flow_table_offload_trace.h"
>>  
>>  static struct workqueue_struct *nf_flow_offload_add_wq;
>>  static struct workqueue_struct *nf_flow_offload_del_wq;
>>  static struct workqueue_struct *nf_flow_offload_stats_wq;
>>  
>> -struct flow_offload_work {
>> -	struct list_head	list;
>> -	enum flow_cls_command	cmd;
>> -	int			priority;
>> -	struct nf_flowtable	*flowtable;
>> -	struct flow_offload	*flow;
>> -	struct work_struct	work;
>> -};
>> -
>>  #define NF_FLOW_DISSECTOR(__match, __type, __field)	\
>>  	(__match)->dissector.offset[__type] =		\
>>  		offsetof(struct nf_flow_key, __field)
>> @@ -895,6 +888,8 @@ static void flow_offload_work_add(struct flow_offload_work *offload)
>>  	struct nf_flow_rule *flow_rule[FLOW_OFFLOAD_DIR_MAX];
>>  	int err;
>>  
>> +	trace_flow_offload_work_add(offload);
>> +
>>  	err = nf_flow_offload_alloc(offload, flow_rule);
>>  	if (err < 0)
>>  		return;
>> @@ -911,6 +906,8 @@ static void flow_offload_work_add(struct flow_offload_work *offload)
>>  
>>  static void flow_offload_work_del(struct flow_offload_work *offload)
>>  {
>> +	trace_flow_offload_work_del(offload);
>> +
>>  	clear_bit(IPS_HW_OFFLOAD_BIT, &offload->flow->ct->status);
>>  	flow_offload_tuple_del(offload, FLOW_OFFLOAD_DIR_ORIGINAL);
>>  	flow_offload_tuple_del(offload, FLOW_OFFLOAD_DIR_REPLY);
>> @@ -931,6 +928,8 @@ static void flow_offload_work_stats(struct flow_offload_work *offload)
>>  	struct flow_stats stats[FLOW_OFFLOAD_DIR_MAX] = {};
>>  	u64 lastused;
>>  
>> +	trace_flow_offload_work_stats(offload);
>> +
>>  	flow_offload_tuple_stats(offload, FLOW_OFFLOAD_DIR_ORIGINAL, &stats[0]);
>>  	flow_offload_tuple_stats(offload, FLOW_OFFLOAD_DIR_REPLY, &stats[1]);
>>  
>> @@ -1034,6 +1033,7 @@ void nf_flow_offload_add(struct nf_flowtable *flowtable,
>>  		return;
>>  	}
>>  
>> +	trace_flow_offload_add(offload);
>>  	flow_offload_queue_work(offload);
>>  }
>>  
>> @@ -1048,6 +1048,7 @@ void nf_flow_offload_del(struct nf_flowtable *flowtable,
>>  		return;
>>  
>>  	atomic_inc(&net->nft.count_wq_del);
>> +	trace_flow_offload_del(offload);
>>  	set_bit(NF_FLOW_HW_DYING, &flow->flags);
>>  	flow_offload_queue_work(offload);
>>  }
>> @@ -1068,6 +1069,7 @@ void nf_flow_offload_stats(struct nf_flowtable *flowtable,
>>  		return;
>>  
>>  	atomic_inc(&net->nft.count_wq_stats);
>> +	trace_flow_offload_stats(offload);
>>  	flow_offload_queue_work(offload);
>>  }
>>  
>> diff --git a/net/netfilter/nf_flow_table_offload_trace.h b/net/netfilter/nf_flow_table_offload_trace.h
>> new file mode 100644
>> index 000000000000..49cfbc2ec35d
>> --- /dev/null
>> +++ b/net/netfilter/nf_flow_table_offload_trace.h
>> @@ -0,0 +1,48 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +
>> +#undef TRACE_SYSTEM
>> +#define TRACE_SYSTEM nf
>> +
>> +#if !defined(_NF_FLOW_TABLE_OFFLOAD_TRACE_) || defined(TRACE_HEADER_MULTI_READ)
>> +#define _NF_FLOW_TABLE_OFFLOAD_TRACE_
>> +
>> +#include <linux/tracepoint.h>
>> +#include <net/netfilter/nf_tables.h>
>> +
>> +DECLARE_EVENT_CLASS(
>> +	nf_flow_offload_work_template,
>> +	TP_PROTO(struct flow_offload_work *w),
>> +	TP_ARGS(w),
>> +	TP_STRUCT__entry(
>> +		__field(void *, work)
>> +		__field(void *, flowtable)
>> +		__field(void *, flow)
>> +	),
>> +	TP_fast_assign(
>> +		__entry->work = w;
>> +		__entry->flowtable = w->flowtable;
>> +		__entry->flow = w->flow;
>> +	),
>> +	TP_printk("work=%p flowtable=%p flow=%p",
>> +		  __entry->work, __entry->flowtable, __entry->flow)
>> +);
>> +
>> +#define DEFINE_NF_FLOW_OFFLOAD_WORK_EVENT(name)				\
>> +	DEFINE_EVENT(nf_flow_offload_work_template, name,		\
>> +		     TP_PROTO(struct flow_offload_work *w), TP_ARGS(w))
>> +
>> +DEFINE_NF_FLOW_OFFLOAD_WORK_EVENT(flow_offload_add);
>> +DEFINE_NF_FLOW_OFFLOAD_WORK_EVENT(flow_offload_work_add);
>> +DEFINE_NF_FLOW_OFFLOAD_WORK_EVENT(flow_offload_del);
>> +DEFINE_NF_FLOW_OFFLOAD_WORK_EVENT(flow_offload_work_del);
>> +DEFINE_NF_FLOW_OFFLOAD_WORK_EVENT(flow_offload_stats);
>> +DEFINE_NF_FLOW_OFFLOAD_WORK_EVENT(flow_offload_work_stats);
>> +
>> +#endif
>> +
>> +/* This part must be outside protection */
>> +#undef TRACE_INCLUDE_PATH
>> +#define TRACE_INCLUDE_PATH ../../net/netfilter
>> +#undef TRACE_INCLUDE_FILE
>> +#define TRACE_INCLUDE_FILE nf_flow_table_offload_trace
>> +#include <trace/define_trace.h>
>> -- 
>> 2.31.1
>> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 2/8] netfilter: introduce total count of hw offloaded flow table entries
  2022-03-12 18:56     ` Vlad Buslov
@ 2022-03-15 10:23       ` Pablo Neira Ayuso
  2022-03-15 16:18         ` Vlad Buslov
  0 siblings, 1 reply; 27+ messages in thread
From: Pablo Neira Ayuso @ 2022-03-15 10:23 UTC (permalink / raw)
  To: Vlad Buslov; +Cc: netfilter-devel, kadlec, fw, ozsh, paulb

On Sat, Mar 12, 2022 at 08:56:49PM +0200, Vlad Buslov wrote:
[...]
> Hi Pablo,
> 
> Thanks for reviewing my code and sorry for the late reply.
> 
> We explored the approach you propose and found several issues with it.
> First, the nice benefit of implementation in this patch is that having
> counter increment in flow_offload_add() (and test in following patch)
> completely avoids spamming the workqueue when the limit is reached which
> is an important concern for slower embedded DPU cores. Second, it is not
> possible to change it when IPS_HW_OFFLOAD_BIT is set at the very end of
> flow_offload_work_add() function because in following patch we need to
> verify that counter is in user-specified limit before attempting
> offload. Third, changing the counter in wq tasks makes it hard to
> balance correctly. Consider following cases:
> 
> - flow_offload_work_add() can be called arbitrary amount of times per
>   flow due to refresh logic. However, any such flow is still deleted
>   only once.
> 
> - flow_offload_work_del() can be called for flows that were never
>   actually offloaded (it is called for flows that have NF_FLOW_HW bit
>   that is unconditionally set before attempting to schedule offload task
>   on wq).
>
> Counter balancing issues could maybe be solved by carefully
> conditionally changing it based on current value IPS_HW_OFFLOAD_BIT, but
> spamming the workqueue can't be prevented for such design.
>
> > That also moves the atomic would be away from the packet path.
> 
> I understand your concern. However, note that this atomic is normally
> changed once for adding offloaded flow and once for removing it. The
> code path is only executed per-packet in error cases where flow has
> failed to offload and refresh is called repeatedly for same flow.

Thanks for explaining.

There used to be in the code a list of pending flows to be offloaded.

I think it would be possible to restore such list and make it per-cpu,
the idea is to add a new field to the flow_offload structure to
annotate the cpu that needs to deal with this flow (same cpu deals
with add/del/stats). The cpu field is set at flow creation time.

Once there is one item, add work to the workqueue to that cpu.
Meanwhile the workqueue does not have a chance, we keep adding more
items to the workqueue.

The workqueue handler then zaps the list of pending flows to be
offloaded, it might have more than one single item in the list.

So instead of three workqueues, we only have one. Scalability is
achieved by fanning out flows over CPUs.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 8/8] netfilter: flowtable: add hardware offload tracepoints
  2022-03-12 20:05     ` Vlad Buslov
@ 2022-03-15 10:29       ` Pablo Neira Ayuso
  2022-03-15 16:36         ` Vlad Buslov
  0 siblings, 1 reply; 27+ messages in thread
From: Pablo Neira Ayuso @ 2022-03-15 10:29 UTC (permalink / raw)
  To: Vlad Buslov; +Cc: netfilter-devel, kadlec, fw, ozsh, paulb

On Sat, Mar 12, 2022 at 10:05:55PM +0200, Vlad Buslov wrote:
> 
> On Mon 07 Mar 2022 at 23:49, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > On Tue, Feb 22, 2022 at 05:10:03PM +0200, Vlad Buslov wrote:
> >> Add tracepoints to trace creation and start of execution of flowtable
> >> hardware offload 'add', 'del' and 'stats' tasks. Move struct
> >> flow_offload_work from source into header file to allow access to structure
> >> fields from tracepoint code.
> >
> > This patch, I would prefer to keep it back and explore exposing trace
> > infrastructure for the flowtable through netlink.
> >
> 
> What approach do you have in mind with netlink? I used tracepoints here
> because they are:
> 
> - Incur no performance penalty when disabled.
> 
> - Handy to attach BPF programs to.
> 
> According to my experience with optimizing TC control path parsing
> Netlink is CPU-intensive. I am also not aware of mechanisms to leverage
> it to attach BPF.

Sure, no question tracing and introspection is useful.

But could you use the generic workqueue trace points instead?

This is adding tracing infrastructure for a very specific purpose, to
inspect the workqueue behaviour for the flowtable.

And I am not sure how you use this yet other than observing that the
workqueue is coping with the workload?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 2/8] netfilter: introduce total count of hw offloaded flow table entries
  2022-03-12 19:51     ` Vlad Buslov
@ 2022-03-15 10:41       ` Pablo Neira Ayuso
  2022-03-15 16:34         ` Vlad Buslov
  0 siblings, 1 reply; 27+ messages in thread
From: Pablo Neira Ayuso @ 2022-03-15 10:41 UTC (permalink / raw)
  To: Vlad Buslov; +Cc: netfilter-devel, kadlec, fw, ozsh, paulb

On Sat, Mar 12, 2022 at 09:51:45PM +0200, Vlad Buslov wrote:
> 
> On Mon 07 Mar 2022 at 22:56, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > On Tue, Feb 22, 2022 at 05:09:57PM +0200, Vlad Buslov wrote:
> >> To improve hardware offload debuggability and allow capping total amount of
> >> offloaded entries in following patch extend struct netns_nftables with
> >> 'count_hw' counter and expose it to userspace as 'nf_flowtable_count_hw'
> >> sysctl entry. Increment the counter together with setting NF_FLOW_HW flag
> >> when scheduling offload add task on workqueue and decrement it after
> >> successfully scheduling offload del task.
> >> 
> >> Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
> >> Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
> >> Reviewed-by: Paul Blakey <paulb@nvidia.com>
> >> ---
> >>  include/net/netns/nftables.h            |  1 +
> >>  net/netfilter/nf_conntrack_standalone.c | 12 ++++++++++++
> >>  net/netfilter/nf_flow_table_core.c      | 12 ++++++++++--
> >>  3 files changed, 23 insertions(+), 2 deletions(-)
> >> 
> >> diff --git a/include/net/netns/nftables.h b/include/net/netns/nftables.h
> >> index 8c77832d0240..262b8b3213cb 100644
> >> --- a/include/net/netns/nftables.h
> >> +++ b/include/net/netns/nftables.h
> >> @@ -6,6 +6,7 @@
> >>  
> >>  struct netns_nftables {
> >>  	u8			gencursor;
> >> +	atomic_t		count_hw;
> >
> > In addition to the previous comments: I'd suggest to use
> > register_pernet_subsys() and register the sysctl from the
> > nf_flow_table_offload.c through nf_flow_table_offload_init()
> > file instead of using the conntrack nf_ct_sysctl_table[].
> >
> > That would require a bit more work though.
> 
> I added the new sysctl in ct because there is already similar-ish
> NF_SYSCTL_CT_PROTO_TIMEOUT_UDP_OFFLOAD that is also part of ct sysctl
> but is actually used by flow table code. I'll implement dedicated sysctl
> table for nf_flow_table_* code, if you suggest it is warranted for this
> change.

IIRC, that was removed.

commit 4592ee7f525c4683ec9e290381601fdee50ae110
Author: Florian Westphal <fw@strlen.de>
Date:   Wed Aug 4 15:02:15 2021 +0200

    netfilter: conntrack: remove offload_pickup sysctl again

I think it's good if we start having a dedicated sysctl for the
flowtable, yes.

Thanks.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 2/8] netfilter: introduce total count of hw offloaded flow table entries
  2022-03-15 10:23       ` Pablo Neira Ayuso
@ 2022-03-15 16:18         ` Vlad Buslov
  0 siblings, 0 replies; 27+ messages in thread
From: Vlad Buslov @ 2022-03-15 16:18 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, kadlec, fw, ozsh, paulb

On Tue 15 Mar 2022 at 11:23, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Sat, Mar 12, 2022 at 08:56:49PM +0200, Vlad Buslov wrote:
> [...]
>> Hi Pablo,
>> 
>> Thanks for reviewing my code and sorry for the late reply.
>> 
>> We explored the approach you propose and found several issues with it.
>> First, the nice benefit of implementation in this patch is that having
>> counter increment in flow_offload_add() (and test in following patch)
>> completely avoids spamming the workqueue when the limit is reached which
>> is an important concern for slower embedded DPU cores. Second, it is not
>> possible to change it when IPS_HW_OFFLOAD_BIT is set at the very end of
>> flow_offload_work_add() function because in following patch we need to
>> verify that counter is in user-specified limit before attempting
>> offload. Third, changing the counter in wq tasks makes it hard to
>> balance correctly. Consider following cases:
>> 
>> - flow_offload_work_add() can be called arbitrary amount of times per
>>   flow due to refresh logic. However, any such flow is still deleted
>>   only once.
>> 
>> - flow_offload_work_del() can be called for flows that were never
>>   actually offloaded (it is called for flows that have NF_FLOW_HW bit
>>   that is unconditionally set before attempting to schedule offload task
>>   on wq).
>>
>> Counter balancing issues could maybe be solved by carefully
>> conditionally changing it based on current value IPS_HW_OFFLOAD_BIT, but
>> spamming the workqueue can't be prevented for such design.
>>
>> > That also moves the atomic would be away from the packet path.
>> 
>> I understand your concern. However, note that this atomic is normally
>> changed once for adding offloaded flow and once for removing it. The
>> code path is only executed per-packet in error cases where flow has
>> failed to offload and refresh is called repeatedly for same flow.
>
> Thanks for explaining.
>
> There used to be in the code a list of pending flows to be offloaded.
>
> I think it would be possible to restore such list and make it per-cpu,
> the idea is to add a new field to the flow_offload structure to
> annotate the cpu that needs to deal with this flow (same cpu deals
> with add/del/stats). The cpu field is set at flow creation time.

What would be the algorithm to assign the cpu field? Some simple
algorithm like round-robin will not take into account CPU load of
unrelated tasks (for example, OvS which is also CPU-intensive) and
offload tasks on contested cores will get less CPU time, which will
result unbalanced occupancy where some cores are idle and other have
long list of offload tasks. Any advanced algorithm will be hard to
implement since we don't have access to scheduler internal data. Also,
in my experience not all offload tasks take same amount of CPU time (for
example, offloading complex flows with tunnels takes longer than simple
flows and deletes take less time than adds), so having access to just
the current lists sizes doesn't directly translate to list processing
time.

>
> Once there is one item, add work to the workqueue to that cpu.
> Meanwhile the workqueue does not have a chance, we keep adding more
> items to the workqueue.
>
> The workqueue handler then zaps the list of pending flows to be
> offloaded, it might have more than one single item in the list.

I understand the proposal but I'm missing the benefit it provides over
existing workqueue approach. Both standard kernel linked list and
workqueue are unbounded and don't count their elements which means we
would still have to implement approach similar to what is proposed in
existing series - add atomic to manually count the size and reject new
elements over some maximum (including, in case of unified list, flow
deletions that we don't really want to skip).

>
> So instead of three workqueues, we only have one. Scalability is
> achieved by fanning out flows over CPUs.

But existing nf_ft_offload_* wokrqueues are already parallel and
unbound, so they already fan out tasks over CPU cores and probably also
do it better than any custom algorithm that we can come up with since
threads are scheduled by the system scheduler that takes into account
current CPU load.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 2/8] netfilter: introduce total count of hw offloaded flow table entries
  2022-03-15 10:41       ` Pablo Neira Ayuso
@ 2022-03-15 16:34         ` Vlad Buslov
  0 siblings, 0 replies; 27+ messages in thread
From: Vlad Buslov @ 2022-03-15 16:34 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, kadlec, fw, ozsh, paulb

On Tue 15 Mar 2022 at 11:41, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Sat, Mar 12, 2022 at 09:51:45PM +0200, Vlad Buslov wrote:
>> 
>> On Mon 07 Mar 2022 at 22:56, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
>> > On Tue, Feb 22, 2022 at 05:09:57PM +0200, Vlad Buslov wrote:
>> >> To improve hardware offload debuggability and allow capping total amount of
>> >> offloaded entries in following patch extend struct netns_nftables with
>> >> 'count_hw' counter and expose it to userspace as 'nf_flowtable_count_hw'
>> >> sysctl entry. Increment the counter together with setting NF_FLOW_HW flag
>> >> when scheduling offload add task on workqueue and decrement it after
>> >> successfully scheduling offload del task.
>> >> 
>> >> Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
>> >> Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
>> >> Reviewed-by: Paul Blakey <paulb@nvidia.com>
>> >> ---
>> >>  include/net/netns/nftables.h            |  1 +
>> >>  net/netfilter/nf_conntrack_standalone.c | 12 ++++++++++++
>> >>  net/netfilter/nf_flow_table_core.c      | 12 ++++++++++--
>> >>  3 files changed, 23 insertions(+), 2 deletions(-)
>> >> 
>> >> diff --git a/include/net/netns/nftables.h b/include/net/netns/nftables.h
>> >> index 8c77832d0240..262b8b3213cb 100644
>> >> --- a/include/net/netns/nftables.h
>> >> +++ b/include/net/netns/nftables.h
>> >> @@ -6,6 +6,7 @@
>> >>  
>> >>  struct netns_nftables {
>> >>  	u8			gencursor;
>> >> +	atomic_t		count_hw;
>> >
>> > In addition to the previous comments: I'd suggest to use
>> > register_pernet_subsys() and register the sysctl from the
>> > nf_flow_table_offload.c through nf_flow_table_offload_init()
>> > file instead of using the conntrack nf_ct_sysctl_table[].
>> >
>> > That would require a bit more work though.
>> 
>> I added the new sysctl in ct because there is already similar-ish
>> NF_SYSCTL_CT_PROTO_TIMEOUT_UDP_OFFLOAD that is also part of ct sysctl
>> but is actually used by flow table code. I'll implement dedicated sysctl
>> table for nf_flow_table_* code, if you suggest it is warranted for this
>> change.
>
> IIRC, that was removed.
>
> commit 4592ee7f525c4683ec9e290381601fdee50ae110
> Author: Florian Westphal <fw@strlen.de>
> Date:   Wed Aug 4 15:02:15 2021 +0200
>
>     netfilter: conntrack: remove offload_pickup sysctl again
>
> I think it's good if we start having a dedicated sysctl for the
> flowtable, yes.
>
> Thanks.

Got it. Will move these sysctls into dedicated namespace in v2.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 8/8] netfilter: flowtable: add hardware offload tracepoints
  2022-03-15 10:29       ` Pablo Neira Ayuso
@ 2022-03-15 16:36         ` Vlad Buslov
  0 siblings, 0 replies; 27+ messages in thread
From: Vlad Buslov @ 2022-03-15 16:36 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, kadlec, fw, ozsh, paulb

On Tue 15 Mar 2022 at 11:29, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Sat, Mar 12, 2022 at 10:05:55PM +0200, Vlad Buslov wrote:
>> 
>> On Mon 07 Mar 2022 at 23:49, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
>> > On Tue, Feb 22, 2022 at 05:10:03PM +0200, Vlad Buslov wrote:
>> >> Add tracepoints to trace creation and start of execution of flowtable
>> >> hardware offload 'add', 'del' and 'stats' tasks. Move struct
>> >> flow_offload_work from source into header file to allow access to structure
>> >> fields from tracepoint code.
>> >
>> > This patch, I would prefer to keep it back and explore exposing trace
>> > infrastructure for the flowtable through netlink.
>> >
>> 
>> What approach do you have in mind with netlink? I used tracepoints here
>> because they are:
>> 
>> - Incur no performance penalty when disabled.
>> 
>> - Handy to attach BPF programs to.
>> 
>> According to my experience with optimizing TC control path parsing
>> Netlink is CPU-intensive. I am also not aware of mechanisms to leverage
>> it to attach BPF.
>
> Sure, no question tracing and introspection is useful.
>
> But could you use the generic workqueue trace points instead?

I can. In fact, this is exactly what I use to implement such scripts for
current upstream:

tracepoint:workqueue:workqueue_queue_work
/ str(args->workqueue) == "nf_ft_offload_add" /
{
    ...
}

However note that such approach:

1. Requires knowledge of kernel infrastructure internals. We would like
to make it accessible to more users than just kernel hackers.

2. Is probably slower due to string comparison. I didn't benchmark CPU
usage of scripts that rely on workqueue tracepoints vs their
re-implementation using new dedicated tracepoints from this patch
though.

>
> This is adding tracing infrastructure for a very specific purpose, to
> inspect the workqueue behaviour for the flowtable.
>
> And I am not sure how you use this yet other than observing that the
> workqueue is coping with the workload?

Well, there are multiple different metrics that can constitute "coping".
Besides measuring workqueue size we are also interested in task
processing latency histogram, current workqueue task creation rate vs
task processing rate, etc. We could probably implement all of these
without any tracepoints at all by using just kprobes, but such programs
would be much more complicated and fragile.

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2022-03-15 16:52 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-22 15:09 [PATCH net-next 0/8] Conntrack offload debuggability improvements Vlad Buslov
2022-02-22 15:09 ` [PATCH net-next 1/8] net/sched: act_ct: set 'net' pointer when creating new nf_flow_table Vlad Buslov
2022-03-07 21:09   ` Pablo Neira Ayuso
2022-02-22 15:09 ` [PATCH net-next 2/8] netfilter: introduce total count of hw offloaded flow table entries Vlad Buslov
2022-03-07 21:47   ` Pablo Neira Ayuso
2022-03-12 18:56     ` Vlad Buslov
2022-03-15 10:23       ` Pablo Neira Ayuso
2022-03-15 16:18         ` Vlad Buslov
2022-03-07 21:56   ` Pablo Neira Ayuso
2022-03-12 19:51     ` Vlad Buslov
2022-03-15 10:41       ` Pablo Neira Ayuso
2022-03-15 16:34         ` Vlad Buslov
2022-02-22 15:09 ` [PATCH net-next 3/8] netfilter: introduce max " Vlad Buslov
2022-03-07 22:13   ` Pablo Neira Ayuso
2022-03-12 19:32     ` Vlad Buslov
2022-02-22 15:09 ` [PATCH net-next 4/8] netfilter: introduce total count of hw offload 'add' workqueue tasks Vlad Buslov
2022-03-07 22:46   ` Pablo Neira Ayuso
2022-02-22 15:10 ` [PATCH net-next 5/8] netfilter: introduce max " Vlad Buslov
2022-03-07 22:43   ` Pablo Neira Ayuso
2022-03-12 19:59     ` Vlad Buslov
2022-02-22 15:10 ` [PATCH net-next 6/8] netfilter: introduce total count of hw offload 'del' " Vlad Buslov
2022-02-22 15:10 ` [PATCH net-next 7/8] netfilter: introduce total count of hw offload 'stats' wq tasks Vlad Buslov
2022-02-22 15:10 ` [PATCH net-next 8/8] netfilter: flowtable: add hardware offload tracepoints Vlad Buslov
2022-03-07 22:49   ` Pablo Neira Ayuso
2022-03-12 20:05     ` Vlad Buslov
2022-03-15 10:29       ` Pablo Neira Ayuso
2022-03-15 16:36         ` Vlad Buslov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.