All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH nf-next v2 0/7] netfilter: remove pcpu dying list
@ 2022-02-24 16:44 Florian Westphal
  2022-02-24 16:44 ` [PATCH nf-next v2 1/7] nfnetlink: handle already-released nl socket Florian Westphal
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: Florian Westphal @ 2022-02-24 16:44 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

v2: fix EXPORT_SYMBOL_GPL related build failure in patch 6.
No other changes.

This is part 1 of a series that aims to remove both the unconfirmed
and dying lists.

This moves the dying list into the ecache infrastructure.
Entries are placed on this list only if the delivery of the destroy
event has failed (which implies that at least one userspace listener
did request redelivery).

The percpu dying list is removed in the last patch as it has no
functionality anymore.  This avoids the extra spinlock for conntrack
removal.

Florian Westphal (7):
  nfnetlink: handle already-released nl socket
  netfilter: ctnetlink: make ecache event cb global again
  netfilter: ecache: move to separate structure
  netfilter: ecache: use dedicated list for event redelivery
  netfilter: conntrack: split inner loop of list dumping to own function
  netfilter: conntrack: include ecache dying list in dumps
  netfilter: conntrack: remove the percpu dying list

 include/net/netfilter/nf_conntrack.h        |   9 +-
 include/net/netfilter/nf_conntrack_ecache.h |  21 +--
 include/net/netns/conntrack.h               |   2 -
 net/netfilter/nf_conntrack_core.c           |  60 ++++---
 net/netfilter/nf_conntrack_ecache.c         | 172 +++++++++-----------
 net/netfilter/nf_conntrack_netlink.c        | 150 +++++++++--------
 net/netfilter/nfnetlink.c                   |  62 +++++--
 7 files changed, 256 insertions(+), 220 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH nf-next v2 1/7] nfnetlink: handle already-released nl socket
  2022-02-24 16:44 [PATCH nf-next v2 0/7] netfilter: remove pcpu dying list Florian Westphal
@ 2022-02-24 16:44 ` Florian Westphal
  2022-02-24 16:44 ` [PATCH nf-next v2 2/7] netfilter: ctnetlink: make ecache event cb global again Florian Westphal
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Florian Westphal @ 2022-02-24 16:44 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

At this time upper layer, e.g. nf_conntrack_event, has to make sure that
its pernet exit handler runs before the nfnetlink one, otherwise we get a
crash if kernel tries to send a conntrack event after the nfnetlink netns
exit handler did close the socket already.

In order to move nf_conntrack_ecache to global (not pernet) netns event
pointer again the nfnetlink apis need to survive attempts to send a netlink
message after the socket has been destroyed in nfnetlink netns exit
function.

Set the pernet socket to null in the pre_exit handler and close it in the
exit_batch handler via a 'stash' pointer.

All functions now check nlsk for NULL before using it.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/nfnetlink.c | 62 +++++++++++++++++++++++++++++----------
 1 file changed, 47 insertions(+), 15 deletions(-)

diff --git a/net/netfilter/nfnetlink.c b/net/netfilter/nfnetlink.c
index 7e2c8dd01408..6105dc9b8f8e 100644
--- a/net/netfilter/nfnetlink.c
+++ b/net/netfilter/nfnetlink.c
@@ -46,6 +46,7 @@ static unsigned int nfnetlink_pernet_id __read_mostly;
 
 struct nfnl_net {
 	struct sock *nfnl;
+	struct sock *nfnl_stash;
 };
 
 static struct {
@@ -160,37 +161,56 @@ nfnetlink_find_client(u16 type, const struct nfnetlink_subsystem *ss)
 	return &ss->cb[cb_id];
 }
 
-int nfnetlink_has_listeners(struct net *net, unsigned int group)
+static struct sock *nfnl_pernet_sk(struct net *net)
 {
 	struct nfnl_net *nfnlnet = nfnl_pernet(net);
 
-	return netlink_has_listeners(nfnlnet->nfnl, group);
+	return READ_ONCE(nfnlnet->nfnl);
+}
+
+int nfnetlink_has_listeners(struct net *net, unsigned int group)
+{
+	struct sock *nlsk = nfnl_pernet_sk(net);
+
+	return nlsk ? netlink_has_listeners(nlsk, group) : 0;
 }
 EXPORT_SYMBOL_GPL(nfnetlink_has_listeners);
 
 int nfnetlink_send(struct sk_buff *skb, struct net *net, u32 portid,
 		   unsigned int group, int echo, gfp_t flags)
 {
-	struct nfnl_net *nfnlnet = nfnl_pernet(net);
+	struct sock *nlsk = nfnl_pernet_sk(net);
+
+	if (nlsk)
+		return nlmsg_notify(nlsk, skb, portid, group, echo, flags);
 
-	return nlmsg_notify(nfnlnet->nfnl, skb, portid, group, echo, flags);
+	/* nlsk already gone? This happens when .pre_exit was already called,
+	 * return 0, we can't retry.
+	 */
+	kfree_skb(skb);
+	return 0;
 }
 EXPORT_SYMBOL_GPL(nfnetlink_send);
 
 int nfnetlink_set_err(struct net *net, u32 portid, u32 group, int error)
 {
-	struct nfnl_net *nfnlnet = nfnl_pernet(net);
+	struct sock *nlsk = nfnl_pernet_sk(net);
 
-	return netlink_set_err(nfnlnet->nfnl, portid, group, error);
+	return nlsk ? netlink_set_err(nlsk, portid, group, error) : 0;
 }
 EXPORT_SYMBOL_GPL(nfnetlink_set_err);
 
 int nfnetlink_unicast(struct sk_buff *skb, struct net *net, u32 portid)
 {
-	struct nfnl_net *nfnlnet = nfnl_pernet(net);
+	struct sock *nlsk = nfnl_pernet_sk(net);
 	int err;
 
-	err = nlmsg_unicast(nfnlnet->nfnl, skb, portid);
+	if (!nlsk) {
+		kfree_skb(skb);
+		return 0;
+	}
+
+	err = nlmsg_unicast(nlsk, skb, portid);
 	if (err == -EAGAIN)
 		err = -ENOBUFS;
 
@@ -201,9 +221,12 @@ EXPORT_SYMBOL_GPL(nfnetlink_unicast);
 void nfnetlink_broadcast(struct net *net, struct sk_buff *skb, __u32 portid,
 			 __u32 group, gfp_t allocation)
 {
-	struct nfnl_net *nfnlnet = nfnl_pernet(net);
+	struct sock *nlsk = nfnl_pernet_sk(net);
 
-	netlink_broadcast(nfnlnet->nfnl, skb, portid, group, allocation);
+	if (nlsk)
+		netlink_broadcast(nlsk, skb, portid, group, allocation);
+	else
+		kfree_skb(skb);
 }
 EXPORT_SYMBOL_GPL(nfnetlink_broadcast);
 
@@ -247,7 +270,7 @@ static int nfnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh,
 
 	{
 		int min_len = nlmsg_total_size(sizeof(struct nfgenmsg));
-		struct nfnl_net *nfnlnet = nfnl_pernet(net);
+		struct sock *nlsk = nfnl_pernet_sk(net);
 		u8 cb_id = NFNL_MSG_TYPE(nlh->nlmsg_type);
 		struct nlattr *cda[NFNL_MAX_ATTR_COUNT + 1];
 		struct nlattr *attr = (void *)nlh + min_len;
@@ -255,7 +278,7 @@ static int nfnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh,
 		__u8 subsys_id = NFNL_SUBSYS_ID(type);
 		struct nfnl_info info = {
 			.net	= net,
-			.sk	= nfnlnet->nfnl,
+			.sk	= nlsk,
 			.nlh	= nlh,
 			.nfmsg	= nlmsg_data(nlh),
 			.extack	= extack,
@@ -484,14 +507,14 @@ static void nfnetlink_rcv_batch(struct sk_buff *skb, struct nlmsghdr *nlh,
 
 		{
 			int min_len = nlmsg_total_size(sizeof(struct nfgenmsg));
-			struct nfnl_net *nfnlnet = nfnl_pernet(net);
+			struct sock *nlsk = nfnl_pernet_sk(net);
 			struct nlattr *cda[NFNL_MAX_ATTR_COUNT + 1];
 			struct nlattr *attr = (void *)nlh + min_len;
 			u8 cb_id = NFNL_MSG_TYPE(nlh->nlmsg_type);
 			int attrlen = nlh->nlmsg_len - min_len;
 			struct nfnl_info info = {
 				.net	= net,
-				.sk	= nfnlnet->nfnl,
+				.sk	= nlsk,
 				.nlh	= nlh,
 				.nfmsg	= nlmsg_data(nlh),
 				.extack	= &extack,
@@ -699,12 +722,21 @@ static void __net_exit nfnetlink_net_exit_batch(struct list_head *net_exit_list)
 	list_for_each_entry(net, net_exit_list, exit_list) {
 		nfnlnet = nfnl_pernet(net);
 
-		netlink_kernel_release(nfnlnet->nfnl);
+		netlink_kernel_release(nfnlnet->nfnl_stash);
 	}
 }
 
+static void __net_exit nfnetlink_net_pre_exit(struct net *net)
+{
+	struct nfnl_net *nfnlnet = nfnl_pernet(net);
+
+	nfnlnet->nfnl_stash = nfnlnet->nfnl;
+	WRITE_ONCE(nfnlnet->nfnl, NULL);
+}
+
 static struct pernet_operations nfnetlink_net_ops = {
 	.init		= nfnetlink_net_init,
+	.pre_exit	= nfnetlink_net_pre_exit,
 	.exit_batch	= nfnetlink_net_exit_batch,
 	.id		= &nfnetlink_pernet_id,
 	.size		= sizeof(struct nfnl_net),
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH nf-next v2 2/7] netfilter: ctnetlink: make ecache event cb global again
  2022-02-24 16:44 [PATCH nf-next v2 0/7] netfilter: remove pcpu dying list Florian Westphal
  2022-02-24 16:44 ` [PATCH nf-next v2 1/7] nfnetlink: handle already-released nl socket Florian Westphal
@ 2022-02-24 16:44 ` Florian Westphal
  2022-02-24 16:44 ` [PATCH nf-next v2 3/7] netfilter: ecache: move to separate structure Florian Westphal
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Florian Westphal @ 2022-02-24 16:44 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

This was pernet to make sure we do not trip over already closed nfnl sk.

After moving nfnl validity checks to nfnetlink core this can be global
again, it only needs to be set to NULL when the module is removed.

This also avoids the need for pernet ops in ctnetlink and register mutex.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/net/netfilter/nf_conntrack_ecache.h | 18 +++++------
 include/net/netns/conntrack.h               |  1 -
 net/netfilter/nf_conntrack_ecache.c         | 32 ++++++-------------
 net/netfilter/nf_conntrack_netlink.c        | 35 ++++-----------------
 4 files changed, 23 insertions(+), 63 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack_ecache.h b/include/net/netfilter/nf_conntrack_ecache.h
index 6c4c490a3e34..ebd816af9c40 100644
--- a/include/net/netfilter/nf_conntrack_ecache.h
+++ b/include/net/netfilter/nf_conntrack_ecache.h
@@ -83,9 +83,8 @@ struct nf_ct_event_notifier {
 	int (*exp_event)(unsigned int events, const struct nf_exp_event *item);
 };
 
-void nf_conntrack_register_notifier(struct net *net,
-				   const struct nf_ct_event_notifier *nb);
-void nf_conntrack_unregister_notifier(struct net *net);
+void nf_conntrack_register_notifier(const struct nf_ct_event_notifier *nb);
+void nf_conntrack_unregister_notifier(void);
 
 void nf_ct_deliver_cached_events(struct nf_conn *ct);
 int nf_conntrack_eventmask_report(unsigned int eventmask, struct nf_conn *ct,
@@ -107,14 +106,15 @@ static inline int nf_conntrack_eventmask_report(unsigned int eventmask,
 
 #endif
 
+extern const struct nf_ct_event_notifier __rcu *nf_conntrack_event_cb __read_mostly;
+
 static inline void
 nf_conntrack_event_cache(enum ip_conntrack_events event, struct nf_conn *ct)
 {
 #ifdef CONFIG_NF_CONNTRACK_EVENTS
-	struct net *net = nf_ct_net(ct);
 	struct nf_conntrack_ecache *e;
 
-	if (!rcu_access_pointer(net->ct.nf_conntrack_event_cb))
+	if (!rcu_access_pointer(nf_conntrack_event_cb))
 		return;
 
 	e = nf_ct_ecache_find(ct);
@@ -130,9 +130,7 @@ nf_conntrack_event_report(enum ip_conntrack_events event, struct nf_conn *ct,
 			  u32 portid, int report)
 {
 #ifdef CONFIG_NF_CONNTRACK_EVENTS
-	const struct net *net = nf_ct_net(ct);
-
-	if (!rcu_access_pointer(net->ct.nf_conntrack_event_cb))
+	if (!rcu_access_pointer(nf_conntrack_event_cb))
 		return 0;
 
 	return nf_conntrack_eventmask_report(1 << event, ct, portid, report);
@@ -145,9 +143,7 @@ static inline int
 nf_conntrack_event(enum ip_conntrack_events event, struct nf_conn *ct)
 {
 #ifdef CONFIG_NF_CONNTRACK_EVENTS
-	const struct net *net = nf_ct_net(ct);
-
-	if (!rcu_access_pointer(net->ct.nf_conntrack_event_cb))
+	if (!rcu_access_pointer(nf_conntrack_event_cb))
 		return 0;
 
 	return nf_conntrack_eventmask_report(1 << event, ct, 0, 0);
diff --git a/include/net/netns/conntrack.h b/include/net/netns/conntrack.h
index 0294f3d473af..3bb62e938fa9 100644
--- a/include/net/netns/conntrack.h
+++ b/include/net/netns/conntrack.h
@@ -112,7 +112,6 @@ struct netns_ct {
 
 	struct ct_pcpu __percpu *pcpu_lists;
 	struct ip_conntrack_stat __percpu *stat;
-	struct nf_ct_event_notifier __rcu *nf_conntrack_event_cb;
 	struct nf_ip_net	nf_ct_proto;
 #if defined(CONFIG_NF_CONNTRACK_LABELS)
 	unsigned int		labels_used;
diff --git a/net/netfilter/nf_conntrack_ecache.c b/net/netfilter/nf_conntrack_ecache.c
index 07e65b4e92f8..13ba6aa82ec1 100644
--- a/net/netfilter/nf_conntrack_ecache.c
+++ b/net/netfilter/nf_conntrack_ecache.c
@@ -27,7 +27,7 @@
 #include <net/netfilter/nf_conntrack_ecache.h>
 #include <net/netfilter/nf_conntrack_extend.h>
 
-static DEFINE_MUTEX(nf_ct_ecache_mutex);
+const struct nf_ct_event_notifier __rcu *nf_conntrack_event_cb __read_mostly;
 
 #define ECACHE_RETRY_WAIT (HZ/10)
 #define ECACHE_STACK_ALLOC (256 / sizeof(void *))
@@ -135,8 +135,7 @@ static int __nf_conntrack_eventmask_report(struct nf_conntrack_ecache *e,
 					   const u32 missed,
 					   const struct nf_ct_event *item)
 {
-	struct net *net = nf_ct_net(item->ct);
-	struct nf_ct_event_notifier *notify;
+	const struct nf_ct_event_notifier *notify;
 	u32 old, want;
 	int ret;
 
@@ -145,7 +144,7 @@ static int __nf_conntrack_eventmask_report(struct nf_conntrack_ecache *e,
 
 	rcu_read_lock();
 
-	notify = rcu_dereference(net->ct.nf_conntrack_event_cb);
+	notify = rcu_dereference(nf_conntrack_event_cb);
 	if (!notify) {
 		rcu_read_unlock();
 		return 0;
@@ -240,12 +239,11 @@ void nf_ct_expect_event_report(enum ip_conntrack_expect_events event,
 			       u32 portid, int report)
 
 {
-	struct net *net = nf_ct_exp_net(exp);
-	struct nf_ct_event_notifier *notify;
+	const struct nf_ct_event_notifier *notify;
 	struct nf_conntrack_ecache *e;
 
 	rcu_read_lock();
-	notify = rcu_dereference(net->ct.nf_conntrack_event_cb);
+	notify = rcu_dereference(nf_conntrack_event_cb);
 	if (!notify)
 		goto out_unlock;
 
@@ -265,26 +263,16 @@ void nf_ct_expect_event_report(enum ip_conntrack_expect_events event,
 	rcu_read_unlock();
 }
 
-void nf_conntrack_register_notifier(struct net *net,
-				    const struct nf_ct_event_notifier *new)
+void nf_conntrack_register_notifier(const struct nf_ct_event_notifier *new)
 {
-	struct nf_ct_event_notifier *notify;
-
-	mutex_lock(&nf_ct_ecache_mutex);
-	notify = rcu_dereference_protected(net->ct.nf_conntrack_event_cb,
-					   lockdep_is_held(&nf_ct_ecache_mutex));
-	WARN_ON_ONCE(notify);
-	rcu_assign_pointer(net->ct.nf_conntrack_event_cb, new);
-	mutex_unlock(&nf_ct_ecache_mutex);
+	WARN_ON_ONCE(rcu_access_pointer(nf_conntrack_event_cb));
+	rcu_assign_pointer(nf_conntrack_event_cb, new);
 }
 EXPORT_SYMBOL_GPL(nf_conntrack_register_notifier);
 
-void nf_conntrack_unregister_notifier(struct net *net)
+void nf_conntrack_unregister_notifier(void)
 {
-	mutex_lock(&nf_ct_ecache_mutex);
-	RCU_INIT_POINTER(net->ct.nf_conntrack_event_cb, NULL);
-	mutex_unlock(&nf_ct_ecache_mutex);
-	/* synchronize_rcu() is called after netns pre_exit */
+	RCU_INIT_POINTER(nf_conntrack_event_cb, NULL);
 }
 EXPORT_SYMBOL_GPL(nf_conntrack_unregister_notifier);
 
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 1ea2ad732d57..a37a988b2b36 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -3772,7 +3772,7 @@ static int ctnetlink_stat_exp_cpu(struct sk_buff *skb,
 }
 
 #ifdef CONFIG_NF_CONNTRACK_EVENTS
-static struct nf_ct_event_notifier ctnl_notifier = {
+static const struct nf_ct_event_notifier ctnl_notifier = {
 	.ct_event = ctnetlink_conntrack_event,
 	.exp_event = ctnetlink_expect_event,
 };
@@ -3864,26 +3864,6 @@ MODULE_ALIAS("ip_conntrack_netlink");
 MODULE_ALIAS_NFNL_SUBSYS(NFNL_SUBSYS_CTNETLINK);
 MODULE_ALIAS_NFNL_SUBSYS(NFNL_SUBSYS_CTNETLINK_EXP);
 
-static int __net_init ctnetlink_net_init(struct net *net)
-{
-#ifdef CONFIG_NF_CONNTRACK_EVENTS
-	nf_conntrack_register_notifier(net, &ctnl_notifier);
-#endif
-	return 0;
-}
-
-static void ctnetlink_net_pre_exit(struct net *net)
-{
-#ifdef CONFIG_NF_CONNTRACK_EVENTS
-	nf_conntrack_unregister_notifier(net);
-#endif
-}
-
-static struct pernet_operations ctnetlink_net_ops = {
-	.init		= ctnetlink_net_init,
-	.pre_exit	= ctnetlink_net_pre_exit,
-};
-
 static int __init ctnetlink_init(void)
 {
 	int ret;
@@ -3902,19 +3882,14 @@ static int __init ctnetlink_init(void)
 		goto err_unreg_subsys;
 	}
 
-	ret = register_pernet_subsys(&ctnetlink_net_ops);
-	if (ret < 0) {
-		pr_err("ctnetlink_init: cannot register pernet operations\n");
-		goto err_unreg_exp_subsys;
-	}
+	nf_conntrack_register_notifier(&ctnl_notifier);
+
 #ifdef CONFIG_NETFILTER_NETLINK_GLUE_CT
 	/* setup interaction between nf_queue and nf_conntrack_netlink. */
 	RCU_INIT_POINTER(nfnl_ct_hook, &ctnetlink_glue_hook);
 #endif
 	return 0;
 
-err_unreg_exp_subsys:
-	nfnetlink_subsys_unregister(&ctnl_exp_subsys);
 err_unreg_subsys:
 	nfnetlink_subsys_unregister(&ctnl_subsys);
 err_out:
@@ -3923,7 +3898,9 @@ static int __init ctnetlink_init(void)
 
 static void __exit ctnetlink_exit(void)
 {
-	unregister_pernet_subsys(&ctnetlink_net_ops);
+#ifdef CONFIG_NF_CONNTRACK_EVENTS
+	nf_conntrack_unregister_notifier();
+#endif
 	nfnetlink_subsys_unregister(&ctnl_exp_subsys);
 	nfnetlink_subsys_unregister(&ctnl_subsys);
 #ifdef CONFIG_NETFILTER_NETLINK_GLUE_CT
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH nf-next v2 3/7] netfilter: ecache: move to separate structure
  2022-02-24 16:44 [PATCH nf-next v2 0/7] netfilter: remove pcpu dying list Florian Westphal
  2022-02-24 16:44 ` [PATCH nf-next v2 1/7] nfnetlink: handle already-released nl socket Florian Westphal
  2022-02-24 16:44 ` [PATCH nf-next v2 2/7] netfilter: ctnetlink: make ecache event cb global again Florian Westphal
@ 2022-02-24 16:44 ` Florian Westphal
  2022-02-24 16:44 ` [PATCH nf-next v2 4/7] netfilter: ecache: use dedicated list for event redelivery Florian Westphal
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Florian Westphal @ 2022-02-24 16:44 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

This makes it easier for a followup patch to only expose ecache
related parts of nf_conntrack_net structure.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/net/netfilter/nf_conntrack.h |  8 ++++++--
 net/netfilter/nf_conntrack_ecache.c  | 19 ++++++++++---------
 2 files changed, 16 insertions(+), 11 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index 8731d5bcb47d..20fefe043850 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -43,6 +43,11 @@ union nf_conntrack_expect_proto {
 	/* insert expect proto private data here */
 };
 
+struct nf_conntrack_net_ecache {
+	struct delayed_work dwork;
+	struct netns_ct *ct_net;
+};
+
 struct nf_conntrack_net {
 	/* only used when new connection is allocated: */
 	atomic_t count;
@@ -58,8 +63,7 @@ struct nf_conntrack_net {
 	struct ctl_table_header	*sysctl_header;
 #endif
 #ifdef CONFIG_NF_CONNTRACK_EVENTS
-	struct delayed_work ecache_dwork;
-	struct netns_ct *ct_net;
+	struct nf_conntrack_net_ecache ecache;
 #endif
 };
 
diff --git a/net/netfilter/nf_conntrack_ecache.c b/net/netfilter/nf_conntrack_ecache.c
index 13ba6aa82ec1..d7a07873e605 100644
--- a/net/netfilter/nf_conntrack_ecache.c
+++ b/net/netfilter/nf_conntrack_ecache.c
@@ -96,8 +96,8 @@ static enum retry_state ecache_work_evict_list(struct ct_pcpu *pcpu)
 
 static void ecache_work(struct work_struct *work)
 {
-	struct nf_conntrack_net *cnet = container_of(work, struct nf_conntrack_net, ecache_dwork.work);
-	struct netns_ct *ctnet = cnet->ct_net;
+	struct nf_conntrack_net *cnet = container_of(work, struct nf_conntrack_net, ecache.dwork.work);
+	struct netns_ct *ctnet = cnet->ecache.ct_net;
 	int cpu, delay = -1;
 	struct ct_pcpu *pcpu;
 
@@ -127,7 +127,7 @@ static void ecache_work(struct work_struct *work)
 
 	ctnet->ecache_dwork_pending = delay > 0;
 	if (delay >= 0)
-		schedule_delayed_work(&cnet->ecache_dwork, delay);
+		schedule_delayed_work(&cnet->ecache.dwork, delay);
 }
 
 static int __nf_conntrack_eventmask_report(struct nf_conntrack_ecache *e,
@@ -281,12 +281,12 @@ void nf_conntrack_ecache_work(struct net *net, enum nf_ct_ecache_state state)
 	struct nf_conntrack_net *cnet = nf_ct_pernet(net);
 
 	if (state == NFCT_ECACHE_DESTROY_FAIL &&
-	    !delayed_work_pending(&cnet->ecache_dwork)) {
-		schedule_delayed_work(&cnet->ecache_dwork, HZ);
+	    !delayed_work_pending(&cnet->ecache.dwork)) {
+		schedule_delayed_work(&cnet->ecache.dwork, HZ);
 		net->ct.ecache_dwork_pending = true;
 	} else if (state == NFCT_ECACHE_DESTROY_SENT) {
 		net->ct.ecache_dwork_pending = false;
-		mod_delayed_work(system_wq, &cnet->ecache_dwork, 0);
+		mod_delayed_work(system_wq, &cnet->ecache.dwork, 0);
 	}
 }
 
@@ -298,8 +298,9 @@ void nf_conntrack_ecache_pernet_init(struct net *net)
 	struct nf_conntrack_net *cnet = nf_ct_pernet(net);
 
 	net->ct.sysctl_events = nf_ct_events;
-	cnet->ct_net = &net->ct;
-	INIT_DELAYED_WORK(&cnet->ecache_dwork, ecache_work);
+
+	cnet->ecache.ct_net = &net->ct;
+	INIT_DELAYED_WORK(&cnet->ecache.dwork, ecache_work);
 
 	BUILD_BUG_ON(__IPCT_MAX >= 16);	/* e->ctmask is u16 */
 }
@@ -308,5 +309,5 @@ void nf_conntrack_ecache_pernet_fini(struct net *net)
 {
 	struct nf_conntrack_net *cnet = nf_ct_pernet(net);
 
-	cancel_delayed_work_sync(&cnet->ecache_dwork);
+	cancel_delayed_work_sync(&cnet->ecache.dwork);
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH nf-next v2 4/7] netfilter: ecache: use dedicated list for event redelivery
  2022-02-24 16:44 [PATCH nf-next v2 0/7] netfilter: remove pcpu dying list Florian Westphal
                   ` (2 preceding siblings ...)
  2022-02-24 16:44 ` [PATCH nf-next v2 3/7] netfilter: ecache: move to separate structure Florian Westphal
@ 2022-02-24 16:44 ` Florian Westphal
  2022-02-24 16:44 ` [PATCH nf-next v2 5/7] netfilter: conntrack: split inner loop of list dumping to own function Florian Westphal
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Florian Westphal @ 2022-02-24 16:44 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

This disentangles event redelivery and the percpu dying list.

Because entries are now stored on a dedicated list, all
entries are in NFCT_ECACHE_DESTROY_FAIL state and all entries
still have confirmed bit set -- the reference count is at least 1.

The 'struct net' back-pointer can be removed as well.

The pcpu dying list will be removed eventually, it has no functionality.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/net/netfilter/nf_conntrack.h        |   3 +-
 include/net/netfilter/nf_conntrack_ecache.h |   1 -
 net/netfilter/nf_conntrack_core.c           |  33 +++++-
 net/netfilter/nf_conntrack_ecache.c         | 118 +++++++++-----------
 4 files changed, 83 insertions(+), 72 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index 20fefe043850..dbbb0e206901 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -45,7 +45,8 @@ union nf_conntrack_expect_proto {
 
 struct nf_conntrack_net_ecache {
 	struct delayed_work dwork;
-	struct netns_ct *ct_net;
+	spinlock_t dying_lock;
+	struct hlist_nulls_head dying_list;
 };
 
 struct nf_conntrack_net {
diff --git a/include/net/netfilter/nf_conntrack_ecache.h b/include/net/netfilter/nf_conntrack_ecache.h
index ebd816af9c40..c63a8fc3225e 100644
--- a/include/net/netfilter/nf_conntrack_ecache.h
+++ b/include/net/netfilter/nf_conntrack_ecache.h
@@ -14,7 +14,6 @@
 #include <net/netfilter/nf_conntrack_extend.h>
 
 enum nf_ct_ecache_state {
-	NFCT_ECACHE_UNKNOWN,		/* destroy event not sent */
 	NFCT_ECACHE_DESTROY_FAIL,	/* tried but failed to send destroy event */
 	NFCT_ECACHE_DESTROY_SENT,	/* sent destroy event after failure */
 };
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 9b7f9c966f73..7eefcfa55fc2 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -647,15 +647,12 @@ void nf_ct_destroy(struct nf_conntrack *nfct)
 }
 EXPORT_SYMBOL(nf_ct_destroy);
 
-static void nf_ct_delete_from_lists(struct nf_conn *ct)
+static void __nf_ct_delete_from_lists(struct nf_conn *ct)
 {
 	struct net *net = nf_ct_net(ct);
 	unsigned int hash, reply_hash;
 	unsigned int sequence;
 
-	nf_ct_helper_destroy(ct);
-
-	local_bh_disable();
 	do {
 		sequence = read_seqcount_begin(&nf_conntrack_generation);
 		hash = hash_conntrack(net,
@@ -668,12 +665,33 @@ static void nf_ct_delete_from_lists(struct nf_conn *ct)
 
 	clean_from_lists(ct);
 	nf_conntrack_double_unlock(hash, reply_hash);
+}
 
+static void nf_ct_delete_from_lists(struct nf_conn *ct)
+{
+	nf_ct_helper_destroy(ct);
+	local_bh_disable();
+
+	__nf_ct_delete_from_lists(ct);
 	nf_ct_add_to_dying_list(ct);
 
 	local_bh_enable();
 }
 
+static void nf_ct_add_to_ecache_list(struct nf_conn *ct)
+{
+#ifdef CONFIG_NF_CONNTRACK_EVENTS
+	struct nf_conntrack_net *cnet = nf_ct_pernet(nf_ct_net(ct));
+
+	spin_lock(&cnet->ecache.dying_lock);
+	hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode,
+				 &cnet->ecache.dying_list);
+	spin_unlock(&cnet->ecache.dying_lock);
+#else
+	nf_ct_add_to_dying_list(ct);
+#endif
+}
+
 bool nf_ct_delete(struct nf_conn *ct, u32 portid, int report)
 {
 	struct nf_conn_tstamp *tstamp;
@@ -696,7 +714,12 @@ bool nf_ct_delete(struct nf_conn *ct, u32 portid, int report)
 		/* destroy event was not delivered. nf_ct_put will
 		 * be done by event cache worker on redelivery.
 		 */
-		nf_ct_delete_from_lists(ct);
+		nf_ct_helper_destroy(ct);
+		local_bh_disable();
+		__nf_ct_delete_from_lists(ct);
+		nf_ct_add_to_ecache_list(ct);
+		local_bh_enable();
+
 		nf_conntrack_ecache_work(nf_ct_net(ct), NFCT_ECACHE_DESTROY_FAIL);
 		return false;
 	}
diff --git a/net/netfilter/nf_conntrack_ecache.c b/net/netfilter/nf_conntrack_ecache.c
index d7a07873e605..be111218899d 100644
--- a/net/netfilter/nf_conntrack_ecache.c
+++ b/net/netfilter/nf_conntrack_ecache.c
@@ -16,7 +16,6 @@
 #include <linux/vmalloc.h>
 #include <linux/stddef.h>
 #include <linux/err.h>
-#include <linux/percpu.h>
 #include <linux/kernel.h>
 #include <linux/netdevice.h>
 #include <linux/slab.h>
@@ -29,8 +28,9 @@
 
 const struct nf_ct_event_notifier __rcu *nf_conntrack_event_cb __read_mostly;
 
-#define ECACHE_RETRY_WAIT (HZ/10)
-#define ECACHE_STACK_ALLOC (256 / sizeof(void *))
+#define DYING_NULLS_VAL			((1 << 30) + 1)
+#define ECACHE_MAX_JIFFIES		msecs_to_jiffies(10)
+#define ECACHE_RETRY_JIFFIES		msecs_to_jiffies(10)
 
 enum retry_state {
 	STATE_CONGESTED,
@@ -38,58 +38,59 @@ enum retry_state {
 	STATE_DONE,
 };
 
-static enum retry_state ecache_work_evict_list(struct ct_pcpu *pcpu)
+static enum retry_state ecache_work_evict_list(struct nf_conntrack_net *cnet)
 {
-	struct nf_conn *refs[ECACHE_STACK_ALLOC];
+	unsigned long stop = jiffies + ECACHE_MAX_JIFFIES;
+	struct hlist_nulls_head evicted_list;
 	enum retry_state ret = STATE_DONE;
 	struct nf_conntrack_tuple_hash *h;
 	struct hlist_nulls_node *n;
-	unsigned int evicted = 0;
+	unsigned int sent;
 
-	spin_lock(&pcpu->lock);
+	INIT_HLIST_NULLS_HEAD(&evicted_list, DYING_NULLS_VAL);
 
-	hlist_nulls_for_each_entry(h, n, &pcpu->dying, hnnode) {
+next:
+	sent = 0;
+	spin_lock_bh(&cnet->ecache.dying_lock);
+
+	hlist_nulls_for_each_entry_safe(h, n, &cnet->ecache.dying_list, hnnode) {
 		struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(h);
-		struct nf_conntrack_ecache *e;
-
-		if (!nf_ct_is_confirmed(ct))
-			continue;
-
-		/* This ecache access is safe because the ct is on the
-		 * pcpu dying list and we hold the spinlock -- the entry
-		 * cannot be free'd until after the lock is released.
-		 *
-		 * This is true even if ct has a refcount of 0: the
-		 * cpu that is about to free the entry must remove it
-		 * from the dying list and needs the lock to do so.
-		 */
-		e = nf_ct_ecache_find(ct);
-		if (!e || e->state != NFCT_ECACHE_DESTROY_FAIL)
-			continue;
 
-		/* ct is in NFCT_ECACHE_DESTROY_FAIL state, this means
-		 * the worker owns this entry: the ct will remain valid
-		 * until the worker puts its ct reference.
+		/* The worker owns all entries, ct remains valid until nf_ct_put
+		 * in the loop below.
 		 */
 		if (nf_conntrack_event(IPCT_DESTROY, ct)) {
 			ret = STATE_CONGESTED;
 			break;
 		}
 
-		e->state = NFCT_ECACHE_DESTROY_SENT;
-		refs[evicted] = ct;
+		hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
+		hlist_nulls_add_head(&ct->tuplehash[IP_CT_DIR_REPLY].hnnode, &evicted_list);
 
-		if (++evicted >= ARRAY_SIZE(refs)) {
+		if (time_after(stop, jiffies)) {
 			ret = STATE_RESTART;
 			break;
 		}
+
+		if (sent++ > 16) {
+			spin_unlock_bh(&cnet->ecache.dying_lock);
+			cond_resched();
+			spin_lock_bh(&cnet->ecache.dying_lock);
+			goto next;
+		}
 	}
 
-	spin_unlock(&pcpu->lock);
+	spin_unlock_bh(&cnet->ecache.dying_lock);
 
-	/* can't _put while holding lock */
-	while (evicted)
-		nf_ct_put(refs[--evicted]);
+	hlist_nulls_for_each_entry_safe(h, n, &evicted_list, hnnode) {
+		struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(h);
+
+		hlist_nulls_add_fake(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
+		hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnnode);
+		nf_ct_put(ct);
+
+		cond_resched();
+	}
 
 	return ret;
 }
@@ -97,35 +98,20 @@ static enum retry_state ecache_work_evict_list(struct ct_pcpu *pcpu)
 static void ecache_work(struct work_struct *work)
 {
 	struct nf_conntrack_net *cnet = container_of(work, struct nf_conntrack_net, ecache.dwork.work);
-	struct netns_ct *ctnet = cnet->ecache.ct_net;
-	int cpu, delay = -1;
-	struct ct_pcpu *pcpu;
-
-	local_bh_disable();
-
-	for_each_possible_cpu(cpu) {
-		enum retry_state ret;
-
-		pcpu = per_cpu_ptr(ctnet->pcpu_lists, cpu);
-
-		ret = ecache_work_evict_list(pcpu);
-
-		switch (ret) {
-		case STATE_CONGESTED:
-			delay = ECACHE_RETRY_WAIT;
-			goto out;
-		case STATE_RESTART:
-			delay = 0;
-			break;
-		case STATE_DONE:
-			break;
-		}
+	int ret, delay = -1;
+
+	ret = ecache_work_evict_list(cnet);
+	switch (ret) {
+	case STATE_CONGESTED:
+		delay = ECACHE_RETRY_JIFFIES;
+		break;
+	case STATE_RESTART:
+		delay = 0;
+		break;
+	case STATE_DONE:
+		break;
 	}
 
- out:
-	local_bh_enable();
-
-	ctnet->ecache_dwork_pending = delay > 0;
 	if (delay >= 0)
 		schedule_delayed_work(&cnet->ecache.dwork, delay);
 }
@@ -198,7 +184,6 @@ int nf_conntrack_eventmask_report(unsigned int events, struct nf_conn *ct,
 		 */
 		if (e->portid == 0 && portid != 0)
 			e->portid = portid;
-		e->state = NFCT_ECACHE_DESTROY_FAIL;
 	}
 
 	return ret;
@@ -285,8 +270,10 @@ void nf_conntrack_ecache_work(struct net *net, enum nf_ct_ecache_state state)
 		schedule_delayed_work(&cnet->ecache.dwork, HZ);
 		net->ct.ecache_dwork_pending = true;
 	} else if (state == NFCT_ECACHE_DESTROY_SENT) {
-		net->ct.ecache_dwork_pending = false;
-		mod_delayed_work(system_wq, &cnet->ecache.dwork, 0);
+		if (!hlist_nulls_empty(&cnet->ecache.dying_list))
+			mod_delayed_work(system_wq, &cnet->ecache.dwork, 0);
+		else
+			net->ct.ecache_dwork_pending = false;
 	}
 }
 
@@ -299,8 +286,9 @@ void nf_conntrack_ecache_pernet_init(struct net *net)
 
 	net->ct.sysctl_events = nf_ct_events;
 
-	cnet->ecache.ct_net = &net->ct;
 	INIT_DELAYED_WORK(&cnet->ecache.dwork, ecache_work);
+	INIT_HLIST_NULLS_HEAD(&cnet->ecache.dying_list, DYING_NULLS_VAL);
+	spin_lock_init(&cnet->ecache.dying_lock);
 
 	BUILD_BUG_ON(__IPCT_MAX >= 16);	/* e->ctmask is u16 */
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH nf-next v2 5/7] netfilter: conntrack: split inner loop of list dumping to own function
  2022-02-24 16:44 [PATCH nf-next v2 0/7] netfilter: remove pcpu dying list Florian Westphal
                   ` (3 preceding siblings ...)
  2022-02-24 16:44 ` [PATCH nf-next v2 4/7] netfilter: ecache: use dedicated list for event redelivery Florian Westphal
@ 2022-02-24 16:44 ` Florian Westphal
  2022-02-24 16:44 ` [PATCH nf-next v2 6/7] netfilter: conntrack: include ecache dying list in dumps Florian Westphal
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Florian Westphal @ 2022-02-24 16:44 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

This allows code re-use in the followup patch.
No functional changes intended.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/nf_conntrack_netlink.c | 68 ++++++++++++++++++----------
 1 file changed, 43 insertions(+), 25 deletions(-)

diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index a37a988b2b36..940bd13a7fca 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -1708,6 +1708,47 @@ static int ctnetlink_done_list(struct netlink_callback *cb)
 	return 0;
 }
 
+static int ctnetlink_dump_one_entry(struct sk_buff *skb,
+				    struct netlink_callback *cb,
+				    struct nf_conn *ct,
+				    bool dying)
+{
+	struct ctnetlink_list_dump_ctx *ctx = (void *)cb->ctx;
+	struct nfgenmsg *nfmsg = nlmsg_data(cb->nlh);
+	u8 l3proto = nfmsg->nfgen_family;
+	int res;
+
+	if (l3proto && nf_ct_l3num(ct) != l3proto)
+		return 0;
+
+	if (ctx->last) {
+		if (ct != ctx->last)
+			return 0;
+
+		ctx->last = NULL;
+	}
+
+	/* We can't dump extension info for the unconfirmed
+	 * list because unconfirmed conntracks can have
+	 * ct->ext reallocated (and thus freed).
+	 *
+	 * In the dying list case ct->ext can't be free'd
+	 * until after we drop pcpu->lock.
+	 */
+	res = ctnetlink_fill_info(skb, NETLINK_CB(cb->skb).portid,
+				  cb->nlh->nlmsg_seq,
+				  NFNL_MSG_TYPE(cb->nlh->nlmsg_type),
+				  ct, dying, 0);
+	if (res < 0) {
+		if (!refcount_inc_not_zero(&ct->ct_general.use))
+			return 0;
+
+		ctx->last = ct;
+	}
+
+	return res;
+}
+
 static int
 ctnetlink_dump_list(struct sk_buff *skb, struct netlink_callback *cb, bool dying)
 {
@@ -1715,12 +1756,9 @@ ctnetlink_dump_list(struct sk_buff *skb, struct netlink_callback *cb, bool dying
 	struct nf_conn *ct, *last;
 	struct nf_conntrack_tuple_hash *h;
 	struct hlist_nulls_node *n;
-	struct nfgenmsg *nfmsg = nlmsg_data(cb->nlh);
-	u_int8_t l3proto = nfmsg->nfgen_family;
-	int res;
-	int cpu;
 	struct hlist_nulls_head *list;
 	struct net *net = sock_net(skb->sk);
+	int res, cpu;
 
 	if (ctx->done)
 		return 0;
@@ -1739,30 +1777,10 @@ ctnetlink_dump_list(struct sk_buff *skb, struct netlink_callback *cb, bool dying
 restart:
 		hlist_nulls_for_each_entry(h, n, list, hnnode) {
 			ct = nf_ct_tuplehash_to_ctrack(h);
-			if (l3proto && nf_ct_l3num(ct) != l3proto)
-				continue;
-			if (ctx->last) {
-				if (ct != last)
-					continue;
-				ctx->last = NULL;
-			}
 
-			/* We can't dump extension info for the unconfirmed
-			 * list because unconfirmed conntracks can have
-			 * ct->ext reallocated (and thus freed).
-			 *
-			 * In the dying list case ct->ext can't be free'd
-			 * until after we drop pcpu->lock.
-			 */
-			res = ctnetlink_fill_info(skb, NETLINK_CB(cb->skb).portid,
-						  cb->nlh->nlmsg_seq,
-						  NFNL_MSG_TYPE(cb->nlh->nlmsg_type),
-						  ct, dying, 0);
+			res = ctnetlink_dump_one_entry(skb, cb, ct, dying);
 			if (res < 0) {
-				if (!refcount_inc_not_zero(&ct->ct_general.use))
-					continue;
 				ctx->cpu = cpu;
-				ctx->last = ct;
 				spin_unlock_bh(&pcpu->lock);
 				goto out;
 			}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH nf-next v2 6/7] netfilter: conntrack: include ecache dying list in dumps
  2022-02-24 16:44 [PATCH nf-next v2 0/7] netfilter: remove pcpu dying list Florian Westphal
                   ` (4 preceding siblings ...)
  2022-02-24 16:44 ` [PATCH nf-next v2 5/7] netfilter: conntrack: split inner loop of list dumping to own function Florian Westphal
@ 2022-02-24 16:44 ` Florian Westphal
  2022-02-24 16:44 ` [PATCH nf-next v2 7/7] netfilter: conntrack: remove the percpu dying list Florian Westphal
  2022-03-02 23:26 ` [PATCH nf-next v2 0/7] netfilter: remove pcpu " Pablo Neira Ayuso
  7 siblings, 0 replies; 9+ messages in thread
From: Florian Westphal @ 2022-02-24 16:44 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

The new pernet dying list includes conntrack entries that await
delivery of the 'destroy' event via ctnetlink.

The old percpu dying list will be removed soon.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/net/netfilter/nf_conntrack_ecache.h |  2 ++
 net/netfilter/nf_conntrack_ecache.c         | 10 ++++++
 net/netfilter/nf_conntrack_netlink.c        | 38 +++++++++++++++++++++
 3 files changed, 50 insertions(+)

diff --git a/include/net/netfilter/nf_conntrack_ecache.h b/include/net/netfilter/nf_conntrack_ecache.h
index c63a8fc3225e..54051e663ff4 100644
--- a/include/net/netfilter/nf_conntrack_ecache.h
+++ b/include/net/netfilter/nf_conntrack_ecache.h
@@ -161,6 +161,8 @@ void nf_conntrack_ecache_work(struct net *net, enum nf_ct_ecache_state state);
 void nf_conntrack_ecache_pernet_init(struct net *net);
 void nf_conntrack_ecache_pernet_fini(struct net *net);
 
+struct nf_conntrack_net_ecache *nf_conn_pernet_ecache(const struct net *net);
+
 static inline bool nf_conntrack_ecache_dwork_pending(const struct net *net)
 {
 	return net->ct.ecache_dwork_pending;
diff --git a/net/netfilter/nf_conntrack_ecache.c b/net/netfilter/nf_conntrack_ecache.c
index be111218899d..ce3ebd420585 100644
--- a/net/netfilter/nf_conntrack_ecache.c
+++ b/net/netfilter/nf_conntrack_ecache.c
@@ -38,6 +38,16 @@ enum retry_state {
 	STATE_DONE,
 };
 
+struct nf_conntrack_net_ecache *nf_conn_pernet_ecache(const struct net *net)
+{
+	struct nf_conntrack_net *cnet = nf_ct_pernet(net);
+
+	return &cnet->ecache;
+}
+#if IS_MODULE(CONFIG_NF_CT_NETLINK)
+EXPORT_SYMBOL_GPL(nf_conn_pernet_ecache);
+#endif
+
 static enum retry_state ecache_work_evict_list(struct nf_conntrack_net *cnet)
 {
 	unsigned long stop = jiffies + ECACHE_MAX_JIFFIES;
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 940bd13a7fca..44cad50ef67f 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -62,6 +62,7 @@ struct ctnetlink_list_dump_ctx {
 	struct nf_conn *last;
 	unsigned int cpu;
 	bool done;
+	bool retrans_done;
 };
 
 static int ctnetlink_dump_tuples_proto(struct sk_buff *skb,
@@ -1802,6 +1803,43 @@ ctnetlink_dump_list(struct sk_buff *skb, struct netlink_callback *cb, bool dying
 static int
 ctnetlink_dump_dying(struct sk_buff *skb, struct netlink_callback *cb)
 {
+	struct ctnetlink_list_dump_ctx *ctx = (void *)cb->ctx;
+	struct nf_conntrack_net_ecache *ecache_net;
+	const struct net *net = sock_net(skb->sk);
+	struct nf_conn *last = ctx->last;
+	struct nf_conntrack_tuple_hash *h;
+	struct hlist_nulls_node *n;
+
+	if (ctx->retrans_done)
+		return ctnetlink_dump_list(skb, cb, true);
+
+	ctx->last = NULL;
+	ecache_net = nf_conn_pernet_ecache(net);
+	spin_lock_bh(&ecache_net->dying_lock);
+
+	hlist_nulls_for_each_entry(h, n, &ecache_net->dying_list, hnnode) {
+		struct nf_conn *ct;
+		int res;
+
+		ct = nf_ct_tuplehash_to_ctrack(h);
+		if (last && last != ct)
+			continue;
+
+		res = ctnetlink_dump_one_entry(skb, cb, ct, true);
+		if (res < 0) {
+			spin_unlock_bh(&ecache_net->dying_lock);
+			nf_ct_put(last);
+			return skb->len;
+		}
+
+		nf_ct_put(last);
+		last = NULL;
+	}
+
+	ctx->retrans_done = true;
+	spin_unlock_bh(&ecache_net->dying_lock);
+	nf_ct_put(last);
+
 	return ctnetlink_dump_list(skb, cb, true);
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH nf-next v2 7/7] netfilter: conntrack: remove the percpu dying list
  2022-02-24 16:44 [PATCH nf-next v2 0/7] netfilter: remove pcpu dying list Florian Westphal
                   ` (5 preceding siblings ...)
  2022-02-24 16:44 ` [PATCH nf-next v2 6/7] netfilter: conntrack: include ecache dying list in dumps Florian Westphal
@ 2022-02-24 16:44 ` Florian Westphal
  2022-03-02 23:26 ` [PATCH nf-next v2 0/7] netfilter: remove pcpu " Pablo Neira Ayuso
  7 siblings, 0 replies; 9+ messages in thread
From: Florian Westphal @ 2022-02-24 16:44 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

Its no longer needed. Entries that need event redelivery are placed
on the new pernet dying list.

The advantage is that there is no need to take additional spinlock on
conntrack removal unless event redelivery failed or the conntrack entry
was never added to the table in the first place (confirmed bit not set).

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/net/netns/conntrack.h        |  1 -
 net/netfilter/nf_conntrack_core.c    | 35 ++++------------------------
 net/netfilter/nf_conntrack_ecache.c  |  1 -
 net/netfilter/nf_conntrack_netlink.c | 21 ++++-------------
 4 files changed, 9 insertions(+), 49 deletions(-)

diff --git a/include/net/netns/conntrack.h b/include/net/netns/conntrack.h
index 3bb62e938fa9..dd1d096b2ada 100644
--- a/include/net/netns/conntrack.h
+++ b/include/net/netns/conntrack.h
@@ -96,7 +96,6 @@ struct nf_ip_net {
 struct ct_pcpu {
 	spinlock_t		lock;
 	struct hlist_nulls_head unconfirmed;
-	struct hlist_nulls_head dying;
 };
 
 struct netns_ct {
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 7eefcfa55fc2..9ca862bc5d7d 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -512,21 +512,6 @@ clean_from_lists(struct nf_conn *ct)
 	nf_ct_remove_expectations(ct);
 }
 
-/* must be called with local_bh_disable */
-static void nf_ct_add_to_dying_list(struct nf_conn *ct)
-{
-	struct ct_pcpu *pcpu;
-
-	/* add this conntrack to the (per cpu) dying list */
-	ct->cpu = smp_processor_id();
-	pcpu = per_cpu_ptr(nf_ct_net(ct)->ct.pcpu_lists, ct->cpu);
-
-	spin_lock(&pcpu->lock);
-	hlist_nulls_add_head(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode,
-			     &pcpu->dying);
-	spin_unlock(&pcpu->lock);
-}
-
 /* must be called with local_bh_disable */
 static void nf_ct_add_to_unconfirmed_list(struct nf_conn *ct)
 {
@@ -543,11 +528,11 @@ static void nf_ct_add_to_unconfirmed_list(struct nf_conn *ct)
 }
 
 /* must be called with local_bh_disable */
-static void nf_ct_del_from_dying_or_unconfirmed_list(struct nf_conn *ct)
+static void nf_ct_del_from_unconfirmed_list(struct nf_conn *ct)
 {
 	struct ct_pcpu *pcpu;
 
-	/* We overload first tuple to link into unconfirmed or dying list.*/
+	/* We overload first tuple to link into unconfirmed list.*/
 	pcpu = per_cpu_ptr(nf_ct_net(ct)->ct.pcpu_lists, ct->cpu);
 
 	spin_lock(&pcpu->lock);
@@ -635,7 +620,8 @@ void nf_ct_destroy(struct nf_conntrack *nfct)
 	 */
 	nf_ct_remove_expectations(ct);
 
-	nf_ct_del_from_dying_or_unconfirmed_list(ct);
+	if (unlikely(!nf_ct_is_confirmed(ct)))
+		nf_ct_del_from_unconfirmed_list(ct);
 
 	local_bh_enable();
 
@@ -673,23 +659,18 @@ static void nf_ct_delete_from_lists(struct nf_conn *ct)
 	local_bh_disable();
 
 	__nf_ct_delete_from_lists(ct);
-	nf_ct_add_to_dying_list(ct);
 
 	local_bh_enable();
 }
 
 static void nf_ct_add_to_ecache_list(struct nf_conn *ct)
 {
-#ifdef CONFIG_NF_CONNTRACK_EVENTS
 	struct nf_conntrack_net *cnet = nf_ct_pernet(nf_ct_net(ct));
 
 	spin_lock(&cnet->ecache.dying_lock);
 	hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode,
 				 &cnet->ecache.dying_list);
 	spin_unlock(&cnet->ecache.dying_lock);
-#else
-	nf_ct_add_to_dying_list(ct);
-#endif
 }
 
 bool nf_ct_delete(struct nf_conn *ct, u32 portid, int report)
@@ -1011,7 +992,6 @@ static int __nf_ct_resolve_clash(struct sk_buff *skb,
 		nf_conntrack_get(&ct->ct_general);
 
 		nf_ct_acct_merge(ct, ctinfo, loser_ct);
-		nf_ct_add_to_dying_list(loser_ct);
 		nf_ct_put(loser_ct);
 		nf_ct_set(skb, ct, ctinfo);
 
@@ -1144,7 +1124,6 @@ nf_ct_resolve_clash(struct sk_buff *skb, struct nf_conntrack_tuple_hash *h,
 		return ret;
 
 drop:
-	nf_ct_add_to_dying_list(loser_ct);
 	NF_CT_STAT_INC(net, drop);
 	NF_CT_STAT_INC(net, insert_failed);
 	return NF_DROP;
@@ -1211,10 +1190,9 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 	 * user context, else we insert an already 'dead' hash, blocking
 	 * further use of that particular connection -JM.
 	 */
-	nf_ct_del_from_dying_or_unconfirmed_list(ct);
+	nf_ct_del_from_unconfirmed_list(ct);
 
 	if (unlikely(nf_ct_is_dying(ct))) {
-		nf_ct_add_to_dying_list(ct);
 		NF_CT_STAT_INC(net, insert_failed);
 		goto dying;
 	}
@@ -1238,7 +1216,6 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 			goto out;
 		if (chainlen++ > max_chainlen) {
 chaintoolong:
-			nf_ct_add_to_dying_list(ct);
 			NF_CT_STAT_INC(net, chaintoolong);
 			NF_CT_STAT_INC(net, insert_failed);
 			ret = NF_DROP;
@@ -2752,7 +2729,6 @@ void nf_conntrack_init_end(void)
  * We need to use special "null" values, not used in hash table
  */
 #define UNCONFIRMED_NULLS_VAL	((1<<30)+0)
-#define DYING_NULLS_VAL		((1<<30)+1)
 
 int nf_conntrack_init_net(struct net *net)
 {
@@ -2773,7 +2749,6 @@ int nf_conntrack_init_net(struct net *net)
 
 		spin_lock_init(&pcpu->lock);
 		INIT_HLIST_NULLS_HEAD(&pcpu->unconfirmed, UNCONFIRMED_NULLS_VAL);
-		INIT_HLIST_NULLS_HEAD(&pcpu->dying, DYING_NULLS_VAL);
 	}
 
 	net->ct.stat = alloc_percpu(struct ip_conntrack_stat);
diff --git a/net/netfilter/nf_conntrack_ecache.c b/net/netfilter/nf_conntrack_ecache.c
index ce3ebd420585..a6ae8fd0c00e 100644
--- a/net/netfilter/nf_conntrack_ecache.c
+++ b/net/netfilter/nf_conntrack_ecache.c
@@ -95,7 +95,6 @@ static enum retry_state ecache_work_evict_list(struct nf_conntrack_net *cnet)
 	hlist_nulls_for_each_entry_safe(h, n, &evicted_list, hnnode) {
 		struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(h);
 
-		hlist_nulls_add_fake(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
 		hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnnode);
 		nf_ct_put(ct);
 
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 44cad50ef67f..8b57855154f5 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -62,7 +62,6 @@ struct ctnetlink_list_dump_ctx {
 	struct nf_conn *last;
 	unsigned int cpu;
 	bool done;
-	bool retrans_done;
 };
 
 static int ctnetlink_dump_tuples_proto(struct sk_buff *skb,
@@ -1751,13 +1750,12 @@ static int ctnetlink_dump_one_entry(struct sk_buff *skb,
 }
 
 static int
-ctnetlink_dump_list(struct sk_buff *skb, struct netlink_callback *cb, bool dying)
+ctnetlink_dump_unconfirmed(struct sk_buff *skb, struct netlink_callback *cb)
 {
 	struct ctnetlink_list_dump_ctx *ctx = (void *)cb->ctx;
 	struct nf_conn *ct, *last;
 	struct nf_conntrack_tuple_hash *h;
 	struct hlist_nulls_node *n;
-	struct hlist_nulls_head *list;
 	struct net *net = sock_net(skb->sk);
 	int res, cpu;
 
@@ -1774,12 +1772,11 @@ ctnetlink_dump_list(struct sk_buff *skb, struct netlink_callback *cb, bool dying
 
 		pcpu = per_cpu_ptr(net->ct.pcpu_lists, cpu);
 		spin_lock_bh(&pcpu->lock);
-		list = dying ? &pcpu->dying : &pcpu->unconfirmed;
 restart:
-		hlist_nulls_for_each_entry(h, n, list, hnnode) {
+		hlist_nulls_for_each_entry(h, n, &pcpu->unconfirmed, hnnode) {
 			ct = nf_ct_tuplehash_to_ctrack(h);
 
-			res = ctnetlink_dump_one_entry(skb, cb, ct, dying);
+			res = ctnetlink_dump_one_entry(skb, cb, ct, false);
 			if (res < 0) {
 				ctx->cpu = cpu;
 				spin_unlock_bh(&pcpu->lock);
@@ -1810,9 +1807,6 @@ ctnetlink_dump_dying(struct sk_buff *skb, struct netlink_callback *cb)
 	struct nf_conntrack_tuple_hash *h;
 	struct hlist_nulls_node *n;
 
-	if (ctx->retrans_done)
-		return ctnetlink_dump_list(skb, cb, true);
-
 	ctx->last = NULL;
 	ecache_net = nf_conn_pernet_ecache(net);
 	spin_lock_bh(&ecache_net->dying_lock);
@@ -1836,11 +1830,10 @@ ctnetlink_dump_dying(struct sk_buff *skb, struct netlink_callback *cb)
 		last = NULL;
 	}
 
-	ctx->retrans_done = true;
 	spin_unlock_bh(&ecache_net->dying_lock);
 	nf_ct_put(last);
 
-	return ctnetlink_dump_list(skb, cb, true);
+	return skb->len;
 }
 
 static int ctnetlink_get_ct_dying(struct sk_buff *skb,
@@ -1858,12 +1851,6 @@ static int ctnetlink_get_ct_dying(struct sk_buff *skb,
 	return -EOPNOTSUPP;
 }
 
-static int
-ctnetlink_dump_unconfirmed(struct sk_buff *skb, struct netlink_callback *cb)
-{
-	return ctnetlink_dump_list(skb, cb, false);
-}
-
 static int ctnetlink_get_ct_unconfirmed(struct sk_buff *skb,
 					const struct nfnl_info *info,
 					const struct nlattr * const cda[])
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH nf-next v2 0/7] netfilter: remove pcpu dying list
  2022-02-24 16:44 [PATCH nf-next v2 0/7] netfilter: remove pcpu dying list Florian Westphal
                   ` (6 preceding siblings ...)
  2022-02-24 16:44 ` [PATCH nf-next v2 7/7] netfilter: conntrack: remove the percpu dying list Florian Westphal
@ 2022-03-02 23:26 ` Pablo Neira Ayuso
  7 siblings, 0 replies; 9+ messages in thread
From: Pablo Neira Ayuso @ 2022-03-02 23:26 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel

On Thu, Feb 24, 2022 at 05:44:39PM +0100, Florian Westphal wrote:
> v2: fix EXPORT_SYMBOL_GPL related build failure in patch 6.
> No other changes.
> 
> This is part 1 of a series that aims to remove both the unconfirmed
> and dying lists.
> 
> This moves the dying list into the ecache infrastructure.
> Entries are placed on this list only if the delivery of the destroy
> event has failed (which implies that at least one userspace listener
> did request redelivery).
> 
> The percpu dying list is removed in the last patch as it has no
> functionality anymore.  This avoids the extra spinlock for conntrack
> removal.

Applied, thanks

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-03-03  0:02 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-24 16:44 [PATCH nf-next v2 0/7] netfilter: remove pcpu dying list Florian Westphal
2022-02-24 16:44 ` [PATCH nf-next v2 1/7] nfnetlink: handle already-released nl socket Florian Westphal
2022-02-24 16:44 ` [PATCH nf-next v2 2/7] netfilter: ctnetlink: make ecache event cb global again Florian Westphal
2022-02-24 16:44 ` [PATCH nf-next v2 3/7] netfilter: ecache: move to separate structure Florian Westphal
2022-02-24 16:44 ` [PATCH nf-next v2 4/7] netfilter: ecache: use dedicated list for event redelivery Florian Westphal
2022-02-24 16:44 ` [PATCH nf-next v2 5/7] netfilter: conntrack: split inner loop of list dumping to own function Florian Westphal
2022-02-24 16:44 ` [PATCH nf-next v2 6/7] netfilter: conntrack: include ecache dying list in dumps Florian Westphal
2022-02-24 16:44 ` [PATCH nf-next v2 7/7] netfilter: conntrack: remove the percpu dying list Florian Westphal
2022-03-02 23:26 ` [PATCH nf-next v2 0/7] netfilter: remove pcpu " Pablo Neira Ayuso

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.