All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next v4 0/6] Support defragmenting IPv(4|6) packets in BPF
@ 2023-07-12 23:43 Daniel Xu
  2023-07-12 23:43 ` [PATCH bpf-next v4 1/6] netfilter: defrag: Add glue hooks for enabling/disabling defrag Daniel Xu
                   ` (5 more replies)
  0 siblings, 6 replies; 15+ messages in thread
From: Daniel Xu @ 2023-07-12 23:43 UTC (permalink / raw)
  To: linux-kernel, bpf, coreteam, netfilter-devel, netdev,
	linux-kselftest, alexei.starovoitov, fw, daniel
  Cc: dsahern

=== Context ===

In the context of a middlebox, fragmented packets are tricky to handle.
The full 5-tuple of a packet is often only available in the first
fragment which makes enforcing consistent policy difficult. There are
really only two stateless options, neither of which are very nice:

1. Enforce policy on first fragment and accept all subsequent fragments.
   This works but may let in certain attacks or allow data exfiltration.

2. Enforce policy on first fragment and drop all subsequent fragments.
   This does not really work b/c some protocols may rely on
   fragmentation. For example, DNS may rely on oversized UDP packets for
   large responses.

So stateful tracking is the only sane option. RFC 8900 [0] calls this
out as well in section 6.3:

    Middleboxes [...] should process IP fragments in a manner that is
    consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes
    must maintain state in order to achieve this goal.

=== BPF related bits ===

Policy has traditionally been enforced from XDP/TC hooks. Both hooks
run before kernel reassembly facilities. However, with the new
BPF_PROG_TYPE_NETFILTER, we can rather easily hook into existing
netfilter reassembly infra.

The basic idea is we bump a refcnt on the netfilter defrag module and
then run the bpf prog after the defrag module runs. This allows bpf
progs to transparently see full, reassembled packets. The nice thing
about this is that progs don't have to carry around logic to detect
fragments.

=== Changelog ===

Changes from v3:
* Correctly initialize `addrlen` stack var for recvmsg()

Changes from v2:

* module_put() if ->enable() fails
* Fix CI build errors

Changes from v1:

* Drop bpf_program__attach_netfilter() patches
* static -> static const where appropriate
* Fix callback assignment order during registration
* Only request_module() if callbacks are missing
* Fix retval when modprobe fails in userspace
* Fix v6 defrag module name (nf_defrag_ipv6_hooks -> nf_defrag_ipv6)
* Simplify priority checking code
* Add warning if module doesn't assign callbacks in the future
* Take refcnt on module while defrag link is active


[0]: https://datatracker.ietf.org/doc/html/rfc8900


Daniel Xu (6):
  netfilter: defrag: Add glue hooks for enabling/disabling defrag
  netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link
  netfilter: bpf: Prevent defrag module unload while link active
  bpf: selftests: Support not connecting client socket
  bpf: selftests: Support custom type and proto for client sockets
  bpf: selftests: Add defrag selftests

 include/linux/netfilter.h                     |  15 +
 include/uapi/linux/bpf.h                      |   5 +
 net/ipv4/netfilter/nf_defrag_ipv4.c           |  17 +-
 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c     |  11 +
 net/netfilter/core.c                          |   6 +
 net/netfilter/nf_bpf_link.c                   | 150 +++++++++-
 tools/include/uapi/linux/bpf.h                |   5 +
 tools/testing/selftests/bpf/Makefile          |   4 +-
 .../selftests/bpf/generate_udp_fragments.py   |  90 ++++++
 .../selftests/bpf/ip_check_defrag_frags.h     |  57 ++++
 tools/testing/selftests/bpf/network_helpers.c |  26 +-
 tools/testing/selftests/bpf/network_helpers.h |   3 +
 .../bpf/prog_tests/ip_check_defrag.c          | 283 ++++++++++++++++++
 .../selftests/bpf/progs/ip_check_defrag.c     | 104 +++++++
 14 files changed, 754 insertions(+), 22 deletions(-)
 create mode 100755 tools/testing/selftests/bpf/generate_udp_fragments.py
 create mode 100644 tools/testing/selftests/bpf/ip_check_defrag_frags.h
 create mode 100644 tools/testing/selftests/bpf/prog_tests/ip_check_defrag.c
 create mode 100644 tools/testing/selftests/bpf/progs/ip_check_defrag.c

-- 
2.41.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH bpf-next v4 1/6] netfilter: defrag: Add glue hooks for enabling/disabling defrag
  2023-07-12 23:43 [PATCH bpf-next v4 0/6] Support defragmenting IPv(4|6) packets in BPF Daniel Xu
@ 2023-07-12 23:43 ` Daniel Xu
  2023-07-12 23:43 ` [PATCH bpf-next v4 2/6] netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link Daniel Xu
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Daniel Xu @ 2023-07-12 23:43 UTC (permalink / raw)
  To: fw, davem, pabeni, pablo, dsahern, edumazet, kuba, kadlec,
	alexei.starovoitov, daniel
  Cc: netfilter-devel, coreteam, linux-kernel, netdev, bpf

We want to be able to enable/disable IP packet defrag from core
bpf/netfilter code. In other words, execute code from core that could
possibly be built as a module.

To help avoid symbol resolution errors, use glue hooks that the modules
will register callbacks with during module init.

Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
---
 include/linux/netfilter.h                 | 12 ++++++++++++
 net/ipv4/netfilter/nf_defrag_ipv4.c       | 16 +++++++++++++++-
 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c | 10 ++++++++++
 net/netfilter/core.c                      |  6 ++++++
 4 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index d4fed4c508ca..77a637b681f2 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -481,6 +481,18 @@ struct nfnl_ct_hook {
 };
 extern const struct nfnl_ct_hook __rcu *nfnl_ct_hook;
 
+struct nf_defrag_v4_hook {
+	int (*enable)(struct net *net);
+	void (*disable)(struct net *net);
+};
+extern const struct nf_defrag_v4_hook __rcu *nf_defrag_v4_hook;
+
+struct nf_defrag_v6_hook {
+	int (*enable)(struct net *net);
+	void (*disable)(struct net *net);
+};
+extern const struct nf_defrag_v6_hook __rcu *nf_defrag_v6_hook;
+
 /*
  * nf_skb_duplicated - TEE target has sent a packet
  *
diff --git a/net/ipv4/netfilter/nf_defrag_ipv4.c b/net/ipv4/netfilter/nf_defrag_ipv4.c
index e61ea428ea18..1f3e0e893b7a 100644
--- a/net/ipv4/netfilter/nf_defrag_ipv4.c
+++ b/net/ipv4/netfilter/nf_defrag_ipv4.c
@@ -7,6 +7,7 @@
 #include <linux/ip.h>
 #include <linux/netfilter.h>
 #include <linux/module.h>
+#include <linux/rcupdate.h>
 #include <linux/skbuff.h>
 #include <net/netns/generic.h>
 #include <net/route.h>
@@ -113,17 +114,30 @@ static void __net_exit defrag4_net_exit(struct net *net)
 	}
 }
 
+static const struct nf_defrag_v4_hook defrag_hook = {
+	.enable = nf_defrag_ipv4_enable,
+	.disable = nf_defrag_ipv4_disable,
+};
+
 static struct pernet_operations defrag4_net_ops = {
 	.exit = defrag4_net_exit,
 };
 
 static int __init nf_defrag_init(void)
 {
-	return register_pernet_subsys(&defrag4_net_ops);
+	int err;
+
+	err = register_pernet_subsys(&defrag4_net_ops);
+	if (err)
+		return err;
+
+	rcu_assign_pointer(nf_defrag_v4_hook, &defrag_hook);
+	return err;
 }
 
 static void __exit nf_defrag_fini(void)
 {
+	rcu_assign_pointer(nf_defrag_v4_hook, NULL);
 	unregister_pernet_subsys(&defrag4_net_ops);
 }
 
diff --git a/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c b/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
index cb4eb1d2c620..f7c7ee31c472 100644
--- a/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
+++ b/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
@@ -10,6 +10,7 @@
 #include <linux/module.h>
 #include <linux/skbuff.h>
 #include <linux/icmp.h>
+#include <linux/rcupdate.h>
 #include <linux/sysctl.h>
 #include <net/ipv6_frag.h>
 
@@ -96,6 +97,11 @@ static void __net_exit defrag6_net_exit(struct net *net)
 	}
 }
 
+static const struct nf_defrag_v6_hook defrag_hook = {
+	.enable = nf_defrag_ipv6_enable,
+	.disable = nf_defrag_ipv6_disable,
+};
+
 static struct pernet_operations defrag6_net_ops = {
 	.exit = defrag6_net_exit,
 };
@@ -114,6 +120,9 @@ static int __init nf_defrag_init(void)
 		pr_err("nf_defrag_ipv6: can't register pernet ops\n");
 		goto cleanup_frag6;
 	}
+
+	rcu_assign_pointer(nf_defrag_v6_hook, &defrag_hook);
+
 	return ret;
 
 cleanup_frag6:
@@ -124,6 +133,7 @@ static int __init nf_defrag_init(void)
 
 static void __exit nf_defrag_fini(void)
 {
+	rcu_assign_pointer(nf_defrag_v6_hook, NULL);
 	unregister_pernet_subsys(&defrag6_net_ops);
 	nf_ct_frag6_cleanup();
 }
diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index 5f76ae86a656..34845155bb85 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -680,6 +680,12 @@ EXPORT_SYMBOL_GPL(nfnl_ct_hook);
 const struct nf_ct_hook __rcu *nf_ct_hook __read_mostly;
 EXPORT_SYMBOL_GPL(nf_ct_hook);
 
+const struct nf_defrag_v4_hook __rcu *nf_defrag_v4_hook __read_mostly;
+EXPORT_SYMBOL_GPL(nf_defrag_v4_hook);
+
+const struct nf_defrag_v6_hook __rcu *nf_defrag_v6_hook __read_mostly;
+EXPORT_SYMBOL_GPL(nf_defrag_v6_hook);
+
 #if IS_ENABLED(CONFIG_NF_CONNTRACK)
 u8 nf_ctnetlink_has_listener;
 EXPORT_SYMBOL_GPL(nf_ctnetlink_has_listener);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH bpf-next v4 2/6] netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link
  2023-07-12 23:43 [PATCH bpf-next v4 0/6] Support defragmenting IPv(4|6) packets in BPF Daniel Xu
  2023-07-12 23:43 ` [PATCH bpf-next v4 1/6] netfilter: defrag: Add glue hooks for enabling/disabling defrag Daniel Xu
@ 2023-07-12 23:43 ` Daniel Xu
  2023-07-13  0:43   ` Alexei Starovoitov
  2023-07-12 23:43 ` [PATCH bpf-next v4 3/6] netfilter: bpf: Prevent defrag module unload while link active Daniel Xu
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 15+ messages in thread
From: Daniel Xu @ 2023-07-12 23:43 UTC (permalink / raw)
  To: andrii, ast, fw, davem, pablo, pabeni, daniel, edumazet, kuba,
	kadlec, alexei.starovoitov
  Cc: martin.lau, song, yhs, john.fastabend, kpsingh, sdf, haoluo,
	jolsa, bpf, linux-kernel, netfilter-devel, coreteam, netdev,
	dsahern

This commit adds support for enabling IP defrag using pre-existing
netfilter defrag support. Basically all the flag does is bump a refcnt
while the link the active. Checks are also added to ensure the prog
requesting defrag support is run _after_ netfilter defrag hooks.

Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
---
 include/uapi/linux/bpf.h       |   5 ++
 net/netfilter/nf_bpf_link.c    | 129 ++++++++++++++++++++++++++++++---
 tools/include/uapi/linux/bpf.h |   5 ++
 3 files changed, 128 insertions(+), 11 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 600d0caebbd8..c820076c38db 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1180,6 +1180,11 @@ enum bpf_perf_event_type {
  */
 #define BPF_F_KPROBE_MULTI_RETURN	(1U << 0)
 
+/* link_create.netfilter.flags used in LINK_CREATE command for
+ * BPF_PROG_TYPE_NETFILTER to enable IP packet defragmentation.
+ */
+#define BPF_F_NETFILTER_IP_DEFRAG (1U << 0)
+
 /* When BPF ldimm64's insn[0].src_reg != 0 then this can have
  * the following extensions:
  *
diff --git a/net/netfilter/nf_bpf_link.c b/net/netfilter/nf_bpf_link.c
index c36da56d756f..5b72aa246577 100644
--- a/net/netfilter/nf_bpf_link.c
+++ b/net/netfilter/nf_bpf_link.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/bpf.h>
 #include <linux/filter.h>
+#include <linux/kmod.h>
 #include <linux/netfilter.h>
 
 #include <net/netfilter/nf_bpf_link.h>
@@ -23,8 +24,98 @@ struct bpf_nf_link {
 	struct nf_hook_ops hook_ops;
 	struct net *net;
 	u32 dead;
+	bool defrag;
 };
 
+static int bpf_nf_enable_defrag(struct bpf_nf_link *link)
+{
+	const struct nf_defrag_v4_hook __maybe_unused *v4_hook;
+	const struct nf_defrag_v6_hook __maybe_unused *v6_hook;
+	int err;
+
+	switch (link->hook_ops.pf) {
+#if IS_ENABLED(CONFIG_NF_DEFRAG_IPV4)
+	case NFPROTO_IPV4:
+		rcu_read_lock();
+		v4_hook = rcu_dereference(nf_defrag_v4_hook);
+		if (!v4_hook) {
+			rcu_read_unlock();
+			err = request_module("nf_defrag_ipv4");
+			if (err)
+				return err < 0 ? err : -EINVAL;
+
+			rcu_read_lock();
+			v4_hook = rcu_dereference(nf_defrag_v4_hook);
+			if (!v4_hook) {
+				WARN_ONCE(1, "nf_defrag_ipv4 bad registration");
+				err = -ENOENT;
+				goto out_v4;
+			}
+		}
+
+		err = v4_hook->enable(link->net);
+out_v4:
+		rcu_read_unlock();
+		return err;
+#endif
+#if IS_ENABLED(CONFIG_NF_DEFRAG_IPV6)
+	case NFPROTO_IPV6:
+		rcu_read_lock();
+		v6_hook = rcu_dereference(nf_defrag_v6_hook);
+		if (!v6_hook) {
+			rcu_read_unlock();
+			err = request_module("nf_defrag_ipv6");
+			if (err)
+				return err < 0 ? err : -EINVAL;
+
+			rcu_read_lock();
+			v6_hook = rcu_dereference(nf_defrag_v6_hook);
+			if (!v6_hook) {
+				WARN_ONCE(1, "nf_defrag_ipv6_hooks bad registration");
+				err = -ENOENT;
+				goto out_v6;
+			}
+		}
+
+		err = v6_hook->enable(link->net);
+out_v6:
+		rcu_read_unlock();
+		return err;
+#endif
+	default:
+		return -EAFNOSUPPORT;
+	}
+}
+
+static void bpf_nf_disable_defrag(struct bpf_nf_link *link)
+{
+	const struct nf_defrag_v4_hook __maybe_unused *v4_hook;
+	const struct nf_defrag_v6_hook __maybe_unused *v6_hook;
+
+	switch (link->hook_ops.pf) {
+#if IS_ENABLED(CONFIG_NF_DEFRAG_IPV4)
+	case NFPROTO_IPV4:
+		rcu_read_lock();
+		v4_hook = rcu_dereference(nf_defrag_v4_hook);
+		if (v4_hook)
+			v4_hook->disable(link->net);
+		rcu_read_unlock();
+
+		break;
+#endif
+#if IS_ENABLED(CONFIG_NF_DEFRAG_IPV6)
+	case NFPROTO_IPV6:
+		rcu_read_lock();
+		v6_hook = rcu_dereference(nf_defrag_v6_hook);
+		if (v6_hook)
+			v6_hook->disable(link->net);
+		rcu_read_unlock();
+
+		break;
+	}
+#endif
+}
+
 static void bpf_nf_link_release(struct bpf_link *link)
 {
 	struct bpf_nf_link *nf_link = container_of(link, struct bpf_nf_link, link);
@@ -37,6 +128,9 @@ static void bpf_nf_link_release(struct bpf_link *link)
 	 */
 	if (!cmpxchg(&nf_link->dead, 0, 1))
 		nf_unregister_net_hook(nf_link->net, &nf_link->hook_ops);
+
+	if (nf_link->defrag)
+		bpf_nf_disable_defrag(nf_link);
 }
 
 static void bpf_nf_link_dealloc(struct bpf_link *link)
@@ -92,6 +186,8 @@ static const struct bpf_link_ops bpf_nf_link_lops = {
 
 static int bpf_nf_check_pf_and_hooks(const union bpf_attr *attr)
 {
+	int prio;
+
 	switch (attr->link_create.netfilter.pf) {
 	case NFPROTO_IPV4:
 	case NFPROTO_IPV6:
@@ -102,19 +198,18 @@ static int bpf_nf_check_pf_and_hooks(const union bpf_attr *attr)
 		return -EAFNOSUPPORT;
 	}
 
-	if (attr->link_create.netfilter.flags)
+	if (attr->link_create.netfilter.flags & ~BPF_F_NETFILTER_IP_DEFRAG)
 		return -EOPNOTSUPP;
 
-	/* make sure conntrack confirm is always last.
-	 *
-	 * In the future, if userspace can e.g. request defrag, then
-	 * "defrag_requested && prio before NF_IP_PRI_CONNTRACK_DEFRAG"
-	 * should fail.
-	 */
-	switch (attr->link_create.netfilter.priority) {
-	case NF_IP_PRI_FIRST: return -ERANGE; /* sabotage_in and other warts */
-	case NF_IP_PRI_LAST: return -ERANGE; /* e.g. conntrack confirm */
-	}
+	/* make sure conntrack confirm is always last */
+	prio = attr->link_create.netfilter.priority;
+	if (prio == NF_IP_PRI_FIRST)
+		return -ERANGE;  /* sabotage_in and other warts */
+	else if (prio == NF_IP_PRI_LAST)
+		return -ERANGE;  /* e.g. conntrack confirm */
+	else if ((attr->link_create.netfilter.flags & BPF_F_NETFILTER_IP_DEFRAG) &&
+		 prio <= NF_IP_PRI_CONNTRACK_DEFRAG)
+		return -ERANGE;  /* cannot use defrag if prog runs before nf_defrag */
 
 	return 0;
 }
@@ -156,6 +251,18 @@ int bpf_nf_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
 		return err;
 	}
 
+	if (attr->link_create.netfilter.flags & BPF_F_NETFILTER_IP_DEFRAG) {
+		err = bpf_nf_enable_defrag(link);
+		if (err) {
+			bpf_link_cleanup(&link_primer);
+			return err;
+		}
+		/* only mark defrag enabled if enabling succeeds so cleanup path
+		 * doesn't disable without a corresponding enable
+		 */
+		link->defrag = true;
+	}
+
 	err = nf_register_net_hook(net, &link->hook_ops);
 	if (err) {
 		bpf_link_cleanup(&link_primer);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 600d0caebbd8..c820076c38db 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1180,6 +1180,11 @@ enum bpf_perf_event_type {
  */
 #define BPF_F_KPROBE_MULTI_RETURN	(1U << 0)
 
+/* link_create.netfilter.flags used in LINK_CREATE command for
+ * BPF_PROG_TYPE_NETFILTER to enable IP packet defragmentation.
+ */
+#define BPF_F_NETFILTER_IP_DEFRAG (1U << 0)
+
 /* When BPF ldimm64's insn[0].src_reg != 0 then this can have
  * the following extensions:
  *
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH bpf-next v4 3/6] netfilter: bpf: Prevent defrag module unload while link active
  2023-07-12 23:43 [PATCH bpf-next v4 0/6] Support defragmenting IPv(4|6) packets in BPF Daniel Xu
  2023-07-12 23:43 ` [PATCH bpf-next v4 1/6] netfilter: defrag: Add glue hooks for enabling/disabling defrag Daniel Xu
  2023-07-12 23:43 ` [PATCH bpf-next v4 2/6] netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link Daniel Xu
@ 2023-07-12 23:43 ` Daniel Xu
  2023-07-12 23:43 ` [PATCH bpf-next v4 4/6] bpf: selftests: Support not connecting client socket Daniel Xu
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Daniel Xu @ 2023-07-12 23:43 UTC (permalink / raw)
  To: fw, davem, pabeni, pablo, dsahern, edumazet, kuba, kadlec,
	alexei.starovoitov, daniel
  Cc: netfilter-devel, coreteam, linux-kernel, netdev, bpf

While in practice we could handle the module being unloaded while a
netfilter link (that requested defrag) was active, it's a better user
experience to prevent the defrag module from going away. It would
violate user expectations if fragmented packets started showing up if
some other part of the system tried to unload defrag module.

Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
---
 include/linux/netfilter.h                 |  3 +++
 net/ipv4/netfilter/nf_defrag_ipv4.c       |  1 +
 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c |  1 +
 net/netfilter/nf_bpf_link.c               | 25 +++++++++++++++++++++--
 4 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index 77a637b681f2..a160dc1e23bf 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -11,6 +11,7 @@
 #include <linux/wait.h>
 #include <linux/list.h>
 #include <linux/static_key.h>
+#include <linux/module.h>
 #include <linux/netfilter_defs.h>
 #include <linux/netdevice.h>
 #include <linux/sockptr.h>
@@ -482,12 +483,14 @@ struct nfnl_ct_hook {
 extern const struct nfnl_ct_hook __rcu *nfnl_ct_hook;
 
 struct nf_defrag_v4_hook {
+	struct module *owner;
 	int (*enable)(struct net *net);
 	void (*disable)(struct net *net);
 };
 extern const struct nf_defrag_v4_hook __rcu *nf_defrag_v4_hook;
 
 struct nf_defrag_v6_hook {
+	struct module *owner;
 	int (*enable)(struct net *net);
 	void (*disable)(struct net *net);
 };
diff --git a/net/ipv4/netfilter/nf_defrag_ipv4.c b/net/ipv4/netfilter/nf_defrag_ipv4.c
index 1f3e0e893b7a..fb133bf3131d 100644
--- a/net/ipv4/netfilter/nf_defrag_ipv4.c
+++ b/net/ipv4/netfilter/nf_defrag_ipv4.c
@@ -115,6 +115,7 @@ static void __net_exit defrag4_net_exit(struct net *net)
 }
 
 static const struct nf_defrag_v4_hook defrag_hook = {
+	.owner = THIS_MODULE,
 	.enable = nf_defrag_ipv4_enable,
 	.disable = nf_defrag_ipv4_disable,
 };
diff --git a/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c b/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
index f7c7ee31c472..29d31721c9c0 100644
--- a/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
+++ b/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
@@ -98,6 +98,7 @@ static void __net_exit defrag6_net_exit(struct net *net)
 }
 
 static const struct nf_defrag_v6_hook defrag_hook = {
+	.owner = THIS_MODULE,
 	.enable = nf_defrag_ipv6_enable,
 	.disable = nf_defrag_ipv6_disable,
 };
diff --git a/net/netfilter/nf_bpf_link.c b/net/netfilter/nf_bpf_link.c
index 5b72aa246577..77ffbf26ba3d 100644
--- a/net/netfilter/nf_bpf_link.c
+++ b/net/netfilter/nf_bpf_link.c
@@ -2,6 +2,7 @@
 #include <linux/bpf.h>
 #include <linux/filter.h>
 #include <linux/kmod.h>
+#include <linux/module.h>
 #include <linux/netfilter.h>
 
 #include <net/netfilter/nf_bpf_link.h>
@@ -53,7 +54,15 @@ static int bpf_nf_enable_defrag(struct bpf_nf_link *link)
 			}
 		}
 
+		/* Prevent defrag module from going away while in use */
+		if (!try_module_get(v4_hook->owner)) {
+			err = -ENOENT;
+			goto out_v4;
+		}
+
 		err = v4_hook->enable(link->net);
+		if (err)
+			module_put(v4_hook->owner);
 out_v4:
 		rcu_read_unlock();
 		return err;
@@ -77,7 +86,15 @@ static int bpf_nf_enable_defrag(struct bpf_nf_link *link)
 			}
 		}
 
+		/* Prevent defrag module from going away while in use */
+		if (!try_module_get(v6_hook->owner)) {
+			err = -ENOENT;
+			goto out_v6;
+		}
+
 		err = v6_hook->enable(link->net);
+		if (err)
+			module_put(v6_hook->owner);
 out_v6:
 		rcu_read_unlock();
 		return err;
@@ -97,8 +114,10 @@ static void bpf_nf_disable_defrag(struct bpf_nf_link *link)
 	case NFPROTO_IPV4:
 		rcu_read_lock();
 		v4_hook = rcu_dereference(nf_defrag_v4_hook);
-		if (v4_hook)
+		if (v4_hook) {
 			v4_hook->disable(link->net);
+			module_put(v4_hook->owner);
+		}
 		rcu_read_unlock();
 
 		break;
@@ -107,8 +126,10 @@ static void bpf_nf_disable_defrag(struct bpf_nf_link *link)
 	case NFPROTO_IPV6:
 		rcu_read_lock();
 		v6_hook = rcu_dereference(nf_defrag_v6_hook);
-		if (v6_hook)
+		if (v6_hook) {
 			v6_hook->disable(link->net);
+			module_put(v6_hook->owner);
+		}
 		rcu_read_unlock();
 
 		break;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH bpf-next v4 4/6] bpf: selftests: Support not connecting client socket
  2023-07-12 23:43 [PATCH bpf-next v4 0/6] Support defragmenting IPv(4|6) packets in BPF Daniel Xu
                   ` (2 preceding siblings ...)
  2023-07-12 23:43 ` [PATCH bpf-next v4 3/6] netfilter: bpf: Prevent defrag module unload while link active Daniel Xu
@ 2023-07-12 23:43 ` Daniel Xu
  2023-07-12 23:44 ` [PATCH bpf-next v4 5/6] bpf: selftests: Support custom type and proto for client sockets Daniel Xu
  2023-07-12 23:44 ` [PATCH bpf-next v4 6/6] bpf: selftests: Add defrag selftests Daniel Xu
  5 siblings, 0 replies; 15+ messages in thread
From: Daniel Xu @ 2023-07-12 23:43 UTC (permalink / raw)
  To: andrii, daniel, ast, shuah, alexei.starovoitov, fw
  Cc: martin.lau, song, yhs, john.fastabend, kpsingh, sdf, haoluo,
	jolsa, mykolal, bpf, linux-kselftest, linux-kernel,
	netfilter-devel, dsahern

For connectionless protocols or raw sockets we do not want to actually
connect() to the server.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
---
 tools/testing/selftests/bpf/network_helpers.c | 5 +++--
 tools/testing/selftests/bpf/network_helpers.h | 1 +
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c
index a105c0cd008a..d5c78c08903b 100644
--- a/tools/testing/selftests/bpf/network_helpers.c
+++ b/tools/testing/selftests/bpf/network_helpers.c
@@ -301,8 +301,9 @@ int connect_to_fd_opts(int server_fd, const struct network_helper_opts *opts)
 		       strlen(opts->cc) + 1))
 		goto error_close;
 
-	if (connect_fd_to_addr(fd, &addr, addrlen, opts->must_fail))
-		goto error_close;
+	if (!opts->noconnect)
+		if (connect_fd_to_addr(fd, &addr, addrlen, opts->must_fail))
+			goto error_close;
 
 	return fd;
 
diff --git a/tools/testing/selftests/bpf/network_helpers.h b/tools/testing/selftests/bpf/network_helpers.h
index 694185644da6..87894dc984dd 100644
--- a/tools/testing/selftests/bpf/network_helpers.h
+++ b/tools/testing/selftests/bpf/network_helpers.h
@@ -21,6 +21,7 @@ struct network_helper_opts {
 	const char *cc;
 	int timeout_ms;
 	bool must_fail;
+	bool noconnect;
 };
 
 /* ipv4 test vector */
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH bpf-next v4 5/6] bpf: selftests: Support custom type and proto for client sockets
  2023-07-12 23:43 [PATCH bpf-next v4 0/6] Support defragmenting IPv(4|6) packets in BPF Daniel Xu
                   ` (3 preceding siblings ...)
  2023-07-12 23:43 ` [PATCH bpf-next v4 4/6] bpf: selftests: Support not connecting client socket Daniel Xu
@ 2023-07-12 23:44 ` Daniel Xu
  2023-07-12 23:44 ` [PATCH bpf-next v4 6/6] bpf: selftests: Add defrag selftests Daniel Xu
  5 siblings, 0 replies; 15+ messages in thread
From: Daniel Xu @ 2023-07-12 23:44 UTC (permalink / raw)
  To: andrii, daniel, ast, shuah, alexei.starovoitov, fw
  Cc: martin.lau, song, yhs, john.fastabend, kpsingh, sdf, haoluo,
	jolsa, mykolal, bpf, linux-kselftest, linux-kernel,
	netfilter-devel, dsahern

Extend connect_to_fd_opts() to take optional type and protocol
parameters for the client socket. These parameters are useful when
opening a raw socket to send IP fragments.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
---
 tools/testing/selftests/bpf/network_helpers.c | 21 +++++++++++++------
 tools/testing/selftests/bpf/network_helpers.h |  2 ++
 2 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c
index d5c78c08903b..910d5d0470e6 100644
--- a/tools/testing/selftests/bpf/network_helpers.c
+++ b/tools/testing/selftests/bpf/network_helpers.c
@@ -270,14 +270,23 @@ int connect_to_fd_opts(int server_fd, const struct network_helper_opts *opts)
 		opts = &default_opts;
 
 	optlen = sizeof(type);
-	if (getsockopt(server_fd, SOL_SOCKET, SO_TYPE, &type, &optlen)) {
-		log_err("getsockopt(SOL_TYPE)");
-		return -1;
+
+	if (opts->type) {
+		type = opts->type;
+	} else {
+		if (getsockopt(server_fd, SOL_SOCKET, SO_TYPE, &type, &optlen)) {
+			log_err("getsockopt(SOL_TYPE)");
+			return -1;
+		}
 	}
 
-	if (getsockopt(server_fd, SOL_SOCKET, SO_PROTOCOL, &protocol, &optlen)) {
-		log_err("getsockopt(SOL_PROTOCOL)");
-		return -1;
+	if (opts->proto) {
+		protocol = opts->proto;
+	} else {
+		if (getsockopt(server_fd, SOL_SOCKET, SO_PROTOCOL, &protocol, &optlen)) {
+			log_err("getsockopt(SOL_PROTOCOL)");
+			return -1;
+		}
 	}
 
 	addrlen = sizeof(addr);
diff --git a/tools/testing/selftests/bpf/network_helpers.h b/tools/testing/selftests/bpf/network_helpers.h
index 87894dc984dd..5eccc67d1a99 100644
--- a/tools/testing/selftests/bpf/network_helpers.h
+++ b/tools/testing/selftests/bpf/network_helpers.h
@@ -22,6 +22,8 @@ struct network_helper_opts {
 	int timeout_ms;
 	bool must_fail;
 	bool noconnect;
+	int type;
+	int proto;
 };
 
 /* ipv4 test vector */
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH bpf-next v4 6/6] bpf: selftests: Add defrag selftests
  2023-07-12 23:43 [PATCH bpf-next v4 0/6] Support defragmenting IPv(4|6) packets in BPF Daniel Xu
                   ` (4 preceding siblings ...)
  2023-07-12 23:44 ` [PATCH bpf-next v4 5/6] bpf: selftests: Support custom type and proto for client sockets Daniel Xu
@ 2023-07-12 23:44 ` Daniel Xu
  5 siblings, 0 replies; 15+ messages in thread
From: Daniel Xu @ 2023-07-12 23:44 UTC (permalink / raw)
  To: andrii, daniel, ast, shuah, alexei.starovoitov, fw
  Cc: mykolal, martin.lau, song, yhs, john.fastabend, kpsingh, sdf,
	haoluo, jolsa, linux-kernel, bpf, linux-kselftest,
	netfilter-devel, dsahern

These selftests tests 2 major scenarios: the BPF based defragmentation
can successfully be done and that packet pointers are invalidated after
calls to the kfunc. The logic is similar for both ipv4 and ipv6.

In the first scenario, we create a UDP client and UDP echo server. The
the server side is fairly straightforward: we attach the prog and simply
echo back the message.

The on the client side, we send fragmented packets to and expect the
reassembled message back from the server.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
---
 tools/testing/selftests/bpf/Makefile          |   4 +-
 .../selftests/bpf/generate_udp_fragments.py   |  90 ++++++
 .../selftests/bpf/ip_check_defrag_frags.h     |  57 ++++
 .../bpf/prog_tests/ip_check_defrag.c          | 283 ++++++++++++++++++
 .../selftests/bpf/progs/ip_check_defrag.c     | 104 +++++++
 5 files changed, 536 insertions(+), 2 deletions(-)
 create mode 100755 tools/testing/selftests/bpf/generate_udp_fragments.py
 create mode 100644 tools/testing/selftests/bpf/ip_check_defrag_frags.h
 create mode 100644 tools/testing/selftests/bpf/prog_tests/ip_check_defrag.c
 create mode 100644 tools/testing/selftests/bpf/progs/ip_check_defrag.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 882be03b179f..619df497fce5 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -565,8 +565,8 @@ TRUNNER_EXTRA_SOURCES := test_progs.c cgroup_helpers.c trace_helpers.c	\
 			 network_helpers.c testing_helpers.c		\
 			 btf_helpers.c flow_dissector_load.h		\
 			 cap_helpers.c test_loader.c xsk.c disasm.c	\
-			 json_writer.c unpriv_helpers.c
-
+			 json_writer.c unpriv_helpers.c 		\
+			 ip_check_defrag_frags.h
 TRUNNER_EXTRA_FILES := $(OUTPUT)/urandom_read $(OUTPUT)/bpf_testmod.ko	\
 		       $(OUTPUT)/liburandom_read.so			\
 		       $(OUTPUT)/xdp_synproxy				\
diff --git a/tools/testing/selftests/bpf/generate_udp_fragments.py b/tools/testing/selftests/bpf/generate_udp_fragments.py
new file mode 100755
index 000000000000..2b8a1187991c
--- /dev/null
+++ b/tools/testing/selftests/bpf/generate_udp_fragments.py
@@ -0,0 +1,90 @@
+#!/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+"""
+This script helps generate fragmented UDP packets.
+
+While it is technically possible to dynamically generate
+fragmented packets in C, it is much harder to read and write
+said code. `scapy` is relatively industry standard and really
+easy to read / write.
+
+So we choose to write this script that generates a valid C
+header. Rerun script and commit generated file after any
+modifications.
+"""
+
+import argparse
+import os
+
+from scapy.all import *
+
+
+# These constants must stay in sync with `ip_check_defrag.c`
+VETH1_ADDR = "172.16.1.200"
+VETH0_ADDR6 = "fc00::100"
+VETH1_ADDR6 = "fc00::200"
+CLIENT_PORT = 48878
+SERVER_PORT = 48879
+MAGIC_MESSAGE = "THIS IS THE ORIGINAL MESSAGE, PLEASE REASSEMBLE ME"
+
+
+def print_header(f):
+    f.write("// SPDX-License-Identifier: GPL-2.0\n")
+    f.write("/* DO NOT EDIT -- this file is generated */\n")
+    f.write("\n")
+    f.write("#ifndef _IP_CHECK_DEFRAG_FRAGS_H\n")
+    f.write("#define _IP_CHECK_DEFRAG_FRAGS_H\n")
+    f.write("\n")
+    f.write("#include <stdint.h>\n")
+    f.write("\n")
+
+
+def print_frags(f, frags, v6):
+    for idx, frag in enumerate(frags):
+        # 10 bytes per line to keep width in check
+        chunks = [frag[i : i + 10] for i in range(0, len(frag), 10)]
+        chunks_fmted = [", ".join([str(hex(b)) for b in chunk]) for chunk in chunks]
+        suffix = "6" if v6 else ""
+
+        f.write(f"static uint8_t frag{suffix}_{idx}[] = {{\n")
+        for chunk in chunks_fmted:
+            f.write(f"\t{chunk},\n")
+        f.write(f"}};\n")
+
+
+def print_trailer(f):
+    f.write("\n")
+    f.write("#endif /* _IP_CHECK_DEFRAG_FRAGS_H */\n")
+
+
+def main(f):
+    # srcip of 0 is filled in by IP_HDRINCL
+    sip = "0.0.0.0"
+    sip6 = VETH0_ADDR6
+    dip = VETH1_ADDR
+    dip6 = VETH1_ADDR6
+    sport = CLIENT_PORT
+    dport = SERVER_PORT
+    payload = MAGIC_MESSAGE.encode()
+
+    # Disable UDPv4 checksums to keep code simpler
+    pkt = IP(src=sip,dst=dip) / UDP(sport=sport,dport=dport,chksum=0) / Raw(load=payload)
+    # UDPv6 requires a checksum
+    # Also pin the ipv6 fragment header ID, otherwise it's a random value
+    pkt6 = IPv6(src=sip6,dst=dip6) / IPv6ExtHdrFragment(id=0xBEEF) / UDP(sport=sport,dport=dport) / Raw(load=payload)
+
+    frags = [f.build() for f in pkt.fragment(24)]
+    frags6 = [f.build() for f in fragment6(pkt6, 72)]
+
+    print_header(f)
+    print_frags(f, frags, False)
+    print_frags(f, frags6, True)
+    print_trailer(f)
+
+
+if __name__ == "__main__":
+    dir = os.path.dirname(os.path.realpath(__file__))
+    header = f"{dir}/ip_check_defrag_frags.h"
+    with open(header, "w") as f:
+        main(f)
diff --git a/tools/testing/selftests/bpf/ip_check_defrag_frags.h b/tools/testing/selftests/bpf/ip_check_defrag_frags.h
new file mode 100644
index 000000000000..70ab7e9fa22b
--- /dev/null
+++ b/tools/testing/selftests/bpf/ip_check_defrag_frags.h
@@ -0,0 +1,57 @@
+// SPDX-License-Identifier: GPL-2.0
+/* DO NOT EDIT -- this file is generated */
+
+#ifndef _IP_CHECK_DEFRAG_FRAGS_H
+#define _IP_CHECK_DEFRAG_FRAGS_H
+
+#include <stdint.h>
+
+static uint8_t frag_0[] = {
+	0x45, 0x0, 0x0, 0x2c, 0x0, 0x1, 0x20, 0x0, 0x40, 0x11,
+	0xac, 0xe8, 0x0, 0x0, 0x0, 0x0, 0xac, 0x10, 0x1, 0xc8,
+	0xbe, 0xee, 0xbe, 0xef, 0x0, 0x3a, 0x0, 0x0, 0x54, 0x48,
+	0x49, 0x53, 0x20, 0x49, 0x53, 0x20, 0x54, 0x48, 0x45, 0x20,
+	0x4f, 0x52, 0x49, 0x47,
+};
+static uint8_t frag_1[] = {
+	0x45, 0x0, 0x0, 0x2c, 0x0, 0x1, 0x20, 0x3, 0x40, 0x11,
+	0xac, 0xe5, 0x0, 0x0, 0x0, 0x0, 0xac, 0x10, 0x1, 0xc8,
+	0x49, 0x4e, 0x41, 0x4c, 0x20, 0x4d, 0x45, 0x53, 0x53, 0x41,
+	0x47, 0x45, 0x2c, 0x20, 0x50, 0x4c, 0x45, 0x41, 0x53, 0x45,
+	0x20, 0x52, 0x45, 0x41,
+};
+static uint8_t frag_2[] = {
+	0x45, 0x0, 0x0, 0x1e, 0x0, 0x1, 0x0, 0x6, 0x40, 0x11,
+	0xcc, 0xf0, 0x0, 0x0, 0x0, 0x0, 0xac, 0x10, 0x1, 0xc8,
+	0x53, 0x53, 0x45, 0x4d, 0x42, 0x4c, 0x45, 0x20, 0x4d, 0x45,
+};
+static uint8_t frag6_0[] = {
+	0x60, 0x0, 0x0, 0x0, 0x0, 0x20, 0x2c, 0x40, 0xfc, 0x0,
+	0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
+	0x0, 0x0, 0x1, 0x0, 0xfc, 0x0, 0x0, 0x0, 0x0, 0x0,
+	0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x2, 0x0,
+	0x11, 0x0, 0x0, 0x1, 0x0, 0x0, 0xbe, 0xef, 0xbe, 0xee,
+	0xbe, 0xef, 0x0, 0x3a, 0xd0, 0xf8, 0x54, 0x48, 0x49, 0x53,
+	0x20, 0x49, 0x53, 0x20, 0x54, 0x48, 0x45, 0x20, 0x4f, 0x52,
+	0x49, 0x47,
+};
+static uint8_t frag6_1[] = {
+	0x60, 0x0, 0x0, 0x0, 0x0, 0x20, 0x2c, 0x40, 0xfc, 0x0,
+	0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
+	0x0, 0x0, 0x1, 0x0, 0xfc, 0x0, 0x0, 0x0, 0x0, 0x0,
+	0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x2, 0x0,
+	0x11, 0x0, 0x0, 0x19, 0x0, 0x0, 0xbe, 0xef, 0x49, 0x4e,
+	0x41, 0x4c, 0x20, 0x4d, 0x45, 0x53, 0x53, 0x41, 0x47, 0x45,
+	0x2c, 0x20, 0x50, 0x4c, 0x45, 0x41, 0x53, 0x45, 0x20, 0x52,
+	0x45, 0x41,
+};
+static uint8_t frag6_2[] = {
+	0x60, 0x0, 0x0, 0x0, 0x0, 0x12, 0x2c, 0x40, 0xfc, 0x0,
+	0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
+	0x0, 0x0, 0x1, 0x0, 0xfc, 0x0, 0x0, 0x0, 0x0, 0x0,
+	0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x2, 0x0,
+	0x11, 0x0, 0x0, 0x30, 0x0, 0x0, 0xbe, 0xef, 0x53, 0x53,
+	0x45, 0x4d, 0x42, 0x4c, 0x45, 0x20, 0x4d, 0x45,
+};
+
+#endif /* _IP_CHECK_DEFRAG_FRAGS_H */
diff --git a/tools/testing/selftests/bpf/prog_tests/ip_check_defrag.c b/tools/testing/selftests/bpf/prog_tests/ip_check_defrag.c
new file mode 100644
index 000000000000..57c814f5f6a7
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/ip_check_defrag.c
@@ -0,0 +1,283 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include <net/if.h>
+#include <linux/netfilter.h>
+#include <network_helpers.h>
+#include "ip_check_defrag.skel.h"
+#include "ip_check_defrag_frags.h"
+
+/*
+ * This selftest spins up a client and an echo server, each in their own
+ * network namespace. The client will send a fragmented message to the server.
+ * The prog attached to the server will shoot down any fragments. Thus, if
+ * the server is able to correctly echo back the message to the client, we will
+ * have verified that netfilter is reassembling packets for us.
+ *
+ * Topology:
+ * =========
+ *           NS0         |         NS1
+ *                       |
+ *         client        |       server
+ *       ----------      |     ----------
+ *       |  veth0  | --------- |  veth1  |
+ *       ----------    peer    ----------
+ *                       |
+ *                       |       with bpf
+ */
+
+#define NS0		"defrag_ns0"
+#define NS1		"defrag_ns1"
+#define VETH0		"veth0"
+#define VETH1		"veth1"
+#define VETH0_ADDR	"172.16.1.100"
+#define VETH0_ADDR6	"fc00::100"
+/* The following constants must stay in sync with `generate_udp_fragments.py` */
+#define VETH1_ADDR	"172.16.1.200"
+#define VETH1_ADDR6	"fc00::200"
+#define CLIENT_PORT	48878
+#define SERVER_PORT	48879
+#define MAGIC_MESSAGE	"THIS IS THE ORIGINAL MESSAGE, PLEASE REASSEMBLE ME"
+
+static int setup_topology(bool ipv6)
+{
+	bool up;
+	int i;
+
+	SYS(fail, "ip netns add " NS0);
+	SYS(fail, "ip netns add " NS1);
+	SYS(fail, "ip link add " VETH0 " netns " NS0 " type veth peer name " VETH1 " netns " NS1);
+	if (ipv6) {
+		SYS(fail, "ip -6 -net " NS0 " addr add " VETH0_ADDR6 "/64 dev " VETH0 " nodad");
+		SYS(fail, "ip -6 -net " NS1 " addr add " VETH1_ADDR6 "/64 dev " VETH1 " nodad");
+	} else {
+		SYS(fail, "ip -net " NS0 " addr add " VETH0_ADDR "/24 dev " VETH0);
+		SYS(fail, "ip -net " NS1 " addr add " VETH1_ADDR "/24 dev " VETH1);
+	}
+	SYS(fail, "ip -net " NS0 " link set dev " VETH0 " up");
+	SYS(fail, "ip -net " NS1 " link set dev " VETH1 " up");
+
+	/* Wait for up to 5s for links to come up */
+	for (i = 0; i < 5; ++i) {
+		if (ipv6)
+			up = !system("ip netns exec " NS0 " ping -6 -c 1 -W 1 " VETH1_ADDR6 " &>/dev/null");
+		else
+			up = !system("ip netns exec " NS0 " ping -c 1 -W 1 " VETH1_ADDR " &>/dev/null");
+
+		if (up)
+			break;
+	}
+
+	return 0;
+fail:
+	return -1;
+}
+
+static void cleanup_topology(void)
+{
+	SYS_NOFAIL("test -f /var/run/netns/" NS0 " && ip netns delete " NS0);
+	SYS_NOFAIL("test -f /var/run/netns/" NS1 " && ip netns delete " NS1);
+}
+
+static int attach(struct ip_check_defrag *skel, bool ipv6)
+{
+	LIBBPF_OPTS(bpf_netfilter_opts, opts,
+		    .pf = ipv6 ? NFPROTO_IPV6 : NFPROTO_IPV4,
+		    .priority = 42,
+		    .flags = BPF_F_NETFILTER_IP_DEFRAG);
+	struct nstoken *nstoken;
+	int err = -1;
+
+	nstoken = open_netns(NS1);
+
+	skel->links.defrag = bpf_program__attach_netfilter(skel->progs.defrag, &opts);
+	if (!ASSERT_OK_PTR(skel->links.defrag, "program attach"))
+		goto out;
+
+	err = 0;
+out:
+	close_netns(nstoken);
+	return err;
+}
+
+static int send_frags(int client)
+{
+	struct sockaddr_storage saddr;
+	struct sockaddr *saddr_p;
+	socklen_t saddr_len;
+	int err;
+
+	saddr_p = (struct sockaddr *)&saddr;
+	err = make_sockaddr(AF_INET, VETH1_ADDR, SERVER_PORT, &saddr, &saddr_len);
+	if (!ASSERT_OK(err, "make_sockaddr"))
+		return -1;
+
+	err = sendto(client, frag_0, sizeof(frag_0), 0, saddr_p, saddr_len);
+	if (!ASSERT_GE(err, 0, "sendto frag_0"))
+		return -1;
+
+	err = sendto(client, frag_1, sizeof(frag_1), 0, saddr_p, saddr_len);
+	if (!ASSERT_GE(err, 0, "sendto frag_1"))
+		return -1;
+
+	err = sendto(client, frag_2, sizeof(frag_2), 0, saddr_p, saddr_len);
+	if (!ASSERT_GE(err, 0, "sendto frag_2"))
+		return -1;
+
+	return 0;
+}
+
+static int send_frags6(int client)
+{
+	struct sockaddr_storage saddr;
+	struct sockaddr *saddr_p;
+	socklen_t saddr_len;
+	int err;
+
+	saddr_p = (struct sockaddr *)&saddr;
+	/* Port needs to be set to 0 for raw ipv6 socket for some reason */
+	err = make_sockaddr(AF_INET6, VETH1_ADDR6, 0, &saddr, &saddr_len);
+	if (!ASSERT_OK(err, "make_sockaddr"))
+		return -1;
+
+	err = sendto(client, frag6_0, sizeof(frag6_0), 0, saddr_p, saddr_len);
+	if (!ASSERT_GE(err, 0, "sendto frag6_0"))
+		return -1;
+
+	err = sendto(client, frag6_1, sizeof(frag6_1), 0, saddr_p, saddr_len);
+	if (!ASSERT_GE(err, 0, "sendto frag6_1"))
+		return -1;
+
+	err = sendto(client, frag6_2, sizeof(frag6_2), 0, saddr_p, saddr_len);
+	if (!ASSERT_GE(err, 0, "sendto frag6_2"))
+		return -1;
+
+	return 0;
+}
+
+void test_bpf_ip_check_defrag_ok(bool ipv6)
+{
+	struct network_helper_opts rx_opts = {
+		.timeout_ms = 1000,
+		.noconnect = true,
+	};
+	struct network_helper_opts tx_ops = {
+		.timeout_ms = 1000,
+		.type = SOCK_RAW,
+		.proto = IPPROTO_RAW,
+		.noconnect = true,
+	};
+	struct sockaddr_storage caddr;
+	struct ip_check_defrag *skel;
+	struct nstoken *nstoken;
+	int client_tx_fd = -1;
+	int client_rx_fd = -1;
+	socklen_t caddr_len;
+	int srv_fd = -1;
+	char buf[1024];
+	int len, err;
+
+	skel = ip_check_defrag__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		return;
+
+	if (!ASSERT_OK(setup_topology(ipv6), "setup_topology"))
+		goto out;
+
+	if (!ASSERT_OK(attach(skel, ipv6), "attach"))
+		goto out;
+
+	/* Start server in ns1 */
+	nstoken = open_netns(NS1);
+	if (!ASSERT_OK_PTR(nstoken, "setns ns1"))
+		goto out;
+	srv_fd = start_server(ipv6 ? AF_INET6 : AF_INET, SOCK_DGRAM, NULL, SERVER_PORT, 0);
+	close_netns(nstoken);
+	if (!ASSERT_GE(srv_fd, 0, "start_server"))
+		goto out;
+
+	/* Open tx raw socket in ns0 */
+	nstoken = open_netns(NS0);
+	if (!ASSERT_OK_PTR(nstoken, "setns ns0"))
+		goto out;
+	client_tx_fd = connect_to_fd_opts(srv_fd, &tx_ops);
+	close_netns(nstoken);
+	if (!ASSERT_GE(client_tx_fd, 0, "connect_to_fd_opts"))
+		goto out;
+
+	/* Open rx socket in ns0 */
+	nstoken = open_netns(NS0);
+	if (!ASSERT_OK_PTR(nstoken, "setns ns0"))
+		goto out;
+	client_rx_fd = connect_to_fd_opts(srv_fd, &rx_opts);
+	close_netns(nstoken);
+	if (!ASSERT_GE(client_rx_fd, 0, "connect_to_fd_opts"))
+		goto out;
+
+	/* Bind rx socket to a premeditated port */
+	memset(&caddr, 0, sizeof(caddr));
+	nstoken = open_netns(NS0);
+	if (!ASSERT_OK_PTR(nstoken, "setns ns0"))
+		goto out;
+	if (ipv6) {
+		struct sockaddr_in6 *c = (struct sockaddr_in6 *)&caddr;
+
+		c->sin6_family = AF_INET6;
+		inet_pton(AF_INET6, VETH0_ADDR6, &c->sin6_addr);
+		c->sin6_port = htons(CLIENT_PORT);
+		err = bind(client_rx_fd, (struct sockaddr *)c, sizeof(*c));
+	} else {
+		struct sockaddr_in *c = (struct sockaddr_in *)&caddr;
+
+		c->sin_family = AF_INET;
+		inet_pton(AF_INET, VETH0_ADDR, &c->sin_addr);
+		c->sin_port = htons(CLIENT_PORT);
+		err = bind(client_rx_fd, (struct sockaddr *)c, sizeof(*c));
+	}
+	close_netns(nstoken);
+	if (!ASSERT_OK(err, "bind"))
+		goto out;
+
+	/* Send message in fragments */
+	if (ipv6) {
+		if (!ASSERT_OK(send_frags6(client_tx_fd), "send_frags6"))
+			goto out;
+	} else {
+		if (!ASSERT_OK(send_frags(client_tx_fd), "send_frags"))
+			goto out;
+	}
+
+	if (!ASSERT_EQ(skel->bss->shootdowns, 0, "shootdowns"))
+		goto out;
+
+	/* Receive reassembled msg on server and echo back to client */
+	caddr_len = sizeof(caddr);
+	len = recvfrom(srv_fd, buf, sizeof(buf), 0, (struct sockaddr *)&caddr, &caddr_len);
+	if (!ASSERT_GE(len, 0, "server recvfrom"))
+		goto out;
+	len = sendto(srv_fd, buf, len, 0, (struct sockaddr *)&caddr, caddr_len);
+	if (!ASSERT_GE(len, 0, "server sendto"))
+		goto out;
+
+	/* Expect reassembed message to be echoed back */
+	len = recvfrom(client_rx_fd, buf, sizeof(buf), 0, NULL, NULL);
+	if (!ASSERT_EQ(len, sizeof(MAGIC_MESSAGE) - 1, "client short read"))
+		goto out;
+
+out:
+	if (client_rx_fd != -1)
+		close(client_rx_fd);
+	if (client_tx_fd != -1)
+		close(client_tx_fd);
+	if (srv_fd != -1)
+		close(srv_fd);
+	cleanup_topology();
+	ip_check_defrag__destroy(skel);
+}
+
+void test_bpf_ip_check_defrag(void)
+{
+	if (test__start_subtest("v4"))
+		test_bpf_ip_check_defrag_ok(false);
+	if (test__start_subtest("v6"))
+		test_bpf_ip_check_defrag_ok(true);
+}
diff --git a/tools/testing/selftests/bpf/progs/ip_check_defrag.c b/tools/testing/selftests/bpf/progs/ip_check_defrag.c
new file mode 100644
index 000000000000..4259c6d59968
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/ip_check_defrag.c
@@ -0,0 +1,104 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+#include "bpf_tracing_net.h"
+
+#define NF_DROP			0
+#define NF_ACCEPT		1
+#define ETH_P_IP		0x0800
+#define ETH_P_IPV6		0x86DD
+#define IP_MF			0x2000
+#define IP_OFFSET		0x1FFF
+#define NEXTHDR_FRAGMENT	44
+
+extern int bpf_dynptr_from_skb(struct sk_buff *skb, __u64 flags,
+                               struct bpf_dynptr *ptr__uninit) __ksym;
+extern void *bpf_dynptr_slice(const struct bpf_dynptr *ptr, uint32_t offset,
+			      void *buffer, uint32_t buffer__sz) __ksym;
+
+volatile int shootdowns = 0;
+
+static bool is_frag_v4(struct iphdr *iph)
+{
+	int offset;
+	int flags;
+
+	offset = bpf_ntohs(iph->frag_off);
+	flags = offset & ~IP_OFFSET;
+	offset &= IP_OFFSET;
+	offset <<= 3;
+
+	return (flags & IP_MF) || offset;
+}
+
+static bool is_frag_v6(struct ipv6hdr *ip6h)
+{
+	/* Simplifying assumption that there are no extension headers
+	 * between fixed header and fragmentation header. This assumption
+	 * is only valid in this test case. It saves us the hassle of
+	 * searching all potential extension headers.
+	 */
+	return ip6h->nexthdr == NEXTHDR_FRAGMENT;
+}
+
+static int handle_v4(struct sk_buff *skb)
+{
+	struct bpf_dynptr ptr;
+	u8 iph_buf[20] = {};
+	struct iphdr *iph;
+
+	if (bpf_dynptr_from_skb(skb, 0, &ptr))
+		return NF_DROP;
+
+	iph = bpf_dynptr_slice(&ptr, 0, iph_buf, sizeof(iph_buf));
+	if (!iph)
+		return NF_DROP;
+
+	/* Shootdown any frags */
+	if (is_frag_v4(iph)) {
+		shootdowns++;
+		return NF_DROP;
+	}
+
+	return NF_ACCEPT;
+}
+
+static int handle_v6(struct sk_buff *skb)
+{
+	struct bpf_dynptr ptr;
+	struct ipv6hdr *ip6h;
+	u8 ip6h_buf[40] = {};
+
+	if (bpf_dynptr_from_skb(skb, 0, &ptr))
+		return NF_DROP;
+
+	ip6h = bpf_dynptr_slice(&ptr, 0, ip6h_buf, sizeof(ip6h_buf));
+	if (!ip6h)
+		return NF_DROP;
+
+	/* Shootdown any frags */
+	if (is_frag_v6(ip6h)) {
+		shootdowns++;
+		return NF_DROP;
+	}
+
+	return NF_ACCEPT;
+}
+
+SEC("netfilter")
+int defrag(struct bpf_nf_ctx *ctx)
+{
+	struct sk_buff *skb = ctx->skb;
+
+	switch (bpf_ntohs(skb->protocol)) {
+	case ETH_P_IP:
+		return handle_v4(skb);
+	case ETH_P_IPV6:
+		return handle_v6(skb);
+	default:
+		return NF_ACCEPT;
+	}
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH bpf-next v4 2/6] netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link
  2023-07-12 23:43 ` [PATCH bpf-next v4 2/6] netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link Daniel Xu
@ 2023-07-13  0:43   ` Alexei Starovoitov
  2023-07-13  1:22     ` Daniel Xu
  0 siblings, 1 reply; 15+ messages in thread
From: Alexei Starovoitov @ 2023-07-13  0:43 UTC (permalink / raw)
  To: Daniel Xu
  Cc: Andrii Nakryiko, Alexei Starovoitov, Florian Westphal,
	David S. Miller, Pablo Neira Ayuso, Paolo Abeni, Daniel Borkmann,
	Eric Dumazet, Jakub Kicinski, Jozsef Kadlecsik, Martin KaFai Lau,
	Song Liu, Yonghong Song, John Fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, bpf, LKML,
	netfilter-devel, coreteam, Network Development, David Ahern

On Wed, Jul 12, 2023 at 4:44 PM Daniel Xu <dxu@dxuuu.xyz> wrote:
> +#if IS_ENABLED(CONFIG_NF_DEFRAG_IPV6)
> +       case NFPROTO_IPV6:
> +               rcu_read_lock();
> +               v6_hook = rcu_dereference(nf_defrag_v6_hook);
> +               if (!v6_hook) {
> +                       rcu_read_unlock();
> +                       err = request_module("nf_defrag_ipv6");
> +                       if (err)
> +                               return err < 0 ? err : -EINVAL;
> +
> +                       rcu_read_lock();
> +                       v6_hook = rcu_dereference(nf_defrag_v6_hook);
> +                       if (!v6_hook) {
> +                               WARN_ONCE(1, "nf_defrag_ipv6_hooks bad registration");
> +                               err = -ENOENT;
> +                               goto out_v6;
> +                       }
> +               }
> +
> +               err = v6_hook->enable(link->net);

I was about to apply, but luckily caught this issue in my local test:

[   18.462448] BUG: sleeping function called from invalid context at
kernel/locking/mutex.c:283
[   18.463238] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid:
2042, name: test_progs
[   18.463927] preempt_count: 0, expected: 0
[   18.464249] RCU nest depth: 1, expected: 0
[   18.464631] CPU: 15 PID: 2042 Comm: test_progs Tainted: G
O       6.4.0-04319-g6f6ec4fa00dc #4896
[   18.465480] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
[   18.466531] Call Trace:
[   18.466767]  <TASK>
[   18.466975]  dump_stack_lvl+0x32/0x40
[   18.467325]  __might_resched+0x129/0x180
[   18.467691]  mutex_lock+0x1a/0x40
[   18.468057]  nf_defrag_ipv4_enable+0x16/0x70
[   18.468467]  bpf_nf_link_attach+0x141/0x300
[   18.468856]  __sys_bpf+0x133e/0x26d0

You cannot call mutex under rcu_read_lock.

Please make sure you have all kernel debug flags on in your testing.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH bpf-next v4 2/6] netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link
  2023-07-13  0:43   ` Alexei Starovoitov
@ 2023-07-13  1:22     ` Daniel Xu
  2023-07-13  1:26       ` Alexei Starovoitov
  0 siblings, 1 reply; 15+ messages in thread
From: Daniel Xu @ 2023-07-13  1:22 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, Alexei Starovoitov, Florian Westphal,
	David S. Miller, Pablo Neira Ayuso, Paolo Abeni, Daniel Borkmann,
	Eric Dumazet, Jakub Kicinski, Jozsef Kadlecsik, Martin KaFai Lau,
	Song Liu, Yonghong Song, John Fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, bpf, LKML,
	netfilter-devel, coreteam, Network Development, David Ahern

Hi Alexei,

On Wed, Jul 12, 2023 at 05:43:49PM -0700, Alexei Starovoitov wrote:
> On Wed, Jul 12, 2023 at 4:44 PM Daniel Xu <dxu@dxuuu.xyz> wrote:
> > +#if IS_ENABLED(CONFIG_NF_DEFRAG_IPV6)
> > +       case NFPROTO_IPV6:
> > +               rcu_read_lock();
> > +               v6_hook = rcu_dereference(nf_defrag_v6_hook);
> > +               if (!v6_hook) {
> > +                       rcu_read_unlock();
> > +                       err = request_module("nf_defrag_ipv6");
> > +                       if (err)
> > +                               return err < 0 ? err : -EINVAL;
> > +
> > +                       rcu_read_lock();
> > +                       v6_hook = rcu_dereference(nf_defrag_v6_hook);
> > +                       if (!v6_hook) {
> > +                               WARN_ONCE(1, "nf_defrag_ipv6_hooks bad registration");
> > +                               err = -ENOENT;
> > +                               goto out_v6;
> > +                       }
> > +               }
> > +
> > +               err = v6_hook->enable(link->net);
> 
> I was about to apply, but luckily caught this issue in my local test:
> 
> [   18.462448] BUG: sleeping function called from invalid context at
> kernel/locking/mutex.c:283
> [   18.463238] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid:
> 2042, name: test_progs
> [   18.463927] preempt_count: 0, expected: 0
> [   18.464249] RCU nest depth: 1, expected: 0
> [   18.464631] CPU: 15 PID: 2042 Comm: test_progs Tainted: G
> O       6.4.0-04319-g6f6ec4fa00dc #4896
> [   18.465480] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
> [   18.466531] Call Trace:
> [   18.466767]  <TASK>
> [   18.466975]  dump_stack_lvl+0x32/0x40
> [   18.467325]  __might_resched+0x129/0x180
> [   18.467691]  mutex_lock+0x1a/0x40
> [   18.468057]  nf_defrag_ipv4_enable+0x16/0x70
> [   18.468467]  bpf_nf_link_attach+0x141/0x300
> [   18.468856]  __sys_bpf+0x133e/0x26d0
> 
> You cannot call mutex under rcu_read_lock.

Whoops, my bad. I think this patch should fix it:

```
From 7e8927c44452db07ddd7cf0e30bb49215fc044ed Mon Sep 17 00:00:00 2001
Message-ID: <7e8927c44452db07ddd7cf0e30bb49215fc044ed.1689211250.git.dxu@dxuuu.xyz>
From: Daniel Xu <dxu@dxuuu.xyz>
Date: Wed, 12 Jul 2023 19:17:35 -0600
Subject: [PATCH] netfilter: bpf: Don't hold rcu_read_lock during
 enable/disable

->enable()/->disable() takes a mutex which can sleep. You can't sleep
during RCU read side critical section.

Our refcnt on the module will protect us from ->enable()/->disable()
from going away while we call it.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
---
 net/netfilter/nf_bpf_link.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/nf_bpf_link.c b/net/netfilter/nf_bpf_link.c
index 77ffbf26ba3d..79704cc596aa 100644
--- a/net/netfilter/nf_bpf_link.c
+++ b/net/netfilter/nf_bpf_link.c
@@ -60,9 +60,12 @@ static int bpf_nf_enable_defrag(struct bpf_nf_link *link)
                        goto out_v4;
                }

+               rcu_read_unlock();
                err = v4_hook->enable(link->net);
                if (err)
                        module_put(v4_hook->owner);
+
+               return err;
 out_v4:
                rcu_read_unlock();
                return err;
@@ -92,9 +95,12 @@ static int bpf_nf_enable_defrag(struct bpf_nf_link *link)
                        goto out_v6;
                }

+               rcu_read_unlock();
                err = v6_hook->enable(link->net);
                if (err)
                        module_put(v6_hook->owner);
+
+               return err;
 out_v6:
                rcu_read_unlock();
                return err;
@@ -114,11 +120,11 @@ static void bpf_nf_disable_defrag(struct bpf_nf_link *link)
        case NFPROTO_IPV4:
                rcu_read_lock();
                v4_hook = rcu_dereference(nf_defrag_v4_hook);
+               rcu_read_unlock();
                if (v4_hook) {
                        v4_hook->disable(link->net);
                        module_put(v4_hook->owner);
                }
-               rcu_read_unlock();

                break;
 #endif
@@ -126,11 +132,11 @@ static void bpf_nf_disable_defrag(struct bpf_nf_link *link)
        case NFPROTO_IPV6:
                rcu_read_lock();
                v6_hook = rcu_dereference(nf_defrag_v6_hook);
+               rcu_read_unlock();
                if (v6_hook) {
                        v6_hook->disable(link->net);
                        module_put(v6_hook->owner);
                }
-               rcu_read_unlock();

                break;
        }
--
2.41.0
```

I'll send out a v5 tomorrow morning unless you feel like applying the
series + this patch today.

> 
> Please make sure you have all kernel debug flags on in your testing.
> 

Ack. Will make sure lockdep is on.


Thanks,
Daniel

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH bpf-next v4 2/6] netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link
  2023-07-13  1:22     ` Daniel Xu
@ 2023-07-13  1:26       ` Alexei Starovoitov
  2023-07-13  4:33         ` Daniel Xu
  0 siblings, 1 reply; 15+ messages in thread
From: Alexei Starovoitov @ 2023-07-13  1:26 UTC (permalink / raw)
  To: Daniel Xu
  Cc: Andrii Nakryiko, Alexei Starovoitov, Florian Westphal,
	David S. Miller, Pablo Neira Ayuso, Paolo Abeni, Daniel Borkmann,
	Eric Dumazet, Jakub Kicinski, Jozsef Kadlecsik, Martin KaFai Lau,
	Song Liu, Yonghong Song, John Fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, bpf, LKML,
	netfilter-devel, coreteam, Network Development, David Ahern

On Wed, Jul 12, 2023 at 6:22 PM Daniel Xu <dxu@dxuuu.xyz> wrote:
>
> Hi Alexei,
>
> On Wed, Jul 12, 2023 at 05:43:49PM -0700, Alexei Starovoitov wrote:
> > On Wed, Jul 12, 2023 at 4:44 PM Daniel Xu <dxu@dxuuu.xyz> wrote:
> > > +#if IS_ENABLED(CONFIG_NF_DEFRAG_IPV6)
> > > +       case NFPROTO_IPV6:
> > > +               rcu_read_lock();
> > > +               v6_hook = rcu_dereference(nf_defrag_v6_hook);
> > > +               if (!v6_hook) {
> > > +                       rcu_read_unlock();
> > > +                       err = request_module("nf_defrag_ipv6");
> > > +                       if (err)
> > > +                               return err < 0 ? err : -EINVAL;
> > > +
> > > +                       rcu_read_lock();
> > > +                       v6_hook = rcu_dereference(nf_defrag_v6_hook);
> > > +                       if (!v6_hook) {
> > > +                               WARN_ONCE(1, "nf_defrag_ipv6_hooks bad registration");
> > > +                               err = -ENOENT;
> > > +                               goto out_v6;
> > > +                       }
> > > +               }
> > > +
> > > +               err = v6_hook->enable(link->net);
> >
> > I was about to apply, but luckily caught this issue in my local test:
> >
> > [   18.462448] BUG: sleeping function called from invalid context at
> > kernel/locking/mutex.c:283
> > [   18.463238] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid:
> > 2042, name: test_progs
> > [   18.463927] preempt_count: 0, expected: 0
> > [   18.464249] RCU nest depth: 1, expected: 0
> > [   18.464631] CPU: 15 PID: 2042 Comm: test_progs Tainted: G
> > O       6.4.0-04319-g6f6ec4fa00dc #4896
> > [   18.465480] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> > BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
> > [   18.466531] Call Trace:
> > [   18.466767]  <TASK>
> > [   18.466975]  dump_stack_lvl+0x32/0x40
> > [   18.467325]  __might_resched+0x129/0x180
> > [   18.467691]  mutex_lock+0x1a/0x40
> > [   18.468057]  nf_defrag_ipv4_enable+0x16/0x70
> > [   18.468467]  bpf_nf_link_attach+0x141/0x300
> > [   18.468856]  __sys_bpf+0x133e/0x26d0
> >
> > You cannot call mutex under rcu_read_lock.
>
> Whoops, my bad. I think this patch should fix it:
>
> ```
> From 7e8927c44452db07ddd7cf0e30bb49215fc044ed Mon Sep 17 00:00:00 2001
> Message-ID: <7e8927c44452db07ddd7cf0e30bb49215fc044ed.1689211250.git.dxu@dxuuu.xyz>
> From: Daniel Xu <dxu@dxuuu.xyz>
> Date: Wed, 12 Jul 2023 19:17:35 -0600
> Subject: [PATCH] netfilter: bpf: Don't hold rcu_read_lock during
>  enable/disable
>
> ->enable()/->disable() takes a mutex which can sleep. You can't sleep
> during RCU read side critical section.
>
> Our refcnt on the module will protect us from ->enable()/->disable()
> from going away while we call it.
>
> Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
> ---
>  net/netfilter/nf_bpf_link.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/net/netfilter/nf_bpf_link.c b/net/netfilter/nf_bpf_link.c
> index 77ffbf26ba3d..79704cc596aa 100644
> --- a/net/netfilter/nf_bpf_link.c
> +++ b/net/netfilter/nf_bpf_link.c
> @@ -60,9 +60,12 @@ static int bpf_nf_enable_defrag(struct bpf_nf_link *link)
>                         goto out_v4;
>                 }
>
> +               rcu_read_unlock();
>                 err = v4_hook->enable(link->net);
>                 if (err)
>                         module_put(v4_hook->owner);
> +
> +               return err;
>  out_v4:
>                 rcu_read_unlock();
>                 return err;
> @@ -92,9 +95,12 @@ static int bpf_nf_enable_defrag(struct bpf_nf_link *link)
>                         goto out_v6;
>                 }
>
> +               rcu_read_unlock();
>                 err = v6_hook->enable(link->net);
>                 if (err)
>                         module_put(v6_hook->owner);
> +
> +               return err;
>  out_v6:
>                 rcu_read_unlock();
>                 return err;
> @@ -114,11 +120,11 @@ static void bpf_nf_disable_defrag(struct bpf_nf_link *link)
>         case NFPROTO_IPV4:
>                 rcu_read_lock();
>                 v4_hook = rcu_dereference(nf_defrag_v4_hook);
> +               rcu_read_unlock();
>                 if (v4_hook) {
>                         v4_hook->disable(link->net);
>                         module_put(v4_hook->owner);
>                 }
> -               rcu_read_unlock();
>
>                 break;
>  #endif
> @@ -126,11 +132,11 @@ static void bpf_nf_disable_defrag(struct bpf_nf_link *link)
>         case NFPROTO_IPV6:
>                 rcu_read_lock();
>                 v6_hook = rcu_dereference(nf_defrag_v6_hook);
> +               rcu_read_unlock();

No. v6_hook is gone as soon as you unlock it.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH bpf-next v4 2/6] netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link
  2023-07-13  1:26       ` Alexei Starovoitov
@ 2023-07-13  4:33         ` Daniel Xu
  2023-07-13 23:10           ` Alexei Starovoitov
  0 siblings, 1 reply; 15+ messages in thread
From: Daniel Xu @ 2023-07-13  4:33 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, Alexei Starovoitov, Florian Westphal,
	David S. Miller, Pablo Neira Ayuso, Paolo Abeni, Daniel Borkmann,
	Eric Dumazet, Jakub Kicinski, Jozsef Kadlecsik, Martin KaFai Lau,
	Song Liu, Yonghong Song, John Fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, bpf, LKML,
	netfilter-devel, coreteam, Network Development, David Ahern

On Wed, Jul 12, 2023 at 06:26:13PM -0700, Alexei Starovoitov wrote:
> On Wed, Jul 12, 2023 at 6:22 PM Daniel Xu <dxu@dxuuu.xyz> wrote:
> >
> > Hi Alexei,
> >
> > On Wed, Jul 12, 2023 at 05:43:49PM -0700, Alexei Starovoitov wrote:
> > > On Wed, Jul 12, 2023 at 4:44 PM Daniel Xu <dxu@dxuuu.xyz> wrote:
> > > > +#if IS_ENABLED(CONFIG_NF_DEFRAG_IPV6)
> > > > +       case NFPROTO_IPV6:
> > > > +               rcu_read_lock();
> > > > +               v6_hook = rcu_dereference(nf_defrag_v6_hook);
> > > > +               if (!v6_hook) {
> > > > +                       rcu_read_unlock();
> > > > +                       err = request_module("nf_defrag_ipv6");
> > > > +                       if (err)
> > > > +                               return err < 0 ? err : -EINVAL;
> > > > +
> > > > +                       rcu_read_lock();
> > > > +                       v6_hook = rcu_dereference(nf_defrag_v6_hook);
> > > > +                       if (!v6_hook) {
> > > > +                               WARN_ONCE(1, "nf_defrag_ipv6_hooks bad registration");
> > > > +                               err = -ENOENT;
> > > > +                               goto out_v6;
> > > > +                       }
> > > > +               }
> > > > +
> > > > +               err = v6_hook->enable(link->net);
> > >
> > > I was about to apply, but luckily caught this issue in my local test:
> > >
> > > [   18.462448] BUG: sleeping function called from invalid context at
> > > kernel/locking/mutex.c:283
> > > [   18.463238] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid:
> > > 2042, name: test_progs
> > > [   18.463927] preempt_count: 0, expected: 0
> > > [   18.464249] RCU nest depth: 1, expected: 0
> > > [   18.464631] CPU: 15 PID: 2042 Comm: test_progs Tainted: G
> > > O       6.4.0-04319-g6f6ec4fa00dc #4896
> > > [   18.465480] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> > > BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
> > > [   18.466531] Call Trace:
> > > [   18.466767]  <TASK>
> > > [   18.466975]  dump_stack_lvl+0x32/0x40
> > > [   18.467325]  __might_resched+0x129/0x180
> > > [   18.467691]  mutex_lock+0x1a/0x40
> > > [   18.468057]  nf_defrag_ipv4_enable+0x16/0x70
> > > [   18.468467]  bpf_nf_link_attach+0x141/0x300
> > > [   18.468856]  __sys_bpf+0x133e/0x26d0
> > >
> > > You cannot call mutex under rcu_read_lock.
> >
> > Whoops, my bad. I think this patch should fix it:
> >
> > ```
> > From 7e8927c44452db07ddd7cf0e30bb49215fc044ed Mon Sep 17 00:00:00 2001
> > Message-ID: <7e8927c44452db07ddd7cf0e30bb49215fc044ed.1689211250.git.dxu@dxuuu.xyz>
> > From: Daniel Xu <dxu@dxuuu.xyz>
> > Date: Wed, 12 Jul 2023 19:17:35 -0600
> > Subject: [PATCH] netfilter: bpf: Don't hold rcu_read_lock during
> >  enable/disable
> >
> > ->enable()/->disable() takes a mutex which can sleep. You can't sleep
> > during RCU read side critical section.
> >
> > Our refcnt on the module will protect us from ->enable()/->disable()
> > from going away while we call it.
> >
> > Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
> > ---
> >  net/netfilter/nf_bpf_link.c | 10 ++++++++--
> >  1 file changed, 8 insertions(+), 2 deletions(-)
> >
> > diff --git a/net/netfilter/nf_bpf_link.c b/net/netfilter/nf_bpf_link.c
> > index 77ffbf26ba3d..79704cc596aa 100644
> > --- a/net/netfilter/nf_bpf_link.c
> > +++ b/net/netfilter/nf_bpf_link.c
> > @@ -60,9 +60,12 @@ static int bpf_nf_enable_defrag(struct bpf_nf_link *link)
> >                         goto out_v4;
> >                 }
> >
> > +               rcu_read_unlock();
> >                 err = v4_hook->enable(link->net);
> >                 if (err)
> >                         module_put(v4_hook->owner);
> > +
> > +               return err;
> >  out_v4:
> >                 rcu_read_unlock();
> >                 return err;
> > @@ -92,9 +95,12 @@ static int bpf_nf_enable_defrag(struct bpf_nf_link *link)
> >                         goto out_v6;
> >                 }
> >
> > +               rcu_read_unlock();
> >                 err = v6_hook->enable(link->net);
> >                 if (err)
> >                         module_put(v6_hook->owner);
> > +
> > +               return err;
> >  out_v6:
> >                 rcu_read_unlock();
> >                 return err;
> > @@ -114,11 +120,11 @@ static void bpf_nf_disable_defrag(struct bpf_nf_link *link)
> >         case NFPROTO_IPV4:
> >                 rcu_read_lock();
> >                 v4_hook = rcu_dereference(nf_defrag_v4_hook);
> > +               rcu_read_unlock();
> >                 if (v4_hook) {
> >                         v4_hook->disable(link->net);
> >                         module_put(v4_hook->owner);
> >                 }
> > -               rcu_read_unlock();
> >
> >                 break;
> >  #endif
> > @@ -126,11 +132,11 @@ static void bpf_nf_disable_defrag(struct bpf_nf_link *link)
> >         case NFPROTO_IPV6:
> >                 rcu_read_lock();
> >                 v6_hook = rcu_dereference(nf_defrag_v6_hook);
> > +               rcu_read_unlock();
> 
> No. v6_hook is gone as soon as you unlock it.

I think we're protected here by the try_module_get() on the enable path.
And we only disable defrag if enabling succeeds. The module shouldn't
be able to deregister its hooks until we call the module_put() later.

I think READ_ONCE() would've been more appropriate but I wasn't sure if
that was ok given nf_defrag_v(4|6)_hook is written to by
rcu_assign_pointer() and I was assuming symmetry is necessary.

Does that sound right?

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH bpf-next v4 2/6] netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link
  2023-07-13  4:33         ` Daniel Xu
@ 2023-07-13 23:10           ` Alexei Starovoitov
  2023-07-13 23:42             ` Daniel Xu
  0 siblings, 1 reply; 15+ messages in thread
From: Alexei Starovoitov @ 2023-07-13 23:10 UTC (permalink / raw)
  To: Daniel Xu
  Cc: Andrii Nakryiko, Alexei Starovoitov, Florian Westphal,
	David S. Miller, Pablo Neira Ayuso, Paolo Abeni, Daniel Borkmann,
	Eric Dumazet, Jakub Kicinski, Jozsef Kadlecsik, Martin KaFai Lau,
	Song Liu, Yonghong Song, John Fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, bpf, LKML,
	netfilter-devel, coreteam, Network Development, David Ahern

On Wed, Jul 12, 2023 at 9:33 PM Daniel Xu <dxu@dxuuu.xyz> wrote:
>
> On Wed, Jul 12, 2023 at 06:26:13PM -0700, Alexei Starovoitov wrote:
> > On Wed, Jul 12, 2023 at 6:22 PM Daniel Xu <dxu@dxuuu.xyz> wrote:
> > >
> > > Hi Alexei,
> > >
> > > On Wed, Jul 12, 2023 at 05:43:49PM -0700, Alexei Starovoitov wrote:
> > > > On Wed, Jul 12, 2023 at 4:44 PM Daniel Xu <dxu@dxuuu.xyz> wrote:
> > > > > +#if IS_ENABLED(CONFIG_NF_DEFRAG_IPV6)
> > > > > +       case NFPROTO_IPV6:
> > > > > +               rcu_read_lock();
> > > > > +               v6_hook = rcu_dereference(nf_defrag_v6_hook);
> > > > > +               if (!v6_hook) {
> > > > > +                       rcu_read_unlock();
> > > > > +                       err = request_module("nf_defrag_ipv6");
> > > > > +                       if (err)
> > > > > +                               return err < 0 ? err : -EINVAL;
> > > > > +
> > > > > +                       rcu_read_lock();
> > > > > +                       v6_hook = rcu_dereference(nf_defrag_v6_hook);
> > > > > +                       if (!v6_hook) {
> > > > > +                               WARN_ONCE(1, "nf_defrag_ipv6_hooks bad registration");
> > > > > +                               err = -ENOENT;
> > > > > +                               goto out_v6;
> > > > > +                       }
> > > > > +               }
> > > > > +
> > > > > +               err = v6_hook->enable(link->net);
> > > >
> > > > I was about to apply, but luckily caught this issue in my local test:
> > > >
> > > > [   18.462448] BUG: sleeping function called from invalid context at
> > > > kernel/locking/mutex.c:283
> > > > [   18.463238] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid:
> > > > 2042, name: test_progs
> > > > [   18.463927] preempt_count: 0, expected: 0
> > > > [   18.464249] RCU nest depth: 1, expected: 0
> > > > [   18.464631] CPU: 15 PID: 2042 Comm: test_progs Tainted: G
> > > > O       6.4.0-04319-g6f6ec4fa00dc #4896
> > > > [   18.465480] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> > > > BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
> > > > [   18.466531] Call Trace:
> > > > [   18.466767]  <TASK>
> > > > [   18.466975]  dump_stack_lvl+0x32/0x40
> > > > [   18.467325]  __might_resched+0x129/0x180
> > > > [   18.467691]  mutex_lock+0x1a/0x40
> > > > [   18.468057]  nf_defrag_ipv4_enable+0x16/0x70
> > > > [   18.468467]  bpf_nf_link_attach+0x141/0x300
> > > > [   18.468856]  __sys_bpf+0x133e/0x26d0
> > > >
> > > > You cannot call mutex under rcu_read_lock.
> > >
> > > Whoops, my bad. I think this patch should fix it:
> > >
> > > ```
> > > From 7e8927c44452db07ddd7cf0e30bb49215fc044ed Mon Sep 17 00:00:00 2001
> > > Message-ID: <7e8927c44452db07ddd7cf0e30bb49215fc044ed.1689211250.git.dxu@dxuuu.xyz>
> > > From: Daniel Xu <dxu@dxuuu.xyz>
> > > Date: Wed, 12 Jul 2023 19:17:35 -0600
> > > Subject: [PATCH] netfilter: bpf: Don't hold rcu_read_lock during
> > >  enable/disable
> > >
> > > ->enable()/->disable() takes a mutex which can sleep. You can't sleep
> > > during RCU read side critical section.
> > >
> > > Our refcnt on the module will protect us from ->enable()/->disable()
> > > from going away while we call it.
> > >
> > > Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
> > > ---
> > >  net/netfilter/nf_bpf_link.c | 10 ++++++++--
> > >  1 file changed, 8 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/net/netfilter/nf_bpf_link.c b/net/netfilter/nf_bpf_link.c
> > > index 77ffbf26ba3d..79704cc596aa 100644
> > > --- a/net/netfilter/nf_bpf_link.c
> > > +++ b/net/netfilter/nf_bpf_link.c
> > > @@ -60,9 +60,12 @@ static int bpf_nf_enable_defrag(struct bpf_nf_link *link)
> > >                         goto out_v4;
> > >                 }
> > >
> > > +               rcu_read_unlock();
> > >                 err = v4_hook->enable(link->net);
> > >                 if (err)
> > >                         module_put(v4_hook->owner);
> > > +
> > > +               return err;
> > >  out_v4:
> > >                 rcu_read_unlock();
> > >                 return err;
> > > @@ -92,9 +95,12 @@ static int bpf_nf_enable_defrag(struct bpf_nf_link *link)
> > >                         goto out_v6;
> > >                 }
> > >
> > > +               rcu_read_unlock();
> > >                 err = v6_hook->enable(link->net);
> > >                 if (err)
> > >                         module_put(v6_hook->owner);
> > > +
> > > +               return err;
> > >  out_v6:
> > >                 rcu_read_unlock();
> > >                 return err;
> > > @@ -114,11 +120,11 @@ static void bpf_nf_disable_defrag(struct bpf_nf_link *link)
> > >         case NFPROTO_IPV4:
> > >                 rcu_read_lock();
> > >                 v4_hook = rcu_dereference(nf_defrag_v4_hook);
> > > +               rcu_read_unlock();
> > >                 if (v4_hook) {
> > >                         v4_hook->disable(link->net);
> > >                         module_put(v4_hook->owner);
> > >                 }
> > > -               rcu_read_unlock();
> > >
> > >                 break;
> > >  #endif
> > > @@ -126,11 +132,11 @@ static void bpf_nf_disable_defrag(struct bpf_nf_link *link)
> > >         case NFPROTO_IPV6:
> > >                 rcu_read_lock();
> > >                 v6_hook = rcu_dereference(nf_defrag_v6_hook);
> > > +               rcu_read_unlock();
> >
> > No. v6_hook is gone as soon as you unlock it.
>
> I think we're protected here by the try_module_get() on the enable path.
> And we only disable defrag if enabling succeeds. The module shouldn't
> be able to deregister its hooks until we call the module_put() later.
>
> I think READ_ONCE() would've been more appropriate but I wasn't sure if
> that was ok given nf_defrag_v(4|6)_hook is written to by
> rcu_assign_pointer() and I was assuming symmetry is necessary.

Why is rcu_assign_pointer() used?
If it's not RCU protected, what is the point of rcu_*() accessors
and rcu_read_lock() ?

In general, the pattern:
rcu_read_lock();
ptr = rcu_dereference(...);
rcu_read_unlock();
ptr->..
is a bug. 100%.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH bpf-next v4 2/6] netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link
  2023-07-13 23:10           ` Alexei Starovoitov
@ 2023-07-13 23:42             ` Daniel Xu
  2023-07-14  9:47               ` Florian Westphal
  0 siblings, 1 reply; 15+ messages in thread
From: Daniel Xu @ 2023-07-13 23:42 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, Alexei Starovoitov, Florian Westphal,
	David S. Miller, Pablo Neira Ayuso, Paolo Abeni, Daniel Borkmann,
	Eric Dumazet, Jakub Kicinski, Jozsef Kadlecsik, Martin KaFai Lau,
	Song Liu, Yonghong Song, John Fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, bpf, LKML,
	netfilter-devel, coreteam, Network Development, David Ahern

On Thu, Jul 13, 2023 at 04:10:03PM -0700, Alexei Starovoitov wrote:
> On Wed, Jul 12, 2023 at 9:33 PM Daniel Xu <dxu@dxuuu.xyz> wrote:
> >
> > On Wed, Jul 12, 2023 at 06:26:13PM -0700, Alexei Starovoitov wrote:
> > > On Wed, Jul 12, 2023 at 6:22 PM Daniel Xu <dxu@dxuuu.xyz> wrote:
> > > >
> > > > Hi Alexei,
> > > >
> > > > On Wed, Jul 12, 2023 at 05:43:49PM -0700, Alexei Starovoitov wrote:
> > > > > On Wed, Jul 12, 2023 at 4:44 PM Daniel Xu <dxu@dxuuu.xyz> wrote:
> > > > > > +#if IS_ENABLED(CONFIG_NF_DEFRAG_IPV6)
> > > > > > +       case NFPROTO_IPV6:
> > > > > > +               rcu_read_lock();
> > > > > > +               v6_hook = rcu_dereference(nf_defrag_v6_hook);
> > > > > > +               if (!v6_hook) {
> > > > > > +                       rcu_read_unlock();
> > > > > > +                       err = request_module("nf_defrag_ipv6");
> > > > > > +                       if (err)
> > > > > > +                               return err < 0 ? err : -EINVAL;
> > > > > > +
> > > > > > +                       rcu_read_lock();
> > > > > > +                       v6_hook = rcu_dereference(nf_defrag_v6_hook);
> > > > > > +                       if (!v6_hook) {
> > > > > > +                               WARN_ONCE(1, "nf_defrag_ipv6_hooks bad registration");
> > > > > > +                               err = -ENOENT;
> > > > > > +                               goto out_v6;
> > > > > > +                       }
> > > > > > +               }
> > > > > > +
> > > > > > +               err = v6_hook->enable(link->net);
> > > > >
> > > > > I was about to apply, but luckily caught this issue in my local test:
> > > > >
> > > > > [   18.462448] BUG: sleeping function called from invalid context at
> > > > > kernel/locking/mutex.c:283
> > > > > [   18.463238] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid:
> > > > > 2042, name: test_progs
> > > > > [   18.463927] preempt_count: 0, expected: 0
> > > > > [   18.464249] RCU nest depth: 1, expected: 0
> > > > > [   18.464631] CPU: 15 PID: 2042 Comm: test_progs Tainted: G
> > > > > O       6.4.0-04319-g6f6ec4fa00dc #4896
> > > > > [   18.465480] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> > > > > BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
> > > > > [   18.466531] Call Trace:
> > > > > [   18.466767]  <TASK>
> > > > > [   18.466975]  dump_stack_lvl+0x32/0x40
> > > > > [   18.467325]  __might_resched+0x129/0x180
> > > > > [   18.467691]  mutex_lock+0x1a/0x40
> > > > > [   18.468057]  nf_defrag_ipv4_enable+0x16/0x70
> > > > > [   18.468467]  bpf_nf_link_attach+0x141/0x300
> > > > > [   18.468856]  __sys_bpf+0x133e/0x26d0
> > > > >
> > > > > You cannot call mutex under rcu_read_lock.
> > > >
> > > > Whoops, my bad. I think this patch should fix it:
> > > >
> > > > ```
> > > > From 7e8927c44452db07ddd7cf0e30bb49215fc044ed Mon Sep 17 00:00:00 2001
> > > > Message-ID: <7e8927c44452db07ddd7cf0e30bb49215fc044ed.1689211250.git.dxu@dxuuu.xyz>
> > > > From: Daniel Xu <dxu@dxuuu.xyz>
> > > > Date: Wed, 12 Jul 2023 19:17:35 -0600
> > > > Subject: [PATCH] netfilter: bpf: Don't hold rcu_read_lock during
> > > >  enable/disable
> > > >
> > > > ->enable()/->disable() takes a mutex which can sleep. You can't sleep
> > > > during RCU read side critical section.
> > > >
> > > > Our refcnt on the module will protect us from ->enable()/->disable()
> > > > from going away while we call it.
> > > >
> > > > Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
> > > > ---
> > > >  net/netfilter/nf_bpf_link.c | 10 ++++++++--
> > > >  1 file changed, 8 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/net/netfilter/nf_bpf_link.c b/net/netfilter/nf_bpf_link.c
> > > > index 77ffbf26ba3d..79704cc596aa 100644
> > > > --- a/net/netfilter/nf_bpf_link.c
> > > > +++ b/net/netfilter/nf_bpf_link.c
> > > > @@ -60,9 +60,12 @@ static int bpf_nf_enable_defrag(struct bpf_nf_link *link)
> > > >                         goto out_v4;
> > > >                 }
> > > >
> > > > +               rcu_read_unlock();
> > > >                 err = v4_hook->enable(link->net);
> > > >                 if (err)
> > > >                         module_put(v4_hook->owner);
> > > > +
> > > > +               return err;
> > > >  out_v4:
> > > >                 rcu_read_unlock();
> > > >                 return err;
> > > > @@ -92,9 +95,12 @@ static int bpf_nf_enable_defrag(struct bpf_nf_link *link)
> > > >                         goto out_v6;
> > > >                 }
> > > >
> > > > +               rcu_read_unlock();
> > > >                 err = v6_hook->enable(link->net);
> > > >                 if (err)
> > > >                         module_put(v6_hook->owner);
> > > > +
> > > > +               return err;
> > > >  out_v6:
> > > >                 rcu_read_unlock();
> > > >                 return err;
> > > > @@ -114,11 +120,11 @@ static void bpf_nf_disable_defrag(struct bpf_nf_link *link)
> > > >         case NFPROTO_IPV4:
> > > >                 rcu_read_lock();
> > > >                 v4_hook = rcu_dereference(nf_defrag_v4_hook);
> > > > +               rcu_read_unlock();
> > > >                 if (v4_hook) {
> > > >                         v4_hook->disable(link->net);
> > > >                         module_put(v4_hook->owner);
> > > >                 }
> > > > -               rcu_read_unlock();
> > > >
> > > >                 break;
> > > >  #endif
> > > > @@ -126,11 +132,11 @@ static void bpf_nf_disable_defrag(struct bpf_nf_link *link)
> > > >         case NFPROTO_IPV6:
> > > >                 rcu_read_lock();
> > > >                 v6_hook = rcu_dereference(nf_defrag_v6_hook);
> > > > +               rcu_read_unlock();
> > >
> > > No. v6_hook is gone as soon as you unlock it.
> >
> > I think we're protected here by the try_module_get() on the enable path.
> > And we only disable defrag if enabling succeeds. The module shouldn't
> > be able to deregister its hooks until we call the module_put() later.
> >
> > I think READ_ONCE() would've been more appropriate but I wasn't sure if
> > that was ok given nf_defrag_v(4|6)_hook is written to by
> > rcu_assign_pointer() and I was assuming symmetry is necessary.
> 
> Why is rcu_assign_pointer() used?
> If it's not RCU protected, what is the point of rcu_*() accessors
> and rcu_read_lock() ?
> 
> In general, the pattern:
> rcu_read_lock();
> ptr = rcu_dereference(...);
> rcu_read_unlock();
> ptr->..
> is a bug. 100%.
> 

The reason I left it like this is b/c otherwise I think there is a race
with module unload and taking a refcnt. For example:

ptr = READ_ONCE(global_var)
                                             <module unload on other cpu>
// ptr invalid
try_module_get(ptr->owner) 

I think the the synchronize_rcu() call in
kernel/module/main.c:free_module() protects against that race based on
my reading.

Maybe the ->enable() path can store a copy of the hook ptr in
struct bpf_nf_link to get rid of the odd rcu_dereference()?

Open to other ideas too -- would appreciate any hints.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH bpf-next v4 2/6] netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link
  2023-07-13 23:42             ` Daniel Xu
@ 2023-07-14  9:47               ` Florian Westphal
  2023-07-18 21:45                 ` Daniel Xu
  0 siblings, 1 reply; 15+ messages in thread
From: Florian Westphal @ 2023-07-14  9:47 UTC (permalink / raw)
  To: Daniel Xu
  Cc: Alexei Starovoitov, Andrii Nakryiko, Alexei Starovoitov,
	Florian Westphal, David S. Miller, Pablo Neira Ayuso,
	Paolo Abeni, Daniel Borkmann, Eric Dumazet, Jakub Kicinski,
	Jozsef Kadlecsik, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	bpf, LKML, netfilter-devel, coreteam, Network Development,
	David Ahern

Daniel Xu <dxu@dxuuu.xyz> wrote:
> On Thu, Jul 13, 2023 at 04:10:03PM -0700, Alexei Starovoitov wrote:
> > Why is rcu_assign_pointer() used?
> > If it's not RCU protected, what is the point of rcu_*() accessors
> > and rcu_read_lock() ?
> > 
> > In general, the pattern:
> > rcu_read_lock();
> > ptr = rcu_dereference(...);
> > rcu_read_unlock();
> > ptr->..
> > is a bug. 100%.

FWIW, I agree with Alexei, it does look... dodgy.

> The reason I left it like this is b/c otherwise I think there is a race
> with module unload and taking a refcnt. For example:
> 
> ptr = READ_ONCE(global_var)
>                                              <module unload on other cpu>
> // ptr invalid
> try_module_get(ptr->owner) 
>

Yes, I agree.

> I think the the synchronize_rcu() call in
> kernel/module/main.c:free_module() protects against that race based on
> my reading.
> 
> Maybe the ->enable() path can store a copy of the hook ptr in
> struct bpf_nf_link to get rid of the odd rcu_dereference()?
> 
> Open to other ideas too -- would appreciate any hints.

I would suggest the following:

- Switch ordering of patches 2 and 3.
  What is currently patch 3 would add the .owner fields only.

Then, what is currently patch #2 would document the rcu/modref
interaction like this (omitting error checking for brevity):

rcu_read_lock();
v6_hook = rcu_dereference(nf_defrag_v6_hook);
if (!v6_hook) {
        rcu_read_unlock();
        err = request_module("nf_defrag_ipv6");
        if (err)
                 return err < 0 ? err : -EINVAL;
        rcu_read_lock();
	v6_hook = rcu_dereference(nf_defrag_v6_hook);
}

if (v6_hook && try_module_get(v6_hook->owner))
	v6_hook = rcu_pointer_handoff(v6_hook);
else
	v6_hook = NULL;

rcu_read_unlock();

if (!v6_hook)
	err();
v6_hook->enable();


I'd store the v4/6_hook pointer in the nf bpf link struct, its probably more
self-explanatory for the disable side in that we did pick up a module reference
that we still own at delete time, without need for any rcu involvement.

Because above handoff is repetitive for ipv4 and ipv6,
I suggest to add an agnostic helper for this.

I know you added distinct structures for ipv4 and ipv6 but if they would use
 the same one you could add

static const struct nf_defrag_hook *get_proto_frag_hook(const struct nf_defrag_hook __rcu *hook,
							const char *modulename);

And then use it like:

v4_hook = get_proto_frag_hook(nf_defrag_v4_hook, "nf_defrag_ipv4");

Without a need to copy the modprobe and handoff part.

What do you think?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH bpf-next v4 2/6] netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link
  2023-07-14  9:47               ` Florian Westphal
@ 2023-07-18 21:45                 ` Daniel Xu
  0 siblings, 0 replies; 15+ messages in thread
From: Daniel Xu @ 2023-07-18 21:45 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Alexei Starovoitov, Andrii Nakryiko, Alexei Starovoitov,
	David S. Miller, Pablo Neira Ayuso, Paolo Abeni, Daniel Borkmann,
	Eric Dumazet, Jakub Kicinski, Jozsef Kadlecsik, Martin KaFai Lau,
	Song Liu, Yonghong Song, John Fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, bpf, LKML,
	netfilter-devel, coreteam, Network Development, David Ahern

Hi Florian,

On Fri, Jul 14, 2023 at 11:47:41AM +0200, Florian Westphal wrote:
> Daniel Xu <dxu@dxuuu.xyz> wrote:
> > On Thu, Jul 13, 2023 at 04:10:03PM -0700, Alexei Starovoitov wrote:
> > > Why is rcu_assign_pointer() used?
> > > If it's not RCU protected, what is the point of rcu_*() accessors
> > > and rcu_read_lock() ?
> > > 
> > > In general, the pattern:
> > > rcu_read_lock();
> > > ptr = rcu_dereference(...);
> > > rcu_read_unlock();
> > > ptr->..
> > > is a bug. 100%.
> 
> FWIW, I agree with Alexei, it does look... dodgy.
> 
> > The reason I left it like this is b/c otherwise I think there is a race
> > with module unload and taking a refcnt. For example:
> > 
> > ptr = READ_ONCE(global_var)
> >                                              <module unload on other cpu>
> > // ptr invalid
> > try_module_get(ptr->owner) 
> >
> 
> Yes, I agree.
> 
> > I think the the synchronize_rcu() call in
> > kernel/module/main.c:free_module() protects against that race based on
> > my reading.
> > 
> > Maybe the ->enable() path can store a copy of the hook ptr in
> > struct bpf_nf_link to get rid of the odd rcu_dereference()?
> > 
> > Open to other ideas too -- would appreciate any hints.
> 
> I would suggest the following:
> 
> - Switch ordering of patches 2 and 3.
>   What is currently patch 3 would add the .owner fields only.
> 
> Then, what is currently patch #2 would document the rcu/modref
> interaction like this (omitting error checking for brevity):
> 
> rcu_read_lock();
> v6_hook = rcu_dereference(nf_defrag_v6_hook);
> if (!v6_hook) {
>         rcu_read_unlock();
>         err = request_module("nf_defrag_ipv6");
>         if (err)
>                  return err < 0 ? err : -EINVAL;
>         rcu_read_lock();
> 	v6_hook = rcu_dereference(nf_defrag_v6_hook);
> }
> 
> if (v6_hook && try_module_get(v6_hook->owner))
> 	v6_hook = rcu_pointer_handoff(v6_hook);
> else
> 	v6_hook = NULL;
> 
> rcu_read_unlock();
> 
> if (!v6_hook)
> 	err();
> v6_hook->enable();
> 
> 
> I'd store the v4/6_hook pointer in the nf bpf link struct, its probably more
> self-explanatory for the disable side in that we did pick up a module reference
> that we still own at delete time, without need for any rcu involvement.
> 
> Because above handoff is repetitive for ipv4 and ipv6,
> I suggest to add an agnostic helper for this.
> 
> I know you added distinct structures for ipv4 and ipv6 but if they would use
>  the same one you could add
> 
> static const struct nf_defrag_hook *get_proto_frag_hook(const struct nf_defrag_hook __rcu *hook,
> 							const char *modulename);
> 
> And then use it like:
> 
> v4_hook = get_proto_frag_hook(nf_defrag_v4_hook, "nf_defrag_ipv4");
> 
> Without a need to copy the modprobe and handoff part.
> 
> What do you think?

That sounds reasonable to me. I'll give it a shot. Thanks for the input!

Daniel

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-07-18 21:45 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-12 23:43 [PATCH bpf-next v4 0/6] Support defragmenting IPv(4|6) packets in BPF Daniel Xu
2023-07-12 23:43 ` [PATCH bpf-next v4 1/6] netfilter: defrag: Add glue hooks for enabling/disabling defrag Daniel Xu
2023-07-12 23:43 ` [PATCH bpf-next v4 2/6] netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link Daniel Xu
2023-07-13  0:43   ` Alexei Starovoitov
2023-07-13  1:22     ` Daniel Xu
2023-07-13  1:26       ` Alexei Starovoitov
2023-07-13  4:33         ` Daniel Xu
2023-07-13 23:10           ` Alexei Starovoitov
2023-07-13 23:42             ` Daniel Xu
2023-07-14  9:47               ` Florian Westphal
2023-07-18 21:45                 ` Daniel Xu
2023-07-12 23:43 ` [PATCH bpf-next v4 3/6] netfilter: bpf: Prevent defrag module unload while link active Daniel Xu
2023-07-12 23:43 ` [PATCH bpf-next v4 4/6] bpf: selftests: Support not connecting client socket Daniel Xu
2023-07-12 23:44 ` [PATCH bpf-next v4 5/6] bpf: selftests: Support custom type and proto for client sockets Daniel Xu
2023-07-12 23:44 ` [PATCH bpf-next v4 6/6] bpf: selftests: Add defrag selftests Daniel Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.