All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next 0/7] Support defragmenting IPv(4|6) packets in BPF
@ 2023-06-26 23:02 Daniel Xu
  2023-06-26 23:02 ` [PATCH bpf-next 1/7] tools: libbpf: add netfilter link attach helper Daniel Xu
                   ` (8 more replies)
  0 siblings, 9 replies; 22+ messages in thread
From: Daniel Xu @ 2023-06-26 23:02 UTC (permalink / raw)
  To: bpf, netdev, linux-kernel, linux-kselftest, coreteam,
	netfilter-devel, fw, daniel
  Cc: dsahern

=== Context ===

In the context of a middlebox, fragmented packets are tricky to handle.
The full 5-tuple of a packet is often only available in the first
fragment which makes enforcing consistent policy difficult. There are
really only two stateless options, neither of which are very nice:

1. Enforce policy on first fragment and accept all subsequent fragments.
   This works but may let in certain attacks or allow data exfiltration.

2. Enforce policy on first fragment and drop all subsequent fragments.
   This does not really work b/c some protocols may rely on
   fragmentation. For example, DNS may rely on oversized UDP packets for
   large responses.

So stateful tracking is the only sane option. RFC 8900 [0] calls this
out as well in section 6.3:

    Middleboxes [...] should process IP fragments in a manner that is
    consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes
    must maintain state in order to achieve this goal.

=== BPF related bits ===

Policy has traditionally been enforced from XDP/TC hooks. Both hooks
run before kernel reassembly facilities. However, with the new
BPF_PROG_TYPE_NETFILTER, we can rather easily hook into existing
netfilter reassembly infra.

The basic idea is we bump a refcnt on the netfilter defrag module and
then run the bpf prog after the defrag module runs. This allows bpf
progs to transparently see full, reassembled packets. The nice thing
about this is that progs don't have to carry around logic to detect
fragments.

=== Patchset details ===

There was an earlier attempt at providing defrag via kfuncs [1]. The
feedback was that we could end up doing too much stuff in prog execution
context (like sending ICMP error replies). However, I think there are
still some outstanding discussion w.r.t. performance when it comes to
netfilter vs the previous approach. I'll schedule some time during
office hours for this.

Patches 1 & 2 are stolenfrom Florian. Hopefully he doesn't mind. There
were some outstanding comments on the v2 [2] but it doesn't look like a
v3 was ever submitted.  I've addressed the comments and put them in this
patchset cuz I needed them.

Finally, the new selftest seems to be a little flaky. I'm not quite
sure why the server will fail to `recvfrom()` occassionaly. I'm fairly
sure it's a timing related issue with creating veths. I'll keep
debugging but I didn't want that to hold up discussion on this patchset.


[0]: https://datatracker.ietf.org/doc/html/rfc8900
[1]: https://lore.kernel.org/bpf/cover.1677526810.git.dxu@dxuuu.xyz/
[2]: https://lore.kernel.org/bpf/20230525110100.8212-1-fw@strlen.de/

Daniel Xu (7):
  tools: libbpf: add netfilter link attach helper
  selftests/bpf: Add bpf_program__attach_netfilter helper test
  netfilter: defrag: Add glue hooks for enabling/disabling defrag
  netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link
  bpf: selftests: Support not connecting client socket
  bpf: selftests: Support custom type and proto for client sockets
  bpf: selftests: Add defrag selftests

 include/linux/netfilter.h                     |  12 +
 include/uapi/linux/bpf.h                      |   5 +
 net/ipv4/netfilter/nf_defrag_ipv4.c           |   8 +
 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c     |  10 +
 net/netfilter/core.c                          |   6 +
 net/netfilter/nf_bpf_link.c                   | 108 ++++++-
 tools/include/uapi/linux/bpf.h                |   5 +
 tools/lib/bpf/bpf.c                           |   8 +
 tools/lib/bpf/bpf.h                           |   6 +
 tools/lib/bpf/libbpf.c                        |  47 +++
 tools/lib/bpf/libbpf.h                        |  15 +
 tools/lib/bpf/libbpf.map                      |   1 +
 tools/testing/selftests/bpf/Makefile          |   4 +-
 .../selftests/bpf/generate_udp_fragments.py   |  90 ++++++
 .../selftests/bpf/ip_check_defrag_frags.h     |  57 ++++
 tools/testing/selftests/bpf/network_helpers.c |  26 +-
 tools/testing/selftests/bpf/network_helpers.h |   3 +
 .../bpf/prog_tests/ip_check_defrag.c          | 282 ++++++++++++++++++
 .../bpf/prog_tests/netfilter_basic.c          |  78 +++++
 .../selftests/bpf/progs/ip_check_defrag.c     | 104 +++++++
 .../bpf/progs/test_netfilter_link_attach.c    |  14 +
 21 files changed, 868 insertions(+), 21 deletions(-)
 create mode 100755 tools/testing/selftests/bpf/generate_udp_fragments.py
 create mode 100644 tools/testing/selftests/bpf/ip_check_defrag_frags.h
 create mode 100644 tools/testing/selftests/bpf/prog_tests/ip_check_defrag.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/netfilter_basic.c
 create mode 100644 tools/testing/selftests/bpf/progs/ip_check_defrag.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_netfilter_link_attach.c

-- 
2.40.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH bpf-next 1/7] tools: libbpf: add netfilter link attach helper
  2023-06-26 23:02 [PATCH bpf-next 0/7] Support defragmenting IPv(4|6) packets in BPF Daniel Xu
@ 2023-06-26 23:02 ` Daniel Xu
  2023-06-27  0:11   ` Andrii Nakryiko
  2023-06-26 23:02 ` [PATCH bpf-next 2/7] selftests/bpf: Add bpf_program__attach_netfilter helper test Daniel Xu
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Daniel Xu @ 2023-06-26 23:02 UTC (permalink / raw)
  To: daniel, ast, andrii, fw
  Cc: martin.lau, song, yhs, john.fastabend, kpsingh, sdf, haoluo,
	jolsa, bpf, linux-kernel, netfilter-devel, dsahern,
	Andrii Nakryiko

Add new api function: bpf_program__attach_netfilter.

It takes a bpf program (netfilter type), and a pointer to a option struct
that contains the desired attachment (protocol family, priority, hook
location, ...).

It returns a pointer to a 'bpf_link' structure or NULL on error.

Next patch adds new netfilter_basic test that uses this function to
attach a program to a few pf/hook/priority combinations.

Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
---
 tools/lib/bpf/bpf.c      |  8 +++++++
 tools/lib/bpf/bpf.h      |  6 +++++
 tools/lib/bpf/libbpf.c   | 47 ++++++++++++++++++++++++++++++++++++++++
 tools/lib/bpf/libbpf.h   | 15 +++++++++++++
 tools/lib/bpf/libbpf.map |  1 +
 5 files changed, 77 insertions(+)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index ed86b37d8024..3b0da19715e1 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -741,6 +741,14 @@ int bpf_link_create(int prog_fd, int target_fd,
 		if (!OPTS_ZEROED(opts, tracing))
 			return libbpf_err(-EINVAL);
 		break;
+	case BPF_NETFILTER:
+		attr.link_create.netfilter.pf = OPTS_GET(opts, netfilter.pf, 0);
+		attr.link_create.netfilter.hooknum = OPTS_GET(opts, netfilter.hooknum, 0);
+		attr.link_create.netfilter.priority = OPTS_GET(opts, netfilter.priority, 0);
+		attr.link_create.netfilter.flags = OPTS_GET(opts, netfilter.flags, 0);
+		if (!OPTS_ZEROED(opts, netfilter))
+			return libbpf_err(-EINVAL);
+		break;
 	default:
 		if (!OPTS_ZEROED(opts, flags))
 			return libbpf_err(-EINVAL);
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 9aa0ee473754..c676295ab9bf 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -349,6 +349,12 @@ struct bpf_link_create_opts {
 		struct {
 			__u64 cookie;
 		} tracing;
+		struct {
+			__u32 pf;
+			__u32 hooknum;
+			__s32 priority;
+			__u32 flags;
+		} netfilter;
 	};
 	size_t :0;
 };
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 214f828ece6b..a8b9d5abb55f 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -11811,6 +11811,53 @@ static int attach_iter(const struct bpf_program *prog, long cookie, struct bpf_l
 	return libbpf_get_error(*link);
 }
 
+struct bpf_link *bpf_program__attach_netfilter(const struct bpf_program *prog,
+					       const struct bpf_netfilter_opts *opts)
+{
+	DECLARE_LIBBPF_OPTS(bpf_link_create_opts, link_create_opts);
+	struct bpf_link *link;
+	int prog_fd, link_fd;
+
+	if (!OPTS_VALID(opts, bpf_netfilter_opts))
+		return libbpf_err_ptr(-EINVAL);
+
+	link_create_opts.netfilter.pf = OPTS_GET(opts, pf, 0);
+	link_create_opts.netfilter.hooknum = OPTS_GET(opts, hooknum, 0);
+	link_create_opts.netfilter.priority = OPTS_GET(opts, priority, 0);
+	link_create_opts.netfilter.flags = OPTS_GET(opts, flags, 0);
+
+	prog_fd = bpf_program__fd(prog);
+	if (prog_fd < 0) {
+		pr_warn("prog '%s': can't attach before loaded\n", prog->name);
+		return libbpf_err_ptr(-EINVAL);
+	}
+
+	link = calloc(1, sizeof(*link));
+	if (!link)
+		return libbpf_err_ptr(-ENOMEM);
+	link->detach = &bpf_link__detach_fd;
+
+	link_fd = bpf_link_create(prog_fd, 0, BPF_NETFILTER, &link_create_opts);
+
+	link->fd = ensure_good_fd(link_fd);
+
+	if (link->fd < 0) {
+		char errmsg[STRERR_BUFSIZE];
+
+		link_fd = -errno;
+		free(link);
+		pr_warn("prog '%s': failed to attach to pf:%d,hooknum:%d:prio:%d: %s\n",
+			prog->name,
+			OPTS_GET(opts, pf, 0),
+			OPTS_GET(opts, hooknum, 0),
+			OPTS_GET(opts, priority, 0),
+			libbpf_strerror_r(link_fd, errmsg, sizeof(errmsg)));
+		return libbpf_err_ptr(link_fd);
+	}
+
+	return link;
+}
+
 struct bpf_link *bpf_program__attach(const struct bpf_program *prog)
 {
 	struct bpf_link *link = NULL;
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 754da73c643b..10642ad69d76 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -718,6 +718,21 @@ LIBBPF_API struct bpf_link *
 bpf_program__attach_freplace(const struct bpf_program *prog,
 			     int target_fd, const char *attach_func_name);
 
+struct bpf_netfilter_opts {
+	/* size of this struct, for forward/backward compatibility */
+	size_t sz;
+
+	__u32 pf;
+	__u32 hooknum;
+	__s32 priority;
+	__u32 flags;
+};
+#define bpf_netfilter_opts__last_field flags
+
+LIBBPF_API struct bpf_link *
+bpf_program__attach_netfilter(const struct bpf_program *prog,
+			      const struct bpf_netfilter_opts *opts);
+
 struct bpf_map;
 
 LIBBPF_API struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map);
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index 7521a2fb7626..d9ec4407befa 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -395,4 +395,5 @@ LIBBPF_1.2.0 {
 LIBBPF_1.3.0 {
 	global:
 		bpf_obj_pin_opts;
+		bpf_program__attach_netfilter;
 } LIBBPF_1.2.0;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH bpf-next 2/7] selftests/bpf: Add bpf_program__attach_netfilter helper test
  2023-06-26 23:02 [PATCH bpf-next 0/7] Support defragmenting IPv(4|6) packets in BPF Daniel Xu
  2023-06-26 23:02 ` [PATCH bpf-next 1/7] tools: libbpf: add netfilter link attach helper Daniel Xu
@ 2023-06-26 23:02 ` Daniel Xu
  2023-06-26 23:02 ` [PATCH bpf-next 3/7] netfilter: defrag: Add glue hooks for enabling/disabling defrag Daniel Xu
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Daniel Xu @ 2023-06-26 23:02 UTC (permalink / raw)
  To: ast, daniel, andrii, shuah, fw
  Cc: mykolal, martin.lau, song, yhs, john.fastabend, kpsingh, sdf,
	haoluo, jolsa, linux-kernel, bpf, linux-kselftest,
	netfilter-devel, dsahern

Call bpf_program__attach_netfilter() with different
protocol/hook/priority combinations.

Test fails if supposedly-illegal attachments work
(e.g., bogus protocol family, illegal priority and so on)
or if a should-work attachment fails.

Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
---
 .../bpf/prog_tests/netfilter_basic.c          | 78 +++++++++++++++++++
 .../bpf/progs/test_netfilter_link_attach.c    | 14 ++++
 2 files changed, 92 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/netfilter_basic.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_netfilter_link_attach.c

diff --git a/tools/testing/selftests/bpf/prog_tests/netfilter_basic.c b/tools/testing/selftests/bpf/prog_tests/netfilter_basic.c
new file mode 100644
index 000000000000..357353fee19d
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/netfilter_basic.c
@@ -0,0 +1,78 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include <netinet/in.h>
+#include <linux/netfilter.h>
+
+#include "test_progs.h"
+#include "test_netfilter_link_attach.skel.h"
+
+struct nf_hook_options {
+	__u32 pf;
+	__u32 hooknum;
+	__s32 priority;
+	__u32 flags;
+
+	bool expect_success;
+};
+
+struct nf_hook_options nf_hook_attach_tests[] = {
+	{  },
+	{ .pf = NFPROTO_NUMPROTO, },
+	{ .pf = NFPROTO_IPV4, .hooknum = 42, },
+	{ .pf = NFPROTO_IPV4, .priority = INT_MIN },
+	{ .pf = NFPROTO_IPV4, .priority = INT_MAX },
+	{ .pf = NFPROTO_IPV4, .flags = UINT_MAX },
+
+	{ .pf = NFPROTO_INET, .priority = 1, },
+
+	{ .pf = NFPROTO_IPV4, .priority = -10000, .expect_success = true },
+	{ .pf = NFPROTO_IPV6, .priority = 10001, .expect_success = true },
+};
+
+void test_netfilter_basic(void)
+{
+	struct test_netfilter_link_attach *skel;
+	LIBBPF_OPTS(bpf_netfilter_opts, opts);
+	struct bpf_program *prog;
+	int i;
+
+	skel = test_netfilter_link_attach__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "test_netfilter_link_attach__open_and_load"))
+		goto out;
+
+	prog = skel->progs.nf_link_attach_test;
+
+	for (i = 0; i < ARRAY_SIZE(nf_hook_attach_tests); i++) {
+		struct bpf_link *link;
+
+#define X(opts, m, i)	opts.m = nf_hook_attach_tests[(i)].m
+		X(opts, pf, i);
+		X(opts, hooknum, i);
+		X(opts, priority, i);
+		X(opts, flags, i);
+#undef X
+		link = bpf_program__attach_netfilter(prog, &opts);
+		if (nf_hook_attach_tests[i].expect_success) {
+			struct bpf_link *link2;
+
+			if (!ASSERT_OK_PTR(link, "program attach successful"))
+				continue;
+
+			link2 = bpf_program__attach_netfilter(prog, &opts);
+			ASSERT_ERR_PTR(link2, "attach program with same pf/hook/priority");
+
+			if (!ASSERT_OK(bpf_link__destroy(link), "link destroy"))
+				break;
+
+			link2 = bpf_program__attach_netfilter(prog, &opts);
+			if (!ASSERT_OK_PTR(link2, "program reattach successful"))
+				continue;
+			if (!ASSERT_OK(bpf_link__destroy(link2), "link destroy"))
+				break;
+		} else {
+			ASSERT_ERR_PTR(link, "program load failure");
+		}
+	}
+out:
+	test_netfilter_link_attach__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_netfilter_link_attach.c b/tools/testing/selftests/bpf/progs/test_netfilter_link_attach.c
new file mode 100644
index 000000000000..03a475160abe
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_netfilter_link_attach.c
@@ -0,0 +1,14 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+
+#define NF_ACCEPT 1
+
+SEC("netfilter")
+int nf_link_attach_test(struct bpf_nf_ctx *ctx)
+{
+	return NF_ACCEPT;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH bpf-next 3/7] netfilter: defrag: Add glue hooks for enabling/disabling defrag
  2023-06-26 23:02 [PATCH bpf-next 0/7] Support defragmenting IPv(4|6) packets in BPF Daniel Xu
  2023-06-26 23:02 ` [PATCH bpf-next 1/7] tools: libbpf: add netfilter link attach helper Daniel Xu
  2023-06-26 23:02 ` [PATCH bpf-next 2/7] selftests/bpf: Add bpf_program__attach_netfilter helper test Daniel Xu
@ 2023-06-26 23:02 ` Daniel Xu
  2023-06-27 11:04   ` Florian Westphal
  2023-06-26 23:02 ` [PATCH bpf-next 4/7] netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link Daniel Xu
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Daniel Xu @ 2023-06-26 23:02 UTC (permalink / raw)
  To: edumazet, dsahern, kuba, fw, pabeni, pablo, davem, kadlec, daniel
  Cc: netfilter-devel, coreteam, linux-kernel, netdev, bpf

We want to be able to enable/disable IP packet defrag from core
bpf/netfilter code. In other words, execute code from core that could
possibly be built as a module.

To help avoid symbol resolution errors, use glue hooks that the modules
will register callbacks with during module init.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
---
 include/linux/netfilter.h                 | 12 ++++++++++++
 net/ipv4/netfilter/nf_defrag_ipv4.c       |  8 ++++++++
 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c | 10 ++++++++++
 net/netfilter/core.c                      |  6 ++++++
 4 files changed, 36 insertions(+)

diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index 0762444e3767..1d68499de03e 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -481,6 +481,18 @@ struct nfnl_ct_hook {
 };
 extern const struct nfnl_ct_hook __rcu *nfnl_ct_hook;
 
+struct nf_defrag_v4_hook {
+	int (*enable)(struct net *net);
+	void (*disable)(struct net *net);
+};
+extern const struct nf_defrag_v4_hook __rcu *nf_defrag_v4_hook;
+
+struct nf_defrag_v6_hook {
+	int (*enable)(struct net *net);
+	void (*disable)(struct net *net);
+};
+extern const struct nf_defrag_v6_hook __rcu *nf_defrag_v6_hook;
+
 /**
  * nf_skb_duplicated - TEE target has sent a packet
  *
diff --git a/net/ipv4/netfilter/nf_defrag_ipv4.c b/net/ipv4/netfilter/nf_defrag_ipv4.c
index e61ea428ea18..436e629b0969 100644
--- a/net/ipv4/netfilter/nf_defrag_ipv4.c
+++ b/net/ipv4/netfilter/nf_defrag_ipv4.c
@@ -7,6 +7,7 @@
 #include <linux/ip.h>
 #include <linux/netfilter.h>
 #include <linux/module.h>
+#include <linux/rcupdate.h>
 #include <linux/skbuff.h>
 #include <net/netns/generic.h>
 #include <net/route.h>
@@ -113,17 +114,24 @@ static void __net_exit defrag4_net_exit(struct net *net)
 	}
 }
 
+static struct nf_defrag_v4_hook defrag_hook = {
+	.enable = nf_defrag_ipv4_enable,
+	.disable = nf_defrag_ipv4_disable,
+};
+
 static struct pernet_operations defrag4_net_ops = {
 	.exit = defrag4_net_exit,
 };
 
 static int __init nf_defrag_init(void)
 {
+	rcu_assign_pointer(nf_defrag_v4_hook, &defrag_hook);
 	return register_pernet_subsys(&defrag4_net_ops);
 }
 
 static void __exit nf_defrag_fini(void)
 {
+	rcu_assign_pointer(nf_defrag_v4_hook, NULL);
 	unregister_pernet_subsys(&defrag4_net_ops);
 }
 
diff --git a/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c b/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
index cb4eb1d2c620..205fb692f524 100644
--- a/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
+++ b/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
@@ -10,6 +10,7 @@
 #include <linux/module.h>
 #include <linux/skbuff.h>
 #include <linux/icmp.h>
+#include <linux/rcupdate.h>
 #include <linux/sysctl.h>
 #include <net/ipv6_frag.h>
 
@@ -96,6 +97,11 @@ static void __net_exit defrag6_net_exit(struct net *net)
 	}
 }
 
+static struct nf_defrag_v6_hook defrag_hook = {
+	.enable = nf_defrag_ipv6_enable,
+	.disable = nf_defrag_ipv6_disable,
+};
+
 static struct pernet_operations defrag6_net_ops = {
 	.exit = defrag6_net_exit,
 };
@@ -114,6 +120,9 @@ static int __init nf_defrag_init(void)
 		pr_err("nf_defrag_ipv6: can't register pernet ops\n");
 		goto cleanup_frag6;
 	}
+
+	rcu_assign_pointer(nf_defrag_v6_hook, &defrag_hook);
+
 	return ret;
 
 cleanup_frag6:
@@ -124,6 +133,7 @@ static int __init nf_defrag_init(void)
 
 static void __exit nf_defrag_fini(void)
 {
+	rcu_assign_pointer(nf_defrag_v6_hook, NULL);
 	unregister_pernet_subsys(&defrag6_net_ops);
 	nf_ct_frag6_cleanup();
 }
diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index 5f76ae86a656..34845155bb85 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -680,6 +680,12 @@ EXPORT_SYMBOL_GPL(nfnl_ct_hook);
 const struct nf_ct_hook __rcu *nf_ct_hook __read_mostly;
 EXPORT_SYMBOL_GPL(nf_ct_hook);
 
+const struct nf_defrag_v4_hook __rcu *nf_defrag_v4_hook __read_mostly;
+EXPORT_SYMBOL_GPL(nf_defrag_v4_hook);
+
+const struct nf_defrag_v6_hook __rcu *nf_defrag_v6_hook __read_mostly;
+EXPORT_SYMBOL_GPL(nf_defrag_v6_hook);
+
 #if IS_ENABLED(CONFIG_NF_CONNTRACK)
 u8 nf_ctnetlink_has_listener;
 EXPORT_SYMBOL_GPL(nf_ctnetlink_has_listener);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH bpf-next 4/7] netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link
  2023-06-26 23:02 [PATCH bpf-next 0/7] Support defragmenting IPv(4|6) packets in BPF Daniel Xu
                   ` (2 preceding siblings ...)
  2023-06-26 23:02 ` [PATCH bpf-next 3/7] netfilter: defrag: Add glue hooks for enabling/disabling defrag Daniel Xu
@ 2023-06-26 23:02 ` Daniel Xu
  2023-06-27 11:12   ` Florian Westphal
  2023-06-26 23:02 ` [PATCH bpf-next 5/7] bpf: selftests: Support not connecting client socket Daniel Xu
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Daniel Xu @ 2023-06-26 23:02 UTC (permalink / raw)
  To: daniel, edumazet, kuba, fw, pabeni, pablo, andrii, davem, ast, kadlec
  Cc: martin.lau, song, yhs, john.fastabend, kpsingh, sdf, haoluo,
	jolsa, bpf, linux-kernel, netfilter-devel, coreteam, netdev,
	dsahern

This commit adds support for enabling IP defrag using pre-existing
netfilter defrag support. Basically all the flag does is bump a refcnt
while the link the active. Checks are also added to ensure the prog
requesting defrag support is run _after_ netfilter defrag hooks.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
---
 include/uapi/linux/bpf.h       |   5 ++
 net/netfilter/nf_bpf_link.c    | 108 +++++++++++++++++++++++++++++----
 tools/include/uapi/linux/bpf.h |   5 ++
 3 files changed, 107 insertions(+), 11 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 60a9d59beeab..04ac77481583 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1170,6 +1170,11 @@ enum bpf_link_type {
  */
 #define BPF_F_KPROBE_MULTI_RETURN	(1U << 0)
 
+/* link_create.netfilter.flags used in LINK_CREATE command for
+ * BPF_PROG_TYPE_NETFILTER to enable IP packet defragmentation.
+ */
+#define BPF_F_NETFILTER_IP_DEFRAG (1U << 0)
+
 /* When BPF ldimm64's insn[0].src_reg != 0 then this can have
  * the following extensions:
  *
diff --git a/net/netfilter/nf_bpf_link.c b/net/netfilter/nf_bpf_link.c
index c36da56d756f..a8015dbce12a 100644
--- a/net/netfilter/nf_bpf_link.c
+++ b/net/netfilter/nf_bpf_link.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/bpf.h>
 #include <linux/filter.h>
+#include <linux/kmod.h>
 #include <linux/netfilter.h>
 
 #include <net/netfilter/nf_bpf_link.h>
@@ -23,8 +24,77 @@ struct bpf_nf_link {
 	struct nf_hook_ops hook_ops;
 	struct net *net;
 	u32 dead;
+	bool defrag;
 };
 
+static int bpf_nf_enable_defrag(struct bpf_nf_link *link)
+{
+	int err;
+
+	switch (link->hook_ops.pf) {
+#if IS_ENABLED(CONFIG_NF_DEFRAG_IPV4)
+	case NFPROTO_IPV4:
+		const struct nf_defrag_v4_hook *v4_hook;
+
+		err = request_module("nf_defrag_ipv4");
+		if (err)
+			return err;
+
+		rcu_read_lock();
+		v4_hook = rcu_dereference(nf_defrag_v4_hook);
+		err = v4_hook->enable(link->net);
+		rcu_read_unlock();
+
+		return err;
+#endif
+#if IS_ENABLED(CONFIG_NF_DEFRAG_IPV6)
+	case NFPROTO_IPV6:
+		const struct nf_defrag_v6_hook *v6_hook;
+
+		err = request_module("nf_defrag_ipv6_hooks");
+		if (err)
+			return err;
+
+		rcu_read_lock();
+		v6_hook = rcu_dereference(nf_defrag_v6_hook);
+		err = v6_hook->enable(link->net);
+		rcu_read_unlock();
+
+		return err;
+#endif
+	default:
+		return -EAFNOSUPPORT;
+	}
+}
+
+static void bpf_nf_disable_defrag(struct bpf_nf_link *link)
+{
+	switch (link->hook_ops.pf) {
+#if IS_ENABLED(CONFIG_NF_DEFRAG_IPV4)
+	case NFPROTO_IPV4:
+		const struct nf_defrag_v4_hook *v4_hook;
+
+		rcu_read_lock();
+		v4_hook = rcu_dereference(nf_defrag_v4_hook);
+		v4_hook->disable(link->net);
+		rcu_read_unlock();
+
+		break;
+#endif
+#if IS_ENABLED(CONFIG_NF_DEFRAG_IPV6)
+	case NFPROTO_IPV6:
+		const struct nf_defrag_v6_hook *v6_hook;
+
+		rcu_read_lock();
+		v6_hook = rcu_dereference(nf_defrag_v6_hook);
+		v6_hook->disable(link->net);
+		rcu_read_unlock();
+
+		break;
+	}
+#endif
+}
+
 static void bpf_nf_link_release(struct bpf_link *link)
 {
 	struct bpf_nf_link *nf_link = container_of(link, struct bpf_nf_link, link);
@@ -37,6 +107,9 @@ static void bpf_nf_link_release(struct bpf_link *link)
 	 */
 	if (!cmpxchg(&nf_link->dead, 0, 1))
 		nf_unregister_net_hook(nf_link->net, &nf_link->hook_ops);
+
+	if (nf_link->defrag)
+		bpf_nf_disable_defrag(nf_link);
 }
 
 static void bpf_nf_link_dealloc(struct bpf_link *link)
@@ -92,6 +165,8 @@ static const struct bpf_link_ops bpf_nf_link_lops = {
 
 static int bpf_nf_check_pf_and_hooks(const union bpf_attr *attr)
 {
+	int prio;
+
 	switch (attr->link_create.netfilter.pf) {
 	case NFPROTO_IPV4:
 	case NFPROTO_IPV6:
@@ -102,19 +177,18 @@ static int bpf_nf_check_pf_and_hooks(const union bpf_attr *attr)
 		return -EAFNOSUPPORT;
 	}
 
-	if (attr->link_create.netfilter.flags)
+	if (attr->link_create.netfilter.flags & ~BPF_F_NETFILTER_IP_DEFRAG)
 		return -EOPNOTSUPP;
 
-	/* make sure conntrack confirm is always last.
-	 *
-	 * In the future, if userspace can e.g. request defrag, then
-	 * "defrag_requested && prio before NF_IP_PRI_CONNTRACK_DEFRAG"
-	 * should fail.
-	 */
-	switch (attr->link_create.netfilter.priority) {
-	case NF_IP_PRI_FIRST: return -ERANGE; /* sabotage_in and other warts */
-	case NF_IP_PRI_LAST: return -ERANGE; /* e.g. conntrack confirm */
-	}
+	/* make sure conntrack confirm is always last */
+	prio = attr->link_create.netfilter.priority;
+	if (prio == NF_IP_PRI_FIRST)
+		return -ERANGE;  /* sabotage_in and other warts */
+	else if (prio == NF_IP_PRI_LAST)
+		return -ERANGE;  /* e.g. conntrack confirm */
+	else if ((attr->link_create.netfilter.flags & BPF_F_NETFILTER_IP_DEFRAG) &&
+		 (prio > NF_IP_PRI_FIRST && prio <= NF_IP_PRI_CONNTRACK_DEFRAG))
+		return -ERANGE;  /* cannot use defrag if prog runs before nf_defrag */
 
 	return 0;
 }
@@ -156,6 +230,18 @@ int bpf_nf_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
 		return err;
 	}
 
+	if (attr->link_create.netfilter.flags & BPF_F_NETFILTER_IP_DEFRAG) {
+		err = bpf_nf_enable_defrag(link);
+		if (err) {
+			bpf_link_cleanup(&link_primer);
+			return err;
+		}
+		/* only mark defrag enabled if enabling succeeds so cleanup path
+		 * doesn't disable without a corresponding enable
+		 */
+		link->defrag = true;
+	}
+
 	err = nf_register_net_hook(net, &link->hook_ops);
 	if (err) {
 		bpf_link_cleanup(&link_primer);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 60a9d59beeab..04ac77481583 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1170,6 +1170,11 @@ enum bpf_link_type {
  */
 #define BPF_F_KPROBE_MULTI_RETURN	(1U << 0)
 
+/* link_create.netfilter.flags used in LINK_CREATE command for
+ * BPF_PROG_TYPE_NETFILTER to enable IP packet defragmentation.
+ */
+#define BPF_F_NETFILTER_IP_DEFRAG (1U << 0)
+
 /* When BPF ldimm64's insn[0].src_reg != 0 then this can have
  * the following extensions:
  *
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH bpf-next 5/7] bpf: selftests: Support not connecting client socket
  2023-06-26 23:02 [PATCH bpf-next 0/7] Support defragmenting IPv(4|6) packets in BPF Daniel Xu
                   ` (3 preceding siblings ...)
  2023-06-26 23:02 ` [PATCH bpf-next 4/7] netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link Daniel Xu
@ 2023-06-26 23:02 ` Daniel Xu
  2023-06-26 23:02 ` [PATCH bpf-next 6/7] bpf: selftests: Support custom type and proto for client sockets Daniel Xu
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Daniel Xu @ 2023-06-26 23:02 UTC (permalink / raw)
  To: ast, daniel, andrii, shuah, fw
  Cc: mykolal, martin.lau, song, yhs, john.fastabend, kpsingh, sdf,
	haoluo, jolsa, bpf, linux-kselftest, linux-kernel,
	netfilter-devel, dsahern

For connectionless protocols or raw sockets we do not want to actually
connect() to the server.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
---
 tools/testing/selftests/bpf/network_helpers.c | 5 +++--
 tools/testing/selftests/bpf/network_helpers.h | 1 +
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c
index a105c0cd008a..d5c78c08903b 100644
--- a/tools/testing/selftests/bpf/network_helpers.c
+++ b/tools/testing/selftests/bpf/network_helpers.c
@@ -301,8 +301,9 @@ int connect_to_fd_opts(int server_fd, const struct network_helper_opts *opts)
 		       strlen(opts->cc) + 1))
 		goto error_close;
 
-	if (connect_fd_to_addr(fd, &addr, addrlen, opts->must_fail))
-		goto error_close;
+	if (!opts->noconnect)
+		if (connect_fd_to_addr(fd, &addr, addrlen, opts->must_fail))
+			goto error_close;
 
 	return fd;
 
diff --git a/tools/testing/selftests/bpf/network_helpers.h b/tools/testing/selftests/bpf/network_helpers.h
index 694185644da6..87894dc984dd 100644
--- a/tools/testing/selftests/bpf/network_helpers.h
+++ b/tools/testing/selftests/bpf/network_helpers.h
@@ -21,6 +21,7 @@ struct network_helper_opts {
 	const char *cc;
 	int timeout_ms;
 	bool must_fail;
+	bool noconnect;
 };
 
 /* ipv4 test vector */
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH bpf-next 6/7] bpf: selftests: Support custom type and proto for client sockets
  2023-06-26 23:02 [PATCH bpf-next 0/7] Support defragmenting IPv(4|6) packets in BPF Daniel Xu
                   ` (4 preceding siblings ...)
  2023-06-26 23:02 ` [PATCH bpf-next 5/7] bpf: selftests: Support not connecting client socket Daniel Xu
@ 2023-06-26 23:02 ` Daniel Xu
  2023-06-26 23:02 ` [PATCH bpf-next 7/7] bpf: selftests: Add defrag selftests Daniel Xu
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Daniel Xu @ 2023-06-26 23:02 UTC (permalink / raw)
  To: daniel, ast, shuah, andrii, fw
  Cc: martin.lau, song, yhs, john.fastabend, kpsingh, sdf, haoluo,
	jolsa, mykolal, bpf, linux-kselftest, linux-kernel,
	netfilter-devel, dsahern

Extend connect_to_fd_opts() to take optional type and protocol
parameters for the client socket. These parameters are useful when
opening a raw socket to send IP fragments.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
---
 tools/testing/selftests/bpf/network_helpers.c | 21 +++++++++++++------
 tools/testing/selftests/bpf/network_helpers.h |  2 ++
 2 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c
index d5c78c08903b..910d5d0470e6 100644
--- a/tools/testing/selftests/bpf/network_helpers.c
+++ b/tools/testing/selftests/bpf/network_helpers.c
@@ -270,14 +270,23 @@ int connect_to_fd_opts(int server_fd, const struct network_helper_opts *opts)
 		opts = &default_opts;
 
 	optlen = sizeof(type);
-	if (getsockopt(server_fd, SOL_SOCKET, SO_TYPE, &type, &optlen)) {
-		log_err("getsockopt(SOL_TYPE)");
-		return -1;
+
+	if (opts->type) {
+		type = opts->type;
+	} else {
+		if (getsockopt(server_fd, SOL_SOCKET, SO_TYPE, &type, &optlen)) {
+			log_err("getsockopt(SOL_TYPE)");
+			return -1;
+		}
 	}
 
-	if (getsockopt(server_fd, SOL_SOCKET, SO_PROTOCOL, &protocol, &optlen)) {
-		log_err("getsockopt(SOL_PROTOCOL)");
-		return -1;
+	if (opts->proto) {
+		protocol = opts->proto;
+	} else {
+		if (getsockopt(server_fd, SOL_SOCKET, SO_PROTOCOL, &protocol, &optlen)) {
+			log_err("getsockopt(SOL_PROTOCOL)");
+			return -1;
+		}
 	}
 
 	addrlen = sizeof(addr);
diff --git a/tools/testing/selftests/bpf/network_helpers.h b/tools/testing/selftests/bpf/network_helpers.h
index 87894dc984dd..5eccc67d1a99 100644
--- a/tools/testing/selftests/bpf/network_helpers.h
+++ b/tools/testing/selftests/bpf/network_helpers.h
@@ -22,6 +22,8 @@ struct network_helper_opts {
 	int timeout_ms;
 	bool must_fail;
 	bool noconnect;
+	int type;
+	int proto;
 };
 
 /* ipv4 test vector */
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH bpf-next 7/7] bpf: selftests: Add defrag selftests
  2023-06-26 23:02 [PATCH bpf-next 0/7] Support defragmenting IPv(4|6) packets in BPF Daniel Xu
                   ` (5 preceding siblings ...)
  2023-06-26 23:02 ` [PATCH bpf-next 6/7] bpf: selftests: Support custom type and proto for client sockets Daniel Xu
@ 2023-06-26 23:02 ` Daniel Xu
  2023-06-27 10:48 ` [PATCH bpf-next 0/7] Support defragmenting IPv(4|6) packets in BPF Florian Westphal
  2023-06-27 14:25 ` Toke Høiland-Jørgensen
  8 siblings, 0 replies; 22+ messages in thread
From: Daniel Xu @ 2023-06-26 23:02 UTC (permalink / raw)
  To: ast, daniel, andrii, shuah, fw
  Cc: martin.lau, song, yhs, john.fastabend, kpsingh, sdf, haoluo,
	jolsa, mykolal, linux-kernel, bpf, linux-kselftest,
	netfilter-devel, dsahern

These selftests tests 2 major scenarios: the BPF based defragmentation
can successfully be done and that packet pointers are invalidated after
calls to the kfunc. The logic is similar for both ipv4 and ipv6.

In the first scenario, we create a UDP client and UDP echo server. The
the server side is fairly straightforward: we attach the prog and simply
echo back the message.

The on the client side, we send fragmented packets to and expect the
reassembled message back from the server.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
---
 tools/testing/selftests/bpf/Makefile          |   4 +-
 .../selftests/bpf/generate_udp_fragments.py   |  90 ++++++
 .../selftests/bpf/ip_check_defrag_frags.h     |  57 ++++
 .../bpf/prog_tests/ip_check_defrag.c          | 282 ++++++++++++++++++
 .../selftests/bpf/progs/ip_check_defrag.c     | 104 +++++++
 5 files changed, 535 insertions(+), 2 deletions(-)
 create mode 100755 tools/testing/selftests/bpf/generate_udp_fragments.py
 create mode 100644 tools/testing/selftests/bpf/ip_check_defrag_frags.h
 create mode 100644 tools/testing/selftests/bpf/prog_tests/ip_check_defrag.c
 create mode 100644 tools/testing/selftests/bpf/progs/ip_check_defrag.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 538df8fb8c42..b47f20381d56 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -561,8 +561,8 @@ TRUNNER_EXTRA_SOURCES := test_progs.c cgroup_helpers.c trace_helpers.c	\
 			 network_helpers.c testing_helpers.c		\
 			 btf_helpers.c flow_dissector_load.h		\
 			 cap_helpers.c test_loader.c xsk.c disasm.c	\
-			 json_writer.c unpriv_helpers.c
-
+			 json_writer.c unpriv_helpers.c 		\
+			 ip_check_defrag_frags.h
 TRUNNER_EXTRA_FILES := $(OUTPUT)/urandom_read $(OUTPUT)/bpf_testmod.ko	\
 		       $(OUTPUT)/liburandom_read.so			\
 		       $(OUTPUT)/xdp_synproxy				\
diff --git a/tools/testing/selftests/bpf/generate_udp_fragments.py b/tools/testing/selftests/bpf/generate_udp_fragments.py
new file mode 100755
index 000000000000..2b8a1187991c
--- /dev/null
+++ b/tools/testing/selftests/bpf/generate_udp_fragments.py
@@ -0,0 +1,90 @@
+#!/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+"""
+This script helps generate fragmented UDP packets.
+
+While it is technically possible to dynamically generate
+fragmented packets in C, it is much harder to read and write
+said code. `scapy` is relatively industry standard and really
+easy to read / write.
+
+So we choose to write this script that generates a valid C
+header. Rerun script and commit generated file after any
+modifications.
+"""
+
+import argparse
+import os
+
+from scapy.all import *
+
+
+# These constants must stay in sync with `ip_check_defrag.c`
+VETH1_ADDR = "172.16.1.200"
+VETH0_ADDR6 = "fc00::100"
+VETH1_ADDR6 = "fc00::200"
+CLIENT_PORT = 48878
+SERVER_PORT = 48879
+MAGIC_MESSAGE = "THIS IS THE ORIGINAL MESSAGE, PLEASE REASSEMBLE ME"
+
+
+def print_header(f):
+    f.write("// SPDX-License-Identifier: GPL-2.0\n")
+    f.write("/* DO NOT EDIT -- this file is generated */\n")
+    f.write("\n")
+    f.write("#ifndef _IP_CHECK_DEFRAG_FRAGS_H\n")
+    f.write("#define _IP_CHECK_DEFRAG_FRAGS_H\n")
+    f.write("\n")
+    f.write("#include <stdint.h>\n")
+    f.write("\n")
+
+
+def print_frags(f, frags, v6):
+    for idx, frag in enumerate(frags):
+        # 10 bytes per line to keep width in check
+        chunks = [frag[i : i + 10] for i in range(0, len(frag), 10)]
+        chunks_fmted = [", ".join([str(hex(b)) for b in chunk]) for chunk in chunks]
+        suffix = "6" if v6 else ""
+
+        f.write(f"static uint8_t frag{suffix}_{idx}[] = {{\n")
+        for chunk in chunks_fmted:
+            f.write(f"\t{chunk},\n")
+        f.write(f"}};\n")
+
+
+def print_trailer(f):
+    f.write("\n")
+    f.write("#endif /* _IP_CHECK_DEFRAG_FRAGS_H */\n")
+
+
+def main(f):
+    # srcip of 0 is filled in by IP_HDRINCL
+    sip = "0.0.0.0"
+    sip6 = VETH0_ADDR6
+    dip = VETH1_ADDR
+    dip6 = VETH1_ADDR6
+    sport = CLIENT_PORT
+    dport = SERVER_PORT
+    payload = MAGIC_MESSAGE.encode()
+
+    # Disable UDPv4 checksums to keep code simpler
+    pkt = IP(src=sip,dst=dip) / UDP(sport=sport,dport=dport,chksum=0) / Raw(load=payload)
+    # UDPv6 requires a checksum
+    # Also pin the ipv6 fragment header ID, otherwise it's a random value
+    pkt6 = IPv6(src=sip6,dst=dip6) / IPv6ExtHdrFragment(id=0xBEEF) / UDP(sport=sport,dport=dport) / Raw(load=payload)
+
+    frags = [f.build() for f in pkt.fragment(24)]
+    frags6 = [f.build() for f in fragment6(pkt6, 72)]
+
+    print_header(f)
+    print_frags(f, frags, False)
+    print_frags(f, frags6, True)
+    print_trailer(f)
+
+
+if __name__ == "__main__":
+    dir = os.path.dirname(os.path.realpath(__file__))
+    header = f"{dir}/ip_check_defrag_frags.h"
+    with open(header, "w") as f:
+        main(f)
diff --git a/tools/testing/selftests/bpf/ip_check_defrag_frags.h b/tools/testing/selftests/bpf/ip_check_defrag_frags.h
new file mode 100644
index 000000000000..70ab7e9fa22b
--- /dev/null
+++ b/tools/testing/selftests/bpf/ip_check_defrag_frags.h
@@ -0,0 +1,57 @@
+// SPDX-License-Identifier: GPL-2.0
+/* DO NOT EDIT -- this file is generated */
+
+#ifndef _IP_CHECK_DEFRAG_FRAGS_H
+#define _IP_CHECK_DEFRAG_FRAGS_H
+
+#include <stdint.h>
+
+static uint8_t frag_0[] = {
+	0x45, 0x0, 0x0, 0x2c, 0x0, 0x1, 0x20, 0x0, 0x40, 0x11,
+	0xac, 0xe8, 0x0, 0x0, 0x0, 0x0, 0xac, 0x10, 0x1, 0xc8,
+	0xbe, 0xee, 0xbe, 0xef, 0x0, 0x3a, 0x0, 0x0, 0x54, 0x48,
+	0x49, 0x53, 0x20, 0x49, 0x53, 0x20, 0x54, 0x48, 0x45, 0x20,
+	0x4f, 0x52, 0x49, 0x47,
+};
+static uint8_t frag_1[] = {
+	0x45, 0x0, 0x0, 0x2c, 0x0, 0x1, 0x20, 0x3, 0x40, 0x11,
+	0xac, 0xe5, 0x0, 0x0, 0x0, 0x0, 0xac, 0x10, 0x1, 0xc8,
+	0x49, 0x4e, 0x41, 0x4c, 0x20, 0x4d, 0x45, 0x53, 0x53, 0x41,
+	0x47, 0x45, 0x2c, 0x20, 0x50, 0x4c, 0x45, 0x41, 0x53, 0x45,
+	0x20, 0x52, 0x45, 0x41,
+};
+static uint8_t frag_2[] = {
+	0x45, 0x0, 0x0, 0x1e, 0x0, 0x1, 0x0, 0x6, 0x40, 0x11,
+	0xcc, 0xf0, 0x0, 0x0, 0x0, 0x0, 0xac, 0x10, 0x1, 0xc8,
+	0x53, 0x53, 0x45, 0x4d, 0x42, 0x4c, 0x45, 0x20, 0x4d, 0x45,
+};
+static uint8_t frag6_0[] = {
+	0x60, 0x0, 0x0, 0x0, 0x0, 0x20, 0x2c, 0x40, 0xfc, 0x0,
+	0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
+	0x0, 0x0, 0x1, 0x0, 0xfc, 0x0, 0x0, 0x0, 0x0, 0x0,
+	0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x2, 0x0,
+	0x11, 0x0, 0x0, 0x1, 0x0, 0x0, 0xbe, 0xef, 0xbe, 0xee,
+	0xbe, 0xef, 0x0, 0x3a, 0xd0, 0xf8, 0x54, 0x48, 0x49, 0x53,
+	0x20, 0x49, 0x53, 0x20, 0x54, 0x48, 0x45, 0x20, 0x4f, 0x52,
+	0x49, 0x47,
+};
+static uint8_t frag6_1[] = {
+	0x60, 0x0, 0x0, 0x0, 0x0, 0x20, 0x2c, 0x40, 0xfc, 0x0,
+	0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
+	0x0, 0x0, 0x1, 0x0, 0xfc, 0x0, 0x0, 0x0, 0x0, 0x0,
+	0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x2, 0x0,
+	0x11, 0x0, 0x0, 0x19, 0x0, 0x0, 0xbe, 0xef, 0x49, 0x4e,
+	0x41, 0x4c, 0x20, 0x4d, 0x45, 0x53, 0x53, 0x41, 0x47, 0x45,
+	0x2c, 0x20, 0x50, 0x4c, 0x45, 0x41, 0x53, 0x45, 0x20, 0x52,
+	0x45, 0x41,
+};
+static uint8_t frag6_2[] = {
+	0x60, 0x0, 0x0, 0x0, 0x0, 0x12, 0x2c, 0x40, 0xfc, 0x0,
+	0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
+	0x0, 0x0, 0x1, 0x0, 0xfc, 0x0, 0x0, 0x0, 0x0, 0x0,
+	0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x2, 0x0,
+	0x11, 0x0, 0x0, 0x30, 0x0, 0x0, 0xbe, 0xef, 0x53, 0x53,
+	0x45, 0x4d, 0x42, 0x4c, 0x45, 0x20, 0x4d, 0x45,
+};
+
+#endif /* _IP_CHECK_DEFRAG_FRAGS_H */
diff --git a/tools/testing/selftests/bpf/prog_tests/ip_check_defrag.c b/tools/testing/selftests/bpf/prog_tests/ip_check_defrag.c
new file mode 100644
index 000000000000..5cd08d6e0ebc
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/ip_check_defrag.c
@@ -0,0 +1,282 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include <net/if.h>
+#include <linux/netfilter.h>
+#include <network_helpers.h>
+#include "ip_check_defrag.skel.h"
+#include "ip_check_defrag_frags.h"
+
+/*
+ * This selftest spins up a client and an echo server, each in their own
+ * network namespace. The client will send a fragmented message to the server.
+ * The prog attached to the server will shoot down any fragments. Thus, if
+ * the server is able to correctly echo back the message to the client, we will
+ * have verified that netfilter is reassembling packets for us.
+ *
+ * Topology:
+ * =========
+ *           NS0         |         NS1
+ *                       |
+ *         client        |       server
+ *       ----------      |     ----------
+ *       |  veth0  | --------- |  veth1  |
+ *       ----------    peer    ----------
+ *                       |
+ *                       |       with bpf
+ */
+
+#define NS0		"defrag_ns0"
+#define NS1		"defrag_ns1"
+#define VETH0		"veth0"
+#define VETH1		"veth1"
+#define VETH0_ADDR	"172.16.1.100"
+#define VETH0_ADDR6	"fc00::100"
+/* The following constants must stay in sync with `generate_udp_fragments.py` */
+#define VETH1_ADDR	"172.16.1.200"
+#define VETH1_ADDR6	"fc00::200"
+#define CLIENT_PORT	48878
+#define SERVER_PORT	48879
+#define MAGIC_MESSAGE	"THIS IS THE ORIGINAL MESSAGE, PLEASE REASSEMBLE ME"
+
+static int setup_topology(bool ipv6)
+{
+	bool up;
+	int i;
+
+	SYS(fail, "ip netns add " NS0);
+	SYS(fail, "ip netns add " NS1);
+	SYS(fail, "ip link add " VETH0 " netns " NS0 " type veth peer name " VETH1 " netns " NS1);
+	if (ipv6) {
+		SYS(fail, "ip -6 -net " NS0 " addr add " VETH0_ADDR6 "/64 dev " VETH0 " nodad");
+		SYS(fail, "ip -6 -net " NS1 " addr add " VETH1_ADDR6 "/64 dev " VETH1 " nodad");
+	} else {
+		SYS(fail, "ip -net " NS0 " addr add " VETH0_ADDR "/24 dev " VETH0);
+		SYS(fail, "ip -net " NS1 " addr add " VETH1_ADDR "/24 dev " VETH1);
+	}
+	SYS(fail, "ip -net " NS0 " link set dev " VETH0 " up");
+	SYS(fail, "ip -net " NS1 " link set dev " VETH1 " up");
+
+	/* Wait for up to 5s for links to come up */
+	for (i = 0; i < 5; ++i) {
+		if (ipv6)
+			up = !system("ip netns exec " NS0 " ping -6 -c 1 -W 1 " VETH1_ADDR6 " &>/dev/null");
+		else
+			up = !system("ip netns exec " NS0 " ping -c 1 -W 1 " VETH1_ADDR " &>/dev/null");
+
+		if (up)
+			break;
+	}
+
+	return 0;
+fail:
+	return -1;
+}
+
+static void cleanup_topology(void)
+{
+	SYS_NOFAIL("test -f /var/run/netns/" NS0 " && ip netns delete " NS0);
+	SYS_NOFAIL("test -f /var/run/netns/" NS1 " && ip netns delete " NS1);
+}
+
+static int attach(struct ip_check_defrag *skel, bool ipv6)
+{
+	LIBBPF_OPTS(bpf_netfilter_opts, opts,
+		    .pf = ipv6 ? NFPROTO_IPV6 : NFPROTO_IPV4,
+		    .priority = 42,
+		    .flags = BPF_F_NETFILTER_IP_DEFRAG);
+	struct nstoken *nstoken;
+	int err = -1;
+
+	nstoken = open_netns(NS1);
+
+	skel->links.defrag = bpf_program__attach_netfilter(skel->progs.defrag, &opts);
+	if (!ASSERT_OK_PTR(skel->links.defrag, "program attach"))
+		goto out;
+
+	err = 0;
+out:
+	close_netns(nstoken);
+	return err;
+}
+
+static int send_frags(int client)
+{
+	struct sockaddr_storage saddr;
+	struct sockaddr *saddr_p;
+	socklen_t saddr_len;
+	int err;
+
+	saddr_p = (struct sockaddr *)&saddr;
+	err = make_sockaddr(AF_INET, VETH1_ADDR, SERVER_PORT, &saddr, &saddr_len);
+	if (!ASSERT_OK(err, "make_sockaddr"))
+		return -1;
+
+	err = sendto(client, frag_0, sizeof(frag_0), 0, saddr_p, saddr_len);
+	if (!ASSERT_GE(err, 0, "sendto frag_0"))
+		return -1;
+
+	err = sendto(client, frag_1, sizeof(frag_1), 0, saddr_p, saddr_len);
+	if (!ASSERT_GE(err, 0, "sendto frag_1"))
+		return -1;
+
+	err = sendto(client, frag_2, sizeof(frag_2), 0, saddr_p, saddr_len);
+	if (!ASSERT_GE(err, 0, "sendto frag_2"))
+		return -1;
+
+	return 0;
+}
+
+static int send_frags6(int client)
+{
+	struct sockaddr_storage saddr;
+	struct sockaddr *saddr_p;
+	socklen_t saddr_len;
+	int err;
+
+	saddr_p = (struct sockaddr *)&saddr;
+	/* Port needs to be set to 0 for raw ipv6 socket for some reason */
+	err = make_sockaddr(AF_INET6, VETH1_ADDR6, 0, &saddr, &saddr_len);
+	if (!ASSERT_OK(err, "make_sockaddr"))
+		return -1;
+
+	err = sendto(client, frag6_0, sizeof(frag6_0), 0, saddr_p, saddr_len);
+	if (!ASSERT_GE(err, 0, "sendto frag6_0"))
+		return -1;
+
+	err = sendto(client, frag6_1, sizeof(frag6_1), 0, saddr_p, saddr_len);
+	if (!ASSERT_GE(err, 0, "sendto frag6_1"))
+		return -1;
+
+	err = sendto(client, frag6_2, sizeof(frag6_2), 0, saddr_p, saddr_len);
+	if (!ASSERT_GE(err, 0, "sendto frag6_2"))
+		return -1;
+
+	return 0;
+}
+
+void test_bpf_ip_check_defrag_ok(bool ipv6)
+{
+	struct network_helper_opts rx_opts = {
+		.timeout_ms = 1000,
+		.noconnect = true,
+	};
+	struct network_helper_opts tx_ops = {
+		.timeout_ms = 1000,
+		.type = SOCK_RAW,
+		.proto = IPPROTO_RAW,
+		.noconnect = true,
+	};
+	struct sockaddr_storage caddr;
+	struct ip_check_defrag *skel;
+	struct nstoken *nstoken;
+	int client_tx_fd = -1;
+	int client_rx_fd = -1;
+	socklen_t caddr_len;
+	int srv_fd = -1;
+	char buf[1024];
+	int len, err;
+
+	skel = ip_check_defrag__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		return;
+
+	if (!ASSERT_OK(setup_topology(ipv6), "setup_topology"))
+		goto out;
+
+	if (!ASSERT_OK(attach(skel, ipv6), "attach"))
+		goto out;
+
+	/* Start server in ns1 */
+	nstoken = open_netns(NS1);
+	if (!ASSERT_OK_PTR(nstoken, "setns ns1"))
+		goto out;
+	srv_fd = start_server(ipv6 ? AF_INET6 : AF_INET, SOCK_DGRAM, NULL, SERVER_PORT, 0);
+	close_netns(nstoken);
+	if (!ASSERT_GE(srv_fd, 0, "start_server"))
+		goto out;
+
+	/* Open tx raw socket in ns0 */
+	nstoken = open_netns(NS0);
+	if (!ASSERT_OK_PTR(nstoken, "setns ns0"))
+		goto out;
+	client_tx_fd = connect_to_fd_opts(srv_fd, &tx_ops);
+	close_netns(nstoken);
+	if (!ASSERT_GE(client_tx_fd, 0, "connect_to_fd_opts"))
+		goto out;
+
+	/* Open rx socket in ns0 */
+	nstoken = open_netns(NS0);
+	if (!ASSERT_OK_PTR(nstoken, "setns ns0"))
+		goto out;
+	client_rx_fd = connect_to_fd_opts(srv_fd, &rx_opts);
+	close_netns(nstoken);
+	if (!ASSERT_GE(client_rx_fd, 0, "connect_to_fd_opts"))
+		goto out;
+
+	/* Bind rx socket to a premeditated port */
+	memset(&caddr, 0, sizeof(caddr));
+	nstoken = open_netns(NS0);
+	if (!ASSERT_OK_PTR(nstoken, "setns ns0"))
+		goto out;
+	if (ipv6) {
+		struct sockaddr_in6 *c = (struct sockaddr_in6 *)&caddr;
+
+		c->sin6_family = AF_INET6;
+		inet_pton(AF_INET6, VETH0_ADDR6, &c->sin6_addr);
+		c->sin6_port = htons(CLIENT_PORT);
+		err = bind(client_rx_fd, (struct sockaddr *)c, sizeof(*c));
+	} else {
+		struct sockaddr_in *c = (struct sockaddr_in *)&caddr;
+
+		c->sin_family = AF_INET;
+		inet_pton(AF_INET, VETH0_ADDR, &c->sin_addr);
+		c->sin_port = htons(CLIENT_PORT);
+		err = bind(client_rx_fd, (struct sockaddr *)c, sizeof(*c));
+	}
+	close_netns(nstoken);
+	if (!ASSERT_OK(err, "bind"))
+		goto out;
+
+	/* Send message in fragments */
+	if (ipv6) {
+		if (!ASSERT_OK(send_frags6(client_tx_fd), "send_frags6"))
+			goto out;
+	} else {
+		if (!ASSERT_OK(send_frags(client_tx_fd), "send_frags"))
+			goto out;
+	}
+
+	if (!ASSERT_EQ(skel->bss->shootdowns, 0, "shootdowns"))
+		goto out;
+
+	/* Receive reassembled msg on server and echo back to client */
+	len = recvfrom(srv_fd, buf, sizeof(buf), 0, (struct sockaddr *)&caddr, &caddr_len);
+	if (!ASSERT_GE(len, 0, "server recvfrom"))
+		goto out;
+	len = sendto(srv_fd, buf, len, 0, (struct sockaddr *)&caddr, caddr_len);
+	if (!ASSERT_GE(len, 0, "server sendto"))
+		goto out;
+
+	/* Expect reassembed message to be echoed back */
+	len = recvfrom(client_rx_fd, buf, sizeof(buf), 0, NULL, NULL);
+	if (!ASSERT_EQ(len, sizeof(MAGIC_MESSAGE) - 1, "client short read"))
+		goto out;
+
+out:
+	if (client_rx_fd != -1)
+		close(client_rx_fd);
+	if (client_tx_fd != -1)
+		close(client_tx_fd);
+	if (srv_fd != -1)
+		close(srv_fd);
+	cleanup_topology();
+	ip_check_defrag__destroy(skel);
+}
+
+void test_bpf_ip_check_defrag(void)
+{
+	if (test__start_subtest("v4"))
+		test_bpf_ip_check_defrag_ok(false);
+	if (test__start_subtest("v6"))
+		test_bpf_ip_check_defrag_ok(true);
+}
diff --git a/tools/testing/selftests/bpf/progs/ip_check_defrag.c b/tools/testing/selftests/bpf/progs/ip_check_defrag.c
new file mode 100644
index 000000000000..4259c6d59968
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/ip_check_defrag.c
@@ -0,0 +1,104 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+#include "bpf_tracing_net.h"
+
+#define NF_DROP			0
+#define NF_ACCEPT		1
+#define ETH_P_IP		0x0800
+#define ETH_P_IPV6		0x86DD
+#define IP_MF			0x2000
+#define IP_OFFSET		0x1FFF
+#define NEXTHDR_FRAGMENT	44
+
+extern int bpf_dynptr_from_skb(struct sk_buff *skb, __u64 flags,
+                               struct bpf_dynptr *ptr__uninit) __ksym;
+extern void *bpf_dynptr_slice(const struct bpf_dynptr *ptr, uint32_t offset,
+			      void *buffer, uint32_t buffer__sz) __ksym;
+
+volatile int shootdowns = 0;
+
+static bool is_frag_v4(struct iphdr *iph)
+{
+	int offset;
+	int flags;
+
+	offset = bpf_ntohs(iph->frag_off);
+	flags = offset & ~IP_OFFSET;
+	offset &= IP_OFFSET;
+	offset <<= 3;
+
+	return (flags & IP_MF) || offset;
+}
+
+static bool is_frag_v6(struct ipv6hdr *ip6h)
+{
+	/* Simplifying assumption that there are no extension headers
+	 * between fixed header and fragmentation header. This assumption
+	 * is only valid in this test case. It saves us the hassle of
+	 * searching all potential extension headers.
+	 */
+	return ip6h->nexthdr == NEXTHDR_FRAGMENT;
+}
+
+static int handle_v4(struct sk_buff *skb)
+{
+	struct bpf_dynptr ptr;
+	u8 iph_buf[20] = {};
+	struct iphdr *iph;
+
+	if (bpf_dynptr_from_skb(skb, 0, &ptr))
+		return NF_DROP;
+
+	iph = bpf_dynptr_slice(&ptr, 0, iph_buf, sizeof(iph_buf));
+	if (!iph)
+		return NF_DROP;
+
+	/* Shootdown any frags */
+	if (is_frag_v4(iph)) {
+		shootdowns++;
+		return NF_DROP;
+	}
+
+	return NF_ACCEPT;
+}
+
+static int handle_v6(struct sk_buff *skb)
+{
+	struct bpf_dynptr ptr;
+	struct ipv6hdr *ip6h;
+	u8 ip6h_buf[40] = {};
+
+	if (bpf_dynptr_from_skb(skb, 0, &ptr))
+		return NF_DROP;
+
+	ip6h = bpf_dynptr_slice(&ptr, 0, ip6h_buf, sizeof(ip6h_buf));
+	if (!ip6h)
+		return NF_DROP;
+
+	/* Shootdown any frags */
+	if (is_frag_v6(ip6h)) {
+		shootdowns++;
+		return NF_DROP;
+	}
+
+	return NF_ACCEPT;
+}
+
+SEC("netfilter")
+int defrag(struct bpf_nf_ctx *ctx)
+{
+	struct sk_buff *skb = ctx->skb;
+
+	switch (bpf_ntohs(skb->protocol)) {
+	case ETH_P_IP:
+		return handle_v4(skb);
+	case ETH_P_IPV6:
+		return handle_v6(skb);
+	default:
+		return NF_ACCEPT;
+	}
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next 1/7] tools: libbpf: add netfilter link attach helper
  2023-06-26 23:02 ` [PATCH bpf-next 1/7] tools: libbpf: add netfilter link attach helper Daniel Xu
@ 2023-06-27  0:11   ` Andrii Nakryiko
  0 siblings, 0 replies; 22+ messages in thread
From: Andrii Nakryiko @ 2023-06-27  0:11 UTC (permalink / raw)
  To: Daniel Xu
  Cc: daniel, ast, andrii, fw, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, bpf, linux-kernel, netfilter-devel,
	dsahern

On Mon, Jun 26, 2023 at 4:02 PM Daniel Xu <dxu@dxuuu.xyz> wrote:
>
> Add new api function: bpf_program__attach_netfilter.
>
> It takes a bpf program (netfilter type), and a pointer to a option struct
> that contains the desired attachment (protocol family, priority, hook
> location, ...).
>
> It returns a pointer to a 'bpf_link' structure or NULL on error.
>
> Next patch adds new netfilter_basic test that uses this function to
> attach a program to a few pf/hook/priority combinations.
>
> Co-developed-by: Florian Westphal <fw@strlen.de>
> Signed-off-by: Florian Westphal <fw@strlen.de>
> Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
> ---
>  tools/lib/bpf/bpf.c      |  8 +++++++
>  tools/lib/bpf/bpf.h      |  6 +++++
>  tools/lib/bpf/libbpf.c   | 47 ++++++++++++++++++++++++++++++++++++++++
>  tools/lib/bpf/libbpf.h   | 15 +++++++++++++
>  tools/lib/bpf/libbpf.map |  1 +
>  5 files changed, 77 insertions(+)
>
> diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
> index ed86b37d8024..3b0da19715e1 100644
> --- a/tools/lib/bpf/bpf.c
> +++ b/tools/lib/bpf/bpf.c
> @@ -741,6 +741,14 @@ int bpf_link_create(int prog_fd, int target_fd,
>                 if (!OPTS_ZEROED(opts, tracing))
>                         return libbpf_err(-EINVAL);
>                 break;
> +       case BPF_NETFILTER:
> +               attr.link_create.netfilter.pf = OPTS_GET(opts, netfilter.pf, 0);
> +               attr.link_create.netfilter.hooknum = OPTS_GET(opts, netfilter.hooknum, 0);
> +               attr.link_create.netfilter.priority = OPTS_GET(opts, netfilter.priority, 0);
> +               attr.link_create.netfilter.flags = OPTS_GET(opts, netfilter.flags, 0);
> +               if (!OPTS_ZEROED(opts, netfilter))
> +                       return libbpf_err(-EINVAL);
> +               break;
>         default:
>                 if (!OPTS_ZEROED(opts, flags))
>                         return libbpf_err(-EINVAL);
> diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
> index 9aa0ee473754..c676295ab9bf 100644
> --- a/tools/lib/bpf/bpf.h
> +++ b/tools/lib/bpf/bpf.h
> @@ -349,6 +349,12 @@ struct bpf_link_create_opts {
>                 struct {
>                         __u64 cookie;
>                 } tracing;
> +               struct {
> +                       __u32 pf;
> +                       __u32 hooknum;
> +                       __s32 priority;
> +                       __u32 flags;
> +               } netfilter;
>         };
>         size_t :0;
>  };
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 214f828ece6b..a8b9d5abb55f 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -11811,6 +11811,53 @@ static int attach_iter(const struct bpf_program *prog, long cookie, struct bpf_l
>         return libbpf_get_error(*link);
>  }
>
> +struct bpf_link *bpf_program__attach_netfilter(const struct bpf_program *prog,
> +                                              const struct bpf_netfilter_opts *opts)
> +{
> +       DECLARE_LIBBPF_OPTS(bpf_link_create_opts, link_create_opts);

nit: let's use shorter LIBBPF_OPTS() macro

> +       struct bpf_link *link;
> +       int prog_fd, link_fd;
> +
> +       if (!OPTS_VALID(opts, bpf_netfilter_opts))
> +               return libbpf_err_ptr(-EINVAL);
> +
> +       link_create_opts.netfilter.pf = OPTS_GET(opts, pf, 0);
> +       link_create_opts.netfilter.hooknum = OPTS_GET(opts, hooknum, 0);
> +       link_create_opts.netfilter.priority = OPTS_GET(opts, priority, 0);
> +       link_create_opts.netfilter.flags = OPTS_GET(opts, flags, 0);
> +
> +       prog_fd = bpf_program__fd(prog);
> +       if (prog_fd < 0) {
> +               pr_warn("prog '%s': can't attach before loaded\n", prog->name);
> +               return libbpf_err_ptr(-EINVAL);
> +       }
> +
> +       link = calloc(1, sizeof(*link));
> +       if (!link)
> +               return libbpf_err_ptr(-ENOMEM);
> +       link->detach = &bpf_link__detach_fd;
> +
> +       link_fd = bpf_link_create(prog_fd, 0, BPF_NETFILTER, &link_create_opts);
> +
> +       link->fd = ensure_good_fd(link_fd);

bpf_link_create() does ensure_good_fd() already, no need to do it
here, just assign result directly


> +
> +       if (link->fd < 0) {
> +               char errmsg[STRERR_BUFSIZE];
> +
> +               link_fd = -errno;
> +               free(link);
> +               pr_warn("prog '%s': failed to attach to pf:%d,hooknum:%d:prio:%d: %s\n",

comma before prio? but also how necessary is to emit all these? what
if we add another argument to opts, would we add them here as well?

I'd just go with just "failed to attach netfilter" and keep it simple

> +                       prog->name,
> +                       OPTS_GET(opts, pf, 0),
> +                       OPTS_GET(opts, hooknum, 0),
> +                       OPTS_GET(opts, priority, 0),
> +                       libbpf_strerror_r(link_fd, errmsg, sizeof(errmsg)));
> +               return libbpf_err_ptr(link_fd);
> +       }
> +
> +       return link;
> +}
> +
>  struct bpf_link *bpf_program__attach(const struct bpf_program *prog)
>  {
>         struct bpf_link *link = NULL;
> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index 754da73c643b..10642ad69d76 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -718,6 +718,21 @@ LIBBPF_API struct bpf_link *
>  bpf_program__attach_freplace(const struct bpf_program *prog,
>                              int target_fd, const char *attach_func_name);
>
> +struct bpf_netfilter_opts {
> +       /* size of this struct, for forward/backward compatibility */
> +       size_t sz;
> +
> +       __u32 pf;
> +       __u32 hooknum;
> +       __s32 priority;
> +       __u32 flags;
> +};
> +#define bpf_netfilter_opts__last_field flags
> +
> +LIBBPF_API struct bpf_link *
> +bpf_program__attach_netfilter(const struct bpf_program *prog,
> +                             const struct bpf_netfilter_opts *opts);
> +
>  struct bpf_map;
>
>  LIBBPF_API struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map);
> diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
> index 7521a2fb7626..d9ec4407befa 100644
> --- a/tools/lib/bpf/libbpf.map
> +++ b/tools/lib/bpf/libbpf.map
> @@ -395,4 +395,5 @@ LIBBPF_1.2.0 {
>  LIBBPF_1.3.0 {
>         global:
>                 bpf_obj_pin_opts;
> +               bpf_program__attach_netfilter;
>  } LIBBPF_1.2.0;
> --
> 2.40.1
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next 0/7] Support defragmenting IPv(4|6) packets in BPF
  2023-06-26 23:02 [PATCH bpf-next 0/7] Support defragmenting IPv(4|6) packets in BPF Daniel Xu
                   ` (6 preceding siblings ...)
  2023-06-26 23:02 ` [PATCH bpf-next 7/7] bpf: selftests: Add defrag selftests Daniel Xu
@ 2023-06-27 10:48 ` Florian Westphal
  2023-06-27 14:18   ` Daniel Xu
  2023-06-27 14:25 ` Toke Høiland-Jørgensen
  8 siblings, 1 reply; 22+ messages in thread
From: Florian Westphal @ 2023-06-27 10:48 UTC (permalink / raw)
  To: Daniel Xu
  Cc: bpf, netdev, linux-kernel, linux-kselftest, coreteam,
	netfilter-devel, fw, daniel, dsahern

Daniel Xu <dxu@dxuuu.xyz> wrote:
> Patches 1 & 2 are stolenfrom Florian. Hopefully he doesn't mind. There
> were some outstanding comments on the v2 [2] but it doesn't look like a
> v3 was ever submitted.  I've addressed the comments and put them in this
> patchset cuz I needed them.

I did not submit a v3 because i had to wait for the bpf -> bpf-next
merge to get "bpf: netfilter: Add BPF_NETFILTER bpf_attach_type".

Now that has been done so I will do v3 shortly.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next 3/7] netfilter: defrag: Add glue hooks for enabling/disabling defrag
  2023-06-26 23:02 ` [PATCH bpf-next 3/7] netfilter: defrag: Add glue hooks for enabling/disabling defrag Daniel Xu
@ 2023-06-27 11:04   ` Florian Westphal
  0 siblings, 0 replies; 22+ messages in thread
From: Florian Westphal @ 2023-06-27 11:04 UTC (permalink / raw)
  To: Daniel Xu
  Cc: edumazet, dsahern, kuba, fw, pabeni, pablo, davem, kadlec,
	daniel, netfilter-devel, coreteam, linux-kernel, netdev, bpf

Daniel Xu <dxu@dxuuu.xyz> wrote:
> diff --git a/net/ipv4/netfilter/nf_defrag_ipv4.c b/net/ipv4/netfilter/nf_defrag_ipv4.c
> index e61ea428ea18..436e629b0969 100644
> --- a/net/ipv4/netfilter/nf_defrag_ipv4.c
> +++ b/net/ipv4/netfilter/nf_defrag_ipv4.c
> @@ -7,6 +7,7 @@
>  #include <linux/ip.h>
>  #include <linux/netfilter.h>
>  #include <linux/module.h>
> +#include <linux/rcupdate.h>
>  #include <linux/skbuff.h>
>  #include <net/netns/generic.h>
>  #include <net/route.h>
> @@ -113,17 +114,24 @@ static void __net_exit defrag4_net_exit(struct net *net)
>  	}
>  }
>  
> +static struct nf_defrag_v4_hook defrag_hook = {
> +	.enable = nf_defrag_ipv4_enable,
> +	.disable = nf_defrag_ipv4_disable,
> +};

Nit: static const, same for v6.

>  static struct pernet_operations defrag4_net_ops = {
>  	.exit = defrag4_net_exit,
>  };
>  
>  static int __init nf_defrag_init(void)
>  {
> +	rcu_assign_pointer(nf_defrag_v4_hook, &defrag_hook);
>  	return register_pernet_subsys(&defrag4_net_ops);

register_pernet failure results in nf_defrag_v4_hook pointing to
garbage.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next 4/7] netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link
  2023-06-26 23:02 ` [PATCH bpf-next 4/7] netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link Daniel Xu
@ 2023-06-27 11:12   ` Florian Westphal
  2023-06-27 15:35     ` Daniel Xu
  0 siblings, 1 reply; 22+ messages in thread
From: Florian Westphal @ 2023-06-27 11:12 UTC (permalink / raw)
  To: Daniel Xu
  Cc: daniel, edumazet, kuba, fw, pabeni, pablo, andrii, davem, ast,
	kadlec, martin.lau, song, yhs, john.fastabend, kpsingh, sdf,
	haoluo, jolsa, bpf, linux-kernel, netfilter-devel, coreteam,
	netdev, dsahern

Daniel Xu <dxu@dxuuu.xyz> wrote:
> +static int bpf_nf_enable_defrag(struct bpf_nf_link *link)
> +{
> +	int err;
> +
> +	switch (link->hook_ops.pf) {
> +#if IS_ENABLED(CONFIG_NF_DEFRAG_IPV4)
> +	case NFPROTO_IPV4:
> +		const struct nf_defrag_v4_hook *v4_hook;
> +
> +		err = request_module("nf_defrag_ipv4");
> +		if (err)
> +			return err;
> +
> +		rcu_read_lock();
> +		v4_hook = rcu_dereference(nf_defrag_v4_hook);
> +		err = v4_hook->enable(link->net);
> +		rcu_read_unlock();

I'd reverse this, first try rcu_dereference(), then modprobe
if thats returned NULL.

> +static void bpf_nf_disable_defrag(struct bpf_nf_link *link)
> +{
> +	switch (link->hook_ops.pf) {
> +#if IS_ENABLED(CONFIG_NF_DEFRAG_IPV4)
> +	case NFPROTO_IPV4:
> +		const struct nf_defrag_v4_hook *v4_hook;
> +
> +		rcu_read_lock();
> +		v4_hook = rcu_dereference(nf_defrag_v4_hook);
> +		v4_hook->disable(link->net);
> +		rcu_read_unlock();

if (v4_hook)
	v4_hook->disable()

Else we get trouble on manual 'rmmod'.

> +	/* make sure conntrack confirm is always last */
> +	prio = attr->link_create.netfilter.priority;
> +	if (prio == NF_IP_PRI_FIRST)
> +		return -ERANGE;  /* sabotage_in and other warts */
> +	else if (prio == NF_IP_PRI_LAST)
> +		return -ERANGE;  /* e.g. conntrack confirm */
> +	else if ((attr->link_create.netfilter.flags & BPF_F_NETFILTER_IP_DEFRAG) &&
> +		 (prio > NF_IP_PRI_FIRST && prio <= NF_IP_PRI_CONNTRACK_DEFRAG))
> +		return -ERANGE;  /* cannot use defrag if prog runs before nf_defrag */

You could elide the (prio > NF_IP_PRI_FIRST, its already handled by
first conditional.  Otherwise this looks good to me.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next 0/7] Support defragmenting IPv(4|6) packets in BPF
  2023-06-27 10:48 ` [PATCH bpf-next 0/7] Support defragmenting IPv(4|6) packets in BPF Florian Westphal
@ 2023-06-27 14:18   ` Daniel Xu
  0 siblings, 0 replies; 22+ messages in thread
From: Daniel Xu @ 2023-06-27 14:18 UTC (permalink / raw)
  To: Florian Westphal
  Cc: bpf, netdev, linux-kernel, linux-kselftest, coreteam,
	netfilter-devel, daniel, dsahern

Hi Florian,

On Tue, Jun 27, 2023 at 12:48:20PM +0200, Florian Westphal wrote:
> Daniel Xu <dxu@dxuuu.xyz> wrote:
> > Patches 1 & 2 are stolenfrom Florian. Hopefully he doesn't mind. There
> > were some outstanding comments on the v2 [2] but it doesn't look like a
> > v3 was ever submitted.  I've addressed the comments and put them in this
> > patchset cuz I needed them.
> 
> I did not submit a v3 because i had to wait for the bpf -> bpf-next
> merge to get "bpf: netfilter: Add BPF_NETFILTER bpf_attach_type".
> 
> Now that has been done so I will do v3 shortly.

Ack. Will wait for your patches to go in before sending my v2.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next 0/7] Support defragmenting IPv(4|6) packets in BPF
  2023-06-26 23:02 [PATCH bpf-next 0/7] Support defragmenting IPv(4|6) packets in BPF Daniel Xu
                   ` (7 preceding siblings ...)
  2023-06-27 10:48 ` [PATCH bpf-next 0/7] Support defragmenting IPv(4|6) packets in BPF Florian Westphal
@ 2023-06-27 14:25 ` Toke Høiland-Jørgensen
  2023-06-27 14:51   ` Daniel Xu
  2023-06-27 15:44   ` Florian Westphal
  8 siblings, 2 replies; 22+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-06-27 14:25 UTC (permalink / raw)
  To: Daniel Xu, bpf, netdev, linux-kernel, linux-kselftest, coreteam,
	netfilter-devel, fw, daniel
  Cc: dsahern

> The basic idea is we bump a refcnt on the netfilter defrag module and
> then run the bpf prog after the defrag module runs. This allows bpf
> progs to transparently see full, reassembled packets. The nice thing
> about this is that progs don't have to carry around logic to detect
> fragments.

One high-level comment after glancing through the series: Instead of
allocating a flag specifically for the defrag module, why not support
loading (and holding) arbitrary netfilter modules in the UAPI? If we
need to allocate a new flag every time someone wants to use a netfilter
module along with BPF we'll run out of flags pretty quickly :)

-Toke


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next 0/7] Support defragmenting IPv(4|6) packets in BPF
  2023-06-27 14:25 ` Toke Høiland-Jørgensen
@ 2023-06-27 14:51   ` Daniel Xu
  2023-06-27 15:44   ` Florian Westphal
  1 sibling, 0 replies; 22+ messages in thread
From: Daniel Xu @ 2023-06-27 14:51 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: bpf, netdev, linux-kernel, linux-kselftest, coreteam,
	netfilter-devel, fw, daniel, dsahern

Hi Toke,

Thanks for taking a look at the patchset.

On Tue, Jun 27, 2023 at 04:25:13PM +0200, Toke Høiland-Jørgensen wrote:
> > The basic idea is we bump a refcnt on the netfilter defrag module and
> > then run the bpf prog after the defrag module runs. This allows bpf
> > progs to transparently see full, reassembled packets. The nice thing
> > about this is that progs don't have to carry around logic to detect
> > fragments.
> 
> One high-level comment after glancing through the series: Instead of
> allocating a flag specifically for the defrag module, why not support
> loading (and holding) arbitrary netfilter modules in the UAPI? If we
> need to allocate a new flag every time someone wants to use a netfilter
> module along with BPF we'll run out of flags pretty quickly :)

I don't have enough context on netfilter in general to say if it'd be
generically useful -- perhaps Florian can comment on that.

However, I'm not sure such a mechanism removes the need for a flag. The
netfilter defrag modules still need to be called into to bump the refcnt.

The module could export some kfuncs to inc/dec the refcnt, but it'd be
rather odd for prog code to think about the lifetime of the attachment
(as inc/dec for _each_ prog execution seems wasteful and slow).  AFAIK
all the other resource acquire/release APIs are for a single prog
execution.

So a flag for link attach feels the most natural to me. We could always
add a flag2 field or something right?

[...]

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next 4/7] netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link
  2023-06-27 11:12   ` Florian Westphal
@ 2023-06-27 15:35     ` Daniel Xu
  0 siblings, 0 replies; 22+ messages in thread
From: Daniel Xu @ 2023-06-27 15:35 UTC (permalink / raw)
  To: Florian Westphal
  Cc: daniel, edumazet, kuba, pabeni, pablo, andrii, davem, ast,
	kadlec, martin.lau, song, yhs, john.fastabend, kpsingh, sdf,
	haoluo, jolsa, bpf, linux-kernel, netfilter-devel, coreteam,
	netdev, dsahern

On Tue, Jun 27, 2023 at 01:12:48PM +0200, Florian Westphal wrote:
> Daniel Xu <dxu@dxuuu.xyz> wrote:
> > +static int bpf_nf_enable_defrag(struct bpf_nf_link *link)
> > +{
> > +	int err;
> > +
> > +	switch (link->hook_ops.pf) {
> > +#if IS_ENABLED(CONFIG_NF_DEFRAG_IPV4)
> > +	case NFPROTO_IPV4:
> > +		const struct nf_defrag_v4_hook *v4_hook;
> > +
> > +		err = request_module("nf_defrag_ipv4");
> > +		if (err)
> > +			return err;
> > +
> > +		rcu_read_lock();
> > +		v4_hook = rcu_dereference(nf_defrag_v4_hook);
> > +		err = v4_hook->enable(link->net);
> > +		rcu_read_unlock();
> 
> I'd reverse this, first try rcu_dereference(), then modprobe
> if thats returned NULL.

Ack.

> 
> > +static void bpf_nf_disable_defrag(struct bpf_nf_link *link)
> > +{
> > +	switch (link->hook_ops.pf) {
> > +#if IS_ENABLED(CONFIG_NF_DEFRAG_IPV4)
> > +	case NFPROTO_IPV4:
> > +		const struct nf_defrag_v4_hook *v4_hook;
> > +
> > +		rcu_read_lock();
> > +		v4_hook = rcu_dereference(nf_defrag_v4_hook);
> > +		v4_hook->disable(link->net);
> > +		rcu_read_unlock();
> 
> if (v4_hook)
> 	v4_hook->disable()
> 
> Else we get trouble on manual 'rmmod'.

Ah good catch, thanks.

> 
> > +	/* make sure conntrack confirm is always last */
> > +	prio = attr->link_create.netfilter.priority;
> > +	if (prio == NF_IP_PRI_FIRST)
> > +		return -ERANGE;  /* sabotage_in and other warts */
> > +	else if (prio == NF_IP_PRI_LAST)
> > +		return -ERANGE;  /* e.g. conntrack confirm */
> > +	else if ((attr->link_create.netfilter.flags & BPF_F_NETFILTER_IP_DEFRAG) &&
> > +		 (prio > NF_IP_PRI_FIRST && prio <= NF_IP_PRI_CONNTRACK_DEFRAG))
> > +		return -ERANGE;  /* cannot use defrag if prog runs before nf_defrag */
> 
> You could elide the (prio > NF_IP_PRI_FIRST, its already handled by
> first conditional.  Otherwise this looks good to me.
> 

Ah, right. It's INT_MIN.


Thanks,
Daniel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next 0/7] Support defragmenting IPv(4|6) packets in BPF
  2023-06-27 14:25 ` Toke Høiland-Jørgensen
  2023-06-27 14:51   ` Daniel Xu
@ 2023-06-27 15:44   ` Florian Westphal
  2023-06-29 12:16     ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 22+ messages in thread
From: Florian Westphal @ 2023-06-27 15:44 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Daniel Xu, bpf, netdev, linux-kernel, linux-kselftest, coreteam,
	netfilter-devel, fw, daniel, dsahern

Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> > The basic idea is we bump a refcnt on the netfilter defrag module and
> > then run the bpf prog after the defrag module runs. This allows bpf
> > progs to transparently see full, reassembled packets. The nice thing
> > about this is that progs don't have to carry around logic to detect
> > fragments.
> 
> One high-level comment after glancing through the series: Instead of
> allocating a flag specifically for the defrag module, why not support
> loading (and holding) arbitrary netfilter modules in the UAPI?

How would that work/look like?

defrag (and conntrack) need special handling because loading these
modules has no effect on the datapath.

Traditionally, yes, loading was enough, but now with netns being
ubiquitous we don't want these to get enabled unless needed.

Ignoring bpf, this happens when user adds nftables/iptables rules
that check for conntrack state, use some form of NAT or use e.g. tproxy.

For bpf a flag during link attachment seemed like the best way
to go.

At the moment I only see two flags for this, namely
"need defrag" and "need conntrack".

For conntrack, we MIGHT be able to not need a flag but
maybe verifier could "guess" based on kfuncs used.

But for defrag, I don't think its good to add a dummy do-nothing
kfunc just for expressing the dependency on bpf prog side.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next 0/7] Support defragmenting IPv(4|6) packets in BPF
  2023-06-27 15:44   ` Florian Westphal
@ 2023-06-29 12:16     ` Toke Høiland-Jørgensen
  2023-06-29 13:21       ` Florian Westphal
  0 siblings, 1 reply; 22+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-06-29 12:16 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Daniel Xu, bpf, netdev, linux-kernel, linux-kselftest, coreteam,
	netfilter-devel, fw, daniel, dsahern

Florian Westphal <fw@strlen.de> writes:

> Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> > The basic idea is we bump a refcnt on the netfilter defrag module and
>> > then run the bpf prog after the defrag module runs. This allows bpf
>> > progs to transparently see full, reassembled packets. The nice thing
>> > about this is that progs don't have to carry around logic to detect
>> > fragments.
>> 
>> One high-level comment after glancing through the series: Instead of
>> allocating a flag specifically for the defrag module, why not support
>> loading (and holding) arbitrary netfilter modules in the UAPI?
>
> How would that work/look like?
>
> defrag (and conntrack) need special handling because loading these
> modules has no effect on the datapath.
>
> Traditionally, yes, loading was enough, but now with netns being
> ubiquitous we don't want these to get enabled unless needed.
>
> Ignoring bpf, this happens when user adds nftables/iptables rules
> that check for conntrack state, use some form of NAT or use e.g. tproxy.
>
> For bpf a flag during link attachment seemed like the best way
> to go.

Right, I wasn't disputing that having a flag to load a module was a good
idea. On the contrary, I was thinking we'd need many more of these
if/when BPF wants to take advantage of more netfilter code. Say, if a
BPF module wants to call into TPROXY, that module would also need go be
loaded and kept around, no?

I was thinking something along the lines of just having a field
'netfilter_modules[]' where userspace could put an arbitrary number of
module names into, and we'd load all of them and put a ref into the
bpf_link. In principle, we could just have that be a string array of
module names, but that's probably a bit cumbersome (and, well, building
a generic module loader interface into the bpf_like API is not
desirable either). But maybe with an explicit ENUM?

> At the moment I only see two flags for this, namely
> "need defrag" and "need conntrack".
>
> For conntrack, we MIGHT be able to not need a flag but
> maybe verifier could "guess" based on kfuncs used.

If the verifier can just identify the modules from the kfuncs and do the
whole thing automatically, that would of course be even better from an
ease-of-use PoV. Not sure what that would take, though? I seem to recall
having discussions around these lines before that fell down on various
points.

> But for defrag, I don't think its good to add a dummy do-nothing
> kfunc just for expressing the dependency on bpf prog side.

Agreed.

-Toke


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next 0/7] Support defragmenting IPv(4|6) packets in BPF
  2023-06-29 12:16     ` Toke Høiland-Jørgensen
@ 2023-06-29 13:21       ` Florian Westphal
  2023-06-29 14:35         ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 22+ messages in thread
From: Florian Westphal @ 2023-06-29 13:21 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Florian Westphal, Daniel Xu, bpf, netdev, linux-kernel,
	linux-kselftest, coreteam, netfilter-devel, daniel, dsahern

Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> Florian Westphal <fw@strlen.de> writes:
> > For bpf a flag during link attachment seemed like the best way
> > to go.
> 
> Right, I wasn't disputing that having a flag to load a module was a good
> idea. On the contrary, I was thinking we'd need many more of these
> if/when BPF wants to take advantage of more netfilter code. Say, if a
> BPF module wants to call into TPROXY, that module would also need go be
> loaded and kept around, no?

That seems to be a different topic that has nothing to do with
either bpf_link or netfilter?

If the program calls into say, TPROXY, then I'd expect that this needs
to be handled via kfuncs, no? Or if I misunderstand, what do you mean
by "call into TPROXY"?

And if so, thats already handled at bpf_prog load time, not
at link creation time, or do I miss something here?

AFAIU, if prog uses such kfuncs, verifier will grab needed module ref
and if module isn't loaded the kfuncs won't be found and program load
fails.

> I was thinking something along the lines of just having a field
> 'netfilter_modules[]' where userspace could put an arbitrary number of
> module names into, and we'd load all of them and put a ref into the
> bpf_link.

Why?  I fail to understand the connection between bpf_link, netfilter
and modules.  What makes netfilter so special that we need such a
module array, and what does that have to do with bpf_link interface?

> In principle, we could just have that be a string array f
> module names, but that's probably a bit cumbersome (and, well, building
> a generic module loader interface into the bpf_like API is not
> desirable either). But maybe with an explicit ENUM?

What functionality does that provide? I can't think of a single module
where this functionality is needed.

Either we're talking about future kfuncs, then, as far as i understand
how kfuncs work, this is handled at bpf_prog load time, not when the
bpf_link is created.

Or we are talking about implicit dependencies, where program doesn't
call function X but needs functionality handled earlier in the pipeline?

The only two instances I know where this is the case for netfilter
is defrag + conntrack.

> > For conntrack, we MIGHT be able to not need a flag but
> > maybe verifier could "guess" based on kfuncs used.
> 
> If the verifier can just identify the modules from the kfuncs and do the
> whole thing automatically, that would of course be even better from an
> ease-of-use PoV. Not sure what that would take, though? I seem to recall
> having discussions around these lines before that fell down on various
> points.

AFAICS the conntrack kfuncs are wired to nf_conntrack already, so I
would expect that the module has to be loaded already for the verifier
to accept the program.

Those kfuncs are not yet exposed to NETFILTER program types.
Once they are, all that would be needed is for the netfilter bpf_link
to be able tp detect that the prog is calling into those kfuncs, and
then make the needed register/unregister calls to enable the conntrack
hooks.

Wheter thats better than using an explicit "please turn on conntrack for
me", I don't know.  Perhaps future bpf programs could access skb->_nfct
directly without kfuncs so I'd say the flag is a better approach
from an uapi point of view.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next 0/7] Support defragmenting IPv(4|6) packets in BPF
  2023-06-29 13:21       ` Florian Westphal
@ 2023-06-29 14:35         ` Toke Høiland-Jørgensen
  2023-06-29 14:53           ` Florian Westphal
  0 siblings, 1 reply; 22+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-06-29 14:35 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Florian Westphal, Daniel Xu, bpf, netdev, linux-kernel,
	linux-kselftest, coreteam, netfilter-devel, daniel, dsahern

Florian Westphal <fw@strlen.de> writes:

> Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> Florian Westphal <fw@strlen.de> writes:
>> > For bpf a flag during link attachment seemed like the best way
>> > to go.
>> 
>> Right, I wasn't disputing that having a flag to load a module was a good
>> idea. On the contrary, I was thinking we'd need many more of these
>> if/when BPF wants to take advantage of more netfilter code. Say, if a
>> BPF module wants to call into TPROXY, that module would also need go be
>> loaded and kept around, no?
>
> That seems to be a different topic that has nothing to do with
> either bpf_link or netfilter?
>
> If the program calls into say, TPROXY, then I'd expect that this needs
> to be handled via kfuncs, no? Or if I misunderstand, what do you mean
> by "call into TPROXY"?
>
> And if so, thats already handled at bpf_prog load time, not
> at link creation time, or do I miss something here?
>
> AFAIU, if prog uses such kfuncs, verifier will grab needed module ref
> and if module isn't loaded the kfuncs won't be found and program load
> fails.

...

> Or we are talking about implicit dependencies, where program doesn't
> call function X but needs functionality handled earlier in the pipeline?
>
> The only two instances I know where this is the case for netfilter
> is defrag + conntrack.

Well, I was kinda mixing the two cases above, sorry about that. The
"kfuncs locking the module" was not present in my mind when starting to
talk about that bit...

As for the original question, that's answered by your point above: If
those two modules are the only ones that are likely to need this, then a
flag for each is fine by me - that was the key piece I was missing (I'm
not a netfilter expert, as you well know).

Thanks for clarifying, and apologies for the muddled thinking! :)

-Toke


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next 0/7] Support defragmenting IPv(4|6) packets in BPF
  2023-06-29 14:35         ` Toke Høiland-Jørgensen
@ 2023-06-29 14:53           ` Florian Westphal
  2023-06-29 17:59             ` Daniel Xu
  0 siblings, 1 reply; 22+ messages in thread
From: Florian Westphal @ 2023-06-29 14:53 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Florian Westphal, Daniel Xu, bpf, netdev, linux-kernel,
	linux-kselftest, coreteam, netfilter-devel, daniel, dsahern

Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> Florian Westphal <fw@strlen.de> writes:
> As for the original question, that's answered by your point above: If
> those two modules are the only ones that are likely to need this, then a
> flag for each is fine by me - that was the key piece I was missing (I'm
> not a netfilter expert, as you well know).

No problem, I was worried I was missing an important piece of kfunc
plumbing :-)

You do raise a good point though.  With kfuncs, module is pinned.
So, should a "please turn on defrag for this bpf_link" pin
the defrag modules too?

For plain netfilter we don't do that, i.e. you can just do
"rmmod nf_defrag_ipv4".  But I suspect that for the new bpf-link
defrag we probably should grab a reference to prevent unwanted
functionality breakage of the bpf prog.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next 0/7] Support defragmenting IPv(4|6) packets in BPF
  2023-06-29 14:53           ` Florian Westphal
@ 2023-06-29 17:59             ` Daniel Xu
  0 siblings, 0 replies; 22+ messages in thread
From: Daniel Xu @ 2023-06-29 17:59 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Toke Høiland-Jørgensen, bpf, netdev, linux-kernel,
	linux-kselftest, coreteam, netfilter-devel, daniel, dsahern

On Thu, Jun 29, 2023 at 04:53:15PM +0200, Florian Westphal wrote:
> Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> > Florian Westphal <fw@strlen.de> writes:
> > As for the original question, that's answered by your point above: If
> > those two modules are the only ones that are likely to need this, then a
> > flag for each is fine by me - that was the key piece I was missing (I'm
> > not a netfilter expert, as you well know).
> 
> No problem, I was worried I was missing an important piece of kfunc
> plumbing :-)
> 
> You do raise a good point though.  With kfuncs, module is pinned.
> So, should a "please turn on defrag for this bpf_link" pin
> the defrag modules too?
> 
> For plain netfilter we don't do that, i.e. you can just do
> "rmmod nf_defrag_ipv4".  But I suspect that for the new bpf-link
> defrag we probably should grab a reference to prevent unwanted
> functionality breakage of the bpf prog.

Ack. Will add to v3.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2023-06-29 17:59 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-26 23:02 [PATCH bpf-next 0/7] Support defragmenting IPv(4|6) packets in BPF Daniel Xu
2023-06-26 23:02 ` [PATCH bpf-next 1/7] tools: libbpf: add netfilter link attach helper Daniel Xu
2023-06-27  0:11   ` Andrii Nakryiko
2023-06-26 23:02 ` [PATCH bpf-next 2/7] selftests/bpf: Add bpf_program__attach_netfilter helper test Daniel Xu
2023-06-26 23:02 ` [PATCH bpf-next 3/7] netfilter: defrag: Add glue hooks for enabling/disabling defrag Daniel Xu
2023-06-27 11:04   ` Florian Westphal
2023-06-26 23:02 ` [PATCH bpf-next 4/7] netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link Daniel Xu
2023-06-27 11:12   ` Florian Westphal
2023-06-27 15:35     ` Daniel Xu
2023-06-26 23:02 ` [PATCH bpf-next 5/7] bpf: selftests: Support not connecting client socket Daniel Xu
2023-06-26 23:02 ` [PATCH bpf-next 6/7] bpf: selftests: Support custom type and proto for client sockets Daniel Xu
2023-06-26 23:02 ` [PATCH bpf-next 7/7] bpf: selftests: Add defrag selftests Daniel Xu
2023-06-27 10:48 ` [PATCH bpf-next 0/7] Support defragmenting IPv(4|6) packets in BPF Florian Westphal
2023-06-27 14:18   ` Daniel Xu
2023-06-27 14:25 ` Toke Høiland-Jørgensen
2023-06-27 14:51   ` Daniel Xu
2023-06-27 15:44   ` Florian Westphal
2023-06-29 12:16     ` Toke Høiland-Jørgensen
2023-06-29 13:21       ` Florian Westphal
2023-06-29 14:35         ` Toke Høiland-Jørgensen
2023-06-29 14:53           ` Florian Westphal
2023-06-29 17:59             ` Daniel Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.