All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next v3 0/2] Fix missing synack in BPF cgroup_skb filters
@ 2023-06-20 17:14 Kui-Feng Lee
  2023-06-20 17:14 ` [PATCH bpf-next v3 1/2] net: bpf: Always call BPF cgroup filters for egress Kui-Feng Lee
  2023-06-20 17:14 ` [PATCH bpf-next v3 2/2] selftests/bpf: Verify that the cgroup_skb filters receive expected packets Kui-Feng Lee
  0 siblings, 2 replies; 14+ messages in thread
From: Kui-Feng Lee @ 2023-06-20 17:14 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii, daniel, yhs,
	kpsingh, shuah, john.fastabend, sdf, mykolal, linux-kselftest,
	jolsa, haoluo
  Cc: Kui-Feng Lee

TCP SYN/ACK packets of connections from processes/sockets outside a
cgroup on the same host are not received by the cgroup's installed
cgroup_skb filters.

There were two BPF cgroup_skb programs attached to a cgroup named
"my_cgroup".

    SEC("cgroup_skb/ingress")
    int ingress(struct __sk_buff *skb)
    {
        /* .... process skb ... */
        return 1;
    }

    SEC("cgroup_skb/egress")
    int egress(struct __sk_buff *skb)
    {
        /* .... process skb ... */
        return 1;
    
    }

We discovered that when running the command "nc -6 -l 8000" in
"my_group" and connecting to it from outside of "my_cgroup" with the
command "nc -6 localhost 8000", the egress filter did not detect the
SYN/ACK packet. However, we did observe the SYN/ACK packet at the
ingress when connecting from a socket in "my_cgroup" to a socket
outside of it.

We came across BPF_CGROUP_RUN_PROG_INET_EGRESS(). This macro is
responsible for calling BPF programs that are attached to the egress
hook of a cgroup and it skips programs if the sending socket is not the
owner of the skb. Specifically, in our situation, the SYN/ACK
skb is owned by a struct request_sock instance, but the sending
socket is the listener socket we use to receive incoming
connections. The request_sock is created to manage an incoming
connection.

It has been determined that checking the owner of a skb against
the sending socket is not required. Removing this check will allow the
filters to receive SYN/ACK packets.

To ensure that cgroup_skb filters can receive all signaling packets,
including SYN, SYN/ACK, ACK, FIN, and FIN/ACK. A new self-test has
been added as well.

Changes from v2:

 - Remove redundant blank lines.

Changes from v1:

 - Check the number of observed packets instead of just sleeping.

 - Use ASSERT_XXX() instead of CHECK()/

[v1] https://lore.kernel.org/all/20230612191641.441774-1-kuifeng@meta.com/
[v2] https://lore.kernel.org/all/20230617052756.640916-2-kuifeng@meta.com/

Kui-Feng Lee (2):
  net: bpf: Always call BPF cgroup filters for egress.
  selftests/bpf: Verify that the cgroup_skb filters receive expected
    packets.

 include/linux/bpf-cgroup.h                    |   2 +-
 tools/testing/selftests/bpf/cgroup_helpers.c  |  12 +
 tools/testing/selftests/bpf/cgroup_helpers.h  |   1 +
 tools/testing/selftests/bpf/cgroup_tcp_skb.h  |  35 ++
 .../selftests/bpf/prog_tests/cgroup_tcp_skb.c | 399 ++++++++++++++++++
 .../selftests/bpf/progs/cgroup_tcp_skb.c      | 382 +++++++++++++++++
 6 files changed, 830 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/cgroup_tcp_skb.h
 create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_tcp_skb.c
 create mode 100644 tools/testing/selftests/bpf/progs/cgroup_tcp_skb.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH bpf-next v3 1/2] net: bpf: Always call BPF cgroup filters for egress.
  2023-06-20 17:14 [PATCH bpf-next v3 0/2] Fix missing synack in BPF cgroup_skb filters Kui-Feng Lee
@ 2023-06-20 17:14 ` Kui-Feng Lee
  2023-06-22  3:37   ` Yonghong Song
  2023-06-20 17:14 ` [PATCH bpf-next v3 2/2] selftests/bpf: Verify that the cgroup_skb filters receive expected packets Kui-Feng Lee
  1 sibling, 1 reply; 14+ messages in thread
From: Kui-Feng Lee @ 2023-06-20 17:14 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii, daniel, yhs,
	kpsingh, shuah, john.fastabend, sdf, mykolal, linux-kselftest,
	jolsa, haoluo
  Cc: Kui-Feng Lee

Always call BPF filters if CGROUP BPF is enabled for EGRESS without
checking skb->sk against sk.

The filters were called only if skb is owned by the sock that the
skb is sent out through.  In another words, skb->sk should point to
the sock that it is sending through its egress.  However, the filters would
miss SYNACK skbs that they are owned by a request_sock but sent through
the listening sock, that is the socket listening incoming connections.
This is an unnecessary restrict.

Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
---
 include/linux/bpf-cgroup.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index 57e9e109257e..e656da531f9f 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -199,7 +199,7 @@ static inline bool cgroup_bpf_sock_enabled(struct sock *sk,
 #define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb)			       \
 ({									       \
 	int __ret = 0;							       \
-	if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk && sk == skb->sk) { \
+	if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk) {		       \
 		typeof(sk) __sk = sk_to_full_sk(sk);			       \
 		if (sk_fullsock(__sk) &&				       \
 		    cgroup_bpf_sock_enabled(__sk, CGROUP_INET_EGRESS))	       \
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH bpf-next v3 2/2] selftests/bpf: Verify that the cgroup_skb filters receive expected packets.
  2023-06-20 17:14 [PATCH bpf-next v3 0/2] Fix missing synack in BPF cgroup_skb filters Kui-Feng Lee
  2023-06-20 17:14 ` [PATCH bpf-next v3 1/2] net: bpf: Always call BPF cgroup filters for egress Kui-Feng Lee
@ 2023-06-20 17:14 ` Kui-Feng Lee
  2023-06-22  4:15   ` Yonghong Song
  1 sibling, 1 reply; 14+ messages in thread
From: Kui-Feng Lee @ 2023-06-20 17:14 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii, daniel, yhs,
	kpsingh, shuah, john.fastabend, sdf, mykolal, linux-kselftest,
	jolsa, haoluo
  Cc: Kui-Feng Lee

This test case includes four scenarios:
1. Connect to the server from outside the cgroup and close the connection
   from outside the cgroup.
2. Connect to the server from outside the cgroup and close the connection
   from inside the cgroup.
3. Connect to the server from inside the cgroup and close the connection
   from outside the cgroup.
4. Connect to the server from inside the cgroup and close the connection
   from inside the cgroup.

The test case is to verify that cgroup_skb/{egress, ingress} filters
receive expected packets including SYN, SYN/ACK, ACK, FIN, and FIN/ACK.

Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
---
 tools/testing/selftests/bpf/cgroup_helpers.c  |  12 +
 tools/testing/selftests/bpf/cgroup_helpers.h  |   1 +
 tools/testing/selftests/bpf/cgroup_tcp_skb.h  |  35 ++
 .../selftests/bpf/prog_tests/cgroup_tcp_skb.c | 399 ++++++++++++++++++
 .../selftests/bpf/progs/cgroup_tcp_skb.c      | 382 +++++++++++++++++
 5 files changed, 829 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/cgroup_tcp_skb.h
 create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_tcp_skb.c
 create mode 100644 tools/testing/selftests/bpf/progs/cgroup_tcp_skb.c

diff --git a/tools/testing/selftests/bpf/cgroup_helpers.c b/tools/testing/selftests/bpf/cgroup_helpers.c
index 9e95b37a7dff..2caee8423ee0 100644
--- a/tools/testing/selftests/bpf/cgroup_helpers.c
+++ b/tools/testing/selftests/bpf/cgroup_helpers.c
@@ -277,6 +277,18 @@ int join_cgroup(const char *relative_path)
 	return join_cgroup_from_top(cgroup_path);
 }
 
+/**
+ * join_root_cgroup() - Join the root cgroup
+ *
+ * This function joins the root cgroup.
+ *
+ * On success, it returns 0, otherwise on failure it returns 1.
+ */
+int join_root_cgroup(void)
+{
+	return join_cgroup_from_top(CGROUP_MOUNT_PATH);
+}
+
 /**
  * join_parent_cgroup() - Join a cgroup in the parent process workdir
  * @relative_path: The cgroup path, relative to parent process workdir, to join
diff --git a/tools/testing/selftests/bpf/cgroup_helpers.h b/tools/testing/selftests/bpf/cgroup_helpers.h
index f099a166c94d..5c2cb9c8b546 100644
--- a/tools/testing/selftests/bpf/cgroup_helpers.h
+++ b/tools/testing/selftests/bpf/cgroup_helpers.h
@@ -22,6 +22,7 @@ void remove_cgroup(const char *relative_path);
 unsigned long long get_cgroup_id(const char *relative_path);
 
 int join_cgroup(const char *relative_path);
+int join_root_cgroup(void);
 int join_parent_cgroup(const char *relative_path);
 
 int setup_cgroup_environment(void);
diff --git a/tools/testing/selftests/bpf/cgroup_tcp_skb.h b/tools/testing/selftests/bpf/cgroup_tcp_skb.h
new file mode 100644
index 000000000000..1054b3633983
--- /dev/null
+++ b/tools/testing/selftests/bpf/cgroup_tcp_skb.h
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2023 Facebook */
+
+/* Define states of a socket to tracking messages sending to and from the
+ * socket.
+ *
+ * These states are based on rfc9293 with some modifications to support
+ * tracking of messages sent out from a socket. For example, when a SYN is
+ * received, a new socket is transiting to the SYN_RECV state defined in
+ * rfc9293. But, we put it in SYN_RECV_SENDING_SYN_ACK state and when
+ * SYN-ACK is sent out, it moves to SYN_RECV state. With this modification,
+ * we can track the message sent out from a socket.
+ */
+
+#ifndef __CGROUP_TCP_SKB_H__
+#define __CGROUP_TCP_SKB_H__
+
+enum {
+	INIT,
+	CLOSED,
+	SYN_SENT,
+	SYN_RECV_SENDING_SYN_ACK,
+	SYN_RECV,
+	ESTABLISHED,
+	FIN_WAIT1,
+	FIN_WAIT2,
+	CLOSE_WAIT_SENDING_ACK,
+	CLOSE_WAIT,
+	CLOSING,
+	LAST_ACK,
+	TIME_WAIT_SENDING_ACK,
+	TIME_WAIT,
+};
+
+#endif /* __CGROUP_TCP_SKB_H__ */
diff --git a/tools/testing/selftests/bpf/prog_tests/cgroup_tcp_skb.c b/tools/testing/selftests/bpf/prog_tests/cgroup_tcp_skb.c
new file mode 100644
index 000000000000..1b78e8ab3f02
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/cgroup_tcp_skb.c
@@ -0,0 +1,399 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2023 Facebook */
+#include <test_progs.h>
+#include <linux/in6.h>
+#include <sys/socket.h>
+#include <sched.h>
+#include <unistd.h>
+#include "cgroup_helpers.h"
+#include "testing_helpers.h"
+#include "cgroup_tcp_skb.skel.h"
+#include "cgroup_tcp_skb.h"
+
+#define CGROUP_TCP_SKB_PATH "/test_cgroup_tcp_skb"
+
+static int install_filters(int cgroup_fd,
+			   struct bpf_link **egress_link,
+			   struct bpf_link **ingress_link,
+			   struct bpf_program *egress_prog,
+			   struct bpf_program *ingress_prog,
+			   struct cgroup_tcp_skb *skel)
+{
+	/* Prepare filters */
+	skel->bss->g_sock_state = 0;
+	skel->bss->g_unexpected = 0;
+	*egress_link =
+		bpf_program__attach_cgroup(egress_prog,
+					   cgroup_fd);
+	if (!ASSERT_NEQ(*egress_link, NULL, "egress_link"))
+		return -1;
+	*ingress_link =
+		bpf_program__attach_cgroup(ingress_prog,
+					   cgroup_fd);
+	if (!ASSERT_NEQ(*ingress_link, NULL, "ingress_link"))
+		return -1;
+
+	return 0;
+}
+
+static void uninstall_filters(struct bpf_link **egress_link,
+			      struct bpf_link **ingress_link)
+{
+	bpf_link__destroy(*egress_link);
+	*egress_link = NULL;
+	bpf_link__destroy(*ingress_link);
+	*ingress_link = NULL;
+}
+
+static int create_client_sock_v6(void)
+{
+	int fd;
+
+	fd = socket(AF_INET6, SOCK_STREAM, 0);
+	if (fd < 0) {
+		perror("socket");
+		return -1;
+	}
+
+	return fd;
+}
+
+static int create_server_sock_v6(void)
+{
+	struct sockaddr_in6 addr = {
+		.sin6_family = AF_INET6,
+		.sin6_port = htons(0),
+		.sin6_addr = IN6ADDR_LOOPBACK_INIT,
+	};
+	int fd, err;
+
+	fd = socket(AF_INET6, SOCK_STREAM, 0);
+	if (fd < 0) {
+		perror("socket");
+		return -1;
+	}
+
+	err = bind(fd, (struct sockaddr *)&addr, sizeof(addr));
+	if (err < 0) {
+		perror("bind");
+		return -1;
+	}
+
+	err = listen(fd, 1);
+	if (err < 0) {
+		perror("listen");
+		return -1;
+	}
+
+	return fd;
+}
+
+static int get_sock_port_v6(int fd)
+{
+	struct sockaddr_in6 addr;
+	socklen_t len;
+	int err;
+
+	len = sizeof(addr);
+	err = getsockname(fd, (struct sockaddr *)&addr, &len);
+	if (err < 0) {
+		perror("getsockname");
+		return -1;
+	}
+
+	return ntohs(addr.sin6_port);
+}
+
+static int connect_client_server_v6(int client_fd, int listen_fd)
+{
+	struct sockaddr_in6 addr = {
+		.sin6_family = AF_INET6,
+		.sin6_addr = IN6ADDR_LOOPBACK_INIT,
+	};
+	int err;
+
+	addr.sin6_port = htons(get_sock_port_v6(listen_fd));
+	if (addr.sin6_port < 0)
+		return -1;
+
+	err = connect(client_fd, (struct sockaddr *)&addr, sizeof(addr));
+	if (err < 0) {
+		perror("connect");
+		return -1;
+	}
+
+	return 0;
+}
+
+/* Connect to the server in a cgroup from the outside of the cgroup. */
+static int talk_to_cgroup(int *client_fd, int *listen_fd, int *service_fd,
+			  struct cgroup_tcp_skb *skel)
+{
+	int err, cp;
+	char buf[5];
+
+	/* Create client & server socket */
+	err = join_root_cgroup();
+	if (!ASSERT_OK(err, "join_root_cgroup"))
+		return -1;
+	*client_fd = create_client_sock_v6();
+	if (!ASSERT_GE(*client_fd, 0, "client_fd"))
+		return -1;
+	err = join_cgroup(CGROUP_TCP_SKB_PATH);
+	if (!ASSERT_OK(err, "join_cgroup"))
+		return -1;
+	*listen_fd = create_server_sock_v6();
+	if (!ASSERT_GE(*listen_fd, 0, "listen_fd"))
+		return -1;
+	skel->bss->g_sock_port = get_sock_port_v6(*listen_fd);
+
+	/* Connect client to server */
+	err = connect_client_server_v6(*client_fd, *listen_fd);
+	if (!ASSERT_OK(err, "connect_client_server_v6"))
+		return -1;
+	*service_fd = accept(*listen_fd, NULL, NULL);
+	if (!ASSERT_GE(*service_fd, 0, "service_fd"))
+		return -1;
+	err = join_root_cgroup();
+	if (!ASSERT_OK(err, "join_root_cgroup"))
+		return -1;
+	cp = write(*client_fd, "hello", 5);
+	if (!ASSERT_EQ(cp, 5, "write"))
+		return -1;
+	cp = read(*service_fd, buf, 5);
+	if (!ASSERT_EQ(cp, 5, "read"))
+		return -1;
+
+	return 0;
+}
+
+/* Connect to the server out of a cgroup from inside the cgroup. */
+static int talk_to_outside(int *client_fd, int *listen_fd, int *service_fd,
+			   struct cgroup_tcp_skb *skel)
+
+{
+	int err, cp;
+	char buf[5];
+
+	/* Create client & server socket */
+	err = join_root_cgroup();
+	if (!ASSERT_OK(err, "join_root_cgroup"))
+		return -1;
+	*listen_fd = create_server_sock_v6();
+	if (!ASSERT_GE(*listen_fd, 0, "listen_fd"))
+		return -1;
+	err = join_cgroup(CGROUP_TCP_SKB_PATH);
+	if (!ASSERT_OK(err, "join_cgroup"))
+		return -1;
+	*client_fd = create_client_sock_v6();
+	if (!ASSERT_GE(*client_fd, 0, "client_fd"))
+		return -1;
+	err = join_root_cgroup();
+	if (!ASSERT_OK(err, "join_root_cgroup"))
+		return -1;
+	skel->bss->g_sock_port = get_sock_port_v6(*listen_fd);
+
+	/* Connect client to server */
+	err = connect_client_server_v6(*client_fd, *listen_fd);
+	if (!ASSERT_OK(err, "connect_client_server_v6"))
+		return -1;
+	*service_fd = accept(*listen_fd, NULL, NULL);
+	if (!ASSERT_GE(*service_fd, 0, "service_fd"))
+		return -1;
+	cp = write(*client_fd, "hello", 5);
+	if (!ASSERT_EQ(cp, 5, "write"))
+		return -1;
+	cp = read(*service_fd, buf, 5);
+	if (!ASSERT_EQ(cp, 5, "read"))
+		return -1;
+
+	return 0;
+}
+
+static int close_connection(int *closing_fd, int *peer_fd, int *listen_fd,
+			    struct cgroup_tcp_skb *skel)
+{
+	__u32 saved_packet_count = 0;
+	int err;
+	int i;
+
+	/* Wait for ACKs to be sent */
+	saved_packet_count = skel->bss->g_packet_count;
+	usleep(100000);		/* 0.1s */
+	while (skel->bss->g_packet_count != saved_packet_count) {
+		saved_packet_count = skel->bss->g_packet_count;
+		usleep(100000);	/* 0.1s */
+	}
+
+	skel->bss->g_packet_count = 0;
+	saved_packet_count = 0;
+
+	/* Half shutdown to make sure the closing socket having a chance to
+	 * receive a FIN from the peer.
+	 */
+	err = shutdown(*closing_fd, SHUT_WR);
+	if (!ASSERT_OK(err, "shutdown closing_fd"))
+		return -1;
+
+	/* Wait for FIN and the ACK of the FIN to be observed */
+	for (i = 0;
+	     skel->bss->g_packet_count < saved_packet_count + 2 && i < 10;
+	     i++) {
+		usleep(100000);	/* 0.1s */
+	}
+	if (!ASSERT_GE(skel->bss->g_packet_count, saved_packet_count + 2,
+		       "packet_count"))
+		return -1;
+
+	saved_packet_count = skel->bss->g_packet_count;
+
+	/* Fully shutdown the connection */
+	err = close(*peer_fd);
+	if (!ASSERT_OK(err, "close peer_fd"))
+		return -1;
+	*peer_fd = -1;
+
+	/* Wait for FIN and the ACK of the FIN to be observed */
+	for (i = 0;
+	     skel->bss->g_packet_count < saved_packet_count + 2 && i < 10;
+	     i++) {
+		usleep(100000);	/* 0.1s */
+	}
+	if (!ASSERT_GE(skel->bss->g_packet_count, saved_packet_count + 2,
+		       "packet_count"))
+		return -1;
+
+	err = close(*closing_fd);
+	if (!ASSERT_OK(err, "close closing_fd"))
+		return -1;
+	*closing_fd = -1;
+
+	close(*listen_fd);
+	*listen_fd = -1;
+
+	return 0;
+}
+
+/* This test case includes four scenarios:
+ * 1. Connect to the server from outside the cgroup and close the connection
+ *    from outside the cgroup.
+ * 2. Connect to the server from outside the cgroup and close the connection
+ *    from inside the cgroup.
+ * 3. Connect to the server from inside the cgroup and close the connection
+ *    from outside the cgroup.
+ * 4. Connect to the server from inside the cgroup and close the connection
+ *    from inside the cgroup.
+ *
+ * The test case is to verify that cgroup_skb/{egress,ingress} filters
+ * receive expected packets including SYN, SYN/ACK, ACK, FIN, and FIN/ACK.
+ */
+void test_cgroup_tcp_skb(void)
+{
+	struct bpf_link *ingress_link = NULL;
+	struct bpf_link *egress_link = NULL;
+	int client_fd = -1, listen_fd = -1;
+	struct cgroup_tcp_skb *skel;
+	int service_fd = -1;
+	int cgroup_fd = -1;
+	int err;
+
+	skel = cgroup_tcp_skb__open_and_load();
+	if (!ASSERT_OK(!skel, "skel_open_load"))
+		return;
+
+	err = setup_cgroup_environment();
+	if (!ASSERT_OK(err, "setup_cgroup_environment"))
+		goto cleanup;
+
+	cgroup_fd = create_and_get_cgroup(CGROUP_TCP_SKB_PATH);
+	if (!ASSERT_GE(cgroup_fd, 0, "cgroup_fd"))
+		goto cleanup;
+
+	/* Scenario 1 */
+	err = install_filters(cgroup_fd, &egress_link, &ingress_link,
+			      skel->progs.server_egress,
+			      skel->progs.server_ingress,
+			      skel);
+	if (!ASSERT_OK(err, "install_filters"))
+		goto cleanup;
+
+	err = talk_to_cgroup(&client_fd, &listen_fd, &service_fd, skel);
+	if (!ASSERT_OK(err, "talk_to_cgroup"))
+		goto cleanup;
+
+	err = close_connection(&client_fd, &service_fd, &listen_fd, skel);
+	if (!ASSERT_OK(err, "close_connection"))
+		goto cleanup;
+
+	ASSERT_EQ(skel->bss->g_unexpected, 0, "g_unexpected");
+	ASSERT_EQ(skel->bss->g_sock_state, CLOSED, "g_sock_state");
+
+	uninstall_filters(&egress_link, &ingress_link);
+
+	/* Scenario 2 */
+	err = install_filters(cgroup_fd, &egress_link, &ingress_link,
+			      skel->progs.server_egress_srv,
+			      skel->progs.server_ingress_srv,
+			      skel);
+
+	err = talk_to_cgroup(&client_fd, &listen_fd, &service_fd, skel);
+	if (!ASSERT_OK(err, "talk_to_cgroup"))
+		goto cleanup;
+
+	err = close_connection(&service_fd, &client_fd, &listen_fd, skel);
+	if (!ASSERT_OK(err, "close_connection"))
+		goto cleanup;
+
+	ASSERT_EQ(skel->bss->g_unexpected, 0, "g_unexpected");
+	ASSERT_EQ(skel->bss->g_sock_state, TIME_WAIT, "g_sock_state");
+
+	uninstall_filters(&egress_link, &ingress_link);
+
+	/* Scenario 3 */
+	err = install_filters(cgroup_fd, &egress_link, &ingress_link,
+			      skel->progs.client_egress_srv,
+			      skel->progs.client_ingress_srv,
+			      skel);
+
+	err = talk_to_outside(&client_fd, &listen_fd, &service_fd, skel);
+	if (!ASSERT_OK(err, "talk_to_outside"))
+		goto cleanup;
+
+	err = close_connection(&service_fd, &client_fd, &listen_fd, skel);
+	if (!ASSERT_OK(err, "close_connection"))
+		goto cleanup;
+
+	ASSERT_EQ(skel->bss->g_unexpected, 0, "g_unexpected");
+	ASSERT_EQ(skel->bss->g_sock_state, CLOSED, "g_sock_state");
+
+	uninstall_filters(&egress_link, &ingress_link);
+
+	/* Scenario 4 */
+	err = install_filters(cgroup_fd, &egress_link, &ingress_link,
+			      skel->progs.client_egress,
+			      skel->progs.client_ingress,
+			      skel);
+
+	err = talk_to_outside(&client_fd, &listen_fd, &service_fd, skel);
+	if (!ASSERT_OK(err, "talk_to_outside"))
+		goto cleanup;
+
+	err = close_connection(&client_fd, &service_fd, &listen_fd, skel);
+	if (!ASSERT_OK(err, "close_connection"))
+		goto cleanup;
+
+	ASSERT_EQ(skel->bss->g_unexpected, 0, "g_unexpected");
+	ASSERT_EQ(skel->bss->g_sock_state, TIME_WAIT, "g_sock_state");
+
+	uninstall_filters(&egress_link, &ingress_link);
+
+cleanup:
+	close(client_fd);
+	close(listen_fd);
+	close(service_fd);
+	close(cgroup_fd);
+	bpf_link__destroy(egress_link);
+	bpf_link__destroy(ingress_link);
+	cleanup_cgroup_environment();
+	cgroup_tcp_skb__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/cgroup_tcp_skb.c b/tools/testing/selftests/bpf/progs/cgroup_tcp_skb.c
new file mode 100644
index 000000000000..372a1548798c
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/cgroup_tcp_skb.c
@@ -0,0 +1,382 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2023 Facebook */
+#include <linux/bpf.h>
+#include <bpf/bpf_endian.h>
+#include <bpf/bpf_helpers.h>
+
+#include <linux/if_ether.h>
+#include <linux/in.h>
+#include <linux/in6.h>
+#include <linux/ipv6.h>
+#include <linux/tcp.h>
+
+#include <sys/types.h>
+#include <sys/socket.h>
+
+#include "cgroup_tcp_skb.h"
+
+char _license[] SEC("license") = "GPL";
+
+__u16 g_sock_port = 0;
+__u32 g_sock_state = 0;
+int g_unexpected = 0;
+__u32 g_packet_count = 0;
+
+int needed_tcp_pkt(struct __sk_buff *skb, struct tcphdr *tcph)
+{
+	struct ipv6hdr ip6h;
+
+	if (skb->protocol != bpf_htons(ETH_P_IPV6))
+		return 0;
+	if (bpf_skb_load_bytes(skb, 0, &ip6h, sizeof(ip6h)))
+		return 0;
+
+	if (ip6h.nexthdr != IPPROTO_TCP)
+		return 0;
+
+	if (bpf_skb_load_bytes(skb, sizeof(ip6h), tcph, sizeof(*tcph)))
+		return 0;
+
+	if (tcph->source != bpf_htons(g_sock_port) &&
+	    tcph->dest != bpf_htons(g_sock_port))
+		return 0;
+
+	return 1;
+}
+
+/* Run accept() on a socket in the cgroup to receive a new connection. */
+static int egress_accept(struct tcphdr *tcph)
+{
+	if (g_sock_state ==  SYN_RECV_SENDING_SYN_ACK) {
+		if (tcph->fin || !tcph->syn || !tcph->ack)
+			g_unexpected++;
+		else
+			g_sock_state = SYN_RECV;
+		return 1;
+	}
+
+	return 0;
+}
+
+static int ingress_accept(struct tcphdr *tcph)
+{
+	switch (g_sock_state) {
+	case INIT:
+		if (!tcph->syn || tcph->fin || tcph->ack)
+			g_unexpected++;
+		else
+			g_sock_state = SYN_RECV_SENDING_SYN_ACK;
+		break;
+	case SYN_RECV:
+		if (tcph->fin || tcph->syn || !tcph->ack)
+			g_unexpected++;
+		else
+			g_sock_state = ESTABLISHED;
+		break;
+	default:
+		return 0;
+	}
+
+	return 1;
+}
+
+/* Run connect() on a socket in the cgroup to start a new connection. */
+static int egress_connect(struct tcphdr *tcph)
+{
+	if (g_sock_state == INIT) {
+		if (!tcph->syn || tcph->fin || tcph->ack)
+			g_unexpected++;
+		else
+			g_sock_state = SYN_SENT;
+		return 1;
+	}
+
+	return 0;
+}
+
+static int ingress_connect(struct tcphdr *tcph)
+{
+	if (g_sock_state == SYN_SENT) {
+		if (tcph->fin || !tcph->syn || !tcph->ack)
+			g_unexpected++;
+		else
+			g_sock_state = ESTABLISHED;
+		return 1;
+	}
+
+	return 0;
+}
+
+/* The connection is closed by the peer outside the cgroup. */
+static int egress_close_remote(struct tcphdr *tcph)
+{
+	switch (g_sock_state) {
+	case ESTABLISHED:
+		break;
+	case CLOSE_WAIT_SENDING_ACK:
+		if (tcph->fin || tcph->syn || !tcph->ack)
+			g_unexpected++;
+		else
+			g_sock_state = CLOSE_WAIT;
+		break;
+	case CLOSE_WAIT:
+		if (!tcph->fin)
+			g_unexpected++;
+		else
+			g_sock_state = LAST_ACK;
+		break;
+	default:
+		return 0;
+	}
+
+	return 1;
+}
+
+static int ingress_close_remote(struct tcphdr *tcph)
+{
+	switch (g_sock_state) {
+	case ESTABLISHED:
+		if (tcph->fin)
+			g_sock_state = CLOSE_WAIT_SENDING_ACK;
+		break;
+	case LAST_ACK:
+		if (tcph->fin || tcph->syn || !tcph->ack)
+			g_unexpected++;
+		else
+			g_sock_state = CLOSED;
+		break;
+	default:
+		return 0;
+	}
+
+	return 1;
+}
+
+/* The connection is closed by the endpoint inside the cgroup. */
+static int egress_close_local(struct tcphdr *tcph)
+{
+	switch (g_sock_state) {
+	case ESTABLISHED:
+		if (tcph->fin)
+			g_sock_state = FIN_WAIT1;
+		break;
+	case TIME_WAIT_SENDING_ACK:
+		if (tcph->fin || tcph->syn || !tcph->ack)
+			g_unexpected++;
+		else
+			g_sock_state = TIME_WAIT;
+		break;
+	default:
+		return 0;
+	}
+
+	return 1;
+}
+
+static int ingress_close_local(struct tcphdr *tcph)
+{
+	switch (g_sock_state) {
+	case ESTABLISHED:
+		break;
+	case FIN_WAIT1:
+		if (tcph->fin || tcph->syn || !tcph->ack)
+			g_unexpected++;
+		else
+			g_sock_state = FIN_WAIT2;
+		break;
+	case FIN_WAIT2:
+		if (!tcph->fin || tcph->syn || !tcph->ack)
+			g_unexpected++;
+		else
+			g_sock_state = TIME_WAIT_SENDING_ACK;
+		break;
+	default:
+		return 0;
+	}
+
+	return 1;
+}
+
+/* Check the types of outgoing packets of a server socket to make sure they
+ * are consistent with the state of the server socket.
+ *
+ * The connection is closed by the client side.
+ */
+SEC("cgroup_skb/egress")
+int server_egress(struct __sk_buff *skb)
+{
+	struct tcphdr tcph;
+
+	if (!needed_tcp_pkt(skb, &tcph))
+		return 1;
+
+	g_packet_count++;
+
+	/* Egress of the server socket. */
+	if (egress_accept(&tcph) || egress_close_remote(&tcph))
+		return 1;
+
+	g_unexpected++;
+	return 1;
+}
+
+/* Check the types of incoming packets of a server socket to make sure they
+ * are consistent with the state of the server socket.
+ *
+ * The connection is closed by the client side.
+ */
+SEC("cgroup_skb/ingress")
+int server_ingress(struct __sk_buff *skb)
+{
+	struct tcphdr tcph;
+
+	if (!needed_tcp_pkt(skb, &tcph))
+		return 1;
+
+	g_packet_count++;
+
+	/* Ingress of the server socket. */
+	if (ingress_accept(&tcph) || ingress_close_remote(&tcph))
+		return 1;
+
+	g_unexpected++;
+	return 1;
+}
+
+/* Check the types of outgoing packets of a server socket to make sure they
+ * are consistent with the state of the server socket.
+ *
+ * The connection is closed by the server side.
+ */
+SEC("cgroup_skb/egress")
+int server_egress_srv(struct __sk_buff *skb)
+{
+	struct tcphdr tcph;
+
+	if (!needed_tcp_pkt(skb, &tcph))
+		return 1;
+
+	g_packet_count++;
+
+	/* Egress of the server socket. */
+	if (egress_accept(&tcph) || egress_close_local(&tcph))
+		return 1;
+
+	g_unexpected++;
+	return 1;
+}
+
+/* Check the types of incoming packets of a server socket to make sure they
+ * are consistent with the state of the server socket.
+ *
+ * The connection is closed by the server side.
+ */
+SEC("cgroup_skb/ingress")
+int server_ingress_srv(struct __sk_buff *skb)
+{
+	struct tcphdr tcph;
+
+	if (!needed_tcp_pkt(skb, &tcph))
+		return 1;
+
+	g_packet_count++;
+
+	/* Ingress of the server socket. */
+	if (ingress_accept(&tcph) || ingress_close_local(&tcph))
+		return 1;
+
+	g_unexpected++;
+	return 1;
+}
+
+/* Check the types of outgoing packets of a client socket to make sure they
+ * are consistent with the state of the client socket.
+ *
+ * The connection is closed by the server side.
+ */
+SEC("cgroup_skb/egress")
+int client_egress_srv(struct __sk_buff *skb)
+{
+	struct tcphdr tcph;
+
+	if (!needed_tcp_pkt(skb, &tcph))
+		return 1;
+
+	g_packet_count++;
+
+	/* Egress of the server socket. */
+	if (egress_connect(&tcph) || egress_close_remote(&tcph))
+		return 1;
+
+	g_unexpected++;
+	return 1;
+}
+
+/* Check the types of incoming packets of a client socket to make sure they
+ * are consistent with the state of the client socket.
+ *
+ * The connection is closed by the server side.
+ */
+SEC("cgroup_skb/ingress")
+int client_ingress_srv(struct __sk_buff *skb)
+{
+	struct tcphdr tcph;
+
+	if (!needed_tcp_pkt(skb, &tcph))
+		return 1;
+
+	g_packet_count++;
+
+	/* Ingress of the server socket. */
+	if (ingress_connect(&tcph) || ingress_close_remote(&tcph))
+		return 1;
+
+	g_unexpected++;
+	return 1;
+}
+
+/* Check the types of outgoing packets of a client socket to make sure they
+ * are consistent with the state of the client socket.
+ *
+ * The connection is closed by the client side.
+ */
+SEC("cgroup_skb/egress")
+int client_egress(struct __sk_buff *skb)
+{
+	struct tcphdr tcph;
+
+	if (!needed_tcp_pkt(skb, &tcph))
+		return 1;
+
+	g_packet_count++;
+
+	/* Egress of the server socket. */
+	if (egress_connect(&tcph) || egress_close_local(&tcph))
+		return 1;
+
+	g_unexpected++;
+	return 1;
+}
+
+/* Check the types of incoming packets of a client socket to make sure they
+ * are consistent with the state of the client socket.
+ *
+ * The connection is closed by the client side.
+ */
+SEC("cgroup_skb/ingress")
+int client_ingress(struct __sk_buff *skb)
+{
+	struct tcphdr tcph;
+
+	if (!needed_tcp_pkt(skb, &tcph))
+		return 1;
+
+	g_packet_count++;
+
+	/* Ingress of the server socket. */
+	if (ingress_connect(&tcph) || ingress_close_local(&tcph))
+		return 1;
+
+	g_unexpected++;
+	return 1;
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v3 1/2] net: bpf: Always call BPF cgroup filters for egress.
  2023-06-20 17:14 ` [PATCH bpf-next v3 1/2] net: bpf: Always call BPF cgroup filters for egress Kui-Feng Lee
@ 2023-06-22  3:37   ` Yonghong Song
  2023-06-22 15:34     ` Kui-Feng Lee
  2023-06-22 17:15     ` Kui-Feng Lee
  0 siblings, 2 replies; 14+ messages in thread
From: Yonghong Song @ 2023-06-22  3:37 UTC (permalink / raw)
  To: Kui-Feng Lee, bpf, ast, martin.lau, song, kernel-team, andrii,
	daniel, yhs, kpsingh, shuah, john.fastabend, sdf, mykolal,
	linux-kselftest, jolsa, haoluo
  Cc: Kui-Feng Lee



On 6/20/23 10:14 AM, Kui-Feng Lee wrote:
> Always call BPF filters if CGROUP BPF is enabled for EGRESS without
> checking skb->sk against sk.
> 
> The filters were called only if skb is owned by the sock that the
> skb is sent out through.  In another words, skb->sk should point to
> the sock that it is sending through its egress.  However, the filters would
> miss SYNACK skbs that they are owned by a request_sock but sent through
> the listening sock, that is the socket listening incoming connections.
> This is an unnecessary restrict.

The original patch which introduced 'sk == skb->sk' is
   3007098494be  cgroup: add support for eBPF programs
There are no mentioning in commit message why 'sk == skb->sk'
is needed. So it is possible that this is just restricted
for use cases at that moment. Now there are use cases
where 'sk != skb->sk' so removing this check can enable
the new use case. Maybe you can add this into your commit
message so people can understand the history of 'sk == skb->sk'.

> 
> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
> ---
>   include/linux/bpf-cgroup.h | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
> index 57e9e109257e..e656da531f9f 100644
> --- a/include/linux/bpf-cgroup.h
> +++ b/include/linux/bpf-cgroup.h
> @@ -199,7 +199,7 @@ static inline bool cgroup_bpf_sock_enabled(struct sock *sk,
>   #define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb)			       \
>   ({									       \
>   	int __ret = 0;							       \
> -	if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk && sk == skb->sk) { \
> +	if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk) {		       \
>   		typeof(sk) __sk = sk_to_full_sk(sk);			       \
>   		if (sk_fullsock(__sk) &&				       \
>   		    cgroup_bpf_sock_enabled(__sk, CGROUP_INET_EGRESS))	       \

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v3 2/2] selftests/bpf: Verify that the cgroup_skb filters receive expected packets.
  2023-06-20 17:14 ` [PATCH bpf-next v3 2/2] selftests/bpf: Verify that the cgroup_skb filters receive expected packets Kui-Feng Lee
@ 2023-06-22  4:15   ` Yonghong Song
  2023-06-22 15:33     ` Kui-Feng Lee
  0 siblings, 1 reply; 14+ messages in thread
From: Yonghong Song @ 2023-06-22  4:15 UTC (permalink / raw)
  To: Kui-Feng Lee, bpf, ast, martin.lau, song, kernel-team, andrii,
	daniel, yhs, kpsingh, shuah, john.fastabend, sdf, mykolal,
	linux-kselftest, jolsa, haoluo
  Cc: Kui-Feng Lee



On 6/20/23 10:14 AM, Kui-Feng Lee wrote:
> This test case includes four scenarios:
> 1. Connect to the server from outside the cgroup and close the connection
>     from outside the cgroup.
> 2. Connect to the server from outside the cgroup and close the connection
>     from inside the cgroup.
> 3. Connect to the server from inside the cgroup and close the connection
>     from outside the cgroup.
> 4. Connect to the server from inside the cgroup and close the connection
>     from inside the cgroup.
> 
> The test case is to verify that cgroup_skb/{egress, ingress} filters
> receive expected packets including SYN, SYN/ACK, ACK, FIN, and FIN/ACK.
> 
> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
> ---
>   tools/testing/selftests/bpf/cgroup_helpers.c  |  12 +
>   tools/testing/selftests/bpf/cgroup_helpers.h  |   1 +
>   tools/testing/selftests/bpf/cgroup_tcp_skb.h  |  35 ++
>   .../selftests/bpf/prog_tests/cgroup_tcp_skb.c | 399 ++++++++++++++++++
>   .../selftests/bpf/progs/cgroup_tcp_skb.c      | 382 +++++++++++++++++
>   5 files changed, 829 insertions(+)
>   create mode 100644 tools/testing/selftests/bpf/cgroup_tcp_skb.h
>   create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_tcp_skb.c
>   create mode 100644 tools/testing/selftests/bpf/progs/cgroup_tcp_skb.c
> 
> diff --git a/tools/testing/selftests/bpf/cgroup_helpers.c b/tools/testing/selftests/bpf/cgroup_helpers.c
> index 9e95b37a7dff..2caee8423ee0 100644
> --- a/tools/testing/selftests/bpf/cgroup_helpers.c
> +++ b/tools/testing/selftests/bpf/cgroup_helpers.c
> @@ -277,6 +277,18 @@ int join_cgroup(const char *relative_path)
>   	return join_cgroup_from_top(cgroup_path);
>   }
>   
> +/**
> + * join_root_cgroup() - Join the root cgroup
> + *
> + * This function joins the root cgroup.
> + *
> + * On success, it returns 0, otherwise on failure it returns 1.
> + */
> +int join_root_cgroup(void)
> +{
> +	return join_cgroup_from_top(CGROUP_MOUNT_PATH);
> +}
> +
>   /**
>    * join_parent_cgroup() - Join a cgroup in the parent process workdir
>    * @relative_path: The cgroup path, relative to parent process workdir, to join
> diff --git a/tools/testing/selftests/bpf/cgroup_helpers.h b/tools/testing/selftests/bpf/cgroup_helpers.h
> index f099a166c94d..5c2cb9c8b546 100644
> --- a/tools/testing/selftests/bpf/cgroup_helpers.h
> +++ b/tools/testing/selftests/bpf/cgroup_helpers.h
> @@ -22,6 +22,7 @@ void remove_cgroup(const char *relative_path);
>   unsigned long long get_cgroup_id(const char *relative_path);
>   
>   int join_cgroup(const char *relative_path);
> +int join_root_cgroup(void);
>   int join_parent_cgroup(const char *relative_path);
>   
>   int setup_cgroup_environment(void);
> diff --git a/tools/testing/selftests/bpf/cgroup_tcp_skb.h b/tools/testing/selftests/bpf/cgroup_tcp_skb.h
> new file mode 100644
> index 000000000000..1054b3633983
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/cgroup_tcp_skb.h
> @@ -0,0 +1,35 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright (c) 2023 Facebook */

/* Copyright (c) 2023 Meta Platforms, Inc. and affiliates. */

> +
> +/* Define states of a socket to tracking messages sending to and from the
> + * socket.
> + *
> + * These states are based on rfc9293 with some modifications to support
> + * tracking of messages sent out from a socket. For example, when a SYN is
> + * received, a new socket is transiting to the SYN_RECV state defined in
> + * rfc9293. But, we put it in SYN_RECV_SENDING_SYN_ACK state and when
> + * SYN-ACK is sent out, it moves to SYN_RECV state. With this modification,
> + * we can track the message sent out from a socket.
> + */
> +
> +#ifndef __CGROUP_TCP_SKB_H__
> +#define __CGROUP_TCP_SKB_H__
> +
> +enum {
> +	INIT,
> +	CLOSED,
> +	SYN_SENT,
> +	SYN_RECV_SENDING_SYN_ACK,
> +	SYN_RECV,
> +	ESTABLISHED,
> +	FIN_WAIT1,
> +	FIN_WAIT2,
> +	CLOSE_WAIT_SENDING_ACK,
> +	CLOSE_WAIT,
> +	CLOSING,
> +	LAST_ACK,
> +	TIME_WAIT_SENDING_ACK,
> +	TIME_WAIT,
> +};
> +
> +#endif /* __CGROUP_TCP_SKB_H__ */
> diff --git a/tools/testing/selftests/bpf/prog_tests/cgroup_tcp_skb.c b/tools/testing/selftests/bpf/prog_tests/cgroup_tcp_skb.c
> new file mode 100644
> index 000000000000..1b78e8ab3f02
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/cgroup_tcp_skb.c
> @@ -0,0 +1,399 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2023 Facebook */
> +#include <test_progs.h>
> +#include <linux/in6.h>
> +#include <sys/socket.h>
> +#include <sched.h>
> +#include <unistd.h>
> +#include "cgroup_helpers.h"
> +#include "testing_helpers.h"
> +#include "cgroup_tcp_skb.skel.h"
> +#include "cgroup_tcp_skb.h"
> +
> +#define CGROUP_TCP_SKB_PATH "/test_cgroup_tcp_skb"
> +
> +static int install_filters(int cgroup_fd,
> +			   struct bpf_link **egress_link,
> +			   struct bpf_link **ingress_link,
> +			   struct bpf_program *egress_prog,
> +			   struct bpf_program *ingress_prog,
> +			   struct cgroup_tcp_skb *skel)
> +{
> +	/* Prepare filters */
> +	skel->bss->g_sock_state = 0;
> +	skel->bss->g_unexpected = 0;
> +	*egress_link =
> +		bpf_program__attach_cgroup(egress_prog,
> +					   cgroup_fd);
> +	if (!ASSERT_NEQ(*egress_link, NULL, "egress_link"))
> +		return -1;

!ASSERT_OK_PTR(...)

> +	*ingress_link =
> +		bpf_program__attach_cgroup(ingress_prog,
> +					   cgroup_fd);
> +	if (!ASSERT_NEQ(*ingress_link, NULL, "ingress_link"))
> +		return -1;

!ASSERT_OK_PTR(...)

> +
> +	return 0;
> +}
> +
> +static void uninstall_filters(struct bpf_link **egress_link,
> +			      struct bpf_link **ingress_link)
> +{
> +	bpf_link__destroy(*egress_link);
> +	*egress_link = NULL;
> +	bpf_link__destroy(*ingress_link);
> +	*ingress_link = NULL;
> +}
> +
[...]
> +
> +static int close_connection(int *closing_fd, int *peer_fd, int *listen_fd,
> +			    struct cgroup_tcp_skb *skel)
> +{
> +	__u32 saved_packet_count = 0;
> +	int err;
> +	int i;
> +
> +	/* Wait for ACKs to be sent */
> +	saved_packet_count = skel->bss->g_packet_count;
> +	usleep(100000);		/* 0.1s */
> +	while (skel->bss->g_packet_count != saved_packet_count) {
> +		saved_packet_count = skel->bss->g_packet_count;
> +		usleep(100000);	/* 0.1s */
> +	}

Should we put a limitation in the number of loop iterations
just in case that something went wrong or too slow?

> +
> +	skel->bss->g_packet_count = 0;
> +	saved_packet_count = 0;
> +
> +	/* Half shutdown to make sure the closing socket having a chance to
> +	 * receive a FIN from the peer.
> +	 */
> +	err = shutdown(*closing_fd, SHUT_WR);
> +	if (!ASSERT_OK(err, "shutdown closing_fd"))
> +		return -1;
> +
> +	/* Wait for FIN and the ACK of the FIN to be observed */
> +	for (i = 0;
> +	     skel->bss->g_packet_count < saved_packet_count + 2 && i < 10;
> +	     i++) {
> +		usleep(100000);	/* 0.1s */
> +	}

'{...}' is not necessary.

> +	if (!ASSERT_GE(skel->bss->g_packet_count, saved_packet_count + 2,
> +		       "packet_count"))
> +		return -1;
> +
> +	saved_packet_count = skel->bss->g_packet_count;
> +
> +	/* Fully shutdown the connection */
> +	err = close(*peer_fd);
> +	if (!ASSERT_OK(err, "close peer_fd"))
> +		return -1;
> +	*peer_fd = -1;
> +
> +	/* Wait for FIN and the ACK of the FIN to be observed */
> +	for (i = 0;
> +	     skel->bss->g_packet_count < saved_packet_count + 2 && i < 10;
> +	     i++) {
> +		usleep(100000);	/* 0.1s */
> +	}

'{...}' is not necessary.

> +	if (!ASSERT_GE(skel->bss->g_packet_count, saved_packet_count + 2,
> +		       "packet_count"))
> +		return -1;
> +
> +	err = close(*closing_fd);
> +	if (!ASSERT_OK(err, "close closing_fd"))
> +		return -1;
> +	*closing_fd = -1;
> +
> +	close(*listen_fd);
> +	*listen_fd = -1;
> +
> +	return 0;
> +}
> +
[...]
> diff --git a/tools/testing/selftests/bpf/progs/cgroup_tcp_skb.c b/tools/testing/selftests/bpf/progs/cgroup_tcp_skb.c
> new file mode 100644
> index 000000000000..372a1548798c
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/cgroup_tcp_skb.c
> @@ -0,0 +1,382 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2023 Facebook */

/* Copyright (c) 2023 Meta Platforms, Inc. and affiliates. */

> +#include <linux/bpf.h>
> +#include <bpf/bpf_endian.h>
> +#include <bpf/bpf_helpers.h>
> +
> +#include <linux/if_ether.h>
> +#include <linux/in.h>
> +#include <linux/in6.h>
> +#include <linux/ipv6.h>
> +#include <linux/tcp.h>
> +
> +#include <sys/types.h>
> +#include <sys/socket.h>
> +
> +#include "cgroup_tcp_skb.h"
[...]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v3 2/2] selftests/bpf: Verify that the cgroup_skb filters receive expected packets.
  2023-06-22  4:15   ` Yonghong Song
@ 2023-06-22 15:33     ` Kui-Feng Lee
  0 siblings, 0 replies; 14+ messages in thread
From: Kui-Feng Lee @ 2023-06-22 15:33 UTC (permalink / raw)
  To: Yonghong Song, Kui-Feng Lee, bpf, ast, martin.lau, song,
	kernel-team, andrii, daniel, yhs, kpsingh, shuah, john.fastabend,
	sdf, mykolal, linux-kselftest, jolsa, haoluo
  Cc: Kui-Feng Lee



On 6/21/23 21:15, Yonghong Song wrote:
> 
> 
> On 6/20/23 10:14 AM, Kui-Feng Lee wrote:
>> This test case includes four scenarios:
>> 1. Connect to the server from outside the cgroup and close the connection
>>     from outside the cgroup.
>> 2. Connect to the server from outside the cgroup and close the connection
>>     from inside the cgroup.
>> 3. Connect to the server from inside the cgroup and close the connection
>>     from outside the cgroup.
>> 4. Connect to the server from inside the cgroup and close the connection
>>     from inside the cgroup.
>>
>> The test case is to verify that cgroup_skb/{egress, ingress} filters
>> receive expected packets including SYN, SYN/ACK, ACK, FIN, and FIN/ACK.
>>
>> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
>> ---
>>   tools/testing/selftests/bpf/cgroup_helpers.c  |  12 +
>>   tools/testing/selftests/bpf/cgroup_helpers.h  |   1 +
>>   tools/testing/selftests/bpf/cgroup_tcp_skb.h  |  35 ++
>>   .../selftests/bpf/prog_tests/cgroup_tcp_skb.c | 399 ++++++++++++++++++
>>   .../selftests/bpf/progs/cgroup_tcp_skb.c      | 382 +++++++++++++++++
>>   5 files changed, 829 insertions(+)
>>   create mode 100644 tools/testing/selftests/bpf/cgroup_tcp_skb.h
>>   create mode 100644 
>> tools/testing/selftests/bpf/prog_tests/cgroup_tcp_skb.c
>>   create mode 100644 tools/testing/selftests/bpf/progs/cgroup_tcp_skb.c
>>
>> diff --git a/tools/testing/selftests/bpf/cgroup_helpers.c 
>> b/tools/testing/selftests/bpf/cgroup_helpers.c
>> index 9e95b37a7dff..2caee8423ee0 100644
>> --- a/tools/testing/selftests/bpf/cgroup_helpers.c
>> +++ b/tools/testing/selftests/bpf/cgroup_helpers.c
>> @@ -277,6 +277,18 @@ int join_cgroup(const char *relative_path)
>>       return join_cgroup_from_top(cgroup_path);
>>   }
>> +/**
>> + * join_root_cgroup() - Join the root cgroup
>> + *
>> + * This function joins the root cgroup.
>> + *
>> + * On success, it returns 0, otherwise on failure it returns 1.
>> + */
>> +int join_root_cgroup(void)
>> +{
>> +    return join_cgroup_from_top(CGROUP_MOUNT_PATH);
>> +}
>> +
>>   /**
>>    * join_parent_cgroup() - Join a cgroup in the parent process workdir
>>    * @relative_path: The cgroup path, relative to parent process 
>> workdir, to join
>> diff --git a/tools/testing/selftests/bpf/cgroup_helpers.h 
>> b/tools/testing/selftests/bpf/cgroup_helpers.h
>> index f099a166c94d..5c2cb9c8b546 100644
>> --- a/tools/testing/selftests/bpf/cgroup_helpers.h
>> +++ b/tools/testing/selftests/bpf/cgroup_helpers.h
>> @@ -22,6 +22,7 @@ void remove_cgroup(const char *relative_path);
>>   unsigned long long get_cgroup_id(const char *relative_path);
>>   int join_cgroup(const char *relative_path);
>> +int join_root_cgroup(void);
>>   int join_parent_cgroup(const char *relative_path);
>>   int setup_cgroup_environment(void);
>> diff --git a/tools/testing/selftests/bpf/cgroup_tcp_skb.h 
>> b/tools/testing/selftests/bpf/cgroup_tcp_skb.h
>> new file mode 100644
>> index 000000000000..1054b3633983
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/cgroup_tcp_skb.h
>> @@ -0,0 +1,35 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/* Copyright (c) 2023 Facebook */
> 
> /* Copyright (c) 2023 Meta Platforms, Inc. and affiliates. */
> 
>> +
>> +/* Define states of a socket to tracking messages sending to and from 
>> the
>> + * socket.
>> + *
>> + * These states are based on rfc9293 with some modifications to support
>> + * tracking of messages sent out from a socket. For example, when a 
>> SYN is
>> + * received, a new socket is transiting to the SYN_RECV state defined in
>> + * rfc9293. But, we put it in SYN_RECV_SENDING_SYN_ACK state and when
>> + * SYN-ACK is sent out, it moves to SYN_RECV state. With this 
>> modification,
>> + * we can track the message sent out from a socket.
>> + */
>> +
>> +#ifndef __CGROUP_TCP_SKB_H__
>> +#define __CGROUP_TCP_SKB_H__
>> +
>> +enum {
>> +    INIT,
>> +    CLOSED,
>> +    SYN_SENT,
>> +    SYN_RECV_SENDING_SYN_ACK,
>> +    SYN_RECV,
>> +    ESTABLISHED,
>> +    FIN_WAIT1,
>> +    FIN_WAIT2,
>> +    CLOSE_WAIT_SENDING_ACK,
>> +    CLOSE_WAIT,
>> +    CLOSING,
>> +    LAST_ACK,
>> +    TIME_WAIT_SENDING_ACK,
>> +    TIME_WAIT,
>> +};
>> +
>> +#endif /* __CGROUP_TCP_SKB_H__ */
>> diff --git a/tools/testing/selftests/bpf/prog_tests/cgroup_tcp_skb.c 
>> b/tools/testing/selftests/bpf/prog_tests/cgroup_tcp_skb.c
>> new file mode 100644
>> index 000000000000..1b78e8ab3f02
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/prog_tests/cgroup_tcp_skb.c
>> @@ -0,0 +1,399 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (c) 2023 Facebook */
>> +#include <test_progs.h>
>> +#include <linux/in6.h>
>> +#include <sys/socket.h>
>> +#include <sched.h>
>> +#include <unistd.h>
>> +#include "cgroup_helpers.h"
>> +#include "testing_helpers.h"
>> +#include "cgroup_tcp_skb.skel.h"
>> +#include "cgroup_tcp_skb.h"
>> +
>> +#define CGROUP_TCP_SKB_PATH "/test_cgroup_tcp_skb"
>> +
>> +static int install_filters(int cgroup_fd,
>> +               struct bpf_link **egress_link,
>> +               struct bpf_link **ingress_link,
>> +               struct bpf_program *egress_prog,
>> +               struct bpf_program *ingress_prog,
>> +               struct cgroup_tcp_skb *skel)
>> +{
>> +    /* Prepare filters */
>> +    skel->bss->g_sock_state = 0;
>> +    skel->bss->g_unexpected = 0;
>> +    *egress_link =
>> +        bpf_program__attach_cgroup(egress_prog,
>> +                       cgroup_fd);
>> +    if (!ASSERT_NEQ(*egress_link, NULL, "egress_link"))
>> +        return -1;
> 
> !ASSERT_OK_PTR(...)

Sure!

> 
>> +    *ingress_link =
>> +        bpf_program__attach_cgroup(ingress_prog,
>> +                       cgroup_fd);
>> +    if (!ASSERT_NEQ(*ingress_link, NULL, "ingress_link"))
>> +        return -1;
> 
> !ASSERT_OK_PTR(...)
> 
>> +
>> +    return 0;
>> +}
>> +
>> +static void uninstall_filters(struct bpf_link **egress_link,
>> +                  struct bpf_link **ingress_link)
>> +{
>> +    bpf_link__destroy(*egress_link);
>> +    *egress_link = NULL;
>> +    bpf_link__destroy(*ingress_link);
>> +    *ingress_link = NULL;
>> +}
>> +
> [...]
>> +
>> +static int close_connection(int *closing_fd, int *peer_fd, int 
>> *listen_fd,
>> +                struct cgroup_tcp_skb *skel)
>> +{
>> +    __u32 saved_packet_count = 0;
>> +    int err;
>> +    int i;
>> +
>> +    /* Wait for ACKs to be sent */
>> +    saved_packet_count = skel->bss->g_packet_count;
>> +    usleep(100000);        /* 0.1s */
>> +    while (skel->bss->g_packet_count != saved_packet_count) {
>> +        saved_packet_count = skel->bss->g_packet_count;
>> +        usleep(100000);    /* 0.1s */
>> +    }
> 
> Should we put a limitation in the number of loop iterations
> just in case that something went wrong or too slow?

Sure! It make sense to me.

[...]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v3 1/2] net: bpf: Always call BPF cgroup filters for egress.
  2023-06-22  3:37   ` Yonghong Song
@ 2023-06-22 15:34     ` Kui-Feng Lee
  2023-06-22 17:15     ` Kui-Feng Lee
  1 sibling, 0 replies; 14+ messages in thread
From: Kui-Feng Lee @ 2023-06-22 15:34 UTC (permalink / raw)
  To: Yonghong Song, Kui-Feng Lee, bpf, ast, martin.lau, song,
	kernel-team, andrii, daniel, yhs, kpsingh, shuah, john.fastabend,
	sdf, mykolal, linux-kselftest, jolsa, haoluo
  Cc: Kui-Feng Lee



On 6/21/23 20:37, Yonghong Song wrote:
> 
> 
> On 6/20/23 10:14 AM, Kui-Feng Lee wrote:
>> Always call BPF filters if CGROUP BPF is enabled for EGRESS without
>> checking skb->sk against sk.
>>
>> The filters were called only if skb is owned by the sock that the
>> skb is sent out through.  In another words, skb->sk should point to
>> the sock that it is sending through its egress.  However, the filters 
>> would
>> miss SYNACK skbs that they are owned by a request_sock but sent through
>> the listening sock, that is the socket listening incoming connections.
>> This is an unnecessary restrict.
> 
> The original patch which introduced 'sk == skb->sk' is
>    3007098494be  cgroup: add support for eBPF programs
> There are no mentioning in commit message why 'sk == skb->sk'
> is needed. So it is possible that this is just restricted
> for use cases at that moment. Now there are use cases
> where 'sk != skb->sk' so removing this check can enable
> the new use case. Maybe you can add this into your commit
> message so people can understand the history of 'sk == skb->sk'.

I will put it down on the next version.

> 
>>
>> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
>> ---
>>   include/linux/bpf-cgroup.h | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
>> index 57e9e109257e..e656da531f9f 100644
>> --- a/include/linux/bpf-cgroup.h
>> +++ b/include/linux/bpf-cgroup.h
>> @@ -199,7 +199,7 @@ static inline bool cgroup_bpf_sock_enabled(struct 
>> sock *sk,
>>   #define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb)                   \
>>   ({                                           \
>>       int __ret = 0;                                   \
>> -    if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk && sk == 
>> skb->sk) { \
>> +    if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk) {               \
>>           typeof(sk) __sk = sk_to_full_sk(sk);                   \
>>           if (sk_fullsock(__sk) &&                       \
>>               cgroup_bpf_sock_enabled(__sk, 
>> CGROUP_INET_EGRESS))           \
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v3 1/2] net: bpf: Always call BPF cgroup filters for egress.
  2023-06-22  3:37   ` Yonghong Song
  2023-06-22 15:34     ` Kui-Feng Lee
@ 2023-06-22 17:15     ` Kui-Feng Lee
  2023-06-22 18:28       ` Yonghong Song
  1 sibling, 1 reply; 14+ messages in thread
From: Kui-Feng Lee @ 2023-06-22 17:15 UTC (permalink / raw)
  To: Yonghong Song, Kui-Feng Lee, bpf, ast, martin.lau, song,
	kernel-team, andrii, daniel, yhs, kpsingh, shuah, john.fastabend,
	sdf, mykolal, linux-kselftest, jolsa, haoluo
  Cc: Kui-Feng Lee



On 6/21/23 20:37, Yonghong Song wrote:
> 
> 
> On 6/20/23 10:14 AM, Kui-Feng Lee wrote:
>> Always call BPF filters if CGROUP BPF is enabled for EGRESS without
>> checking skb->sk against sk.
>>
>> The filters were called only if skb is owned by the sock that the
>> skb is sent out through.  In another words, skb->sk should point to
>> the sock that it is sending through its egress.  However, the filters 
>> would
>> miss SYNACK skbs that they are owned by a request_sock but sent through
>> the listening sock, that is the socket listening incoming connections.
>> This is an unnecessary restrict.
> 
> The original patch which introduced 'sk == skb->sk' is
>    3007098494be  cgroup: add support for eBPF programs
> There are no mentioning in commit message why 'sk == skb->sk'
> is needed. So it is possible that this is just restricted
> for use cases at that moment. Now there are use cases
> where 'sk != skb->sk' so removing this check can enable
> the new use case. Maybe you can add this into your commit
> message so people can understand the history of 'sk == skb->sk'.

After checking the code and the Alexei's comment[1] again, this check
may be different from what I thought. In another post[2],
Daniel Borkmann mentioned

     Wouldn't that mean however, when you go through stacked devices that
     you'd run the same eBPF cgroup program for skb->sk multiple times?

I read this paragraph several times.
This check ensures the filters are only called for the device on
the top of a stack.  So, I probably should change the check to

     sk == skb_to_full_sk(skb)

instead of removing it.  If we remove the check, egress filters
could be called multiple times for a skb, just like what Daniel said.

Does that make sense?

[1] 
https://lore.kernel.org/all/CAADnVQKi0c=Mf3b=z43=b6n2xBVhwPw4QoV_au5+pFE29iLkaQ@mail.gmail.com/
[2] https://lore.kernel.org/all/58193E9D.7040201@iogearbox.net/

> 
>>
>> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
>> ---
>>   include/linux/bpf-cgroup.h | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
>> index 57e9e109257e..e656da531f9f 100644
>> --- a/include/linux/bpf-cgroup.h
>> +++ b/include/linux/bpf-cgroup.h
>> @@ -199,7 +199,7 @@ static inline bool cgroup_bpf_sock_enabled(struct 
>> sock *sk,
>>   #define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb)                   \
>>   ({                                           \
>>       int __ret = 0;                                   \
>> -    if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk && sk == 
>> skb->sk) { \
>> +    if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk) {               \
>>           typeof(sk) __sk = sk_to_full_sk(sk);                   \
>>           if (sk_fullsock(__sk) &&                       \
>>               cgroup_bpf_sock_enabled(__sk, 
>> CGROUP_INET_EGRESS))           \
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v3 1/2] net: bpf: Always call BPF cgroup filters for egress.
  2023-06-22 17:15     ` Kui-Feng Lee
@ 2023-06-22 18:28       ` Yonghong Song
  2023-06-22 20:06         ` Daniel Borkmann
  0 siblings, 1 reply; 14+ messages in thread
From: Yonghong Song @ 2023-06-22 18:28 UTC (permalink / raw)
  To: Kui-Feng Lee, Kui-Feng Lee, bpf, ast, martin.lau, song,
	kernel-team, andrii, daniel, yhs, kpsingh, shuah, john.fastabend,
	sdf, mykolal, linux-kselftest, jolsa, haoluo
  Cc: Kui-Feng Lee



On 6/22/23 10:15 AM, Kui-Feng Lee wrote:
> 
> 
> On 6/21/23 20:37, Yonghong Song wrote:
>>
>>
>> On 6/20/23 10:14 AM, Kui-Feng Lee wrote:
>>> Always call BPF filters if CGROUP BPF is enabled for EGRESS without
>>> checking skb->sk against sk.
>>>
>>> The filters were called only if skb is owned by the sock that the
>>> skb is sent out through.  In another words, skb->sk should point to
>>> the sock that it is sending through its egress.  However, the filters 
>>> would
>>> miss SYNACK skbs that they are owned by a request_sock but sent through
>>> the listening sock, that is the socket listening incoming connections.
>>> This is an unnecessary restrict.
>>
>> The original patch which introduced 'sk == skb->sk' is
>>    3007098494be  cgroup: add support for eBPF programs
>> There are no mentioning in commit message why 'sk == skb->sk'
>> is needed. So it is possible that this is just restricted
>> for use cases at that moment. Now there are use cases
>> where 'sk != skb->sk' so removing this check can enable
>> the new use case. Maybe you can add this into your commit
>> message so people can understand the history of 'sk == skb->sk'.
> 
> After checking the code and the Alexei's comment[1] again, this check
> may be different from what I thought. In another post[2],
> Daniel Borkmann mentioned
> 
>      Wouldn't that mean however, when you go through stacked devices that
>      you'd run the same eBPF cgroup program for skb->sk multiple times?
> 
> I read this paragraph several times.
> This check ensures the filters are only called for the device on
> the top of a stack.  So, I probably should change the check to
> 
>      sk == skb_to_full_sk(skb)

I think this should work. It exactly covers your use case:
   they are owned by a request_sock but sent through
   the listening sock, that is the socket listening incoming connections
and sk == skb->sk for non request_sock/listening_sock case.

I originally though whether you could do
   sk == skb->sk || skb->sk->sk_state == TCP_NEW_SYN_RECV
but obviously your approach is better.

> 
> instead of removing it.  If we remove the check, egress filters
> could be called multiple times for a skb, just like what Daniel said.
> 
> Does that make sense?
> 
> [1] 
> https://lore.kernel.org/all/CAADnVQKi0c=Mf3b=z43=b6n2xBVhwPw4QoV_au5+pFE29iLkaQ@mail.gmail.com/
> [2] https://lore.kernel.org/all/58193E9D.7040201@iogearbox.net/
> 
>>
>>>
>>> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
>>> ---
>>>   include/linux/bpf-cgroup.h | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
>>> index 57e9e109257e..e656da531f9f 100644
>>> --- a/include/linux/bpf-cgroup.h
>>> +++ b/include/linux/bpf-cgroup.h
>>> @@ -199,7 +199,7 @@ static inline bool cgroup_bpf_sock_enabled(struct 
>>> sock *sk,
>>>   #define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb)                   \
>>>   ({                                           \
>>>       int __ret = 0;                                   \
>>> -    if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk && sk == 
>>> skb->sk) { \
>>> +    if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk) {               \
>>>           typeof(sk) __sk = sk_to_full_sk(sk);                   \
>>>           if (sk_fullsock(__sk) &&                       \
>>>               cgroup_bpf_sock_enabled(__sk, 
>>> CGROUP_INET_EGRESS))           \
>>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v3 1/2] net: bpf: Always call BPF cgroup filters for egress.
  2023-06-22 18:28       ` Yonghong Song
@ 2023-06-22 20:06         ` Daniel Borkmann
  2023-06-22 23:55           ` Kui-Feng Lee
  0 siblings, 1 reply; 14+ messages in thread
From: Daniel Borkmann @ 2023-06-22 20:06 UTC (permalink / raw)
  To: Yonghong Song, Kui-Feng Lee, Kui-Feng Lee, bpf, ast, martin.lau,
	song, kernel-team, andrii, yhs, kpsingh, shuah, john.fastabend,
	sdf, mykolal, linux-kselftest, jolsa, haoluo
  Cc: Kui-Feng Lee

On 6/22/23 8:28 PM, Yonghong Song wrote:
> On 6/22/23 10:15 AM, Kui-Feng Lee wrote:
>> On 6/21/23 20:37, Yonghong Song wrote:
>>> On 6/20/23 10:14 AM, Kui-Feng Lee wrote:
>>>> Always call BPF filters if CGROUP BPF is enabled for EGRESS without
>>>> checking skb->sk against sk.
>>>>
>>>> The filters were called only if skb is owned by the sock that the
>>>> skb is sent out through.  In another words, skb->sk should point to
>>>> the sock that it is sending through its egress.  However, the filters would
>>>> miss SYNACK skbs that they are owned by a request_sock but sent through
>>>> the listening sock, that is the socket listening incoming connections.
>>>> This is an unnecessary restrict.
>>>
>>> The original patch which introduced 'sk == skb->sk' is
>>>    3007098494be  cgroup: add support for eBPF programs
>>> There are no mentioning in commit message why 'sk == skb->sk'
>>> is needed. So it is possible that this is just restricted
>>> for use cases at that moment. Now there are use cases
>>> where 'sk != skb->sk' so removing this check can enable
>>> the new use case. Maybe you can add this into your commit
>>> message so people can understand the history of 'sk == skb->sk'.
>>
>> After checking the code and the Alexei's comment[1] again, this check
>> may be different from what I thought. In another post[2],
>> Daniel Borkmann mentioned
>>
>>      Wouldn't that mean however, when you go through stacked devices that
>>      you'd run the same eBPF cgroup program for skb->sk multiple times?
>>
>> I read this paragraph several times.
>> This check ensures the filters are only called for the device on
>> the top of a stack.  So, I probably should change the check to
>>
>>      sk == skb_to_full_sk(skb)
> 
> I think this should work. It exactly covers your use case:
>    they are owned by a request_sock but sent through
>    the listening sock, that is the socket listening incoming connections
> and sk == skb->sk for non request_sock/listening_sock case.

Just a thought, should the test look like the below?

         int __ret = 0;                                                         \
         if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk) {                    \
                 typeof(sk) __sk = sk_to_full_sk(sk);                           \
                 if (sk_fullsock(__sk) && __sk == skb_to_full_sk(skb) &&        \
                     cgroup_bpf_sock_enabled(__sk, CGROUP_INET_EGRESS))         \
                         __ret = __cgroup_bpf_run_filter_skb(__sk, skb,         \
                                                       CGROUP_INET_EGRESS); \
         }                                                                      \

Iow, we do already convert __sk to full sk, so we should then also use that
for the test with skb_to_full_sk(skb).

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v3 1/2] net: bpf: Always call BPF cgroup filters for egress.
  2023-06-22 20:06         ` Daniel Borkmann
@ 2023-06-22 23:55           ` Kui-Feng Lee
  2023-06-23  8:50             ` Daniel Borkmann
  0 siblings, 1 reply; 14+ messages in thread
From: Kui-Feng Lee @ 2023-06-22 23:55 UTC (permalink / raw)
  To: Daniel Borkmann, Yonghong Song, Kui-Feng Lee, bpf, ast,
	martin.lau, song, kernel-team, andrii, yhs, kpsingh, shuah,
	john.fastabend, sdf, mykolal, linux-kselftest, jolsa, haoluo
  Cc: Kui-Feng Lee



On 6/22/23 13:06, Daniel Borkmann wrote:
> On 6/22/23 8:28 PM, Yonghong Song wrote:
>> On 6/22/23 10:15 AM, Kui-Feng Lee wrote:
>>> On 6/21/23 20:37, Yonghong Song wrote:
>>>> On 6/20/23 10:14 AM, Kui-Feng Lee wrote:
>>>>> Always call BPF filters if CGROUP BPF is enabled for EGRESS without
>>>>> checking skb->sk against sk.
>>>>>
>>>>> The filters were called only if skb is owned by the sock that the
>>>>> skb is sent out through.  In another words, skb->sk should point to
>>>>> the sock that it is sending through its egress.  However, the 
>>>>> filters would
>>>>> miss SYNACK skbs that they are owned by a request_sock but sent 
>>>>> through
>>>>> the listening sock, that is the socket listening incoming connections.
>>>>> This is an unnecessary restrict.
>>>>
>>>> The original patch which introduced 'sk == skb->sk' is
>>>>    3007098494be  cgroup: add support for eBPF programs
>>>> There are no mentioning in commit message why 'sk == skb->sk'
>>>> is needed. So it is possible that this is just restricted
>>>> for use cases at that moment. Now there are use cases
>>>> where 'sk != skb->sk' so removing this check can enable
>>>> the new use case. Maybe you can add this into your commit
>>>> message so people can understand the history of 'sk == skb->sk'.
>>>
>>> After checking the code and the Alexei's comment[1] again, this check
>>> may be different from what I thought. In another post[2],
>>> Daniel Borkmann mentioned
>>>
>>>      Wouldn't that mean however, when you go through stacked devices 
>>> that
>>>      you'd run the same eBPF cgroup program for skb->sk multiple times?
>>>
>>> I read this paragraph several times.
>>> This check ensures the filters are only called for the device on
>>> the top of a stack.  So, I probably should change the check to
>>>
>>>      sk == skb_to_full_sk(skb)
>>
>> I think this should work. It exactly covers your use case:
>>    they are owned by a request_sock but sent through
>>    the listening sock, that is the socket listening incoming connections
>> and sk == skb->sk for non request_sock/listening_sock case.
> 
> Just a thought, should the test look like the below?
> 
>          int __ret = 
> 0;                                                         \
>          if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk) 
> {                    \
>                  typeof(sk) __sk = 
> sk_to_full_sk(sk);                           \
>                  if (sk_fullsock(__sk) && __sk == skb_to_full_sk(skb) 
> &&        \
>                      cgroup_bpf_sock_enabled(__sk, 
> CGROUP_INET_EGRESS))         \
>                          __ret = __cgroup_bpf_run_filter_skb(__sk, 
> skb,         \
>                                                        
> CGROUP_INET_EGRESS); \
>          
> }                                                                      \
> 
> Iow, we do already convert __sk to full sk, so we should then also use that
> for the test with skb_to_full_sk(skb).

Agree!

> 
> Thanks,
> Daniel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v3 1/2] net: bpf: Always call BPF cgroup filters for egress.
  2023-06-22 23:55           ` Kui-Feng Lee
@ 2023-06-23  8:50             ` Daniel Borkmann
  2023-06-23 16:30               ` Kui-Feng Lee
  0 siblings, 1 reply; 14+ messages in thread
From: Daniel Borkmann @ 2023-06-23  8:50 UTC (permalink / raw)
  To: Kui-Feng Lee, Yonghong Song, Kui-Feng Lee, bpf, ast, martin.lau,
	song, kernel-team, andrii, yhs, kpsingh, shuah, john.fastabend,
	sdf, mykolal, linux-kselftest, jolsa, haoluo
  Cc: Kui-Feng Lee

On 6/23/23 1:55 AM, Kui-Feng Lee wrote:
> On 6/22/23 13:06, Daniel Borkmann wrote:
>> On 6/22/23 8:28 PM, Yonghong Song wrote:
>>> On 6/22/23 10:15 AM, Kui-Feng Lee wrote:
>>>> On 6/21/23 20:37, Yonghong Song wrote:
>>>>> On 6/20/23 10:14 AM, Kui-Feng Lee wrote:
>>>>>> Always call BPF filters if CGROUP BPF is enabled for EGRESS without
>>>>>> checking skb->sk against sk.
>>>>>>
>>>>>> The filters were called only if skb is owned by the sock that the
>>>>>> skb is sent out through.  In another words, skb->sk should point to
>>>>>> the sock that it is sending through its egress.  However, the filters would
>>>>>> miss SYNACK skbs that they are owned by a request_sock but sent through
>>>>>> the listening sock, that is the socket listening incoming connections.
>>>>>> This is an unnecessary restrict.
>>>>>
>>>>> The original patch which introduced 'sk == skb->sk' is
>>>>>    3007098494be  cgroup: add support for eBPF programs
>>>>> There are no mentioning in commit message why 'sk == skb->sk'
>>>>> is needed. So it is possible that this is just restricted
>>>>> for use cases at that moment. Now there are use cases
>>>>> where 'sk != skb->sk' so removing this check can enable
>>>>> the new use case. Maybe you can add this into your commit
>>>>> message so people can understand the history of 'sk == skb->sk'.
>>>>
>>>> After checking the code and the Alexei's comment[1] again, this check
>>>> may be different from what I thought. In another post[2],
>>>> Daniel Borkmann mentioned
>>>>
>>>>      Wouldn't that mean however, when you go through stacked devices that
>>>>      you'd run the same eBPF cgroup program for skb->sk multiple times?
>>>>
>>>> I read this paragraph several times.
>>>> This check ensures the filters are only called for the device on
>>>> the top of a stack.  So, I probably should change the check to
>>>>
>>>>      sk == skb_to_full_sk(skb)
>>>
>>> I think this should work. It exactly covers your use case:
>>>    they are owned by a request_sock but sent through
>>>    the listening sock, that is the socket listening incoming connections
>>> and sk == skb->sk for non request_sock/listening_sock case.
>>
>> Just a thought, should the test look like the below?
>>
>>          int __ret = 0;                                                         \
>>          if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk) {                    \
>>                  typeof(sk) __sk = sk_to_full_sk(sk);                           \
>>                  if (sk_fullsock(__sk) && __sk == skb_to_full_sk(skb) &&        \
>>                      cgroup_bpf_sock_enabled(__sk, CGROUP_INET_EGRESS))         \
>>                          __ret = __cgroup_bpf_run_filter_skb(__sk, skb,         \
>> CGROUP_INET_EGRESS); \
>> }                                                                      \
>>
>> Iow, we do already convert __sk to full sk, so we should then also use that
>> for the test with skb_to_full_sk(skb).
> 
> Agree!

It would also be useful to do an in-depth analysis for the commit msg in which
cases the sk == skb->sk matches and sk was not a full sock (but __sk is) given
the __sk = sk_to_full_sk(sk) exists in the code to document which situation this
is covering in the existing code (... perhaps it used to work back then for
synack just that later changes altered it without anyone noticing until now).

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v3 1/2] net: bpf: Always call BPF cgroup filters for egress.
  2023-06-23  8:50             ` Daniel Borkmann
@ 2023-06-23 16:30               ` Kui-Feng Lee
  0 siblings, 0 replies; 14+ messages in thread
From: Kui-Feng Lee @ 2023-06-23 16:30 UTC (permalink / raw)
  To: Daniel Borkmann, Yonghong Song, Kui-Feng Lee, bpf, ast,
	martin.lau, song, kernel-team, andrii, yhs, kpsingh, shuah,
	john.fastabend, sdf, mykolal, linux-kselftest, jolsa, haoluo
  Cc: Kui-Feng Lee



On 6/23/23 01:50, Daniel Borkmann wrote:
> On 6/23/23 1:55 AM, Kui-Feng Lee wrote:
>> On 6/22/23 13:06, Daniel Borkmann wrote:
>>> On 6/22/23 8:28 PM, Yonghong Song wrote:
>>>> On 6/22/23 10:15 AM, Kui-Feng Lee wrote:
>>>>> On 6/21/23 20:37, Yonghong Song wrote:
>>>>>> On 6/20/23 10:14 AM, Kui-Feng Lee wrote:
>>>>>>> Always call BPF filters if CGROUP BPF is enabled for EGRESS without
>>>>>>> checking skb->sk against sk.
>>>>>>>
>>>>>>> The filters were called only if skb is owned by the sock that the
>>>>>>> skb is sent out through.  In another words, skb->sk should point to
>>>>>>> the sock that it is sending through its egress.  However, the 
>>>>>>> filters would
>>>>>>> miss SYNACK skbs that they are owned by a request_sock but sent 
>>>>>>> through
>>>>>>> the listening sock, that is the socket listening incoming 
>>>>>>> connections.
>>>>>>> This is an unnecessary restrict.
>>>>>>
>>>>>> The original patch which introduced 'sk == skb->sk' is
>>>>>>    3007098494be  cgroup: add support for eBPF programs
>>>>>> There are no mentioning in commit message why 'sk == skb->sk'
>>>>>> is needed. So it is possible that this is just restricted
>>>>>> for use cases at that moment. Now there are use cases
>>>>>> where 'sk != skb->sk' so removing this check can enable
>>>>>> the new use case. Maybe you can add this into your commit
>>>>>> message so people can understand the history of 'sk == skb->sk'.
>>>>>
>>>>> After checking the code and the Alexei's comment[1] again, this check
>>>>> may be different from what I thought. In another post[2],
>>>>> Daniel Borkmann mentioned
>>>>>
>>>>>      Wouldn't that mean however, when you go through stacked 
>>>>> devices that
>>>>>      you'd run the same eBPF cgroup program for skb->sk multiple 
>>>>> times?
>>>>>
>>>>> I read this paragraph several times.
>>>>> This check ensures the filters are only called for the device on
>>>>> the top of a stack.  So, I probably should change the check to
>>>>>
>>>>>      sk == skb_to_full_sk(skb)
>>>>
>>>> I think this should work. It exactly covers your use case:
>>>>    they are owned by a request_sock but sent through
>>>>    the listening sock, that is the socket listening incoming 
>>>> connections
>>>> and sk == skb->sk for non request_sock/listening_sock case.
>>>
>>> Just a thought, should the test look like the below?
>>>
>>>          int __ret = 
>>> 0;                                                         \
>>>          if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk) 
>>> {                    \
>>>                  typeof(sk) __sk = 
>>> sk_to_full_sk(sk);                           \
>>>                  if (sk_fullsock(__sk) && __sk == skb_to_full_sk(skb) 
>>> &&        \
>>>                      cgroup_bpf_sock_enabled(__sk, 
>>> CGROUP_INET_EGRESS))         \
>>>                          __ret = __cgroup_bpf_run_filter_skb(__sk, 
>>> skb,         \
>>> CGROUP_INET_EGRESS); \
>>> }                                                                      \
>>>
>>> Iow, we do already convert __sk to full sk, so we should then also 
>>> use that
>>> for the test with skb_to_full_sk(skb).
>>
>> Agree!
> 
> It would also be useful to do an in-depth analysis for the commit msg in 
> which
> cases the sk == skb->sk matches and sk was not a full sock (but __sk is) 
> given
> the __sk = sk_to_full_sk(sk) exists in the code to document which 
> situation this
> is covering in the existing code (... perhaps it used to work back then for
> synack just that later changes altered it without anyone noticing until 
> now).

I did a test that trace how a packet going through L2TP
devices. I am going to include the analysis of the test and other
related links of discussions in the commit log.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH bpf-next v3 1/2] net: bpf: Always call BPF cgroup filters for egress.
  2023-06-20 21:32 [PATCH bpf-next v3 0/2] Fix missing synack in BPF cgroup_skb filters Kui-Feng Lee
@ 2023-06-20 21:32 ` Kui-Feng Lee
  0 siblings, 0 replies; 14+ messages in thread
From: Kui-Feng Lee @ 2023-06-20 21:32 UTC (permalink / raw)
  To: netdev; +Cc: kuba, Kui-Feng Lee

Always call BPF filters if CGROUP BPF is enabled for EGRESS without
checking skb->sk against sk.

The filters were called only if skb is owned by the sock that the
skb is sent out through.  In another words, skb->sk should point to
the sock that it is sending through its egress.  However, the filters would
miss SYNACK skbs that they are owned by a request_sock but sent through
the listening sock, that is the socket listening incoming connections.
This is an unnecessary restrict.

Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
---
 include/linux/bpf-cgroup.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index 57e9e109257e..e656da531f9f 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -199,7 +199,7 @@ static inline bool cgroup_bpf_sock_enabled(struct sock *sk,
 #define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb)			       \
 ({									       \
 	int __ret = 0;							       \
-	if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk && sk == skb->sk) { \
+	if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk) {		       \
 		typeof(sk) __sk = sk_to_full_sk(sk);			       \
 		if (sk_fullsock(__sk) &&				       \
 		    cgroup_bpf_sock_enabled(__sk, CGROUP_INET_EGRESS))	       \
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2023-06-23 16:30 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-20 17:14 [PATCH bpf-next v3 0/2] Fix missing synack in BPF cgroup_skb filters Kui-Feng Lee
2023-06-20 17:14 ` [PATCH bpf-next v3 1/2] net: bpf: Always call BPF cgroup filters for egress Kui-Feng Lee
2023-06-22  3:37   ` Yonghong Song
2023-06-22 15:34     ` Kui-Feng Lee
2023-06-22 17:15     ` Kui-Feng Lee
2023-06-22 18:28       ` Yonghong Song
2023-06-22 20:06         ` Daniel Borkmann
2023-06-22 23:55           ` Kui-Feng Lee
2023-06-23  8:50             ` Daniel Borkmann
2023-06-23 16:30               ` Kui-Feng Lee
2023-06-20 17:14 ` [PATCH bpf-next v3 2/2] selftests/bpf: Verify that the cgroup_skb filters receive expected packets Kui-Feng Lee
2023-06-22  4:15   ` Yonghong Song
2023-06-22 15:33     ` Kui-Feng Lee
2023-06-20 21:32 [PATCH bpf-next v3 0/2] Fix missing synack in BPF cgroup_skb filters Kui-Feng Lee
2023-06-20 21:32 ` [PATCH bpf-next v3 1/2] net: bpf: Always call BPF cgroup filters for egress Kui-Feng Lee

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.