bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Patch bpf 0/3] sock_map: fix ->poll() and update selftests
@ 2021-09-24 22:05 Cong Wang
  2021-09-24 22:05 ` [Patch bpf 1/3] skmsg: introduce sk_psock_get_checked() Cong Wang
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Cong Wang @ 2021-09-24 22:05 UTC (permalink / raw)
  To: netdev; +Cc: bpf, Cong Wang

From: Cong Wang <cong.wang@bytedance.com>

This patchset fixes ->poll() on sockmap sockets and update
selftests accordingly with select(). Please check each patch
for more details.

Cong Wang (2):
  skmsg: introduce sk_psock_get_checked()
  net: poll psock queues too for sockmap sockets

Yucong Sun (1):
  selftests/bpf: use recv_timeout() instead of retries

 include/linux/skmsg.h                         | 26 +++++++
 net/core/skmsg.c                              | 15 ++++
 net/core/sock_map.c                           | 22 +-----
 net/ipv4/tcp.c                                |  2 +
 net/ipv4/udp.c                                |  2 +
 net/unix/af_unix.c                            |  5 ++
 .../selftests/bpf/prog_tests/sockmap_listen.c | 75 +++++--------------
 7 files changed, 71 insertions(+), 76 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Patch bpf 1/3] skmsg: introduce sk_psock_get_checked()
  2021-09-24 22:05 [Patch bpf 0/3] sock_map: fix ->poll() and update selftests Cong Wang
@ 2021-09-24 22:05 ` Cong Wang
  2021-09-24 22:05 ` [Patch bpf 2/3] net: poll psock queues too for sockmap sockets Cong Wang
  2021-09-24 22:05 ` [Patch bpf 3/3] selftests/bpf: use recv_timeout() instead of retries Cong Wang
  2 siblings, 0 replies; 6+ messages in thread
From: Cong Wang @ 2021-09-24 22:05 UTC (permalink / raw)
  To: netdev
  Cc: bpf, Cong Wang, John Fastabend, Daniel Borkmann, Jakub Sitnicki,
	Lorenz Bauer

From: Cong Wang <cong.wang@bytedance.com>

Although we have sk_psock_get(), it assumes the psock
retrieved from sk_user_data is for sockmap, this is not
sufficient if we call it outside of sockmap, for example,
reuseport_array.

Fortunately sock_map_psock_get_checked() is more strict
and checks for sock_map_close before using psock. So we can
refactor it and rename it to sk_psock_get_checked(), which
can be safely called outside of sockmap.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 include/linux/skmsg.h | 20 ++++++++++++++++++++
 net/core/sock_map.c   | 22 +---------------------
 2 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 14ab0c0bc924..d47097f2c8c0 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -452,6 +452,26 @@ static inline struct sk_psock *sk_psock_get(struct sock *sk)
 	return psock;
 }
 
+static inline struct sk_psock *sk_psock_get_checked(struct sock *sk)
+{
+	struct sk_psock *psock;
+
+	rcu_read_lock();
+	psock = sk_psock(sk);
+	if (psock) {
+		if (sk->sk_prot->close != sock_map_close) {
+			psock = ERR_PTR(-EBUSY);
+			goto out;
+		}
+
+		if (!refcount_inc_not_zero(&psock->refcnt))
+			psock = ERR_PTR(-EBUSY);
+	}
+out:
+	rcu_read_unlock();
+	return psock;
+}
+
 void sk_psock_drop(struct sock *sk, struct sk_psock *psock);
 
 static inline void sk_psock_put(struct sock *sk, struct sk_psock *psock)
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index e252b8ec2b85..6612bb0b95b5 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -191,26 +191,6 @@ static int sock_map_init_proto(struct sock *sk, struct sk_psock *psock)
 	return sk->sk_prot->psock_update_sk_prot(sk, psock, false);
 }
 
-static struct sk_psock *sock_map_psock_get_checked(struct sock *sk)
-{
-	struct sk_psock *psock;
-
-	rcu_read_lock();
-	psock = sk_psock(sk);
-	if (psock) {
-		if (sk->sk_prot->close != sock_map_close) {
-			psock = ERR_PTR(-EBUSY);
-			goto out;
-		}
-
-		if (!refcount_inc_not_zero(&psock->refcnt))
-			psock = ERR_PTR(-EBUSY);
-	}
-out:
-	rcu_read_unlock();
-	return psock;
-}
-
 static int sock_map_link(struct bpf_map *map, struct sock *sk)
 {
 	struct sk_psock_progs *progs = sock_map_progs(map);
@@ -255,7 +235,7 @@ static int sock_map_link(struct bpf_map *map, struct sock *sk)
 		}
 	}
 
-	psock = sock_map_psock_get_checked(sk);
+	psock = sk_psock_get_checked(sk);
 	if (IS_ERR(psock)) {
 		ret = PTR_ERR(psock);
 		goto out_progs;
-- 
2.30.2


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Patch bpf 2/3] net: poll psock queues too for sockmap sockets
  2021-09-24 22:05 [Patch bpf 0/3] sock_map: fix ->poll() and update selftests Cong Wang
  2021-09-24 22:05 ` [Patch bpf 1/3] skmsg: introduce sk_psock_get_checked() Cong Wang
@ 2021-09-24 22:05 ` Cong Wang
  2021-09-27 18:07   ` John Fastabend
  2021-09-24 22:05 ` [Patch bpf 3/3] selftests/bpf: use recv_timeout() instead of retries Cong Wang
  2 siblings, 1 reply; 6+ messages in thread
From: Cong Wang @ 2021-09-24 22:05 UTC (permalink / raw)
  To: netdev
  Cc: bpf, Cong Wang, Yucong Sun, John Fastabend, Daniel Borkmann,
	Jakub Sitnicki, Lorenz Bauer

From: Cong Wang <cong.wang@bytedance.com>

Yucong noticed we can't poll() sockets in sockmap even
when they are the destination sockets of redirections.
This is because we never poll any psock queues in ->poll().
We can not overwrite ->poll() as it is in struct proto_ops,
not in struct proto.

So introduce sk_msg_poll() to poll psock ingress_msg queue
and let sockets which support sockmap invoke it directly.

Reported-by: Yucong Sun <sunyucong@gmail.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 include/linux/skmsg.h |  6 ++++++
 net/core/skmsg.c      | 15 +++++++++++++++
 net/ipv4/tcp.c        |  2 ++
 net/ipv4/udp.c        |  2 ++
 net/unix/af_unix.c    |  5 +++++
 5 files changed, 30 insertions(+)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index d47097f2c8c0..163b0cc1703a 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -128,6 +128,7 @@ int sk_msg_memcopy_from_iter(struct sock *sk, struct iov_iter *from,
 			     struct sk_msg *msg, u32 bytes);
 int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
 		   int len, int flags);
+__poll_t sk_msg_poll(struct sock *sk);
 
 static inline void sk_msg_check_to_free(struct sk_msg *msg, u32 i, u32 bytes)
 {
@@ -562,5 +563,10 @@ static inline void skb_bpf_redirect_clear(struct sk_buff *skb)
 {
 	skb->_sk_redir = 0;
 }
+#else
+static inline __poll_t sk_msg_poll(struct sock *sk)
+{
+	return 0;
+}
 #endif /* CONFIG_NET_SOCK_MSG */
 #endif /* _LINUX_SKMSG_H */
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 2d6249b28928..8e6d7ea43eca 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -474,6 +474,21 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
 }
 EXPORT_SYMBOL_GPL(sk_msg_recvmsg);
 
+__poll_t sk_msg_poll(struct sock *sk)
+{
+	struct sk_psock *psock;
+	__poll_t mask = 0;
+
+	psock = sk_psock_get_checked(sk);
+	if (IS_ERR_OR_NULL(psock))
+		return 0;
+	if (!sk_psock_queue_empty(psock))
+		mask |= EPOLLIN | EPOLLRDNORM;
+	sk_psock_put(sk, psock);
+	return mask;
+}
+EXPORT_SYMBOL_GPL(sk_msg_poll);
+
 static struct sk_msg *sk_psock_create_ingress_msg(struct sock *sk,
 						  struct sk_buff *skb)
 {
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e8b48df73c85..2eb1a87ba056 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -280,6 +280,7 @@
 #include <linux/uaccess.h>
 #include <asm/ioctls.h>
 #include <net/busy_poll.h>
+#include <linux/skmsg.h>
 
 /* Track pending CMSGs. */
 enum {
@@ -563,6 +564,7 @@ __poll_t tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
 
 		if (tcp_stream_is_readable(sk, target))
 			mask |= EPOLLIN | EPOLLRDNORM;
+		mask |= sk_msg_poll(sk);
 
 		if (!(sk->sk_shutdown & SEND_SHUTDOWN)) {
 			if (__sk_stream_is_writeable(sk, 1)) {
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 8851c9463b4b..fbc989d27388 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -97,6 +97,7 @@
 #include <linux/skbuff.h>
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
+#include <linux/skmsg.h>
 #include <net/net_namespace.h>
 #include <net/icmp.h>
 #include <net/inet_hashtables.h>
@@ -2866,6 +2867,7 @@ __poll_t udp_poll(struct file *file, struct socket *sock, poll_table *wait)
 	    !(sk->sk_shutdown & RCV_SHUTDOWN) && first_packet_length(sk) == -1)
 		mask &= ~(EPOLLIN | EPOLLRDNORM);
 
+	mask |= sk_msg_poll(sk);
 	return mask;
 
 }
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 92345c9bb60c..5d705541d082 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -114,6 +114,7 @@
 #include <linux/freezer.h>
 #include <linux/file.h>
 #include <linux/btf_ids.h>
+#include <linux/skmsg.h>
 
 #include "scm.h"
 
@@ -3015,6 +3016,8 @@ static __poll_t unix_poll(struct file *file, struct socket *sock, poll_table *wa
 	if (!skb_queue_empty_lockless(&sk->sk_receive_queue))
 		mask |= EPOLLIN | EPOLLRDNORM;
 
+	mask |= sk_msg_poll(sk);
+
 	/* Connection-based need to check for termination and startup */
 	if ((sk->sk_type == SOCK_STREAM || sk->sk_type == SOCK_SEQPACKET) &&
 	    sk->sk_state == TCP_CLOSE)
@@ -3054,6 +3057,8 @@ static __poll_t unix_dgram_poll(struct file *file, struct socket *sock,
 	if (!skb_queue_empty_lockless(&sk->sk_receive_queue))
 		mask |= EPOLLIN | EPOLLRDNORM;
 
+	mask |= sk_msg_poll(sk);
+
 	/* Connection-based need to check for termination and startup */
 	if (sk->sk_type == SOCK_SEQPACKET) {
 		if (sk->sk_state == TCP_CLOSE)
-- 
2.30.2


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Patch bpf 3/3] selftests/bpf: use recv_timeout() instead of retries
  2021-09-24 22:05 [Patch bpf 0/3] sock_map: fix ->poll() and update selftests Cong Wang
  2021-09-24 22:05 ` [Patch bpf 1/3] skmsg: introduce sk_psock_get_checked() Cong Wang
  2021-09-24 22:05 ` [Patch bpf 2/3] net: poll psock queues too for sockmap sockets Cong Wang
@ 2021-09-24 22:05 ` Cong Wang
  2 siblings, 0 replies; 6+ messages in thread
From: Cong Wang @ 2021-09-24 22:05 UTC (permalink / raw)
  To: netdev
  Cc: bpf, Yucong Sun, John Fastabend, Daniel Borkmann, Jakub Sitnicki,
	Lorenz Bauer, Cong Wang

From: Yucong Sun <sunyucong@gmail.com>

We use non-blocking sockets in those tests, retrying for
EAGAIN is ugly because there is no upper bound for the packet
arrival time, at least in theory. After we fix ->poll() for
sockmap sockets, now we can switch to select()+recv().

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Yucong Sun <sunyucong@gmail.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 .../selftests/bpf/prog_tests/sockmap_listen.c | 75 +++++--------------
 1 file changed, 20 insertions(+), 55 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c b/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
index 5c5979046523..d88bb65b74cc 100644
--- a/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
+++ b/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
@@ -949,7 +949,6 @@ static void redir_to_connected(int family, int sotype, int sock_mapfd,
 	int err, n;
 	u32 key;
 	char b;
-	int retries = 100;
 
 	zero_verdict_count(verd_mapfd);
 
@@ -1002,17 +1001,11 @@ static void redir_to_connected(int family, int sotype, int sock_mapfd,
 		goto close_peer1;
 	if (pass != 1)
 		FAIL("%s: want pass count 1, have %d", log_prefix, pass);
-again:
-	n = read(c0, &b, 1);
-	if (n < 0) {
-		if (errno == EAGAIN && retries--) {
-			usleep(1000);
-			goto again;
-		}
-		FAIL_ERRNO("%s: read", log_prefix);
-	}
+	n = recv_timeout(c0, &b, 1, 0, IO_TIMEOUT_SEC);
+	if (n < 0)
+		FAIL_ERRNO("%s: recv_timeout", log_prefix);
 	if (n == 0)
-		FAIL("%s: incomplete read", log_prefix);
+		FAIL("%s: incomplete recv", log_prefix);
 
 close_peer1:
 	xclose(p1);
@@ -1571,7 +1564,6 @@ static void unix_redir_to_connected(int sotype, int sock_mapfd,
 	const char *log_prefix = redir_mode_str(mode);
 	int c0, c1, p0, p1;
 	unsigned int pass;
-	int retries = 100;
 	int err, n;
 	int sfd[2];
 	u32 key;
@@ -1606,17 +1598,11 @@ static void unix_redir_to_connected(int sotype, int sock_mapfd,
 	if (pass != 1)
 		FAIL("%s: want pass count 1, have %d", log_prefix, pass);
 
-again:
-	n = read(mode == REDIR_INGRESS ? p0 : c0, &b, 1);
-	if (n < 0) {
-		if (errno == EAGAIN && retries--) {
-			usleep(1000);
-			goto again;
-		}
-		FAIL_ERRNO("%s: read", log_prefix);
-	}
+	n = recv_timeout(mode == REDIR_INGRESS ? p0 : c0, &b, 1, 0, IO_TIMEOUT_SEC);
+	if (n < 0)
+		FAIL_ERRNO("%s: recv_timeout", log_prefix);
 	if (n == 0)
-		FAIL("%s: incomplete read", log_prefix);
+		FAIL("%s: incomplete recv", log_prefix);
 
 close:
 	xclose(c1);
@@ -1748,7 +1734,6 @@ static void udp_redir_to_connected(int family, int sock_mapfd, int verd_mapfd,
 	const char *log_prefix = redir_mode_str(mode);
 	int c0, c1, p0, p1;
 	unsigned int pass;
-	int retries = 100;
 	int err, n;
 	u32 key;
 	char b;
@@ -1781,17 +1766,11 @@ static void udp_redir_to_connected(int family, int sock_mapfd, int verd_mapfd,
 	if (pass != 1)
 		FAIL("%s: want pass count 1, have %d", log_prefix, pass);
 
-again:
-	n = read(mode == REDIR_INGRESS ? p0 : c0, &b, 1);
-	if (n < 0) {
-		if (errno == EAGAIN && retries--) {
-			usleep(1000);
-			goto again;
-		}
-		FAIL_ERRNO("%s: read", log_prefix);
-	}
+	n = recv_timeout(mode == REDIR_INGRESS ? p0 : c0, &b, 1, 0, IO_TIMEOUT_SEC);
+	if (n < 0)
+		FAIL_ERRNO("%s: recv_timeout", log_prefix);
 	if (n == 0)
-		FAIL("%s: incomplete read", log_prefix);
+		FAIL("%s: incomplete recv", log_prefix);
 
 close_cli1:
 	xclose(c1);
@@ -1841,7 +1820,6 @@ static void inet_unix_redir_to_connected(int family, int type, int sock_mapfd,
 	const char *log_prefix = redir_mode_str(mode);
 	int c0, c1, p0, p1;
 	unsigned int pass;
-	int retries = 100;
 	int err, n;
 	int sfd[2];
 	u32 key;
@@ -1876,17 +1854,11 @@ static void inet_unix_redir_to_connected(int family, int type, int sock_mapfd,
 	if (pass != 1)
 		FAIL("%s: want pass count 1, have %d", log_prefix, pass);
 
-again:
-	n = read(mode == REDIR_INGRESS ? p0 : c0, &b, 1);
-	if (n < 0) {
-		if (errno == EAGAIN && retries--) {
-			usleep(1000);
-			goto again;
-		}
-		FAIL_ERRNO("%s: read", log_prefix);
-	}
+	n = recv_timeout(mode == REDIR_INGRESS ? p0 : c0, &b, 1, 0, IO_TIMEOUT_SEC);
+	if (n < 0)
+		FAIL_ERRNO("%s: recv_timeout", log_prefix);
 	if (n == 0)
-		FAIL("%s: incomplete read", log_prefix);
+		FAIL("%s: incomplete recv", log_prefix);
 
 close_cli1:
 	xclose(c1);
@@ -1932,7 +1904,6 @@ static void unix_inet_redir_to_connected(int family, int type, int sock_mapfd,
 	int sfd[2];
 	u32 key;
 	char b;
-	int retries = 100;
 
 	zero_verdict_count(verd_mapfd);
 
@@ -1963,17 +1934,11 @@ static void unix_inet_redir_to_connected(int family, int type, int sock_mapfd,
 	if (pass != 1)
 		FAIL("%s: want pass count 1, have %d", log_prefix, pass);
 
-again:
-	n = read(mode == REDIR_INGRESS ? p0 : c0, &b, 1);
-	if (n < 0) {
-		if (errno == EAGAIN && retries--) {
-			usleep(1000);
-			goto again;
-		}
-		FAIL_ERRNO("%s: read", log_prefix);
-	}
+	n = recv_timeout(mode == REDIR_INGRESS ? p0 : c0, &b, 1, 0, IO_TIMEOUT_SEC);
+	if (n < 0)
+		FAIL_ERRNO("%s: recv_timeout", log_prefix);
 	if (n == 0)
-		FAIL("%s: incomplete read", log_prefix);
+		FAIL("%s: incomplete recv", log_prefix);
 
 close:
 	xclose(c1);
-- 
2.30.2


^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [Patch bpf 2/3] net: poll psock queues too for sockmap sockets
  2021-09-24 22:05 ` [Patch bpf 2/3] net: poll psock queues too for sockmap sockets Cong Wang
@ 2021-09-27 18:07   ` John Fastabend
  2021-09-27 19:29     ` Cong Wang
  0 siblings, 1 reply; 6+ messages in thread
From: John Fastabend @ 2021-09-27 18:07 UTC (permalink / raw)
  To: Cong Wang, netdev
  Cc: bpf, Cong Wang, Yucong Sun, John Fastabend, Daniel Borkmann,
	Jakub Sitnicki, Lorenz Bauer

Cong Wang wrote:
> From: Cong Wang <cong.wang@bytedance.com>
> 
> Yucong noticed we can't poll() sockets in sockmap even
> when they are the destination sockets of redirections.
> This is because we never poll any psock queues in ->poll().
> We can not overwrite ->poll() as it is in struct proto_ops,
> not in struct proto.
> 
> So introduce sk_msg_poll() to poll psock ingress_msg queue
> and let sockets which support sockmap invoke it directly.
> 
> Reported-by: Yucong Sun <sunyucong@gmail.com>
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Jakub Sitnicki <jakub@cloudflare.com>
> Cc: Lorenz Bauer <lmb@cloudflare.com>
> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> ---
>  include/linux/skmsg.h |  6 ++++++
>  net/core/skmsg.c      | 15 +++++++++++++++
>  net/ipv4/tcp.c        |  2 ++
>  net/ipv4/udp.c        |  2 ++
>  net/unix/af_unix.c    |  5 +++++
>  5 files changed, 30 insertions(+)
> 

[...]
  						  struct sk_buff *skb)
>  {
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index e8b48df73c85..2eb1a87ba056 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -280,6 +280,7 @@
>  #include <linux/uaccess.h>
>  #include <asm/ioctls.h>
>  #include <net/busy_poll.h>
> +#include <linux/skmsg.h>
>  
>  /* Track pending CMSGs. */
>  enum {
> @@ -563,6 +564,7 @@ __poll_t tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
>  
>  		if (tcp_stream_is_readable(sk, target))
>  			mask |= EPOLLIN | EPOLLRDNORM;
> +		mask |= sk_msg_poll(sk);
>  
>  		if (!(sk->sk_shutdown & SEND_SHUTDOWN)) {
>  			if (__sk_stream_is_writeable(sk, 1)) {


For TCP we implement the stream_memory_read() hook which we implement in
tcp_bpf.c with tcp_bpf_stream_read. This just checks psock->ingress_msg
list which should cover any redirect from skmsg into the ingress side
of another socket.

And the tcp_poll logic is using tcp_stream_is_readable() which is 
checking for sk->sk_prot->stream_memory_read() and then calling it.

The straight receive path, e.g. not redirected from a sender should
be covered by the normal tcp_epollin_ready() checks because this
would be after TCP does the normal updates to rcv_nxt, copied_seq,
etc.

So above is not in the TCP case by my reading. Did I miss a
case? We also have done tests with Envoy which I thought were polling
so I'll check on that as well.

Thanks,
John

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Patch bpf 2/3] net: poll psock queues too for sockmap sockets
  2021-09-27 18:07   ` John Fastabend
@ 2021-09-27 19:29     ` Cong Wang
  0 siblings, 0 replies; 6+ messages in thread
From: Cong Wang @ 2021-09-27 19:29 UTC (permalink / raw)
  To: John Fastabend
  Cc: Linux Kernel Network Developers, bpf, Cong Wang, Yucong Sun,
	Daniel Borkmann, Jakub Sitnicki, Lorenz Bauer

On Mon, Sep 27, 2021 at 11:07 AM John Fastabend
<john.fastabend@gmail.com> wrote:
>
> Cong Wang wrote:
> > From: Cong Wang <cong.wang@bytedance.com>
> >
> > Yucong noticed we can't poll() sockets in sockmap even
> > when they are the destination sockets of redirections.
> > This is because we never poll any psock queues in ->poll().
> > We can not overwrite ->poll() as it is in struct proto_ops,
> > not in struct proto.
> >
> > So introduce sk_msg_poll() to poll psock ingress_msg queue
> > and let sockets which support sockmap invoke it directly.
> >
> > Reported-by: Yucong Sun <sunyucong@gmail.com>
> > Cc: John Fastabend <john.fastabend@gmail.com>
> > Cc: Daniel Borkmann <daniel@iogearbox.net>
> > Cc: Jakub Sitnicki <jakub@cloudflare.com>
> > Cc: Lorenz Bauer <lmb@cloudflare.com>
> > Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> > ---
> >  include/linux/skmsg.h |  6 ++++++
> >  net/core/skmsg.c      | 15 +++++++++++++++
> >  net/ipv4/tcp.c        |  2 ++
> >  net/ipv4/udp.c        |  2 ++
> >  net/unix/af_unix.c    |  5 +++++
> >  5 files changed, 30 insertions(+)
> >
>
> [...]
>                                                   struct sk_buff *skb)
> >  {
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index e8b48df73c85..2eb1a87ba056 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -280,6 +280,7 @@
> >  #include <linux/uaccess.h>
> >  #include <asm/ioctls.h>
> >  #include <net/busy_poll.h>
> > +#include <linux/skmsg.h>
> >
> >  /* Track pending CMSGs. */
> >  enum {
> > @@ -563,6 +564,7 @@ __poll_t tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
> >
> >               if (tcp_stream_is_readable(sk, target))
> >                       mask |= EPOLLIN | EPOLLRDNORM;
> > +             mask |= sk_msg_poll(sk);
> >
> >               if (!(sk->sk_shutdown & SEND_SHUTDOWN)) {
> >                       if (__sk_stream_is_writeable(sk, 1)) {
>
>
> For TCP we implement the stream_memory_read() hook which we implement in
> tcp_bpf.c with tcp_bpf_stream_read. This just checks psock->ingress_msg
> list which should cover any redirect from skmsg into the ingress side
> of another socket.
>
> And the tcp_poll logic is using tcp_stream_is_readable() which is
> checking for sk->sk_prot->stream_memory_read() and then calling it.

Ah, I missed it. It is better to have such a hook in struct proto,
since we just can overwrite it with bpf hooks. Let me rename it
for non-TCP and implement it for UDP and AF_UNIX too.

>
> The straight receive path, e.g. not redirected from a sender should
> be covered by the normal tcp_epollin_ready() checks because this
> would be after TCP does the normal updates to rcv_nxt, copied_seq,
> etc.

Yes.

>
> So above is not in the TCP case by my reading. Did I miss a
> case? We also have done tests with Envoy which I thought were polling
> so I'll check on that as well.

Right, all of these selftests in patch 3/3 are non-TCP.

Thanks.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-09-27 19:29 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-24 22:05 [Patch bpf 0/3] sock_map: fix ->poll() and update selftests Cong Wang
2021-09-24 22:05 ` [Patch bpf 1/3] skmsg: introduce sk_psock_get_checked() Cong Wang
2021-09-24 22:05 ` [Patch bpf 2/3] net: poll psock queues too for sockmap sockets Cong Wang
2021-09-27 18:07   ` John Fastabend
2021-09-27 19:29     ` Cong Wang
2021-09-24 22:05 ` [Patch bpf 3/3] selftests/bpf: use recv_timeout() instead of retries Cong Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).