All of lore.kernel.org
 help / color / mirror / Atom feed
* [Patch bpf 0/3] sock_map: fix ->poll() and update selftests
@ 2021-09-24 22:05 Cong Wang
  2021-09-24 22:05 ` [Patch bpf 1/3] skmsg: introduce sk_psock_get_checked() Cong Wang
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Cong Wang @ 2021-09-24 22:05 UTC (permalink / raw)
  To: netdev; +Cc: bpf, Cong Wang

From: Cong Wang <cong.wang@bytedance.com>

This patchset fixes ->poll() on sockmap sockets and update
selftests accordingly with select(). Please check each patch
for more details.

Cong Wang (2):
  skmsg: introduce sk_psock_get_checked()
  net: poll psock queues too for sockmap sockets

Yucong Sun (1):
  selftests/bpf: use recv_timeout() instead of retries

 include/linux/skmsg.h                         | 26 +++++++
 net/core/skmsg.c                              | 15 ++++
 net/core/sock_map.c                           | 22 +-----
 net/ipv4/tcp.c                                |  2 +
 net/ipv4/udp.c                                |  2 +
 net/unix/af_unix.c                            |  5 ++
 .../selftests/bpf/prog_tests/sockmap_listen.c | 75 +++++--------------
 7 files changed, 71 insertions(+), 76 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Patch bpf 1/3] skmsg: introduce sk_psock_get_checked()
  2021-09-24 22:05 [Patch bpf 0/3] sock_map: fix ->poll() and update selftests Cong Wang
@ 2021-09-24 22:05 ` Cong Wang
  2021-09-24 22:05 ` [Patch bpf 2/3] net: poll psock queues too for sockmap sockets Cong Wang
  2021-09-24 22:05 ` [Patch bpf 3/3] selftests/bpf: use recv_timeout() instead of retries Cong Wang
  2 siblings, 0 replies; 6+ messages in thread
From: Cong Wang @ 2021-09-24 22:05 UTC (permalink / raw)
  To: netdev
  Cc: bpf, Cong Wang, John Fastabend, Daniel Borkmann, Jakub Sitnicki,
	Lorenz Bauer

From: Cong Wang <cong.wang@bytedance.com>

Although we have sk_psock_get(), it assumes the psock
retrieved from sk_user_data is for sockmap, this is not
sufficient if we call it outside of sockmap, for example,
reuseport_array.

Fortunately sock_map_psock_get_checked() is more strict
and checks for sock_map_close before using psock. So we can
refactor it and rename it to sk_psock_get_checked(), which
can be safely called outside of sockmap.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 include/linux/skmsg.h | 20 ++++++++++++++++++++
 net/core/sock_map.c   | 22 +---------------------
 2 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 14ab0c0bc924..d47097f2c8c0 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -452,6 +452,26 @@ static inline struct sk_psock *sk_psock_get(struct sock *sk)
 	return psock;
 }
 
+static inline struct sk_psock *sk_psock_get_checked(struct sock *sk)
+{
+	struct sk_psock *psock;
+
+	rcu_read_lock();
+	psock = sk_psock(sk);
+	if (psock) {
+		if (sk->sk_prot->close != sock_map_close) {
+			psock = ERR_PTR(-EBUSY);
+			goto out;
+		}
+
+		if (!refcount_inc_not_zero(&psock->refcnt))
+			psock = ERR_PTR(-EBUSY);
+	}
+out:
+	rcu_read_unlock();
+	return psock;
+}
+
 void sk_psock_drop(struct sock *sk, struct sk_psock *psock);
 
 static inline void sk_psock_put(struct sock *sk, struct sk_psock *psock)
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index e252b8ec2b85..6612bb0b95b5 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -191,26 +191,6 @@ static int sock_map_init_proto(struct sock *sk, struct sk_psock *psock)
 	return sk->sk_prot->psock_update_sk_prot(sk, psock, false);
 }
 
-static struct sk_psock *sock_map_psock_get_checked(struct sock *sk)
-{
-	struct sk_psock *psock;
-
-	rcu_read_lock();
-	psock = sk_psock(sk);
-	if (psock) {
-		if (sk->sk_prot->close != sock_map_close) {
-			psock = ERR_PTR(-EBUSY);
-			goto out;
-		}
-
-		if (!refcount_inc_not_zero(&psock->refcnt))
-			psock = ERR_PTR(-EBUSY);
-	}
-out:
-	rcu_read_unlock();
-	return psock;
-}
-
 static int sock_map_link(struct bpf_map *map, struct sock *sk)
 {
 	struct sk_psock_progs *progs = sock_map_progs(map);
@@ -255,7 +235,7 @@ static int sock_map_link(struct bpf_map *map, struct sock *sk)
 		}
 	}
 
-	psock = sock_map_psock_get_checked(sk);
+	psock = sk_psock_get_checked(sk);
 	if (IS_ERR(psock)) {
 		ret = PTR_ERR(psock);
 		goto out_progs;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [Patch bpf 2/3] net: poll psock queues too for sockmap sockets
  2021-09-24 22:05 [Patch bpf 0/3] sock_map: fix ->poll() and update selftests Cong Wang
  2021-09-24 22:05 ` [Patch bpf 1/3] skmsg: introduce sk_psock_get_checked() Cong Wang
@ 2021-09-24 22:05 ` Cong Wang
  2021-09-27 18:07   ` John Fastabend
  2021-09-24 22:05 ` [Patch bpf 3/3] selftests/bpf: use recv_timeout() instead of retries Cong Wang
  2 siblings, 1 reply; 6+ messages in thread
From: Cong Wang @ 2021-09-24 22:05 UTC (permalink / raw)
  To: netdev
  Cc: bpf, Cong Wang, Yucong Sun, John Fastabend, Daniel Borkmann,
	Jakub Sitnicki, Lorenz Bauer

From: Cong Wang <cong.wang@bytedance.com>

Yucong noticed we can't poll() sockets in sockmap even
when they are the destination sockets of redirections.
This is because we never poll any psock queues in ->poll().
We can not overwrite ->poll() as it is in struct proto_ops,
not in struct proto.

So introduce sk_msg_poll() to poll psock ingress_msg queue
and let sockets which support sockmap invoke it directly.

Reported-by: Yucong Sun <sunyucong@gmail.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 include/linux/skmsg.h |  6 ++++++
 net/core/skmsg.c      | 15 +++++++++++++++
 net/ipv4/tcp.c        |  2 ++
 net/ipv4/udp.c        |  2 ++
 net/unix/af_unix.c    |  5 +++++
 5 files changed, 30 insertions(+)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index d47097f2c8c0..163b0cc1703a 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -128,6 +128,7 @@ int sk_msg_memcopy_from_iter(struct sock *sk, struct iov_iter *from,
 			     struct sk_msg *msg, u32 bytes);
 int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
 		   int len, int flags);
+__poll_t sk_msg_poll(struct sock *sk);
 
 static inline void sk_msg_check_to_free(struct sk_msg *msg, u32 i, u32 bytes)
 {
@@ -562,5 +563,10 @@ static inline void skb_bpf_redirect_clear(struct sk_buff *skb)
 {
 	skb->_sk_redir = 0;
 }
+#else
+static inline __poll_t sk_msg_poll(struct sock *sk)
+{
+	return 0;
+}
 #endif /* CONFIG_NET_SOCK_MSG */
 #endif /* _LINUX_SKMSG_H */
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 2d6249b28928..8e6d7ea43eca 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -474,6 +474,21 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
 }
 EXPORT_SYMBOL_GPL(sk_msg_recvmsg);
 
+__poll_t sk_msg_poll(struct sock *sk)
+{
+	struct sk_psock *psock;
+	__poll_t mask = 0;
+
+	psock = sk_psock_get_checked(sk);
+	if (IS_ERR_OR_NULL(psock))
+		return 0;
+	if (!sk_psock_queue_empty(psock))
+		mask |= EPOLLIN | EPOLLRDNORM;
+	sk_psock_put(sk, psock);
+	return mask;
+}
+EXPORT_SYMBOL_GPL(sk_msg_poll);
+
 static struct sk_msg *sk_psock_create_ingress_msg(struct sock *sk,
 						  struct sk_buff *skb)
 {
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e8b48df73c85..2eb1a87ba056 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -280,6 +280,7 @@
 #include <linux/uaccess.h>
 #include <asm/ioctls.h>
 #include <net/busy_poll.h>
+#include <linux/skmsg.h>
 
 /* Track pending CMSGs. */
 enum {
@@ -563,6 +564,7 @@ __poll_t tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
 
 		if (tcp_stream_is_readable(sk, target))
 			mask |= EPOLLIN | EPOLLRDNORM;
+		mask |= sk_msg_poll(sk);
 
 		if (!(sk->sk_shutdown & SEND_SHUTDOWN)) {
 			if (__sk_stream_is_writeable(sk, 1)) {
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 8851c9463b4b..fbc989d27388 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -97,6 +97,7 @@
 #include <linux/skbuff.h>
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
+#include <linux/skmsg.h>
 #include <net/net_namespace.h>
 #include <net/icmp.h>
 #include <net/inet_hashtables.h>
@@ -2866,6 +2867,7 @@ __poll_t udp_poll(struct file *file, struct socket *sock, poll_table *wait)
 	    !(sk->sk_shutdown & RCV_SHUTDOWN) && first_packet_length(sk) == -1)
 		mask &= ~(EPOLLIN | EPOLLRDNORM);
 
+	mask |= sk_msg_poll(sk);
 	return mask;
 
 }
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 92345c9bb60c..5d705541d082 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -114,6 +114,7 @@
 #include <linux/freezer.h>
 #include <linux/file.h>
 #include <linux/btf_ids.h>
+#include <linux/skmsg.h>
 
 #include "scm.h"
 
@@ -3015,6 +3016,8 @@ static __poll_t unix_poll(struct file *file, struct socket *sock, poll_table *wa
 	if (!skb_queue_empty_lockless(&sk->sk_receive_queue))
 		mask |= EPOLLIN | EPOLLRDNORM;
 
+	mask |= sk_msg_poll(sk);
+
 	/* Connection-based need to check for termination and startup */
 	if ((sk->sk_type == SOCK_STREAM || sk->sk_type == SOCK_SEQPACKET) &&
 	    sk->sk_state == TCP_CLOSE)
@@ -3054,6 +3057,8 @@ static __poll_t unix_dgram_poll(struct file *file, struct socket *sock,
 	if (!skb_queue_empty_lockless(&sk->sk_receive_queue))
 		mask |= EPOLLIN | EPOLLRDNORM;
 
+	mask |= sk_msg_poll(sk);
+
 	/* Connection-based need to check for termination and startup */
 	if (sk->sk_type == SOCK_SEQPACKET) {
 		if (sk->sk_state == TCP_CLOSE)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [Patch bpf 3/3] selftests/bpf: use recv_timeout() instead of retries
  2021-09-24 22:05 [Patch bpf 0/3] sock_map: fix ->poll() and update selftests Cong Wang
  2021-09-24 22:05 ` [Patch bpf 1/3] skmsg: introduce sk_psock_get_checked() Cong Wang
  2021-09-24 22:05 ` [Patch bpf 2/3] net: poll psock queues too for sockmap sockets Cong Wang
@ 2021-09-24 22:05 ` Cong Wang
  2 siblings, 0 replies; 6+ messages in thread
From: Cong Wang @ 2021-09-24 22:05 UTC (permalink / raw)
  To: netdev
  Cc: bpf, Yucong Sun, John Fastabend, Daniel Borkmann, Jakub Sitnicki,
	Lorenz Bauer, Cong Wang

From: Yucong Sun <sunyucong@gmail.com>

We use non-blocking sockets in those tests, retrying for
EAGAIN is ugly because there is no upper bound for the packet
arrival time, at least in theory. After we fix ->poll() for
sockmap sockets, now we can switch to select()+recv().

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Yucong Sun <sunyucong@gmail.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 .../selftests/bpf/prog_tests/sockmap_listen.c | 75 +++++--------------
 1 file changed, 20 insertions(+), 55 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c b/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
index 5c5979046523..d88bb65b74cc 100644
--- a/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
+++ b/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
@@ -949,7 +949,6 @@ static void redir_to_connected(int family, int sotype, int sock_mapfd,
 	int err, n;
 	u32 key;
 	char b;
-	int retries = 100;
 
 	zero_verdict_count(verd_mapfd);
 
@@ -1002,17 +1001,11 @@ static void redir_to_connected(int family, int sotype, int sock_mapfd,
 		goto close_peer1;
 	if (pass != 1)
 		FAIL("%s: want pass count 1, have %d", log_prefix, pass);
-again:
-	n = read(c0, &b, 1);
-	if (n < 0) {
-		if (errno == EAGAIN && retries--) {
-			usleep(1000);
-			goto again;
-		}
-		FAIL_ERRNO("%s: read", log_prefix);
-	}
+	n = recv_timeout(c0, &b, 1, 0, IO_TIMEOUT_SEC);
+	if (n < 0)
+		FAIL_ERRNO("%s: recv_timeout", log_prefix);
 	if (n == 0)
-		FAIL("%s: incomplete read", log_prefix);
+		FAIL("%s: incomplete recv", log_prefix);
 
 close_peer1:
 	xclose(p1);
@@ -1571,7 +1564,6 @@ static void unix_redir_to_connected(int sotype, int sock_mapfd,
 	const char *log_prefix = redir_mode_str(mode);
 	int c0, c1, p0, p1;
 	unsigned int pass;
-	int retries = 100;
 	int err, n;
 	int sfd[2];
 	u32 key;
@@ -1606,17 +1598,11 @@ static void unix_redir_to_connected(int sotype, int sock_mapfd,
 	if (pass != 1)
 		FAIL("%s: want pass count 1, have %d", log_prefix, pass);
 
-again:
-	n = read(mode == REDIR_INGRESS ? p0 : c0, &b, 1);
-	if (n < 0) {
-		if (errno == EAGAIN && retries--) {
-			usleep(1000);
-			goto again;
-		}
-		FAIL_ERRNO("%s: read", log_prefix);
-	}
+	n = recv_timeout(mode == REDIR_INGRESS ? p0 : c0, &b, 1, 0, IO_TIMEOUT_SEC);
+	if (n < 0)
+		FAIL_ERRNO("%s: recv_timeout", log_prefix);
 	if (n == 0)
-		FAIL("%s: incomplete read", log_prefix);
+		FAIL("%s: incomplete recv", log_prefix);
 
 close:
 	xclose(c1);
@@ -1748,7 +1734,6 @@ static void udp_redir_to_connected(int family, int sock_mapfd, int verd_mapfd,
 	const char *log_prefix = redir_mode_str(mode);
 	int c0, c1, p0, p1;
 	unsigned int pass;
-	int retries = 100;
 	int err, n;
 	u32 key;
 	char b;
@@ -1781,17 +1766,11 @@ static void udp_redir_to_connected(int family, int sock_mapfd, int verd_mapfd,
 	if (pass != 1)
 		FAIL("%s: want pass count 1, have %d", log_prefix, pass);
 
-again:
-	n = read(mode == REDIR_INGRESS ? p0 : c0, &b, 1);
-	if (n < 0) {
-		if (errno == EAGAIN && retries--) {
-			usleep(1000);
-			goto again;
-		}
-		FAIL_ERRNO("%s: read", log_prefix);
-	}
+	n = recv_timeout(mode == REDIR_INGRESS ? p0 : c0, &b, 1, 0, IO_TIMEOUT_SEC);
+	if (n < 0)
+		FAIL_ERRNO("%s: recv_timeout", log_prefix);
 	if (n == 0)
-		FAIL("%s: incomplete read", log_prefix);
+		FAIL("%s: incomplete recv", log_prefix);
 
 close_cli1:
 	xclose(c1);
@@ -1841,7 +1820,6 @@ static void inet_unix_redir_to_connected(int family, int type, int sock_mapfd,
 	const char *log_prefix = redir_mode_str(mode);
 	int c0, c1, p0, p1;
 	unsigned int pass;
-	int retries = 100;
 	int err, n;
 	int sfd[2];
 	u32 key;
@@ -1876,17 +1854,11 @@ static void inet_unix_redir_to_connected(int family, int type, int sock_mapfd,
 	if (pass != 1)
 		FAIL("%s: want pass count 1, have %d", log_prefix, pass);
 
-again:
-	n = read(mode == REDIR_INGRESS ? p0 : c0, &b, 1);
-	if (n < 0) {
-		if (errno == EAGAIN && retries--) {
-			usleep(1000);
-			goto again;
-		}
-		FAIL_ERRNO("%s: read", log_prefix);
-	}
+	n = recv_timeout(mode == REDIR_INGRESS ? p0 : c0, &b, 1, 0, IO_TIMEOUT_SEC);
+	if (n < 0)
+		FAIL_ERRNO("%s: recv_timeout", log_prefix);
 	if (n == 0)
-		FAIL("%s: incomplete read", log_prefix);
+		FAIL("%s: incomplete recv", log_prefix);
 
 close_cli1:
 	xclose(c1);
@@ -1932,7 +1904,6 @@ static void unix_inet_redir_to_connected(int family, int type, int sock_mapfd,
 	int sfd[2];
 	u32 key;
 	char b;
-	int retries = 100;
 
 	zero_verdict_count(verd_mapfd);
 
@@ -1963,17 +1934,11 @@ static void unix_inet_redir_to_connected(int family, int type, int sock_mapfd,
 	if (pass != 1)
 		FAIL("%s: want pass count 1, have %d", log_prefix, pass);
 
-again:
-	n = read(mode == REDIR_INGRESS ? p0 : c0, &b, 1);
-	if (n < 0) {
-		if (errno == EAGAIN && retries--) {
-			usleep(1000);
-			goto again;
-		}
-		FAIL_ERRNO("%s: read", log_prefix);
-	}
+	n = recv_timeout(mode == REDIR_INGRESS ? p0 : c0, &b, 1, 0, IO_TIMEOUT_SEC);
+	if (n < 0)
+		FAIL_ERRNO("%s: recv_timeout", log_prefix);
 	if (n == 0)
-		FAIL("%s: incomplete read", log_prefix);
+		FAIL("%s: incomplete recv", log_prefix);
 
 close:
 	xclose(c1);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* RE: [Patch bpf 2/3] net: poll psock queues too for sockmap sockets
  2021-09-24 22:05 ` [Patch bpf 2/3] net: poll psock queues too for sockmap sockets Cong Wang
@ 2021-09-27 18:07   ` John Fastabend
  2021-09-27 19:29     ` Cong Wang
  0 siblings, 1 reply; 6+ messages in thread
From: John Fastabend @ 2021-09-27 18:07 UTC (permalink / raw)
  To: Cong Wang, netdev
  Cc: bpf, Cong Wang, Yucong Sun, John Fastabend, Daniel Borkmann,
	Jakub Sitnicki, Lorenz Bauer

Cong Wang wrote:
> From: Cong Wang <cong.wang@bytedance.com>
> 
> Yucong noticed we can't poll() sockets in sockmap even
> when they are the destination sockets of redirections.
> This is because we never poll any psock queues in ->poll().
> We can not overwrite ->poll() as it is in struct proto_ops,
> not in struct proto.
> 
> So introduce sk_msg_poll() to poll psock ingress_msg queue
> and let sockets which support sockmap invoke it directly.
> 
> Reported-by: Yucong Sun <sunyucong@gmail.com>
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Jakub Sitnicki <jakub@cloudflare.com>
> Cc: Lorenz Bauer <lmb@cloudflare.com>
> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> ---
>  include/linux/skmsg.h |  6 ++++++
>  net/core/skmsg.c      | 15 +++++++++++++++
>  net/ipv4/tcp.c        |  2 ++
>  net/ipv4/udp.c        |  2 ++
>  net/unix/af_unix.c    |  5 +++++
>  5 files changed, 30 insertions(+)
> 

[...]
  						  struct sk_buff *skb)
>  {
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index e8b48df73c85..2eb1a87ba056 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -280,6 +280,7 @@
>  #include <linux/uaccess.h>
>  #include <asm/ioctls.h>
>  #include <net/busy_poll.h>
> +#include <linux/skmsg.h>
>  
>  /* Track pending CMSGs. */
>  enum {
> @@ -563,6 +564,7 @@ __poll_t tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
>  
>  		if (tcp_stream_is_readable(sk, target))
>  			mask |= EPOLLIN | EPOLLRDNORM;
> +		mask |= sk_msg_poll(sk);
>  
>  		if (!(sk->sk_shutdown & SEND_SHUTDOWN)) {
>  			if (__sk_stream_is_writeable(sk, 1)) {


For TCP we implement the stream_memory_read() hook which we implement in
tcp_bpf.c with tcp_bpf_stream_read. This just checks psock->ingress_msg
list which should cover any redirect from skmsg into the ingress side
of another socket.

And the tcp_poll logic is using tcp_stream_is_readable() which is 
checking for sk->sk_prot->stream_memory_read() and then calling it.

The straight receive path, e.g. not redirected from a sender should
be covered by the normal tcp_epollin_ready() checks because this
would be after TCP does the normal updates to rcv_nxt, copied_seq,
etc.

So above is not in the TCP case by my reading. Did I miss a
case? We also have done tests with Envoy which I thought were polling
so I'll check on that as well.

Thanks,
John

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Patch bpf 2/3] net: poll psock queues too for sockmap sockets
  2021-09-27 18:07   ` John Fastabend
@ 2021-09-27 19:29     ` Cong Wang
  0 siblings, 0 replies; 6+ messages in thread
From: Cong Wang @ 2021-09-27 19:29 UTC (permalink / raw)
  To: John Fastabend
  Cc: Linux Kernel Network Developers, bpf, Cong Wang, Yucong Sun,
	Daniel Borkmann, Jakub Sitnicki, Lorenz Bauer

On Mon, Sep 27, 2021 at 11:07 AM John Fastabend
<john.fastabend@gmail.com> wrote:
>
> Cong Wang wrote:
> > From: Cong Wang <cong.wang@bytedance.com>
> >
> > Yucong noticed we can't poll() sockets in sockmap even
> > when they are the destination sockets of redirections.
> > This is because we never poll any psock queues in ->poll().
> > We can not overwrite ->poll() as it is in struct proto_ops,
> > not in struct proto.
> >
> > So introduce sk_msg_poll() to poll psock ingress_msg queue
> > and let sockets which support sockmap invoke it directly.
> >
> > Reported-by: Yucong Sun <sunyucong@gmail.com>
> > Cc: John Fastabend <john.fastabend@gmail.com>
> > Cc: Daniel Borkmann <daniel@iogearbox.net>
> > Cc: Jakub Sitnicki <jakub@cloudflare.com>
> > Cc: Lorenz Bauer <lmb@cloudflare.com>
> > Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> > ---
> >  include/linux/skmsg.h |  6 ++++++
> >  net/core/skmsg.c      | 15 +++++++++++++++
> >  net/ipv4/tcp.c        |  2 ++
> >  net/ipv4/udp.c        |  2 ++
> >  net/unix/af_unix.c    |  5 +++++
> >  5 files changed, 30 insertions(+)
> >
>
> [...]
>                                                   struct sk_buff *skb)
> >  {
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index e8b48df73c85..2eb1a87ba056 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -280,6 +280,7 @@
> >  #include <linux/uaccess.h>
> >  #include <asm/ioctls.h>
> >  #include <net/busy_poll.h>
> > +#include <linux/skmsg.h>
> >
> >  /* Track pending CMSGs. */
> >  enum {
> > @@ -563,6 +564,7 @@ __poll_t tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
> >
> >               if (tcp_stream_is_readable(sk, target))
> >                       mask |= EPOLLIN | EPOLLRDNORM;
> > +             mask |= sk_msg_poll(sk);
> >
> >               if (!(sk->sk_shutdown & SEND_SHUTDOWN)) {
> >                       if (__sk_stream_is_writeable(sk, 1)) {
>
>
> For TCP we implement the stream_memory_read() hook which we implement in
> tcp_bpf.c with tcp_bpf_stream_read. This just checks psock->ingress_msg
> list which should cover any redirect from skmsg into the ingress side
> of another socket.
>
> And the tcp_poll logic is using tcp_stream_is_readable() which is
> checking for sk->sk_prot->stream_memory_read() and then calling it.

Ah, I missed it. It is better to have such a hook in struct proto,
since we just can overwrite it with bpf hooks. Let me rename it
for non-TCP and implement it for UDP and AF_UNIX too.

>
> The straight receive path, e.g. not redirected from a sender should
> be covered by the normal tcp_epollin_ready() checks because this
> would be after TCP does the normal updates to rcv_nxt, copied_seq,
> etc.

Yes.

>
> So above is not in the TCP case by my reading. Did I miss a
> case? We also have done tests with Envoy which I thought were polling
> so I'll check on that as well.

Right, all of these selftests in patch 3/3 are non-TCP.

Thanks.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-09-27 19:29 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-24 22:05 [Patch bpf 0/3] sock_map: fix ->poll() and update selftests Cong Wang
2021-09-24 22:05 ` [Patch bpf 1/3] skmsg: introduce sk_psock_get_checked() Cong Wang
2021-09-24 22:05 ` [Patch bpf 2/3] net: poll psock queues too for sockmap sockets Cong Wang
2021-09-27 18:07   ` John Fastabend
2021-09-27 19:29     ` Cong Wang
2021-09-24 22:05 ` [Patch bpf 3/3] selftests/bpf: use recv_timeout() instead of retries Cong Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.