bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Patch bpf-next v2 0/9] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP
@ 2021-03-02  2:37 Cong Wang
  2021-03-02  2:37 ` [Patch bpf-next v2 1/9] sock_map: introduce BPF_SK_SKB_VERDICT Cong Wang
                   ` (8 more replies)
  0 siblings, 9 replies; 30+ messages in thread
From: Cong Wang @ 2021-03-02  2:37 UTC (permalink / raw)
  To: netdev; +Cc: bpf, duanxiongchun, wangdongdong.6, jiang.wang, Cong Wang

From: Cong Wang <cong.wang@bytedance.com>

We have thousands of services connected to a daemon on every host
via AF_UNIX dgram sockets, after they are moved into VM, we have to
add a proxy to forward these communications from VM to host, because
rewriting thousands of them is not practical. This proxy uses an
AF_UNIX socket connected to services and a UDP socket to connect to
the host. It is inefficient because data is copied between kernel
space and user space twice, and we can not use splice() which only
supports TCP. Therefore, we want to use sockmap to do the splicing
without going to user-space at all (after the initial setup).

Currently sockmap only fully supports TCP, UDP is partially supported
as it is only allowed to add into sockmap. This patchset, as the second
part of the original large patchset, extends sockmap with:
1) cross-protocol support with BPF_SK_SKB_VERDICT; 2) full UDP support.

On the high level, ->sendmsg_locked() and ->read_sock() are required
for each protocol to support sockmap redirection, and in order to do
sock proto update, a new ops ->update_proto() is introduced, which is
also required to implement. A BPF ->recvmsg() is also needed to replace
the original ->recvmsg() to retrieve skmsg. Please see each patch for
more details.

To see the big picture, the original patchset is available here:
https://github.com/congwang/linux/tree/sockmap
this patchset is also available:
https://github.com/congwang/linux/tree/sockmap2

---
v2: separate from the original large patchset
    rebase to the latest bpf-next
    split UDP test case
    move inet_csk_has_ulp() check to tcp_bpf.c
    clean up udp_read_sock()

Cong Wang (9):
  sock_map: introduce BPF_SK_SKB_VERDICT
  sock: introduce sk_prot->update_proto()
  udp: implement ->sendmsg_locked()
  udp: implement ->read_sock() for sockmap
  udp: add ->read_sock() and ->sendmsg_locked() to ipv6
  skmsg: extract __tcp_bpf_recvmsg() and tcp_bpf_wait_data()
  udp: implement udp_bpf_recvmsg() for sockmap
  sock_map: update sock type checks for UDP
  selftests/bpf: add a test case for udp sockmap

 include/linux/skmsg.h                         |  25 ++--
 include/net/ipv6.h                            |   1 +
 include/net/sock.h                            |   3 +
 include/net/tcp.h                             |   3 +-
 include/net/udp.h                             |   4 +
 include/uapi/linux/bpf.h                      |   1 +
 kernel/bpf/syscall.c                          |   1 +
 net/core/skmsg.c                              | 113 +++++++++++++-
 net/core/sock_map.c                           |  52 ++++---
 net/ipv4/af_inet.c                            |   2 +
 net/ipv4/tcp_bpf.c                            | 129 +++-------------
 net/ipv4/tcp_ipv4.c                           |   3 +
 net/ipv4/udp.c                                |  68 ++++++++-
 net/ipv4/udp_bpf.c                            |  78 +++++++++-
 net/ipv6/af_inet6.c                           |   2 +
 net/ipv6/tcp_ipv6.c                           |   3 +
 net/ipv6/udp.c                                |  30 +++-
 net/tls/tls_sw.c                              |   4 +-
 tools/bpf/bpftool/common.c                    |   1 +
 tools/bpf/bpftool/prog.c                      |   1 +
 tools/include/uapi/linux/bpf.h                |   1 +
 .../selftests/bpf/prog_tests/sockmap_listen.c | 140 ++++++++++++++++++
 .../selftests/bpf/progs/test_sockmap_listen.c |  22 +++
 23 files changed, 517 insertions(+), 170 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Patch bpf-next v2 1/9] sock_map: introduce BPF_SK_SKB_VERDICT
  2021-03-02  2:37 [Patch bpf-next v2 0/9] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP Cong Wang
@ 2021-03-02  2:37 ` Cong Wang
  2021-03-02  2:37 ` [Patch bpf-next v2 2/9] sock: introduce sk_prot->update_proto() Cong Wang
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 30+ messages in thread
From: Cong Wang @ 2021-03-02  2:37 UTC (permalink / raw)
  To: netdev
  Cc: bpf, duanxiongchun, wangdongdong.6, jiang.wang, Cong Wang,
	John Fastabend, Daniel Borkmann, Jakub Sitnicki, Lorenz Bauer

From: Cong Wang <cong.wang@bytedance.com>

Reusing BPF_SK_SKB_STREAM_VERDICT is possible but its name is
confusing and more importantly we still want to distinguish them
from user-space. So we can just reuse the stream verdict code but
introduce a new type of eBPF program, skb_verdict. Users are not
allowed to set stream_verdict and skb_verdict at the same time.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 include/linux/skmsg.h          |  3 +++
 include/uapi/linux/bpf.h       |  1 +
 kernel/bpf/syscall.c           |  1 +
 net/core/skmsg.c               |  4 +++-
 net/core/sock_map.c            | 23 ++++++++++++++++++++++-
 tools/bpf/bpftool/common.c     |  1 +
 tools/bpf/bpftool/prog.c       |  1 +
 tools/include/uapi/linux/bpf.h |  1 +
 8 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 6c09d94be2e9..451530d41af7 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -58,6 +58,7 @@ struct sk_psock_progs {
 	struct bpf_prog			*msg_parser;
 	struct bpf_prog			*stream_parser;
 	struct bpf_prog			*stream_verdict;
+	struct bpf_prog			*skb_verdict;
 };
 
 enum sk_psock_state_bits {
@@ -442,6 +443,7 @@ static inline void psock_progs_drop(struct sk_psock_progs *progs)
 	psock_set_prog(&progs->msg_parser, NULL);
 	psock_set_prog(&progs->stream_parser, NULL);
 	psock_set_prog(&progs->stream_verdict, NULL);
+	psock_set_prog(&progs->skb_verdict, NULL);
 }
 
 int sk_psock_tls_strp_read(struct sk_psock *psock, struct sk_buff *skb);
@@ -489,5 +491,6 @@ static inline void skb_bpf_redirect_clear(struct sk_buff *skb)
 {
 	skb->_sk_redir = 0;
 }
+
 #endif /* CONFIG_NET_SOCK_MSG */
 #endif /* _LINUX_SKMSG_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index b89af20cfa19..1a08ab00a45e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -247,6 +247,7 @@ enum bpf_attach_type {
 	BPF_XDP_CPUMAP,
 	BPF_SK_LOOKUP,
 	BPF_XDP,
+	BPF_SK_SKB_VERDICT,
 	__MAX_BPF_ATTACH_TYPE
 };
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index c859bc46d06c..afa803a1553e 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2941,6 +2941,7 @@ attach_type_to_prog_type(enum bpf_attach_type attach_type)
 		return BPF_PROG_TYPE_SK_MSG;
 	case BPF_SK_SKB_STREAM_PARSER:
 	case BPF_SK_SKB_STREAM_VERDICT:
+	case BPF_SK_SKB_VERDICT:
 		return BPF_PROG_TYPE_SK_SKB;
 	case BPF_LIRC_MODE2:
 		return BPF_PROG_TYPE_LIRC_MODE2;
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 07f54015238a..5efd790f1b47 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -693,7 +693,7 @@ void sk_psock_drop(struct sock *sk, struct sk_psock *psock)
 	rcu_assign_sk_user_data(sk, NULL);
 	if (psock->progs.stream_parser)
 		sk_psock_stop_strp(sk, psock);
-	else if (psock->progs.stream_verdict)
+	else if (psock->progs.stream_verdict || psock->progs.skb_verdict)
 		sk_psock_stop_verdict(sk, psock);
 	write_unlock_bh(&sk->sk_callback_lock);
 	sk_psock_clear_state(psock, SK_PSOCK_TX_ENABLED);
@@ -1010,6 +1010,8 @@ static int sk_psock_verdict_recv(read_descriptor_t *desc, struct sk_buff *skb,
 	}
 	skb_set_owner_r(skb, sk);
 	prog = READ_ONCE(psock->progs.stream_verdict);
+	if (!prog)
+		prog = READ_ONCE(psock->progs.skb_verdict);
 	if (likely(prog)) {
 		skb_dst_drop(skb);
 		skb_bpf_redirect_clear(skb);
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index dd53a7771d7e..3bddd9dd2da2 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -155,6 +155,8 @@ static void sock_map_del_link(struct sock *sk,
 				strp_stop = true;
 			if (psock->saved_data_ready && stab->progs.stream_verdict)
 				verdict_stop = true;
+			if (psock->saved_data_ready && stab->progs.skb_verdict)
+				verdict_stop = true;
 			list_del(&link->list);
 			sk_psock_free_link(link);
 		}
@@ -227,7 +229,7 @@ static struct sk_psock *sock_map_psock_get_checked(struct sock *sk)
 static int sock_map_link(struct bpf_map *map, struct sk_psock_progs *progs,
 			 struct sock *sk)
 {
-	struct bpf_prog *msg_parser, *stream_parser, *stream_verdict;
+	struct bpf_prog *msg_parser, *stream_parser, *stream_verdict, *skb_verdict;
 	struct sk_psock *psock;
 	int ret;
 
@@ -256,6 +258,15 @@ static int sock_map_link(struct bpf_map *map, struct sk_psock_progs *progs,
 		}
 	}
 
+	skb_verdict = READ_ONCE(progs->skb_verdict);
+	if (skb_verdict) {
+		skb_verdict = bpf_prog_inc_not_zero(skb_verdict);
+		if (IS_ERR(skb_verdict)) {
+			ret = PTR_ERR(skb_verdict);
+			goto out_put_msg_parser;
+		}
+	}
+
 	psock = sock_map_psock_get_checked(sk);
 	if (IS_ERR(psock)) {
 		ret = PTR_ERR(psock);
@@ -265,6 +276,7 @@ static int sock_map_link(struct bpf_map *map, struct sk_psock_progs *progs,
 	if (psock) {
 		if ((msg_parser && READ_ONCE(psock->progs.msg_parser)) ||
 		    (stream_parser  && READ_ONCE(psock->progs.stream_parser)) ||
+		    (skb_verdict && READ_ONCE(psock->progs.skb_verdict)) ||
 		    (stream_verdict && READ_ONCE(psock->progs.stream_verdict))) {
 			sk_psock_put(sk, psock);
 			ret = -EBUSY;
@@ -296,6 +308,9 @@ static int sock_map_link(struct bpf_map *map, struct sk_psock_progs *progs,
 	} else if (!stream_parser && stream_verdict && !psock->saved_data_ready) {
 		psock_set_prog(&psock->progs.stream_verdict, stream_verdict);
 		sk_psock_start_verdict(sk,psock);
+	} else if (!stream_verdict && skb_verdict && !psock->saved_data_ready) {
+		psock_set_prog(&psock->progs.skb_verdict, skb_verdict);
+		sk_psock_start_verdict(sk, psock);
 	}
 	write_unlock_bh(&sk->sk_callback_lock);
 	return 0;
@@ -304,6 +319,9 @@ static int sock_map_link(struct bpf_map *map, struct sk_psock_progs *progs,
 out_drop:
 	sk_psock_put(sk, psock);
 out_progs:
+	if (skb_verdict)
+		bpf_prog_put(skb_verdict);
+out_put_msg_parser:
 	if (msg_parser)
 		bpf_prog_put(msg_parser);
 out_put_stream_parser:
@@ -1468,6 +1486,9 @@ static int sock_map_prog_update(struct bpf_map *map, struct bpf_prog *prog,
 	case BPF_SK_SKB_STREAM_VERDICT:
 		pprog = &progs->stream_verdict;
 		break;
+	case BPF_SK_SKB_VERDICT:
+		pprog = &progs->skb_verdict;
+		break;
 	default:
 		return -EOPNOTSUPP;
 	}
diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
index 65303664417e..1828bba19020 100644
--- a/tools/bpf/bpftool/common.c
+++ b/tools/bpf/bpftool/common.c
@@ -57,6 +57,7 @@ const char * const attach_type_name[__MAX_BPF_ATTACH_TYPE] = {
 
 	[BPF_SK_SKB_STREAM_PARSER]	= "sk_skb_stream_parser",
 	[BPF_SK_SKB_STREAM_VERDICT]	= "sk_skb_stream_verdict",
+	[BPF_SK_SKB_VERDICT]		= "sk_skb_verdict",
 	[BPF_SK_MSG_VERDICT]		= "sk_msg_verdict",
 	[BPF_LIRC_MODE2]		= "lirc_mode2",
 	[BPF_FLOW_DISSECTOR]		= "flow_dissector",
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index f2b915b20546..3f067d2d7584 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -76,6 +76,7 @@ enum dump_mode {
 static const char * const attach_type_strings[] = {
 	[BPF_SK_SKB_STREAM_PARSER] = "stream_parser",
 	[BPF_SK_SKB_STREAM_VERDICT] = "stream_verdict",
+	[BPF_SK_SKB_VERDICT] = "skb_verdict",
 	[BPF_SK_MSG_VERDICT] = "msg_verdict",
 	[BPF_FLOW_DISSECTOR] = "flow_dissector",
 	[__MAX_BPF_ATTACH_TYPE] = NULL,
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index b89af20cfa19..1a08ab00a45e 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -247,6 +247,7 @@ enum bpf_attach_type {
 	BPF_XDP_CPUMAP,
 	BPF_SK_LOOKUP,
 	BPF_XDP,
+	BPF_SK_SKB_VERDICT,
 	__MAX_BPF_ATTACH_TYPE
 };
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Patch bpf-next v2 2/9] sock: introduce sk_prot->update_proto()
  2021-03-02  2:37 [Patch bpf-next v2 0/9] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP Cong Wang
  2021-03-02  2:37 ` [Patch bpf-next v2 1/9] sock_map: introduce BPF_SK_SKB_VERDICT Cong Wang
@ 2021-03-02  2:37 ` Cong Wang
  2021-03-02 16:22   ` Lorenz Bauer
  2021-03-02  2:37 ` [Patch bpf-next v2 3/9] udp: implement ->sendmsg_locked() Cong Wang
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 30+ messages in thread
From: Cong Wang @ 2021-03-02  2:37 UTC (permalink / raw)
  To: netdev
  Cc: bpf, duanxiongchun, wangdongdong.6, jiang.wang, Cong Wang,
	John Fastabend, Daniel Borkmann, Jakub Sitnicki, Lorenz Bauer

From: Cong Wang <cong.wang@bytedance.com>

Currently sockmap calls into each protocol to update the struct
proto and replace it. This certainly won't work when the protocol
is implemented as a module, for example, AF_UNIX.

Introduce a new ops sk->sk_prot->update_proto(), so each protocol
can implement its own way to replace the struct proto. This also
helps get rid of symbol dependencies on CONFIG_INET.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 include/linux/skmsg.h | 18 +++---------------
 include/net/sock.h    |  3 +++
 include/net/tcp.h     |  1 +
 include/net/udp.h     |  1 +
 net/core/skmsg.c      |  5 -----
 net/core/sock_map.c   | 24 ++++--------------------
 net/ipv4/tcp_bpf.c    | 23 ++++++++++++++++++++---
 net/ipv4/tcp_ipv4.c   |  3 +++
 net/ipv4/udp.c        |  3 +++
 net/ipv4/udp_bpf.c    | 14 ++++++++++++--
 net/ipv6/tcp_ipv6.c   |  3 +++
 net/ipv6/udp.c        |  3 +++
 12 files changed, 56 insertions(+), 45 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 451530d41af7..b5df69d5d397 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -98,6 +98,7 @@ struct sk_psock {
 	void (*saved_close)(struct sock *sk, long timeout);
 	void (*saved_write_space)(struct sock *sk);
 	void (*saved_data_ready)(struct sock *sk);
+	int  (*saved_update_proto)(struct sock *sk, bool restore);
 	struct proto			*sk_proto;
 	struct sk_psock_work_state	work_state;
 	struct work_struct		work;
@@ -350,25 +351,12 @@ static inline void sk_psock_cork_free(struct sk_psock *psock)
 	}
 }
 
-static inline void sk_psock_update_proto(struct sock *sk,
-					 struct sk_psock *psock,
-					 struct proto *ops)
-{
-	/* Pairs with lockless read in sk_clone_lock() */
-	WRITE_ONCE(sk->sk_prot, ops);
-}
-
 static inline void sk_psock_restore_proto(struct sock *sk,
 					  struct sk_psock *psock)
 {
 	sk->sk_prot->unhash = psock->saved_unhash;
-	if (inet_csk_has_ulp(sk)) {
-		tcp_update_ulp(sk, psock->sk_proto, psock->saved_write_space);
-	} else {
-		sk->sk_write_space = psock->saved_write_space;
-		/* Pairs with lockless read in sk_clone_lock() */
-		WRITE_ONCE(sk->sk_prot, psock->sk_proto);
-	}
+	if (psock->saved_update_proto)
+		psock->saved_update_proto(sk, true);
 }
 
 static inline void sk_psock_set_state(struct sk_psock *psock,
diff --git a/include/net/sock.h b/include/net/sock.h
index 636810ddcd9b..0e8577c917e8 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1184,6 +1184,9 @@ struct proto {
 	void			(*unhash)(struct sock *sk);
 	void			(*rehash)(struct sock *sk);
 	int			(*get_port)(struct sock *sk, unsigned short snum);
+#ifdef CONFIG_BPF_SYSCALL
+	int			(*update_proto)(struct sock *sk, bool restore);
+#endif
 
 	/* Keeping track of sockets in use */
 #ifdef CONFIG_PROC_FS
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 075de26f449d..2efa4e5ea23d 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2203,6 +2203,7 @@ struct sk_psock;
 
 #ifdef CONFIG_BPF_SYSCALL
 struct proto *tcp_bpf_get_proto(struct sock *sk, struct sk_psock *psock);
+int tcp_bpf_update_proto(struct sock *sk, bool restore);
 void tcp_bpf_clone(const struct sock *sk, struct sock *newsk);
 #endif /* CONFIG_BPF_SYSCALL */
 
diff --git a/include/net/udp.h b/include/net/udp.h
index d4d064c59232..df7cc1edc200 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -518,6 +518,7 @@ static inline struct sk_buff *udp_rcv_segment(struct sock *sk,
 #ifdef CONFIG_BPF_SYSCALL
 struct sk_psock;
 struct proto *udp_bpf_get_proto(struct sock *sk, struct sk_psock *psock);
+int udp_bpf_update_proto(struct sock *sk, bool restore);
 #endif
 
 #endif	/* _UDP_H */
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 5efd790f1b47..7dbd8344ec89 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -563,11 +563,6 @@ struct sk_psock *sk_psock_init(struct sock *sk, int node)
 
 	write_lock_bh(&sk->sk_callback_lock);
 
-	if (inet_csk_has_ulp(sk)) {
-		psock = ERR_PTR(-EINVAL);
-		goto out;
-	}
-
 	if (sk->sk_user_data) {
 		psock = ERR_PTR(-EBUSY);
 		goto out;
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 3bddd9dd2da2..13d2af5bb81c 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -184,26 +184,10 @@ static void sock_map_unref(struct sock *sk, void *link_raw)
 
 static int sock_map_init_proto(struct sock *sk, struct sk_psock *psock)
 {
-	struct proto *prot;
-
-	switch (sk->sk_type) {
-	case SOCK_STREAM:
-		prot = tcp_bpf_get_proto(sk, psock);
-		break;
-
-	case SOCK_DGRAM:
-		prot = udp_bpf_get_proto(sk, psock);
-		break;
-
-	default:
+	if (!sk->sk_prot->update_proto)
 		return -EINVAL;
-	}
-
-	if (IS_ERR(prot))
-		return PTR_ERR(prot);
-
-	sk_psock_update_proto(sk, psock, prot);
-	return 0;
+	psock->saved_update_proto = sk->sk_prot->update_proto;
+	return sk->sk_prot->update_proto(sk, false);
 }
 
 static struct sk_psock *sock_map_psock_get_checked(struct sock *sk)
@@ -570,7 +554,7 @@ static bool sock_map_redirect_allowed(const struct sock *sk)
 
 static bool sock_map_sk_is_suitable(const struct sock *sk)
 {
-	return sk_is_tcp(sk) || sk_is_udp(sk);
+	return !!sk->sk_prot->update_proto;
 }
 
 static bool sock_map_sk_state_allowed(const struct sock *sk)
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index 17c322b875fd..737726c8138c 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -601,19 +601,36 @@ static int tcp_bpf_assert_proto_ops(struct proto *ops)
 	       ops->sendpage == tcp_sendpage ? 0 : -ENOTSUPP;
 }
 
-struct proto *tcp_bpf_get_proto(struct sock *sk, struct sk_psock *psock)
+int tcp_bpf_update_proto(struct sock *sk, bool restore)
 {
+	struct sk_psock *psock = sk_psock(sk);
 	int family = sk->sk_family == AF_INET6 ? TCP_BPF_IPV6 : TCP_BPF_IPV4;
 	int config = psock->progs.msg_parser   ? TCP_BPF_TX   : TCP_BPF_BASE;
 
+	if (restore) {
+		if (inet_csk_has_ulp(sk)) {
+			tcp_update_ulp(sk, psock->sk_proto, psock->saved_write_space);
+		} else {
+			sk->sk_write_space = psock->saved_write_space;
+			/* Pairs with lockless read in sk_clone_lock() */
+			WRITE_ONCE(sk->sk_prot, psock->sk_proto);
+		}
+		return 0;
+	}
+
+	if (inet_csk_has_ulp(sk))
+		return -EINVAL;
+
 	if (sk->sk_family == AF_INET6) {
 		if (tcp_bpf_assert_proto_ops(psock->sk_proto))
-			return ERR_PTR(-EINVAL);
+			return -EINVAL;
 
 		tcp_bpf_check_v6_needs_rebuild(psock->sk_proto);
 	}
 
-	return &tcp_bpf_prots[family][config];
+	/* Pairs with lockless read in sk_clone_lock() */
+	WRITE_ONCE(sk->sk_prot, &tcp_bpf_prots[family][config]);
+	return 0;
 }
 
 /* If a child got cloned from a listening socket that had tcp_bpf
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index daad4f99db32..21c9e262d07c 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2806,6 +2806,9 @@ struct proto tcp_prot = {
 	.hash			= inet_hash,
 	.unhash			= inet_unhash,
 	.get_port		= inet_csk_get_port,
+#ifdef CONFIG_BPF_SYSCALL
+	.update_proto		= tcp_bpf_update_proto,
+#endif
 	.enter_memory_pressure	= tcp_enter_memory_pressure,
 	.leave_memory_pressure	= tcp_leave_memory_pressure,
 	.stream_memory_free	= tcp_stream_memory_free,
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 4a0478b17243..dbd25b59ce0e 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2849,6 +2849,9 @@ struct proto udp_prot = {
 	.unhash			= udp_lib_unhash,
 	.rehash			= udp_v4_rehash,
 	.get_port		= udp_v4_get_port,
+#ifdef CONFIG_BPF_SYSCALL
+	.update_proto		= udp_bpf_update_proto,
+#endif
 	.memory_allocated	= &udp_memory_allocated,
 	.sysctl_mem		= sysctl_udp_mem,
 	.sysctl_wmem_offset	= offsetof(struct net, ipv4.sysctl_udp_wmem_min),
diff --git a/net/ipv4/udp_bpf.c b/net/ipv4/udp_bpf.c
index 7a94791efc1a..595836088e85 100644
--- a/net/ipv4/udp_bpf.c
+++ b/net/ipv4/udp_bpf.c
@@ -41,12 +41,22 @@ static int __init udp_bpf_v4_build_proto(void)
 }
 core_initcall(udp_bpf_v4_build_proto);
 
-struct proto *udp_bpf_get_proto(struct sock *sk, struct sk_psock *psock)
+int udp_bpf_update_proto(struct sock *sk, bool restore)
 {
 	int family = sk->sk_family == AF_INET ? UDP_BPF_IPV4 : UDP_BPF_IPV6;
+	struct sk_psock *psock = sk_psock(sk);
+
+	if (restore) {
+		sk->sk_write_space = psock->saved_write_space;
+		/* Pairs with lockless read in sk_clone_lock() */
+		WRITE_ONCE(sk->sk_prot, psock->sk_proto);
+		return 0;
+	}
 
 	if (sk->sk_family == AF_INET6)
 		udp_bpf_check_v6_needs_rebuild(psock->sk_proto);
 
-	return &udp_bpf_prots[family];
+	/* Pairs with lockless read in sk_clone_lock() */
+	WRITE_ONCE(sk->sk_prot, &udp_bpf_prots[family]);
+	return 0;
 }
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index bd44ded7e50c..ea5be7e7fcb8 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -2134,6 +2134,9 @@ struct proto tcpv6_prot = {
 	.hash			= inet6_hash,
 	.unhash			= inet_unhash,
 	.get_port		= inet_csk_get_port,
+#ifdef CONFIG_BPF_SYSCALL
+	.update_proto		= tcp_bpf_update_proto,
+#endif
 	.enter_memory_pressure	= tcp_enter_memory_pressure,
 	.leave_memory_pressure	= tcp_leave_memory_pressure,
 	.stream_memory_free	= tcp_stream_memory_free,
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index d25e5a9252fd..105ba0cf739d 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1713,6 +1713,9 @@ struct proto udpv6_prot = {
 	.unhash			= udp_lib_unhash,
 	.rehash			= udp_v6_rehash,
 	.get_port		= udp_v6_get_port,
+#ifdef CONFIG_BPF_SYSCALL
+	.update_proto		= udp_bpf_update_proto,
+#endif
 	.memory_allocated	= &udp_memory_allocated,
 	.sysctl_mem		= sysctl_udp_mem,
 	.sysctl_wmem_offset     = offsetof(struct net, ipv4.sysctl_udp_wmem_min),
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Patch bpf-next v2 3/9] udp: implement ->sendmsg_locked()
  2021-03-02  2:37 [Patch bpf-next v2 0/9] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP Cong Wang
  2021-03-02  2:37 ` [Patch bpf-next v2 1/9] sock_map: introduce BPF_SK_SKB_VERDICT Cong Wang
  2021-03-02  2:37 ` [Patch bpf-next v2 2/9] sock: introduce sk_prot->update_proto() Cong Wang
@ 2021-03-02  2:37 ` Cong Wang
  2021-03-02  2:37 ` [Patch bpf-next v2 4/9] udp: implement ->read_sock() for sockmap Cong Wang
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 30+ messages in thread
From: Cong Wang @ 2021-03-02  2:37 UTC (permalink / raw)
  To: netdev
  Cc: bpf, duanxiongchun, wangdongdong.6, jiang.wang, Cong Wang,
	John Fastabend, Daniel Borkmann, Jakub Sitnicki, Lorenz Bauer

From: Cong Wang <cong.wang@bytedance.com>

UDP already has udp_sendmsg() which takes lock_sock() inside.
We have to build ->sendmsg_locked() on top of it, by adding
a new parameter for whether the sock has been locked.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 include/net/udp.h  |  1 +
 net/ipv4/af_inet.c |  1 +
 net/ipv4/udp.c     | 30 +++++++++++++++++++++++-------
 3 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/include/net/udp.h b/include/net/udp.h
index df7cc1edc200..5264ba1439f9 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -292,6 +292,7 @@ int udp_get_port(struct sock *sk, unsigned short snum,
 int udp_err(struct sk_buff *, u32);
 int udp_abort(struct sock *sk, int err);
 int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len);
+int udp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t len);
 int udp_push_pending_frames(struct sock *sk);
 void udp_flush_pending_frames(struct sock *sk);
 int udp_cmsg_send(struct sock *sk, struct msghdr *msg, u16 *gso_size);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index a02ce89b56b5..d8c73a848c53 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1071,6 +1071,7 @@ const struct proto_ops inet_dgram_ops = {
 	.setsockopt	   = sock_common_setsockopt,
 	.getsockopt	   = sock_common_getsockopt,
 	.sendmsg	   = inet_sendmsg,
+	.sendmsg_locked    = udp_sendmsg_locked,
 	.recvmsg	   = inet_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = inet_sendpage,
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index dbd25b59ce0e..93db853601d7 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1024,7 +1024,7 @@ int udp_cmsg_send(struct sock *sk, struct msghdr *msg, u16 *gso_size)
 }
 EXPORT_SYMBOL_GPL(udp_cmsg_send);
 
-int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
+static int __udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len, bool locked)
 {
 	struct inet_sock *inet = inet_sk(sk);
 	struct udp_sock *up = udp_sk(sk);
@@ -1063,15 +1063,18 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		 * There are pending frames.
 		 * The socket lock must be held while it's corked.
 		 */
-		lock_sock(sk);
+		if (!locked)
+			lock_sock(sk);
 		if (likely(up->pending)) {
 			if (unlikely(up->pending != AF_INET)) {
-				release_sock(sk);
+				if (!locked)
+					release_sock(sk);
 				return -EINVAL;
 			}
 			goto do_append_data;
 		}
-		release_sock(sk);
+		if (!locked)
+			release_sock(sk);
 	}
 	ulen += sizeof(struct udphdr);
 
@@ -1241,11 +1244,13 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		goto out;
 	}
 
-	lock_sock(sk);
+	if (!locked)
+		lock_sock(sk);
 	if (unlikely(up->pending)) {
 		/* The socket is already corked while preparing it. */
 		/* ... which is an evident application bug. --ANK */
-		release_sock(sk);
+		if (!locked)
+			release_sock(sk);
 
 		net_dbg_ratelimited("socket already corked\n");
 		err = -EINVAL;
@@ -1272,7 +1277,8 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		err = udp_push_pending_frames(sk);
 	else if (unlikely(skb_queue_empty(&sk->sk_write_queue)))
 		up->pending = 0;
-	release_sock(sk);
+	if (!locked)
+		release_sock(sk);
 
 out:
 	ip_rt_put(rt);
@@ -1302,8 +1308,18 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	err = 0;
 	goto out;
 }
+
+int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
+{
+	return __udp_sendmsg(sk, msg, len, false);
+}
 EXPORT_SYMBOL(udp_sendmsg);
 
+int udp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t len)
+{
+	return __udp_sendmsg(sk, msg, len, true);
+}
+
 int udp_sendpage(struct sock *sk, struct page *page, int offset,
 		 size_t size, int flags)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Patch bpf-next v2 4/9] udp: implement ->read_sock() for sockmap
  2021-03-02  2:37 [Patch bpf-next v2 0/9] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP Cong Wang
                   ` (2 preceding siblings ...)
  2021-03-02  2:37 ` [Patch bpf-next v2 3/9] udp: implement ->sendmsg_locked() Cong Wang
@ 2021-03-02  2:37 ` Cong Wang
  2021-03-03  6:26   ` Yonghong Song
  2021-03-02  2:37 ` [Patch bpf-next v2 5/9] udp: add ->read_sock() and ->sendmsg_locked() to ipv6 Cong Wang
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 30+ messages in thread
From: Cong Wang @ 2021-03-02  2:37 UTC (permalink / raw)
  To: netdev
  Cc: bpf, duanxiongchun, wangdongdong.6, jiang.wang, Cong Wang,
	John Fastabend, Daniel Borkmann, Jakub Sitnicki, Lorenz Bauer

From: Cong Wang <cong.wang@bytedance.com>

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 include/net/udp.h  |  2 ++
 net/ipv4/af_inet.c |  1 +
 net/ipv4/udp.c     | 34 ++++++++++++++++++++++++++++++++++
 3 files changed, 37 insertions(+)

diff --git a/include/net/udp.h b/include/net/udp.h
index 5264ba1439f9..44a94cfc63b5 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -330,6 +330,8 @@ struct sock *__udp6_lib_lookup(struct net *net,
 			       struct sk_buff *skb);
 struct sock *udp6_lib_lookup_skb(const struct sk_buff *skb,
 				 __be16 sport, __be16 dport);
+int udp_read_sock(struct sock *sk, read_descriptor_t *desc,
+		  sk_read_actor_t recv_actor);
 
 /* UDP uses skb->dev_scratch to cache as much information as possible and avoid
  * possibly multiple cache miss on dequeue()
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index d8c73a848c53..df8e8e238756 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1072,6 +1072,7 @@ const struct proto_ops inet_dgram_ops = {
 	.getsockopt	   = sock_common_getsockopt,
 	.sendmsg	   = inet_sendmsg,
 	.sendmsg_locked    = udp_sendmsg_locked,
+	.read_sock	   = udp_read_sock,
 	.recvmsg	   = inet_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = inet_sendpage,
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 93db853601d7..54f24b1d4f65 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1798,6 +1798,40 @@ struct sk_buff *__skb_recv_udp(struct sock *sk, unsigned int flags,
 }
 EXPORT_SYMBOL(__skb_recv_udp);
 
+int udp_read_sock(struct sock *sk, read_descriptor_t *desc,
+		  sk_read_actor_t recv_actor)
+{
+	int copied = 0;
+
+	while (1) {
+		int offset = 0, err;
+		struct sk_buff *skb;
+
+		skb = __skb_recv_udp(sk, 0, 1, &offset, &err);
+		if (!skb)
+			break;
+		if (offset < skb->len) {
+			int used;
+			size_t len;
+
+			len = skb->len - offset;
+			used = recv_actor(desc, skb, offset, len);
+			if (used <= 0) {
+				if (!copied)
+					copied = used;
+				break;
+			} else if (used <= len) {
+				copied += used;
+				offset += used;
+			}
+		}
+		if (!desc->count)
+			break;
+	}
+
+	return copied;
+}
+
 /*
  * 	This should be easy, if there is something there we
  * 	return it, otherwise we block.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Patch bpf-next v2 5/9] udp: add ->read_sock() and ->sendmsg_locked() to ipv6
  2021-03-02  2:37 [Patch bpf-next v2 0/9] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP Cong Wang
                   ` (3 preceding siblings ...)
  2021-03-02  2:37 ` [Patch bpf-next v2 4/9] udp: implement ->read_sock() for sockmap Cong Wang
@ 2021-03-02  2:37 ` Cong Wang
  2021-03-02 16:23   ` Lorenz Bauer
  2021-03-02  2:37 ` [Patch bpf-next v2 6/9] skmsg: extract __tcp_bpf_recvmsg() and tcp_bpf_wait_data() Cong Wang
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 30+ messages in thread
From: Cong Wang @ 2021-03-02  2:37 UTC (permalink / raw)
  To: netdev
  Cc: bpf, duanxiongchun, wangdongdong.6, jiang.wang, Cong Wang,
	John Fastabend, Daniel Borkmann, Jakub Sitnicki, Lorenz Bauer

From: Cong Wang <cong.wang@bytedance.com>

Similarly, udpv6_sendmsg() takes lock_sock() inside too,
we have to build ->sendmsg_locked() on top of it.

For ->read_sock(), we can just use udp_read_sock().

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 include/net/ipv6.h  |  1 +
 net/ipv4/udp.c      |  1 +
 net/ipv6/af_inet6.c |  2 ++
 net/ipv6/udp.c      | 27 +++++++++++++++++++++------
 4 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index bd1f396cc9c7..48b6850dae85 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -1119,6 +1119,7 @@ int inet6_hash_connect(struct inet_timewait_death_row *death_row,
 int inet6_sendmsg(struct socket *sock, struct msghdr *msg, size_t size);
 int inet6_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
 		  int flags);
+int udpv6_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t len);
 
 /*
  * reassembly.c
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 54f24b1d4f65..717c543aaec3 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1831,6 +1831,7 @@ int udp_read_sock(struct sock *sk, read_descriptor_t *desc,
 
 	return copied;
 }
+EXPORT_SYMBOL(udp_read_sock);
 
 /*
  * 	This should be easy, if there is something there we
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 1fb75f01756c..634ab3a825d7 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -714,7 +714,9 @@ const struct proto_ops inet6_dgram_ops = {
 	.setsockopt	   = sock_common_setsockopt,	/* ok		*/
 	.getsockopt	   = sock_common_getsockopt,	/* ok		*/
 	.sendmsg	   = inet6_sendmsg,		/* retpoline's sake */
+	.sendmsg_locked	   = udpv6_sendmsg_locked,
 	.recvmsg	   = inet6_recvmsg,		/* retpoline's sake */
+	.read_sock	   = udp_read_sock,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = sock_no_sendpage,
 	.set_peek_off	   = sk_set_peek_off,
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 105ba0cf739d..4372597bc271 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1272,7 +1272,7 @@ static int udp_v6_push_pending_frames(struct sock *sk)
 	return err;
 }
 
-int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
+static int __udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len, bool locked)
 {
 	struct ipv6_txoptions opt_space;
 	struct udp_sock *up = udp_sk(sk);
@@ -1361,7 +1361,8 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		 * There are pending frames.
 		 * The socket lock must be held while it's corked.
 		 */
-		lock_sock(sk);
+		if (!locked)
+			lock_sock(sk);
 		if (likely(up->pending)) {
 			if (unlikely(up->pending != AF_INET6)) {
 				release_sock(sk);
@@ -1370,7 +1371,8 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 			dst = NULL;
 			goto do_append_data;
 		}
-		release_sock(sk);
+		if (!locked)
+			release_sock(sk);
 	}
 	ulen += sizeof(struct udphdr);
 
@@ -1533,11 +1535,13 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		goto out;
 	}
 
-	lock_sock(sk);
+	if (!locked)
+		lock_sock(sk);
 	if (unlikely(up->pending)) {
 		/* The socket is already corked while preparing it. */
 		/* ... which is an evident application bug. --ANK */
-		release_sock(sk);
+		if (!locked)
+			release_sock(sk);
 
 		net_dbg_ratelimited("udp cork app bug 2\n");
 		err = -EINVAL;
@@ -1562,7 +1566,8 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 
 	if (err > 0)
 		err = np->recverr ? net_xmit_errno(err) : 0;
-	release_sock(sk);
+	if (!locked)
+		release_sock(sk);
 
 out:
 	dst_release(dst);
@@ -1593,6 +1598,16 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	goto out;
 }
 
+int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
+{
+	return __udpv6_sendmsg(sk, msg, len, false);
+}
+
+int udpv6_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t len)
+{
+	return __udpv6_sendmsg(sk, msg, len, true);
+}
+
 void udpv6_destroy_sock(struct sock *sk)
 {
 	struct udp_sock *up = udp_sk(sk);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Patch bpf-next v2 6/9] skmsg: extract __tcp_bpf_recvmsg() and tcp_bpf_wait_data()
  2021-03-02  2:37 [Patch bpf-next v2 0/9] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP Cong Wang
                   ` (4 preceding siblings ...)
  2021-03-02  2:37 ` [Patch bpf-next v2 5/9] udp: add ->read_sock() and ->sendmsg_locked() to ipv6 Cong Wang
@ 2021-03-02  2:37 ` Cong Wang
  2021-03-02  2:37 ` [Patch bpf-next v2 7/9] udp: implement udp_bpf_recvmsg() for sockmap Cong Wang
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 30+ messages in thread
From: Cong Wang @ 2021-03-02  2:37 UTC (permalink / raw)
  To: netdev
  Cc: bpf, duanxiongchun, wangdongdong.6, jiang.wang, Cong Wang,
	John Fastabend, Daniel Borkmann, Jakub Sitnicki, Lorenz Bauer

From: Cong Wang <cong.wang@bytedance.com>

Although these two functions are only used by TCP, they are not
specific to TCP at all, both operate on skmsg and ingress_msg,
so fit in net/core/skmsg.c very well.

And we will need them for non-TCP, so rename and move them to
skmsg.c and export them to modules.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 include/linux/skmsg.h |   4 ++
 include/net/tcp.h     |   2 -
 net/core/skmsg.c      | 104 +++++++++++++++++++++++++++++++++++++++++
 net/ipv4/tcp_bpf.c    | 106 +-----------------------------------------
 net/tls/tls_sw.c      |   4 +-
 5 files changed, 112 insertions(+), 108 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index b5df69d5d397..8c24495d8d33 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -126,6 +126,10 @@ int sk_msg_zerocopy_from_iter(struct sock *sk, struct iov_iter *from,
 			      struct sk_msg *msg, u32 bytes);
 int sk_msg_memcopy_from_iter(struct sock *sk, struct iov_iter *from,
 			     struct sk_msg *msg, u32 bytes);
+int sk_msg_wait_data(struct sock *sk, struct sk_psock *psock, int flags,
+		     long timeo, int *err);
+int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
+		   int len, int flags);
 
 static inline void sk_msg_check_to_free(struct sk_msg *msg, u32 i, u32 bytes)
 {
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 2efa4e5ea23d..31b1696c62ba 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2209,8 +2209,6 @@ void tcp_bpf_clone(const struct sock *sk, struct sock *newsk);
 
 int tcp_bpf_sendmsg_redir(struct sock *sk, struct sk_msg *msg, u32 bytes,
 			  int flags);
-int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
-		      struct msghdr *msg, int len, int flags);
 #endif /* CONFIG_NET_SOCK_MSG */
 
 #if !defined(CONFIG_BPF_SYSCALL) || !defined(CONFIG_NET_SOCK_MSG)
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 7dbd8344ec89..fa10d869a728 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -399,6 +399,110 @@ int sk_msg_memcopy_from_iter(struct sock *sk, struct iov_iter *from,
 }
 EXPORT_SYMBOL_GPL(sk_msg_memcopy_from_iter);
 
+int sk_msg_wait_data(struct sock *sk, struct sk_psock *psock, int flags,
+		     long timeo, int *err)
+{
+	DEFINE_WAIT_FUNC(wait, woken_wake_function);
+	int ret = 0;
+
+	if (sk->sk_shutdown & RCV_SHUTDOWN)
+		return 1;
+
+	if (!timeo)
+		return ret;
+
+	add_wait_queue(sk_sleep(sk), &wait);
+	sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk);
+	ret = sk_wait_event(sk, &timeo,
+			    !list_empty(&psock->ingress_msg) ||
+			    !skb_queue_empty(&sk->sk_receive_queue), &wait);
+	sk_clear_bit(SOCKWQ_ASYNC_WAITDATA, sk);
+	remove_wait_queue(sk_sleep(sk), &wait);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(sk_msg_wait_data);
+
+/* Receive sk_msg from psock->ingress_msg to @msg. */
+int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
+		   int len, int flags)
+{
+	struct iov_iter *iter = &msg->msg_iter;
+	int peek = flags & MSG_PEEK;
+	struct sk_msg *msg_rx;
+	int i, copied = 0;
+
+	msg_rx = list_first_entry_or_null(&psock->ingress_msg,
+					  struct sk_msg, list);
+
+	while (copied != len) {
+		struct scatterlist *sge;
+
+		if (unlikely(!msg_rx))
+			break;
+
+		i = msg_rx->sg.start;
+		do {
+			struct page *page;
+			int copy;
+
+			sge = sk_msg_elem(msg_rx, i);
+			copy = sge->length;
+			page = sg_page(sge);
+			if (copied + copy > len)
+				copy = len - copied;
+			copy = copy_page_to_iter(page, sge->offset, copy, iter);
+			if (!copy)
+				return copied ? copied : -EFAULT;
+
+			copied += copy;
+			if (likely(!peek)) {
+				sge->offset += copy;
+				sge->length -= copy;
+				if (!msg_rx->skb)
+					sk_mem_uncharge(sk, copy);
+				msg_rx->sg.size -= copy;
+
+				if (!sge->length) {
+					sk_msg_iter_var_next(i);
+					if (!msg_rx->skb)
+						put_page(page);
+				}
+			} else {
+				/* Lets not optimize peek case if copy_page_to_iter
+				 * didn't copy the entire length lets just break.
+				 */
+				if (copy != sge->length)
+					return copied;
+				sk_msg_iter_var_next(i);
+			}
+
+			if (copied == len)
+				break;
+		} while (i != msg_rx->sg.end);
+
+		if (unlikely(peek)) {
+			if (msg_rx == list_last_entry(&psock->ingress_msg,
+						      struct sk_msg, list))
+				break;
+			msg_rx = list_next_entry(msg_rx, list);
+			continue;
+		}
+
+		msg_rx->sg.start = i;
+		if (!sge->length && msg_rx->sg.start == msg_rx->sg.end) {
+			list_del(&msg_rx->list);
+			if (msg_rx->skb)
+				consume_skb(msg_rx->skb);
+			kfree(msg_rx);
+		}
+		msg_rx = list_first_entry_or_null(&psock->ingress_msg,
+						  struct sk_msg, list);
+	}
+
+	return copied;
+}
+EXPORT_SYMBOL_GPL(sk_msg_recvmsg);
+
 static struct sk_msg *sk_psock_create_ingress_msg(struct sock *sk,
 						  struct sk_buff *skb)
 {
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index 737726c8138c..d2c5394bfb5e 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -10,86 +10,6 @@
 #include <net/inet_common.h>
 #include <net/tls.h>
 
-int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
-		      struct msghdr *msg, int len, int flags)
-{
-	struct iov_iter *iter = &msg->msg_iter;
-	int peek = flags & MSG_PEEK;
-	struct sk_msg *msg_rx;
-	int i, copied = 0;
-
-	msg_rx = list_first_entry_or_null(&psock->ingress_msg,
-					  struct sk_msg, list);
-
-	while (copied != len) {
-		struct scatterlist *sge;
-
-		if (unlikely(!msg_rx))
-			break;
-
-		i = msg_rx->sg.start;
-		do {
-			struct page *page;
-			int copy;
-
-			sge = sk_msg_elem(msg_rx, i);
-			copy = sge->length;
-			page = sg_page(sge);
-			if (copied + copy > len)
-				copy = len - copied;
-			copy = copy_page_to_iter(page, sge->offset, copy, iter);
-			if (!copy)
-				return copied ? copied : -EFAULT;
-
-			copied += copy;
-			if (likely(!peek)) {
-				sge->offset += copy;
-				sge->length -= copy;
-				if (!msg_rx->skb)
-					sk_mem_uncharge(sk, copy);
-				msg_rx->sg.size -= copy;
-
-				if (!sge->length) {
-					sk_msg_iter_var_next(i);
-					if (!msg_rx->skb)
-						put_page(page);
-				}
-			} else {
-				/* Lets not optimize peek case if copy_page_to_iter
-				 * didn't copy the entire length lets just break.
-				 */
-				if (copy != sge->length)
-					return copied;
-				sk_msg_iter_var_next(i);
-			}
-
-			if (copied == len)
-				break;
-		} while (i != msg_rx->sg.end);
-
-		if (unlikely(peek)) {
-			if (msg_rx == list_last_entry(&psock->ingress_msg,
-						      struct sk_msg, list))
-				break;
-			msg_rx = list_next_entry(msg_rx, list);
-			continue;
-		}
-
-		msg_rx->sg.start = i;
-		if (!sge->length && msg_rx->sg.start == msg_rx->sg.end) {
-			list_del(&msg_rx->list);
-			if (msg_rx->skb)
-				consume_skb(msg_rx->skb);
-			kfree(msg_rx);
-		}
-		msg_rx = list_first_entry_or_null(&psock->ingress_msg,
-						  struct sk_msg, list);
-	}
-
-	return copied;
-}
-EXPORT_SYMBOL_GPL(__tcp_bpf_recvmsg);
-
 static int bpf_tcp_ingress(struct sock *sk, struct sk_psock *psock,
 			   struct sk_msg *msg, u32 apply_bytes, int flags)
 {
@@ -243,28 +163,6 @@ static bool tcp_bpf_stream_read(const struct sock *sk)
 	return !empty;
 }
 
-static int tcp_bpf_wait_data(struct sock *sk, struct sk_psock *psock,
-			     int flags, long timeo, int *err)
-{
-	DEFINE_WAIT_FUNC(wait, woken_wake_function);
-	int ret = 0;
-
-	if (sk->sk_shutdown & RCV_SHUTDOWN)
-		return 1;
-
-	if (!timeo)
-		return ret;
-
-	add_wait_queue(sk_sleep(sk), &wait);
-	sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk);
-	ret = sk_wait_event(sk, &timeo,
-			    !list_empty(&psock->ingress_msg) ||
-			    !skb_queue_empty(&sk->sk_receive_queue), &wait);
-	sk_clear_bit(SOCKWQ_ASYNC_WAITDATA, sk);
-	remove_wait_queue(sk_sleep(sk), &wait);
-	return ret;
-}
-
 static int tcp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 		    int nonblock, int flags, int *addr_len)
 {
@@ -284,13 +182,13 @@ static int tcp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 	}
 	lock_sock(sk);
 msg_bytes_ready:
-	copied = __tcp_bpf_recvmsg(sk, psock, msg, len, flags);
+	copied = sk_msg_recvmsg(sk, psock, msg, len, flags);
 	if (!copied) {
 		int data, err = 0;
 		long timeo;
 
 		timeo = sock_rcvtimeo(sk, nonblock);
-		data = tcp_bpf_wait_data(sk, psock, flags, timeo, &err);
+		data = sk_msg_wait_data(sk, psock, flags, timeo, &err);
 		if (data) {
 			if (!sk_psock_queue_empty(psock))
 				goto msg_bytes_ready;
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 01d933ae5f16..1dcb34dfd56b 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1789,8 +1789,8 @@ int tls_sw_recvmsg(struct sock *sk,
 		skb = tls_wait_data(sk, psock, flags, timeo, &err);
 		if (!skb) {
 			if (psock) {
-				int ret = __tcp_bpf_recvmsg(sk, psock,
-							    msg, len, flags);
+				int ret = sk_msg_recvmsg(sk, psock, msg, len,
+							 flags);
 
 				if (ret > 0) {
 					decrypted += ret;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Patch bpf-next v2 7/9] udp: implement udp_bpf_recvmsg() for sockmap
  2021-03-02  2:37 [Patch bpf-next v2 0/9] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP Cong Wang
                   ` (5 preceding siblings ...)
  2021-03-02  2:37 ` [Patch bpf-next v2 6/9] skmsg: extract __tcp_bpf_recvmsg() and tcp_bpf_wait_data() Cong Wang
@ 2021-03-02  2:37 ` Cong Wang
  2021-03-02  2:37 ` [Patch bpf-next v2 8/9] sock_map: update sock type checks for UDP Cong Wang
  2021-03-02  2:37 ` [Patch bpf-next v2 9/9] selftests/bpf: add a test case for udp sockmap Cong Wang
  8 siblings, 0 replies; 30+ messages in thread
From: Cong Wang @ 2021-03-02  2:37 UTC (permalink / raw)
  To: netdev
  Cc: bpf, duanxiongchun, wangdongdong.6, jiang.wang, Cong Wang,
	John Fastabend, Daniel Borkmann, Jakub Sitnicki, Lorenz Bauer

From: Cong Wang <cong.wang@bytedance.com>

We have to implement udp_bpf_recvmsg() to replace the ->recvmsg()
to retrieve skmsg from ingress_msg.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 net/ipv4/udp_bpf.c | 64 +++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 63 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/udp_bpf.c b/net/ipv4/udp_bpf.c
index 595836088e85..9a37ba056575 100644
--- a/net/ipv4/udp_bpf.c
+++ b/net/ipv4/udp_bpf.c
@@ -4,6 +4,68 @@
 #include <linux/skmsg.h>
 #include <net/sock.h>
 #include <net/udp.h>
+#include <net/inet_common.h>
+
+#include "udp_impl.h"
+
+static struct proto *udpv6_prot_saved __read_mostly;
+
+static int sk_udp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
+			  int noblock, int flags, int *addr_len)
+{
+#if IS_ENABLED(CONFIG_IPV6)
+	if (sk->sk_family == AF_INET6)
+		return udpv6_prot_saved->recvmsg(sk, msg, len, noblock, flags,
+						 addr_len);
+#endif
+	return udp_prot.recvmsg(sk, msg, len, noblock, flags, addr_len);
+}
+
+static int udp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
+			   int nonblock, int flags, int *addr_len)
+{
+	struct sk_psock *psock;
+	int copied, ret;
+
+	if (unlikely(flags & MSG_ERRQUEUE))
+		return inet_recv_error(sk, msg, len, addr_len);
+
+	psock = sk_psock_get(sk);
+	if (unlikely(!psock))
+		return sk_udp_recvmsg(sk, msg, len, nonblock, flags, addr_len);
+
+	lock_sock(sk);
+	if (sk_psock_queue_empty(psock)) {
+		ret = sk_udp_recvmsg(sk, msg, len, nonblock, flags, addr_len);
+		goto out;
+	}
+
+msg_bytes_ready:
+	copied = sk_msg_recvmsg(sk, psock, msg, len, flags);
+	if (!copied) {
+		int data, err = 0;
+		long timeo;
+
+		timeo = sock_rcvtimeo(sk, nonblock);
+		data = sk_msg_wait_data(sk, psock, flags, timeo, &err);
+		if (data) {
+			if (!sk_psock_queue_empty(psock))
+				goto msg_bytes_ready;
+			ret = sk_udp_recvmsg(sk, msg, len, nonblock, flags, addr_len);
+			goto out;
+		}
+		if (err) {
+			ret = err;
+			goto out;
+		}
+		copied = -EAGAIN;
+	}
+	ret = copied;
+out:
+	release_sock(sk);
+	sk_psock_put(sk, psock);
+	return ret;
+}
 
 enum {
 	UDP_BPF_IPV4,
@@ -11,7 +73,6 @@ enum {
 	UDP_BPF_NUM_PROTS,
 };
 
-static struct proto *udpv6_prot_saved __read_mostly;
 static DEFINE_SPINLOCK(udpv6_prot_lock);
 static struct proto udp_bpf_prots[UDP_BPF_NUM_PROTS];
 
@@ -20,6 +81,7 @@ static void udp_bpf_rebuild_protos(struct proto *prot, const struct proto *base)
 	*prot        = *base;
 	prot->unhash = sock_map_unhash;
 	prot->close  = sock_map_close;
+	prot->recvmsg = udp_bpf_recvmsg;
 }
 
 static void udp_bpf_check_v6_needs_rebuild(struct proto *ops)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Patch bpf-next v2 8/9] sock_map: update sock type checks for UDP
  2021-03-02  2:37 [Patch bpf-next v2 0/9] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP Cong Wang
                   ` (6 preceding siblings ...)
  2021-03-02  2:37 ` [Patch bpf-next v2 7/9] udp: implement udp_bpf_recvmsg() for sockmap Cong Wang
@ 2021-03-02  2:37 ` Cong Wang
  2021-03-03  6:37   ` Yonghong Song
  2021-03-02  2:37 ` [Patch bpf-next v2 9/9] selftests/bpf: add a test case for udp sockmap Cong Wang
  8 siblings, 1 reply; 30+ messages in thread
From: Cong Wang @ 2021-03-02  2:37 UTC (permalink / raw)
  To: netdev
  Cc: bpf, duanxiongchun, wangdongdong.6, jiang.wang, Cong Wang,
	John Fastabend, Daniel Borkmann, Jakub Sitnicki, Lorenz Bauer

From: Cong Wang <cong.wang@bytedance.com>

Now UDP supports sockmap and redirection, we can safely update
the sock type checks for it accordingly.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 net/core/sock_map.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 13d2af5bb81c..f7eee4b7b994 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -549,7 +549,10 @@ static bool sk_is_udp(const struct sock *sk)
 
 static bool sock_map_redirect_allowed(const struct sock *sk)
 {
-	return sk_is_tcp(sk) && sk->sk_state != TCP_LISTEN;
+	if (sk_is_tcp(sk))
+		return sk->sk_state != TCP_LISTEN;
+	else
+		return sk->sk_state == TCP_ESTABLISHED;
 }
 
 static bool sock_map_sk_is_suitable(const struct sock *sk)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Patch bpf-next v2 9/9] selftests/bpf: add a test case for udp sockmap
  2021-03-02  2:37 [Patch bpf-next v2 0/9] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP Cong Wang
                   ` (7 preceding siblings ...)
  2021-03-02  2:37 ` [Patch bpf-next v2 8/9] sock_map: update sock type checks for UDP Cong Wang
@ 2021-03-02  2:37 ` Cong Wang
  2021-03-02 16:31   ` Lorenz Bauer
  8 siblings, 1 reply; 30+ messages in thread
From: Cong Wang @ 2021-03-02  2:37 UTC (permalink / raw)
  To: netdev
  Cc: bpf, duanxiongchun, wangdongdong.6, jiang.wang, Cong Wang,
	John Fastabend, Daniel Borkmann, Jakub Sitnicki, Lorenz Bauer

From: Cong Wang <cong.wang@bytedance.com>

Add a test case to ensure redirection between two UDP sockets work.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 .../selftests/bpf/prog_tests/sockmap_listen.c | 140 ++++++++++++++++++
 .../selftests/bpf/progs/test_sockmap_listen.c |  22 +++
 2 files changed, 162 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c b/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
index c26e6bf05e49..a549ebd3b5a6 100644
--- a/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
+++ b/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
@@ -1563,6 +1563,142 @@ static void test_redir(struct test_sockmap_listen *skel, struct bpf_map *map,
 	}
 }
 
+static void udp_redir_to_connected(int family, int sotype, int sock_mapfd,
+				   int verd_mapfd, enum redir_mode mode)
+{
+	const char *log_prefix = redir_mode_str(mode);
+	struct sockaddr_storage addr;
+	int c0, c1, p0, p1;
+	unsigned int pass;
+	socklen_t len;
+	int err, n;
+	u64 value;
+	u32 key;
+	char b;
+
+	zero_verdict_count(verd_mapfd);
+
+	p0 = socket_loopback(family, sotype | SOCK_NONBLOCK);
+	if (p0 < 0)
+		return;
+	len = sizeof(addr);
+	err = xgetsockname(p0, sockaddr(&addr), &len);
+	if (err)
+		goto close_peer0;
+
+	c0 = xsocket(family, sotype | SOCK_NONBLOCK, 0);
+	if (c0 < 0)
+		goto close_peer0;
+	err = xconnect(c0, sockaddr(&addr), len);
+	if (err)
+		goto close_cli0;
+	err = xgetsockname(c0, sockaddr(&addr), &len);
+	if (err)
+		goto close_cli0;
+	err = xconnect(p0, sockaddr(&addr), len);
+	if (err)
+		goto close_cli0;
+
+	p1 = socket_loopback(family, sotype | SOCK_NONBLOCK);
+	if (p1 < 0)
+		goto close_cli0;
+	err = xgetsockname(p1, sockaddr(&addr), &len);
+	if (err)
+		goto close_cli0;
+
+	c1 = xsocket(family, sotype | SOCK_NONBLOCK, 0);
+	if (c1 < 0)
+		goto close_peer1;
+	err = xconnect(c1, sockaddr(&addr), len);
+	if (err)
+		goto close_cli1;
+	err = xgetsockname(c1, sockaddr(&addr), &len);
+	if (err)
+		goto close_cli1;
+	err = xconnect(p1, sockaddr(&addr), len);
+	if (err)
+		goto close_cli1;
+
+	key = 0;
+	value = p0;
+	err = xbpf_map_update_elem(sock_mapfd, &key, &value, BPF_NOEXIST);
+	if (err)
+		goto close_cli1;
+
+	key = 1;
+	value = p1;
+	err = xbpf_map_update_elem(sock_mapfd, &key, &value, BPF_NOEXIST);
+	if (err)
+		goto close_cli1;
+
+	n = write(c1, "a", 1);
+	if (n < 0)
+		FAIL_ERRNO("%s: write", log_prefix);
+	if (n == 0)
+		FAIL("%s: incomplete write", log_prefix);
+	if (n < 1)
+		goto close_cli1;
+
+	key = SK_PASS;
+	err = xbpf_map_lookup_elem(verd_mapfd, &key, &pass);
+	if (err)
+		goto close_cli1;
+	if (pass != 1)
+		FAIL("%s: want pass count 1, have %d", log_prefix, pass);
+
+	n = read(mode == REDIR_INGRESS ? p0 : c0, &b, 1);
+	if (n < 0)
+		FAIL_ERRNO("%s: read", log_prefix);
+	if (n == 0)
+		FAIL("%s: incomplete read", log_prefix);
+
+close_cli1:
+	xclose(c1);
+close_peer1:
+	xclose(p1);
+close_cli0:
+	xclose(c0);
+close_peer0:
+	xclose(p0);
+}
+
+static void udp_skb_redir_to_connected(struct test_sockmap_listen *skel,
+					   struct bpf_map *inner_map, int family,
+					   int sotype)
+{
+	int verdict = bpf_program__fd(skel->progs.prog_skb_verdict);
+	int verdict_map = bpf_map__fd(skel->maps.verdict_map);
+	int sock_map = bpf_map__fd(inner_map);
+	int err;
+
+	err = xbpf_prog_attach(verdict, sock_map, BPF_SK_SKB_VERDICT, 0);
+	if (err)
+		return;
+
+	skel->bss->test_ingress = false;
+	udp_redir_to_connected(family, sotype, sock_map, verdict_map,
+			       REDIR_EGRESS);
+	skel->bss->test_ingress = true;
+	udp_redir_to_connected(family, sotype, sock_map, verdict_map,
+			       REDIR_INGRESS);
+
+	xbpf_prog_detach2(verdict, sock_map, BPF_SK_SKB_VERDICT);
+}
+
+static void test_udp_redir(struct test_sockmap_listen *skel, struct bpf_map *map,
+			   int family)
+{
+	const char *family_name, *map_name;
+	char s[MAX_TEST_NAME];
+
+	family_name = family_str(family);
+	map_name = map_type_str(map);
+	snprintf(s, sizeof(s), "%s %s %s", map_name, family_name, __func__);
+	if (!test__start_subtest(s))
+		return;
+	udp_skb_redir_to_connected(skel, map, family, SOCK_DGRAM);
+}
+
 static void test_reuseport(struct test_sockmap_listen *skel,
 			   struct bpf_map *map, int family, int sotype)
 {
@@ -1626,10 +1762,14 @@ void test_sockmap_listen(void)
 	skel->bss->test_sockmap = true;
 	run_tests(skel, skel->maps.sock_map, AF_INET);
 	run_tests(skel, skel->maps.sock_map, AF_INET6);
+	test_udp_redir(skel, skel->maps.sock_map, AF_INET);
+	test_udp_redir(skel, skel->maps.sock_map, AF_INET6);
 
 	skel->bss->test_sockmap = false;
 	run_tests(skel, skel->maps.sock_hash, AF_INET);
 	run_tests(skel, skel->maps.sock_hash, AF_INET6);
+	test_udp_redir(skel, skel->maps.sock_hash, AF_INET);
+	test_udp_redir(skel, skel->maps.sock_hash, AF_INET6);
 
 	test_sockmap_listen__destroy(skel);
 }
diff --git a/tools/testing/selftests/bpf/progs/test_sockmap_listen.c b/tools/testing/selftests/bpf/progs/test_sockmap_listen.c
index fa221141e9c1..a39eba9f5201 100644
--- a/tools/testing/selftests/bpf/progs/test_sockmap_listen.c
+++ b/tools/testing/selftests/bpf/progs/test_sockmap_listen.c
@@ -29,6 +29,7 @@ struct {
 } verdict_map SEC(".maps");
 
 static volatile bool test_sockmap; /* toggled by user-space */
+static volatile bool test_ingress; /* toggled by user-space */
 
 SEC("sk_skb/stream_parser")
 int prog_stream_parser(struct __sk_buff *skb)
@@ -55,6 +56,27 @@ int prog_stream_verdict(struct __sk_buff *skb)
 	return verdict;
 }
 
+SEC("sk_skb/skb_verdict")
+int prog_skb_verdict(struct __sk_buff *skb)
+{
+	unsigned int *count;
+	__u32 zero = 0;
+	int verdict;
+
+	if (test_sockmap)
+		verdict = bpf_sk_redirect_map(skb, &sock_map, zero,
+					      test_ingress ? BPF_F_INGRESS : 0);
+	else
+		verdict = bpf_sk_redirect_hash(skb, &sock_hash, &zero,
+					       test_ingress ? BPF_F_INGRESS : 0);
+
+	count = bpf_map_lookup_elem(&verdict_map, &verdict);
+	if (count)
+		(*count)++;
+
+	return verdict;
+}
+
 SEC("sk_msg")
 int prog_msg_verdict(struct sk_msg_md *msg)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [Patch bpf-next v2 2/9] sock: introduce sk_prot->update_proto()
  2021-03-02  2:37 ` [Patch bpf-next v2 2/9] sock: introduce sk_prot->update_proto() Cong Wang
@ 2021-03-02 16:22   ` Lorenz Bauer
  2021-03-02 18:23     ` Cong Wang
  0 siblings, 1 reply; 30+ messages in thread
From: Lorenz Bauer @ 2021-03-02 16:22 UTC (permalink / raw)
  To: Cong Wang
  Cc: Networking, bpf, duanxiongchun, wangdongdong.6, jiang.wang,
	Cong Wang, John Fastabend, Daniel Borkmann, Jakub Sitnicki

On Tue, 2 Mar 2021 at 02:37, Cong Wang <xiyou.wangcong@gmail.com> wrote:

...

> @@ -350,25 +351,12 @@ static inline void sk_psock_cork_free(struct sk_psock *psock)
>         }
>  }
>
> -static inline void sk_psock_update_proto(struct sock *sk,
> -                                        struct sk_psock *psock,
> -                                        struct proto *ops)
> -{
> -       /* Pairs with lockless read in sk_clone_lock() */
> -       WRITE_ONCE(sk->sk_prot, ops);
> -}
> -
>  static inline void sk_psock_restore_proto(struct sock *sk,
>                                           struct sk_psock *psock)
>  {
>         sk->sk_prot->unhash = psock->saved_unhash;

Not related to your patch set, but why do an extra restore of
sk_prot->unhash here? At this point sk->sk_prot is one of our tcp_bpf
/ udp_bpf protos, so overwriting that seems wrong?

> -       if (inet_csk_has_ulp(sk)) {
> -               tcp_update_ulp(sk, psock->sk_proto, psock->saved_write_space);
> -       } else {
> -               sk->sk_write_space = psock->saved_write_space;
> -               /* Pairs with lockless read in sk_clone_lock() */
> -               WRITE_ONCE(sk->sk_prot, psock->sk_proto);
> -       }
> +       if (psock->saved_update_proto)
> +               psock->saved_update_proto(sk, true);
>  }
>
>  static inline void sk_psock_set_state(struct sk_psock *psock,
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 636810ddcd9b..0e8577c917e8 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1184,6 +1184,9 @@ struct proto {
>         void                    (*unhash)(struct sock *sk);
>         void                    (*rehash)(struct sock *sk);
>         int                     (*get_port)(struct sock *sk, unsigned short snum);
> +#ifdef CONFIG_BPF_SYSCALL
> +       int                     (*update_proto)(struct sock *sk, bool restore);

Kind of a nit, but this name suggests that the callback is a lot more
generic than it really is. The only thing you can use it for is to
prep the socket to be sockmap ready since we hardwire sockmap_unhash,
etc. It's also not at all clear that this only works if sk has an
sk_psock associated with it. Calling it without one would crash the
kernel since the update_proto functions don't check for !sk_psock.

Might as well call it install_sockmap_hooks or something and have a
valid sk_psock be passed in to the callback. Additionally, I'd prefer
if the function returned a struct proto * like it does at the moment.
That way we keep sk->sk_prot manipulation confined to the sockmap code
and don't have to copy paste it into every proto.

> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> index 3bddd9dd2da2..13d2af5bb81c 100644
> --- a/net/core/sock_map.c
> +++ b/net/core/sock_map.c
> @@ -184,26 +184,10 @@ static void sock_map_unref(struct sock *sk, void *link_raw)
>
>  static int sock_map_init_proto(struct sock *sk, struct sk_psock *psock)
>  {
> -       struct proto *prot;
> -
> -       switch (sk->sk_type) {
> -       case SOCK_STREAM:
> -               prot = tcp_bpf_get_proto(sk, psock);
> -               break;
> -
> -       case SOCK_DGRAM:
> -               prot = udp_bpf_get_proto(sk, psock);
> -               break;
> -
> -       default:
> +       if (!sk->sk_prot->update_proto)
>                 return -EINVAL;
> -       }
> -
> -       if (IS_ERR(prot))
> -               return PTR_ERR(prot);
> -
> -       sk_psock_update_proto(sk, psock, prot);
> -       return 0;
> +       psock->saved_update_proto = sk->sk_prot->update_proto;
> +       return sk->sk_prot->update_proto(sk, false);

I think reads / writes from sk_prot need READ_ONCE / WRITE_ONCE. We've
not been diligent about this so far, but I think it makes sense to be
careful in new code.

-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Patch bpf-next v2 5/9] udp: add ->read_sock() and ->sendmsg_locked() to ipv6
  2021-03-02  2:37 ` [Patch bpf-next v2 5/9] udp: add ->read_sock() and ->sendmsg_locked() to ipv6 Cong Wang
@ 2021-03-02 16:23   ` Lorenz Bauer
  2021-03-02 17:59     ` Cong Wang
  0 siblings, 1 reply; 30+ messages in thread
From: Lorenz Bauer @ 2021-03-02 16:23 UTC (permalink / raw)
  To: Cong Wang
  Cc: Networking, bpf, duanxiongchun, wangdongdong.6, jiang.wang,
	Cong Wang, John Fastabend, Daniel Borkmann, Jakub Sitnicki

On Tue, 2 Mar 2021 at 02:38, Cong Wang <xiyou.wangcong@gmail.com> wrote:

...

> diff --git a/include/net/ipv6.h b/include/net/ipv6.h
> index bd1f396cc9c7..48b6850dae85 100644
> --- a/include/net/ipv6.h
> +++ b/include/net/ipv6.h
> @@ -1119,6 +1119,7 @@ int inet6_hash_connect(struct inet_timewait_death_row *death_row,
>  int inet6_sendmsg(struct socket *sock, struct msghdr *msg, size_t size);
>  int inet6_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
>                   int flags);
> +int udpv6_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t len);
>
>  /*
>   * reassembly.c
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 54f24b1d4f65..717c543aaec3 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1831,6 +1831,7 @@ int udp_read_sock(struct sock *sk, read_descriptor_t *desc,
>
>         return copied;
>  }
> +EXPORT_SYMBOL(udp_read_sock);

Should this be in the previous commit?

-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Patch bpf-next v2 9/9] selftests/bpf: add a test case for udp sockmap
  2021-03-02  2:37 ` [Patch bpf-next v2 9/9] selftests/bpf: add a test case for udp sockmap Cong Wang
@ 2021-03-02 16:31   ` Lorenz Bauer
  2021-03-02 18:05     ` Cong Wang
  0 siblings, 1 reply; 30+ messages in thread
From: Lorenz Bauer @ 2021-03-02 16:31 UTC (permalink / raw)
  To: Cong Wang
  Cc: Networking, bpf, duanxiongchun, wangdongdong.6, jiang.wang,
	Cong Wang, John Fastabend, Daniel Borkmann, Jakub Sitnicki

On Tue, 2 Mar 2021 at 02:38, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> From: Cong Wang <cong.wang@bytedance.com>
>
> Add a test case to ensure redirection between two UDP sockets work.

I basically don't understand how splicing works, but watching from the
sidelines makes me think it'd be good to have more thorough tests.
tools/testing/selftests/bpf/test_sockmap.c has quite elaborate tests
for the TCP part, it'd be nice to get similar tests going for UDP. For
example:

* sendfile?
* sendmmsg
* Something Jakub mentioned: what happens when a connected, spliced
socket is disconnected via connect(AF_UNSPEC)? Seems like we don't
hook sk_prot->disconnect anywhere.

-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Patch bpf-next v2 5/9] udp: add ->read_sock() and ->sendmsg_locked() to ipv6
  2021-03-02 16:23   ` Lorenz Bauer
@ 2021-03-02 17:59     ` Cong Wang
  0 siblings, 0 replies; 30+ messages in thread
From: Cong Wang @ 2021-03-02 17:59 UTC (permalink / raw)
  To: Lorenz Bauer
  Cc: Networking, bpf, duanxiongchun, Dongdong Wang, Jiang Wang,
	Cong Wang, John Fastabend, Daniel Borkmann, Jakub Sitnicki

On Tue, Mar 2, 2021 at 8:23 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
>
> On Tue, 2 Mar 2021 at 02:38, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > index 54f24b1d4f65..717c543aaec3 100644
> > --- a/net/ipv4/udp.c
> > +++ b/net/ipv4/udp.c
> > @@ -1831,6 +1831,7 @@ int udp_read_sock(struct sock *sk, read_descriptor_t *desc,
> >
> >         return copied;
> >  }
> > +EXPORT_SYMBOL(udp_read_sock);
>
> Should this be in the previous commit?

No, exporting this symbol is unnecessary until a module starts to
use it, which is IPv6 module in this patch. So, it is perfectly fine to
export it here.

Thanks.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Patch bpf-next v2 9/9] selftests/bpf: add a test case for udp sockmap
  2021-03-02 16:31   ` Lorenz Bauer
@ 2021-03-02 18:05     ` Cong Wang
  2021-03-03 10:20       ` Lorenz Bauer
  0 siblings, 1 reply; 30+ messages in thread
From: Cong Wang @ 2021-03-02 18:05 UTC (permalink / raw)
  To: Lorenz Bauer
  Cc: Networking, bpf, duanxiongchun, Dongdong Wang, Jiang Wang,
	Cong Wang, John Fastabend, Daniel Borkmann, Jakub Sitnicki

On Tue, Mar 2, 2021 at 8:32 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
>
> On Tue, 2 Mar 2021 at 02:38, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > From: Cong Wang <cong.wang@bytedance.com>
> >
> > Add a test case to ensure redirection between two UDP sockets work.
>
> I basically don't understand how splicing works, but watching from the
> sidelines makes me think it'd be good to have more thorough tests.
> tools/testing/selftests/bpf/test_sockmap.c has quite elaborate tests
> for the TCP part, it'd be nice to get similar tests going for UDP. For

Sure, TCP supports more than just BPF_SK_SKB_VERDICT, hence
why it must have more tests than UDP. ;)

> example:
>
> * sendfile?
> * sendmmsg

Does UDP support any of these? I don't think so, at least not in my
patchset.

> * Something Jakub mentioned: what happens when a connected, spliced
> socket is disconnected via connect(AF_UNSPEC)? Seems like we don't
> hook sk_prot->disconnect anywhere.

But we hook ->unhash(), right?

Thanks.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Patch bpf-next v2 2/9] sock: introduce sk_prot->update_proto()
  2021-03-02 16:22   ` Lorenz Bauer
@ 2021-03-02 18:23     ` Cong Wang
  2021-03-03  9:35       ` Lorenz Bauer
  2021-03-04 23:52       ` Cong Wang
  0 siblings, 2 replies; 30+ messages in thread
From: Cong Wang @ 2021-03-02 18:23 UTC (permalink / raw)
  To: Lorenz Bauer
  Cc: Networking, bpf, duanxiongchun, Dongdong Wang, Jiang Wang,
	Cong Wang, John Fastabend, Daniel Borkmann, Jakub Sitnicki

On Tue, Mar 2, 2021 at 8:22 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
>
> On Tue, 2 Mar 2021 at 02:37, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> ...
> >  static inline void sk_psock_restore_proto(struct sock *sk,
> >                                           struct sk_psock *psock)
> >  {
> >         sk->sk_prot->unhash = psock->saved_unhash;
>
> Not related to your patch set, but why do an extra restore of
> sk_prot->unhash here? At this point sk->sk_prot is one of our tcp_bpf
> / udp_bpf protos, so overwriting that seems wrong?

Good catch. It seems you are right, but I need a double check. And
yes, it is completely unrelated to my patch, as the current code has
the same problem.

>
> > -       if (inet_csk_has_ulp(sk)) {
> > -               tcp_update_ulp(sk, psock->sk_proto, psock->saved_write_space);
> > -       } else {
> > -               sk->sk_write_space = psock->saved_write_space;
> > -               /* Pairs with lockless read in sk_clone_lock() */
> > -               WRITE_ONCE(sk->sk_prot, psock->sk_proto);
> > -       }
> > +       if (psock->saved_update_proto)
> > +               psock->saved_update_proto(sk, true);
> >  }
> >
> >  static inline void sk_psock_set_state(struct sk_psock *psock,
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index 636810ddcd9b..0e8577c917e8 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -1184,6 +1184,9 @@ struct proto {
> >         void                    (*unhash)(struct sock *sk);
> >         void                    (*rehash)(struct sock *sk);
> >         int                     (*get_port)(struct sock *sk, unsigned short snum);
> > +#ifdef CONFIG_BPF_SYSCALL
> > +       int                     (*update_proto)(struct sock *sk, bool restore);
>
> Kind of a nit, but this name suggests that the callback is a lot more
> generic than it really is. The only thing you can use it for is to
> prep the socket to be sockmap ready since we hardwire sockmap_unhash,
> etc. It's also not at all clear that this only works if sk has an
> sk_psock associated with it. Calling it without one would crash the
> kernel since the update_proto functions don't check for !sk_psock.
>
> Might as well call it install_sockmap_hooks or something and have a
> valid sk_psock be passed in to the callback. Additionally, I'd prefer

For the name, sure, I am always open to better names. Not sure if
'install_sockmap_hooks' is a good name, I also want to express we
are overriding sk_prot. How about 'psock_update_sk_prot'?


> if the function returned a struct proto * like it does at the moment.
> That way we keep sk->sk_prot manipulation confined to the sockmap code
> and don't have to copy paste it into every proto.

Well, TCP seems too special to do this, as it could call tcp_update_ulp()
to update the proto.

>
> > diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> > index 3bddd9dd2da2..13d2af5bb81c 100644
> > --- a/net/core/sock_map.c
> > +++ b/net/core/sock_map.c
> > @@ -184,26 +184,10 @@ static void sock_map_unref(struct sock *sk, void *link_raw)
> >
> >  static int sock_map_init_proto(struct sock *sk, struct sk_psock *psock)
> >  {
> > -       struct proto *prot;
> > -
> > -       switch (sk->sk_type) {
> > -       case SOCK_STREAM:
> > -               prot = tcp_bpf_get_proto(sk, psock);
> > -               break;
> > -
> > -       case SOCK_DGRAM:
> > -               prot = udp_bpf_get_proto(sk, psock);
> > -               break;
> > -
> > -       default:
> > +       if (!sk->sk_prot->update_proto)
> >                 return -EINVAL;
> > -       }
> > -
> > -       if (IS_ERR(prot))
> > -               return PTR_ERR(prot);
> > -
> > -       sk_psock_update_proto(sk, psock, prot);
> > -       return 0;
> > +       psock->saved_update_proto = sk->sk_prot->update_proto;
> > +       return sk->sk_prot->update_proto(sk, false);
>
> I think reads / writes from sk_prot need READ_ONCE / WRITE_ONCE. We've
> not been diligent about this so far, but I think it makes sense to be
> careful in new code.

Hmm, there are many places not using READ_ONCE/WRITE_ONCE,
for a quick example:

void sock_map_unhash(struct sock *sk)
{
        void (*saved_unhash)(struct sock *sk);
        struct sk_psock *psock;

        rcu_read_lock();
        psock = sk_psock(sk);
        if (unlikely(!psock)) {
                rcu_read_unlock();
                if (sk->sk_prot->unhash)
                        sk->sk_prot->unhash(sk);
                return;
        }

        saved_unhash = psock->saved_unhash;
        sock_map_remove_links(sk, psock);
        rcu_read_unlock();
        saved_unhash(sk);
}

Thanks.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Patch bpf-next v2 4/9] udp: implement ->read_sock() for sockmap
  2021-03-02  2:37 ` [Patch bpf-next v2 4/9] udp: implement ->read_sock() for sockmap Cong Wang
@ 2021-03-03  6:26   ` Yonghong Song
  0 siblings, 0 replies; 30+ messages in thread
From: Yonghong Song @ 2021-03-03  6:26 UTC (permalink / raw)
  To: Cong Wang, netdev
  Cc: bpf, duanxiongchun, wangdongdong.6, jiang.wang, Cong Wang,
	John Fastabend, Daniel Borkmann, Jakub Sitnicki, Lorenz Bauer



On 3/1/21 6:37 PM, Cong Wang wrote:
> From: Cong Wang <cong.wang@bytedance.com>

Some even simple commit message here will be preferred
compared to empty commit message.

> 
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Jakub Sitnicki <jakub@cloudflare.com>
> Cc: Lorenz Bauer <lmb@cloudflare.com>
> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> ---
>   include/net/udp.h  |  2 ++
>   net/ipv4/af_inet.c |  1 +
>   net/ipv4/udp.c     | 34 ++++++++++++++++++++++++++++++++++
>   3 files changed, 37 insertions(+)
> 
> diff --git a/include/net/udp.h b/include/net/udp.h
> index 5264ba1439f9..44a94cfc63b5 100644
> --- a/include/net/udp.h
> +++ b/include/net/udp.h
> @@ -330,6 +330,8 @@ struct sock *__udp6_lib_lookup(struct net *net,
>   			       struct sk_buff *skb);
>   struct sock *udp6_lib_lookup_skb(const struct sk_buff *skb,
>   				 __be16 sport, __be16 dport);
> +int udp_read_sock(struct sock *sk, read_descriptor_t *desc,
> +		  sk_read_actor_t recv_actor);
>   
>   /* UDP uses skb->dev_scratch to cache as much information as possible and avoid
>    * possibly multiple cache miss on dequeue()
[...]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Patch bpf-next v2 8/9] sock_map: update sock type checks for UDP
  2021-03-02  2:37 ` [Patch bpf-next v2 8/9] sock_map: update sock type checks for UDP Cong Wang
@ 2021-03-03  6:37   ` Yonghong Song
  2021-03-03 18:02     ` Cong Wang
  0 siblings, 1 reply; 30+ messages in thread
From: Yonghong Song @ 2021-03-03  6:37 UTC (permalink / raw)
  To: Cong Wang, netdev
  Cc: bpf, duanxiongchun, wangdongdong.6, jiang.wang, Cong Wang,
	John Fastabend, Daniel Borkmann, Jakub Sitnicki, Lorenz Bauer



On 3/1/21 6:37 PM, Cong Wang wrote:
> From: Cong Wang <cong.wang@bytedance.com>
> 
> Now UDP supports sockmap and redirection, we can safely update
> the sock type checks for it accordingly.
> 
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Jakub Sitnicki <jakub@cloudflare.com>
> Cc: Lorenz Bauer <lmb@cloudflare.com>
> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> ---
>   net/core/sock_map.c | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> index 13d2af5bb81c..f7eee4b7b994 100644
> --- a/net/core/sock_map.c
> +++ b/net/core/sock_map.c
> @@ -549,7 +549,10 @@ static bool sk_is_udp(const struct sock *sk)
>   
>   static bool sock_map_redirect_allowed(const struct sock *sk)
>   {
> -	return sk_is_tcp(sk) && sk->sk_state != TCP_LISTEN;
> +	if (sk_is_tcp(sk))
> +		return sk->sk_state != TCP_LISTEN;
> +	else
> +		return sk->sk_state == TCP_ESTABLISHED;

Not a networking expert, a dump question. Here we tested
whether sk_is_tcp(sk) or not, if not we compare
sk->sk_state == TCP_ESTABLISHED, could this be
always false? Mostly I missed something, some comments
here will be good.

>   }
>   
>   static bool sock_map_sk_is_suitable(const struct sock *sk)
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Patch bpf-next v2 2/9] sock: introduce sk_prot->update_proto()
  2021-03-02 18:23     ` Cong Wang
@ 2021-03-03  9:35       ` Lorenz Bauer
  2021-03-03 18:20         ` Cong Wang
  2021-03-04 23:52       ` Cong Wang
  1 sibling, 1 reply; 30+ messages in thread
From: Lorenz Bauer @ 2021-03-03  9:35 UTC (permalink / raw)
  To: Cong Wang
  Cc: Networking, bpf, duanxiongchun, Dongdong Wang, Jiang Wang,
	Cong Wang, John Fastabend, Daniel Borkmann, Jakub Sitnicki

On Tue, 2 Mar 2021 at 18:23, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> > if the function returned a struct proto * like it does at the moment.
> > That way we keep sk->sk_prot manipulation confined to the sockmap code
> > and don't have to copy paste it into every proto.
>
> Well, TCP seems too special to do this, as it could call tcp_update_ulp()
> to update the proto.

I had a quick look, tcp_bpf_update_proto is the only caller of tcp_update_ulp,
which in turn is the only caller of icsk_ulp_ops->update, which in turn is only
implemented as tls_update in tls_main.c. Turns out that tls_update
has another one of these calls:

} else {
    /* Pairs with lockless read in sk_clone_lock(). */
    WRITE_ONCE(sk->sk_prot, p);
    sk->sk_write_space = write_space;
}

Maybe it looks familiar? :o) I think it would be a worthwhile change.

>
> >
> > > diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> > > index 3bddd9dd2da2..13d2af5bb81c 100644
> > > --- a/net/core/sock_map.c
> > > +++ b/net/core/sock_map.c
> > > @@ -184,26 +184,10 @@ static void sock_map_unref(struct sock *sk, void *link_raw)
> > >
> > >  static int sock_map_init_proto(struct sock *sk, struct sk_psock *psock)
> > >  {
> > > -       struct proto *prot;
> > > -
> > > -       switch (sk->sk_type) {
> > > -       case SOCK_STREAM:
> > > -               prot = tcp_bpf_get_proto(sk, psock);
> > > -               break;
> > > -
> > > -       case SOCK_DGRAM:
> > > -               prot = udp_bpf_get_proto(sk, psock);
> > > -               break;
> > > -
> > > -       default:
> > > +       if (!sk->sk_prot->update_proto)
> > >                 return -EINVAL;
> > > -       }
> > > -
> > > -       if (IS_ERR(prot))
> > > -               return PTR_ERR(prot);
> > > -
> > > -       sk_psock_update_proto(sk, psock, prot);
> > > -       return 0;
> > > +       psock->saved_update_proto = sk->sk_prot->update_proto;
> > > +       return sk->sk_prot->update_proto(sk, false);
> >
> > I think reads / writes from sk_prot need READ_ONCE / WRITE_ONCE. We've
> > not been diligent about this so far, but I think it makes sense to be
> > careful in new code.
>
> Hmm, there are many places not using READ_ONCE/WRITE_ONCE,
> for a quick example:

I know! I'll defer to John and Jakub.

-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Patch bpf-next v2 9/9] selftests/bpf: add a test case for udp sockmap
  2021-03-02 18:05     ` Cong Wang
@ 2021-03-03 10:20       ` Lorenz Bauer
  0 siblings, 0 replies; 30+ messages in thread
From: Lorenz Bauer @ 2021-03-03 10:20 UTC (permalink / raw)
  To: Cong Wang
  Cc: Networking, bpf, duanxiongchun, Dongdong Wang, Jiang Wang,
	Cong Wang, John Fastabend, Daniel Borkmann, Jakub Sitnicki

On Tue, 2 Mar 2021 at 18:05, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Tue, Mar 2, 2021 at 8:32 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
> >
> > On Tue, 2 Mar 2021 at 02:38, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > >
> > > From: Cong Wang <cong.wang@bytedance.com>
> > >
> > > Add a test case to ensure redirection between two UDP sockets work.
> >
> > I basically don't understand how splicing works, but watching from the
> > sidelines makes me think it'd be good to have more thorough tests.
> > tools/testing/selftests/bpf/test_sockmap.c has quite elaborate tests
> > for the TCP part, it'd be nice to get similar tests going for UDP. For
>
> Sure, TCP supports more than just BPF_SK_SKB_VERDICT, hence
> why it must have more tests than UDP. ;)
>
> > example:
> >
> > * sendfile?
> > * sendmmsg
>
> Does UDP support any of these? I don't think so, at least not in my
> patchset.

I have no idea, thanks for checking :)

>
> > * Something Jakub mentioned: what happens when a connected, spliced
> > socket is disconnected via connect(AF_UNSPEC)? Seems like we don't
> > hook sk_prot->disconnect anywhere.
>
> But we hook ->unhash(), right?

I wasn't aware that ->disconnect calls unhash, thanks!

-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Patch bpf-next v2 8/9] sock_map: update sock type checks for UDP
  2021-03-03  6:37   ` Yonghong Song
@ 2021-03-03 18:02     ` Cong Wang
  2021-03-03 18:50       ` Yonghong Song
  0 siblings, 1 reply; 30+ messages in thread
From: Cong Wang @ 2021-03-03 18:02 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Linux Kernel Network Developers, bpf, duanxiongchun,
	Dongdong Wang, Jiang Wang, Cong Wang, John Fastabend,
	Daniel Borkmann, Jakub Sitnicki, Lorenz Bauer

On Tue, Mar 2, 2021 at 10:37 PM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 3/1/21 6:37 PM, Cong Wang wrote:
> > From: Cong Wang <cong.wang@bytedance.com>
> >
> > Now UDP supports sockmap and redirection, we can safely update
> > the sock type checks for it accordingly.
> >
> > Cc: John Fastabend <john.fastabend@gmail.com>
> > Cc: Daniel Borkmann <daniel@iogearbox.net>
> > Cc: Jakub Sitnicki <jakub@cloudflare.com>
> > Cc: Lorenz Bauer <lmb@cloudflare.com>
> > Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> > ---
> >   net/core/sock_map.c | 5 ++++-
> >   1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> > index 13d2af5bb81c..f7eee4b7b994 100644
> > --- a/net/core/sock_map.c
> > +++ b/net/core/sock_map.c
> > @@ -549,7 +549,10 @@ static bool sk_is_udp(const struct sock *sk)
> >
> >   static bool sock_map_redirect_allowed(const struct sock *sk)
> >   {
> > -     return sk_is_tcp(sk) && sk->sk_state != TCP_LISTEN;
> > +     if (sk_is_tcp(sk))
> > +             return sk->sk_state != TCP_LISTEN;
> > +     else
> > +             return sk->sk_state == TCP_ESTABLISHED;
>
> Not a networking expert, a dump question. Here we tested
> whether sk_is_tcp(sk) or not, if not we compare
> sk->sk_state == TCP_ESTABLISHED, could this be
> always false? Mostly I missed something, some comments
> here will be good.

No, dgram sockets also use TCP_ESTABLISHED as a valid
state. I know its name looks confusing, but it is already widely
used in networking:

net/appletalk/ddp.c:    sk->sk_state = TCP_ESTABLISHED;
net/appletalk/ddp.c:            if (sk->sk_state != TCP_ESTABLISHED)
net/appletalk/ddp.c:            if (sk->sk_state != TCP_ESTABLISHED)
net/ax25/af_ax25.c:     sk->sk_state    = TCP_ESTABLISHED;
net/ax25/af_ax25.c:             case TCP_ESTABLISHED: /* connection
established */
net/ax25/af_ax25.c:     if (sk->sk_state == TCP_ESTABLISHED &&
sk->sk_type == SOCK_SEQPACKET) {
net/ax25/af_ax25.c:             sk->sk_state   = TCP_ESTABLISHED;
net/ax25/af_ax25.c:     if (sk->sk_state != TCP_ESTABLISHED && (flags
& O_NONBLOCK)) {
net/ax25/af_ax25.c:     if (sk->sk_state != TCP_ESTABLISHED) {
net/ax25/af_ax25.c:             if (sk->sk_state != TCP_ESTABLISHED) {
net/ax25/af_ax25.c:             if (sk->sk_state != TCP_ESTABLISHED) {
net/ax25/af_ax25.c:             if (sk->sk_state != TCP_ESTABLISHED) {
net/ax25/af_ax25.c:     if (sk->sk_type == SOCK_SEQPACKET &&
sk->sk_state != TCP_ESTABLISHED) {
net/ax25/ax25_ds_in.c:                  ax25->sk->sk_state = TCP_ESTABLISHED;
net/ax25/ax25_in.c:             make->sk_state = TCP_ESTABLISHED;
net/ax25/ax25_std_in.c:                         ax25->sk->sk_state =
TCP_ESTABLISHED;
net/caif/caif_socket.c: CAIF_CONNECTED          = TCP_ESTABLISHED,
net/ceph/messenger.c:   case TCP_ESTABLISHED:
net/ceph/messenger.c:           dout("%s TCP_ESTABLISHED\n", __func__);
net/core/datagram.c:        !(sk->sk_state == TCP_ESTABLISHED ||
sk->sk_state == TCP_LISTEN))
...

Hence, I believe it is okay to use it as it is, otherwise we would need
to comment on every use of it. ;)

Thanks.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Patch bpf-next v2 2/9] sock: introduce sk_prot->update_proto()
  2021-03-03  9:35       ` Lorenz Bauer
@ 2021-03-03 18:20         ` Cong Wang
  2021-03-04  9:30           ` Lorenz Bauer
  0 siblings, 1 reply; 30+ messages in thread
From: Cong Wang @ 2021-03-03 18:20 UTC (permalink / raw)
  To: Lorenz Bauer
  Cc: Networking, bpf, duanxiongchun, Dongdong Wang, Jiang Wang,
	Cong Wang, John Fastabend, Daniel Borkmann, Jakub Sitnicki

On Wed, Mar 3, 2021 at 1:35 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
>
> On Tue, 2 Mar 2021 at 18:23, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > > if the function returned a struct proto * like it does at the moment.
> > > That way we keep sk->sk_prot manipulation confined to the sockmap code
> > > and don't have to copy paste it into every proto.
> >
> > Well, TCP seems too special to do this, as it could call tcp_update_ulp()
> > to update the proto.
>
> I had a quick look, tcp_bpf_update_proto is the only caller of tcp_update_ulp,
> which in turn is the only caller of icsk_ulp_ops->update, which in turn is only
> implemented as tls_update in tls_main.c. Turns out that tls_update
> has another one of these calls:
>
> } else {
>     /* Pairs with lockless read in sk_clone_lock(). */
>     WRITE_ONCE(sk->sk_prot, p);
>     sk->sk_write_space = write_space;
> }
>
> Maybe it looks familiar? :o) I think it would be a worthwhile change.

Yeah, I am not surprised we can change tcp_update_ulp() too, but
why should I bother kTLS when I do not have to? What you suggest
could at most save us a bit of code size, not a big gain. So, I'd keep
its return value as it is, unless you see any other benefits.

BTW, I will rename it to 'psock_update_sk_prot', please let me know
if you have any better names.

Thanks.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Patch bpf-next v2 8/9] sock_map: update sock type checks for UDP
  2021-03-03 18:02     ` Cong Wang
@ 2021-03-03 18:50       ` Yonghong Song
  0 siblings, 0 replies; 30+ messages in thread
From: Yonghong Song @ 2021-03-03 18:50 UTC (permalink / raw)
  To: Cong Wang
  Cc: Linux Kernel Network Developers, bpf, duanxiongchun,
	Dongdong Wang, Jiang Wang, Cong Wang, John Fastabend,
	Daniel Borkmann, Jakub Sitnicki, Lorenz Bauer



On 3/3/21 10:02 AM, Cong Wang wrote:
> On Tue, Mar 2, 2021 at 10:37 PM Yonghong Song <yhs@fb.com> wrote:
>>
>>
>>
>> On 3/1/21 6:37 PM, Cong Wang wrote:
>>> From: Cong Wang <cong.wang@bytedance.com>
>>>
>>> Now UDP supports sockmap and redirection, we can safely update
>>> the sock type checks for it accordingly.
>>>
>>> Cc: John Fastabend <john.fastabend@gmail.com>
>>> Cc: Daniel Borkmann <daniel@iogearbox.net>
>>> Cc: Jakub Sitnicki <jakub@cloudflare.com>
>>> Cc: Lorenz Bauer <lmb@cloudflare.com>
>>> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
>>> ---
>>>    net/core/sock_map.c | 5 ++++-
>>>    1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
>>> index 13d2af5bb81c..f7eee4b7b994 100644
>>> --- a/net/core/sock_map.c
>>> +++ b/net/core/sock_map.c
>>> @@ -549,7 +549,10 @@ static bool sk_is_udp(const struct sock *sk)
>>>
>>>    static bool sock_map_redirect_allowed(const struct sock *sk)
>>>    {
>>> -     return sk_is_tcp(sk) && sk->sk_state != TCP_LISTEN;
>>> +     if (sk_is_tcp(sk))
>>> +             return sk->sk_state != TCP_LISTEN;
>>> +     else
>>> +             return sk->sk_state == TCP_ESTABLISHED;
>>
>> Not a networking expert, a dump question. Here we tested
>> whether sk_is_tcp(sk) or not, if not we compare
>> sk->sk_state == TCP_ESTABLISHED, could this be
>> always false? Mostly I missed something, some comments
>> here will be good.
> 
> No, dgram sockets also use TCP_ESTABLISHED as a valid
> state. I know its name looks confusing, but it is already widely
> used in networking:
> 
> net/appletalk/ddp.c:    sk->sk_state = TCP_ESTABLISHED;
> net/appletalk/ddp.c:            if (sk->sk_state != TCP_ESTABLISHED)
> net/appletalk/ddp.c:            if (sk->sk_state != TCP_ESTABLISHED)
> net/ax25/af_ax25.c:     sk->sk_state    = TCP_ESTABLISHED;
> net/ax25/af_ax25.c:             case TCP_ESTABLISHED: /* connection
> established */
> net/ax25/af_ax25.c:     if (sk->sk_state == TCP_ESTABLISHED &&
> sk->sk_type == SOCK_SEQPACKET) {
> net/ax25/af_ax25.c:             sk->sk_state   = TCP_ESTABLISHED;
> net/ax25/af_ax25.c:     if (sk->sk_state != TCP_ESTABLISHED && (flags
> & O_NONBLOCK)) {
> net/ax25/af_ax25.c:     if (sk->sk_state != TCP_ESTABLISHED) {
> net/ax25/af_ax25.c:             if (sk->sk_state != TCP_ESTABLISHED) {
> net/ax25/af_ax25.c:             if (sk->sk_state != TCP_ESTABLISHED) {
> net/ax25/af_ax25.c:             if (sk->sk_state != TCP_ESTABLISHED) {
> net/ax25/af_ax25.c:     if (sk->sk_type == SOCK_SEQPACKET &&
> sk->sk_state != TCP_ESTABLISHED) {
> net/ax25/ax25_ds_in.c:                  ax25->sk->sk_state = TCP_ESTABLISHED;
> net/ax25/ax25_in.c:             make->sk_state = TCP_ESTABLISHED;
> net/ax25/ax25_std_in.c:                         ax25->sk->sk_state =
> TCP_ESTABLISHED;
> net/caif/caif_socket.c: CAIF_CONNECTED          = TCP_ESTABLISHED,
> net/ceph/messenger.c:   case TCP_ESTABLISHED:
> net/ceph/messenger.c:           dout("%s TCP_ESTABLISHED\n", __func__);
> net/core/datagram.c:        !(sk->sk_state == TCP_ESTABLISHED ||
> sk->sk_state == TCP_LISTEN))
> ...
> 
> Hence, I believe it is okay to use it as it is, otherwise we would need
> to comment on every use of it. ;)

That is okay. Thanks for explanation!

> 
> Thanks.
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Patch bpf-next v2 2/9] sock: introduce sk_prot->update_proto()
  2021-03-03 18:20         ` Cong Wang
@ 2021-03-04  9:30           ` Lorenz Bauer
  0 siblings, 0 replies; 30+ messages in thread
From: Lorenz Bauer @ 2021-03-04  9:30 UTC (permalink / raw)
  To: Cong Wang
  Cc: Networking, bpf, duanxiongchun, Dongdong Wang, Jiang Wang,
	Cong Wang, John Fastabend, Daniel Borkmann, Jakub Sitnicki

On Wed, 3 Mar 2021 at 18:21, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> Yeah, I am not surprised we can change tcp_update_ulp() too, but
> why should I bother kTLS when I do not have to? What you suggest
> could at most save us a bit of code size, not a big gain. So, I'd keep
> its return value as it is, unless you see any other benefits.

I think the end result is code that is easier to understand and
therefore maintain. Keep it as it is if you prefer.

> BTW, I will rename it to 'psock_update_sk_prot', please let me know
> if you have any better names.

SGTM.

-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Patch bpf-next v2 2/9] sock: introduce sk_prot->update_proto()
  2021-03-02 18:23     ` Cong Wang
  2021-03-03  9:35       ` Lorenz Bauer
@ 2021-03-04 23:52       ` Cong Wang
  2021-03-06  0:27         ` John Fastabend
  1 sibling, 1 reply; 30+ messages in thread
From: Cong Wang @ 2021-03-04 23:52 UTC (permalink / raw)
  To: Lorenz Bauer
  Cc: Networking, bpf, duanxiongchun, Dongdong Wang, Jiang Wang,
	Cong Wang, John Fastabend, Daniel Borkmann, Jakub Sitnicki

On Tue, Mar 2, 2021 at 10:23 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Tue, Mar 2, 2021 at 8:22 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
> >
> > On Tue, 2 Mar 2021 at 02:37, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > ...
> > >  static inline void sk_psock_restore_proto(struct sock *sk,
> > >                                           struct sk_psock *psock)
> > >  {
> > >         sk->sk_prot->unhash = psock->saved_unhash;
> >
> > Not related to your patch set, but why do an extra restore of
> > sk_prot->unhash here? At this point sk->sk_prot is one of our tcp_bpf
> > / udp_bpf protos, so overwriting that seems wrong?
>
> Good catch. It seems you are right, but I need a double check. And
> yes, it is completely unrelated to my patch, as the current code has
> the same problem.

Looking at this again. I noticed

commit 4da6a196f93b1af7612340e8c1ad8ce71e18f955
Author: John Fastabend <john.fastabend@gmail.com>
Date:   Sat Jan 11 06:11:59 2020 +0000

    bpf: Sockmap/tls, during free we may call tcp_bpf_unhash() in loop

intentionally fixed a bug in kTLS with overwriting this ->unhash.

I agree with you that it should not be updated for sockmap case,
however I don't know what to do with kTLS case, it seems the bug the
above commit fixed still exists if we just revert it.

Anyway, this should be targeted for -bpf as a bug fix, so it does not
belong to this patchset.

Thanks.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Patch bpf-next v2 2/9] sock: introduce sk_prot->update_proto()
  2021-03-04 23:52       ` Cong Wang
@ 2021-03-06  0:27         ` John Fastabend
  2021-03-06  0:57           ` Cong Wang
  0 siblings, 1 reply; 30+ messages in thread
From: John Fastabend @ 2021-03-06  0:27 UTC (permalink / raw)
  To: Cong Wang, Lorenz Bauer
  Cc: Networking, bpf, duanxiongchun, Dongdong Wang, Jiang Wang,
	Cong Wang, John Fastabend, Daniel Borkmann, Jakub Sitnicki

Cong Wang wrote:
> On Tue, Mar 2, 2021 at 10:23 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > On Tue, Mar 2, 2021 at 8:22 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
> > >
> > > On Tue, 2 Mar 2021 at 02:37, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > >
> > > ...
> > > >  static inline void sk_psock_restore_proto(struct sock *sk,
> > > >                                           struct sk_psock *psock)
> > > >  {
> > > >         sk->sk_prot->unhash = psock->saved_unhash;
> > >
> > > Not related to your patch set, but why do an extra restore of
> > > sk_prot->unhash here? At this point sk->sk_prot is one of our tcp_bpf
> > > / udp_bpf protos, so overwriting that seems wrong?

"extra"? restore_proto should only be called when the psock ref count
is zero and we need to transition back to the original socks proto
handlers. To trigger this we can simply delete a sock from the map.
In the case where we are deleting the psock overwriting the tcp_bpf
protos is exactly what we want.?

> >
> > Good catch. It seems you are right, but I need a double check. And
> > yes, it is completely unrelated to my patch, as the current code has
> > the same problem.
> 
> Looking at this again. I noticed
> 
> commit 4da6a196f93b1af7612340e8c1ad8ce71e18f955
> Author: John Fastabend <john.fastabend@gmail.com>
> Date:   Sat Jan 11 06:11:59 2020 +0000
> 
>     bpf: Sockmap/tls, during free we may call tcp_bpf_unhash() in loop
> 
> intentionally fixed a bug in kTLS with overwriting this ->unhash.
> 
> I agree with you that it should not be updated for sockmap case,
> however I don't know what to do with kTLS case, it seems the bug the
> above commit fixed still exists if we just revert it.
> 
> Anyway, this should be targeted for -bpf as a bug fix, so it does not
> belong to this patchset.
> 
> Thanks.

Hi,

I'm missing the error case here. The restore logic happens when the refcnt
hits 0 on the psock, indicating its time to garbage collect the psock. 

 sk_psock_put
   if (refcount_dec_and_test(&psock->refcnt))
    sk_psock_drop(sk, psock);
      sk_psock_restore_proto(sk, psock)
         sk->sk_prot->unhash = psock->saved_unhash

When sockets are initialized via sk_psock_init() we opulate the unhash field

 psock->saved_unhash = prot->unhash;

So we need to unwind this otherwise a future unhash() call would not call
the original protos unhash handler.

Care to give me some more context on what the bug is?

Thanks,
John

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Patch bpf-next v2 2/9] sock: introduce sk_prot->update_proto()
  2021-03-06  0:27         ` John Fastabend
@ 2021-03-06  0:57           ` Cong Wang
  2021-03-06  1:55             ` John Fastabend
  0 siblings, 1 reply; 30+ messages in thread
From: Cong Wang @ 2021-03-06  0:57 UTC (permalink / raw)
  To: John Fastabend
  Cc: Lorenz Bauer, Networking, bpf, duanxiongchun, Dongdong Wang,
	Jiang Wang, Cong Wang, Daniel Borkmann, Jakub Sitnicki

On Fri, Mar 5, 2021 at 4:27 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Cong Wang wrote:
> > On Tue, Mar 2, 2021 at 10:23 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > >
> > > On Tue, Mar 2, 2021 at 8:22 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
> > > >
> > > > On Tue, 2 Mar 2021 at 02:37, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > > >
> > > > ...
> > > > >  static inline void sk_psock_restore_proto(struct sock *sk,
> > > > >                                           struct sk_psock *psock)
> > > > >  {
> > > > >         sk->sk_prot->unhash = psock->saved_unhash;
> > > >
> > > > Not related to your patch set, but why do an extra restore of
> > > > sk_prot->unhash here? At this point sk->sk_prot is one of our tcp_bpf
> > > > / udp_bpf protos, so overwriting that seems wrong?
>
> "extra"? restore_proto should only be called when the psock ref count
> is zero and we need to transition back to the original socks proto
> handlers. To trigger this we can simply delete a sock from the map.
> In the case where we are deleting the psock overwriting the tcp_bpf
> protos is exactly what we want.?

Why do you want to overwrite tcp_bpf_prots->unhash? Overwriting
tcp_bpf_prots is correct, but overwriting tcp_bpf_prots->unhash is not.
Because once you overwrite it, the next time you use it to replace
sk->sk_prot, it would be a different one rather than sock_map_unhash():

// tcp_bpf_prots->unhash == sock_map_unhash
sk_psock_restore_proto();
// Now  tcp_bpf_prots->unhash is inet_unhash
...
sk_psock_update_proto();
// sk->sk_proto is now tcp_bpf_prots again,
// so its ->unhash now is inet_unhash
// but it should be sock_map_unhash here

Thanks.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Patch bpf-next v2 2/9] sock: introduce sk_prot->update_proto()
  2021-03-06  0:57           ` Cong Wang
@ 2021-03-06  1:55             ` John Fastabend
  2021-03-09 17:53               ` Cong Wang
  0 siblings, 1 reply; 30+ messages in thread
From: John Fastabend @ 2021-03-06  1:55 UTC (permalink / raw)
  To: Cong Wang, John Fastabend
  Cc: Lorenz Bauer, Networking, bpf, duanxiongchun, Dongdong Wang,
	Jiang Wang, Cong Wang, Daniel Borkmann, Jakub Sitnicki

Cong Wang wrote:
> On Fri, Mar 5, 2021 at 4:27 PM John Fastabend <john.fastabend@gmail.com> wrote:
> >
> > Cong Wang wrote:
> > > On Tue, Mar 2, 2021 at 10:23 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > > >
> > > > On Tue, Mar 2, 2021 at 8:22 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
> > > > >
> > > > > On Tue, 2 Mar 2021 at 02:37, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > > > >
> > > > > ...
> > > > > >  static inline void sk_psock_restore_proto(struct sock *sk,
> > > > > >                                           struct sk_psock *psock)
> > > > > >  {
> > > > > >         sk->sk_prot->unhash = psock->saved_unhash;
> > > > >
> > > > > Not related to your patch set, but why do an extra restore of
> > > > > sk_prot->unhash here? At this point sk->sk_prot is one of our tcp_bpf
> > > > > / udp_bpf protos, so overwriting that seems wrong?
> >
> > "extra"? restore_proto should only be called when the psock ref count
> > is zero and we need to transition back to the original socks proto
> > handlers. To trigger this we can simply delete a sock from the map.
> > In the case where we are deleting the psock overwriting the tcp_bpf
> > protos is exactly what we want.?
> 
> Why do you want to overwrite tcp_bpf_prots->unhash? Overwriting
> tcp_bpf_prots is correct, but overwriting tcp_bpf_prots->unhash is not.
> Because once you overwrite it, the next time you use it to replace
> sk->sk_prot, it would be a different one rather than sock_map_unhash():
> 
> // tcp_bpf_prots->unhash == sock_map_unhash
> sk_psock_restore_proto();
> // Now  tcp_bpf_prots->unhash is inet_unhash
> ...
> sk_psock_update_proto();
> // sk->sk_proto is now tcp_bpf_prots again,
> // so its ->unhash now is inet_unhash
> // but it should be sock_map_unhash here

Right, we can fix this on the TLS side. I'll push a fix shortly.

> 
> Thanks.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Patch bpf-next v2 2/9] sock: introduce sk_prot->update_proto()
  2021-03-06  1:55             ` John Fastabend
@ 2021-03-09 17:53               ` Cong Wang
  2021-03-10  6:33                 ` John Fastabend
  0 siblings, 1 reply; 30+ messages in thread
From: Cong Wang @ 2021-03-09 17:53 UTC (permalink / raw)
  To: John Fastabend
  Cc: Lorenz Bauer, Networking, bpf, duanxiongchun, Dongdong Wang,
	Jiang Wang, Cong Wang, Daniel Borkmann, Jakub Sitnicki

On Fri, Mar 5, 2021 at 5:55 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Cong Wang wrote:
> > On Fri, Mar 5, 2021 at 4:27 PM John Fastabend <john.fastabend@gmail.com> wrote:
> > >
> > > Cong Wang wrote:
> > > > On Tue, Mar 2, 2021 at 10:23 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > > > >
> > > > > On Tue, Mar 2, 2021 at 8:22 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
> > > > > >
> > > > > > On Tue, 2 Mar 2021 at 02:37, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > > > > >
> > > > > > ...
> > > > > > >  static inline void sk_psock_restore_proto(struct sock *sk,
> > > > > > >                                           struct sk_psock *psock)
> > > > > > >  {
> > > > > > >         sk->sk_prot->unhash = psock->saved_unhash;
> > > > > >
> > > > > > Not related to your patch set, but why do an extra restore of
> > > > > > sk_prot->unhash here? At this point sk->sk_prot is one of our tcp_bpf
> > > > > > / udp_bpf protos, so overwriting that seems wrong?
> > >
> > > "extra"? restore_proto should only be called when the psock ref count
> > > is zero and we need to transition back to the original socks proto
> > > handlers. To trigger this we can simply delete a sock from the map.
> > > In the case where we are deleting the psock overwriting the tcp_bpf
> > > protos is exactly what we want.?
> >
> > Why do you want to overwrite tcp_bpf_prots->unhash? Overwriting
> > tcp_bpf_prots is correct, but overwriting tcp_bpf_prots->unhash is not.
> > Because once you overwrite it, the next time you use it to replace
> > sk->sk_prot, it would be a different one rather than sock_map_unhash():
> >
> > // tcp_bpf_prots->unhash == sock_map_unhash
> > sk_psock_restore_proto();
> > // Now  tcp_bpf_prots->unhash is inet_unhash
> > ...
> > sk_psock_update_proto();
> > // sk->sk_proto is now tcp_bpf_prots again,
> > // so its ->unhash now is inet_unhash
> > // but it should be sock_map_unhash here
>
> Right, we can fix this on the TLS side. I'll push a fix shortly.

Are you still working on this? If kTLS still needs it, then we can
have something like this:

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 8edbbf5f2f93..5eb617df7f48 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -349,8 +349,8 @@ static inline void sk_psock_update_proto(struct sock *sk,
 static inline void sk_psock_restore_proto(struct sock *sk,
                                          struct sk_psock *psock)
 {
-       sk->sk_prot->unhash = psock->saved_unhash;
        if (inet_csk_has_ulp(sk)) {
+               sk->sk_prot->unhash = psock->saved_unhash;
                tcp_update_ulp(sk, psock->sk_proto, psock->saved_write_space);
        } else {
                sk->sk_write_space = psock->saved_write_space;


Thanks.

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [Patch bpf-next v2 2/9] sock: introduce sk_prot->update_proto()
  2021-03-09 17:53               ` Cong Wang
@ 2021-03-10  6:33                 ` John Fastabend
  0 siblings, 0 replies; 30+ messages in thread
From: John Fastabend @ 2021-03-10  6:33 UTC (permalink / raw)
  To: Cong Wang, John Fastabend
  Cc: Lorenz Bauer, Networking, bpf, duanxiongchun, Dongdong Wang,
	Jiang Wang, Cong Wang, Daniel Borkmann, Jakub Sitnicki

Cong Wang wrote:
> On Fri, Mar 5, 2021 at 5:55 PM John Fastabend <john.fastabend@gmail.com> wrote:
> >

[...]

> > > // tcp_bpf_prots->unhash == sock_map_unhash
> > > sk_psock_restore_proto();
> > > // Now  tcp_bpf_prots->unhash is inet_unhash
> > > ...
> > > sk_psock_update_proto();
> > > // sk->sk_proto is now tcp_bpf_prots again,
> > > // so its ->unhash now is inet_unhash
> > > // but it should be sock_map_unhash here
> >
> > Right, we can fix this on the TLS side. I'll push a fix shortly.
> 
> Are you still working on this? If kTLS still needs it, then we can
> have something like this:

Testing a fix now I will flush it out tomorrow. The below is not
really correct either it just moves the issue so it only impacts
TLS.

> 
> diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
> index 8edbbf5f2f93..5eb617df7f48 100644
> --- a/include/linux/skmsg.h
> +++ b/include/linux/skmsg.h
> @@ -349,8 +349,8 @@ static inline void sk_psock_update_proto(struct sock *sk,
>  static inline void sk_psock_restore_proto(struct sock *sk,
>                                           struct sk_psock *psock)
>  {
> -       sk->sk_prot->unhash = psock->saved_unhash;
>         if (inet_csk_has_ulp(sk)) {
> +               sk->sk_prot->unhash = psock->saved_unhash;
>                 tcp_update_ulp(sk, psock->sk_proto, psock->saved_write_space);
>         } else {
>                 sk->sk_write_space = psock->saved_write_space;
> 
> 
> Thanks.

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2021-03-10  6:34 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-02  2:37 [Patch bpf-next v2 0/9] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP Cong Wang
2021-03-02  2:37 ` [Patch bpf-next v2 1/9] sock_map: introduce BPF_SK_SKB_VERDICT Cong Wang
2021-03-02  2:37 ` [Patch bpf-next v2 2/9] sock: introduce sk_prot->update_proto() Cong Wang
2021-03-02 16:22   ` Lorenz Bauer
2021-03-02 18:23     ` Cong Wang
2021-03-03  9:35       ` Lorenz Bauer
2021-03-03 18:20         ` Cong Wang
2021-03-04  9:30           ` Lorenz Bauer
2021-03-04 23:52       ` Cong Wang
2021-03-06  0:27         ` John Fastabend
2021-03-06  0:57           ` Cong Wang
2021-03-06  1:55             ` John Fastabend
2021-03-09 17:53               ` Cong Wang
2021-03-10  6:33                 ` John Fastabend
2021-03-02  2:37 ` [Patch bpf-next v2 3/9] udp: implement ->sendmsg_locked() Cong Wang
2021-03-02  2:37 ` [Patch bpf-next v2 4/9] udp: implement ->read_sock() for sockmap Cong Wang
2021-03-03  6:26   ` Yonghong Song
2021-03-02  2:37 ` [Patch bpf-next v2 5/9] udp: add ->read_sock() and ->sendmsg_locked() to ipv6 Cong Wang
2021-03-02 16:23   ` Lorenz Bauer
2021-03-02 17:59     ` Cong Wang
2021-03-02  2:37 ` [Patch bpf-next v2 6/9] skmsg: extract __tcp_bpf_recvmsg() and tcp_bpf_wait_data() Cong Wang
2021-03-02  2:37 ` [Patch bpf-next v2 7/9] udp: implement udp_bpf_recvmsg() for sockmap Cong Wang
2021-03-02  2:37 ` [Patch bpf-next v2 8/9] sock_map: update sock type checks for UDP Cong Wang
2021-03-03  6:37   ` Yonghong Song
2021-03-03 18:02     ` Cong Wang
2021-03-03 18:50       ` Yonghong Song
2021-03-02  2:37 ` [Patch bpf-next v2 9/9] selftests/bpf: add a test case for udp sockmap Cong Wang
2021-03-02 16:31   ` Lorenz Bauer
2021-03-02 18:05     ` Cong Wang
2021-03-03 10:20       ` Lorenz Bauer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).