bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Patch bpf-next v7 00/13] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP
@ 2021-03-28 20:20 Cong Wang
  2021-03-28 20:20 ` [Patch bpf-next v7 01/13] skmsg: lock ingress_skb when purging Cong Wang
                   ` (13 more replies)
  0 siblings, 14 replies; 30+ messages in thread
From: Cong Wang @ 2021-03-28 20:20 UTC (permalink / raw)
  To: netdev; +Cc: bpf, duanxiongchun, wangdongdong.6, jiang.wang, Cong Wang

From: Cong Wang <cong.wang@bytedance.com>

We have thousands of services connected to a daemon on every host
via AF_UNIX dgram sockets, after they are moved into VM, we have to
add a proxy to forward these communications from VM to host, because
rewriting thousands of them is not practical. This proxy uses an
AF_UNIX socket connected to services and a UDP socket to connect to
the host. It is inefficient because data is copied between kernel
space and user space twice, and we can not use splice() which only
supports TCP. Therefore, we want to use sockmap to do the splicing
without going to user-space at all (after the initial setup).

Currently sockmap only fully supports TCP, UDP is partially supported
as it is only allowed to add into sockmap. This patchset, as the second
part of the original large patchset, extends sockmap with:
1) cross-protocol support with BPF_SK_SKB_VERDICT; 2) full UDP support.

On the high level, ->read_sock() is required for each protocol to support
sockmap redirection, and in order to do sock proto update, a new ops
->psock_update_sk_prot() is introduced, which is also required. And the
BPF ->recvmsg() is also needed to replace the original ->recvmsg() to
retrieve skmsg. To make life easier, we have to get rid of lock_sock()
in sk_psock_handle_skb(), otherwise we would have to implement
->sendmsg_locked() on top of ->sendmsg(), which is ugly.

Please see each patch for more details.

To see the big picture, the original patchset is available here:
https://github.com/congwang/linux/tree/sockmap
this patchset is also available:
https://github.com/congwang/linux/tree/sockmap2

---
v7: use work_mutex to protect psock->work
    return err in udp_read_sock()
    add patch 6/13
    clean up test case

v6: get rid of sk_psock_zap_ingress()
    add rcu work patch

v5: use INDIRECT_CALL_2() for function pointers
    use ingress_lock to fix a race condition found by Jacub
    rename two helper functions

v4: get rid of lock_sock() in sk_psock_handle_skb()
    get rid of udp_sendmsg_locked()
    remove an empty line
    update cover letter

v3: export tcp/udp_update_proto()
    rename sk->sk_prot->psock_update_sk_prot()
    improve changelogs

v2: separate from the original large patchset
    rebase to the latest bpf-next
    split UDP test case
    move inet_csk_has_ulp() check to tcp_bpf.c
    clean up udp_read_sock()

Cong Wang (13):
  skmsg: lock ingress_skb when purging
  skmsg: introduce a spinlock to protect ingress_msg
  net: introduce skb_send_sock() for sock_map
  skmsg: avoid lock_sock() in sk_psock_backlog()
  skmsg: use rcu work for destroying psock
  skmsg: use GFP_KERNEL in sk_psock_create_ingress_msg()
  sock_map: introduce BPF_SK_SKB_VERDICT
  sock: introduce sk->sk_prot->psock_update_sk_prot()
  udp: implement ->read_sock() for sockmap
  skmsg: extract __tcp_bpf_recvmsg() and tcp_bpf_wait_data()
  udp: implement udp_bpf_recvmsg() for sockmap
  sock_map: update sock type checks for UDP
  selftests/bpf: add a test case for udp sockmap

 include/linux/skbuff.h                        |   1 +
 include/linux/skmsg.h                         |  77 ++++++--
 include/net/sock.h                            |   3 +
 include/net/tcp.h                             |   3 +-
 include/net/udp.h                             |   3 +
 include/uapi/linux/bpf.h                      |   1 +
 kernel/bpf/syscall.c                          |   1 +
 net/core/skbuff.c                             |  55 +++++-
 net/core/skmsg.c                              | 177 ++++++++++++++----
 net/core/sock_map.c                           |  53 +++---
 net/ipv4/af_inet.c                            |   1 +
 net/ipv4/tcp_bpf.c                            | 130 +++----------
 net/ipv4/tcp_ipv4.c                           |   3 +
 net/ipv4/udp.c                                |  38 ++++
 net/ipv4/udp_bpf.c                            |  79 +++++++-
 net/ipv6/af_inet6.c                           |   1 +
 net/ipv6/tcp_ipv6.c                           |   3 +
 net/ipv6/udp.c                                |   3 +
 net/tls/tls_sw.c                              |   4 +-
 tools/bpf/bpftool/common.c                    |   1 +
 tools/bpf/bpftool/prog.c                      |   1 +
 tools/include/uapi/linux/bpf.h                |   1 +
 .../selftests/bpf/prog_tests/sockmap_listen.c | 136 ++++++++++++++
 .../selftests/bpf/progs/test_sockmap_listen.c |  22 +++
 24 files changed, 601 insertions(+), 196 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Patch bpf-next v7 01/13] skmsg: lock ingress_skb when purging
  2021-03-28 20:20 [Patch bpf-next v7 00/13] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP Cong Wang
@ 2021-03-28 20:20 ` Cong Wang
  2021-03-28 20:20 ` [Patch bpf-next v7 02/13] skmsg: introduce a spinlock to protect ingress_msg Cong Wang
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 30+ messages in thread
From: Cong Wang @ 2021-03-28 20:20 UTC (permalink / raw)
  To: netdev
  Cc: bpf, duanxiongchun, wangdongdong.6, jiang.wang, Cong Wang,
	John Fastabend, Daniel Borkmann, Lorenz Bauer, Jakub Sitnicki

From: Cong Wang <cong.wang@bytedance.com>

Currently we purge the ingress_skb queue only when psock
refcnt goes down to 0, so locking the queue is not necessary,
but in order to be called during ->close, we have to lock it
here.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Lorenz Bauer <lmb@cloudflare.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 net/core/skmsg.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 07f54015238a..bebf84ed4e30 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -634,7 +634,7 @@ static void sk_psock_zap_ingress(struct sk_psock *psock)
 {
 	struct sk_buff *skb;
 
-	while ((skb = __skb_dequeue(&psock->ingress_skb)) != NULL) {
+	while ((skb = skb_dequeue(&psock->ingress_skb)) != NULL) {
 		skb_bpf_redirect_clear(skb);
 		kfree_skb(skb);
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Patch bpf-next v7 02/13] skmsg: introduce a spinlock to protect ingress_msg
  2021-03-28 20:20 [Patch bpf-next v7 00/13] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP Cong Wang
  2021-03-28 20:20 ` [Patch bpf-next v7 01/13] skmsg: lock ingress_skb when purging Cong Wang
@ 2021-03-28 20:20 ` Cong Wang
  2021-03-29 19:11   ` John Fastabend
  2021-03-28 20:20 ` [Patch bpf-next v7 03/13] net: introduce skb_send_sock() for sock_map Cong Wang
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 30+ messages in thread
From: Cong Wang @ 2021-03-28 20:20 UTC (permalink / raw)
  To: netdev
  Cc: bpf, duanxiongchun, wangdongdong.6, jiang.wang, Cong Wang,
	John Fastabend, Daniel Borkmann, Lorenz Bauer, Jakub Sitnicki

From: Cong Wang <cong.wang@bytedance.com>

Currently we rely on lock_sock to protect ingress_msg,
it is too big for this, we can actually just use a spinlock
to protect this list like protecting other skb queues.

__tcp_bpf_recvmsg() is still special because of peeking,
it still has to use lock_sock.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Lorenz Bauer <lmb@cloudflare.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 include/linux/skmsg.h | 46 +++++++++++++++++++++++++++++++++++++++++++
 net/core/skmsg.c      |  3 +++
 net/ipv4/tcp_bpf.c    | 18 ++++++-----------
 3 files changed, 55 insertions(+), 12 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 6c09d94be2e9..f2d45a73b2b2 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -89,6 +89,7 @@ struct sk_psock {
 #endif
 	struct sk_buff_head		ingress_skb;
 	struct list_head		ingress_msg;
+	spinlock_t			ingress_lock;
 	unsigned long			state;
 	struct list_head		link;
 	spinlock_t			link_lock;
@@ -284,7 +285,45 @@ static inline struct sk_psock *sk_psock(const struct sock *sk)
 static inline void sk_psock_queue_msg(struct sk_psock *psock,
 				      struct sk_msg *msg)
 {
+	spin_lock_bh(&psock->ingress_lock);
 	list_add_tail(&msg->list, &psock->ingress_msg);
+	spin_unlock_bh(&psock->ingress_lock);
+}
+
+static inline struct sk_msg *sk_psock_dequeue_msg(struct sk_psock *psock)
+{
+	struct sk_msg *msg;
+
+	spin_lock_bh(&psock->ingress_lock);
+	msg = list_first_entry_or_null(&psock->ingress_msg, struct sk_msg, list);
+	if (msg)
+		list_del(&msg->list);
+	spin_unlock_bh(&psock->ingress_lock);
+	return msg;
+}
+
+static inline struct sk_msg *sk_psock_peek_msg(struct sk_psock *psock)
+{
+	struct sk_msg *msg;
+
+	spin_lock_bh(&psock->ingress_lock);
+	msg = list_first_entry_or_null(&psock->ingress_msg, struct sk_msg, list);
+	spin_unlock_bh(&psock->ingress_lock);
+	return msg;
+}
+
+static inline struct sk_msg *sk_psock_next_msg(struct sk_psock *psock,
+					       struct sk_msg *msg)
+{
+	struct sk_msg *ret;
+
+	spin_lock_bh(&psock->ingress_lock);
+	if (list_is_last(&msg->list, &psock->ingress_msg))
+		ret = NULL;
+	else
+		ret = list_next_entry(msg, list);
+	spin_unlock_bh(&psock->ingress_lock);
+	return ret;
 }
 
 static inline bool sk_psock_queue_empty(const struct sk_psock *psock)
@@ -292,6 +331,13 @@ static inline bool sk_psock_queue_empty(const struct sk_psock *psock)
 	return psock ? list_empty(&psock->ingress_msg) : true;
 }
 
+static inline void kfree_sk_msg(struct sk_msg *msg)
+{
+	if (msg->skb)
+		consume_skb(msg->skb);
+	kfree(msg);
+}
+
 static inline void sk_psock_report_error(struct sk_psock *psock, int err)
 {
 	struct sock *sk = psock->sk;
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index bebf84ed4e30..305dddc51857 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -592,6 +592,7 @@ struct sk_psock *sk_psock_init(struct sock *sk, int node)
 
 	INIT_WORK(&psock->work, sk_psock_backlog);
 	INIT_LIST_HEAD(&psock->ingress_msg);
+	spin_lock_init(&psock->ingress_lock);
 	skb_queue_head_init(&psock->ingress_skb);
 
 	sk_psock_set_state(psock, SK_PSOCK_TX_ENABLED);
@@ -638,7 +639,9 @@ static void sk_psock_zap_ingress(struct sk_psock *psock)
 		skb_bpf_redirect_clear(skb);
 		kfree_skb(skb);
 	}
+	spin_lock_bh(&psock->ingress_lock);
 	__sk_psock_purge_ingress_msg(psock);
+	spin_unlock_bh(&psock->ingress_lock);
 }
 
 static void sk_psock_link_destroy(struct sk_psock *psock)
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index 17c322b875fd..ae980716d896 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -18,9 +18,7 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
 	struct sk_msg *msg_rx;
 	int i, copied = 0;
 
-	msg_rx = list_first_entry_or_null(&psock->ingress_msg,
-					  struct sk_msg, list);
-
+	msg_rx = sk_psock_peek_msg(psock);
 	while (copied != len) {
 		struct scatterlist *sge;
 
@@ -68,22 +66,18 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
 		} while (i != msg_rx->sg.end);
 
 		if (unlikely(peek)) {
-			if (msg_rx == list_last_entry(&psock->ingress_msg,
-						      struct sk_msg, list))
+			msg_rx = sk_psock_next_msg(psock, msg_rx);
+			if (!msg_rx)
 				break;
-			msg_rx = list_next_entry(msg_rx, list);
 			continue;
 		}
 
 		msg_rx->sg.start = i;
 		if (!sge->length && msg_rx->sg.start == msg_rx->sg.end) {
-			list_del(&msg_rx->list);
-			if (msg_rx->skb)
-				consume_skb(msg_rx->skb);
-			kfree(msg_rx);
+			msg_rx = sk_psock_dequeue_msg(psock);
+			kfree_sk_msg(msg_rx);
 		}
-		msg_rx = list_first_entry_or_null(&psock->ingress_msg,
-						  struct sk_msg, list);
+		msg_rx = sk_psock_peek_msg(psock);
 	}
 
 	return copied;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Patch bpf-next v7 03/13] net: introduce skb_send_sock() for sock_map
  2021-03-28 20:20 [Patch bpf-next v7 00/13] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP Cong Wang
  2021-03-28 20:20 ` [Patch bpf-next v7 01/13] skmsg: lock ingress_skb when purging Cong Wang
  2021-03-28 20:20 ` [Patch bpf-next v7 02/13] skmsg: introduce a spinlock to protect ingress_msg Cong Wang
@ 2021-03-28 20:20 ` Cong Wang
  2021-03-28 20:20 ` [Patch bpf-next v7 04/13] skmsg: avoid lock_sock() in sk_psock_backlog() Cong Wang
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 30+ messages in thread
From: Cong Wang @ 2021-03-28 20:20 UTC (permalink / raw)
  To: netdev
  Cc: bpf, duanxiongchun, wangdongdong.6, jiang.wang, Cong Wang,
	John Fastabend, Daniel Borkmann, Jakub Sitnicki, Lorenz Bauer

From: Cong Wang <cong.wang@bytedance.com>

We only have skb_send_sock_locked() which requires callers
to use lock_sock(). Introduce a variant skb_send_sock()
which locks on its own, callers do not need to lock it
any more. This will save us from adding a ->sendmsg_locked
for each protocol.

To reuse the code, pass function pointers to __skb_send_sock()
and build skb_send_sock() and skb_send_sock_locked() on top.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 include/linux/skbuff.h |  1 +
 net/core/skbuff.c      | 55 ++++++++++++++++++++++++++++++++++++------
 2 files changed, 49 insertions(+), 7 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index c8def85fcc22..dbf820a50a39 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3626,6 +3626,7 @@ int skb_splice_bits(struct sk_buff *skb, struct sock *sk, unsigned int offset,
 		    unsigned int flags);
 int skb_send_sock_locked(struct sock *sk, struct sk_buff *skb, int offset,
 			 int len);
+int skb_send_sock(struct sock *sk, struct sk_buff *skb, int offset, int len);
 void skb_copy_and_csum_dev(const struct sk_buff *skb, u8 *to);
 unsigned int skb_zerocopy_headlen(const struct sk_buff *from);
 int skb_zerocopy(struct sk_buff *to, struct sk_buff *from,
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index e8320b5d651a..3ad9e8425ab2 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2500,9 +2500,32 @@ int skb_splice_bits(struct sk_buff *skb, struct sock *sk, unsigned int offset,
 }
 EXPORT_SYMBOL_GPL(skb_splice_bits);
 
-/* Send skb data on a socket. Socket must be locked. */
-int skb_send_sock_locked(struct sock *sk, struct sk_buff *skb, int offset,
-			 int len)
+static int sendmsg_unlocked(struct sock *sk, struct msghdr *msg,
+			    struct kvec *vec, size_t num, size_t size)
+{
+	struct socket *sock = sk->sk_socket;
+
+	if (!sock)
+		return -EINVAL;
+	return kernel_sendmsg(sock, msg, vec, num, size);
+}
+
+static int sendpage_unlocked(struct sock *sk, struct page *page, int offset,
+			     size_t size, int flags)
+{
+	struct socket *sock = sk->sk_socket;
+
+	if (!sock)
+		return -EINVAL;
+	return kernel_sendpage(sock, page, offset, size, flags);
+}
+
+typedef int (*sendmsg_func)(struct sock *sk, struct msghdr *msg,
+			    struct kvec *vec, size_t num, size_t size);
+typedef int (*sendpage_func)(struct sock *sk, struct page *page, int offset,
+			     size_t size, int flags);
+static int __skb_send_sock(struct sock *sk, struct sk_buff *skb, int offset,
+			   int len, sendmsg_func sendmsg, sendpage_func sendpage)
 {
 	unsigned int orig_len = len;
 	struct sk_buff *head = skb;
@@ -2522,7 +2545,8 @@ int skb_send_sock_locked(struct sock *sk, struct sk_buff *skb, int offset,
 		memset(&msg, 0, sizeof(msg));
 		msg.msg_flags = MSG_DONTWAIT;
 
-		ret = kernel_sendmsg_locked(sk, &msg, &kv, 1, slen);
+		ret = INDIRECT_CALL_2(sendmsg, kernel_sendmsg_locked,
+				      sendmsg_unlocked, sk, &msg, &kv, 1, slen);
 		if (ret <= 0)
 			goto error;
 
@@ -2553,9 +2577,11 @@ int skb_send_sock_locked(struct sock *sk, struct sk_buff *skb, int offset,
 		slen = min_t(size_t, len, skb_frag_size(frag) - offset);
 
 		while (slen) {
-			ret = kernel_sendpage_locked(sk, skb_frag_page(frag),
-						     skb_frag_off(frag) + offset,
-						     slen, MSG_DONTWAIT);
+			ret = INDIRECT_CALL_2(sendpage, kernel_sendpage_locked,
+					      sendpage_unlocked, sk,
+					      skb_frag_page(frag),
+					      skb_frag_off(frag) + offset,
+					      slen, MSG_DONTWAIT);
 			if (ret <= 0)
 				goto error;
 
@@ -2587,8 +2613,23 @@ int skb_send_sock_locked(struct sock *sk, struct sk_buff *skb, int offset,
 error:
 	return orig_len == len ? ret : orig_len - len;
 }
+
+/* Send skb data on a socket. Socket must be locked. */
+int skb_send_sock_locked(struct sock *sk, struct sk_buff *skb, int offset,
+			 int len)
+{
+	return __skb_send_sock(sk, skb, offset, len, kernel_sendmsg_locked,
+			       kernel_sendpage_locked);
+}
 EXPORT_SYMBOL_GPL(skb_send_sock_locked);
 
+/* Send skb data on a socket. Socket must be unlocked. */
+int skb_send_sock(struct sock *sk, struct sk_buff *skb, int offset, int len)
+{
+	return __skb_send_sock(sk, skb, offset, len, sendmsg_unlocked,
+			       sendpage_unlocked);
+}
+
 /**
  *	skb_store_bits - store bits from kernel buffer to skb
  *	@skb: destination buffer
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Patch bpf-next v7 04/13] skmsg: avoid lock_sock() in sk_psock_backlog()
  2021-03-28 20:20 [Patch bpf-next v7 00/13] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP Cong Wang
                   ` (2 preceding siblings ...)
  2021-03-28 20:20 ` [Patch bpf-next v7 03/13] net: introduce skb_send_sock() for sock_map Cong Wang
@ 2021-03-28 20:20 ` Cong Wang
  2021-03-29 19:41   ` John Fastabend
  2021-03-28 20:20 ` [Patch bpf-next v7 05/13] skmsg: use rcu work for destroying psock Cong Wang
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 30+ messages in thread
From: Cong Wang @ 2021-03-28 20:20 UTC (permalink / raw)
  To: netdev
  Cc: bpf, duanxiongchun, wangdongdong.6, jiang.wang, Cong Wang,
	John Fastabend, Daniel Borkmann, Jakub Sitnicki, Lorenz Bauer

From: Cong Wang <cong.wang@bytedance.com>

We do not have to lock the sock to avoid losing sk_socket,
instead we can purge all the ingress queues when we close
the socket. Sending or receiving packets after orphaning
socket makes no sense.

We do purge these queues when psock refcnt reaches zero but
here we want to purge them explicitly in sock_map_close().
There are also some nasty race conditions on testing bit
SK_PSOCK_TX_ENABLED and queuing/canceling the psock work,
we can expand psock->ingress_lock a bit to protect them too.

As noticed by John, we still have to lock the psock->work,
because the same work item could be running concurrently on
different CPU's.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 include/linux/skmsg.h |  2 ++
 net/core/skmsg.c      | 50 +++++++++++++++++++++++++++++--------------
 net/core/sock_map.c   |  1 +
 3 files changed, 37 insertions(+), 16 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index f2d45a73b2b2..7382c4b518d7 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -99,6 +99,7 @@ struct sk_psock {
 	void (*saved_write_space)(struct sock *sk);
 	void (*saved_data_ready)(struct sock *sk);
 	struct proto			*sk_proto;
+	struct mutex			work_mutex;
 	struct sk_psock_work_state	work_state;
 	struct work_struct		work;
 	union {
@@ -347,6 +348,7 @@ static inline void sk_psock_report_error(struct sk_psock *psock, int err)
 }
 
 struct sk_psock *sk_psock_init(struct sock *sk, int node);
+void sk_psock_stop(struct sk_psock *psock, bool wait);
 
 #if IS_ENABLED(CONFIG_BPF_STREAM_PARSER)
 int sk_psock_init_strp(struct sock *sk, struct sk_psock *psock);
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 305dddc51857..9c25020086a9 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -497,7 +497,7 @@ static int sk_psock_handle_skb(struct sk_psock *psock, struct sk_buff *skb,
 	if (!ingress) {
 		if (!sock_writeable(psock->sk))
 			return -EAGAIN;
-		return skb_send_sock_locked(psock->sk, skb, off, len);
+		return skb_send_sock(psock->sk, skb, off, len);
 	}
 	return sk_psock_skb_ingress(psock, skb);
 }
@@ -511,8 +511,7 @@ static void sk_psock_backlog(struct work_struct *work)
 	u32 len, off;
 	int ret;
 
-	/* Lock sock to avoid losing sk_socket during loop. */
-	lock_sock(psock->sk);
+	mutex_lock(&psock->work_mutex);
 	if (state->skb) {
 		skb = state->skb;
 		len = state->len;
@@ -529,7 +528,7 @@ static void sk_psock_backlog(struct work_struct *work)
 		skb_bpf_redirect_clear(skb);
 		do {
 			ret = -EIO;
-			if (likely(psock->sk->sk_socket))
+			if (!sock_flag(psock->sk, SOCK_DEAD))
 				ret = sk_psock_handle_skb(psock, skb, off,
 							  len, ingress);
 			if (ret <= 0) {
@@ -553,7 +552,7 @@ static void sk_psock_backlog(struct work_struct *work)
 			kfree_skb(skb);
 	}
 end:
-	release_sock(psock->sk);
+	mutex_unlock(&psock->work_mutex);
 }
 
 struct sk_psock *sk_psock_init(struct sock *sk, int node)
@@ -591,6 +590,7 @@ struct sk_psock *sk_psock_init(struct sock *sk, int node)
 	spin_lock_init(&psock->link_lock);
 
 	INIT_WORK(&psock->work, sk_psock_backlog);
+	mutex_init(&psock->work_mutex);
 	INIT_LIST_HEAD(&psock->ingress_msg);
 	spin_lock_init(&psock->ingress_lock);
 	skb_queue_head_init(&psock->ingress_skb);
@@ -631,7 +631,7 @@ static void __sk_psock_purge_ingress_msg(struct sk_psock *psock)
 	}
 }
 
-static void sk_psock_zap_ingress(struct sk_psock *psock)
+static void __sk_psock_zap_ingress(struct sk_psock *psock)
 {
 	struct sk_buff *skb;
 
@@ -639,9 +639,7 @@ static void sk_psock_zap_ingress(struct sk_psock *psock)
 		skb_bpf_redirect_clear(skb);
 		kfree_skb(skb);
 	}
-	spin_lock_bh(&psock->ingress_lock);
 	__sk_psock_purge_ingress_msg(psock);
-	spin_unlock_bh(&psock->ingress_lock);
 }
 
 static void sk_psock_link_destroy(struct sk_psock *psock)
@@ -654,6 +652,18 @@ static void sk_psock_link_destroy(struct sk_psock *psock)
 	}
 }
 
+void sk_psock_stop(struct sk_psock *psock, bool wait)
+{
+	spin_lock_bh(&psock->ingress_lock);
+	sk_psock_clear_state(psock, SK_PSOCK_TX_ENABLED);
+	sk_psock_cork_free(psock);
+	__sk_psock_zap_ingress(psock);
+	spin_unlock_bh(&psock->ingress_lock);
+
+	if (wait)
+		cancel_work_sync(&psock->work);
+}
+
 static void sk_psock_done_strp(struct sk_psock *psock);
 
 static void sk_psock_destroy_deferred(struct work_struct *gc)
@@ -665,12 +675,12 @@ static void sk_psock_destroy_deferred(struct work_struct *gc)
 	sk_psock_done_strp(psock);
 
 	cancel_work_sync(&psock->work);
+	mutex_destroy(&psock->work_mutex);
 
 	psock_progs_drop(&psock->progs);
 
 	sk_psock_link_destroy(psock);
 	sk_psock_cork_free(psock);
-	sk_psock_zap_ingress(psock);
 
 	if (psock->sk_redir)
 		sock_put(psock->sk_redir);
@@ -688,8 +698,7 @@ static void sk_psock_destroy(struct rcu_head *rcu)
 
 void sk_psock_drop(struct sock *sk, struct sk_psock *psock)
 {
-	sk_psock_cork_free(psock);
-	sk_psock_zap_ingress(psock);
+	sk_psock_stop(psock, false);
 
 	write_lock_bh(&sk->sk_callback_lock);
 	sk_psock_restore_proto(sk, psock);
@@ -699,7 +708,6 @@ void sk_psock_drop(struct sock *sk, struct sk_psock *psock)
 	else if (psock->progs.stream_verdict)
 		sk_psock_stop_verdict(sk, psock);
 	write_unlock_bh(&sk->sk_callback_lock);
-	sk_psock_clear_state(psock, SK_PSOCK_TX_ENABLED);
 
 	call_rcu(&psock->rcu, sk_psock_destroy);
 }
@@ -770,14 +778,20 @@ static void sk_psock_skb_redirect(struct sk_buff *skb)
 	 * error that caused the pipe to break. We can't send a packet on
 	 * a socket that is in this state so we drop the skb.
 	 */
-	if (!psock_other || sock_flag(sk_other, SOCK_DEAD) ||
-	    !sk_psock_test_state(psock_other, SK_PSOCK_TX_ENABLED)) {
+	if (!psock_other || sock_flag(sk_other, SOCK_DEAD)) {
+		kfree_skb(skb);
+		return;
+	}
+	spin_lock_bh(&psock_other->ingress_lock);
+	if (!sk_psock_test_state(psock_other, SK_PSOCK_TX_ENABLED)) {
+		spin_unlock_bh(&psock_other->ingress_lock);
 		kfree_skb(skb);
 		return;
 	}
 
 	skb_queue_tail(&psock_other->ingress_skb, skb);
 	schedule_work(&psock_other->work);
+	spin_unlock_bh(&psock_other->ingress_lock);
 }
 
 static void sk_psock_tls_verdict_apply(struct sk_buff *skb, struct sock *sk, int verdict)
@@ -845,8 +859,12 @@ static void sk_psock_verdict_apply(struct sk_psock *psock,
 			err = sk_psock_skb_ingress_self(psock, skb);
 		}
 		if (err < 0) {
-			skb_queue_tail(&psock->ingress_skb, skb);
-			schedule_work(&psock->work);
+			spin_lock_bh(&psock->ingress_lock);
+			if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) {
+				skb_queue_tail(&psock->ingress_skb, skb);
+				schedule_work(&psock->work);
+			}
+			spin_unlock_bh(&psock->ingress_lock);
 		}
 		break;
 	case __SK_REDIRECT:
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index dd53a7771d7e..e564fdeaada1 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -1540,6 +1540,7 @@ void sock_map_close(struct sock *sk, long timeout)
 	saved_close = psock->saved_close;
 	sock_map_remove_links(sk, psock);
 	rcu_read_unlock();
+	sk_psock_stop(psock, true);
 	release_sock(sk);
 	saved_close(sk, timeout);
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Patch bpf-next v7 05/13] skmsg: use rcu work for destroying psock
  2021-03-28 20:20 [Patch bpf-next v7 00/13] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP Cong Wang
                   ` (3 preceding siblings ...)
  2021-03-28 20:20 ` [Patch bpf-next v7 04/13] skmsg: avoid lock_sock() in sk_psock_backlog() Cong Wang
@ 2021-03-28 20:20 ` Cong Wang
  2021-03-29 19:42   ` John Fastabend
  2021-03-28 20:20 ` [Patch bpf-next v7 06/13] skmsg: use GFP_KERNEL in sk_psock_create_ingress_msg() Cong Wang
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 30+ messages in thread
From: Cong Wang @ 2021-03-28 20:20 UTC (permalink / raw)
  To: netdev
  Cc: bpf, duanxiongchun, wangdongdong.6, jiang.wang, Cong Wang,
	Daniel Borkmann, Jakub Sitnicki, Lorenz Bauer, John Fastabend

From: Cong Wang <cong.wang@bytedance.com>

The RCU callback sk_psock_destroy() only queues work psock->gc,
so we can just switch to rcu work to simplify the code.

Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Lorenz Bauer <lmb@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 include/linux/skmsg.h |  5 +----
 net/core/skmsg.c      | 17 +++++------------
 2 files changed, 6 insertions(+), 16 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 7382c4b518d7..e7aba150539d 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -102,10 +102,7 @@ struct sk_psock {
 	struct mutex			work_mutex;
 	struct sk_psock_work_state	work_state;
 	struct work_struct		work;
-	union {
-		struct rcu_head		rcu;
-		struct work_struct	gc;
-	};
+	struct rcu_work			rwork;
 };
 
 int sk_msg_alloc(struct sock *sk, struct sk_msg *msg, int len,
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 9c25020086a9..d43d43905d2c 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -666,10 +666,10 @@ void sk_psock_stop(struct sk_psock *psock, bool wait)
 
 static void sk_psock_done_strp(struct sk_psock *psock);
 
-static void sk_psock_destroy_deferred(struct work_struct *gc)
+static void sk_psock_destroy(struct work_struct *work)
 {
-	struct sk_psock *psock = container_of(gc, struct sk_psock, gc);
-
+	struct sk_psock *psock = container_of(to_rcu_work(work),
+					      struct sk_psock, rwork);
 	/* No sk_callback_lock since already detached. */
 
 	sk_psock_done_strp(psock);
@@ -688,14 +688,6 @@ static void sk_psock_destroy_deferred(struct work_struct *gc)
 	kfree(psock);
 }
 
-static void sk_psock_destroy(struct rcu_head *rcu)
-{
-	struct sk_psock *psock = container_of(rcu, struct sk_psock, rcu);
-
-	INIT_WORK(&psock->gc, sk_psock_destroy_deferred);
-	schedule_work(&psock->gc);
-}
-
 void sk_psock_drop(struct sock *sk, struct sk_psock *psock)
 {
 	sk_psock_stop(psock, false);
@@ -709,7 +701,8 @@ void sk_psock_drop(struct sock *sk, struct sk_psock *psock)
 		sk_psock_stop_verdict(sk, psock);
 	write_unlock_bh(&sk->sk_callback_lock);
 
-	call_rcu(&psock->rcu, sk_psock_destroy);
+	INIT_RCU_WORK(&psock->rwork, sk_psock_destroy);
+	queue_rcu_work(system_wq, &psock->rwork);
 }
 EXPORT_SYMBOL_GPL(sk_psock_drop);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Patch bpf-next v7 06/13] skmsg: use GFP_KERNEL in sk_psock_create_ingress_msg()
  2021-03-28 20:20 [Patch bpf-next v7 00/13] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP Cong Wang
                   ` (4 preceding siblings ...)
  2021-03-28 20:20 ` [Patch bpf-next v7 05/13] skmsg: use rcu work for destroying psock Cong Wang
@ 2021-03-28 20:20 ` Cong Wang
  2021-03-29 19:44   ` John Fastabend
  2021-03-28 20:20 ` [Patch bpf-next v7 07/13] sock_map: introduce BPF_SK_SKB_VERDICT Cong Wang
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 30+ messages in thread
From: Cong Wang @ 2021-03-28 20:20 UTC (permalink / raw)
  To: netdev
  Cc: bpf, duanxiongchun, wangdongdong.6, jiang.wang, Cong Wang,
	John Fastabend, Daniel Borkmann, Jakub Sitnicki, Lorenz Bauer

From: Cong Wang <cong.wang@bytedance.com>

This function is only called in process context.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 net/core/skmsg.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index d43d43905d2c..656eceab73bc 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -410,7 +410,7 @@ static struct sk_msg *sk_psock_create_ingress_msg(struct sock *sk,
 	if (!sk_rmem_schedule(sk, skb, skb->truesize))
 		return NULL;
 
-	msg = kzalloc(sizeof(*msg), __GFP_NOWARN | GFP_ATOMIC);
+	msg = kzalloc(sizeof(*msg), __GFP_NOWARN | GFP_KERNEL);
 	if (unlikely(!msg))
 		return NULL;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Patch bpf-next v7 07/13] sock_map: introduce BPF_SK_SKB_VERDICT
  2021-03-28 20:20 [Patch bpf-next v7 00/13] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP Cong Wang
                   ` (5 preceding siblings ...)
  2021-03-28 20:20 ` [Patch bpf-next v7 06/13] skmsg: use GFP_KERNEL in sk_psock_create_ingress_msg() Cong Wang
@ 2021-03-28 20:20 ` Cong Wang
  2021-03-29 20:09   ` John Fastabend
  2021-03-28 20:20 ` [Patch bpf-next v7 08/13] sock: introduce sk->sk_prot->psock_update_sk_prot() Cong Wang
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 30+ messages in thread
From: Cong Wang @ 2021-03-28 20:20 UTC (permalink / raw)
  To: netdev
  Cc: bpf, duanxiongchun, wangdongdong.6, jiang.wang, Cong Wang,
	John Fastabend, Daniel Borkmann, Jakub Sitnicki, Lorenz Bauer

From: Cong Wang <cong.wang@bytedance.com>

Reusing BPF_SK_SKB_STREAM_VERDICT is possible but its name is
confusing and more importantly we still want to distinguish them
from user-space. So we can just reuse the stream verdict code but
introduce a new type of eBPF program, skb_verdict. Users are not
allowed to set stream_verdict and skb_verdict at the same time.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 include/linux/skmsg.h          |  2 ++
 include/uapi/linux/bpf.h       |  1 +
 kernel/bpf/syscall.c           |  1 +
 net/core/skmsg.c               |  4 +++-
 net/core/sock_map.c            | 23 ++++++++++++++++++++++-
 tools/bpf/bpftool/common.c     |  1 +
 tools/bpf/bpftool/prog.c       |  1 +
 tools/include/uapi/linux/bpf.h |  1 +
 8 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index e7aba150539d..c83dbc2d81d9 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -58,6 +58,7 @@ struct sk_psock_progs {
 	struct bpf_prog			*msg_parser;
 	struct bpf_prog			*stream_parser;
 	struct bpf_prog			*stream_verdict;
+	struct bpf_prog			*skb_verdict;
 };
 
 enum sk_psock_state_bits {
@@ -487,6 +488,7 @@ static inline void psock_progs_drop(struct sk_psock_progs *progs)
 	psock_set_prog(&progs->msg_parser, NULL);
 	psock_set_prog(&progs->stream_parser, NULL);
 	psock_set_prog(&progs->stream_verdict, NULL);
+	psock_set_prog(&progs->skb_verdict, NULL);
 }
 
 int sk_psock_tls_strp_read(struct sk_psock *psock, struct sk_buff *skb);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 598716742593..49371eba98ba 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -957,6 +957,7 @@ enum bpf_attach_type {
 	BPF_XDP_CPUMAP,
 	BPF_SK_LOOKUP,
 	BPF_XDP,
+	BPF_SK_SKB_VERDICT,
 	__MAX_BPF_ATTACH_TYPE
 };
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 9603de81811a..6428634da57e 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2948,6 +2948,7 @@ attach_type_to_prog_type(enum bpf_attach_type attach_type)
 		return BPF_PROG_TYPE_SK_MSG;
 	case BPF_SK_SKB_STREAM_PARSER:
 	case BPF_SK_SKB_STREAM_VERDICT:
+	case BPF_SK_SKB_VERDICT:
 		return BPF_PROG_TYPE_SK_SKB;
 	case BPF_LIRC_MODE2:
 		return BPF_PROG_TYPE_LIRC_MODE2;
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 656eceab73bc..a045812d7c78 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -697,7 +697,7 @@ void sk_psock_drop(struct sock *sk, struct sk_psock *psock)
 	rcu_assign_sk_user_data(sk, NULL);
 	if (psock->progs.stream_parser)
 		sk_psock_stop_strp(sk, psock);
-	else if (psock->progs.stream_verdict)
+	else if (psock->progs.stream_verdict || psock->progs.skb_verdict)
 		sk_psock_stop_verdict(sk, psock);
 	write_unlock_bh(&sk->sk_callback_lock);
 
@@ -1024,6 +1024,8 @@ static int sk_psock_verdict_recv(read_descriptor_t *desc, struct sk_buff *skb,
 	}
 	skb_set_owner_r(skb, sk);
 	prog = READ_ONCE(psock->progs.stream_verdict);
+	if (!prog)
+		prog = READ_ONCE(psock->progs.skb_verdict);
 	if (likely(prog)) {
 		skb_dst_drop(skb);
 		skb_bpf_redirect_clear(skb);
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index e564fdeaada1..c46709786a49 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -155,6 +155,8 @@ static void sock_map_del_link(struct sock *sk,
 				strp_stop = true;
 			if (psock->saved_data_ready && stab->progs.stream_verdict)
 				verdict_stop = true;
+			if (psock->saved_data_ready && stab->progs.skb_verdict)
+				verdict_stop = true;
 			list_del(&link->list);
 			sk_psock_free_link(link);
 		}
@@ -227,7 +229,7 @@ static struct sk_psock *sock_map_psock_get_checked(struct sock *sk)
 static int sock_map_link(struct bpf_map *map, struct sk_psock_progs *progs,
 			 struct sock *sk)
 {
-	struct bpf_prog *msg_parser, *stream_parser, *stream_verdict;
+	struct bpf_prog *msg_parser, *stream_parser, *stream_verdict, *skb_verdict;
 	struct sk_psock *psock;
 	int ret;
 
@@ -256,6 +258,15 @@ static int sock_map_link(struct bpf_map *map, struct sk_psock_progs *progs,
 		}
 	}
 
+	skb_verdict = READ_ONCE(progs->skb_verdict);
+	if (skb_verdict) {
+		skb_verdict = bpf_prog_inc_not_zero(skb_verdict);
+		if (IS_ERR(skb_verdict)) {
+			ret = PTR_ERR(skb_verdict);
+			goto out_put_msg_parser;
+		}
+	}
+
 	psock = sock_map_psock_get_checked(sk);
 	if (IS_ERR(psock)) {
 		ret = PTR_ERR(psock);
@@ -265,6 +276,7 @@ static int sock_map_link(struct bpf_map *map, struct sk_psock_progs *progs,
 	if (psock) {
 		if ((msg_parser && READ_ONCE(psock->progs.msg_parser)) ||
 		    (stream_parser  && READ_ONCE(psock->progs.stream_parser)) ||
+		    (skb_verdict && READ_ONCE(psock->progs.skb_verdict)) ||
 		    (stream_verdict && READ_ONCE(psock->progs.stream_verdict))) {
 			sk_psock_put(sk, psock);
 			ret = -EBUSY;
@@ -296,6 +308,9 @@ static int sock_map_link(struct bpf_map *map, struct sk_psock_progs *progs,
 	} else if (!stream_parser && stream_verdict && !psock->saved_data_ready) {
 		psock_set_prog(&psock->progs.stream_verdict, stream_verdict);
 		sk_psock_start_verdict(sk,psock);
+	} else if (!stream_verdict && skb_verdict && !psock->saved_data_ready) {
+		psock_set_prog(&psock->progs.skb_verdict, skb_verdict);
+		sk_psock_start_verdict(sk, psock);
 	}
 	write_unlock_bh(&sk->sk_callback_lock);
 	return 0;
@@ -304,6 +319,9 @@ static int sock_map_link(struct bpf_map *map, struct sk_psock_progs *progs,
 out_drop:
 	sk_psock_put(sk, psock);
 out_progs:
+	if (skb_verdict)
+		bpf_prog_put(skb_verdict);
+out_put_msg_parser:
 	if (msg_parser)
 		bpf_prog_put(msg_parser);
 out_put_stream_parser:
@@ -1468,6 +1486,9 @@ static int sock_map_prog_update(struct bpf_map *map, struct bpf_prog *prog,
 	case BPF_SK_SKB_STREAM_VERDICT:
 		pprog = &progs->stream_verdict;
 		break;
+	case BPF_SK_SKB_VERDICT:
+		pprog = &progs->skb_verdict;
+		break;
 	default:
 		return -EOPNOTSUPP;
 	}
diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
index 65303664417e..1828bba19020 100644
--- a/tools/bpf/bpftool/common.c
+++ b/tools/bpf/bpftool/common.c
@@ -57,6 +57,7 @@ const char * const attach_type_name[__MAX_BPF_ATTACH_TYPE] = {
 
 	[BPF_SK_SKB_STREAM_PARSER]	= "sk_skb_stream_parser",
 	[BPF_SK_SKB_STREAM_VERDICT]	= "sk_skb_stream_verdict",
+	[BPF_SK_SKB_VERDICT]		= "sk_skb_verdict",
 	[BPF_SK_MSG_VERDICT]		= "sk_msg_verdict",
 	[BPF_LIRC_MODE2]		= "lirc_mode2",
 	[BPF_FLOW_DISSECTOR]		= "flow_dissector",
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index f2b915b20546..3f067d2d7584 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -76,6 +76,7 @@ enum dump_mode {
 static const char * const attach_type_strings[] = {
 	[BPF_SK_SKB_STREAM_PARSER] = "stream_parser",
 	[BPF_SK_SKB_STREAM_VERDICT] = "stream_verdict",
+	[BPF_SK_SKB_VERDICT] = "skb_verdict",
 	[BPF_SK_MSG_VERDICT] = "msg_verdict",
 	[BPF_FLOW_DISSECTOR] = "flow_dissector",
 	[__MAX_BPF_ATTACH_TYPE] = NULL,
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index ab9f2233607c..69902603012c 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -957,6 +957,7 @@ enum bpf_attach_type {
 	BPF_XDP_CPUMAP,
 	BPF_SK_LOOKUP,
 	BPF_XDP,
+	BPF_SK_SKB_VERDICT,
 	__MAX_BPF_ATTACH_TYPE
 };
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Patch bpf-next v7 08/13] sock: introduce sk->sk_prot->psock_update_sk_prot()
  2021-03-28 20:20 [Patch bpf-next v7 00/13] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP Cong Wang
                   ` (6 preceding siblings ...)
  2021-03-28 20:20 ` [Patch bpf-next v7 07/13] sock_map: introduce BPF_SK_SKB_VERDICT Cong Wang
@ 2021-03-28 20:20 ` Cong Wang
  2021-03-28 20:20 ` [Patch bpf-next v7 09/13] udp: implement ->read_sock() for sockmap Cong Wang
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 30+ messages in thread
From: Cong Wang @ 2021-03-28 20:20 UTC (permalink / raw)
  To: netdev
  Cc: bpf, duanxiongchun, wangdongdong.6, jiang.wang, Cong Wang,
	John Fastabend, Daniel Borkmann, Jakub Sitnicki, Lorenz Bauer

From: Cong Wang <cong.wang@bytedance.com>

Currently sockmap calls into each protocol to update the struct
proto and replace it. This certainly won't work when the protocol
is implemented as a module, for example, AF_UNIX.

Introduce a new ops sk->sk_prot->psock_update_sk_prot(), so each
protocol can implement its own way to replace the struct proto.
This also helps get rid of symbol dependencies on CONFIG_INET.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 include/linux/skmsg.h | 18 +++---------------
 include/net/sock.h    |  3 +++
 include/net/tcp.h     |  1 +
 include/net/udp.h     |  1 +
 net/core/skmsg.c      |  5 -----
 net/core/sock_map.c   | 24 ++++--------------------
 net/ipv4/tcp_bpf.c    | 24 +++++++++++++++++++++---
 net/ipv4/tcp_ipv4.c   |  3 +++
 net/ipv4/udp.c        |  3 +++
 net/ipv4/udp_bpf.c    | 15 +++++++++++++--
 net/ipv6/tcp_ipv6.c   |  3 +++
 net/ipv6/udp.c        |  3 +++
 12 files changed, 58 insertions(+), 45 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index c83dbc2d81d9..5e800ddc2dc6 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -99,6 +99,7 @@ struct sk_psock {
 	void (*saved_close)(struct sock *sk, long timeout);
 	void (*saved_write_space)(struct sock *sk);
 	void (*saved_data_ready)(struct sock *sk);
+	int  (*psock_update_sk_prot)(struct sock *sk, bool restore);
 	struct proto			*sk_proto;
 	struct mutex			work_mutex;
 	struct sk_psock_work_state	work_state;
@@ -395,25 +396,12 @@ static inline void sk_psock_cork_free(struct sk_psock *psock)
 	}
 }
 
-static inline void sk_psock_update_proto(struct sock *sk,
-					 struct sk_psock *psock,
-					 struct proto *ops)
-{
-	/* Pairs with lockless read in sk_clone_lock() */
-	WRITE_ONCE(sk->sk_prot, ops);
-}
-
 static inline void sk_psock_restore_proto(struct sock *sk,
 					  struct sk_psock *psock)
 {
 	sk->sk_prot->unhash = psock->saved_unhash;
-	if (inet_csk_has_ulp(sk)) {
-		tcp_update_ulp(sk, psock->sk_proto, psock->saved_write_space);
-	} else {
-		sk->sk_write_space = psock->saved_write_space;
-		/* Pairs with lockless read in sk_clone_lock() */
-		WRITE_ONCE(sk->sk_prot, psock->sk_proto);
-	}
+	if (psock->psock_update_sk_prot)
+		psock->psock_update_sk_prot(sk, true);
 }
 
 static inline void sk_psock_set_state(struct sk_psock *psock,
diff --git a/include/net/sock.h b/include/net/sock.h
index 0b6266fd6bf6..8b4155e756c2 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1184,6 +1184,9 @@ struct proto {
 	void			(*unhash)(struct sock *sk);
 	void			(*rehash)(struct sock *sk);
 	int			(*get_port)(struct sock *sk, unsigned short snum);
+#ifdef CONFIG_BPF_SYSCALL
+	int			(*psock_update_sk_prot)(struct sock *sk, bool restore);
+#endif
 
 	/* Keeping track of sockets in use */
 #ifdef CONFIG_PROC_FS
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 075de26f449d..2efa4e5ea23d 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2203,6 +2203,7 @@ struct sk_psock;
 
 #ifdef CONFIG_BPF_SYSCALL
 struct proto *tcp_bpf_get_proto(struct sock *sk, struct sk_psock *psock);
+int tcp_bpf_update_proto(struct sock *sk, bool restore);
 void tcp_bpf_clone(const struct sock *sk, struct sock *newsk);
 #endif /* CONFIG_BPF_SYSCALL */
 
diff --git a/include/net/udp.h b/include/net/udp.h
index d4d064c59232..df7cc1edc200 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -518,6 +518,7 @@ static inline struct sk_buff *udp_rcv_segment(struct sock *sk,
 #ifdef CONFIG_BPF_SYSCALL
 struct sk_psock;
 struct proto *udp_bpf_get_proto(struct sock *sk, struct sk_psock *psock);
+int udp_bpf_update_proto(struct sock *sk, bool restore);
 #endif
 
 #endif	/* _UDP_H */
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index a045812d7c78..9fc83f7cc1a0 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -562,11 +562,6 @@ struct sk_psock *sk_psock_init(struct sock *sk, int node)
 
 	write_lock_bh(&sk->sk_callback_lock);
 
-	if (inet_csk_has_ulp(sk)) {
-		psock = ERR_PTR(-EINVAL);
-		goto out;
-	}
-
 	if (sk->sk_user_data) {
 		psock = ERR_PTR(-EBUSY);
 		goto out;
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index c46709786a49..d04b98fc8104 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -184,26 +184,10 @@ static void sock_map_unref(struct sock *sk, void *link_raw)
 
 static int sock_map_init_proto(struct sock *sk, struct sk_psock *psock)
 {
-	struct proto *prot;
-
-	switch (sk->sk_type) {
-	case SOCK_STREAM:
-		prot = tcp_bpf_get_proto(sk, psock);
-		break;
-
-	case SOCK_DGRAM:
-		prot = udp_bpf_get_proto(sk, psock);
-		break;
-
-	default:
+	if (!sk->sk_prot->psock_update_sk_prot)
 		return -EINVAL;
-	}
-
-	if (IS_ERR(prot))
-		return PTR_ERR(prot);
-
-	sk_psock_update_proto(sk, psock, prot);
-	return 0;
+	psock->psock_update_sk_prot = sk->sk_prot->psock_update_sk_prot;
+	return sk->sk_prot->psock_update_sk_prot(sk, false);
 }
 
 static struct sk_psock *sock_map_psock_get_checked(struct sock *sk)
@@ -570,7 +554,7 @@ static bool sock_map_redirect_allowed(const struct sock *sk)
 
 static bool sock_map_sk_is_suitable(const struct sock *sk)
 {
-	return sk_is_tcp(sk) || sk_is_udp(sk);
+	return !!sk->sk_prot->psock_update_sk_prot;
 }
 
 static bool sock_map_sk_state_allowed(const struct sock *sk)
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index ae980716d896..ac8cfbaeacd2 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -595,20 +595,38 @@ static int tcp_bpf_assert_proto_ops(struct proto *ops)
 	       ops->sendpage == tcp_sendpage ? 0 : -ENOTSUPP;
 }
 
-struct proto *tcp_bpf_get_proto(struct sock *sk, struct sk_psock *psock)
+int tcp_bpf_update_proto(struct sock *sk, bool restore)
 {
+	struct sk_psock *psock = sk_psock(sk);
 	int family = sk->sk_family == AF_INET6 ? TCP_BPF_IPV6 : TCP_BPF_IPV4;
 	int config = psock->progs.msg_parser   ? TCP_BPF_TX   : TCP_BPF_BASE;
 
+	if (restore) {
+		if (inet_csk_has_ulp(sk)) {
+			tcp_update_ulp(sk, psock->sk_proto, psock->saved_write_space);
+		} else {
+			sk->sk_write_space = psock->saved_write_space;
+			/* Pairs with lockless read in sk_clone_lock() */
+			WRITE_ONCE(sk->sk_prot, psock->sk_proto);
+		}
+		return 0;
+	}
+
+	if (inet_csk_has_ulp(sk))
+		return -EINVAL;
+
 	if (sk->sk_family == AF_INET6) {
 		if (tcp_bpf_assert_proto_ops(psock->sk_proto))
-			return ERR_PTR(-EINVAL);
+			return -EINVAL;
 
 		tcp_bpf_check_v6_needs_rebuild(psock->sk_proto);
 	}
 
-	return &tcp_bpf_prots[family][config];
+	/* Pairs with lockless read in sk_clone_lock() */
+	WRITE_ONCE(sk->sk_prot, &tcp_bpf_prots[family][config]);
+	return 0;
 }
+EXPORT_SYMBOL_GPL(tcp_bpf_update_proto);
 
 /* If a child got cloned from a listening socket that had tcp_bpf
  * protocol callbacks installed, we need to restore the callbacks to
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index daad4f99db32..dfc6d1c0e710 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2806,6 +2806,9 @@ struct proto tcp_prot = {
 	.hash			= inet_hash,
 	.unhash			= inet_unhash,
 	.get_port		= inet_csk_get_port,
+#ifdef CONFIG_BPF_SYSCALL
+	.psock_update_sk_prot	= tcp_bpf_update_proto,
+#endif
 	.enter_memory_pressure	= tcp_enter_memory_pressure,
 	.leave_memory_pressure	= tcp_leave_memory_pressure,
 	.stream_memory_free	= tcp_stream_memory_free,
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 4a0478b17243..38952aaee3a1 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2849,6 +2849,9 @@ struct proto udp_prot = {
 	.unhash			= udp_lib_unhash,
 	.rehash			= udp_v4_rehash,
 	.get_port		= udp_v4_get_port,
+#ifdef CONFIG_BPF_SYSCALL
+	.psock_update_sk_prot	= udp_bpf_update_proto,
+#endif
 	.memory_allocated	= &udp_memory_allocated,
 	.sysctl_mem		= sysctl_udp_mem,
 	.sysctl_wmem_offset	= offsetof(struct net, ipv4.sysctl_udp_wmem_min),
diff --git a/net/ipv4/udp_bpf.c b/net/ipv4/udp_bpf.c
index 7a94791efc1a..6001f93cd3a0 100644
--- a/net/ipv4/udp_bpf.c
+++ b/net/ipv4/udp_bpf.c
@@ -41,12 +41,23 @@ static int __init udp_bpf_v4_build_proto(void)
 }
 core_initcall(udp_bpf_v4_build_proto);
 
-struct proto *udp_bpf_get_proto(struct sock *sk, struct sk_psock *psock)
+int udp_bpf_update_proto(struct sock *sk, bool restore)
 {
 	int family = sk->sk_family == AF_INET ? UDP_BPF_IPV4 : UDP_BPF_IPV6;
+	struct sk_psock *psock = sk_psock(sk);
+
+	if (restore) {
+		sk->sk_write_space = psock->saved_write_space;
+		/* Pairs with lockless read in sk_clone_lock() */
+		WRITE_ONCE(sk->sk_prot, psock->sk_proto);
+		return 0;
+	}
 
 	if (sk->sk_family == AF_INET6)
 		udp_bpf_check_v6_needs_rebuild(psock->sk_proto);
 
-	return &udp_bpf_prots[family];
+	/* Pairs with lockless read in sk_clone_lock() */
+	WRITE_ONCE(sk->sk_prot, &udp_bpf_prots[family]);
+	return 0;
 }
+EXPORT_SYMBOL_GPL(udp_bpf_update_proto);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index d0f007741e8e..bff22d6ef516 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -2139,6 +2139,9 @@ struct proto tcpv6_prot = {
 	.hash			= inet6_hash,
 	.unhash			= inet_unhash,
 	.get_port		= inet_csk_get_port,
+#ifdef CONFIG_BPF_SYSCALL
+	.psock_update_sk_prot	= tcp_bpf_update_proto,
+#endif
 	.enter_memory_pressure	= tcp_enter_memory_pressure,
 	.leave_memory_pressure	= tcp_leave_memory_pressure,
 	.stream_memory_free	= tcp_stream_memory_free,
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index d25e5a9252fd..ef2c75bb4771 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1713,6 +1713,9 @@ struct proto udpv6_prot = {
 	.unhash			= udp_lib_unhash,
 	.rehash			= udp_v6_rehash,
 	.get_port		= udp_v6_get_port,
+#ifdef CONFIG_BPF_SYSCALL
+	.psock_update_sk_prot	= udp_bpf_update_proto,
+#endif
 	.memory_allocated	= &udp_memory_allocated,
 	.sysctl_mem		= sysctl_udp_mem,
 	.sysctl_wmem_offset     = offsetof(struct net, ipv4.sysctl_udp_wmem_min),
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Patch bpf-next v7 09/13] udp: implement ->read_sock() for sockmap
  2021-03-28 20:20 [Patch bpf-next v7 00/13] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP Cong Wang
                   ` (7 preceding siblings ...)
  2021-03-28 20:20 ` [Patch bpf-next v7 08/13] sock: introduce sk->sk_prot->psock_update_sk_prot() Cong Wang
@ 2021-03-28 20:20 ` Cong Wang
  2021-03-29 20:54   ` John Fastabend
  2021-03-28 20:20 ` [Patch bpf-next v7 10/13] skmsg: extract __tcp_bpf_recvmsg() and tcp_bpf_wait_data() Cong Wang
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 30+ messages in thread
From: Cong Wang @ 2021-03-28 20:20 UTC (permalink / raw)
  To: netdev
  Cc: bpf, duanxiongchun, wangdongdong.6, jiang.wang, Cong Wang,
	John Fastabend, Daniel Borkmann, Jakub Sitnicki, Lorenz Bauer

From: Cong Wang <cong.wang@bytedance.com>

This is similar to tcp_read_sock(), except we do not need
to worry about connections, we just need to retrieve skb
from UDP receive queue.

Note, the return value of ->read_sock() is unused in
sk_psock_verdict_data_ready().

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 include/net/udp.h   |  2 ++
 net/ipv4/af_inet.c  |  1 +
 net/ipv4/udp.c      | 35 +++++++++++++++++++++++++++++++++++
 net/ipv6/af_inet6.c |  1 +
 4 files changed, 39 insertions(+)

diff --git a/include/net/udp.h b/include/net/udp.h
index df7cc1edc200..347b62a753c3 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -329,6 +329,8 @@ struct sock *__udp6_lib_lookup(struct net *net,
 			       struct sk_buff *skb);
 struct sock *udp6_lib_lookup_skb(const struct sk_buff *skb,
 				 __be16 sport, __be16 dport);
+int udp_read_sock(struct sock *sk, read_descriptor_t *desc,
+		  sk_read_actor_t recv_actor);
 
 /* UDP uses skb->dev_scratch to cache as much information as possible and avoid
  * possibly multiple cache miss on dequeue()
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 1355e6c0d567..f17870ee558b 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1070,6 +1070,7 @@ const struct proto_ops inet_dgram_ops = {
 	.setsockopt	   = sock_common_setsockopt,
 	.getsockopt	   = sock_common_getsockopt,
 	.sendmsg	   = inet_sendmsg,
+	.read_sock	   = udp_read_sock,
 	.recvmsg	   = inet_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = inet_sendpage,
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 38952aaee3a1..04620e4d64ab 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1782,6 +1782,41 @@ struct sk_buff *__skb_recv_udp(struct sock *sk, unsigned int flags,
 }
 EXPORT_SYMBOL(__skb_recv_udp);
 
+int udp_read_sock(struct sock *sk, read_descriptor_t *desc,
+		  sk_read_actor_t recv_actor)
+{
+	int copied = 0;
+
+	while (1) {
+		int offset = 0, err;
+		struct sk_buff *skb;
+
+		skb = __skb_recv_udp(sk, 0, 1, &offset, &err);
+		if (!skb)
+			return err;
+		if (offset < skb->len) {
+			size_t len;
+			int used;
+
+			len = skb->len - offset;
+			used = recv_actor(desc, skb, offset, len);
+			if (used <= 0) {
+				if (!copied)
+					copied = used;
+				break;
+			} else if (used <= len) {
+				copied += used;
+				offset += used;
+			}
+		}
+		if (!desc->count)
+			break;
+	}
+
+	return copied;
+}
+EXPORT_SYMBOL(udp_read_sock);
+
 /*
  * 	This should be easy, if there is something there we
  * 	return it, otherwise we block.
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 802f5111805a..71de739b4a9e 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -714,6 +714,7 @@ const struct proto_ops inet6_dgram_ops = {
 	.getsockopt	   = sock_common_getsockopt,	/* ok		*/
 	.sendmsg	   = inet6_sendmsg,		/* retpoline's sake */
 	.recvmsg	   = inet6_recvmsg,		/* retpoline's sake */
+	.read_sock	   = udp_read_sock,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = sock_no_sendpage,
 	.set_peek_off	   = sk_set_peek_off,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Patch bpf-next v7 10/13] skmsg: extract __tcp_bpf_recvmsg() and tcp_bpf_wait_data()
  2021-03-28 20:20 [Patch bpf-next v7 00/13] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP Cong Wang
                   ` (8 preceding siblings ...)
  2021-03-28 20:20 ` [Patch bpf-next v7 09/13] udp: implement ->read_sock() for sockmap Cong Wang
@ 2021-03-28 20:20 ` Cong Wang
  2021-03-28 20:20 ` [Patch bpf-next v7 11/13] udp: implement udp_bpf_recvmsg() for sockmap Cong Wang
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 30+ messages in thread
From: Cong Wang @ 2021-03-28 20:20 UTC (permalink / raw)
  To: netdev
  Cc: bpf, duanxiongchun, wangdongdong.6, jiang.wang, Cong Wang,
	John Fastabend, Daniel Borkmann, Jakub Sitnicki, Lorenz Bauer

From: Cong Wang <cong.wang@bytedance.com>

Although these two functions are only used by TCP, they are not
specific to TCP at all, both operate on skmsg and ingress_msg,
so fit in net/core/skmsg.c very well.

And we will need them for non-TCP, so rename and move them to
skmsg.c and export them to modules.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 include/linux/skmsg.h |   4 ++
 include/net/tcp.h     |   2 -
 net/core/skmsg.c      |  98 +++++++++++++++++++++++++++++++++++++++++
 net/ipv4/tcp_bpf.c    | 100 +-----------------------------------------
 net/tls/tls_sw.c      |   4 +-
 5 files changed, 106 insertions(+), 102 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 5e800ddc2dc6..f78e90a04a69 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -125,6 +125,10 @@ int sk_msg_zerocopy_from_iter(struct sock *sk, struct iov_iter *from,
 			      struct sk_msg *msg, u32 bytes);
 int sk_msg_memcopy_from_iter(struct sock *sk, struct iov_iter *from,
 			     struct sk_msg *msg, u32 bytes);
+int sk_msg_wait_data(struct sock *sk, struct sk_psock *psock, int flags,
+		     long timeo, int *err);
+int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
+		   int len, int flags);
 
 static inline void sk_msg_check_to_free(struct sk_msg *msg, u32 i, u32 bytes)
 {
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 2efa4e5ea23d..31b1696c62ba 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2209,8 +2209,6 @@ void tcp_bpf_clone(const struct sock *sk, struct sock *newsk);
 
 int tcp_bpf_sendmsg_redir(struct sock *sk, struct sk_msg *msg, u32 bytes,
 			  int flags);
-int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
-		      struct msghdr *msg, int len, int flags);
 #endif /* CONFIG_NET_SOCK_MSG */
 
 #if !defined(CONFIG_BPF_SYSCALL) || !defined(CONFIG_NET_SOCK_MSG)
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 9fc83f7cc1a0..92a83c02562a 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -399,6 +399,104 @@ int sk_msg_memcopy_from_iter(struct sock *sk, struct iov_iter *from,
 }
 EXPORT_SYMBOL_GPL(sk_msg_memcopy_from_iter);
 
+int sk_msg_wait_data(struct sock *sk, struct sk_psock *psock, int flags,
+		     long timeo, int *err)
+{
+	DEFINE_WAIT_FUNC(wait, woken_wake_function);
+	int ret = 0;
+
+	if (sk->sk_shutdown & RCV_SHUTDOWN)
+		return 1;
+
+	if (!timeo)
+		return ret;
+
+	add_wait_queue(sk_sleep(sk), &wait);
+	sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk);
+	ret = sk_wait_event(sk, &timeo,
+			    !list_empty(&psock->ingress_msg) ||
+			    !skb_queue_empty(&sk->sk_receive_queue), &wait);
+	sk_clear_bit(SOCKWQ_ASYNC_WAITDATA, sk);
+	remove_wait_queue(sk_sleep(sk), &wait);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(sk_msg_wait_data);
+
+/* Receive sk_msg from psock->ingress_msg to @msg. */
+int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
+		   int len, int flags)
+{
+	struct iov_iter *iter = &msg->msg_iter;
+	int peek = flags & MSG_PEEK;
+	struct sk_msg *msg_rx;
+	int i, copied = 0;
+
+	msg_rx = sk_psock_peek_msg(psock);
+	while (copied != len) {
+		struct scatterlist *sge;
+
+		if (unlikely(!msg_rx))
+			break;
+
+		i = msg_rx->sg.start;
+		do {
+			struct page *page;
+			int copy;
+
+			sge = sk_msg_elem(msg_rx, i);
+			copy = sge->length;
+			page = sg_page(sge);
+			if (copied + copy > len)
+				copy = len - copied;
+			copy = copy_page_to_iter(page, sge->offset, copy, iter);
+			if (!copy)
+				return copied ? copied : -EFAULT;
+
+			copied += copy;
+			if (likely(!peek)) {
+				sge->offset += copy;
+				sge->length -= copy;
+				if (!msg_rx->skb)
+					sk_mem_uncharge(sk, copy);
+				msg_rx->sg.size -= copy;
+
+				if (!sge->length) {
+					sk_msg_iter_var_next(i);
+					if (!msg_rx->skb)
+						put_page(page);
+				}
+			} else {
+				/* Lets not optimize peek case if copy_page_to_iter
+				 * didn't copy the entire length lets just break.
+				 */
+				if (copy != sge->length)
+					return copied;
+				sk_msg_iter_var_next(i);
+			}
+
+			if (copied == len)
+				break;
+		} while (i != msg_rx->sg.end);
+
+		if (unlikely(peek)) {
+			msg_rx = sk_psock_next_msg(psock, msg_rx);
+			if (!msg_rx)
+				break;
+			continue;
+		}
+
+		msg_rx->sg.start = i;
+		if (!sge->length && msg_rx->sg.start == msg_rx->sg.end) {
+			msg_rx = sk_psock_dequeue_msg(psock);
+			kfree_sk_msg(msg_rx);
+		}
+		msg_rx = sk_psock_peek_msg(psock);
+	}
+
+	return copied;
+}
+EXPORT_SYMBOL_GPL(sk_msg_recvmsg);
+
 static struct sk_msg *sk_psock_create_ingress_msg(struct sock *sk,
 						  struct sk_buff *skb)
 {
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index ac8cfbaeacd2..3d622a0d0753 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -10,80 +10,6 @@
 #include <net/inet_common.h>
 #include <net/tls.h>
 
-int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
-		      struct msghdr *msg, int len, int flags)
-{
-	struct iov_iter *iter = &msg->msg_iter;
-	int peek = flags & MSG_PEEK;
-	struct sk_msg *msg_rx;
-	int i, copied = 0;
-
-	msg_rx = sk_psock_peek_msg(psock);
-	while (copied != len) {
-		struct scatterlist *sge;
-
-		if (unlikely(!msg_rx))
-			break;
-
-		i = msg_rx->sg.start;
-		do {
-			struct page *page;
-			int copy;
-
-			sge = sk_msg_elem(msg_rx, i);
-			copy = sge->length;
-			page = sg_page(sge);
-			if (copied + copy > len)
-				copy = len - copied;
-			copy = copy_page_to_iter(page, sge->offset, copy, iter);
-			if (!copy)
-				return copied ? copied : -EFAULT;
-
-			copied += copy;
-			if (likely(!peek)) {
-				sge->offset += copy;
-				sge->length -= copy;
-				if (!msg_rx->skb)
-					sk_mem_uncharge(sk, copy);
-				msg_rx->sg.size -= copy;
-
-				if (!sge->length) {
-					sk_msg_iter_var_next(i);
-					if (!msg_rx->skb)
-						put_page(page);
-				}
-			} else {
-				/* Lets not optimize peek case if copy_page_to_iter
-				 * didn't copy the entire length lets just break.
-				 */
-				if (copy != sge->length)
-					return copied;
-				sk_msg_iter_var_next(i);
-			}
-
-			if (copied == len)
-				break;
-		} while (i != msg_rx->sg.end);
-
-		if (unlikely(peek)) {
-			msg_rx = sk_psock_next_msg(psock, msg_rx);
-			if (!msg_rx)
-				break;
-			continue;
-		}
-
-		msg_rx->sg.start = i;
-		if (!sge->length && msg_rx->sg.start == msg_rx->sg.end) {
-			msg_rx = sk_psock_dequeue_msg(psock);
-			kfree_sk_msg(msg_rx);
-		}
-		msg_rx = sk_psock_peek_msg(psock);
-	}
-
-	return copied;
-}
-EXPORT_SYMBOL_GPL(__tcp_bpf_recvmsg);
-
 static int bpf_tcp_ingress(struct sock *sk, struct sk_psock *psock,
 			   struct sk_msg *msg, u32 apply_bytes, int flags)
 {
@@ -237,28 +163,6 @@ static bool tcp_bpf_stream_read(const struct sock *sk)
 	return !empty;
 }
 
-static int tcp_bpf_wait_data(struct sock *sk, struct sk_psock *psock,
-			     int flags, long timeo, int *err)
-{
-	DEFINE_WAIT_FUNC(wait, woken_wake_function);
-	int ret = 0;
-
-	if (sk->sk_shutdown & RCV_SHUTDOWN)
-		return 1;
-
-	if (!timeo)
-		return ret;
-
-	add_wait_queue(sk_sleep(sk), &wait);
-	sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk);
-	ret = sk_wait_event(sk, &timeo,
-			    !list_empty(&psock->ingress_msg) ||
-			    !skb_queue_empty(&sk->sk_receive_queue), &wait);
-	sk_clear_bit(SOCKWQ_ASYNC_WAITDATA, sk);
-	remove_wait_queue(sk_sleep(sk), &wait);
-	return ret;
-}
-
 static int tcp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 		    int nonblock, int flags, int *addr_len)
 {
@@ -278,13 +182,13 @@ static int tcp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 	}
 	lock_sock(sk);
 msg_bytes_ready:
-	copied = __tcp_bpf_recvmsg(sk, psock, msg, len, flags);
+	copied = sk_msg_recvmsg(sk, psock, msg, len, flags);
 	if (!copied) {
 		int data, err = 0;
 		long timeo;
 
 		timeo = sock_rcvtimeo(sk, nonblock);
-		data = tcp_bpf_wait_data(sk, psock, flags, timeo, &err);
+		data = sk_msg_wait_data(sk, psock, flags, timeo, &err);
 		if (data) {
 			if (!sk_psock_queue_empty(psock))
 				goto msg_bytes_ready;
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 01d933ae5f16..1dcb34dfd56b 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1789,8 +1789,8 @@ int tls_sw_recvmsg(struct sock *sk,
 		skb = tls_wait_data(sk, psock, flags, timeo, &err);
 		if (!skb) {
 			if (psock) {
-				int ret = __tcp_bpf_recvmsg(sk, psock,
-							    msg, len, flags);
+				int ret = sk_msg_recvmsg(sk, psock, msg, len,
+							 flags);
 
 				if (ret > 0) {
 					decrypted += ret;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Patch bpf-next v7 11/13] udp: implement udp_bpf_recvmsg() for sockmap
  2021-03-28 20:20 [Patch bpf-next v7 00/13] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP Cong Wang
                   ` (9 preceding siblings ...)
  2021-03-28 20:20 ` [Patch bpf-next v7 10/13] skmsg: extract __tcp_bpf_recvmsg() and tcp_bpf_wait_data() Cong Wang
@ 2021-03-28 20:20 ` Cong Wang
  2021-03-28 20:20 ` [Patch bpf-next v7 12/13] sock_map: update sock type checks for UDP Cong Wang
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 30+ messages in thread
From: Cong Wang @ 2021-03-28 20:20 UTC (permalink / raw)
  To: netdev
  Cc: bpf, duanxiongchun, wangdongdong.6, jiang.wang, Cong Wang,
	John Fastabend, Daniel Borkmann, Jakub Sitnicki, Lorenz Bauer

From: Cong Wang <cong.wang@bytedance.com>

We have to implement udp_bpf_recvmsg() to replace the ->recvmsg()
to retrieve skmsg from ingress_msg.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 net/ipv4/udp_bpf.c | 64 +++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 63 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/udp_bpf.c b/net/ipv4/udp_bpf.c
index 6001f93cd3a0..7d5c4ebf42fe 100644
--- a/net/ipv4/udp_bpf.c
+++ b/net/ipv4/udp_bpf.c
@@ -4,6 +4,68 @@
 #include <linux/skmsg.h>
 #include <net/sock.h>
 #include <net/udp.h>
+#include <net/inet_common.h>
+
+#include "udp_impl.h"
+
+static struct proto *udpv6_prot_saved __read_mostly;
+
+static int sk_udp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
+			  int noblock, int flags, int *addr_len)
+{
+#if IS_ENABLED(CONFIG_IPV6)
+	if (sk->sk_family == AF_INET6)
+		return udpv6_prot_saved->recvmsg(sk, msg, len, noblock, flags,
+						 addr_len);
+#endif
+	return udp_prot.recvmsg(sk, msg, len, noblock, flags, addr_len);
+}
+
+static int udp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
+			   int nonblock, int flags, int *addr_len)
+{
+	struct sk_psock *psock;
+	int copied, ret;
+
+	if (unlikely(flags & MSG_ERRQUEUE))
+		return inet_recv_error(sk, msg, len, addr_len);
+
+	psock = sk_psock_get(sk);
+	if (unlikely(!psock))
+		return sk_udp_recvmsg(sk, msg, len, nonblock, flags, addr_len);
+
+	lock_sock(sk);
+	if (sk_psock_queue_empty(psock)) {
+		ret = sk_udp_recvmsg(sk, msg, len, nonblock, flags, addr_len);
+		goto out;
+	}
+
+msg_bytes_ready:
+	copied = sk_msg_recvmsg(sk, psock, msg, len, flags);
+	if (!copied) {
+		int data, err = 0;
+		long timeo;
+
+		timeo = sock_rcvtimeo(sk, nonblock);
+		data = sk_msg_wait_data(sk, psock, flags, timeo, &err);
+		if (data) {
+			if (!sk_psock_queue_empty(psock))
+				goto msg_bytes_ready;
+			ret = sk_udp_recvmsg(sk, msg, len, nonblock, flags, addr_len);
+			goto out;
+		}
+		if (err) {
+			ret = err;
+			goto out;
+		}
+		copied = -EAGAIN;
+	}
+	ret = copied;
+out:
+	release_sock(sk);
+	sk_psock_put(sk, psock);
+	return ret;
+}
 
 enum {
 	UDP_BPF_IPV4,
@@ -11,7 +73,6 @@ enum {
 	UDP_BPF_NUM_PROTS,
 };
 
-static struct proto *udpv6_prot_saved __read_mostly;
 static DEFINE_SPINLOCK(udpv6_prot_lock);
 static struct proto udp_bpf_prots[UDP_BPF_NUM_PROTS];
 
@@ -20,6 +81,7 @@ static void udp_bpf_rebuild_protos(struct proto *prot, const struct proto *base)
 	*prot        = *base;
 	prot->unhash = sock_map_unhash;
 	prot->close  = sock_map_close;
+	prot->recvmsg = udp_bpf_recvmsg;
 }
 
 static void udp_bpf_check_v6_needs_rebuild(struct proto *ops)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Patch bpf-next v7 12/13] sock_map: update sock type checks for UDP
  2021-03-28 20:20 [Patch bpf-next v7 00/13] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP Cong Wang
                   ` (10 preceding siblings ...)
  2021-03-28 20:20 ` [Patch bpf-next v7 11/13] udp: implement udp_bpf_recvmsg() for sockmap Cong Wang
@ 2021-03-28 20:20 ` Cong Wang
  2021-03-29 23:10   ` John Fastabend
  2021-03-28 20:20 ` [Patch bpf-next v7 13/13] selftests/bpf: add a test case for udp sockmap Cong Wang
  2021-03-28 23:27 ` [Patch bpf-next v7 00/13] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP Alexei Starovoitov
  13 siblings, 1 reply; 30+ messages in thread
From: Cong Wang @ 2021-03-28 20:20 UTC (permalink / raw)
  To: netdev
  Cc: bpf, duanxiongchun, wangdongdong.6, jiang.wang, Cong Wang,
	John Fastabend, Daniel Borkmann, Jakub Sitnicki, Lorenz Bauer

From: Cong Wang <cong.wang@bytedance.com>

Now UDP supports sockmap and redirection, we can safely update
the sock type checks for it accordingly.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 net/core/sock_map.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index d04b98fc8104..9ed040c7d9e7 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -549,7 +549,10 @@ static bool sk_is_udp(const struct sock *sk)
 
 static bool sock_map_redirect_allowed(const struct sock *sk)
 {
-	return sk_is_tcp(sk) && sk->sk_state != TCP_LISTEN;
+	if (sk_is_tcp(sk))
+		return sk->sk_state != TCP_LISTEN;
+	else
+		return sk->sk_state == TCP_ESTABLISHED;
 }
 
 static bool sock_map_sk_is_suitable(const struct sock *sk)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Patch bpf-next v7 13/13] selftests/bpf: add a test case for udp sockmap
  2021-03-28 20:20 [Patch bpf-next v7 00/13] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP Cong Wang
                   ` (11 preceding siblings ...)
  2021-03-28 20:20 ` [Patch bpf-next v7 12/13] sock_map: update sock type checks for UDP Cong Wang
@ 2021-03-28 20:20 ` Cong Wang
  2021-03-28 23:27 ` [Patch bpf-next v7 00/13] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP Alexei Starovoitov
  13 siblings, 0 replies; 30+ messages in thread
From: Cong Wang @ 2021-03-28 20:20 UTC (permalink / raw)
  To: netdev
  Cc: bpf, duanxiongchun, wangdongdong.6, jiang.wang, Cong Wang,
	John Fastabend, Daniel Borkmann, Jakub Sitnicki, Lorenz Bauer

From: Cong Wang <cong.wang@bytedance.com>

Add a test case to ensure redirection between two UDP sockets work.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 .../selftests/bpf/prog_tests/sockmap_listen.c | 136 ++++++++++++++++++
 .../selftests/bpf/progs/test_sockmap_listen.c |  22 +++
 2 files changed, 158 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c b/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
index c26e6bf05e49..648d9ae898d2 100644
--- a/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
+++ b/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
@@ -1603,6 +1603,141 @@ static void test_reuseport(struct test_sockmap_listen *skel,
 	}
 }
 
+static void udp_redir_to_connected(int family, int sotype, int sock_mapfd,
+				   int verd_mapfd, enum redir_mode mode)
+{
+	const char *log_prefix = redir_mode_str(mode);
+	struct sockaddr_storage addr;
+	int c0, c1, p0, p1;
+	unsigned int pass;
+	socklen_t len;
+	int err, n;
+	u64 value;
+	u32 key;
+	char b;
+
+	zero_verdict_count(verd_mapfd);
+
+	p0 = socket_loopback(family, sotype | SOCK_NONBLOCK);
+	if (p0 < 0)
+		return;
+	len = sizeof(addr);
+	err = xgetsockname(p0, sockaddr(&addr), &len);
+	if (err)
+		goto close_peer0;
+
+	c0 = xsocket(family, sotype | SOCK_NONBLOCK, 0);
+	if (c0 < 0)
+		goto close_peer0;
+	err = xconnect(c0, sockaddr(&addr), len);
+	if (err)
+		goto close_cli0;
+	err = xgetsockname(c0, sockaddr(&addr), &len);
+	if (err)
+		goto close_cli0;
+	err = xconnect(p0, sockaddr(&addr), len);
+	if (err)
+		goto close_cli0;
+
+	p1 = socket_loopback(family, sotype | SOCK_NONBLOCK);
+	if (p1 < 0)
+		goto close_cli0;
+	err = xgetsockname(p1, sockaddr(&addr), &len);
+	if (err)
+		goto close_cli0;
+
+	c1 = xsocket(family, sotype | SOCK_NONBLOCK, 0);
+	if (c1 < 0)
+		goto close_peer1;
+	err = xconnect(c1, sockaddr(&addr), len);
+	if (err)
+		goto close_cli1;
+	err = xgetsockname(c1, sockaddr(&addr), &len);
+	if (err)
+		goto close_cli1;
+	err = xconnect(p1, sockaddr(&addr), len);
+	if (err)
+		goto close_cli1;
+
+	key = 0;
+	value = p0;
+	err = xbpf_map_update_elem(sock_mapfd, &key, &value, BPF_NOEXIST);
+	if (err)
+		goto close_cli1;
+
+	key = 1;
+	value = p1;
+	err = xbpf_map_update_elem(sock_mapfd, &key, &value, BPF_NOEXIST);
+	if (err)
+		goto close_cli1;
+
+	n = write(c1, "a", 1);
+	if (n < 0)
+		FAIL_ERRNO("%s: write", log_prefix);
+	if (n == 0)
+		FAIL("%s: incomplete write", log_prefix);
+	if (n < 1)
+		goto close_cli1;
+
+	key = SK_PASS;
+	err = xbpf_map_lookup_elem(verd_mapfd, &key, &pass);
+	if (err)
+		goto close_cli1;
+	if (pass != 1)
+		FAIL("%s: want pass count 1, have %d", log_prefix, pass);
+
+	n = read(mode == REDIR_INGRESS ? p0 : c0, &b, 1);
+	if (n < 0)
+		FAIL_ERRNO("%s: read", log_prefix);
+	if (n == 0)
+		FAIL("%s: incomplete read", log_prefix);
+
+close_cli1:
+	xclose(c1);
+close_peer1:
+	xclose(p1);
+close_cli0:
+	xclose(c0);
+close_peer0:
+	xclose(p0);
+}
+
+static void udp_skb_redir_to_connected(struct test_sockmap_listen *skel,
+				       struct bpf_map *inner_map, int family)
+{
+	int verdict = bpf_program__fd(skel->progs.prog_skb_verdict);
+	int verdict_map = bpf_map__fd(skel->maps.verdict_map);
+	int sock_map = bpf_map__fd(inner_map);
+	int err;
+
+	err = xbpf_prog_attach(verdict, sock_map, BPF_SK_SKB_VERDICT, 0);
+	if (err)
+		return;
+
+	skel->bss->test_ingress = false;
+	udp_redir_to_connected(family, SOCK_DGRAM, sock_map, verdict_map,
+			       REDIR_EGRESS);
+	skel->bss->test_ingress = true;
+	udp_redir_to_connected(family, SOCK_DGRAM, sock_map, verdict_map,
+			       REDIR_INGRESS);
+
+	xbpf_prog_detach2(verdict, sock_map, BPF_SK_SKB_VERDICT);
+}
+
+static void test_udp_redir(struct test_sockmap_listen *skel, struct bpf_map *map,
+			   int family)
+{
+	const char *family_name, *map_name;
+	char s[MAX_TEST_NAME];
+
+	family_name = family_str(family);
+	map_name = map_type_str(map);
+	snprintf(s, sizeof(s), "%s %s %s", map_name, family_name, __func__);
+	if (!test__start_subtest(s))
+		return;
+	udp_skb_redir_to_connected(skel, map, family);
+}
+
 static void run_tests(struct test_sockmap_listen *skel, struct bpf_map *map,
 		      int family)
 {
@@ -1611,6 +1746,7 @@ static void run_tests(struct test_sockmap_listen *skel, struct bpf_map *map,
 	test_redir(skel, map, family, SOCK_STREAM);
 	test_reuseport(skel, map, family, SOCK_STREAM);
 	test_reuseport(skel, map, family, SOCK_DGRAM);
+	test_udp_redir(skel, map, family);
 }
 
 void test_sockmap_listen(void)
diff --git a/tools/testing/selftests/bpf/progs/test_sockmap_listen.c b/tools/testing/selftests/bpf/progs/test_sockmap_listen.c
index fa221141e9c1..a39eba9f5201 100644
--- a/tools/testing/selftests/bpf/progs/test_sockmap_listen.c
+++ b/tools/testing/selftests/bpf/progs/test_sockmap_listen.c
@@ -29,6 +29,7 @@ struct {
 } verdict_map SEC(".maps");
 
 static volatile bool test_sockmap; /* toggled by user-space */
+static volatile bool test_ingress; /* toggled by user-space */
 
 SEC("sk_skb/stream_parser")
 int prog_stream_parser(struct __sk_buff *skb)
@@ -55,6 +56,27 @@ int prog_stream_verdict(struct __sk_buff *skb)
 	return verdict;
 }
 
+SEC("sk_skb/skb_verdict")
+int prog_skb_verdict(struct __sk_buff *skb)
+{
+	unsigned int *count;
+	__u32 zero = 0;
+	int verdict;
+
+	if (test_sockmap)
+		verdict = bpf_sk_redirect_map(skb, &sock_map, zero,
+					      test_ingress ? BPF_F_INGRESS : 0);
+	else
+		verdict = bpf_sk_redirect_hash(skb, &sock_hash, &zero,
+					       test_ingress ? BPF_F_INGRESS : 0);
+
+	count = bpf_map_lookup_elem(&verdict_map, &verdict);
+	if (count)
+		(*count)++;
+
+	return verdict;
+}
+
 SEC("sk_msg")
 int prog_msg_verdict(struct sk_msg_md *msg)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [Patch bpf-next v7 00/13] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP
  2021-03-28 20:20 [Patch bpf-next v7 00/13] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP Cong Wang
                   ` (12 preceding siblings ...)
  2021-03-28 20:20 ` [Patch bpf-next v7 13/13] selftests/bpf: add a test case for udp sockmap Cong Wang
@ 2021-03-28 23:27 ` Alexei Starovoitov
  2021-03-29 15:03   ` John Fastabend
  13 siblings, 1 reply; 30+ messages in thread
From: Alexei Starovoitov @ 2021-03-28 23:27 UTC (permalink / raw)
  To: Cong Wang
  Cc: netdev, bpf, duanxiongchun, wangdongdong.6, jiang.wang, Cong Wang

On Sun, Mar 28, 2021 at 01:20:00PM -0700, Cong Wang wrote:
> From: Cong Wang <cong.wang@bytedance.com>
> 
> We have thousands of services connected to a daemon on every host
> via AF_UNIX dgram sockets, after they are moved into VM, we have to
> add a proxy to forward these communications from VM to host, because
> rewriting thousands of them is not practical. This proxy uses an
> AF_UNIX socket connected to services and a UDP socket to connect to
> the host. It is inefficient because data is copied between kernel
> space and user space twice, and we can not use splice() which only
> supports TCP. Therefore, we want to use sockmap to do the splicing
> without going to user-space at all (after the initial setup).
> 
> Currently sockmap only fully supports TCP, UDP is partially supported
> as it is only allowed to add into sockmap. This patchset, as the second
> part of the original large patchset, extends sockmap with:
> 1) cross-protocol support with BPF_SK_SKB_VERDICT; 2) full UDP support.
> 
> On the high level, ->read_sock() is required for each protocol to support
> sockmap redirection, and in order to do sock proto update, a new ops
> ->psock_update_sk_prot() is introduced, which is also required. And the
> BPF ->recvmsg() is also needed to replace the original ->recvmsg() to
> retrieve skmsg. To make life easier, we have to get rid of lock_sock()
> in sk_psock_handle_skb(), otherwise we would have to implement
> ->sendmsg_locked() on top of ->sendmsg(), which is ugly.
> 
> Please see each patch for more details.
> 
> To see the big picture, the original patchset is available here:
> https://github.com/congwang/linux/tree/sockmap
> this patchset is also available:
> https://github.com/congwang/linux/tree/sockmap2
> 
> ---
> v7: use work_mutex to protect psock->work
>     return err in udp_read_sock()
>     add patch 6/13
>     clean up test case

The feature looks great to me.
I think the selftest is a bit light in terms of coverage, but it's acceptable.
I'd like to see the final Acks from John/Daniel and Jakub/Lorenz before merging.
Folks,
please prioritize the review of these patches.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Patch bpf-next v7 00/13] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP
  2021-03-28 23:27 ` [Patch bpf-next v7 00/13] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP Alexei Starovoitov
@ 2021-03-29 15:03   ` John Fastabend
  2021-03-29 16:57     ` Cong Wang
  0 siblings, 1 reply; 30+ messages in thread
From: John Fastabend @ 2021-03-29 15:03 UTC (permalink / raw)
  To: Alexei Starovoitov, Cong Wang
  Cc: netdev, bpf, duanxiongchun, wangdongdong.6, jiang.wang, Cong Wang

Alexei Starovoitov wrote:
> On Sun, Mar 28, 2021 at 01:20:00PM -0700, Cong Wang wrote:
> > From: Cong Wang <cong.wang@bytedance.com>
> > 
> > We have thousands of services connected to a daemon on every host
> > via AF_UNIX dgram sockets, after they are moved into VM, we have to
> > add a proxy to forward these communications from VM to host, because
> > rewriting thousands of them is not practical. This proxy uses an
> > AF_UNIX socket connected to services and a UDP socket to connect to
> > the host. It is inefficient because data is copied between kernel
> > space and user space twice, and we can not use splice() which only
> > supports TCP. Therefore, we want to use sockmap to do the splicing
> > without going to user-space at all (after the initial setup).
> > 
> > Currently sockmap only fully supports TCP, UDP is partially supported
> > as it is only allowed to add into sockmap. This patchset, as the second
> > part of the original large patchset, extends sockmap with:
> > 1) cross-protocol support with BPF_SK_SKB_VERDICT; 2) full UDP support.
> > 
> > On the high level, ->read_sock() is required for each protocol to support
> > sockmap redirection, and in order to do sock proto update, a new ops
> > ->psock_update_sk_prot() is introduced, which is also required. And the
> > BPF ->recvmsg() is also needed to replace the original ->recvmsg() to
> > retrieve skmsg. To make life easier, we have to get rid of lock_sock()
> > in sk_psock_handle_skb(), otherwise we would have to implement
> > ->sendmsg_locked() on top of ->sendmsg(), which is ugly.
> > 
> > Please see each patch for more details.
> > 
> > To see the big picture, the original patchset is available here:
> > https://github.com/congwang/linux/tree/sockmap
> > this patchset is also available:
> > https://github.com/congwang/linux/tree/sockmap2
> > 
> > ---
> > v7: use work_mutex to protect psock->work
> >     return err in udp_read_sock()
> >     add patch 6/13
> >     clean up test case
> 
> The feature looks great to me.
> I think the selftest is a bit light in terms of coverage, but it's acceptable.

+1

> I'd like to see the final Acks from John/Daniel and Jakub/Lorenz before merging.
> Folks,
> please prioritize the review of these patches.

This is getting really close I'll take another pass over it today. Thanks

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Patch bpf-next v7 00/13] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP
  2021-03-29 15:03   ` John Fastabend
@ 2021-03-29 16:57     ` Cong Wang
  0 siblings, 0 replies; 30+ messages in thread
From: Cong Wang @ 2021-03-29 16:57 UTC (permalink / raw)
  To: John Fastabend
  Cc: Alexei Starovoitov, Linux Kernel Network Developers, bpf,
	duanxiongchun, Dongdong Wang, Jiang Wang, Cong Wang

On Mon, Mar 29, 2021 at 8:03 AM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Alexei Starovoitov wrote:
> > On Sun, Mar 28, 2021 at 01:20:00PM -0700, Cong Wang wrote:
> > > From: Cong Wang <cong.wang@bytedance.com>
> > >
> > > We have thousands of services connected to a daemon on every host
> > > via AF_UNIX dgram sockets, after they are moved into VM, we have to
> > > add a proxy to forward these communications from VM to host, because
> > > rewriting thousands of them is not practical. This proxy uses an
> > > AF_UNIX socket connected to services and a UDP socket to connect to
> > > the host. It is inefficient because data is copied between kernel
> > > space and user space twice, and we can not use splice() which only
> > > supports TCP. Therefore, we want to use sockmap to do the splicing
> > > without going to user-space at all (after the initial setup).
> > >
> > > Currently sockmap only fully supports TCP, UDP is partially supported
> > > as it is only allowed to add into sockmap. This patchset, as the second
> > > part of the original large patchset, extends sockmap with:
> > > 1) cross-protocol support with BPF_SK_SKB_VERDICT; 2) full UDP support.
> > >
> > > On the high level, ->read_sock() is required for each protocol to support
> > > sockmap redirection, and in order to do sock proto update, a new ops
> > > ->psock_update_sk_prot() is introduced, which is also required. And the
> > > BPF ->recvmsg() is also needed to replace the original ->recvmsg() to
> > > retrieve skmsg. To make life easier, we have to get rid of lock_sock()
> > > in sk_psock_handle_skb(), otherwise we would have to implement
> > > ->sendmsg_locked() on top of ->sendmsg(), which is ugly.
> > >
> > > Please see each patch for more details.
> > >
> > > To see the big picture, the original patchset is available here:
> > > https://github.com/congwang/linux/tree/sockmap
> > > this patchset is also available:
> > > https://github.com/congwang/linux/tree/sockmap2
> > >
> > > ---
> > > v7: use work_mutex to protect psock->work
> > >     return err in udp_read_sock()
> > >     add patch 6/13
> > >     clean up test case
> >
> > The feature looks great to me.
> > I think the selftest is a bit light in terms of coverage, but it's acceptable.
>
> +1

Well, the first half of this patchset still focuses on the existing code, which
is already covered by existing test cases. The second half adds
BPF_SK_SKB_VERDICT and UDP support, which are already covered by
my new test case. And apparently UDP will never support other sockmap
programs like TCP, for example, BPF_SK_SKB_STREAM_PARSER, hence
it of course has much less test cases than TCP. Cross-protocol test case
will be added in the next patchset when AF_UNIX comes in.

If I miss anything, please be specific. Just saying the test case is light does
not help me to understand what I need to add.

Thanks.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [Patch bpf-next v7 02/13] skmsg: introduce a spinlock to protect ingress_msg
  2021-03-28 20:20 ` [Patch bpf-next v7 02/13] skmsg: introduce a spinlock to protect ingress_msg Cong Wang
@ 2021-03-29 19:11   ` John Fastabend
  0 siblings, 0 replies; 30+ messages in thread
From: John Fastabend @ 2021-03-29 19:11 UTC (permalink / raw)
  To: Cong Wang, netdev
  Cc: bpf, duanxiongchun, wangdongdong.6, jiang.wang, Cong Wang,
	John Fastabend, Daniel Borkmann, Lorenz Bauer, Jakub Sitnicki

Cong Wang wrote:
> From: Cong Wang <cong.wang@bytedance.com>
> 
> Currently we rely on lock_sock to protect ingress_msg,
> it is too big for this, we can actually just use a spinlock
> to protect this list like protecting other skb queues.
> 
> __tcp_bpf_recvmsg() is still special because of peeking,
> it still has to use lock_sock.
> 
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Lorenz Bauer <lmb@cloudflare.com>
> Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> ---

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [Patch bpf-next v7 04/13] skmsg: avoid lock_sock() in sk_psock_backlog()
  2021-03-28 20:20 ` [Patch bpf-next v7 04/13] skmsg: avoid lock_sock() in sk_psock_backlog() Cong Wang
@ 2021-03-29 19:41   ` John Fastabend
  0 siblings, 0 replies; 30+ messages in thread
From: John Fastabend @ 2021-03-29 19:41 UTC (permalink / raw)
  To: Cong Wang, netdev
  Cc: bpf, duanxiongchun, wangdongdong.6, jiang.wang, Cong Wang,
	John Fastabend, Daniel Borkmann, Jakub Sitnicki, Lorenz Bauer

Cong Wang wrote:
> From: Cong Wang <cong.wang@bytedance.com>
> 
> We do not have to lock the sock to avoid losing sk_socket,
> instead we can purge all the ingress queues when we close
> the socket. Sending or receiving packets after orphaning
> socket makes no sense.
> 
> We do purge these queues when psock refcnt reaches zero but
> here we want to purge them explicitly in sock_map_close().
> There are also some nasty race conditions on testing bit
> SK_PSOCK_TX_ENABLED and queuing/canceling the psock work,
> we can expand psock->ingress_lock a bit to protect them too.
> 
> As noticed by John, we still have to lock the psock->work,
> because the same work item could be running concurrently on
> different CPU's.
> 
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Jakub Sitnicki <jakub@cloudflare.com>
> Cc: Lorenz Bauer <lmb@cloudflare.com>
> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> ---

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [Patch bpf-next v7 05/13] skmsg: use rcu work for destroying psock
  2021-03-28 20:20 ` [Patch bpf-next v7 05/13] skmsg: use rcu work for destroying psock Cong Wang
@ 2021-03-29 19:42   ` John Fastabend
  0 siblings, 0 replies; 30+ messages in thread
From: John Fastabend @ 2021-03-29 19:42 UTC (permalink / raw)
  To: Cong Wang, netdev
  Cc: bpf, duanxiongchun, wangdongdong.6, jiang.wang, Cong Wang,
	Daniel Borkmann, Jakub Sitnicki, Lorenz Bauer, John Fastabend

Cong Wang wrote:
> From: Cong Wang <cong.wang@bytedance.com>
> 
> The RCU callback sk_psock_destroy() only queues work psock->gc,
> so we can just switch to rcu work to simplify the code.
> 
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Jakub Sitnicki <jakub@cloudflare.com>
> Cc: Lorenz Bauer <lmb@cloudflare.com>
> Acked-by: John Fastabend <john.fastabend@gmail.com>
> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> ---

LGTM

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [Patch bpf-next v7 06/13] skmsg: use GFP_KERNEL in sk_psock_create_ingress_msg()
  2021-03-28 20:20 ` [Patch bpf-next v7 06/13] skmsg: use GFP_KERNEL in sk_psock_create_ingress_msg() Cong Wang
@ 2021-03-29 19:44   ` John Fastabend
  0 siblings, 0 replies; 30+ messages in thread
From: John Fastabend @ 2021-03-29 19:44 UTC (permalink / raw)
  To: Cong Wang, netdev
  Cc: bpf, duanxiongchun, wangdongdong.6, jiang.wang, Cong Wang,
	John Fastabend, Daniel Borkmann, Jakub Sitnicki, Lorenz Bauer

Cong Wang wrote:
> From: Cong Wang <cong.wang@bytedance.com>
> 
> This function is only called in process context.
> 
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Jakub Sitnicki <jakub@cloudflare.com>
> Cc: Lorenz Bauer <lmb@cloudflare.com>
> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> ---
>  net/core/skmsg.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index d43d43905d2c..656eceab73bc 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -410,7 +410,7 @@ static struct sk_msg *sk_psock_create_ingress_msg(struct sock *sk,
>  	if (!sk_rmem_schedule(sk, skb, skb->truesize))
>  		return NULL;
>  
> -	msg = kzalloc(sizeof(*msg), __GFP_NOWARN | GFP_ATOMIC);
> +	msg = kzalloc(sizeof(*msg), __GFP_NOWARN | GFP_KERNEL);
>  	if (unlikely(!msg))
>  		return NULL;
>  
> -- 
> 2.25.1
> 

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [Patch bpf-next v7 07/13] sock_map: introduce BPF_SK_SKB_VERDICT
  2021-03-28 20:20 ` [Patch bpf-next v7 07/13] sock_map: introduce BPF_SK_SKB_VERDICT Cong Wang
@ 2021-03-29 20:09   ` John Fastabend
  2021-03-30  1:27     ` Cong Wang
  0 siblings, 1 reply; 30+ messages in thread
From: John Fastabend @ 2021-03-29 20:09 UTC (permalink / raw)
  To: Cong Wang, netdev
  Cc: bpf, duanxiongchun, wangdongdong.6, jiang.wang, Cong Wang,
	John Fastabend, Daniel Borkmann, Jakub Sitnicki, Lorenz Bauer

Cong Wang wrote:
> From: Cong Wang <cong.wang@bytedance.com>
> 
> Reusing BPF_SK_SKB_STREAM_VERDICT is possible but its name is
> confusing and more importantly we still want to distinguish them
> from user-space. So we can just reuse the stream verdict code but
> introduce a new type of eBPF program, skb_verdict. Users are not
> allowed to set stream_verdict and skb_verdict at the same time.
> 
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Jakub Sitnicki <jakub@cloudflare.com>
> Cc: Lorenz Bauer <lmb@cloudflare.com>
> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> ---

[...]

> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index 656eceab73bc..a045812d7c78 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -697,7 +697,7 @@ void sk_psock_drop(struct sock *sk, struct sk_psock *psock)
>  	rcu_assign_sk_user_data(sk, NULL);
>  	if (psock->progs.stream_parser)
>  		sk_psock_stop_strp(sk, psock);
> -	else if (psock->progs.stream_verdict)
> +	else if (psock->progs.stream_verdict || psock->progs.skb_verdict)
>  		sk_psock_stop_verdict(sk, psock);
>  	write_unlock_bh(&sk->sk_callback_lock);
>  
> @@ -1024,6 +1024,8 @@ static int sk_psock_verdict_recv(read_descriptor_t *desc, struct sk_buff *skb,
>  	}
>  	skb_set_owner_r(skb, sk);
>  	prog = READ_ONCE(psock->progs.stream_verdict);
> +	if (!prog)
> +		prog = READ_ONCE(psock->progs.skb_verdict);

Trying to think through this case. User attachs skb_verdict program
to map, then updates map with a bunch of TCP sockets. The above
code will run the skb_verdict program with the TCP socket as far as
I can tell.

This is OK because there really is no difference, other than by name,
between a skb_verdict and a stream_verdict program? Do we want something
to block adding TCP sockets to maps with stream_verdict programs? It
feels a bit odd in its current state to me.

>  	if (likely(prog)) {
>  		skb_dst_drop(skb);
>  		skb_bpf_redirect_clear(skb);
> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> index e564fdeaada1..c46709786a49 100644
> --- a/net/core/sock_map.c
> +++ b/net/core/sock_map.c
> @@ -155,6 +155,8 @@ static void sock_map_del_link(struct sock *sk,
>  				strp_stop = true;
>  			if (psock->saved_data_ready && stab->progs.stream_verdict)
>  				verdict_stop = true;
> +			if (psock->saved_data_ready && stab->progs.skb_verdict)
> +				verdict_stop = true;
>  			list_del(&link->list);
>  			sk_psock_free_link(link);
>  		}
> @@ -227,7 +229,7 @@ static struct sk_psock *sock_map_psock_get_checked(struct sock *sk)
>  static int sock_map_link(struct bpf_map *map, struct sk_psock_progs *progs,
>  			 struct sock *sk)
>  {
> -	struct bpf_prog *msg_parser, *stream_parser, *stream_verdict;
> +	struct bpf_prog *msg_parser, *stream_parser, *stream_verdict, *skb_verdict;
>  	struct sk_psock *psock;
>  	int ret;
>  
> @@ -256,6 +258,15 @@ static int sock_map_link(struct bpf_map *map, struct sk_psock_progs *progs,
>  		}
>  	}
>  
> +	skb_verdict = READ_ONCE(progs->skb_verdict);
> +	if (skb_verdict) {
> +		skb_verdict = bpf_prog_inc_not_zero(skb_verdict);
> +		if (IS_ERR(skb_verdict)) {
> +			ret = PTR_ERR(skb_verdict);
> +			goto out_put_msg_parser;
> +		}
> +	}
> +
>  	psock = sock_map_psock_get_checked(sk);
>  	if (IS_ERR(psock)) {
>  		ret = PTR_ERR(psock);
> @@ -265,6 +276,7 @@ static int sock_map_link(struct bpf_map *map, struct sk_psock_progs *progs,
>  	if (psock) {
>  		if ((msg_parser && READ_ONCE(psock->progs.msg_parser)) ||
>  		    (stream_parser  && READ_ONCE(psock->progs.stream_parser)) ||
> +		    (skb_verdict && READ_ONCE(psock->progs.skb_verdict)) ||
>  		    (stream_verdict && READ_ONCE(psock->progs.stream_verdict))) {
>  			sk_psock_put(sk, psock);
>  			ret = -EBUSY;

Do we need another test here,

   (skb_verdict && READ_ONCE(psock->progs.stream_verdict)

this way we return EBUSY and avoid having both stream_verdict and
skb_verdict attached on the same map?

From commit msg:
 "Users are not allowed to set stream_verdict and skb_verdict at
  the same time."

> @@ -296,6 +308,9 @@ static int sock_map_link(struct bpf_map *map, struct sk_psock_progs *progs,
>  	} else if (!stream_parser && stream_verdict && !psock->saved_data_ready) {
>  		psock_set_prog(&psock->progs.stream_verdict, stream_verdict);
>  		sk_psock_start_verdict(sk,psock);
> +	} else if (!stream_verdict && skb_verdict && !psock->saved_data_ready) {
> +		psock_set_prog(&psock->progs.skb_verdict, skb_verdict);
> +		sk_psock_start_verdict(sk, psock);

Thanks,
John

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [Patch bpf-next v7 09/13] udp: implement ->read_sock() for sockmap
  2021-03-28 20:20 ` [Patch bpf-next v7 09/13] udp: implement ->read_sock() for sockmap Cong Wang
@ 2021-03-29 20:54   ` John Fastabend
  2021-03-30  5:39     ` Cong Wang
  0 siblings, 1 reply; 30+ messages in thread
From: John Fastabend @ 2021-03-29 20:54 UTC (permalink / raw)
  To: Cong Wang, netdev
  Cc: bpf, duanxiongchun, wangdongdong.6, jiang.wang, Cong Wang,
	John Fastabend, Daniel Borkmann, Jakub Sitnicki, Lorenz Bauer

Cong Wang wrote:
> From: Cong Wang <cong.wang@bytedance.com>
> 
> This is similar to tcp_read_sock(), except we do not need
> to worry about connections, we just need to retrieve skb
> from UDP receive queue.
> 
> Note, the return value of ->read_sock() is unused in
> sk_psock_verdict_data_ready().
> 
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Jakub Sitnicki <jakub@cloudflare.com>
> Cc: Lorenz Bauer <lmb@cloudflare.com>
> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> ---
>  include/net/udp.h   |  2 ++
>  net/ipv4/af_inet.c  |  1 +
>  net/ipv4/udp.c      | 35 +++++++++++++++++++++++++++++++++++
>  net/ipv6/af_inet6.c |  1 +
>  4 files changed, 39 insertions(+)
> 
> diff --git a/include/net/udp.h b/include/net/udp.h
> index df7cc1edc200..347b62a753c3 100644
> --- a/include/net/udp.h
> +++ b/include/net/udp.h
> @@ -329,6 +329,8 @@ struct sock *__udp6_lib_lookup(struct net *net,
>  			       struct sk_buff *skb);
>  struct sock *udp6_lib_lookup_skb(const struct sk_buff *skb,
>  				 __be16 sport, __be16 dport);
> +int udp_read_sock(struct sock *sk, read_descriptor_t *desc,
> +		  sk_read_actor_t recv_actor);
>  
>  /* UDP uses skb->dev_scratch to cache as much information as possible and avoid
>   * possibly multiple cache miss on dequeue()
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index 1355e6c0d567..f17870ee558b 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -1070,6 +1070,7 @@ const struct proto_ops inet_dgram_ops = {
>  	.setsockopt	   = sock_common_setsockopt,
>  	.getsockopt	   = sock_common_getsockopt,
>  	.sendmsg	   = inet_sendmsg,
> +	.read_sock	   = udp_read_sock,
>  	.recvmsg	   = inet_recvmsg,
>  	.mmap		   = sock_no_mmap,
>  	.sendpage	   = inet_sendpage,
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 38952aaee3a1..04620e4d64ab 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1782,6 +1782,41 @@ struct sk_buff *__skb_recv_udp(struct sock *sk, unsigned int flags,
>  }
>  EXPORT_SYMBOL(__skb_recv_udp);
>  
> +int udp_read_sock(struct sock *sk, read_descriptor_t *desc,
> +		  sk_read_actor_t recv_actor)
> +{
> +	int copied = 0;
> +
> +	while (1) {
> +		int offset = 0, err;

Should this be

 int offset = sk_peek_offset()?

MSG_PEEK should work from recv side, at least it does on TCP side. If
its handled in some following patch a comment would be nice. I was
just reading udp_recvmsg() so maybe its not needed.

> +		struct sk_buff *skb;
> +
> +		skb = __skb_recv_udp(sk, 0, 1, &offset, &err);
> +		if (!skb)
> +			return err;
> +		if (offset < skb->len) {
> +			size_t len;
> +			int used;
> +
> +			len = skb->len - offset;
> +			used = recv_actor(desc, skb, offset, len);
> +			if (used <= 0) {
> +				if (!copied)
> +					copied = used;
> +				break;
> +			} else if (used <= len) {
> +				copied += used;
> +				offset += used;

The while loop is going to zero this? What are we trying to do
here with offset?

> +			}
> +		}
> +		if (!desc->count)
> +			break;
> +	}
> +
> +	return copied;
> +}
> +EXPORT_SYMBOL(udp_read_sock);
> +
>  /*
>   * 	This should be easy, if there is something there we
>   * 	return it, otherwise we block.
> diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
> index 802f5111805a..71de739b4a9e 100644
> --- a/net/ipv6/af_inet6.c
> +++ b/net/ipv6/af_inet6.c
> @@ -714,6 +714,7 @@ const struct proto_ops inet6_dgram_ops = {
>  	.getsockopt	   = sock_common_getsockopt,	/* ok		*/
>  	.sendmsg	   = inet6_sendmsg,		/* retpoline's sake */
>  	.recvmsg	   = inet6_recvmsg,		/* retpoline's sake */
> +	.read_sock	   = udp_read_sock,
>  	.mmap		   = sock_no_mmap,
>  	.sendpage	   = sock_no_sendpage,
>  	.set_peek_off	   = sk_set_peek_off,
> -- 
> 2.25.1
> 



^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [Patch bpf-next v7 12/13] sock_map: update sock type checks for UDP
  2021-03-28 20:20 ` [Patch bpf-next v7 12/13] sock_map: update sock type checks for UDP Cong Wang
@ 2021-03-29 23:10   ` John Fastabend
  2021-03-30  5:47     ` Cong Wang
  0 siblings, 1 reply; 30+ messages in thread
From: John Fastabend @ 2021-03-29 23:10 UTC (permalink / raw)
  To: Cong Wang, netdev
  Cc: bpf, duanxiongchun, wangdongdong.6, jiang.wang, Cong Wang,
	John Fastabend, Daniel Borkmann, Jakub Sitnicki, Lorenz Bauer

Cong Wang wrote:
> From: Cong Wang <cong.wang@bytedance.com>
> 
> Now UDP supports sockmap and redirection, we can safely update
> the sock type checks for it accordingly.
> 
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Jakub Sitnicki <jakub@cloudflare.com>
> Cc: Lorenz Bauer <lmb@cloudflare.com>
> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> ---
>  net/core/sock_map.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> index d04b98fc8104..9ed040c7d9e7 100644
> --- a/net/core/sock_map.c
> +++ b/net/core/sock_map.c
> @@ -549,7 +549,10 @@ static bool sk_is_udp(const struct sock *sk)
>  
>  static bool sock_map_redirect_allowed(const struct sock *sk)
>  {
> -	return sk_is_tcp(sk) && sk->sk_state != TCP_LISTEN;
> +	if (sk_is_tcp(sk))
> +		return sk->sk_state != TCP_LISTEN;
> +	else
> +		return sk->sk_state == TCP_ESTABLISHED;
>  }
>  
>  static bool sock_map_sk_is_suitable(const struct sock *sk)
> -- 
> 2.25.1
> 

I think its a bit odd for TCP_ESTABLISHED to work with !tcp, but
thats not your invention so LGTM.

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Patch bpf-next v7 07/13] sock_map: introduce BPF_SK_SKB_VERDICT
  2021-03-29 20:09   ` John Fastabend
@ 2021-03-30  1:27     ` Cong Wang
  0 siblings, 0 replies; 30+ messages in thread
From: Cong Wang @ 2021-03-30  1:27 UTC (permalink / raw)
  To: John Fastabend
  Cc: Linux Kernel Network Developers, bpf, duanxiongchun,
	Dongdong Wang, Jiang Wang, Cong Wang, Daniel Borkmann,
	Jakub Sitnicki, Lorenz Bauer

On Mon, Mar 29, 2021 at 1:10 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Cong Wang wrote:
> > From: Cong Wang <cong.wang@bytedance.com>
> >
> > Reusing BPF_SK_SKB_STREAM_VERDICT is possible but its name is
> > confusing and more importantly we still want to distinguish them
> > from user-space. So we can just reuse the stream verdict code but
> > introduce a new type of eBPF program, skb_verdict. Users are not
> > allowed to set stream_verdict and skb_verdict at the same time.
> >
> > Cc: John Fastabend <john.fastabend@gmail.com>
> > Cc: Daniel Borkmann <daniel@iogearbox.net>
> > Cc: Jakub Sitnicki <jakub@cloudflare.com>
> > Cc: Lorenz Bauer <lmb@cloudflare.com>
> > Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> > ---
>
> [...]
>
> > diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> > index 656eceab73bc..a045812d7c78 100644
> > --- a/net/core/skmsg.c
> > +++ b/net/core/skmsg.c
> > @@ -697,7 +697,7 @@ void sk_psock_drop(struct sock *sk, struct sk_psock *psock)
> >       rcu_assign_sk_user_data(sk, NULL);
> >       if (psock->progs.stream_parser)
> >               sk_psock_stop_strp(sk, psock);
> > -     else if (psock->progs.stream_verdict)
> > +     else if (psock->progs.stream_verdict || psock->progs.skb_verdict)
> >               sk_psock_stop_verdict(sk, psock);
> >       write_unlock_bh(&sk->sk_callback_lock);
> >
> > @@ -1024,6 +1024,8 @@ static int sk_psock_verdict_recv(read_descriptor_t *desc, struct sk_buff *skb,
> >       }
> >       skb_set_owner_r(skb, sk);
> >       prog = READ_ONCE(psock->progs.stream_verdict);
> > +     if (!prog)
> > +             prog = READ_ONCE(psock->progs.skb_verdict);
>
> Trying to think through this case. User attachs skb_verdict program
> to map, then updates map with a bunch of TCP sockets. The above
> code will run the skb_verdict program with the TCP socket as far as
> I can tell.
>
> This is OK because there really is no difference, other than by name,
> between a skb_verdict and a stream_verdict program? Do we want something
> to block adding TCP sockets to maps with stream_verdict programs? It
> feels a bit odd in its current state to me.

Yes, it should work too. skb_verdict only extends stream_verdict beyond
TCP, it does not prohibit TCP.

>
> >       if (likely(prog)) {
> >               skb_dst_drop(skb);
> >               skb_bpf_redirect_clear(skb);
> > diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> > index e564fdeaada1..c46709786a49 100644
> > --- a/net/core/sock_map.c
> > +++ b/net/core/sock_map.c
> > @@ -155,6 +155,8 @@ static void sock_map_del_link(struct sock *sk,
> >                               strp_stop = true;
> >                       if (psock->saved_data_ready && stab->progs.stream_verdict)
> >                               verdict_stop = true;
> > +                     if (psock->saved_data_ready && stab->progs.skb_verdict)
> > +                             verdict_stop = true;
> >                       list_del(&link->list);
> >                       sk_psock_free_link(link);
> >               }
> > @@ -227,7 +229,7 @@ static struct sk_psock *sock_map_psock_get_checked(struct sock *sk)
> >  static int sock_map_link(struct bpf_map *map, struct sk_psock_progs *progs,
> >                        struct sock *sk)
> >  {
> > -     struct bpf_prog *msg_parser, *stream_parser, *stream_verdict;
> > +     struct bpf_prog *msg_parser, *stream_parser, *stream_verdict, *skb_verdict;
> >       struct sk_psock *psock;
> >       int ret;
> >
> > @@ -256,6 +258,15 @@ static int sock_map_link(struct bpf_map *map, struct sk_psock_progs *progs,
> >               }
> >       }
> >
> > +     skb_verdict = READ_ONCE(progs->skb_verdict);
> > +     if (skb_verdict) {
> > +             skb_verdict = bpf_prog_inc_not_zero(skb_verdict);
> > +             if (IS_ERR(skb_verdict)) {
> > +                     ret = PTR_ERR(skb_verdict);
> > +                     goto out_put_msg_parser;
> > +             }
> > +     }
> > +
> >       psock = sock_map_psock_get_checked(sk);
> >       if (IS_ERR(psock)) {
> >               ret = PTR_ERR(psock);
> > @@ -265,6 +276,7 @@ static int sock_map_link(struct bpf_map *map, struct sk_psock_progs *progs,
> >       if (psock) {
> >               if ((msg_parser && READ_ONCE(psock->progs.msg_parser)) ||
> >                   (stream_parser  && READ_ONCE(psock->progs.stream_parser)) ||
> > +                 (skb_verdict && READ_ONCE(psock->progs.skb_verdict)) ||
> >                   (stream_verdict && READ_ONCE(psock->progs.stream_verdict))) {
> >                       sk_psock_put(sk, psock);
> >                       ret = -EBUSY;
>
> Do we need another test here,
>
>    (skb_verdict && READ_ONCE(psock->progs.stream_verdict)
>
> this way we return EBUSY and avoid having both stream_verdict and
> skb_verdict attached on the same map?

Yes, good catch, we do need a check here. And I will see if I can add a small
test case for this too.

Thanks.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Patch bpf-next v7 09/13] udp: implement ->read_sock() for sockmap
  2021-03-29 20:54   ` John Fastabend
@ 2021-03-30  5:39     ` Cong Wang
  2021-03-30  6:23       ` John Fastabend
  0 siblings, 1 reply; 30+ messages in thread
From: Cong Wang @ 2021-03-30  5:39 UTC (permalink / raw)
  To: John Fastabend
  Cc: Linux Kernel Network Developers, bpf, duanxiongchun,
	Dongdong Wang, Jiang Wang, Cong Wang, Daniel Borkmann,
	Jakub Sitnicki, Lorenz Bauer

On Mon, Mar 29, 2021 at 1:54 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Cong Wang wrote:
> > From: Cong Wang <cong.wang@bytedance.com>
> >
> > This is similar to tcp_read_sock(), except we do not need
> > to worry about connections, we just need to retrieve skb
> > from UDP receive queue.
> >
> > Note, the return value of ->read_sock() is unused in
> > sk_psock_verdict_data_ready().
> >
> > Cc: John Fastabend <john.fastabend@gmail.com>
> > Cc: Daniel Borkmann <daniel@iogearbox.net>
> > Cc: Jakub Sitnicki <jakub@cloudflare.com>
> > Cc: Lorenz Bauer <lmb@cloudflare.com>
> > Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> > ---
> >  include/net/udp.h   |  2 ++
> >  net/ipv4/af_inet.c  |  1 +
> >  net/ipv4/udp.c      | 35 +++++++++++++++++++++++++++++++++++
> >  net/ipv6/af_inet6.c |  1 +
> >  4 files changed, 39 insertions(+)
> >
> > diff --git a/include/net/udp.h b/include/net/udp.h
> > index df7cc1edc200..347b62a753c3 100644
> > --- a/include/net/udp.h
> > +++ b/include/net/udp.h
> > @@ -329,6 +329,8 @@ struct sock *__udp6_lib_lookup(struct net *net,
> >                              struct sk_buff *skb);
> >  struct sock *udp6_lib_lookup_skb(const struct sk_buff *skb,
> >                                __be16 sport, __be16 dport);
> > +int udp_read_sock(struct sock *sk, read_descriptor_t *desc,
> > +               sk_read_actor_t recv_actor);
> >
> >  /* UDP uses skb->dev_scratch to cache as much information as possible and avoid
> >   * possibly multiple cache miss on dequeue()
> > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> > index 1355e6c0d567..f17870ee558b 100644
> > --- a/net/ipv4/af_inet.c
> > +++ b/net/ipv4/af_inet.c
> > @@ -1070,6 +1070,7 @@ const struct proto_ops inet_dgram_ops = {
> >       .setsockopt        = sock_common_setsockopt,
> >       .getsockopt        = sock_common_getsockopt,
> >       .sendmsg           = inet_sendmsg,
> > +     .read_sock         = udp_read_sock,
> >       .recvmsg           = inet_recvmsg,
> >       .mmap              = sock_no_mmap,
> >       .sendpage          = inet_sendpage,
> > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > index 38952aaee3a1..04620e4d64ab 100644
> > --- a/net/ipv4/udp.c
> > +++ b/net/ipv4/udp.c
> > @@ -1782,6 +1782,41 @@ struct sk_buff *__skb_recv_udp(struct sock *sk, unsigned int flags,
> >  }
> >  EXPORT_SYMBOL(__skb_recv_udp);
> >
> > +int udp_read_sock(struct sock *sk, read_descriptor_t *desc,
> > +               sk_read_actor_t recv_actor)
> > +{
> > +     int copied = 0;
> > +
> > +     while (1) {
> > +             int offset = 0, err;
>
> Should this be
>
>  int offset = sk_peek_offset()?

What are you really suggesting? sk_peek_offset() is just 0 unless
we have MSG_PEEK here and we don't, because we really want to
dequeue the skb rather than peeking it.

Are you suggesting we should do peeking? I am afraid we can't.
Please be specific, guessing your mind is not an effective way to
address your reviews.

>
> MSG_PEEK should work from recv side, at least it does on TCP side. If
> its handled in some following patch a comment would be nice. I was
> just reading udp_recvmsg() so maybe its not needed.

Please explain why do we need peeking in sockmap? At very least
it has nothing to do with my patchset.

I do not know why you want to use TCP as a "standard" here, TCP
also supports splice(), UDP still doesn't even with ->read_sock().
Of course they are very different.

>
> > +             struct sk_buff *skb;
> > +
> > +             skb = __skb_recv_udp(sk, 0, 1, &offset, &err);
> > +             if (!skb)
> > +                     return err;
> > +             if (offset < skb->len) {
> > +                     size_t len;
> > +                     int used;
> > +
> > +                     len = skb->len - offset;
> > +                     used = recv_actor(desc, skb, offset, len);
> > +                     if (used <= 0) {
> > +                             if (!copied)
> > +                                     copied = used;
> > +                             break;
> > +                     } else if (used <= len) {
> > +                             copied += used;
> > +                             offset += used;
>
> The while loop is going to zero this? What are we trying to do
> here with offset?

offset only matters for MSG_PEEK and we do not support peeking
in sockmap case, hence it is unnecessary here. I "use" it here just
to make the code as complete as possible.

To further answer your question, it is set to 0 when we return a
valid skb on line 201 inside __skb_try_recv_from_queue(), as
"_off" is set to 0 and won't change unless we have MSG_PEEK.

173         bool peek_at_off = false;
174         struct sk_buff *skb;
175         int _off = 0;
176
177         if (unlikely(flags & MSG_PEEK && *off >= 0)) {
178                 peek_at_off = true;
179                 _off = *off;
180         }
181
182         *last = queue->prev;
183         skb_queue_walk(queue, skb) {
184                 if (flags & MSG_PEEK) {
185                         if (peek_at_off && _off >= skb->len &&
186                             (_off || skb->peeked)) {
187                                 _off -= skb->len;
188                                 continue;
189                         }
190                         if (!skb->len) {
191                                 skb = skb_set_peeked(skb);
192                                 if (IS_ERR(skb)) {
193                                         *err = PTR_ERR(skb);
194                                         return NULL;
195                                 }
196                         }
197                         refcount_inc(&skb->users);
198                 } else {
199                         __skb_unlink(skb, queue);
200                 }
201                 *off = _off;
202                 return skb;

Of course, when we return NULL, we return immediately without
using offset:

1794                 skb = __skb_recv_udp(sk, 0, 1, &offset, &err);
1795                 if (!skb)
1796                         return err;

This should not be hard to figure out. Hope it is clear now.

Thanks.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Patch bpf-next v7 12/13] sock_map: update sock type checks for UDP
  2021-03-29 23:10   ` John Fastabend
@ 2021-03-30  5:47     ` Cong Wang
  0 siblings, 0 replies; 30+ messages in thread
From: Cong Wang @ 2021-03-30  5:47 UTC (permalink / raw)
  To: John Fastabend
  Cc: Linux Kernel Network Developers, bpf, duanxiongchun,
	Dongdong Wang, Jiang Wang, Cong Wang, Daniel Borkmann,
	Jakub Sitnicki, Lorenz Bauer

On Mon, Mar 29, 2021 at 4:10 PM John Fastabend <john.fastabend@gmail.com> wrote:
> I think its a bit odd for TCP_ESTABLISHED to work with !tcp, but
> thats not your invention so LGTM.

It has been there for many years, so why it is suddenly a problem with
my patchset? More importantly, why don't you change it by yourself
as it looks odd to you? Please go ahead to do whatever you want,
your patches are always welcome.

Thanks.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Patch bpf-next v7 09/13] udp: implement ->read_sock() for sockmap
  2021-03-30  5:39     ` Cong Wang
@ 2021-03-30  6:23       ` John Fastabend
  2021-03-30  6:36         ` Cong Wang
  0 siblings, 1 reply; 30+ messages in thread
From: John Fastabend @ 2021-03-30  6:23 UTC (permalink / raw)
  To: Cong Wang, John Fastabend
  Cc: Linux Kernel Network Developers, bpf, duanxiongchun,
	Dongdong Wang, Jiang Wang, Cong Wang, Daniel Borkmann,
	Jakub Sitnicki, Lorenz Bauer

Cong Wang wrote:
> On Mon, Mar 29, 2021 at 1:54 PM John Fastabend <john.fastabend@gmail.com> wrote:
> >
> > Cong Wang wrote:
> > > From: Cong Wang <cong.wang@bytedance.com>
> > >
> > > This is similar to tcp_read_sock(), except we do not need
> > > to worry about connections, we just need to retrieve skb
> > > from UDP receive queue.
> > >
> > > Note, the return value of ->read_sock() is unused in
> > > sk_psock_verdict_data_ready().
> > >
> > > Cc: John Fastabend <john.fastabend@gmail.com>
> > > Cc: Daniel Borkmann <daniel@iogearbox.net>
> > > Cc: Jakub Sitnicki <jakub@cloudflare.com>
> > > Cc: Lorenz Bauer <lmb@cloudflare.com>
> > > Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> > > ---

[...]

> > >  }
> > >  EXPORT_SYMBOL(__skb_recv_udp);
> > >
> > > +int udp_read_sock(struct sock *sk, read_descriptor_t *desc,
> > > +               sk_read_actor_t recv_actor)
> > > +{
> > > +     int copied = 0;
> > > +
> > > +     while (1) {
> > > +             int offset = 0, err;
> >
> > Should this be
> >
> >  int offset = sk_peek_offset()?
> 
> What are you really suggesting? sk_peek_offset() is just 0 unless
> we have MSG_PEEK here and we don't, because we really want to
> dequeue the skb rather than peeking it.
> 
> Are you suggesting we should do peeking? I am afraid we can't.
> Please be specific, guessing your mind is not an effective way to
> address your reviews.

I was only asking for further details because the offset addition
below struck me as odd.

> 
> >
> > MSG_PEEK should work from recv side, at least it does on TCP side. If
> > its handled in some following patch a comment would be nice. I was
> > just reading udp_recvmsg() so maybe its not needed.
> 
> Please explain why do we need peeking in sockmap? At very least
> it has nothing to do with my patchset.

We need MSG_PEEK to work from application side. From sockmap
side I agree its not needed.

> 
> I do not know why you want to use TCP as a "standard" here, TCP
> also supports splice(), UDP still doesn't even with ->read_sock().
> Of course they are very different.

Not claiming any "standard" here only that user application needs
to work correctly if it passes MSG_PEEK.

> 
> >
> > > +             struct sk_buff *skb;
> > > +
> > > +             skb = __skb_recv_udp(sk, 0, 1, &offset, &err);
> > > +             if (!skb)
> > > +                     return err;
> > > +             if (offset < skb->len) {
> > > +                     size_t len;
> > > +                     int used;
> > > +
> > > +                     len = skb->len - offset;
> > > +                     used = recv_actor(desc, skb, offset, len);
> > > +                     if (used <= 0) {
> > > +                             if (!copied)
> > > +                                     copied = used;
> > > +                             break;
> > > +                     } else if (used <= len) {
> > > +                             copied += used;
> > > +                             offset += used;
> >
> > The while loop is going to zero this? What are we trying to do
> > here with offset?
> 
> offset only matters for MSG_PEEK and we do not support peeking
> in sockmap case, hence it is unnecessary here. I "use" it here just
> to make the code as complete as possible.

huh? If its not used the addition is just confusing. Can we drop it?

> 
> To further answer your question, it is set to 0 when we return a
> valid skb on line 201 inside __skb_try_recv_from_queue(), as
> "_off" is set to 0 and won't change unless we have MSG_PEEK.
> 
> 173         bool peek_at_off = false;
> 174         struct sk_buff *skb;
> 175         int _off = 0;
> 176
> 177         if (unlikely(flags & MSG_PEEK && *off >= 0)) {
> 178                 peek_at_off = true;
> 179                 _off = *off;
> 180         }
> 181
> 182         *last = queue->prev;
> 183         skb_queue_walk(queue, skb) {
> 184                 if (flags & MSG_PEEK) {
> 185                         if (peek_at_off && _off >= skb->len &&
> 186                             (_off || skb->peeked)) {
> 187                                 _off -= skb->len;
> 188                                 continue;
> 189                         }
> 190                         if (!skb->len) {
> 191                                 skb = skb_set_peeked(skb);
> 192                                 if (IS_ERR(skb)) {
> 193                                         *err = PTR_ERR(skb);
> 194                                         return NULL;
> 195                                 }
> 196                         }
> 197                         refcount_inc(&skb->users);
> 198                 } else {
> 199                         __skb_unlink(skb, queue);
> 200                 }
> 201                 *off = _off;
> 202                 return skb;
> 
> Of course, when we return NULL, we return immediately without
> using offset:
> 
> 1794                 skb = __skb_recv_udp(sk, 0, 1, &offset, &err);
> 1795                 if (!skb)
> 1796                         return err;
> 
> This should not be hard to figure out. Hope it is clear now.
> 

Yes, but tracking offset only to clear it a couple lines later
is confusing.

> Thanks.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Patch bpf-next v7 09/13] udp: implement ->read_sock() for sockmap
  2021-03-30  6:23       ` John Fastabend
@ 2021-03-30  6:36         ` Cong Wang
  2021-03-30  6:45           ` John Fastabend
  0 siblings, 1 reply; 30+ messages in thread
From: Cong Wang @ 2021-03-30  6:36 UTC (permalink / raw)
  To: John Fastabend
  Cc: Linux Kernel Network Developers, bpf, duanxiongchun,
	Dongdong Wang, Jiang Wang, Cong Wang, Daniel Borkmann,
	Jakub Sitnicki, Lorenz Bauer

On Mon, Mar 29, 2021 at 11:23 PM John Fastabend
<john.fastabend@gmail.com> wrote:
>
> Cong Wang wrote:
> > On Mon, Mar 29, 2021 at 1:54 PM John Fastabend <john.fastabend@gmail.com> wrote:
> > >
> > > Cong Wang wrote:
> > > > From: Cong Wang <cong.wang@bytedance.com>
> > > >
> > > > This is similar to tcp_read_sock(), except we do not need
> > > > to worry about connections, we just need to retrieve skb
> > > > from UDP receive queue.
> > > >
> > > > Note, the return value of ->read_sock() is unused in
> > > > sk_psock_verdict_data_ready().
> > > >
> > > > Cc: John Fastabend <john.fastabend@gmail.com>
> > > > Cc: Daniel Borkmann <daniel@iogearbox.net>
> > > > Cc: Jakub Sitnicki <jakub@cloudflare.com>
> > > > Cc: Lorenz Bauer <lmb@cloudflare.com>
> > > > Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> > > > ---
>
> [...]
>
> > > >  }
> > > >  EXPORT_SYMBOL(__skb_recv_udp);
> > > >
> > > > +int udp_read_sock(struct sock *sk, read_descriptor_t *desc,
> > > > +               sk_read_actor_t recv_actor)
> > > > +{
> > > > +     int copied = 0;
> > > > +
> > > > +     while (1) {
> > > > +             int offset = 0, err;
> > >
> > > Should this be
> > >
> > >  int offset = sk_peek_offset()?
> >
> > What are you really suggesting? sk_peek_offset() is just 0 unless
> > we have MSG_PEEK here and we don't, because we really want to
> > dequeue the skb rather than peeking it.
> >
> > Are you suggesting we should do peeking? I am afraid we can't.
> > Please be specific, guessing your mind is not an effective way to
> > address your reviews.
>
> I was only asking for further details because the offset addition
> below struck me as odd.
>
> >
> > >
> > > MSG_PEEK should work from recv side, at least it does on TCP side. If
> > > its handled in some following patch a comment would be nice. I was
> > > just reading udp_recvmsg() so maybe its not needed.
> >
> > Please explain why do we need peeking in sockmap? At very least
> > it has nothing to do with my patchset.
>
> We need MSG_PEEK to work from application side. From sockmap
> side I agree its not needed.

How does the application reach udp_read_sock()? UDP does not support
splice() as I already mentioned, as ->splice_read() is still missing.

>
> >
> > I do not know why you want to use TCP as a "standard" here, TCP
> > also supports splice(), UDP still doesn't even with ->read_sock().
> > Of course they are very different.
>
> Not claiming any "standard" here only that user application needs
> to work correctly if it passes MSG_PEEK.

I do not see how an application could pass any msg flag to
udp_read_sock().

>
> >
> > >
> > > > +             struct sk_buff *skb;
> > > > +
> > > > +             skb = __skb_recv_udp(sk, 0, 1, &offset, &err);
> > > > +             if (!skb)
> > > > +                     return err;
> > > > +             if (offset < skb->len) {
> > > > +                     size_t len;
> > > > +                     int used;
> > > > +
> > > > +                     len = skb->len - offset;
> > > > +                     used = recv_actor(desc, skb, offset, len);
> > > > +                     if (used <= 0) {
> > > > +                             if (!copied)
> > > > +                                     copied = used;
> > > > +                             break;
> > > > +                     } else if (used <= len) {
> > > > +                             copied += used;
> > > > +                             offset += used;
> > >
> > > The while loop is going to zero this? What are we trying to do
> > > here with offset?
> >
> > offset only matters for MSG_PEEK and we do not support peeking
> > in sockmap case, hence it is unnecessary here. I "use" it here just
> > to make the code as complete as possible.
>
> huh? If its not used the addition is just confusing. Can we drop it?

If you mean dropping this single line of code, yes. If you mean
dropping 'offset' completely, no, as both __skb_recv_udp() and
recv_actor() still need it. If you mean I should re-write
__skb_recv_udp() and recv_actor() just to drop 'offset', I am afraid
that is too much with too little gain.

>
> >
> > To further answer your question, it is set to 0 when we return a
> > valid skb on line 201 inside __skb_try_recv_from_queue(), as
> > "_off" is set to 0 and won't change unless we have MSG_PEEK.
> >
> > 173         bool peek_at_off = false;
> > 174         struct sk_buff *skb;
> > 175         int _off = 0;
> > 176
> > 177         if (unlikely(flags & MSG_PEEK && *off >= 0)) {
> > 178                 peek_at_off = true;
> > 179                 _off = *off;
> > 180         }
> > 181
> > 182         *last = queue->prev;
> > 183         skb_queue_walk(queue, skb) {
> > 184                 if (flags & MSG_PEEK) {
> > 185                         if (peek_at_off && _off >= skb->len &&
> > 186                             (_off || skb->peeked)) {
> > 187                                 _off -= skb->len;
> > 188                                 continue;
> > 189                         }
> > 190                         if (!skb->len) {
> > 191                                 skb = skb_set_peeked(skb);
> > 192                                 if (IS_ERR(skb)) {
> > 193                                         *err = PTR_ERR(skb);
> > 194                                         return NULL;
> > 195                                 }
> > 196                         }
> > 197                         refcount_inc(&skb->users);
> > 198                 } else {
> > 199                         __skb_unlink(skb, queue);
> > 200                 }
> > 201                 *off = _off;
> > 202                 return skb;
> >
> > Of course, when we return NULL, we return immediately without
> > using offset:
> >
> > 1794                 skb = __skb_recv_udp(sk, 0, 1, &offset, &err);
> > 1795                 if (!skb)
> > 1796                         return err;
> >
> > This should not be hard to figure out. Hope it is clear now.
> >
>
> Yes, but tracking offset only to clear it a couple lines later
> is confusing.

Yeah, but that's __skb_recv_udp()'s fault, not mine. We can refactor
__skb_recv_udp() a bit for !MSG_PEEK case, but I do not see
much gain here.

Thanks.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Patch bpf-next v7 09/13] udp: implement ->read_sock() for sockmap
  2021-03-30  6:36         ` Cong Wang
@ 2021-03-30  6:45           ` John Fastabend
  0 siblings, 0 replies; 30+ messages in thread
From: John Fastabend @ 2021-03-30  6:45 UTC (permalink / raw)
  To: Cong Wang, John Fastabend
  Cc: Linux Kernel Network Developers, bpf, duanxiongchun,
	Dongdong Wang, Jiang Wang, Cong Wang, Daniel Borkmann,
	Jakub Sitnicki, Lorenz Bauer

Cong Wang wrote:
> On Mon, Mar 29, 2021 at 11:23 PM John Fastabend
> <john.fastabend@gmail.com> wrote:
> >
> > Cong Wang wrote:
> > > On Mon, Mar 29, 2021 at 1:54 PM John Fastabend <john.fastabend@gmail.com> wrote:
> > > >
> > > > Cong Wang wrote:
> > > > > From: Cong Wang <cong.wang@bytedance.com>
> > > > >
> > > > > This is similar to tcp_read_sock(), except we do not need
> > > > > to worry about connections, we just need to retrieve skb
> > > > > from UDP receive queue.
> > > > >
> > > > > Note, the return value of ->read_sock() is unused in
> > > > > sk_psock_verdict_data_ready().
> > > > >
> > > > > Cc: John Fastabend <john.fastabend@gmail.com>
> > > > > Cc: Daniel Borkmann <daniel@iogearbox.net>
> > > > > Cc: Jakub Sitnicki <jakub@cloudflare.com>
> > > > > Cc: Lorenz Bauer <lmb@cloudflare.com>
> > > > > Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> > > > > ---
> >
> > [...]
> >
> > > > >  }
> > > > >  EXPORT_SYMBOL(__skb_recv_udp);
> > > > >
> > > > > +int udp_read_sock(struct sock *sk, read_descriptor_t *desc,
> > > > > +               sk_read_actor_t recv_actor)
> > > > > +{
> > > > > +     int copied = 0;
> > > > > +
> > > > > +     while (1) {
> > > > > +             int offset = 0, err;
> > > >
> > > > Should this be
> > > >
> > > >  int offset = sk_peek_offset()?
> > >
> > > What are you really suggesting? sk_peek_offset() is just 0 unless
> > > we have MSG_PEEK here and we don't, because we really want to
> > > dequeue the skb rather than peeking it.
> > >
> > > Are you suggesting we should do peeking? I am afraid we can't.
> > > Please be specific, guessing your mind is not an effective way to
> > > address your reviews.
> >
> > I was only asking for further details because the offset addition
> > below struck me as odd.
> >
> > >
> > > >
> > > > MSG_PEEK should work from recv side, at least it does on TCP side. If
> > > > its handled in some following patch a comment would be nice. I was
> > > > just reading udp_recvmsg() so maybe its not needed.
> > >
> > > Please explain why do we need peeking in sockmap? At very least
> > > it has nothing to do with my patchset.
> >
> > We need MSG_PEEK to work from application side. From sockmap
> > side I agree its not needed.
> 
> How does the application reach udp_read_sock()? UDP does not support
> splice() as I already mentioned, as ->splice_read() is still missing.

It doesn't. All I was trying to say is if an application calls
recvmsg(..., MSG_PEEK) it should work correctly. It wasn't a
comment about this specific patch.

> 
> >
> > >
> > > I do not know why you want to use TCP as a "standard" here, TCP
> > > also supports splice(), UDP still doesn't even with ->read_sock().
> > > Of course they are very different.
> >
> > Not claiming any "standard" here only that user application needs
> > to work correctly if it passes MSG_PEEK.
> 
> I do not see how an application could pass any msg flag to
> udp_read_sock().

Agree.

> 
> >
> > >
> > > >
> > > > > +             struct sk_buff *skb;
> > > > > +
> > > > > +             skb = __skb_recv_udp(sk, 0, 1, &offset, &err);
> > > > > +             if (!skb)
> > > > > +                     return err;
> > > > > +             if (offset < skb->len) {
> > > > > +                     size_t len;
> > > > > +                     int used;
> > > > > +
> > > > > +                     len = skb->len - offset;
> > > > > +                     used = recv_actor(desc, skb, offset, len);
> > > > > +                     if (used <= 0) {
> > > > > +                             if (!copied)
> > > > > +                                     copied = used;
> > > > > +                             break;
> > > > > +                     } else if (used <= len) {
> > > > > +                             copied += used;
> > > > > +                             offset += used;
> > > >
> > > > The while loop is going to zero this? What are we trying to do
> > > > here with offset?
> > >
> > > offset only matters for MSG_PEEK and we do not support peeking
> > > in sockmap case, hence it is unnecessary here. I "use" it here just
> > > to make the code as complete as possible.
> >
> > huh? If its not used the addition is just confusing. Can we drop it?
> 
> If you mean dropping this single line of code, yes. If you mean
> dropping 'offset' completely, no, as both __skb_recv_udp() and
> recv_actor() still need it. If you mean I should re-write
> __skb_recv_udp() and recv_actor() just to drop 'offset', I am afraid
> that is too much with too little gain.

All I'm saying is drop the single line of code above. This specific
one

 'offset += used'

And add a comment in the commit msg that just says peeking is not
supported. I think we need at least one more respin of the patches
anyways to address a different small comment so should be easy.

> 
> >
> > >
> > > To further answer your question, it is set to 0 when we return a
> > > valid skb on line 201 inside __skb_try_recv_from_queue(), as
> > > "_off" is set to 0 and won't change unless we have MSG_PEEK.
> > >
> > > 173         bool peek_at_off = false;
> > > 174         struct sk_buff *skb;
> > > 175         int _off = 0;
> > > 176
> > > 177         if (unlikely(flags & MSG_PEEK && *off >= 0)) {
> > > 178                 peek_at_off = true;
> > > 179                 _off = *off;
> > > 180         }
> > > 181
> > > 182         *last = queue->prev;
> > > 183         skb_queue_walk(queue, skb) {
> > > 184                 if (flags & MSG_PEEK) {
> > > 185                         if (peek_at_off && _off >= skb->len &&
> > > 186                             (_off || skb->peeked)) {
> > > 187                                 _off -= skb->len;
> > > 188                                 continue;
> > > 189                         }
> > > 190                         if (!skb->len) {
> > > 191                                 skb = skb_set_peeked(skb);
> > > 192                                 if (IS_ERR(skb)) {
> > > 193                                         *err = PTR_ERR(skb);
> > > 194                                         return NULL;
> > > 195                                 }
> > > 196                         }
> > > 197                         refcount_inc(&skb->users);
> > > 198                 } else {
> > > 199                         __skb_unlink(skb, queue);
> > > 200                 }
> > > 201                 *off = _off;
> > > 202                 return skb;
> > >
> > > Of course, when we return NULL, we return immediately without
> > > using offset:
> > >
> > > 1794                 skb = __skb_recv_udp(sk, 0, 1, &offset, &err);
> > > 1795                 if (!skb)
> > > 1796                         return err;
> > >
> > > This should not be hard to figure out. Hope it is clear now.
> > >
> >
> > Yes, but tracking offset only to clear it a couple lines later
> > is confusing.
> 
> Yeah, but that's __skb_recv_udp()'s fault, not mine. We can refactor
> __skb_recv_udp() a bit for !MSG_PEEK case, but I do not see
> much gain here.

No don't bother here. I don't see much gain in doing that either. If
you want do it in another series not this one.

> 
> Thanks.

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2021-03-30  6:46 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-28 20:20 [Patch bpf-next v7 00/13] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP Cong Wang
2021-03-28 20:20 ` [Patch bpf-next v7 01/13] skmsg: lock ingress_skb when purging Cong Wang
2021-03-28 20:20 ` [Patch bpf-next v7 02/13] skmsg: introduce a spinlock to protect ingress_msg Cong Wang
2021-03-29 19:11   ` John Fastabend
2021-03-28 20:20 ` [Patch bpf-next v7 03/13] net: introduce skb_send_sock() for sock_map Cong Wang
2021-03-28 20:20 ` [Patch bpf-next v7 04/13] skmsg: avoid lock_sock() in sk_psock_backlog() Cong Wang
2021-03-29 19:41   ` John Fastabend
2021-03-28 20:20 ` [Patch bpf-next v7 05/13] skmsg: use rcu work for destroying psock Cong Wang
2021-03-29 19:42   ` John Fastabend
2021-03-28 20:20 ` [Patch bpf-next v7 06/13] skmsg: use GFP_KERNEL in sk_psock_create_ingress_msg() Cong Wang
2021-03-29 19:44   ` John Fastabend
2021-03-28 20:20 ` [Patch bpf-next v7 07/13] sock_map: introduce BPF_SK_SKB_VERDICT Cong Wang
2021-03-29 20:09   ` John Fastabend
2021-03-30  1:27     ` Cong Wang
2021-03-28 20:20 ` [Patch bpf-next v7 08/13] sock: introduce sk->sk_prot->psock_update_sk_prot() Cong Wang
2021-03-28 20:20 ` [Patch bpf-next v7 09/13] udp: implement ->read_sock() for sockmap Cong Wang
2021-03-29 20:54   ` John Fastabend
2021-03-30  5:39     ` Cong Wang
2021-03-30  6:23       ` John Fastabend
2021-03-30  6:36         ` Cong Wang
2021-03-30  6:45           ` John Fastabend
2021-03-28 20:20 ` [Patch bpf-next v7 10/13] skmsg: extract __tcp_bpf_recvmsg() and tcp_bpf_wait_data() Cong Wang
2021-03-28 20:20 ` [Patch bpf-next v7 11/13] udp: implement udp_bpf_recvmsg() for sockmap Cong Wang
2021-03-28 20:20 ` [Patch bpf-next v7 12/13] sock_map: update sock type checks for UDP Cong Wang
2021-03-29 23:10   ` John Fastabend
2021-03-30  5:47     ` Cong Wang
2021-03-28 20:20 ` [Patch bpf-next v7 13/13] selftests/bpf: add a test case for udp sockmap Cong Wang
2021-03-28 23:27 ` [Patch bpf-next v7 00/13] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP Alexei Starovoitov
2021-03-29 15:03   ` John Fastabend
2021-03-29 16:57     ` Cong Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).