All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 0/3] net: socket sendmsg MSG_ZEROCOPY_UARG
@ 2024-04-09 20:52 zijianzhang
  2024-04-09 20:52 ` [PATCH net-next 1/3] sock: add MSG_ZEROCOPY_UARG zijianzhang
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: zijianzhang @ 2024-04-09 20:52 UTC (permalink / raw)
  To: netdev
  Cc: edumazet, willemdebruijn.kernel, davem, kuba, cong.wang,
	xiaochun.lu, Zijian Zhang

From: Zijian Zhang <zijianzhang@bytedance.com>

Original notification mechanism needs poll + recvmmsg which is not
easy for applcations to accommodate. And, it also incurs unignorable
overhead including extra system calls and usage of optmem.

While making maximum reuse of the existing MSG_ZEROCOPY related code,
this patch set introduces zerocopy socket send flag MSG_ZEROCOPY_UARG.
It provides a new notification method. Users of sendmsg pass a control
message as a placeholder for the incoming notifications. Upon returning,
kernel embeds notifications directly into user arguments passed in. By
doing so, we can significantly reduce the complexity and overhead for
managing notifications. In an ideal pattern, the user will keep calling
sendmsg with MSG_ZEROCOPY_UARG flag, and the notification will be
delivered as soon as possible.

MSG_ZEROCOPY_UARG does not need to queue skb into errqueue. Thus,
skbuffs allocated from optmem are not a must. In theory, a new struct
carrying the zcopy information should be defined along with its memory
management code. However, existing zcopy generic code assumes the
information is skbuff. Given the very limited performance gain or maybe
no gain of this method, and the need to change a lot of existing code,
we still use skbuffs allocated from optmem to carry zcopy information.

* Performance

I extend the selftests/msg_zerocopy.c to accommodate the new flag, test
result is as follows, the new flag performs 7% better in TCP and 4%
better in UDP.

cfg_notification_limit = 8
+---------------------+---------+---------+---------+---------+
| Test Type / Protocol| TCP v4  | TCP v6  | UDP v4  | UDP v6  |
+---------------------+---------+---------+---------+---------+
| Copy                | 5328    | 5159    | 8581    | 8457    |
+---------------------+---------+---------+---------+---------+
| ZCopy               | 5877    | 5568    | 10314   | 10091   |
+---------------------+---------+---------+---------+---------+
| New ZCopy           | 6254    | 5901    | 10674   | 10293   |
+---------------------+---------+---------+---------+---------+
| ZCopy / Copy        | 110.30% | 107.93% | 120.20% | 119.32% |
+---------------------+---------+---------+---------+---------+
| New ZCopy / Copy    | 117.38% | 114.38% | 124.39% | 121.71% |
+---------------------+---------+---------+---------+---------+


Zijian Zhang (3):
  sock: add MSG_ZEROCOPY_UARG
  selftests: fix OOM problem in msg_zerocopy selftest
  selftests: add msg_zerocopy_uarg test

 include/linux/skbuff.h                      |   7 +-
 include/linux/socket.h                      |   1 +
 include/linux/tcp.h                         |   3 +
 include/linux/udp.h                         |   3 +
 include/net/sock.h                          |  17 +++
 include/net/udp.h                           |   1 +
 include/uapi/asm-generic/socket.h           |   2 +
 include/uapi/linux/socket.h                 |  17 +++
 net/core/skbuff.c                           | 137 ++++++++++++++++--
 net/core/sock.c                             |  50 +++++++
 net/ipv4/ip_output.c                        |   6 +-
 net/ipv4/tcp.c                              |   7 +-
 net/ipv4/udp.c                              |   9 ++
 net/ipv6/ip6_output.c                       |   5 +-
 net/ipv6/udp.c                              |   9 ++
 net/vmw_vsock/virtio_transport_common.c     |   2 +-
 tools/testing/selftests/net/msg_zerocopy.c  | 150 ++++++++++++++++++--
 tools/testing/selftests/net/msg_zerocopy.sh |   1 +
 18 files changed, 397 insertions(+), 30 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH net-next 1/3] sock: add MSG_ZEROCOPY_UARG
  2024-04-09 20:52 [PATCH net-next 0/3] net: socket sendmsg MSG_ZEROCOPY_UARG zijianzhang
@ 2024-04-09 20:52 ` zijianzhang
  2024-04-09 21:18   ` Willem de Bruijn
                     ` (5 more replies)
  2024-04-09 20:52 ` [PATCH net-next 2/3] selftests: fix OOM problem in msg_zerocopy selftest zijianzhang
                   ` (2 subsequent siblings)
  3 siblings, 6 replies; 24+ messages in thread
From: zijianzhang @ 2024-04-09 20:52 UTC (permalink / raw)
  To: netdev
  Cc: edumazet, willemdebruijn.kernel, davem, kuba, cong.wang,
	xiaochun.lu, Zijian Zhang

From: Zijian Zhang <zijianzhang@bytedance.com>

The MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
However, zerocopy is not a free lunch. Apart from the management of user
pages, the combination of poll + recvmsg to receive notifications incurs
unignorable overhead in the applications. The overhead of such sometimes
might be more than the CPU savings from zerocopy. We try to solve this
problem with a new option for TCP and UDP, MSG_ZEROCOPY_UARG.
This new mechanism aims to reduce the overhead associated with receiving
notifications by embedding them directly into user arguments passed with
each sendmsg control message. By doing so, we can significantly reduce
the complexity and overhead for managing notifications. In an ideal
pattern, the user will keep calling sendmsg with MSG_ZEROCOPY_UARG
flag, and the notification will be delivered as soon as possible.

Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com>
Signed-off-by: Xiaochun Lu <xiaochun.lu@bytedance.com>
---
 include/linux/skbuff.h                  |   7 +-
 include/linux/socket.h                  |   1 +
 include/linux/tcp.h                     |   3 +
 include/linux/udp.h                     |   3 +
 include/net/sock.h                      |  17 +++
 include/net/udp.h                       |   1 +
 include/uapi/asm-generic/socket.h       |   2 +
 include/uapi/linux/socket.h             |  17 +++
 net/core/skbuff.c                       | 137 ++++++++++++++++++++++--
 net/core/sock.c                         |  50 +++++++++
 net/ipv4/ip_output.c                    |   6 +-
 net/ipv4/tcp.c                          |   7 +-
 net/ipv4/udp.c                          |   9 ++
 net/ipv6/ip6_output.c                   |   5 +-
 net/ipv6/udp.c                          |   9 ++
 net/vmw_vsock/virtio_transport_common.c |   2 +-
 16 files changed, 258 insertions(+), 18 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 03ea36a82cdd..19b94ba01007 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1663,12 +1663,14 @@ static inline void skb_set_end_offset(struct sk_buff *skb, unsigned int offset)
 #endif
 
 struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
-				       struct ubuf_info *uarg);
+				       struct ubuf_info *uarg, bool user_args_notification);
 
 void msg_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref);
 
 void msg_zerocopy_callback(struct sk_buff *skb, struct ubuf_info *uarg,
 			   bool success);
+void msg_zerocopy_uarg_callback(struct sk_buff *skb, struct ubuf_info *uarg,
+			   bool success);
 
 int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
 			    struct sk_buff *skb, struct iov_iter *from,
@@ -1763,7 +1765,8 @@ static inline void net_zcopy_put(struct ubuf_info *uarg)
 static inline void net_zcopy_put_abort(struct ubuf_info *uarg, bool have_uref)
 {
 	if (uarg) {
-		if (uarg->callback == msg_zerocopy_callback)
+		if (uarg->callback == msg_zerocopy_callback ||
+			uarg->callback == msg_zerocopy_uarg_callback)
 			msg_zerocopy_put_abort(uarg, have_uref);
 		else if (have_uref)
 			net_zcopy_put(uarg);
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 139c330ccf2c..de01392344e1 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -326,6 +326,7 @@ struct ucred {
 					  * plain text and require encryption
 					  */
 
+#define MSG_ZEROCOPY_UARG	0x2000000 /* MSG_ZEROCOPY with UARG notifications */
 #define MSG_ZEROCOPY	0x4000000	/* Use user data in kernel path */
 #define MSG_SPLICE_PAGES 0x8000000	/* Splice the pages from the iterator in sendmsg() */
 #define MSG_FASTOPEN	0x20000000	/* Send data in TCP SYN */
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 55399ee2a57e..e973f4990646 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -501,6 +501,9 @@ struct tcp_sock {
 	 */
 	struct request_sock __rcu *fastopen_rsk;
 	struct saved_syn *saved_syn;
+
+/* TCP MSG_ZEROCOPY_UARG related information */
+	struct tx_msg_zcopy_queue tx_zcopy_queue;
 };
 
 enum tsq_enum {
diff --git a/include/linux/udp.h b/include/linux/udp.h
index 3748e82b627b..502b393eac67 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -95,6 +95,9 @@ struct udp_sock {
 
 	/* Cache friendly copy of sk->sk_peek_off >= 0 */
 	bool		peeking_with_offset;
+
+	/* This field is used by sendmsg zcopy user arg mode notification */
+	struct tx_msg_zcopy_queue tx_zcopy_queue;
 };
 
 #define udp_test_bit(nr, sk)			\
diff --git a/include/net/sock.h b/include/net/sock.h
index 2253eefe2848..f7c045e98213 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -544,6 +544,23 @@ struct sock {
 	netns_tracker		ns_tracker;
 };
 
+struct tx_msg_zcopy_node {
+	struct list_head node;
+	struct tx_msg_zcopy_info info;
+	struct sk_buff *skb;
+};
+
+struct tx_msg_zcopy_queue {
+	struct list_head head;
+	spinlock_t lock; /* protects head queue */
+};
+
+static inline void tx_message_zcopy_queue_init(struct tx_msg_zcopy_queue *q)
+{
+	spin_lock_init(&q->lock);
+	INIT_LIST_HEAD(&q->head);
+}
+
 enum sk_pacing {
 	SK_PACING_NONE		= 0,
 	SK_PACING_NEEDED	= 1,
diff --git a/include/net/udp.h b/include/net/udp.h
index 488a6d2babcc..9e4d7b128de4 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -182,6 +182,7 @@ static inline void udp_lib_init_sock(struct sock *sk)
 	skb_queue_head_init(&up->reader_queue);
 	up->forward_threshold = sk->sk_rcvbuf >> 2;
 	set_bit(SOCK_CUSTOM_SOCKOPT, &sk->sk_socket->flags);
+	tx_message_zcopy_queue_init(&up->tx_zcopy_queue);
 }
 
 /* hash routines shared between UDPv4/6 and UDP-Litev4/6 */
diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
index 8ce8a39a1e5f..86aa4b5cb7f1 100644
--- a/include/uapi/asm-generic/socket.h
+++ b/include/uapi/asm-generic/socket.h
@@ -135,6 +135,8 @@
 #define SO_PASSPIDFD		76
 #define SO_PEERPIDFD		77
 
+#define SO_ZEROCOPY_NOTIFICATION 78
+
 #if !defined(__KERNEL__)
 
 #if __BITS_PER_LONG == 64 || (defined(__x86_64__) && defined(__ILP32__))
diff --git a/include/uapi/linux/socket.h b/include/uapi/linux/socket.h
index d3fcd3b5ec53..469ed8f4e6c8 100644
--- a/include/uapi/linux/socket.h
+++ b/include/uapi/linux/socket.h
@@ -35,4 +35,21 @@ struct __kernel_sockaddr_storage {
 #define SOCK_TXREHASH_DISABLED	0
 #define SOCK_TXREHASH_ENABLED	1
 
+/*
+ * Given the fact that MSG_ZEROCOPY_UARG tries to copy notifications
+ * back to user as soon as possible, 8 should be sufficient.
+ */
+#define SOCK_USR_ZC_INFO_MAX 8
+
+struct tx_msg_zcopy_info {
+	__u32 lo;
+	__u32 hi;
+	__u8 zerocopy;
+};
+
+struct tx_usr_zcopy_info {
+	int length;
+	struct tx_msg_zcopy_info info[SOCK_USR_ZC_INFO_MAX];
+};
+
 #endif /* _UAPI_LINUX_SOCKET_H */
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 2a5ce6667bbb..d939b2c14d55 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1661,6 +1661,16 @@ void mm_unaccount_pinned_pages(struct mmpin *mmp)
 }
 EXPORT_SYMBOL_GPL(mm_unaccount_pinned_pages);
 
+static void init_ubuf_info_msgzc(struct ubuf_info_msgzc *uarg, struct sock *sk, size_t size)
+{
+	uarg->id = ((u32)atomic_inc_return(&sk->sk_zckey)) - 1;
+	uarg->len = 1;
+	uarg->bytelen = size;
+	uarg->zerocopy = 1;
+	uarg->ubuf.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
+	refcount_set(&uarg->ubuf.refcnt, 1);
+}
+
 static struct ubuf_info *msg_zerocopy_alloc(struct sock *sk, size_t size)
 {
 	struct ubuf_info_msgzc *uarg;
@@ -1682,12 +1692,38 @@ static struct ubuf_info *msg_zerocopy_alloc(struct sock *sk, size_t size)
 	}
 
 	uarg->ubuf.callback = msg_zerocopy_callback;
-	uarg->id = ((u32)atomic_inc_return(&sk->sk_zckey)) - 1;
-	uarg->len = 1;
-	uarg->bytelen = size;
-	uarg->zerocopy = 1;
-	uarg->ubuf.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
-	refcount_set(&uarg->ubuf.refcnt, 1);
+	init_ubuf_info_msgzc(uarg, sk, size);
+	sock_hold(sk);
+
+	return &uarg->ubuf;
+}
+
+static struct ubuf_info *msg_zerocopy_uarg_alloc(struct sock *sk, size_t size)
+{
+	struct sk_buff *skb;
+	struct ubuf_info_msgzc *uarg;
+	struct tx_msg_zcopy_node *zcopy_node_p;
+
+	WARN_ON_ONCE(!in_task());
+
+	skb = sock_omalloc(sk, sizeof(*zcopy_node_p), GFP_KERNEL);
+	if (!skb)
+		return NULL;
+
+	BUILD_BUG_ON(sizeof(*uarg) > sizeof(skb->cb));
+	uarg = (void *)skb->cb;
+	uarg->mmp.user = NULL;
+	zcopy_node_p = (struct tx_msg_zcopy_node *)skb_put(skb, sizeof(*zcopy_node_p));
+
+	if (mm_account_pinned_pages(&uarg->mmp, size)) {
+		kfree_skb(skb);
+		return NULL;
+	}
+
+	INIT_LIST_HEAD(&zcopy_node_p->node);
+	zcopy_node_p->skb = skb;
+	uarg->ubuf.callback = msg_zerocopy_uarg_callback;
+	init_ubuf_info_msgzc(uarg, sk, size);
 	sock_hold(sk);
 
 	return &uarg->ubuf;
@@ -1699,7 +1735,7 @@ static inline struct sk_buff *skb_from_uarg(struct ubuf_info_msgzc *uarg)
 }
 
 struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
-				       struct ubuf_info *uarg)
+				       struct ubuf_info *uarg, bool usr_arg_notification)
 {
 	if (uarg) {
 		struct ubuf_info_msgzc *uarg_zc;
@@ -1707,7 +1743,8 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
 		u32 bytelen, next;
 
 		/* there might be non MSG_ZEROCOPY users */
-		if (uarg->callback != msg_zerocopy_callback)
+		if (uarg->callback != msg_zerocopy_callback &&
+		    uarg->callback != msg_zerocopy_uarg_callback)
 			return NULL;
 
 		/* realloc only when socket is locked (TCP, UDP cork),
@@ -1744,6 +1781,8 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
 	}
 
 new_alloc:
+	if (usr_arg_notification)
+		return msg_zerocopy_uarg_alloc(sk, size);
 	return msg_zerocopy_alloc(sk, size);
 }
 EXPORT_SYMBOL_GPL(msg_zerocopy_realloc);
@@ -1830,6 +1869,86 @@ void msg_zerocopy_callback(struct sk_buff *skb, struct ubuf_info *uarg,
 }
 EXPORT_SYMBOL_GPL(msg_zerocopy_callback);
 
+static bool skb_zerocopy_uarg_notify_extend(struct tx_msg_zcopy_node *node, u32 lo, u16 len)
+{
+	u32 old_lo, old_hi;
+	u64 sum_len;
+
+	old_lo = node->info.lo;
+	old_hi = node->info.hi;
+	sum_len = old_hi - old_lo + 1ULL + len;
+
+	if (sum_len >= (1ULL << 32))
+		return false;
+
+	if (lo != old_hi + 1)
+		return false;
+
+	node->info.hi += len;
+	return true;
+}
+
+static void __msg_zerocopy_uarg_callback(struct ubuf_info_msgzc *uarg)
+{
+	struct sk_buff *skb = skb_from_uarg(uarg);
+	struct sock *sk = skb->sk;
+	struct tx_msg_zcopy_node *zcopy_node_p, *tail;
+	struct tx_msg_zcopy_queue *zcopy_queue;
+	unsigned long flags;
+	u32 lo, hi;
+	u16 len;
+
+	mm_unaccount_pinned_pages(&uarg->mmp);
+
+	/* if !len, there was only 1 call, and it was aborted
+	 * so do not queue a completion notification
+	 */
+	if (!uarg->len || sock_flag(sk, SOCK_DEAD))
+		goto release;
+
+	/* only support TCP and UCP currently */
+	if (sk_is_tcp(sk)) {
+		zcopy_queue = &tcp_sk(sk)->tx_zcopy_queue;
+	} else if (sk_is_udp(sk)) {
+		zcopy_queue = &udp_sk(sk)->tx_zcopy_queue;
+	} else {
+		pr_warn("MSG_ZEROCOPY_UARG only support TCP && UDP sockets");
+		goto release;
+	}
+
+	len = uarg->len;
+	lo = uarg->id;
+	hi = uarg->id + len - 1;
+
+	zcopy_node_p = (struct tx_msg_zcopy_node *)skb->data;
+	zcopy_node_p->info.lo = lo;
+	zcopy_node_p->info.hi = hi;
+	zcopy_node_p->info.zerocopy = uarg->zerocopy;
+
+	spin_lock_irqsave(&zcopy_queue->lock, flags);
+	tail = list_last_entry(&zcopy_queue->head, struct tx_msg_zcopy_node, node);
+	if (!tail || !skb_zerocopy_uarg_notify_extend(tail, lo, len)) {
+		list_add_tail(&zcopy_node_p->node, &zcopy_queue->head);
+		skb = NULL;
+	}
+	spin_unlock_irqrestore(&zcopy_queue->lock, flags);
+release:
+	consume_skb(skb);
+	sock_put(sk);
+}
+
+void msg_zerocopy_uarg_callback(struct sk_buff *skb, struct ubuf_info *uarg,
+				bool success)
+{
+	struct ubuf_info_msgzc *uarg_zc = uarg_to_msgzc(uarg);
+
+	uarg_zc->zerocopy = uarg_zc->zerocopy & success;
+
+	if (refcount_dec_and_test(&uarg->refcnt))
+		__msg_zerocopy_uarg_callback(uarg_zc);
+}
+EXPORT_SYMBOL_GPL(msg_zerocopy_uarg_callback);
+
 void msg_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref)
 {
 	struct sock *sk = skb_from_uarg(uarg_to_msgzc(uarg))->sk;
@@ -1838,7 +1957,7 @@ void msg_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref)
 	uarg_to_msgzc(uarg)->len--;
 
 	if (have_uref)
-		msg_zerocopy_callback(NULL, uarg, true);
+		uarg->callback(NULL, uarg, true);
 }
 EXPORT_SYMBOL_GPL(msg_zerocopy_put_abort);
 
diff --git a/net/core/sock.c b/net/core/sock.c
index 5ed411231fc7..a00ebd71f6ed 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2843,6 +2843,56 @@ int __sock_cmsg_send(struct sock *sk, struct cmsghdr *cmsg,
 	case SCM_RIGHTS:
 	case SCM_CREDENTIALS:
 		break;
+	case SO_ZEROCOPY_NOTIFICATION:
+		if (sock_flag(sk, SOCK_ZEROCOPY)) {
+			int i = 0;
+			struct tx_usr_zcopy_info sys_zcopy_info;
+			struct tx_msg_zcopy_node *zcopy_node_p, *tmp;
+			struct tx_msg_zcopy_queue *zcopy_queue;
+			struct tx_msg_zcopy_node *zcopy_node_ps[SOCK_USR_ZC_INFO_MAX];
+			unsigned long flags;
+
+			if (cmsg->cmsg_len != CMSG_LEN(sizeof(void *)))
+				return -EINVAL;
+
+			if (sk_is_tcp(sk))
+				zcopy_queue = &tcp_sk(sk)->tx_zcopy_queue;
+			else if (sk_is_udp(sk))
+				zcopy_queue = &udp_sk(sk)->tx_zcopy_queue;
+			else
+				return -EINVAL;
+
+			spin_lock_irqsave(&zcopy_queue->lock, flags);
+			list_for_each_entry_safe(zcopy_node_p, tmp, &zcopy_queue->head, node) {
+				sys_zcopy_info.info[i].lo = zcopy_node_p->info.lo;
+				sys_zcopy_info.info[i].hi = zcopy_node_p->info.hi;
+				sys_zcopy_info.info[i].zerocopy = zcopy_node_p->info.zerocopy;
+				list_del(&zcopy_node_p->node);
+				zcopy_node_ps[i++] = zcopy_node_p;
+				if (i == SOCK_USR_ZC_INFO_MAX)
+					break;
+			}
+			spin_unlock_irqrestore(&zcopy_queue->lock, flags);
+
+			if (i > 0) {
+				sys_zcopy_info.length = i;
+				if (unlikely(copy_to_user(*(void **)CMSG_DATA(cmsg),
+							  &sys_zcopy_info,
+							  sizeof(sys_zcopy_info))
+					)) {
+					spin_lock_irqsave(&zcopy_queue->lock, flags);
+					while (i > 0)
+						list_add(&zcopy_node_ps[--i]->node,
+							 &zcopy_queue->head);
+					spin_unlock_irqrestore(&zcopy_queue->lock, flags);
+					return -EFAULT;
+				}
+
+				while (i > 0)
+					consume_skb(zcopy_node_ps[--i]->skb);
+			}
+		}
+		break;
 	default:
 		return -EINVAL;
 	}
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 1fe794967211..5adb737c4c01 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1005,7 +1005,7 @@ static int __ip_append_data(struct sock *sk,
 	    (!exthdrlen || (rt->dst.dev->features & NETIF_F_HW_ESP_TX_CSUM)))
 		csummode = CHECKSUM_PARTIAL;
 
-	if ((flags & MSG_ZEROCOPY) && length) {
+	if (((flags & MSG_ZEROCOPY) || (flags & MSG_ZEROCOPY_UARG)) && length) {
 		struct msghdr *msg = from;
 
 		if (getfrag == ip_generic_getfrag && msg->msg_ubuf) {
@@ -1022,7 +1022,9 @@ static int __ip_append_data(struct sock *sk,
 				uarg = msg->msg_ubuf;
 			}
 		} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
-			uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
+			bool user_args = flags & MSG_ZEROCOPY_UARG;
+
+			uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb), user_args);
 			if (!uarg)
 				return -ENOBUFS;
 			extra_uref = !skb_zcopy(skb);	/* only ref on new uarg */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e767721b3a58..6254d0eef3af 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -462,6 +462,8 @@ void tcp_init_sock(struct sock *sk)
 
 	set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags);
 	sk_sockets_allocated_inc(sk);
+
+	tx_message_zcopy_queue_init(&tp->tx_zcopy_queue);
 }
 EXPORT_SYMBOL(tcp_init_sock);
 
@@ -1050,14 +1052,15 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 
 	flags = msg->msg_flags;
 
-	if ((flags & MSG_ZEROCOPY) && size) {
+	if (((flags & MSG_ZEROCOPY) || (flags & MSG_ZEROCOPY_UARG)) && size) {
 		if (msg->msg_ubuf) {
 			uarg = msg->msg_ubuf;
 			if (sk->sk_route_caps & NETIF_F_SG)
 				zc = MSG_ZEROCOPY;
 		} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
+			bool zc_uarg = flags & MSG_ZEROCOPY_UARG;
 			skb = tcp_write_queue_tail(sk);
-			uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb));
+			uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb), zc_uarg);
 			if (!uarg) {
 				err = -ENOBUFS;
 				goto out_err;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 11460d751e73..6c62aacd74d6 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1126,6 +1126,15 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		if (ipc.opt)
 			free = 1;
 		connected = 0;
+
+		/* If len is zero and flag MSG_ZEROCOPY_UARG is set,
+		 * it means this call just wants to get zcopy notifications
+		 * instead of sending packets. It is useful when users
+		 * finish sending and want to get trailing notifications.
+		 */
+		if ((msg->msg_flags & MSG_ZEROCOPY_UARG) &&
+		    sock_flag(sk, SOCK_ZEROCOPY) && len == 0)
+			return 0;
 	}
 	if (!ipc.opt) {
 		struct ip_options_rcu *inet_opt;
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index b9dd3a66e423..891526ddd74c 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1493,7 +1493,7 @@ static int __ip6_append_data(struct sock *sk,
 	    rt->dst.dev->features & (NETIF_F_IPV6_CSUM | NETIF_F_HW_CSUM))
 		csummode = CHECKSUM_PARTIAL;
 
-	if ((flags & MSG_ZEROCOPY) && length) {
+	if (((flags & MSG_ZEROCOPY) || (flags & MSG_ZEROCOPY_UARG)) && length) {
 		struct msghdr *msg = from;
 
 		if (getfrag == ip_generic_getfrag && msg->msg_ubuf) {
@@ -1510,7 +1510,8 @@ static int __ip6_append_data(struct sock *sk,
 				uarg = msg->msg_ubuf;
 			}
 		} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
-			uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
+			uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb),
+						    flags & MSG_ZEROCOPY_UARG);
 			if (!uarg)
 				return -ENOBUFS;
 			extra_uref = !skb_zcopy(skb);	/* only ref on new uarg */
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 2e4dc5e6137b..98f6905c5db9 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1490,6 +1490,15 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		if (!(opt->opt_nflen|opt->opt_flen))
 			opt = NULL;
 		connected = false;
+
+		/* If len is zero and flag MSG_ZEROCOPY_UARG is set,
+		 * it means this call just wants to get zcopy notifications
+		 * instead of sending packets. It is useful when users
+		 * finish sending and want to get trailing notifications.
+		 */
+		if ((msg->msg_flags & MSG_ZEROCOPY_UARG) &&
+		    sock_flag(sk, SOCK_ZEROCOPY) && len == 0)
+			return 0;
 	}
 	if (!opt) {
 		opt = txopt_get(np);
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index 16ff976a86e3..d6e6830f6ffe 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -84,7 +84,7 @@ static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
 
 		uarg = msg_zerocopy_realloc(sk_vsock(vsk),
 					    iter->count,
-					    NULL);
+					    NULL, false);
 		if (!uarg)
 			return -1;
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next 2/3] selftests: fix OOM problem in msg_zerocopy selftest
  2024-04-09 20:52 [PATCH net-next 0/3] net: socket sendmsg MSG_ZEROCOPY_UARG zijianzhang
  2024-04-09 20:52 ` [PATCH net-next 1/3] sock: add MSG_ZEROCOPY_UARG zijianzhang
@ 2024-04-09 20:52 ` zijianzhang
  2024-04-09 21:25   ` Willem de Bruijn
  2024-04-09 20:53 ` [PATCH net-next 3/3] selftests: add msg_zerocopy_uarg test zijianzhang
  2024-04-10  8:46 ` [PATCH net-next 0/3] net: socket sendmsg MSG_ZEROCOPY_UARG Paolo Abeni
  3 siblings, 1 reply; 24+ messages in thread
From: zijianzhang @ 2024-04-09 20:52 UTC (permalink / raw)
  To: netdev
  Cc: edumazet, willemdebruijn.kernel, davem, kuba, cong.wang,
	xiaochun.lu, Zijian Zhang

From: Zijian Zhang <zijianzhang@bytedance.com>

In selftests/net/msg_zerocopy.c, it has a while loop keeps calling sendmsg
on a socket, and it will recv the completion notifications when the socket
is not writable. Typically, it will start the receiving process after
around 30+ sendmsgs.

However, because of the commit dfa2f0483360
("tcp: get rid of sysctl_tcp_adv_win_scale"), the sender is always writable
and does not get any chance to run recv notifications. The selftest always
exits with OUT_OF_MEMORY because the memory used by opt_skb exceeds
the core.sysctl_optmem_max. We introduce "cfg_notification_limit" to
force sender to receive notifications after some number of sendmsgs. Plus,
in the selftest, we need to update skb_orphan_frags_rx to be the same as
skb_orphan_frags. In this case, for some reason, notifications do not
come in order now. We introduce "cfg_notification_order_check" to
possibly ignore the checking for order.

Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com>
Signed-off-by: Xiaochun Lu <xiaochun.lu@bytedance.com>
---
 tools/testing/selftests/net/msg_zerocopy.c | 24 ++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/net/msg_zerocopy.c b/tools/testing/selftests/net/msg_zerocopy.c
index bdc03a2097e8..8e595216a0af 100644
--- a/tools/testing/selftests/net/msg_zerocopy.c
+++ b/tools/testing/selftests/net/msg_zerocopy.c
@@ -85,6 +85,8 @@ static bool cfg_rx;
 static int  cfg_runtime_ms	= 4200;
 static int  cfg_verbose;
 static int  cfg_waittime_ms	= 500;
+static bool cfg_notification_order_check;
+static int  cfg_notification_limit = 32;
 static bool cfg_zerocopy;
 
 static socklen_t cfg_alen;
@@ -435,7 +437,7 @@ static bool do_recv_completion(int fd, int domain)
 	/* Detect notification gaps. These should not happen often, if at all.
 	 * Gaps can occur due to drops, reordering and retransmissions.
 	 */
-	if (lo != next_completion)
+	if (cfg_notification_order_check && lo != next_completion)
 		fprintf(stderr, "gap: %u..%u does not append to %u\n",
 			lo, hi, next_completion);
 	next_completion = hi + 1;
@@ -489,7 +491,7 @@ static void do_tx(int domain, int type, int protocol)
 		struct iphdr iph;
 	} nh;
 	uint64_t tstop;
-	int fd;
+	int fd, sendmsg_counter = 0;
 
 	fd = do_setup_tx(domain, type, protocol);
 
@@ -548,10 +550,18 @@ static void do_tx(int domain, int type, int protocol)
 			do_sendmsg_corked(fd, &msg);
 		else
 			do_sendmsg(fd, &msg, cfg_zerocopy, domain);
+		sendmsg_counter++;
+
+		if (sendmsg_counter == cfg_notification_limit && cfg_zerocopy) {
+			do_recv_completions(fd, domain);
+			sendmsg_counter = 0;
+		}
 
 		while (!do_poll(fd, POLLOUT)) {
-			if (cfg_zerocopy)
+			if (cfg_zerocopy) {
 				do_recv_completions(fd, domain);
+				sendmsg_counter = 0;
+			}
 		}
 
 	} while (gettimeofday_ms() < tstop);
@@ -708,7 +718,7 @@ static void parse_opts(int argc, char **argv)
 
 	cfg_payload_len = max_payload_len;
 
-	while ((c = getopt(argc, argv, "46c:C:D:i:mp:rs:S:t:vz")) != -1) {
+	while ((c = getopt(argc, argv, "46c:C:D:i:mp:rs:S:t:vzol:")) != -1) {
 		switch (c) {
 		case '4':
 			if (cfg_family != PF_UNSPEC)
@@ -760,6 +770,12 @@ static void parse_opts(int argc, char **argv)
 		case 'z':
 			cfg_zerocopy = true;
 			break;
+		case 'o':
+			cfg_notification_order_check = true;
+			break;
+		case 'l':
+			cfg_notification_limit = strtoul(optarg, NULL, 0);
+			break;
 		}
 	}
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next 3/3] selftests: add msg_zerocopy_uarg test
  2024-04-09 20:52 [PATCH net-next 0/3] net: socket sendmsg MSG_ZEROCOPY_UARG zijianzhang
  2024-04-09 20:52 ` [PATCH net-next 1/3] sock: add MSG_ZEROCOPY_UARG zijianzhang
  2024-04-09 20:52 ` [PATCH net-next 2/3] selftests: fix OOM problem in msg_zerocopy selftest zijianzhang
@ 2024-04-09 20:53 ` zijianzhang
  2024-04-10  8:46 ` [PATCH net-next 0/3] net: socket sendmsg MSG_ZEROCOPY_UARG Paolo Abeni
  3 siblings, 0 replies; 24+ messages in thread
From: zijianzhang @ 2024-04-09 20:53 UTC (permalink / raw)
  To: netdev
  Cc: edumazet, willemdebruijn.kernel, davem, kuba, cong.wang,
	xiaochun.lu, Zijian Zhang

From: Zijian Zhang <zijianzhang@bytedance.com>

We update selftests/net/msg_zerocopy.c to accommodate the new flag.
In the original selftest, it tries to retrieve notifications when the
socket is not writable. In order to compare with the new flag, we
introduce a new config, "cfg_notification_limit", which forces the
application to recv notifications when some number of sendmsgs finishes.

Test result from selftests/net/msg_zerocopy.c,
cfg_notification_limit = 1, it's an unrealistic setting for MSG_ZEROCOPY,
and it approximately aligns with the semantics of MSG_ZEROCOPY_UARG.
In this case, the new flag has around 15% cpu savings in TCP and 28% cpu
savings in UDP. The numbers are in the unit of MB.
+---------------------+---------+---------+---------+---------+
| Test Type / Protocol| TCP v4  | TCP v6  | UDP v4  | UDP v6  |
+---------------------+---------+---------+---------+---------+
| Copy                | 5517    | 5345    | 9158    | 8767    |
+---------------------+---------+---------+---------+---------+
| ZCopy               | 5588    | 5439    | 8538    | 8169    |
+---------------------+---------+---------+---------+---------+
| New ZCopy           | 6517    | 6103    | 11000   | 10839   |
+---------------------+---------+---------+---------+---------+
| ZCopy / Copy        | 101.29% | 101.76% | 93.23%  | 93.18%  |
+---------------------+---------+---------+---------+---------+
| New ZCopy / Copy    | 118.13% | 114.18% | 120.11% | 123.63% |
+---------------------+---------+---------+---------+---------+

cfg_notification_limit = 8, it means less poll + recvmsg overhead,
the new flag performs 7% better in TCP and 4% better in UDP.
The numbers are in the unit of MB.
+---------------------+---------+---------+---------+---------+
| Test Type / Protocol| TCP v4  | TCP v6  | UDP v4  | UDP v6  |
+---------------------+---------+---------+---------+---------+
| Copy                | 5328    | 5159    | 8581    | 8457    |
+---------------------+---------+---------+---------+---------+
| ZCopy               | 5877    | 5568    | 10314   | 10091   |
+---------------------+---------+---------+---------+---------+
| New ZCopy           | 6254    | 5901    | 10674   | 10293   |
+---------------------+---------+---------+---------+---------+
| ZCopy / Copy        | 110.30% | 107.93% | 120.20% | 119.32% |
+---------------------+---------+---------+---------+---------+
| New ZCopy / Copy    | 117.38% | 114.38% | 124.39% | 121.71% |
+---------------------+---------+---------+---------+---------+

Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com>
Signed-off-by: Xiaochun Lu <xiaochun.lu@bytedance.com>
---
 tools/testing/selftests/net/msg_zerocopy.c  | 132 ++++++++++++++++++--
 tools/testing/selftests/net/msg_zerocopy.sh |   1 +
 2 files changed, 122 insertions(+), 11 deletions(-)

diff --git a/tools/testing/selftests/net/msg_zerocopy.c b/tools/testing/selftests/net/msg_zerocopy.c
index 8e595216a0af..0ca5e8509032 100644
--- a/tools/testing/selftests/net/msg_zerocopy.c
+++ b/tools/testing/selftests/net/msg_zerocopy.c
@@ -1,4 +1,5 @@
-/* Evaluate MSG_ZEROCOPY
+// SPDX-License-Identifier: GPL-2.0
+/* Evaluate MSG_ZEROCOPY && MSG_ZEROCOPY_UARG
  *
  * Send traffic between two processes over one of the supported
  * protocols and modes:
@@ -66,14 +67,29 @@
 #define SO_ZEROCOPY	60
 #endif
 
+#ifndef SO_ZEROCOPY_NOTIFICATION
+#define SO_ZEROCOPY_NOTIFICATION	78
+#endif
+
 #ifndef SO_EE_CODE_ZEROCOPY_COPIED
 #define SO_EE_CODE_ZEROCOPY_COPIED	1
 #endif
 
+#ifndef MSG_ZEROCOPY_UARG
+#define MSG_ZEROCOPY_UARG	0x2000000
+#endif
+
 #ifndef MSG_ZEROCOPY
 #define MSG_ZEROCOPY	0x4000000
 #endif
 
+#ifndef SOCK_USR_ZC_INFO_MAX
+#define SOCK_USR_ZC_INFO_MAX	8
+#endif
+
+#define ZEROCOPY_MSGERR_NOTIFICATION 1
+#define ZEROCOPY_USER_ARG_NOTIFICATION 2
+
 static int  cfg_cork;
 static bool cfg_cork_mixed;
 static int  cfg_cpu		= -1;		/* default: pin to last cpu */
@@ -87,7 +103,7 @@ static int  cfg_verbose;
 static int  cfg_waittime_ms	= 500;
 static bool cfg_notification_order_check;
 static int  cfg_notification_limit = 32;
-static bool cfg_zerocopy;
+static int  cfg_zerocopy;           /* 1 for MSG_ZEROCOPY, 2 for MSG_ZEROCOPY_UARG */
 
 static socklen_t cfg_alen;
 static struct sockaddr_storage cfg_dst_addr;
@@ -169,6 +185,19 @@ static int do_accept(int fd)
 	return fd;
 }
 
+static void add_zcopy_user_arg(struct msghdr *msg, void *usr_addr)
+{
+	struct cmsghdr *cm;
+
+	if (!msg->msg_control)
+		error(1, errno, "NULL user arg");
+	cm = (void *)msg->msg_control;
+	cm->cmsg_len = CMSG_LEN(sizeof(void *));
+	cm->cmsg_level = SOL_SOCKET;
+	cm->cmsg_type = SO_ZEROCOPY_NOTIFICATION;
+	memcpy(CMSG_DATA(cm), &usr_addr, sizeof(usr_addr));
+}
+
 static void add_zcopy_cookie(struct msghdr *msg, uint32_t cookie)
 {
 	struct cmsghdr *cm;
@@ -182,18 +211,55 @@ static void add_zcopy_cookie(struct msghdr *msg, uint32_t cookie)
 	memcpy(CMSG_DATA(cm), &cookie, sizeof(cookie));
 }
 
-static bool do_sendmsg(int fd, struct msghdr *msg, bool do_zerocopy, int domain)
+static void do_recv_completion_user_arg(void *p)
+{
+	int i;
+	__u32 hi, lo, range;
+	__u8 zerocopy;
+	struct tx_usr_zcopy_info *zc_info_p = (struct tx_usr_zcopy_info *)p;
+
+	for (i = 0; i < zc_info_p->length; ++i) {
+		struct tx_msg_zcopy_info elem = zc_info_p->info[i];
+
+		hi = elem.hi;
+		lo = elem.lo;
+		zerocopy = elem.zerocopy;
+		range = hi - lo + 1;
+
+		if (cfg_notification_order_check && lo != next_completion)
+			fprintf(stderr, "gap: %u..%u does not append to %u\n",
+				lo, hi, next_completion);
+		next_completion = hi + 1;
+
+		if (zerocopied == -1)
+			zerocopied = zerocopy;
+		else if (zerocopied != zerocopy) {
+			fprintf(stderr, "serr: inconsistent\n");
+			zerocopied = zerocopy;
+		}
+
+		completions += range;
+
+		if (cfg_verbose >= 2)
+			fprintf(stderr, "completed: %u (h=%u l=%u)\n",
+				range, hi, lo);
+	}
+}
+
+static bool do_sendmsg(int fd, struct msghdr *msg, int do_zerocopy, int domain)
 {
 	int ret, len, i, flags;
 	static uint32_t cookie;
-	char ckbuf[CMSG_SPACE(sizeof(cookie))];
+	/* ckbuf is used to either hold uint32_t cookie or void *pointer */
+	char ckbuf[CMSG_SPACE(sizeof(void *))];
+	struct tx_usr_zcopy_info zc_info;
 
 	len = 0;
 	for (i = 0; i < msg->msg_iovlen; i++)
 		len += msg->msg_iov[i].iov_len;
 
 	flags = MSG_DONTWAIT;
-	if (do_zerocopy) {
+	if (do_zerocopy == ZEROCOPY_MSGERR_NOTIFICATION) {
 		flags |= MSG_ZEROCOPY;
 		if (domain == PF_RDS) {
 			memset(&msg->msg_control, 0, sizeof(msg->msg_control));
@@ -201,6 +267,12 @@ static bool do_sendmsg(int fd, struct msghdr *msg, bool do_zerocopy, int domain)
 			msg->msg_control = (struct cmsghdr *)ckbuf;
 			add_zcopy_cookie(msg, ++cookie);
 		}
+	} else if (do_zerocopy == ZEROCOPY_USER_ARG_NOTIFICATION) {
+		flags |= MSG_ZEROCOPY_UARG;
+		memset(&zc_info, 0, sizeof(zc_info));
+		msg->msg_controllen = CMSG_SPACE(sizeof(void *));
+		msg->msg_control = (struct cmsghdr *)ckbuf;
+		add_zcopy_user_arg(msg, &zc_info);
 	}
 
 	ret = sendmsg(fd, msg, flags);
@@ -211,13 +283,16 @@ static bool do_sendmsg(int fd, struct msghdr *msg, bool do_zerocopy, int domain)
 	if (cfg_verbose && ret != len)
 		fprintf(stderr, "send: ret=%u != %u\n", ret, len);
 
+	if (do_zerocopy == ZEROCOPY_USER_ARG_NOTIFICATION)
+		do_recv_completion_user_arg(&zc_info);
+
 	if (len) {
 		packets++;
 		bytes += ret;
 		if (do_zerocopy && ret)
 			expected_completions++;
 	}
-	if (do_zerocopy && domain == PF_RDS) {
+	if (msg->msg_control) {
 		msg->msg_control = NULL;
 		msg->msg_controllen = 0;
 	}
@@ -480,6 +555,36 @@ static void do_recv_remaining_completions(int fd, int domain)
 			completions, expected_completions);
 }
 
+static void do_new_recv_remaining_completions(int fd, struct msghdr *msg)
+{
+	int ret, flags;
+	struct tx_usr_zcopy_info zc_info;
+	int64_t tstop = gettimeofday_ms() + cfg_waittime_ms;
+	char ckbuf[CMSG_SPACE(sizeof(void *))];
+
+	flags = MSG_DONTWAIT | MSG_ZEROCOPY_UARG;
+	msg->msg_iovlen = 0;
+	msg->msg_controllen = CMSG_SPACE(sizeof(void *));
+	msg->msg_control = (struct cmsghdr *)ckbuf;
+	add_zcopy_user_arg(msg, &zc_info);
+
+	while (completions < expected_completions &&
+			gettimeofday_ms() < tstop) {
+		memset(&zc_info, 0, sizeof(zc_info));
+		ret = sendmsg(fd, msg, flags);
+		if (ret == -1 && errno == EAGAIN)
+			return;
+		if (ret == -1)
+			error(1, errno, "send");
+
+		do_recv_completion_user_arg(&zc_info);
+	}
+
+	if (completions < expected_completions)
+		fprintf(stderr, "missing notifications: %lu < %lu\n",
+			completions, expected_completions);
+}
+
 static void do_tx(int domain, int type, int protocol)
 {
 	struct iovec iov[3] = { {0} };
@@ -552,13 +657,14 @@ static void do_tx(int domain, int type, int protocol)
 			do_sendmsg(fd, &msg, cfg_zerocopy, domain);
 		sendmsg_counter++;
 
-		if (sendmsg_counter == cfg_notification_limit && cfg_zerocopy) {
+		if (sendmsg_counter == cfg_notification_limit &&
+			cfg_zerocopy == ZEROCOPY_MSGERR_NOTIFICATION) {
 			do_recv_completions(fd, domain);
 			sendmsg_counter = 0;
 		}
 
 		while (!do_poll(fd, POLLOUT)) {
-			if (cfg_zerocopy) {
+			if (cfg_zerocopy == ZEROCOPY_MSGERR_NOTIFICATION) {
 				do_recv_completions(fd, domain);
 				sendmsg_counter = 0;
 			}
@@ -566,8 +672,10 @@ static void do_tx(int domain, int type, int protocol)
 
 	} while (gettimeofday_ms() < tstop);
 
-	if (cfg_zerocopy)
+	if (cfg_zerocopy == ZEROCOPY_MSGERR_NOTIFICATION)
 		do_recv_remaining_completions(fd, domain);
+	else if (cfg_zerocopy == ZEROCOPY_USER_ARG_NOTIFICATION)
+		do_new_recv_remaining_completions(fd, &msg);
 
 	if (close(fd))
 		error(1, errno, "close");
@@ -718,7 +826,7 @@ static void parse_opts(int argc, char **argv)
 
 	cfg_payload_len = max_payload_len;
 
-	while ((c = getopt(argc, argv, "46c:C:D:i:mp:rs:S:t:vzol:")) != -1) {
+	while ((c = getopt(argc, argv, "46c:C:D:i:mp:rs:S:t:vzol:n")) != -1) {
 		switch (c) {
 		case '4':
 			if (cfg_family != PF_UNSPEC)
@@ -768,7 +876,7 @@ static void parse_opts(int argc, char **argv)
 			cfg_verbose++;
 			break;
 		case 'z':
-			cfg_zerocopy = true;
+			cfg_zerocopy = ZEROCOPY_MSGERR_NOTIFICATION;
 			break;
 		case 'o':
 			cfg_notification_order_check = true;
@@ -776,6 +884,9 @@ static void parse_opts(int argc, char **argv)
 		case 'l':
 			cfg_notification_limit = strtoul(optarg, NULL, 0);
 			break;
+		case 'n':
+			cfg_zerocopy = ZEROCOPY_USER_ARG_NOTIFICATION;
+			break;
 		}
 	}
 
diff --git a/tools/testing/selftests/net/msg_zerocopy.sh b/tools/testing/selftests/net/msg_zerocopy.sh
index 89c22f5320e0..022a6936d86f 100755
--- a/tools/testing/selftests/net/msg_zerocopy.sh
+++ b/tools/testing/selftests/net/msg_zerocopy.sh
@@ -118,4 +118,5 @@ do_test() {
 
 do_test "${EXTRA_ARGS}"
 do_test "-z ${EXTRA_ARGS}"
+do_test "-n ${EXTRA_ARGS}"
 echo ok

-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 1/3] sock: add MSG_ZEROCOPY_UARG
  2024-04-09 20:52 ` [PATCH net-next 1/3] sock: add MSG_ZEROCOPY_UARG zijianzhang
@ 2024-04-09 21:18   ` Willem de Bruijn
  2024-04-10  0:08     ` [External] " Zijian Zhang
  2024-04-09 21:18   ` Eric Dumazet
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 24+ messages in thread
From: Willem de Bruijn @ 2024-04-09 21:18 UTC (permalink / raw)
  To: zijianzhang, netdev
  Cc: edumazet, willemdebruijn.kernel, davem, kuba, cong.wang,
	xiaochun.lu, Zijian Zhang

zijianzhang@ wrote:
> From: Zijian Zhang <zijianzhang@bytedance.com>
> 
> The MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
> However, zerocopy is not a free lunch. Apart from the management of user
> pages, the combination of poll + recvmsg to receive notifications incurs
> unignorable overhead in the applications. The overhead of such sometimes
> might be more than the CPU savings from zerocopy. We try to solve this
> problem with a new option for TCP and UDP, MSG_ZEROCOPY_UARG.
> This new mechanism aims to reduce the overhead associated with receiving
> notifications by embedding them directly into user arguments passed with
> each sendmsg control message. By doing so, we can significantly reduce
> the complexity and overhead for managing notifications. In an ideal
> pattern, the user will keep calling sendmsg with MSG_ZEROCOPY_UARG
> flag, and the notification will be delivered as soon as possible.
> 
> Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com>
> Signed-off-by: Xiaochun Lu <xiaochun.lu@bytedance.com>

The idea to avoid POLLERR and errqueue receive cost by piggy-backing
onto the send() call is appealing.

I also implemented this when initially adding MSG_ZEROCOPY. At the
time my implementation did not warrant the modest performance
improvement I measured. But this may be different.

The main reason my code was complex is that this requires changes
to net/socket.c: controldata is intended to be user->kernel on
send and kernel->user on recv, such as data. This feature breaks
that. You'll have to copy msg_controldata in ____sys_sendmsg.

I don't think it requires this much new infrastructure: a separate
queue in structs tcp_sock and udp_sock, a differente uarg callback.

I would suggest an optimistic algorithm, whereby the send call
steaks a notification skb from the error queue if present. This
could be a new mode set with a value 2 in SO_ZEROCOPY.

> ---
>  include/linux/skbuff.h                  |   7 +-
>  include/linux/socket.h                  |   1 +
>  include/linux/tcp.h                     |   3 +
>  include/linux/udp.h                     |   3 +
>  include/net/sock.h                      |  17 +++
>  include/net/udp.h                       |   1 +
>  include/uapi/asm-generic/socket.h       |   2 +
>  include/uapi/linux/socket.h             |  17 +++
>  net/core/skbuff.c                       | 137 ++++++++++++++++++++++--
>  net/core/sock.c                         |  50 +++++++++
>  net/ipv4/ip_output.c                    |   6 +-
>  net/ipv4/tcp.c                          |   7 +-
>  net/ipv4/udp.c                          |   9 ++
>  net/ipv6/ip6_output.c                   |   5 +-
>  net/ipv6/udp.c                          |   9 ++
>  net/vmw_vsock/virtio_transport_common.c |   2 +-
>  16 files changed, 258 insertions(+), 18 deletions(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 03ea36a82cdd..19b94ba01007 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1663,12 +1663,14 @@ static inline void skb_set_end_offset(struct sk_buff *skb, unsigned int offset)
>  #endif
>  
>  struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
> -				       struct ubuf_info *uarg);
> +				       struct ubuf_info *uarg, bool user_args_notification);
>  
>  void msg_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref);
>  
>  void msg_zerocopy_callback(struct sk_buff *skb, struct ubuf_info *uarg,
>  			   bool success);
> +void msg_zerocopy_uarg_callback(struct sk_buff *skb, struct ubuf_info *uarg,
> +			   bool success);
>  
>  int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
>  			    struct sk_buff *skb, struct iov_iter *from,
> @@ -1763,7 +1765,8 @@ static inline void net_zcopy_put(struct ubuf_info *uarg)
>  static inline void net_zcopy_put_abort(struct ubuf_info *uarg, bool have_uref)
>  {
>  	if (uarg) {
> -		if (uarg->callback == msg_zerocopy_callback)
> +		if (uarg->callback == msg_zerocopy_callback ||
> +			uarg->callback == msg_zerocopy_uarg_callback)
>  			msg_zerocopy_put_abort(uarg, have_uref);
>  		else if (have_uref)
>  			net_zcopy_put(uarg);
> diff --git a/include/linux/socket.h b/include/linux/socket.h
> index 139c330ccf2c..de01392344e1 100644
> --- a/include/linux/socket.h
> +++ b/include/linux/socket.h
> @@ -326,6 +326,7 @@ struct ucred {
>  					  * plain text and require encryption
>  					  */
>  
> +#define MSG_ZEROCOPY_UARG	0x2000000 /* MSG_ZEROCOPY with UARG notifications */
>  #define MSG_ZEROCOPY	0x4000000	/* Use user data in kernel path */
>  #define MSG_SPLICE_PAGES 0x8000000	/* Splice the pages from the iterator in sendmsg() */
>  #define MSG_FASTOPEN	0x20000000	/* Send data in TCP SYN */
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index 55399ee2a57e..e973f4990646 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -501,6 +501,9 @@ struct tcp_sock {
>  	 */
>  	struct request_sock __rcu *fastopen_rsk;
>  	struct saved_syn *saved_syn;
> +
> +/* TCP MSG_ZEROCOPY_UARG related information */
> +	struct tx_msg_zcopy_queue tx_zcopy_queue;
>  };
>  
>  enum tsq_enum {
> diff --git a/include/linux/udp.h b/include/linux/udp.h
> index 3748e82b627b..502b393eac67 100644
> --- a/include/linux/udp.h
> +++ b/include/linux/udp.h
> @@ -95,6 +95,9 @@ struct udp_sock {
>  
>  	/* Cache friendly copy of sk->sk_peek_off >= 0 */
>  	bool		peeking_with_offset;
> +
> +	/* This field is used by sendmsg zcopy user arg mode notification */
> +	struct tx_msg_zcopy_queue tx_zcopy_queue;
>  };
>  
>  #define udp_test_bit(nr, sk)			\
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 2253eefe2848..f7c045e98213 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -544,6 +544,23 @@ struct sock {
>  	netns_tracker		ns_tracker;
>  };
>  
> +struct tx_msg_zcopy_node {
> +	struct list_head node;
> +	struct tx_msg_zcopy_info info;
> +	struct sk_buff *skb;
> +};
> +
> +struct tx_msg_zcopy_queue {
> +	struct list_head head;
> +	spinlock_t lock; /* protects head queue */
> +};
> +
> +static inline void tx_message_zcopy_queue_init(struct tx_msg_zcopy_queue *q)
> +{
> +	spin_lock_init(&q->lock);
> +	INIT_LIST_HEAD(&q->head);
> +}
> +
>  enum sk_pacing {
>  	SK_PACING_NONE		= 0,
>  	SK_PACING_NEEDED	= 1,
> diff --git a/include/net/udp.h b/include/net/udp.h
> index 488a6d2babcc..9e4d7b128de4 100644
> --- a/include/net/udp.h
> +++ b/include/net/udp.h
> @@ -182,6 +182,7 @@ static inline void udp_lib_init_sock(struct sock *sk)
>  	skb_queue_head_init(&up->reader_queue);
>  	up->forward_threshold = sk->sk_rcvbuf >> 2;
>  	set_bit(SOCK_CUSTOM_SOCKOPT, &sk->sk_socket->flags);
> +	tx_message_zcopy_queue_init(&up->tx_zcopy_queue);
>  }
>  
>  /* hash routines shared between UDPv4/6 and UDP-Litev4/6 */
> diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
> index 8ce8a39a1e5f..86aa4b5cb7f1 100644
> --- a/include/uapi/asm-generic/socket.h
> +++ b/include/uapi/asm-generic/socket.h
> @@ -135,6 +135,8 @@
>  #define SO_PASSPIDFD		76
>  #define SO_PEERPIDFD		77
>  
> +#define SO_ZEROCOPY_NOTIFICATION 78
> +
>  #if !defined(__KERNEL__)
>  
>  #if __BITS_PER_LONG == 64 || (defined(__x86_64__) && defined(__ILP32__))
> diff --git a/include/uapi/linux/socket.h b/include/uapi/linux/socket.h
> index d3fcd3b5ec53..469ed8f4e6c8 100644
> --- a/include/uapi/linux/socket.h
> +++ b/include/uapi/linux/socket.h
> @@ -35,4 +35,21 @@ struct __kernel_sockaddr_storage {
>  #define SOCK_TXREHASH_DISABLED	0
>  #define SOCK_TXREHASH_ENABLED	1
>  
> +/*
> + * Given the fact that MSG_ZEROCOPY_UARG tries to copy notifications
> + * back to user as soon as possible, 8 should be sufficient.
> + */
> +#define SOCK_USR_ZC_INFO_MAX 8
> +
> +struct tx_msg_zcopy_info {
> +	__u32 lo;
> +	__u32 hi;
> +	__u8 zerocopy;
> +};
> +
> +struct tx_usr_zcopy_info {
> +	int length;
> +	struct tx_msg_zcopy_info info[SOCK_USR_ZC_INFO_MAX];
> +};
> +
>  #endif /* _UAPI_LINUX_SOCKET_H */
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 2a5ce6667bbb..d939b2c14d55 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1661,6 +1661,16 @@ void mm_unaccount_pinned_pages(struct mmpin *mmp)
>  }
>  EXPORT_SYMBOL_GPL(mm_unaccount_pinned_pages);
>  
> +static void init_ubuf_info_msgzc(struct ubuf_info_msgzc *uarg, struct sock *sk, size_t size)
> +{
> +	uarg->id = ((u32)atomic_inc_return(&sk->sk_zckey)) - 1;
> +	uarg->len = 1;
> +	uarg->bytelen = size;
> +	uarg->zerocopy = 1;
> +	uarg->ubuf.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
> +	refcount_set(&uarg->ubuf.refcnt, 1);
> +}
> +
>  static struct ubuf_info *msg_zerocopy_alloc(struct sock *sk, size_t size)
>  {
>  	struct ubuf_info_msgzc *uarg;
> @@ -1682,12 +1692,38 @@ static struct ubuf_info *msg_zerocopy_alloc(struct sock *sk, size_t size)
>  	}
>  
>  	uarg->ubuf.callback = msg_zerocopy_callback;
> -	uarg->id = ((u32)atomic_inc_return(&sk->sk_zckey)) - 1;
> -	uarg->len = 1;
> -	uarg->bytelen = size;
> -	uarg->zerocopy = 1;
> -	uarg->ubuf.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
> -	refcount_set(&uarg->ubuf.refcnt, 1);
> +	init_ubuf_info_msgzc(uarg, sk, size);
> +	sock_hold(sk);
> +
> +	return &uarg->ubuf;
> +}
> +
> +static struct ubuf_info *msg_zerocopy_uarg_alloc(struct sock *sk, size_t size)
> +{
> +	struct sk_buff *skb;
> +	struct ubuf_info_msgzc *uarg;
> +	struct tx_msg_zcopy_node *zcopy_node_p;
> +
> +	WARN_ON_ONCE(!in_task());
> +
> +	skb = sock_omalloc(sk, sizeof(*zcopy_node_p), GFP_KERNEL);
> +	if (!skb)
> +		return NULL;
> +
> +	BUILD_BUG_ON(sizeof(*uarg) > sizeof(skb->cb));
> +	uarg = (void *)skb->cb;
> +	uarg->mmp.user = NULL;
> +	zcopy_node_p = (struct tx_msg_zcopy_node *)skb_put(skb, sizeof(*zcopy_node_p));
> +
> +	if (mm_account_pinned_pages(&uarg->mmp, size)) {
> +		kfree_skb(skb);
> +		return NULL;
> +	}
> +
> +	INIT_LIST_HEAD(&zcopy_node_p->node);
> +	zcopy_node_p->skb = skb;
> +	uarg->ubuf.callback = msg_zerocopy_uarg_callback;
> +	init_ubuf_info_msgzc(uarg, sk, size);
>  	sock_hold(sk);
>  
>  	return &uarg->ubuf;
> @@ -1699,7 +1735,7 @@ static inline struct sk_buff *skb_from_uarg(struct ubuf_info_msgzc *uarg)
>  }
>  
>  struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
> -				       struct ubuf_info *uarg)
> +				       struct ubuf_info *uarg, bool usr_arg_notification)
>  {
>  	if (uarg) {
>  		struct ubuf_info_msgzc *uarg_zc;
> @@ -1707,7 +1743,8 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
>  		u32 bytelen, next;
>  
>  		/* there might be non MSG_ZEROCOPY users */
> -		if (uarg->callback != msg_zerocopy_callback)
> +		if (uarg->callback != msg_zerocopy_callback &&
> +		    uarg->callback != msg_zerocopy_uarg_callback)
>  			return NULL;
>  
>  		/* realloc only when socket is locked (TCP, UDP cork),
> @@ -1744,6 +1781,8 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
>  	}
>  
>  new_alloc:
> +	if (usr_arg_notification)
> +		return msg_zerocopy_uarg_alloc(sk, size);
>  	return msg_zerocopy_alloc(sk, size);
>  }
>  EXPORT_SYMBOL_GPL(msg_zerocopy_realloc);
> @@ -1830,6 +1869,86 @@ void msg_zerocopy_callback(struct sk_buff *skb, struct ubuf_info *uarg,
>  }
>  EXPORT_SYMBOL_GPL(msg_zerocopy_callback);
>  
> +static bool skb_zerocopy_uarg_notify_extend(struct tx_msg_zcopy_node *node, u32 lo, u16 len)
> +{
> +	u32 old_lo, old_hi;
> +	u64 sum_len;
> +
> +	old_lo = node->info.lo;
> +	old_hi = node->info.hi;
> +	sum_len = old_hi - old_lo + 1ULL + len;
> +
> +	if (sum_len >= (1ULL << 32))
> +		return false;
> +
> +	if (lo != old_hi + 1)
> +		return false;
> +
> +	node->info.hi += len;
> +	return true;
> +}
> +
> +static void __msg_zerocopy_uarg_callback(struct ubuf_info_msgzc *uarg)
> +{
> +	struct sk_buff *skb = skb_from_uarg(uarg);
> +	struct sock *sk = skb->sk;
> +	struct tx_msg_zcopy_node *zcopy_node_p, *tail;
> +	struct tx_msg_zcopy_queue *zcopy_queue;
> +	unsigned long flags;
> +	u32 lo, hi;
> +	u16 len;
> +
> +	mm_unaccount_pinned_pages(&uarg->mmp);
> +
> +	/* if !len, there was only 1 call, and it was aborted
> +	 * so do not queue a completion notification
> +	 */
> +	if (!uarg->len || sock_flag(sk, SOCK_DEAD))
> +		goto release;
> +
> +	/* only support TCP and UCP currently */
> +	if (sk_is_tcp(sk)) {
> +		zcopy_queue = &tcp_sk(sk)->tx_zcopy_queue;
> +	} else if (sk_is_udp(sk)) {
> +		zcopy_queue = &udp_sk(sk)->tx_zcopy_queue;
> +	} else {
> +		pr_warn("MSG_ZEROCOPY_UARG only support TCP && UDP sockets");
> +		goto release;
> +	}
> +
> +	len = uarg->len;
> +	lo = uarg->id;
> +	hi = uarg->id + len - 1;
> +
> +	zcopy_node_p = (struct tx_msg_zcopy_node *)skb->data;
> +	zcopy_node_p->info.lo = lo;
> +	zcopy_node_p->info.hi = hi;
> +	zcopy_node_p->info.zerocopy = uarg->zerocopy;
> +
> +	spin_lock_irqsave(&zcopy_queue->lock, flags);
> +	tail = list_last_entry(&zcopy_queue->head, struct tx_msg_zcopy_node, node);
> +	if (!tail || !skb_zerocopy_uarg_notify_extend(tail, lo, len)) {
> +		list_add_tail(&zcopy_node_p->node, &zcopy_queue->head);
> +		skb = NULL;
> +	}
> +	spin_unlock_irqrestore(&zcopy_queue->lock, flags);
> +release:
> +	consume_skb(skb);
> +	sock_put(sk);
> +}
> +
> +void msg_zerocopy_uarg_callback(struct sk_buff *skb, struct ubuf_info *uarg,
> +				bool success)
> +{
> +	struct ubuf_info_msgzc *uarg_zc = uarg_to_msgzc(uarg);
> +
> +	uarg_zc->zerocopy = uarg_zc->zerocopy & success;
> +
> +	if (refcount_dec_and_test(&uarg->refcnt))
> +		__msg_zerocopy_uarg_callback(uarg_zc);
> +}
> +EXPORT_SYMBOL_GPL(msg_zerocopy_uarg_callback);
> +
>  void msg_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref)
>  {
>  	struct sock *sk = skb_from_uarg(uarg_to_msgzc(uarg))->sk;
> @@ -1838,7 +1957,7 @@ void msg_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref)
>  	uarg_to_msgzc(uarg)->len--;
>  
>  	if (have_uref)
> -		msg_zerocopy_callback(NULL, uarg, true);
> +		uarg->callback(NULL, uarg, true);
>  }
>  EXPORT_SYMBOL_GPL(msg_zerocopy_put_abort);
>  
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 5ed411231fc7..a00ebd71f6ed 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -2843,6 +2843,56 @@ int __sock_cmsg_send(struct sock *sk, struct cmsghdr *cmsg,
>  	case SCM_RIGHTS:
>  	case SCM_CREDENTIALS:
>  		break;
> +	case SO_ZEROCOPY_NOTIFICATION:
> +		if (sock_flag(sk, SOCK_ZEROCOPY)) {
> +			int i = 0;
> +			struct tx_usr_zcopy_info sys_zcopy_info;
> +			struct tx_msg_zcopy_node *zcopy_node_p, *tmp;
> +			struct tx_msg_zcopy_queue *zcopy_queue;
> +			struct tx_msg_zcopy_node *zcopy_node_ps[SOCK_USR_ZC_INFO_MAX];
> +			unsigned long flags;
> +
> +			if (cmsg->cmsg_len != CMSG_LEN(sizeof(void *)))
> +				return -EINVAL;
> +
> +			if (sk_is_tcp(sk))
> +				zcopy_queue = &tcp_sk(sk)->tx_zcopy_queue;
> +			else if (sk_is_udp(sk))
> +				zcopy_queue = &udp_sk(sk)->tx_zcopy_queue;
> +			else
> +				return -EINVAL;
> +
> +			spin_lock_irqsave(&zcopy_queue->lock, flags);
> +			list_for_each_entry_safe(zcopy_node_p, tmp, &zcopy_queue->head, node) {
> +				sys_zcopy_info.info[i].lo = zcopy_node_p->info.lo;
> +				sys_zcopy_info.info[i].hi = zcopy_node_p->info.hi;
> +				sys_zcopy_info.info[i].zerocopy = zcopy_node_p->info.zerocopy;
> +				list_del(&zcopy_node_p->node);
> +				zcopy_node_ps[i++] = zcopy_node_p;
> +				if (i == SOCK_USR_ZC_INFO_MAX)
> +					break;
> +			}
> +			spin_unlock_irqrestore(&zcopy_queue->lock, flags);
> +
> +			if (i > 0) {
> +				sys_zcopy_info.length = i;
> +				if (unlikely(copy_to_user(*(void **)CMSG_DATA(cmsg),
> +							  &sys_zcopy_info,
> +							  sizeof(sys_zcopy_info))
> +					)) {
> +					spin_lock_irqsave(&zcopy_queue->lock, flags);
> +					while (i > 0)
> +						list_add(&zcopy_node_ps[--i]->node,
> +							 &zcopy_queue->head);
> +					spin_unlock_irqrestore(&zcopy_queue->lock, flags);
> +					return -EFAULT;
> +				}
> +
> +				while (i > 0)
> +					consume_skb(zcopy_node_ps[--i]->skb);
> +			}
> +		}
> +		break;
>  	default:
>  		return -EINVAL;
>  	}
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 1fe794967211..5adb737c4c01 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -1005,7 +1005,7 @@ static int __ip_append_data(struct sock *sk,
>  	    (!exthdrlen || (rt->dst.dev->features & NETIF_F_HW_ESP_TX_CSUM)))
>  		csummode = CHECKSUM_PARTIAL;
>  
> -	if ((flags & MSG_ZEROCOPY) && length) {
> +	if (((flags & MSG_ZEROCOPY) || (flags & MSG_ZEROCOPY_UARG)) && length) {
>  		struct msghdr *msg = from;
>  
>  		if (getfrag == ip_generic_getfrag && msg->msg_ubuf) {
> @@ -1022,7 +1022,9 @@ static int __ip_append_data(struct sock *sk,
>  				uarg = msg->msg_ubuf;
>  			}
>  		} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
> -			uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
> +			bool user_args = flags & MSG_ZEROCOPY_UARG;
> +
> +			uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb), user_args);
>  			if (!uarg)
>  				return -ENOBUFS;
>  			extra_uref = !skb_zcopy(skb);	/* only ref on new uarg */
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index e767721b3a58..6254d0eef3af 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -462,6 +462,8 @@ void tcp_init_sock(struct sock *sk)
>  
>  	set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags);
>  	sk_sockets_allocated_inc(sk);
> +
> +	tx_message_zcopy_queue_init(&tp->tx_zcopy_queue);
>  }
>  EXPORT_SYMBOL(tcp_init_sock);
>  
> @@ -1050,14 +1052,15 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>  
>  	flags = msg->msg_flags;
>  
> -	if ((flags & MSG_ZEROCOPY) && size) {
> +	if (((flags & MSG_ZEROCOPY) || (flags & MSG_ZEROCOPY_UARG)) && size) {
>  		if (msg->msg_ubuf) {
>  			uarg = msg->msg_ubuf;
>  			if (sk->sk_route_caps & NETIF_F_SG)
>  				zc = MSG_ZEROCOPY;
>  		} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
> +			bool zc_uarg = flags & MSG_ZEROCOPY_UARG;
>  			skb = tcp_write_queue_tail(sk);
> -			uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb));
> +			uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb), zc_uarg);
>  			if (!uarg) {
>  				err = -ENOBUFS;
>  				goto out_err;
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 11460d751e73..6c62aacd74d6 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1126,6 +1126,15 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>  		if (ipc.opt)
>  			free = 1;
>  		connected = 0;
> +
> +		/* If len is zero and flag MSG_ZEROCOPY_UARG is set,
> +		 * it means this call just wants to get zcopy notifications
> +		 * instead of sending packets. It is useful when users
> +		 * finish sending and want to get trailing notifications.
> +		 */
> +		if ((msg->msg_flags & MSG_ZEROCOPY_UARG) &&
> +		    sock_flag(sk, SOCK_ZEROCOPY) && len == 0)
> +			return 0;
>  	}
>  	if (!ipc.opt) {
>  		struct ip_options_rcu *inet_opt;
> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> index b9dd3a66e423..891526ddd74c 100644
> --- a/net/ipv6/ip6_output.c
> +++ b/net/ipv6/ip6_output.c
> @@ -1493,7 +1493,7 @@ static int __ip6_append_data(struct sock *sk,
>  	    rt->dst.dev->features & (NETIF_F_IPV6_CSUM | NETIF_F_HW_CSUM))
>  		csummode = CHECKSUM_PARTIAL;
>  
> -	if ((flags & MSG_ZEROCOPY) && length) {
> +	if (((flags & MSG_ZEROCOPY) || (flags & MSG_ZEROCOPY_UARG)) && length) {
>  		struct msghdr *msg = from;
>  
>  		if (getfrag == ip_generic_getfrag && msg->msg_ubuf) {
> @@ -1510,7 +1510,8 @@ static int __ip6_append_data(struct sock *sk,
>  				uarg = msg->msg_ubuf;
>  			}
>  		} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
> -			uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
> +			uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb),
> +						    flags & MSG_ZEROCOPY_UARG);
>  			if (!uarg)
>  				return -ENOBUFS;
>  			extra_uref = !skb_zcopy(skb);	/* only ref on new uarg */
> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> index 2e4dc5e6137b..98f6905c5db9 100644
> --- a/net/ipv6/udp.c
> +++ b/net/ipv6/udp.c
> @@ -1490,6 +1490,15 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>  		if (!(opt->opt_nflen|opt->opt_flen))
>  			opt = NULL;
>  		connected = false;
> +
> +		/* If len is zero and flag MSG_ZEROCOPY_UARG is set,
> +		 * it means this call just wants to get zcopy notifications
> +		 * instead of sending packets. It is useful when users
> +		 * finish sending and want to get trailing notifications.
> +		 */
> +		if ((msg->msg_flags & MSG_ZEROCOPY_UARG) &&
> +		    sock_flag(sk, SOCK_ZEROCOPY) && len == 0)
> +			return 0;
>  	}
>  	if (!opt) {
>  		opt = txopt_get(np);
> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> index 16ff976a86e3..d6e6830f6ffe 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -84,7 +84,7 @@ static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
>  
>  		uarg = msg_zerocopy_realloc(sk_vsock(vsk),
>  					    iter->count,
> -					    NULL);
> +					    NULL, false);
>  		if (!uarg)
>  			return -1;
>  
> -- 
> 2.20.1
> 



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 1/3] sock: add MSG_ZEROCOPY_UARG
  2024-04-09 20:52 ` [PATCH net-next 1/3] sock: add MSG_ZEROCOPY_UARG zijianzhang
  2024-04-09 21:18   ` Willem de Bruijn
@ 2024-04-09 21:18   ` Eric Dumazet
  2024-04-10 17:01     ` [External] " Zijian Zhang
  2024-04-09 21:23   ` Eric Dumazet
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2024-04-09 21:18 UTC (permalink / raw)
  To: zijianzhang
  Cc: netdev, willemdebruijn.kernel, davem, kuba, cong.wang, xiaochun.lu

On Tue, Apr 9, 2024 at 10:53 PM <zijianzhang@bytedance.com> wrote:
>
> From: Zijian Zhang <zijianzhang@bytedance.com>
>
> The MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
> However, zerocopy is not a free lunch. Apart from the management of user
> pages, the combination of poll + recvmsg to receive notifications incurs
> unignorable overhead in the applications. The overhead of such sometimes
> might be more than the CPU savings from zerocopy. We try to solve this
> problem with a new option for TCP and UDP, MSG_ZEROCOPY_UARG.
> This new mechanism aims to reduce the overhead associated with receiving
> notifications by embedding them directly into user arguments passed with
> each sendmsg control message. By doing so, we can significantly reduce
> the complexity and overhead for managing notifications. In an ideal
> pattern, the user will keep calling sendmsg with MSG_ZEROCOPY_UARG
> flag, and the notification will be delivered as soon as possible.
>
> Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com>
> Signed-off-by: Xiaochun Lu <xiaochun.lu@bytedance.com>
> ---
>  include/linux/skbuff.h                  |   7 +-
>  include/linux/socket.h                  |   1 +
>  include/linux/tcp.h                     |   3 +
>  include/linux/udp.h                     |   3 +
>  include/net/sock.h                      |  17 +++
>  include/net/udp.h                       |   1 +
>  include/uapi/asm-generic/socket.h       |   2 +
>  include/uapi/linux/socket.h             |  17 +++
>  net/core/skbuff.c                       | 137 ++++++++++++++++++++++--
>  net/core/sock.c                         |  50 +++++++++
>  net/ipv4/ip_output.c                    |   6 +-
>  net/ipv4/tcp.c                          |   7 +-
>  net/ipv4/udp.c                          |   9 ++
>  net/ipv6/ip6_output.c                   |   5 +-
>  net/ipv6/udp.c                          |   9 ++
>  net/vmw_vsock/virtio_transport_common.c |   2 +-
>  16 files changed, 258 insertions(+), 18 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 03ea36a82cdd..19b94ba01007 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1663,12 +1663,14 @@ static inline void skb_set_end_offset(struct sk_buff *skb, unsigned int offset)
>  #endif
>
>  struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
> -                                      struct ubuf_info *uarg);
> +                                      struct ubuf_info *uarg, bool user_args_notification);
>
>  void msg_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref);
>
>  void msg_zerocopy_callback(struct sk_buff *skb, struct ubuf_info *uarg,
>                            bool success);
> +void msg_zerocopy_uarg_callback(struct sk_buff *skb, struct ubuf_info *uarg,
> +                          bool success);
>
>  int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
>                             struct sk_buff *skb, struct iov_iter *from,
> @@ -1763,7 +1765,8 @@ static inline void net_zcopy_put(struct ubuf_info *uarg)
>  static inline void net_zcopy_put_abort(struct ubuf_info *uarg, bool have_uref)
>  {
>         if (uarg) {
> -               if (uarg->callback == msg_zerocopy_callback)
> +               if (uarg->callback == msg_zerocopy_callback ||
> +                       uarg->callback == msg_zerocopy_uarg_callback)
>                         msg_zerocopy_put_abort(uarg, have_uref);
>                 else if (have_uref)
>                         net_zcopy_put(uarg);
> diff --git a/include/linux/socket.h b/include/linux/socket.h
> index 139c330ccf2c..de01392344e1 100644
> --- a/include/linux/socket.h
> +++ b/include/linux/socket.h
> @@ -326,6 +326,7 @@ struct ucred {
>                                           * plain text and require encryption
>                                           */
>
> +#define MSG_ZEROCOPY_UARG      0x2000000 /* MSG_ZEROCOPY with UARG notifications */
>  #define MSG_ZEROCOPY   0x4000000       /* Use user data in kernel path */
>  #define MSG_SPLICE_PAGES 0x8000000     /* Splice the pages from the iterator in sendmsg() */
>  #define MSG_FASTOPEN   0x20000000      /* Send data in TCP SYN */
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index 55399ee2a57e..e973f4990646 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -501,6 +501,9 @@ struct tcp_sock {
>          */
>         struct request_sock __rcu *fastopen_rsk;
>         struct saved_syn *saved_syn;
> +
> +/* TCP MSG_ZEROCOPY_UARG related information */
> +       struct tx_msg_zcopy_queue tx_zcopy_queue;
>  };
>
>  enum tsq_enum {
> diff --git a/include/linux/udp.h b/include/linux/udp.h
> index 3748e82b627b..502b393eac67 100644
> --- a/include/linux/udp.h
> +++ b/include/linux/udp.h
> @@ -95,6 +95,9 @@ struct udp_sock {
>
>         /* Cache friendly copy of sk->sk_peek_off >= 0 */
>         bool            peeking_with_offset;
> +
> +       /* This field is used by sendmsg zcopy user arg mode notification */
> +       struct tx_msg_zcopy_queue tx_zcopy_queue;
>  };
>
>  #define udp_test_bit(nr, sk)                   \
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 2253eefe2848..f7c045e98213 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -544,6 +544,23 @@ struct sock {
>         netns_tracker           ns_tracker;
>  };
>
> +struct tx_msg_zcopy_node {
> +       struct list_head node;
> +       struct tx_msg_zcopy_info info;
> +       struct sk_buff *skb;
> +};
> +
> +struct tx_msg_zcopy_queue {
> +       struct list_head head;
> +       spinlock_t lock; /* protects head queue */
> +};
> +
> +static inline void tx_message_zcopy_queue_init(struct tx_msg_zcopy_queue *q)
> +{
> +       spin_lock_init(&q->lock);
> +       INIT_LIST_HEAD(&q->head);
> +}
> +
>  enum sk_pacing {
>         SK_PACING_NONE          = 0,
>         SK_PACING_NEEDED        = 1,
> diff --git a/include/net/udp.h b/include/net/udp.h
> index 488a6d2babcc..9e4d7b128de4 100644
> --- a/include/net/udp.h
> +++ b/include/net/udp.h
> @@ -182,6 +182,7 @@ static inline void udp_lib_init_sock(struct sock *sk)
>         skb_queue_head_init(&up->reader_queue);
>         up->forward_threshold = sk->sk_rcvbuf >> 2;
>         set_bit(SOCK_CUSTOM_SOCKOPT, &sk->sk_socket->flags);
> +       tx_message_zcopy_queue_init(&up->tx_zcopy_queue);
>  }
>
>  /* hash routines shared between UDPv4/6 and UDP-Litev4/6 */
> diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
> index 8ce8a39a1e5f..86aa4b5cb7f1 100644
> --- a/include/uapi/asm-generic/socket.h
> +++ b/include/uapi/asm-generic/socket.h
> @@ -135,6 +135,8 @@
>  #define SO_PASSPIDFD           76
>  #define SO_PEERPIDFD           77
>
> +#define SO_ZEROCOPY_NOTIFICATION 78
> +
>  #if !defined(__KERNEL__)
>
>  #if __BITS_PER_LONG == 64 || (defined(__x86_64__) && defined(__ILP32__))
> diff --git a/include/uapi/linux/socket.h b/include/uapi/linux/socket.h
> index d3fcd3b5ec53..469ed8f4e6c8 100644
> --- a/include/uapi/linux/socket.h
> +++ b/include/uapi/linux/socket.h
> @@ -35,4 +35,21 @@ struct __kernel_sockaddr_storage {
>  #define SOCK_TXREHASH_DISABLED 0
>  #define SOCK_TXREHASH_ENABLED  1
>
> +/*
> + * Given the fact that MSG_ZEROCOPY_UARG tries to copy notifications
> + * back to user as soon as possible, 8 should be sufficient.
> + */
> +#define SOCK_USR_ZC_INFO_MAX 8
> +
> +struct tx_msg_zcopy_info {
> +       __u32 lo;
> +       __u32 hi;
> +       __u8 zerocopy;
> +};
> +
> +struct tx_usr_zcopy_info {
> +       int length;
> +       struct tx_msg_zcopy_info info[SOCK_USR_ZC_INFO_MAX];
> +};
> +
>  #endif /* _UAPI_LINUX_SOCKET_H */
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 2a5ce6667bbb..d939b2c14d55 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1661,6 +1661,16 @@ void mm_unaccount_pinned_pages(struct mmpin *mmp)
>  }
>  EXPORT_SYMBOL_GPL(mm_unaccount_pinned_pages);
>
> +static void init_ubuf_info_msgzc(struct ubuf_info_msgzc *uarg, struct sock *sk, size_t size)
> +{
> +       uarg->id = ((u32)atomic_inc_return(&sk->sk_zckey)) - 1;
> +       uarg->len = 1;
> +       uarg->bytelen = size;
> +       uarg->zerocopy = 1;
> +       uarg->ubuf.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
> +       refcount_set(&uarg->ubuf.refcnt, 1);
> +}
> +
>  static struct ubuf_info *msg_zerocopy_alloc(struct sock *sk, size_t size)
>  {
>         struct ubuf_info_msgzc *uarg;
> @@ -1682,12 +1692,38 @@ static struct ubuf_info *msg_zerocopy_alloc(struct sock *sk, size_t size)
>         }
>
>         uarg->ubuf.callback = msg_zerocopy_callback;
> -       uarg->id = ((u32)atomic_inc_return(&sk->sk_zckey)) - 1;
> -       uarg->len = 1;
> -       uarg->bytelen = size;
> -       uarg->zerocopy = 1;
> -       uarg->ubuf.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
> -       refcount_set(&uarg->ubuf.refcnt, 1);
> +       init_ubuf_info_msgzc(uarg, sk, size);
> +       sock_hold(sk);
> +
> +       return &uarg->ubuf;
> +}
> +
> +static struct ubuf_info *msg_zerocopy_uarg_alloc(struct sock *sk, size_t size)
> +{
> +       struct sk_buff *skb;
> +       struct ubuf_info_msgzc *uarg;
> +       struct tx_msg_zcopy_node *zcopy_node_p;
> +
> +       WARN_ON_ONCE(!in_task());
> +
> +       skb = sock_omalloc(sk, sizeof(*zcopy_node_p), GFP_KERNEL);
> +       if (!skb)
> +               return NULL;
> +
> +       BUILD_BUG_ON(sizeof(*uarg) > sizeof(skb->cb));
> +       uarg = (void *)skb->cb;
> +       uarg->mmp.user = NULL;
> +       zcopy_node_p = (struct tx_msg_zcopy_node *)skb_put(skb, sizeof(*zcopy_node_p));
> +
> +       if (mm_account_pinned_pages(&uarg->mmp, size)) {
> +               kfree_skb(skb);
> +               return NULL;
> +       }
> +
> +       INIT_LIST_HEAD(&zcopy_node_p->node);
> +       zcopy_node_p->skb = skb;
> +       uarg->ubuf.callback = msg_zerocopy_uarg_callback;
> +       init_ubuf_info_msgzc(uarg, sk, size);
>         sock_hold(sk);
>
>         return &uarg->ubuf;
> @@ -1699,7 +1735,7 @@ static inline struct sk_buff *skb_from_uarg(struct ubuf_info_msgzc *uarg)
>  }
>
>  struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
> -                                      struct ubuf_info *uarg)
> +                                      struct ubuf_info *uarg, bool usr_arg_notification)
>  {
>         if (uarg) {
>                 struct ubuf_info_msgzc *uarg_zc;
> @@ -1707,7 +1743,8 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
>                 u32 bytelen, next;
>
>                 /* there might be non MSG_ZEROCOPY users */
> -               if (uarg->callback != msg_zerocopy_callback)
> +               if (uarg->callback != msg_zerocopy_callback &&
> +                   uarg->callback != msg_zerocopy_uarg_callback)
>                         return NULL;
>
>                 /* realloc only when socket is locked (TCP, UDP cork),
> @@ -1744,6 +1781,8 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
>         }
>
>  new_alloc:
> +       if (usr_arg_notification)
> +               return msg_zerocopy_uarg_alloc(sk, size);
>         return msg_zerocopy_alloc(sk, size);
>  }
>  EXPORT_SYMBOL_GPL(msg_zerocopy_realloc);
> @@ -1830,6 +1869,86 @@ void msg_zerocopy_callback(struct sk_buff *skb, struct ubuf_info *uarg,
>  }
>  EXPORT_SYMBOL_GPL(msg_zerocopy_callback);
>
> +static bool skb_zerocopy_uarg_notify_extend(struct tx_msg_zcopy_node *node, u32 lo, u16 len)
> +{
> +       u32 old_lo, old_hi;
> +       u64 sum_len;
> +
> +       old_lo = node->info.lo;
> +       old_hi = node->info.hi;
> +       sum_len = old_hi - old_lo + 1ULL + len;
> +
> +       if (sum_len >= (1ULL << 32))
> +               return false;
> +
> +       if (lo != old_hi + 1)
> +               return false;
> +
> +       node->info.hi += len;
> +       return true;
> +}
> +
> +static void __msg_zerocopy_uarg_callback(struct ubuf_info_msgzc *uarg)
> +{
> +       struct sk_buff *skb = skb_from_uarg(uarg);
> +       struct sock *sk = skb->sk;
> +       struct tx_msg_zcopy_node *zcopy_node_p, *tail;
> +       struct tx_msg_zcopy_queue *zcopy_queue;
> +       unsigned long flags;
> +       u32 lo, hi;
> +       u16 len;
> +
> +       mm_unaccount_pinned_pages(&uarg->mmp);
> +
> +       /* if !len, there was only 1 call, and it was aborted
> +        * so do not queue a completion notification
> +        */
> +       if (!uarg->len || sock_flag(sk, SOCK_DEAD))
> +               goto release;
> +
> +       /* only support TCP and UCP currently */
> +       if (sk_is_tcp(sk)) {
> +               zcopy_queue = &tcp_sk(sk)->tx_zcopy_queue;
> +       } else if (sk_is_udp(sk)) {
> +               zcopy_queue = &udp_sk(sk)->tx_zcopy_queue;
> +       } else {
> +               pr_warn("MSG_ZEROCOPY_UARG only support TCP && UDP sockets");
> +               goto release;
> +       }
> +
> +       len = uarg->len;
> +       lo = uarg->id;
> +       hi = uarg->id + len - 1;
> +
> +       zcopy_node_p = (struct tx_msg_zcopy_node *)skb->data;
> +       zcopy_node_p->info.lo = lo;
> +       zcopy_node_p->info.hi = hi;
> +       zcopy_node_p->info.zerocopy = uarg->zerocopy;
> +
> +       spin_lock_irqsave(&zcopy_queue->lock, flags);
> +       tail = list_last_entry(&zcopy_queue->head, struct tx_msg_zcopy_node, node);
> +       if (!tail || !skb_zerocopy_uarg_notify_extend(tail, lo, len)) {
> +               list_add_tail(&zcopy_node_p->node, &zcopy_queue->head);
> +               skb = NULL;
> +       }
> +       spin_unlock_irqrestore(&zcopy_queue->lock, flags);
> +release:
> +       consume_skb(skb);
> +       sock_put(sk);
> +}
> +
> +void msg_zerocopy_uarg_callback(struct sk_buff *skb, struct ubuf_info *uarg,
> +                               bool success)
> +{
> +       struct ubuf_info_msgzc *uarg_zc = uarg_to_msgzc(uarg);
> +
> +       uarg_zc->zerocopy = uarg_zc->zerocopy & success;
> +
> +       if (refcount_dec_and_test(&uarg->refcnt))
> +               __msg_zerocopy_uarg_callback(uarg_zc);
> +}
> +EXPORT_SYMBOL_GPL(msg_zerocopy_uarg_callback);
> +
>  void msg_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref)
>  {
>         struct sock *sk = skb_from_uarg(uarg_to_msgzc(uarg))->sk;
> @@ -1838,7 +1957,7 @@ void msg_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref)
>         uarg_to_msgzc(uarg)->len--;
>
>         if (have_uref)
> -               msg_zerocopy_callback(NULL, uarg, true);
> +               uarg->callback(NULL, uarg, true);
>  }
>  EXPORT_SYMBOL_GPL(msg_zerocopy_put_abort);
>
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 5ed411231fc7..a00ebd71f6ed 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -2843,6 +2843,56 @@ int __sock_cmsg_send(struct sock *sk, struct cmsghdr *cmsg,
>         case SCM_RIGHTS:
>         case SCM_CREDENTIALS:
>                 break;
> +       case SO_ZEROCOPY_NOTIFICATION:
> +               if (sock_flag(sk, SOCK_ZEROCOPY)) {
> +                       int i = 0;
> +                       struct tx_usr_zcopy_info sys_zcopy_info;
> +                       struct tx_msg_zcopy_node *zcopy_node_p, *tmp;
> +                       struct tx_msg_zcopy_queue *zcopy_queue;
> +                       struct tx_msg_zcopy_node *zcopy_node_ps[SOCK_USR_ZC_INFO_MAX];
> +                       unsigned long flags;
> +
> +                       if (cmsg->cmsg_len != CMSG_LEN(sizeof(void *)))
> +                               return -EINVAL;
> +
> +                       if (sk_is_tcp(sk))
> +                               zcopy_queue = &tcp_sk(sk)->tx_zcopy_queue;
> +                       else if (sk_is_udp(sk))
> +                               zcopy_queue = &udp_sk(sk)->tx_zcopy_queue;
> +                       else
> +                               return -EINVAL;
> +
> +                       spin_lock_irqsave(&zcopy_queue->lock, flags);
> +                       list_for_each_entry_safe(zcopy_node_p, tmp, &zcopy_queue->head, node) {
> +                               sys_zcopy_info.info[i].lo = zcopy_node_p->info.lo;
> +                               sys_zcopy_info.info[i].hi = zcopy_node_p->info.hi;
> +                               sys_zcopy_info.info[i].zerocopy = zcopy_node_p->info.zerocopy;
> +                               list_del(&zcopy_node_p->node);
> +                               zcopy_node_ps[i++] = zcopy_node_p;
> +                               if (i == SOCK_USR_ZC_INFO_MAX)
> +                                       break;
> +                       }
> +                       spin_unlock_irqrestore(&zcopy_queue->lock, flags);
> +
> +                       if (i > 0) {
> +                               sys_zcopy_info.length = i;
> +                               if (unlikely(copy_to_user(*(void **)CMSG_DATA(cmsg),

This is going to break if user space is 32bit, and kernel 64bit ?

Also SOCK_USR_ZC_INFO_MAX is put in stone, it won't change in the future.

> +                                                         &sys_zcopy_info,
> +                                                         sizeof(sys_zcopy_info))

Your are leaking to user space part of kernel stack, if (i <
SOCK_USR_ZC_INFO_MAX)

> +                                       )) {
> +                                       spin_lock_irqsave(&zcopy_queue->lock, flags);
> +                                       while (i > 0)
> +                                               list_add(&zcopy_node_ps[--i]->node,
> +                                                        &zcopy_queue->head);



> +                                       spin_unlock_irqrestore(&zcopy_queue->lock, flags);
> +                                       return -EFAULT;
> +                               }
> +
> +                               while (i > 0)
> +                                       consume_skb(zcopy_node_ps[--i]->skb);
> +                       }
> +               }
> +               break;
>         default:
>                 return -EINVAL;
>         }
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 1fe794967211..5adb737c4c01 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -1005,7 +1005,7 @@ static int __ip_append_data(struct sock *sk,
>             (!exthdrlen || (rt->dst.dev->features & NETIF_F_HW_ESP_TX_CSUM)))
>                 csummode = CHECKSUM_PARTIAL;
>
> -       if ((flags & MSG_ZEROCOPY) && length) {
> +       if (((flags & MSG_ZEROCOPY) || (flags & MSG_ZEROCOPY_UARG)) && length) {
>                 struct msghdr *msg = from;
>
>                 if (getfrag == ip_generic_getfrag && msg->msg_ubuf) {
> @@ -1022,7 +1022,9 @@ static int __ip_append_data(struct sock *sk,
>                                 uarg = msg->msg_ubuf;
>                         }
>                 } else if (sock_flag(sk, SOCK_ZEROCOPY)) {
> -                       uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
> +                       bool user_args = flags & MSG_ZEROCOPY_UARG;
> +
> +                       uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb), user_args);
>                         if (!uarg)
>                                 return -ENOBUFS;
>                         extra_uref = !skb_zcopy(skb);   /* only ref on new uarg */
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index e767721b3a58..6254d0eef3af 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -462,6 +462,8 @@ void tcp_init_sock(struct sock *sk)
>
>         set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags);
>         sk_sockets_allocated_inc(sk);
> +
> +       tx_message_zcopy_queue_init(&tp->tx_zcopy_queue);
>  }
>  EXPORT_SYMBOL(tcp_init_sock);
>
> @@ -1050,14 +1052,15 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>
>         flags = msg->msg_flags;
>
> -       if ((flags & MSG_ZEROCOPY) && size) {
> +       if (((flags & MSG_ZEROCOPY) || (flags & MSG_ZEROCOPY_UARG)) && size) {
>                 if (msg->msg_ubuf) {
>                         uarg = msg->msg_ubuf;
>                         if (sk->sk_route_caps & NETIF_F_SG)
>                                 zc = MSG_ZEROCOPY;
>                 } else if (sock_flag(sk, SOCK_ZEROCOPY)) {
> +                       bool zc_uarg = flags & MSG_ZEROCOPY_UARG;
>                         skb = tcp_write_queue_tail(sk);
> -                       uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb));
> +                       uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb), zc_uarg);
>                         if (!uarg) {
>                                 err = -ENOBUFS;
>                                 goto out_err;
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 11460d751e73..6c62aacd74d6 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1126,6 +1126,15 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>                 if (ipc.opt)
>                         free = 1;
>                 connected = 0;
> +
> +               /* If len is zero and flag MSG_ZEROCOPY_UARG is set,
> +                * it means this call just wants to get zcopy notifications
> +                * instead of sending packets. It is useful when users
> +                * finish sending and want to get trailing notifications.
> +                */
> +               if ((msg->msg_flags & MSG_ZEROCOPY_UARG) &&
> +                   sock_flag(sk, SOCK_ZEROCOPY) && len == 0)
> +                       return 0;
>         }
>         if (!ipc.opt) {
>                 struct ip_options_rcu *inet_opt;
> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> index b9dd3a66e423..891526ddd74c 100644
> --- a/net/ipv6/ip6_output.c
> +++ b/net/ipv6/ip6_output.c
> @@ -1493,7 +1493,7 @@ static int __ip6_append_data(struct sock *sk,
>             rt->dst.dev->features & (NETIF_F_IPV6_CSUM | NETIF_F_HW_CSUM))
>                 csummode = CHECKSUM_PARTIAL;
>
> -       if ((flags & MSG_ZEROCOPY) && length) {
> +       if (((flags & MSG_ZEROCOPY) || (flags & MSG_ZEROCOPY_UARG)) && length) {
>                 struct msghdr *msg = from;
>
>                 if (getfrag == ip_generic_getfrag && msg->msg_ubuf) {
> @@ -1510,7 +1510,8 @@ static int __ip6_append_data(struct sock *sk,
>                                 uarg = msg->msg_ubuf;
>                         }
>                 } else if (sock_flag(sk, SOCK_ZEROCOPY)) {
> -                       uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
> +                       uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb),
> +                                                   flags & MSG_ZEROCOPY_UARG);
>                         if (!uarg)
>                                 return -ENOBUFS;
>                         extra_uref = !skb_zcopy(skb);   /* only ref on new uarg */
> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> index 2e4dc5e6137b..98f6905c5db9 100644
> --- a/net/ipv6/udp.c
> +++ b/net/ipv6/udp.c
> @@ -1490,6 +1490,15 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>                 if (!(opt->opt_nflen|opt->opt_flen))
>                         opt = NULL;
>                 connected = false;
> +
> +               /* If len is zero and flag MSG_ZEROCOPY_UARG is set,
> +                * it means this call just wants to get zcopy notifications
> +                * instead of sending packets. It is useful when users
> +                * finish sending and want to get trailing notifications.
> +                */
> +               if ((msg->msg_flags & MSG_ZEROCOPY_UARG) &&
> +                   sock_flag(sk, SOCK_ZEROCOPY) && len == 0)
> +                       return 0;
>         }
>         if (!opt) {
>                 opt = txopt_get(np);
> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> index 16ff976a86e3..d6e6830f6ffe 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -84,7 +84,7 @@ static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
>
>                 uarg = msg_zerocopy_realloc(sk_vsock(vsk),
>                                             iter->count,
> -                                           NULL);
> +                                           NULL, false);
>                 if (!uarg)
>                         return -1;
>
> --
> 2.20.1
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 1/3] sock: add MSG_ZEROCOPY_UARG
  2024-04-09 20:52 ` [PATCH net-next 1/3] sock: add MSG_ZEROCOPY_UARG zijianzhang
  2024-04-09 21:18   ` Willem de Bruijn
  2024-04-09 21:18   ` Eric Dumazet
@ 2024-04-09 21:23   ` Eric Dumazet
  2024-04-10  0:11     ` [External] " Zijian Zhang
  2024-04-10  9:18   ` kernel test robot
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2024-04-09 21:23 UTC (permalink / raw)
  To: zijianzhang
  Cc: netdev, willemdebruijn.kernel, davem, kuba, cong.wang, xiaochun.lu

On Tue, Apr 9, 2024 at 10:53 PM <zijianzhang@bytedance.com> wrote:
>
> From: Zijian Zhang <zijianzhang@bytedance.com>
>
> The MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
> However, zerocopy is not a free lunch. Apart from the management of user
> pages, the combination of poll + recvmsg to receive notifications incurs
> unignorable overhead in the applications. The overhead of such sometimes
> might be more than the CPU savings from zerocopy. We try to solve this
> problem with a new option for TCP and UDP, MSG_ZEROCOPY_UARG.
> This new mechanism aims to reduce the overhead associated with receiving
> notifications by embedding them directly into user arguments passed with
> each sendmsg control message. By doing so, we can significantly reduce
> the complexity and overhead for managing notifications. In an ideal
> pattern, the user will keep calling sendmsg with MSG_ZEROCOPY_UARG
> flag, and the notification will be delivered as soon as possible.
>
> Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com>
> Signed-off-by: Xiaochun Lu <xiaochun.lu@bytedance.com>
> ---
>  include/linux/skbuff.h                  |   7 +-
>  include/linux/socket.h                  |   1 +
>  include/linux/tcp.h                     |   3 +
>  include/linux/udp.h                     |   3 +
>  include/net/sock.h                      |  17 +++
>  include/net/udp.h                       |   1 +
>  include/uapi/asm-generic/socket.h       |   2 +
>  include/uapi/linux/socket.h             |  17 +++

...

> +
> +static inline void tx_message_zcopy_queue_init(struct tx_msg_zcopy_queue *q)
> +{
> +       spin_lock_init(&q->lock);
> +       INIT_LIST_HEAD(&q->head);
> +}
> +
>

> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index e767721b3a58..6254d0eef3af 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -462,6 +462,8 @@ void tcp_init_sock(struct sock *sk)
>
>         set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags);
>         sk_sockets_allocated_inc(sk);
> +
> +       tx_message_zcopy_queue_init(&tp->tx_zcopy_queue);
>  }


FYI,  tcp_init_sock() is not called for passive sockets.

syzbot would quite easily crash if zerovopy is used after accept()...

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 2/3] selftests: fix OOM problem in msg_zerocopy selftest
  2024-04-09 20:52 ` [PATCH net-next 2/3] selftests: fix OOM problem in msg_zerocopy selftest zijianzhang
@ 2024-04-09 21:25   ` Willem de Bruijn
  2024-04-09 21:30     ` Eric Dumazet
  0 siblings, 1 reply; 24+ messages in thread
From: Willem de Bruijn @ 2024-04-09 21:25 UTC (permalink / raw)
  To: zijianzhang, netdev
  Cc: edumazet, willemdebruijn.kernel, davem, kuba, cong.wang,
	xiaochun.lu, Zijian Zhang

zijianzhang@ wrote:
> From: Zijian Zhang <zijianzhang@bytedance.com>
> 
> In selftests/net/msg_zerocopy.c, it has a while loop keeps calling sendmsg
> on a socket, and it will recv the completion notifications when the socket
> is not writable. Typically, it will start the receiving process after
> around 30+ sendmsgs.
> 
> However, because of the commit dfa2f0483360
> ("tcp: get rid of sysctl_tcp_adv_win_scale"), the sender is always writable
> and does not get any chance to run recv notifications. The selftest always
> exits with OUT_OF_MEMORY because the memory used by opt_skb exceeds
> the core.sysctl_optmem_max. We introduce "cfg_notification_limit" to
> force sender to receive notifications after some number of sendmsgs.

No need for a new option. Existing test automation will not enable
that.

I have not observed this behavior in tests (so I wonder what is
different about the setups). But it is fine to unconditionally force
a call to do_recv_completions every few sends.

> Plus,
> in the selftest, we need to update skb_orphan_frags_rx to be the same as
> skb_orphan_frags.

To be able to test over loopback, I suppose?

> In this case, for some reason, notifications do not
> come in order now. We introduce "cfg_notification_order_check" to
> possibly ignore the checking for order.

Were you testing UDP?

I don't think this is needed. I wonder what you were doing to see
enough of these events to want to suppress the log output.
 
> Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com>
> Signed-off-by: Xiaochun Lu <xiaochun.lu@bytedance.com>
> ---
>  tools/testing/selftests/net/msg_zerocopy.c | 24 ++++++++++++++++++----
>  1 file changed, 20 insertions(+), 4 deletions(-)
> 
> diff --git a/tools/testing/selftests/net/msg_zerocopy.c b/tools/testing/selftests/net/msg_zerocopy.c
> index bdc03a2097e8..8e595216a0af 100644
> --- a/tools/testing/selftests/net/msg_zerocopy.c
> +++ b/tools/testing/selftests/net/msg_zerocopy.c
> @@ -85,6 +85,8 @@ static bool cfg_rx;
>  static int  cfg_runtime_ms	= 4200;
>  static int  cfg_verbose;
>  static int  cfg_waittime_ms	= 500;
> +static bool cfg_notification_order_check;
> +static int  cfg_notification_limit = 32;
>  static bool cfg_zerocopy;
>  
>  static socklen_t cfg_alen;
> @@ -435,7 +437,7 @@ static bool do_recv_completion(int fd, int domain)
>  	/* Detect notification gaps. These should not happen often, if at all.
>  	 * Gaps can occur due to drops, reordering and retransmissions.
>  	 */
> -	if (lo != next_completion)
> +	if (cfg_notification_order_check && lo != next_completion)
>  		fprintf(stderr, "gap: %u..%u does not append to %u\n",
>  			lo, hi, next_completion);
>  	next_completion = hi + 1;
> @@ -489,7 +491,7 @@ static void do_tx(int domain, int type, int protocol)
>  		struct iphdr iph;
>  	} nh;
>  	uint64_t tstop;
> -	int fd;
> +	int fd, sendmsg_counter = 0;
>  
>  	fd = do_setup_tx(domain, type, protocol);
>  
> @@ -548,10 +550,18 @@ static void do_tx(int domain, int type, int protocol)
>  			do_sendmsg_corked(fd, &msg);
>  		else
>  			do_sendmsg(fd, &msg, cfg_zerocopy, domain);
> +		sendmsg_counter++;
> +
> +		if (sendmsg_counter == cfg_notification_limit && cfg_zerocopy) {
> +			do_recv_completions(fd, domain);
> +			sendmsg_counter = 0;
> +		}
>  
>  		while (!do_poll(fd, POLLOUT)) {
> -			if (cfg_zerocopy)
> +			if (cfg_zerocopy) {
>  				do_recv_completions(fd, domain);
> +				sendmsg_counter = 0;
> +			}
>  		}
>  
>  	} while (gettimeofday_ms() < tstop);
> @@ -708,7 +718,7 @@ static void parse_opts(int argc, char **argv)
>  
>  	cfg_payload_len = max_payload_len;
>  
> -	while ((c = getopt(argc, argv, "46c:C:D:i:mp:rs:S:t:vz")) != -1) {
> +	while ((c = getopt(argc, argv, "46c:C:D:i:mp:rs:S:t:vzol:")) != -1) {
>  		switch (c) {
>  		case '4':
>  			if (cfg_family != PF_UNSPEC)
> @@ -760,6 +770,12 @@ static void parse_opts(int argc, char **argv)
>  		case 'z':
>  			cfg_zerocopy = true;
>  			break;
> +		case 'o':
> +			cfg_notification_order_check = true;
> +			break;
> +		case 'l':
> +			cfg_notification_limit = strtoul(optarg, NULL, 0);
> +			break;
>  		}
>  	}
>  
> -- 
> 2.20.1
> 



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 2/3] selftests: fix OOM problem in msg_zerocopy selftest
  2024-04-09 21:25   ` Willem de Bruijn
@ 2024-04-09 21:30     ` Eric Dumazet
  2024-04-09 23:57       ` [External] " Zijian Zhang
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2024-04-09 21:30 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: zijianzhang, netdev, davem, kuba, cong.wang, xiaochun.lu

On Tue, Apr 9, 2024 at 11:25 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> zijianzhang@ wrote:
> > From: Zijian Zhang <zijianzhang@bytedance.com>
> >
> > In selftests/net/msg_zerocopy.c, it has a while loop keeps calling sendmsg
> > on a socket, and it will recv the completion notifications when the socket
> > is not writable. Typically, it will start the receiving process after
> > around 30+ sendmsgs.
> >
> > However, because of the commit dfa2f0483360
> > ("tcp: get rid of sysctl_tcp_adv_win_scale"), the sender is always writable
> > and does not get any chance to run recv notifications. The selftest always
> > exits with OUT_OF_MEMORY because the memory used by opt_skb exceeds
> > the core.sysctl_optmem_max. We introduce "cfg_notification_limit" to
> > force sender to receive notifications after some number of sendmsgs.
>
> No need for a new option. Existing test automation will not enable
> that.
>
> I have not observed this behavior in tests (so I wonder what is
> different about the setups). But it is fine to unconditionally force
> a call to do_recv_completions every few sends.

Maybe their kernel does not have yet :

commit 4944566706b27918ca15eda913889db296792415    net: increase
optmem_max default value

???

>
> > Plus,
> > in the selftest, we need to update skb_orphan_frags_rx to be the same as
> > skb_orphan_frags.
>
> To be able to test over loopback, I suppose?
>
> > In this case, for some reason, notifications do not
> > come in order now. We introduce "cfg_notification_order_check" to
> > possibly ignore the checking for order.
>
> Were you testing UDP?
>
> I don't think this is needed. I wonder what you were doing to see
> enough of these events to want to suppress the log output.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [External] Re: [PATCH net-next 2/3] selftests: fix OOM problem in msg_zerocopy selftest
  2024-04-09 21:30     ` Eric Dumazet
@ 2024-04-09 23:57       ` Zijian Zhang
  2024-04-10 23:06         ` Willem de Bruijn
  0 siblings, 1 reply; 24+ messages in thread
From: Zijian Zhang @ 2024-04-09 23:57 UTC (permalink / raw)
  To: Eric Dumazet, Willem de Bruijn
  Cc: netdev, davem, kuba, cong.wang, xiaochun.lu

Firstly, thanks for your time and quick reply!

On 4/9/24 2:30 PM, Eric Dumazet wrote:
> On Tue, Apr 9, 2024 at 11:25 PM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
>>
>> zijianzhang@ wrote:
>>> From: Zijian Zhang <zijianzhang@bytedance.com>
>>>
>>> In selftests/net/msg_zerocopy.c, it has a while loop keeps calling sendmsg
>>> on a socket, and it will recv the completion notifications when the socket
>>> is not writable. Typically, it will start the receiving process after
>>> around 30+ sendmsgs.
>>>
>>> However, because of the commit dfa2f0483360
>>> ("tcp: get rid of sysctl_tcp_adv_win_scale"), the sender is always writable
>>> and does not get any chance to run recv notifications. The selftest always
>>> exits with OUT_OF_MEMORY because the memory used by opt_skb exceeds
>>> the core.sysctl_optmem_max. We introduce "cfg_notification_limit" to
>>> force sender to receive notifications after some number of sendmsgs.
>>
>> No need for a new option. Existing test automation will not enable
>> that.
>>
>> I have not observed this behavior in tests (so I wonder what is
>> different about the setups). But it is fine to unconditionally force
>> a call to do_recv_completions every few sends.
> 
> Maybe their kernel does not have yet :
> 
> commit 4944566706b27918ca15eda913889db296792415    net: increase
> optmem_max default value
> 
> ???
> 

I did the selftest on a qemu vm with linux repo v6.8-rc3 kernel.
It should have the commit 4944566706b2 ("net: increase optmem_max
default value")

"
qemu-system-x86_64 \
     -enable-kvm \
     -nographic \
     -drive file=$HOME/guest.qcow2,if=virtio \
     -device vfio-pci,host=3b:00.2,multifunction=on \
     -m 32G \
     -smp 16 \
     -kernel $HOME/linux-master/arch/x86/boot/bzImage \
     -append 'root=/dev/vda1 console=ttyS0 earlyprintk=ttyS0 
net.ifnames=0 biosdevname=0'
"

I did it again just now with a clean image, and there was no problem...
Unfortunately, I did not save the image I tested before, I will give you
more information about my network configuration if I could restore it.

Thus, it is not a BUG, but a problem due to my custom conf, sorry about 
this, I will delete this patch in the next version.

>>
>>> Plus,
>>> in the selftest, we need to update skb_orphan_frags_rx to be the same as
>>> skb_orphan_frags.
>>
>> To be able to test over loopback, I suppose?
>>

Yes.

>>> In this case, for some reason, notifications do not
>>> come in order now. We introduce "cfg_notification_order_check" to
>>> possibly ignore the checking for order.
>>
>> Were you testing UDP?
>>
>> I don't think this is needed. I wonder what you were doing to see
>> enough of these events to want to suppress the log output.

I tested again on both TCP and UDP just now, and it happened to both of 
them. For tcp test, too many printfs will delay the sending and thus 
affect the throughput.

ipv4 tcp -z -t 1
gap: 277..277 does not append to 276
gap: 276..276 does not append to 278
gap: 278..1112 does not append to 277
gap: 1114..1114 does not append to 1113
gap: 1113..1113 does not append to 1115
gap: 1115..2330 does not append to 1114
gap: 2332..2332 does not append to 2331
gap: 2331..2331 does not append to 2333
gap: 2333..2559 does not append to 2332
gap: 2562..2562 does not append to 2560
...
gap: 25841..25841 does not append to 25843
gap: 25843..25997 does not append to 25842

...

ipv6 udp -z -t 1
gap: 11632..11687 does not append to 11625
gap: 11625..11631 does not append to 11688
gap: 11688..54662 does not append to 11632

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [External] Re: [PATCH net-next 1/3] sock: add MSG_ZEROCOPY_UARG
  2024-04-09 21:18   ` Willem de Bruijn
@ 2024-04-10  0:08     ` Zijian Zhang
  0 siblings, 0 replies; 24+ messages in thread
From: Zijian Zhang @ 2024-04-10  0:08 UTC (permalink / raw)
  To: Willem de Bruijn, netdev; +Cc: edumazet, davem, kuba, cong.wang, xiaochun.lu



On 4/9/24 2:18 PM, Willem de Bruijn wrote:
> zijianzhang@ wrote:
>> From: Zijian Zhang <zijianzhang@bytedance.com>
>>
>> The MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
>> However, zerocopy is not a free lunch. Apart from the management of user
>> pages, the combination of poll + recvmsg to receive notifications incurs
>> unignorable overhead in the applications. The overhead of such sometimes
>> might be more than the CPU savings from zerocopy. We try to solve this
>> problem with a new option for TCP and UDP, MSG_ZEROCOPY_UARG.
>> This new mechanism aims to reduce the overhead associated with receiving
>> notifications by embedding them directly into user arguments passed with
>> each sendmsg control message. By doing so, we can significantly reduce
>> the complexity and overhead for managing notifications. In an ideal
>> pattern, the user will keep calling sendmsg with MSG_ZEROCOPY_UARG
>> flag, and the notification will be delivered as soon as possible.
>>
>> Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com>
>> Signed-off-by: Xiaochun Lu <xiaochun.lu@bytedance.com>
> 
> The idea to avoid POLLERR and errqueue receive cost by piggy-backing
> onto the send() call is appealing.
> 
> I also implemented this when initially adding MSG_ZEROCOPY. At the
> time my implementation did not warrant the modest performance
> improvement I measured. But this may be different.
> 
> The main reason my code was complex is that this requires changes
> to net/socket.c: controldata is intended to be user->kernel on
> send and kernel->user on recv, such as data. This feature breaks
> that. You'll have to copy msg_controldata in ____sys_sendmsg.
> 
> I don't think it requires this much new infrastructure: a separate
> queue in structs tcp_sock and udp_sock, a differente uarg callback.
> 
> I would suggest an optimistic algorithm, whereby the send call
> steaks a notification skb from the error queue if present. This
> could be a new mode set with a value 2 in SO_ZEROCOPY.
> 

Thanks for the suggestion! I also think that the new infrastructure
requires much code change, it will be nice to reuse the error queue.

And, I agree that the MSG_ZEROCOPY_UARG is not a good name, a new mode
set with a value 2 in SO_ZEROCOPY is a better idea.

I will update in the next patch set.

>> ---
>>   include/linux/skbuff.h                  |   7 +-
>>   include/linux/socket.h                  |   1 +
>>   include/linux/tcp.h                     |   3 +
>>   include/linux/udp.h                     |   3 +
>>   include/net/sock.h                      |  17 +++
>>   include/net/udp.h                       |   1 +
>>   include/uapi/asm-generic/socket.h       |   2 +
>>   include/uapi/linux/socket.h             |  17 +++
>>   net/core/skbuff.c                       | 137 ++++++++++++++++++++++--
>>   net/core/sock.c                         |  50 +++++++++
>>   net/ipv4/ip_output.c                    |   6 +-
>>   net/ipv4/tcp.c                          |   7 +-
>>   net/ipv4/udp.c                          |   9 ++
>>   net/ipv6/ip6_output.c                   |   5 +-
>>   net/ipv6/udp.c                          |   9 ++
>>   net/vmw_vsock/virtio_transport_common.c |   2 +-
>>   16 files changed, 258 insertions(+), 18 deletions(-)
>>
>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>> index 03ea36a82cdd..19b94ba01007 100644
>> --- a/include/linux/skbuff.h
>> +++ b/include/linux/skbuff.h
>> @@ -1663,12 +1663,14 @@ static inline void skb_set_end_offset(struct sk_buff *skb, unsigned int offset)
>>   #endif
>>   
>>   struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
>> -				       struct ubuf_info *uarg);
>> +				       struct ubuf_info *uarg, bool user_args_notification);
>>   
>>   void msg_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref);
>>   
>>   void msg_zerocopy_callback(struct sk_buff *skb, struct ubuf_info *uarg,
>>   			   bool success);
>> +void msg_zerocopy_uarg_callback(struct sk_buff *skb, struct ubuf_info *uarg,
>> +			   bool success);
>>   
>>   int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
>>   			    struct sk_buff *skb, struct iov_iter *from,
>> @@ -1763,7 +1765,8 @@ static inline void net_zcopy_put(struct ubuf_info *uarg)
>>   static inline void net_zcopy_put_abort(struct ubuf_info *uarg, bool have_uref)
>>   {
>>   	if (uarg) {
>> -		if (uarg->callback == msg_zerocopy_callback)
>> +		if (uarg->callback == msg_zerocopy_callback ||
>> +			uarg->callback == msg_zerocopy_uarg_callback)
>>   			msg_zerocopy_put_abort(uarg, have_uref);
>>   		else if (have_uref)
>>   			net_zcopy_put(uarg);
>> diff --git a/include/linux/socket.h b/include/linux/socket.h
>> index 139c330ccf2c..de01392344e1 100644
>> --- a/include/linux/socket.h
>> +++ b/include/linux/socket.h
>> @@ -326,6 +326,7 @@ struct ucred {
>>   					  * plain text and require encryption
>>   					  */
>>   
>> +#define MSG_ZEROCOPY_UARG	0x2000000 /* MSG_ZEROCOPY with UARG notifications */
>>   #define MSG_ZEROCOPY	0x4000000	/* Use user data in kernel path */
>>   #define MSG_SPLICE_PAGES 0x8000000	/* Splice the pages from the iterator in sendmsg() */
>>   #define MSG_FASTOPEN	0x20000000	/* Send data in TCP SYN */
>> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
>> index 55399ee2a57e..e973f4990646 100644
>> --- a/include/linux/tcp.h
>> +++ b/include/linux/tcp.h
>> @@ -501,6 +501,9 @@ struct tcp_sock {
>>   	 */
>>   	struct request_sock __rcu *fastopen_rsk;
>>   	struct saved_syn *saved_syn;
>> +
>> +/* TCP MSG_ZEROCOPY_UARG related information */
>> +	struct tx_msg_zcopy_queue tx_zcopy_queue;
>>   };
>>   
>>   enum tsq_enum {
>> diff --git a/include/linux/udp.h b/include/linux/udp.h
>> index 3748e82b627b..502b393eac67 100644
>> --- a/include/linux/udp.h
>> +++ b/include/linux/udp.h
>> @@ -95,6 +95,9 @@ struct udp_sock {
>>   
>>   	/* Cache friendly copy of sk->sk_peek_off >= 0 */
>>   	bool		peeking_with_offset;
>> +
>> +	/* This field is used by sendmsg zcopy user arg mode notification */
>> +	struct tx_msg_zcopy_queue tx_zcopy_queue;
>>   };
>>   
>>   #define udp_test_bit(nr, sk)			\
>> diff --git a/include/net/sock.h b/include/net/sock.h
>> index 2253eefe2848..f7c045e98213 100644
>> --- a/include/net/sock.h
>> +++ b/include/net/sock.h
>> @@ -544,6 +544,23 @@ struct sock {
>>   	netns_tracker		ns_tracker;
>>   };
>>   
>> +struct tx_msg_zcopy_node {
>> +	struct list_head node;
>> +	struct tx_msg_zcopy_info info;
>> +	struct sk_buff *skb;
>> +};
>> +
>> +struct tx_msg_zcopy_queue {
>> +	struct list_head head;
>> +	spinlock_t lock; /* protects head queue */
>> +};
>> +
>> +static inline void tx_message_zcopy_queue_init(struct tx_msg_zcopy_queue *q)
>> +{
>> +	spin_lock_init(&q->lock);
>> +	INIT_LIST_HEAD(&q->head);
>> +}
>> +
>>   enum sk_pacing {
>>   	SK_PACING_NONE		= 0,
>>   	SK_PACING_NEEDED	= 1,
>> diff --git a/include/net/udp.h b/include/net/udp.h
>> index 488a6d2babcc..9e4d7b128de4 100644
>> --- a/include/net/udp.h
>> +++ b/include/net/udp.h
>> @@ -182,6 +182,7 @@ static inline void udp_lib_init_sock(struct sock *sk)
>>   	skb_queue_head_init(&up->reader_queue);
>>   	up->forward_threshold = sk->sk_rcvbuf >> 2;
>>   	set_bit(SOCK_CUSTOM_SOCKOPT, &sk->sk_socket->flags);
>> +	tx_message_zcopy_queue_init(&up->tx_zcopy_queue);
>>   }
>>   
>>   /* hash routines shared between UDPv4/6 and UDP-Litev4/6 */
>> diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
>> index 8ce8a39a1e5f..86aa4b5cb7f1 100644
>> --- a/include/uapi/asm-generic/socket.h
>> +++ b/include/uapi/asm-generic/socket.h
>> @@ -135,6 +135,8 @@
>>   #define SO_PASSPIDFD		76
>>   #define SO_PEERPIDFD		77
>>   
>> +#define SO_ZEROCOPY_NOTIFICATION 78
>> +
>>   #if !defined(__KERNEL__)
>>   
>>   #if __BITS_PER_LONG == 64 || (defined(__x86_64__) && defined(__ILP32__))
>> diff --git a/include/uapi/linux/socket.h b/include/uapi/linux/socket.h
>> index d3fcd3b5ec53..469ed8f4e6c8 100644
>> --- a/include/uapi/linux/socket.h
>> +++ b/include/uapi/linux/socket.h
>> @@ -35,4 +35,21 @@ struct __kernel_sockaddr_storage {
>>   #define SOCK_TXREHASH_DISABLED	0
>>   #define SOCK_TXREHASH_ENABLED	1
>>   
>> +/*
>> + * Given the fact that MSG_ZEROCOPY_UARG tries to copy notifications
>> + * back to user as soon as possible, 8 should be sufficient.
>> + */
>> +#define SOCK_USR_ZC_INFO_MAX 8
>> +
>> +struct tx_msg_zcopy_info {
>> +	__u32 lo;
>> +	__u32 hi;
>> +	__u8 zerocopy;
>> +};
>> +
>> +struct tx_usr_zcopy_info {
>> +	int length;
>> +	struct tx_msg_zcopy_info info[SOCK_USR_ZC_INFO_MAX];
>> +};
>> +
>>   #endif /* _UAPI_LINUX_SOCKET_H */
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index 2a5ce6667bbb..d939b2c14d55 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -1661,6 +1661,16 @@ void mm_unaccount_pinned_pages(struct mmpin *mmp)
>>   }
>>   EXPORT_SYMBOL_GPL(mm_unaccount_pinned_pages);
>>   
>> +static void init_ubuf_info_msgzc(struct ubuf_info_msgzc *uarg, struct sock *sk, size_t size)
>> +{
>> +	uarg->id = ((u32)atomic_inc_return(&sk->sk_zckey)) - 1;
>> +	uarg->len = 1;
>> +	uarg->bytelen = size;
>> +	uarg->zerocopy = 1;
>> +	uarg->ubuf.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
>> +	refcount_set(&uarg->ubuf.refcnt, 1);
>> +}
>> +
>>   static struct ubuf_info *msg_zerocopy_alloc(struct sock *sk, size_t size)
>>   {
>>   	struct ubuf_info_msgzc *uarg;
>> @@ -1682,12 +1692,38 @@ static struct ubuf_info *msg_zerocopy_alloc(struct sock *sk, size_t size)
>>   	}
>>   
>>   	uarg->ubuf.callback = msg_zerocopy_callback;
>> -	uarg->id = ((u32)atomic_inc_return(&sk->sk_zckey)) - 1;
>> -	uarg->len = 1;
>> -	uarg->bytelen = size;
>> -	uarg->zerocopy = 1;
>> -	uarg->ubuf.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
>> -	refcount_set(&uarg->ubuf.refcnt, 1);
>> +	init_ubuf_info_msgzc(uarg, sk, size);
>> +	sock_hold(sk);
>> +
>> +	return &uarg->ubuf;
>> +}
>> +
>> +static struct ubuf_info *msg_zerocopy_uarg_alloc(struct sock *sk, size_t size)
>> +{
>> +	struct sk_buff *skb;
>> +	struct ubuf_info_msgzc *uarg;
>> +	struct tx_msg_zcopy_node *zcopy_node_p;
>> +
>> +	WARN_ON_ONCE(!in_task());
>> +
>> +	skb = sock_omalloc(sk, sizeof(*zcopy_node_p), GFP_KERNEL);
>> +	if (!skb)
>> +		return NULL;
>> +
>> +	BUILD_BUG_ON(sizeof(*uarg) > sizeof(skb->cb));
>> +	uarg = (void *)skb->cb;
>> +	uarg->mmp.user = NULL;
>> +	zcopy_node_p = (struct tx_msg_zcopy_node *)skb_put(skb, sizeof(*zcopy_node_p));
>> +
>> +	if (mm_account_pinned_pages(&uarg->mmp, size)) {
>> +		kfree_skb(skb);
>> +		return NULL;
>> +	}
>> +
>> +	INIT_LIST_HEAD(&zcopy_node_p->node);
>> +	zcopy_node_p->skb = skb;
>> +	uarg->ubuf.callback = msg_zerocopy_uarg_callback;
>> +	init_ubuf_info_msgzc(uarg, sk, size);
>>   	sock_hold(sk);
>>   
>>   	return &uarg->ubuf;
>> @@ -1699,7 +1735,7 @@ static inline struct sk_buff *skb_from_uarg(struct ubuf_info_msgzc *uarg)
>>   }
>>   
>>   struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
>> -				       struct ubuf_info *uarg)
>> +				       struct ubuf_info *uarg, bool usr_arg_notification)
>>   {
>>   	if (uarg) {
>>   		struct ubuf_info_msgzc *uarg_zc;
>> @@ -1707,7 +1743,8 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
>>   		u32 bytelen, next;
>>   
>>   		/* there might be non MSG_ZEROCOPY users */
>> -		if (uarg->callback != msg_zerocopy_callback)
>> +		if (uarg->callback != msg_zerocopy_callback &&
>> +		    uarg->callback != msg_zerocopy_uarg_callback)
>>   			return NULL;
>>   
>>   		/* realloc only when socket is locked (TCP, UDP cork),
>> @@ -1744,6 +1781,8 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
>>   	}
>>   
>>   new_alloc:
>> +	if (usr_arg_notification)
>> +		return msg_zerocopy_uarg_alloc(sk, size);
>>   	return msg_zerocopy_alloc(sk, size);
>>   }
>>   EXPORT_SYMBOL_GPL(msg_zerocopy_realloc);
>> @@ -1830,6 +1869,86 @@ void msg_zerocopy_callback(struct sk_buff *skb, struct ubuf_info *uarg,
>>   }
>>   EXPORT_SYMBOL_GPL(msg_zerocopy_callback);
>>   
>> +static bool skb_zerocopy_uarg_notify_extend(struct tx_msg_zcopy_node *node, u32 lo, u16 len)
>> +{
>> +	u32 old_lo, old_hi;
>> +	u64 sum_len;
>> +
>> +	old_lo = node->info.lo;
>> +	old_hi = node->info.hi;
>> +	sum_len = old_hi - old_lo + 1ULL + len;
>> +
>> +	if (sum_len >= (1ULL << 32))
>> +		return false;
>> +
>> +	if (lo != old_hi + 1)
>> +		return false;
>> +
>> +	node->info.hi += len;
>> +	return true;
>> +}
>> +
>> +static void __msg_zerocopy_uarg_callback(struct ubuf_info_msgzc *uarg)
>> +{
>> +	struct sk_buff *skb = skb_from_uarg(uarg);
>> +	struct sock *sk = skb->sk;
>> +	struct tx_msg_zcopy_node *zcopy_node_p, *tail;
>> +	struct tx_msg_zcopy_queue *zcopy_queue;
>> +	unsigned long flags;
>> +	u32 lo, hi;
>> +	u16 len;
>> +
>> +	mm_unaccount_pinned_pages(&uarg->mmp);
>> +
>> +	/* if !len, there was only 1 call, and it was aborted
>> +	 * so do not queue a completion notification
>> +	 */
>> +	if (!uarg->len || sock_flag(sk, SOCK_DEAD))
>> +		goto release;
>> +
>> +	/* only support TCP and UCP currently */
>> +	if (sk_is_tcp(sk)) {
>> +		zcopy_queue = &tcp_sk(sk)->tx_zcopy_queue;
>> +	} else if (sk_is_udp(sk)) {
>> +		zcopy_queue = &udp_sk(sk)->tx_zcopy_queue;
>> +	} else {
>> +		pr_warn("MSG_ZEROCOPY_UARG only support TCP && UDP sockets");
>> +		goto release;
>> +	}
>> +
>> +	len = uarg->len;
>> +	lo = uarg->id;
>> +	hi = uarg->id + len - 1;
>> +
>> +	zcopy_node_p = (struct tx_msg_zcopy_node *)skb->data;
>> +	zcopy_node_p->info.lo = lo;
>> +	zcopy_node_p->info.hi = hi;
>> +	zcopy_node_p->info.zerocopy = uarg->zerocopy;
>> +
>> +	spin_lock_irqsave(&zcopy_queue->lock, flags);
>> +	tail = list_last_entry(&zcopy_queue->head, struct tx_msg_zcopy_node, node);
>> +	if (!tail || !skb_zerocopy_uarg_notify_extend(tail, lo, len)) {
>> +		list_add_tail(&zcopy_node_p->node, &zcopy_queue->head);
>> +		skb = NULL;
>> +	}
>> +	spin_unlock_irqrestore(&zcopy_queue->lock, flags);
>> +release:
>> +	consume_skb(skb);
>> +	sock_put(sk);
>> +}
>> +
>> +void msg_zerocopy_uarg_callback(struct sk_buff *skb, struct ubuf_info *uarg,
>> +				bool success)
>> +{
>> +	struct ubuf_info_msgzc *uarg_zc = uarg_to_msgzc(uarg);
>> +
>> +	uarg_zc->zerocopy = uarg_zc->zerocopy & success;
>> +
>> +	if (refcount_dec_and_test(&uarg->refcnt))
>> +		__msg_zerocopy_uarg_callback(uarg_zc);
>> +}
>> +EXPORT_SYMBOL_GPL(msg_zerocopy_uarg_callback);
>> +
>>   void msg_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref)
>>   {
>>   	struct sock *sk = skb_from_uarg(uarg_to_msgzc(uarg))->sk;
>> @@ -1838,7 +1957,7 @@ void msg_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref)
>>   	uarg_to_msgzc(uarg)->len--;
>>   
>>   	if (have_uref)
>> -		msg_zerocopy_callback(NULL, uarg, true);
>> +		uarg->callback(NULL, uarg, true);
>>   }
>>   EXPORT_SYMBOL_GPL(msg_zerocopy_put_abort);
>>   
>> diff --git a/net/core/sock.c b/net/core/sock.c
>> index 5ed411231fc7..a00ebd71f6ed 100644
>> --- a/net/core/sock.c
>> +++ b/net/core/sock.c
>> @@ -2843,6 +2843,56 @@ int __sock_cmsg_send(struct sock *sk, struct cmsghdr *cmsg,
>>   	case SCM_RIGHTS:
>>   	case SCM_CREDENTIALS:
>>   		break;
>> +	case SO_ZEROCOPY_NOTIFICATION:
>> +		if (sock_flag(sk, SOCK_ZEROCOPY)) {
>> +			int i = 0;
>> +			struct tx_usr_zcopy_info sys_zcopy_info;
>> +			struct tx_msg_zcopy_node *zcopy_node_p, *tmp;
>> +			struct tx_msg_zcopy_queue *zcopy_queue;
>> +			struct tx_msg_zcopy_node *zcopy_node_ps[SOCK_USR_ZC_INFO_MAX];
>> +			unsigned long flags;
>> +
>> +			if (cmsg->cmsg_len != CMSG_LEN(sizeof(void *)))
>> +				return -EINVAL;
>> +
>> +			if (sk_is_tcp(sk))
>> +				zcopy_queue = &tcp_sk(sk)->tx_zcopy_queue;
>> +			else if (sk_is_udp(sk))
>> +				zcopy_queue = &udp_sk(sk)->tx_zcopy_queue;
>> +			else
>> +				return -EINVAL;
>> +
>> +			spin_lock_irqsave(&zcopy_queue->lock, flags);
>> +			list_for_each_entry_safe(zcopy_node_p, tmp, &zcopy_queue->head, node) {
>> +				sys_zcopy_info.info[i].lo = zcopy_node_p->info.lo;
>> +				sys_zcopy_info.info[i].hi = zcopy_node_p->info.hi;
>> +				sys_zcopy_info.info[i].zerocopy = zcopy_node_p->info.zerocopy;
>> +				list_del(&zcopy_node_p->node);
>> +				zcopy_node_ps[i++] = zcopy_node_p;
>> +				if (i == SOCK_USR_ZC_INFO_MAX)
>> +					break;
>> +			}
>> +			spin_unlock_irqrestore(&zcopy_queue->lock, flags);
>> +
>> +			if (i > 0) {
>> +				sys_zcopy_info.length = i;
>> +				if (unlikely(copy_to_user(*(void **)CMSG_DATA(cmsg),
>> +							  &sys_zcopy_info,
>> +							  sizeof(sys_zcopy_info))
>> +					)) {
>> +					spin_lock_irqsave(&zcopy_queue->lock, flags);
>> +					while (i > 0)
>> +						list_add(&zcopy_node_ps[--i]->node,
>> +							 &zcopy_queue->head);
>> +					spin_unlock_irqrestore(&zcopy_queue->lock, flags);
>> +					return -EFAULT;
>> +				}
>> +
>> +				while (i > 0)
>> +					consume_skb(zcopy_node_ps[--i]->skb);
>> +			}
>> +		}
>> +		break;
>>   	default:
>>   		return -EINVAL;
>>   	}
>> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
>> index 1fe794967211..5adb737c4c01 100644
>> --- a/net/ipv4/ip_output.c
>> +++ b/net/ipv4/ip_output.c
>> @@ -1005,7 +1005,7 @@ static int __ip_append_data(struct sock *sk,
>>   	    (!exthdrlen || (rt->dst.dev->features & NETIF_F_HW_ESP_TX_CSUM)))
>>   		csummode = CHECKSUM_PARTIAL;
>>   
>> -	if ((flags & MSG_ZEROCOPY) && length) {
>> +	if (((flags & MSG_ZEROCOPY) || (flags & MSG_ZEROCOPY_UARG)) && length) {
>>   		struct msghdr *msg = from;
>>   
>>   		if (getfrag == ip_generic_getfrag && msg->msg_ubuf) {
>> @@ -1022,7 +1022,9 @@ static int __ip_append_data(struct sock *sk,
>>   				uarg = msg->msg_ubuf;
>>   			}
>>   		} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
>> -			uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
>> +			bool user_args = flags & MSG_ZEROCOPY_UARG;
>> +
>> +			uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb), user_args);
>>   			if (!uarg)
>>   				return -ENOBUFS;
>>   			extra_uref = !skb_zcopy(skb);	/* only ref on new uarg */
>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>> index e767721b3a58..6254d0eef3af 100644
>> --- a/net/ipv4/tcp.c
>> +++ b/net/ipv4/tcp.c
>> @@ -462,6 +462,8 @@ void tcp_init_sock(struct sock *sk)
>>   
>>   	set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags);
>>   	sk_sockets_allocated_inc(sk);
>> +
>> +	tx_message_zcopy_queue_init(&tp->tx_zcopy_queue);
>>   }
>>   EXPORT_SYMBOL(tcp_init_sock);
>>   
>> @@ -1050,14 +1052,15 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>>   
>>   	flags = msg->msg_flags;
>>   
>> -	if ((flags & MSG_ZEROCOPY) && size) {
>> +	if (((flags & MSG_ZEROCOPY) || (flags & MSG_ZEROCOPY_UARG)) && size) {
>>   		if (msg->msg_ubuf) {
>>   			uarg = msg->msg_ubuf;
>>   			if (sk->sk_route_caps & NETIF_F_SG)
>>   				zc = MSG_ZEROCOPY;
>>   		} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
>> +			bool zc_uarg = flags & MSG_ZEROCOPY_UARG;
>>   			skb = tcp_write_queue_tail(sk);
>> -			uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb));
>> +			uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb), zc_uarg);
>>   			if (!uarg) {
>>   				err = -ENOBUFS;
>>   				goto out_err;
>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>> index 11460d751e73..6c62aacd74d6 100644
>> --- a/net/ipv4/udp.c
>> +++ b/net/ipv4/udp.c
>> @@ -1126,6 +1126,15 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>>   		if (ipc.opt)
>>   			free = 1;
>>   		connected = 0;
>> +
>> +		/* If len is zero and flag MSG_ZEROCOPY_UARG is set,
>> +		 * it means this call just wants to get zcopy notifications
>> +		 * instead of sending packets. It is useful when users
>> +		 * finish sending and want to get trailing notifications.
>> +		 */
>> +		if ((msg->msg_flags & MSG_ZEROCOPY_UARG) &&
>> +		    sock_flag(sk, SOCK_ZEROCOPY) && len == 0)
>> +			return 0;
>>   	}
>>   	if (!ipc.opt) {
>>   		struct ip_options_rcu *inet_opt;
>> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
>> index b9dd3a66e423..891526ddd74c 100644
>> --- a/net/ipv6/ip6_output.c
>> +++ b/net/ipv6/ip6_output.c
>> @@ -1493,7 +1493,7 @@ static int __ip6_append_data(struct sock *sk,
>>   	    rt->dst.dev->features & (NETIF_F_IPV6_CSUM | NETIF_F_HW_CSUM))
>>   		csummode = CHECKSUM_PARTIAL;
>>   
>> -	if ((flags & MSG_ZEROCOPY) && length) {
>> +	if (((flags & MSG_ZEROCOPY) || (flags & MSG_ZEROCOPY_UARG)) && length) {
>>   		struct msghdr *msg = from;
>>   
>>   		if (getfrag == ip_generic_getfrag && msg->msg_ubuf) {
>> @@ -1510,7 +1510,8 @@ static int __ip6_append_data(struct sock *sk,
>>   				uarg = msg->msg_ubuf;
>>   			}
>>   		} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
>> -			uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
>> +			uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb),
>> +						    flags & MSG_ZEROCOPY_UARG);
>>   			if (!uarg)
>>   				return -ENOBUFS;
>>   			extra_uref = !skb_zcopy(skb);	/* only ref on new uarg */
>> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
>> index 2e4dc5e6137b..98f6905c5db9 100644
>> --- a/net/ipv6/udp.c
>> +++ b/net/ipv6/udp.c
>> @@ -1490,6 +1490,15 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>>   		if (!(opt->opt_nflen|opt->opt_flen))
>>   			opt = NULL;
>>   		connected = false;
>> +
>> +		/* If len is zero and flag MSG_ZEROCOPY_UARG is set,
>> +		 * it means this call just wants to get zcopy notifications
>> +		 * instead of sending packets. It is useful when users
>> +		 * finish sending and want to get trailing notifications.
>> +		 */
>> +		if ((msg->msg_flags & MSG_ZEROCOPY_UARG) &&
>> +		    sock_flag(sk, SOCK_ZEROCOPY) && len == 0)
>> +			return 0;
>>   	}
>>   	if (!opt) {
>>   		opt = txopt_get(np);
>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>> index 16ff976a86e3..d6e6830f6ffe 100644
>> --- a/net/vmw_vsock/virtio_transport_common.c
>> +++ b/net/vmw_vsock/virtio_transport_common.c
>> @@ -84,7 +84,7 @@ static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
>>   
>>   		uarg = msg_zerocopy_realloc(sk_vsock(vsk),
>>   					    iter->count,
>> -					    NULL);
>> +					    NULL, false);
>>   		if (!uarg)
>>   			return -1;
>>   
>> -- 
>> 2.20.1
>>
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [External] Re: [PATCH net-next 1/3] sock: add MSG_ZEROCOPY_UARG
  2024-04-09 21:23   ` Eric Dumazet
@ 2024-04-10  0:11     ` Zijian Zhang
  0 siblings, 0 replies; 24+ messages in thread
From: Zijian Zhang @ 2024-04-10  0:11 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, willemdebruijn.kernel, davem, kuba, cong.wang, xiaochun.lu



On 4/9/24 2:23 PM, Eric Dumazet wrote:
> On Tue, Apr 9, 2024 at 10:53 PM <zijianzhang@bytedance.com> wrote:
>>
>> From: Zijian Zhang <zijianzhang@bytedance.com>
>>
>> The MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
>> However, zerocopy is not a free lunch. Apart from the management of user
>> pages, the combination of poll + recvmsg to receive notifications incurs
>> unignorable overhead in the applications. The overhead of such sometimes
>> might be more than the CPU savings from zerocopy. We try to solve this
>> problem with a new option for TCP and UDP, MSG_ZEROCOPY_UARG.
>> This new mechanism aims to reduce the overhead associated with receiving
>> notifications by embedding them directly into user arguments passed with
>> each sendmsg control message. By doing so, we can significantly reduce
>> the complexity and overhead for managing notifications. In an ideal
>> pattern, the user will keep calling sendmsg with MSG_ZEROCOPY_UARG
>> flag, and the notification will be delivered as soon as possible.
>>
>> Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com>
>> Signed-off-by: Xiaochun Lu <xiaochun.lu@bytedance.com>
>> ---
>>   include/linux/skbuff.h                  |   7 +-
>>   include/linux/socket.h                  |   1 +
>>   include/linux/tcp.h                     |   3 +
>>   include/linux/udp.h                     |   3 +
>>   include/net/sock.h                      |  17 +++
>>   include/net/udp.h                       |   1 +
>>   include/uapi/asm-generic/socket.h       |   2 +
>>   include/uapi/linux/socket.h             |  17 +++
> 
> ...
> 
>> +
>> +static inline void tx_message_zcopy_queue_init(struct tx_msg_zcopy_queue *q)
>> +{
>> +       spin_lock_init(&q->lock);
>> +       INIT_LIST_HEAD(&q->head);
>> +}
>> +
>>
> 
>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>> index e767721b3a58..6254d0eef3af 100644
>> --- a/net/ipv4/tcp.c
>> +++ b/net/ipv4/tcp.c
>> @@ -462,6 +462,8 @@ void tcp_init_sock(struct sock *sk)
>>
>>          set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags);
>>          sk_sockets_allocated_inc(sk);
>> +
>> +       tx_message_zcopy_queue_init(&tp->tx_zcopy_queue);
>>   }
> 
> 
> FYI,  tcp_init_sock() is not called for passive sockets.
> 
> syzbot would quite easily crash if zerovopy is used after accept()...

Thanks for the info, I will reuse error queue in the next patch set and
this line of code will be deleted.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 0/3] net: socket sendmsg MSG_ZEROCOPY_UARG
  2024-04-09 20:52 [PATCH net-next 0/3] net: socket sendmsg MSG_ZEROCOPY_UARG zijianzhang
                   ` (2 preceding siblings ...)
  2024-04-09 20:53 ` [PATCH net-next 3/3] selftests: add msg_zerocopy_uarg test zijianzhang
@ 2024-04-10  8:46 ` Paolo Abeni
  2024-04-10 17:03   ` [External] " Zijian Zhang
  3 siblings, 1 reply; 24+ messages in thread
From: Paolo Abeni @ 2024-04-10  8:46 UTC (permalink / raw)
  To: zijianzhang, netdev
  Cc: edumazet, willemdebruijn.kernel, davem, kuba, cong.wang, xiaochun.lu

On Tue, 2024-04-09 at 20:52 +0000, zijianzhang@bytedance.com wrote:
> From: Zijian Zhang <zijianzhang@bytedance.com>
> 
> Original notification mechanism needs poll + recvmmsg which is not
> easy for applcations to accommodate. And, it also incurs unignorable
> overhead including extra system calls and usage of optmem.
> 
> While making maximum reuse of the existing MSG_ZEROCOPY related code,
> this patch set introduces zerocopy socket send flag MSG_ZEROCOPY_UARG.
> It provides a new notification method. Users of sendmsg pass a control
> message as a placeholder for the incoming notifications. Upon returning,
> kernel embeds notifications directly into user arguments passed in. By
> doing so, we can significantly reduce the complexity and overhead for
> managing notifications. In an ideal pattern, the user will keep calling
> sendmsg with MSG_ZEROCOPY_UARG flag, and the notification will be
> delivered as soon as possible.
> 
> MSG_ZEROCOPY_UARG does not need to queue skb into errqueue. Thus,
> skbuffs allocated from optmem are not a must. In theory, a new struct
> carrying the zcopy information should be defined along with its memory
> management code. However, existing zcopy generic code assumes the
> information is skbuff. Given the very limited performance gain or maybe
> no gain of this method, and the need to change a lot of existing code,
> we still use skbuffs allocated from optmem to carry zcopy information.
> 
> * Performance
> 
> I extend the selftests/msg_zerocopy.c to accommodate the new flag, test
> result is as follows, the new flag performs 7% better in TCP and 4%
> better in UDP.
> 
> cfg_notification_limit = 8
> +---------------------+---------+---------+---------+---------+
> > Test Type / Protocol| TCP v4  | TCP v6  | UDP v4  | UDP v6  |
> +---------------------+---------+---------+---------+---------+
> > Copy                | 5328    | 5159    | 8581    | 8457    |
> +---------------------+---------+---------+---------+---------+
> > ZCopy               | 5877    | 5568    | 10314   | 10091   |
> +---------------------+---------+---------+---------+---------+
> > New ZCopy           | 6254    | 5901    | 10674   | 10293   |
> +---------------------+---------+---------+---------+---------+
> > ZCopy / Copy        | 110.30% | 107.93% | 120.20% | 119.32% |
> +---------------------+---------+---------+---------+---------+
> > New ZCopy / Copy    | 117.38% | 114.38% | 124.39% | 121.71% |
> +---------------------+---------+---------+---------+---------+

Minor nit for a future revision: the relevant part here is the 'New
ZCopy/ ZCopy' direct comparison, which is missing - even if inferable
from the above.  You should provide such data, and you could drop the
'ZCopy / Copy' 'New ZCopy / Copy' and possibly even 'Copy' lines.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 1/3] sock: add MSG_ZEROCOPY_UARG
  2024-04-09 20:52 ` [PATCH net-next 1/3] sock: add MSG_ZEROCOPY_UARG zijianzhang
                     ` (2 preceding siblings ...)
  2024-04-09 21:23   ` Eric Dumazet
@ 2024-04-10  9:18   ` kernel test robot
  2024-04-10 12:11   ` kernel test robot
  2024-04-10 15:22   ` kernel test robot
  5 siblings, 0 replies; 24+ messages in thread
From: kernel test robot @ 2024-04-10  9:18 UTC (permalink / raw)
  To: zijianzhang, netdev
  Cc: oe-kbuild-all, edumazet, willemdebruijn.kernel, davem, kuba,
	cong.wang, xiaochun.lu, Zijian Zhang

Hi,

kernel test robot noticed the following build errors:

[auto build test ERROR on net-next/main]

url:    https://github.com/intel-lab-lkp/linux/commits/zijianzhang-bytedance-com/sock-add-MSG_ZEROCOPY_UARG/20240410-045616
base:   net-next/main
patch link:    https://lore.kernel.org/r/20240409205300.1346681-2-zijianzhang%40bytedance.com
patch subject: [PATCH net-next 1/3] sock: add MSG_ZEROCOPY_UARG
config: alpha-defconfig (https://download.01.org/0day-ci/archive/20240410/202404101756.S6ltBayX-lkp@intel.com/config)
compiler: alpha-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240410/202404101756.S6ltBayX-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202404101756.S6ltBayX-lkp@intel.com/

All errors (new ones prefixed by >>):

   net/core/sock.c: In function '__sock_cmsg_send':
>> net/core/sock.c:2845:14: error: 'SO_ZEROCOPY_NOTIFICATION' undeclared (first use in this function)
    2845 |         case SO_ZEROCOPY_NOTIFICATION:
         |              ^~~~~~~~~~~~~~~~~~~~~~~~
   net/core/sock.c:2845:14: note: each undeclared identifier is reported only once for each function it appears in


vim +/SO_ZEROCOPY_NOTIFICATION +2845 net/core/sock.c

  2807	
  2808	int __sock_cmsg_send(struct sock *sk, struct cmsghdr *cmsg,
  2809			     struct sockcm_cookie *sockc)
  2810	{
  2811		u32 tsflags;
  2812	
  2813		switch (cmsg->cmsg_type) {
  2814		case SO_MARK:
  2815			if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_RAW) &&
  2816			    !ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
  2817				return -EPERM;
  2818			if (cmsg->cmsg_len != CMSG_LEN(sizeof(u32)))
  2819				return -EINVAL;
  2820			sockc->mark = *(u32 *)CMSG_DATA(cmsg);
  2821			break;
  2822		case SO_TIMESTAMPING_OLD:
  2823		case SO_TIMESTAMPING_NEW:
  2824			if (cmsg->cmsg_len != CMSG_LEN(sizeof(u32)))
  2825				return -EINVAL;
  2826	
  2827			tsflags = *(u32 *)CMSG_DATA(cmsg);
  2828			if (tsflags & ~SOF_TIMESTAMPING_TX_RECORD_MASK)
  2829				return -EINVAL;
  2830	
  2831			sockc->tsflags &= ~SOF_TIMESTAMPING_TX_RECORD_MASK;
  2832			sockc->tsflags |= tsflags;
  2833			break;
  2834		case SCM_TXTIME:
  2835			if (!sock_flag(sk, SOCK_TXTIME))
  2836				return -EINVAL;
  2837			if (cmsg->cmsg_len != CMSG_LEN(sizeof(u64)))
  2838				return -EINVAL;
  2839			sockc->transmit_time = get_unaligned((u64 *)CMSG_DATA(cmsg));
  2840			break;
  2841		/* SCM_RIGHTS and SCM_CREDENTIALS are semantically in SOL_UNIX. */
  2842		case SCM_RIGHTS:
  2843		case SCM_CREDENTIALS:
  2844			break;
> 2845		case SO_ZEROCOPY_NOTIFICATION:
  2846			if (sock_flag(sk, SOCK_ZEROCOPY)) {
  2847				int i = 0;
  2848				struct tx_usr_zcopy_info sys_zcopy_info;
  2849				struct tx_msg_zcopy_node *zcopy_node_p, *tmp;
  2850				struct tx_msg_zcopy_queue *zcopy_queue;
  2851				struct tx_msg_zcopy_node *zcopy_node_ps[SOCK_USR_ZC_INFO_MAX];
  2852				unsigned long flags;
  2853	
  2854				if (cmsg->cmsg_len != CMSG_LEN(sizeof(void *)))
  2855					return -EINVAL;
  2856	
  2857				if (sk_is_tcp(sk))
  2858					zcopy_queue = &tcp_sk(sk)->tx_zcopy_queue;
  2859				else if (sk_is_udp(sk))
  2860					zcopy_queue = &udp_sk(sk)->tx_zcopy_queue;
  2861				else
  2862					return -EINVAL;
  2863	
  2864				spin_lock_irqsave(&zcopy_queue->lock, flags);
  2865				list_for_each_entry_safe(zcopy_node_p, tmp, &zcopy_queue->head, node) {
  2866					sys_zcopy_info.info[i].lo = zcopy_node_p->info.lo;
  2867					sys_zcopy_info.info[i].hi = zcopy_node_p->info.hi;
  2868					sys_zcopy_info.info[i].zerocopy = zcopy_node_p->info.zerocopy;
  2869					list_del(&zcopy_node_p->node);
  2870					zcopy_node_ps[i++] = zcopy_node_p;
  2871					if (i == SOCK_USR_ZC_INFO_MAX)
  2872						break;
  2873				}
  2874				spin_unlock_irqrestore(&zcopy_queue->lock, flags);
  2875	
  2876				if (i > 0) {
  2877					sys_zcopy_info.length = i;
  2878					if (unlikely(copy_to_user(*(void **)CMSG_DATA(cmsg),
  2879								  &sys_zcopy_info,
  2880								  sizeof(sys_zcopy_info))
  2881						)) {
  2882						spin_lock_irqsave(&zcopy_queue->lock, flags);
  2883						while (i > 0)
  2884							list_add(&zcopy_node_ps[--i]->node,
  2885								 &zcopy_queue->head);
  2886						spin_unlock_irqrestore(&zcopy_queue->lock, flags);
  2887						return -EFAULT;
  2888					}
  2889	
  2890					while (i > 0)
  2891						consume_skb(zcopy_node_ps[--i]->skb);
  2892				}
  2893			}
  2894			break;
  2895		default:
  2896			return -EINVAL;
  2897		}
  2898		return 0;
  2899	}
  2900	EXPORT_SYMBOL(__sock_cmsg_send);
  2901	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 1/3] sock: add MSG_ZEROCOPY_UARG
  2024-04-09 20:52 ` [PATCH net-next 1/3] sock: add MSG_ZEROCOPY_UARG zijianzhang
                     ` (3 preceding siblings ...)
  2024-04-10  9:18   ` kernel test robot
@ 2024-04-10 12:11   ` kernel test robot
  2024-04-10 15:22   ` kernel test robot
  5 siblings, 0 replies; 24+ messages in thread
From: kernel test robot @ 2024-04-10 12:11 UTC (permalink / raw)
  To: zijianzhang, netdev
  Cc: oe-kbuild-all, edumazet, willemdebruijn.kernel, davem, kuba,
	cong.wang, xiaochun.lu, Zijian Zhang

Hi,

kernel test robot noticed the following build warnings:

[auto build test WARNING on net-next/main]

url:    https://github.com/intel-lab-lkp/linux/commits/zijianzhang-bytedance-com/sock-add-MSG_ZEROCOPY_UARG/20240410-045616
base:   net-next/main
patch link:    https://lore.kernel.org/r/20240409205300.1346681-2-zijianzhang%40bytedance.com
patch subject: [PATCH net-next 1/3] sock: add MSG_ZEROCOPY_UARG
config: i386-randconfig-061-20240410 (https://download.01.org/0day-ci/archive/20240410/202404101954.FBZojOXG-lkp@intel.com/config)
compiler: gcc-13 (Ubuntu 13.2.0-4ubuntu3) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240410/202404101954.FBZojOXG-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202404101954.FBZojOXG-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> net/core/sock.c:2878:37: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected void [noderef] __user *to @@     got void * @@
   net/core/sock.c:2878:37: sparse:     expected void [noderef] __user *to
   net/core/sock.c:2878:37: sparse:     got void *
   net/core/sock.c:2393:9: sparse: sparse: context imbalance in 'sk_clone_lock' - wrong count at exit
   net/core/sock.c:2397:6: sparse: sparse: context imbalance in 'sk_free_unlock_clone' - unexpected unlock
   net/core/sock.c:4083:13: sparse: sparse: context imbalance in 'proto_seq_start' - wrong count at exit
   net/core/sock.c:4095:13: sparse: sparse: context imbalance in 'proto_seq_stop' - wrong count at exit

vim +2878 net/core/sock.c

  2807	
  2808	int __sock_cmsg_send(struct sock *sk, struct cmsghdr *cmsg,
  2809			     struct sockcm_cookie *sockc)
  2810	{
  2811		u32 tsflags;
  2812	
  2813		switch (cmsg->cmsg_type) {
  2814		case SO_MARK:
  2815			if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_RAW) &&
  2816			    !ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
  2817				return -EPERM;
  2818			if (cmsg->cmsg_len != CMSG_LEN(sizeof(u32)))
  2819				return -EINVAL;
  2820			sockc->mark = *(u32 *)CMSG_DATA(cmsg);
  2821			break;
  2822		case SO_TIMESTAMPING_OLD:
  2823		case SO_TIMESTAMPING_NEW:
  2824			if (cmsg->cmsg_len != CMSG_LEN(sizeof(u32)))
  2825				return -EINVAL;
  2826	
  2827			tsflags = *(u32 *)CMSG_DATA(cmsg);
  2828			if (tsflags & ~SOF_TIMESTAMPING_TX_RECORD_MASK)
  2829				return -EINVAL;
  2830	
  2831			sockc->tsflags &= ~SOF_TIMESTAMPING_TX_RECORD_MASK;
  2832			sockc->tsflags |= tsflags;
  2833			break;
  2834		case SCM_TXTIME:
  2835			if (!sock_flag(sk, SOCK_TXTIME))
  2836				return -EINVAL;
  2837			if (cmsg->cmsg_len != CMSG_LEN(sizeof(u64)))
  2838				return -EINVAL;
  2839			sockc->transmit_time = get_unaligned((u64 *)CMSG_DATA(cmsg));
  2840			break;
  2841		/* SCM_RIGHTS and SCM_CREDENTIALS are semantically in SOL_UNIX. */
  2842		case SCM_RIGHTS:
  2843		case SCM_CREDENTIALS:
  2844			break;
  2845		case SO_ZEROCOPY_NOTIFICATION:
  2846			if (sock_flag(sk, SOCK_ZEROCOPY)) {
  2847				int i = 0;
  2848				struct tx_usr_zcopy_info sys_zcopy_info;
  2849				struct tx_msg_zcopy_node *zcopy_node_p, *tmp;
  2850				struct tx_msg_zcopy_queue *zcopy_queue;
  2851				struct tx_msg_zcopy_node *zcopy_node_ps[SOCK_USR_ZC_INFO_MAX];
  2852				unsigned long flags;
  2853	
  2854				if (cmsg->cmsg_len != CMSG_LEN(sizeof(void *)))
  2855					return -EINVAL;
  2856	
  2857				if (sk_is_tcp(sk))
  2858					zcopy_queue = &tcp_sk(sk)->tx_zcopy_queue;
  2859				else if (sk_is_udp(sk))
  2860					zcopy_queue = &udp_sk(sk)->tx_zcopy_queue;
  2861				else
  2862					return -EINVAL;
  2863	
  2864				spin_lock_irqsave(&zcopy_queue->lock, flags);
  2865				list_for_each_entry_safe(zcopy_node_p, tmp, &zcopy_queue->head, node) {
  2866					sys_zcopy_info.info[i].lo = zcopy_node_p->info.lo;
  2867					sys_zcopy_info.info[i].hi = zcopy_node_p->info.hi;
  2868					sys_zcopy_info.info[i].zerocopy = zcopy_node_p->info.zerocopy;
  2869					list_del(&zcopy_node_p->node);
  2870					zcopy_node_ps[i++] = zcopy_node_p;
  2871					if (i == SOCK_USR_ZC_INFO_MAX)
  2872						break;
  2873				}
  2874				spin_unlock_irqrestore(&zcopy_queue->lock, flags);
  2875	
  2876				if (i > 0) {
  2877					sys_zcopy_info.length = i;
> 2878					if (unlikely(copy_to_user(*(void **)CMSG_DATA(cmsg),
  2879								  &sys_zcopy_info,
  2880								  sizeof(sys_zcopy_info))
  2881						)) {
  2882						spin_lock_irqsave(&zcopy_queue->lock, flags);
  2883						while (i > 0)
  2884							list_add(&zcopy_node_ps[--i]->node,
  2885								 &zcopy_queue->head);
  2886						spin_unlock_irqrestore(&zcopy_queue->lock, flags);
  2887						return -EFAULT;
  2888					}
  2889	
  2890					while (i > 0)
  2891						consume_skb(zcopy_node_ps[--i]->skb);
  2892				}
  2893			}
  2894			break;
  2895		default:
  2896			return -EINVAL;
  2897		}
  2898		return 0;
  2899	}
  2900	EXPORT_SYMBOL(__sock_cmsg_send);
  2901	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 1/3] sock: add MSG_ZEROCOPY_UARG
  2024-04-09 20:52 ` [PATCH net-next 1/3] sock: add MSG_ZEROCOPY_UARG zijianzhang
                     ` (4 preceding siblings ...)
  2024-04-10 12:11   ` kernel test robot
@ 2024-04-10 15:22   ` kernel test robot
  5 siblings, 0 replies; 24+ messages in thread
From: kernel test robot @ 2024-04-10 15:22 UTC (permalink / raw)
  To: zijianzhang, netdev
  Cc: oe-kbuild-all, edumazet, willemdebruijn.kernel, davem, kuba,
	cong.wang, xiaochun.lu, Zijian Zhang

Hi,

kernel test robot noticed the following build errors:

[auto build test ERROR on net-next/main]

url:    https://github.com/intel-lab-lkp/linux/commits/zijianzhang-bytedance-com/sock-add-MSG_ZEROCOPY_UARG/20240410-045616
base:   net-next/main
patch link:    https://lore.kernel.org/r/20240409205300.1346681-2-zijianzhang%40bytedance.com
patch subject: [PATCH net-next 1/3] sock: add MSG_ZEROCOPY_UARG
config: x86_64-randconfig-r081-20240410 (https://download.01.org/0day-ci/archive/20240410/202404102247.qzN9N0Il-lkp@intel.com/config)
compiler: clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240410/202404102247.qzN9N0Il-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202404102247.qzN9N0Il-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from <built-in>:1:
   In file included from ./usr/include/linux/un.h:5:
>> usr/include/linux/socket.h:45:2: error: unknown type name '__u32'
      45 |         __u32 lo;
         |         ^
   usr/include/linux/socket.h:46:2: error: unknown type name '__u32'
      46 |         __u32 hi;
         |         ^
>> usr/include/linux/socket.h:47:2: error: unknown type name '__u8'
      47 |         __u8 zerocopy;
         |         ^
   3 errors generated.
--
   In file included from <built-in>:1:
>> ./usr/include/linux/socket.h:45:2: error: unknown type name '__u32'
      45 |         __u32 lo;
         |         ^
   ./usr/include/linux/socket.h:46:2: error: unknown type name '__u32'
      46 |         __u32 hi;
         |         ^
>> ./usr/include/linux/socket.h:47:2: error: unknown type name '__u8'
      47 |         __u8 zerocopy;
         |         ^
   3 errors generated.

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [External] Re: [PATCH net-next 1/3] sock: add MSG_ZEROCOPY_UARG
  2024-04-09 21:18   ` Eric Dumazet
@ 2024-04-10 17:01     ` Zijian Zhang
  2024-04-10 18:57       ` Eric Dumazet
  0 siblings, 1 reply; 24+ messages in thread
From: Zijian Zhang @ 2024-04-10 17:01 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, willemdebruijn.kernel, davem, kuba, cong.wang, xiaochun.lu

On 4/9/24 2:18 PM, Eric Dumazet wrote:
> On Tue, Apr 9, 2024 at 10:53 PM <zijianzhang@bytedance.com> wrote:
>>
>> From: Zijian Zhang <zijianzhang@bytedance.com>
>>
>> The MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
>> However, zerocopy is not a free lunch. Apart from the management of user
>> pages, the combination of poll + recvmsg to receive notifications incurs
>> unignorable overhead in the applications. The overhead of such sometimes
>> might be more than the CPU savings from zerocopy. We try to solve this
>> problem with a new option for TCP and UDP, MSG_ZEROCOPY_UARG.
>> This new mechanism aims to reduce the overhead associated with receiving
>> notifications by embedding them directly into user arguments passed with
>> each sendmsg control message. By doing so, we can significantly reduce
>> the complexity and overhead for managing notifications. In an ideal
>> pattern, the user will keep calling sendmsg with MSG_ZEROCOPY_UARG
>> flag, and the notification will be delivered as soon as possible.
>>
>> Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com>
>> Signed-off-by: Xiaochun Lu <xiaochun.lu@bytedance.com>
>> ---
>>   include/linux/skbuff.h                  |   7 +-
>>   include/linux/socket.h                  |   1 +
>>   include/linux/tcp.h                     |   3 +
>>   include/linux/udp.h                     |   3 +
>>   include/net/sock.h                      |  17 +++
>>   include/net/udp.h                       |   1 +
>>   include/uapi/asm-generic/socket.h       |   2 +
>>   include/uapi/linux/socket.h             |  17 +++
>>   net/core/skbuff.c                       | 137 ++++++++++++++++++++++--
>>   net/core/sock.c                         |  50 +++++++++
>>   net/ipv4/ip_output.c                    |   6 +-
>>   net/ipv4/tcp.c                          |   7 +-
>>   net/ipv4/udp.c                          |   9 ++
>>   net/ipv6/ip6_output.c                   |   5 +-
>>   net/ipv6/udp.c                          |   9 ++
>>   net/vmw_vsock/virtio_transport_common.c |   2 +-
>>   16 files changed, 258 insertions(+), 18 deletions(-)
>>
>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>> index 03ea36a82cdd..19b94ba01007 100644
>> --- a/include/linux/skbuff.h
>> +++ b/include/linux/skbuff.h
>> @@ -1663,12 +1663,14 @@ static inline void skb_set_end_offset(struct sk_buff *skb, unsigned int offset)
>>   #endif
>>
>>   struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
>> -                                      struct ubuf_info *uarg);
>> +                                      struct ubuf_info *uarg, bool user_args_notification);
>>
>>   void msg_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref);
>>
>>   void msg_zerocopy_callback(struct sk_buff *skb, struct ubuf_info *uarg,
>>                             bool success);
>> +void msg_zerocopy_uarg_callback(struct sk_buff *skb, struct ubuf_info *uarg,
>> +                          bool success);
>>
>>   int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
>>                              struct sk_buff *skb, struct iov_iter *from,
>> @@ -1763,7 +1765,8 @@ static inline void net_zcopy_put(struct ubuf_info *uarg)
>>   static inline void net_zcopy_put_abort(struct ubuf_info *uarg, bool have_uref)
>>   {
>>          if (uarg) {
>> -               if (uarg->callback == msg_zerocopy_callback)
>> +               if (uarg->callback == msg_zerocopy_callback ||
>> +                       uarg->callback == msg_zerocopy_uarg_callback)
>>                          msg_zerocopy_put_abort(uarg, have_uref);
>>                  else if (have_uref)
>>                          net_zcopy_put(uarg);
>> diff --git a/include/linux/socket.h b/include/linux/socket.h
>> index 139c330ccf2c..de01392344e1 100644
>> --- a/include/linux/socket.h
>> +++ b/include/linux/socket.h
>> @@ -326,6 +326,7 @@ struct ucred {
>>                                            * plain text and require encryption
>>                                            */
>>
>> +#define MSG_ZEROCOPY_UARG      0x2000000 /* MSG_ZEROCOPY with UARG notifications */
>>   #define MSG_ZEROCOPY   0x4000000       /* Use user data in kernel path */
>>   #define MSG_SPLICE_PAGES 0x8000000     /* Splice the pages from the iterator in sendmsg() */
>>   #define MSG_FASTOPEN   0x20000000      /* Send data in TCP SYN */
>> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
>> index 55399ee2a57e..e973f4990646 100644
>> --- a/include/linux/tcp.h
>> +++ b/include/linux/tcp.h
>> @@ -501,6 +501,9 @@ struct tcp_sock {
>>           */
>>          struct request_sock __rcu *fastopen_rsk;
>>          struct saved_syn *saved_syn;
>> +
>> +/* TCP MSG_ZEROCOPY_UARG related information */
>> +       struct tx_msg_zcopy_queue tx_zcopy_queue;
>>   };
>>
>>   enum tsq_enum {
>> diff --git a/include/linux/udp.h b/include/linux/udp.h
>> index 3748e82b627b..502b393eac67 100644
>> --- a/include/linux/udp.h
>> +++ b/include/linux/udp.h
>> @@ -95,6 +95,9 @@ struct udp_sock {
>>
>>          /* Cache friendly copy of sk->sk_peek_off >= 0 */
>>          bool            peeking_with_offset;
>> +
>> +       /* This field is used by sendmsg zcopy user arg mode notification */
>> +       struct tx_msg_zcopy_queue tx_zcopy_queue;
>>   };
>>
>>   #define udp_test_bit(nr, sk)                   \
>> diff --git a/include/net/sock.h b/include/net/sock.h
>> index 2253eefe2848..f7c045e98213 100644
>> --- a/include/net/sock.h
>> +++ b/include/net/sock.h
>> @@ -544,6 +544,23 @@ struct sock {
>>          netns_tracker           ns_tracker;
>>   };
>>
>> +struct tx_msg_zcopy_node {
>> +       struct list_head node;
>> +       struct tx_msg_zcopy_info info;
>> +       struct sk_buff *skb;
>> +};
>> +
>> +struct tx_msg_zcopy_queue {
>> +       struct list_head head;
>> +       spinlock_t lock; /* protects head queue */
>> +};
>> +
>> +static inline void tx_message_zcopy_queue_init(struct tx_msg_zcopy_queue *q)
>> +{
>> +       spin_lock_init(&q->lock);
>> +       INIT_LIST_HEAD(&q->head);
>> +}
>> +
>>   enum sk_pacing {
>>          SK_PACING_NONE          = 0,
>>          SK_PACING_NEEDED        = 1,
>> diff --git a/include/net/udp.h b/include/net/udp.h
>> index 488a6d2babcc..9e4d7b128de4 100644
>> --- a/include/net/udp.h
>> +++ b/include/net/udp.h
>> @@ -182,6 +182,7 @@ static inline void udp_lib_init_sock(struct sock *sk)
>>          skb_queue_head_init(&up->reader_queue);
>>          up->forward_threshold = sk->sk_rcvbuf >> 2;
>>          set_bit(SOCK_CUSTOM_SOCKOPT, &sk->sk_socket->flags);
>> +       tx_message_zcopy_queue_init(&up->tx_zcopy_queue);
>>   }
>>
>>   /* hash routines shared between UDPv4/6 and UDP-Litev4/6 */
>> diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
>> index 8ce8a39a1e5f..86aa4b5cb7f1 100644
>> --- a/include/uapi/asm-generic/socket.h
>> +++ b/include/uapi/asm-generic/socket.h
>> @@ -135,6 +135,8 @@
>>   #define SO_PASSPIDFD           76
>>   #define SO_PEERPIDFD           77
>>
>> +#define SO_ZEROCOPY_NOTIFICATION 78
>> +
>>   #if !defined(__KERNEL__)
>>
>>   #if __BITS_PER_LONG == 64 || (defined(__x86_64__) && defined(__ILP32__))
>> diff --git a/include/uapi/linux/socket.h b/include/uapi/linux/socket.h
>> index d3fcd3b5ec53..469ed8f4e6c8 100644
>> --- a/include/uapi/linux/socket.h
>> +++ b/include/uapi/linux/socket.h
>> @@ -35,4 +35,21 @@ struct __kernel_sockaddr_storage {
>>   #define SOCK_TXREHASH_DISABLED 0
>>   #define SOCK_TXREHASH_ENABLED  1
>>
>> +/*
>> + * Given the fact that MSG_ZEROCOPY_UARG tries to copy notifications
>> + * back to user as soon as possible, 8 should be sufficient.
>> + */
>> +#define SOCK_USR_ZC_INFO_MAX 8
>> +
>> +struct tx_msg_zcopy_info {
>> +       __u32 lo;
>> +       __u32 hi;
>> +       __u8 zerocopy;
>> +};
>> +
>> +struct tx_usr_zcopy_info {
>> +       int length;
>> +       struct tx_msg_zcopy_info info[SOCK_USR_ZC_INFO_MAX];
>> +};
>> +
>>   #endif /* _UAPI_LINUX_SOCKET_H */
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index 2a5ce6667bbb..d939b2c14d55 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -1661,6 +1661,16 @@ void mm_unaccount_pinned_pages(struct mmpin *mmp)
>>   }
>>   EXPORT_SYMBOL_GPL(mm_unaccount_pinned_pages);
>>
>> +static void init_ubuf_info_msgzc(struct ubuf_info_msgzc *uarg, struct sock *sk, size_t size)
>> +{
>> +       uarg->id = ((u32)atomic_inc_return(&sk->sk_zckey)) - 1;
>> +       uarg->len = 1;
>> +       uarg->bytelen = size;
>> +       uarg->zerocopy = 1;
>> +       uarg->ubuf.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
>> +       refcount_set(&uarg->ubuf.refcnt, 1);
>> +}
>> +
>>   static struct ubuf_info *msg_zerocopy_alloc(struct sock *sk, size_t size)
>>   {
>>          struct ubuf_info_msgzc *uarg;
>> @@ -1682,12 +1692,38 @@ static struct ubuf_info *msg_zerocopy_alloc(struct sock *sk, size_t size)
>>          }
>>
>>          uarg->ubuf.callback = msg_zerocopy_callback;
>> -       uarg->id = ((u32)atomic_inc_return(&sk->sk_zckey)) - 1;
>> -       uarg->len = 1;
>> -       uarg->bytelen = size;
>> -       uarg->zerocopy = 1;
>> -       uarg->ubuf.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
>> -       refcount_set(&uarg->ubuf.refcnt, 1);
>> +       init_ubuf_info_msgzc(uarg, sk, size);
>> +       sock_hold(sk);
>> +
>> +       return &uarg->ubuf;
>> +}
>> +
>> +static struct ubuf_info *msg_zerocopy_uarg_alloc(struct sock *sk, size_t size)
>> +{
>> +       struct sk_buff *skb;
>> +       struct ubuf_info_msgzc *uarg;
>> +       struct tx_msg_zcopy_node *zcopy_node_p;
>> +
>> +       WARN_ON_ONCE(!in_task());
>> +
>> +       skb = sock_omalloc(sk, sizeof(*zcopy_node_p), GFP_KERNEL);
>> +       if (!skb)
>> +               return NULL;
>> +
>> +       BUILD_BUG_ON(sizeof(*uarg) > sizeof(skb->cb));
>> +       uarg = (void *)skb->cb;
>> +       uarg->mmp.user = NULL;
>> +       zcopy_node_p = (struct tx_msg_zcopy_node *)skb_put(skb, sizeof(*zcopy_node_p));
>> +
>> +       if (mm_account_pinned_pages(&uarg->mmp, size)) {
>> +               kfree_skb(skb);
>> +               return NULL;
>> +       }
>> +
>> +       INIT_LIST_HEAD(&zcopy_node_p->node);
>> +       zcopy_node_p->skb = skb;
>> +       uarg->ubuf.callback = msg_zerocopy_uarg_callback;
>> +       init_ubuf_info_msgzc(uarg, sk, size);
>>          sock_hold(sk);
>>
>>          return &uarg->ubuf;
>> @@ -1699,7 +1735,7 @@ static inline struct sk_buff *skb_from_uarg(struct ubuf_info_msgzc *uarg)
>>   }
>>
>>   struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
>> -                                      struct ubuf_info *uarg)
>> +                                      struct ubuf_info *uarg, bool usr_arg_notification)
>>   {
>>          if (uarg) {
>>                  struct ubuf_info_msgzc *uarg_zc;
>> @@ -1707,7 +1743,8 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
>>                  u32 bytelen, next;
>>
>>                  /* there might be non MSG_ZEROCOPY users */
>> -               if (uarg->callback != msg_zerocopy_callback)
>> +               if (uarg->callback != msg_zerocopy_callback &&
>> +                   uarg->callback != msg_zerocopy_uarg_callback)
>>                          return NULL;
>>
>>                  /* realloc only when socket is locked (TCP, UDP cork),
>> @@ -1744,6 +1781,8 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
>>          }
>>
>>   new_alloc:
>> +       if (usr_arg_notification)
>> +               return msg_zerocopy_uarg_alloc(sk, size);
>>          return msg_zerocopy_alloc(sk, size);
>>   }
>>   EXPORT_SYMBOL_GPL(msg_zerocopy_realloc);
>> @@ -1830,6 +1869,86 @@ void msg_zerocopy_callback(struct sk_buff *skb, struct ubuf_info *uarg,
>>   }
>>   EXPORT_SYMBOL_GPL(msg_zerocopy_callback);
>>
>> +static bool skb_zerocopy_uarg_notify_extend(struct tx_msg_zcopy_node *node, u32 lo, u16 len)
>> +{
>> +       u32 old_lo, old_hi;
>> +       u64 sum_len;
>> +
>> +       old_lo = node->info.lo;
>> +       old_hi = node->info.hi;
>> +       sum_len = old_hi - old_lo + 1ULL + len;
>> +
>> +       if (sum_len >= (1ULL << 32))
>> +               return false;
>> +
>> +       if (lo != old_hi + 1)
>> +               return false;
>> +
>> +       node->info.hi += len;
>> +       return true;
>> +}
>> +
>> +static void __msg_zerocopy_uarg_callback(struct ubuf_info_msgzc *uarg)
>> +{
>> +       struct sk_buff *skb = skb_from_uarg(uarg);
>> +       struct sock *sk = skb->sk;
>> +       struct tx_msg_zcopy_node *zcopy_node_p, *tail;
>> +       struct tx_msg_zcopy_queue *zcopy_queue;
>> +       unsigned long flags;
>> +       u32 lo, hi;
>> +       u16 len;
>> +
>> +       mm_unaccount_pinned_pages(&uarg->mmp);
>> +
>> +       /* if !len, there was only 1 call, and it was aborted
>> +        * so do not queue a completion notification
>> +        */
>> +       if (!uarg->len || sock_flag(sk, SOCK_DEAD))
>> +               goto release;
>> +
>> +       /* only support TCP and UCP currently */
>> +       if (sk_is_tcp(sk)) {
>> +               zcopy_queue = &tcp_sk(sk)->tx_zcopy_queue;
>> +       } else if (sk_is_udp(sk)) {
>> +               zcopy_queue = &udp_sk(sk)->tx_zcopy_queue;
>> +       } else {
>> +               pr_warn("MSG_ZEROCOPY_UARG only support TCP && UDP sockets");
>> +               goto release;
>> +       }
>> +
>> +       len = uarg->len;
>> +       lo = uarg->id;
>> +       hi = uarg->id + len - 1;
>> +
>> +       zcopy_node_p = (struct tx_msg_zcopy_node *)skb->data;
>> +       zcopy_node_p->info.lo = lo;
>> +       zcopy_node_p->info.hi = hi;
>> +       zcopy_node_p->info.zerocopy = uarg->zerocopy;
>> +
>> +       spin_lock_irqsave(&zcopy_queue->lock, flags);
>> +       tail = list_last_entry(&zcopy_queue->head, struct tx_msg_zcopy_node, node);
>> +       if (!tail || !skb_zerocopy_uarg_notify_extend(tail, lo, len)) {
>> +               list_add_tail(&zcopy_node_p->node, &zcopy_queue->head);
>> +               skb = NULL;
>> +       }
>> +       spin_unlock_irqrestore(&zcopy_queue->lock, flags);
>> +release:
>> +       consume_skb(skb);
>> +       sock_put(sk);
>> +}
>> +
>> +void msg_zerocopy_uarg_callback(struct sk_buff *skb, struct ubuf_info *uarg,
>> +                               bool success)
>> +{
>> +       struct ubuf_info_msgzc *uarg_zc = uarg_to_msgzc(uarg);
>> +
>> +       uarg_zc->zerocopy = uarg_zc->zerocopy & success;
>> +
>> +       if (refcount_dec_and_test(&uarg->refcnt))
>> +               __msg_zerocopy_uarg_callback(uarg_zc);
>> +}
>> +EXPORT_SYMBOL_GPL(msg_zerocopy_uarg_callback);
>> +
>>   void msg_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref)
>>   {
>>          struct sock *sk = skb_from_uarg(uarg_to_msgzc(uarg))->sk;
>> @@ -1838,7 +1957,7 @@ void msg_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref)
>>          uarg_to_msgzc(uarg)->len--;
>>
>>          if (have_uref)
>> -               msg_zerocopy_callback(NULL, uarg, true);
>> +               uarg->callback(NULL, uarg, true);
>>   }
>>   EXPORT_SYMBOL_GPL(msg_zerocopy_put_abort);
>>
>> diff --git a/net/core/sock.c b/net/core/sock.c
>> index 5ed411231fc7..a00ebd71f6ed 100644
>> --- a/net/core/sock.c
>> +++ b/net/core/sock.c
>> @@ -2843,6 +2843,56 @@ int __sock_cmsg_send(struct sock *sk, struct cmsghdr *cmsg,
>>          case SCM_RIGHTS:
>>          case SCM_CREDENTIALS:
>>                  break;
>> +       case SO_ZEROCOPY_NOTIFICATION:
>> +               if (sock_flag(sk, SOCK_ZEROCOPY)) {
>> +                       int i = 0;
>> +                       struct tx_usr_zcopy_info sys_zcopy_info;
>> +                       struct tx_msg_zcopy_node *zcopy_node_p, *tmp;
>> +                       struct tx_msg_zcopy_queue *zcopy_queue;
>> +                       struct tx_msg_zcopy_node *zcopy_node_ps[SOCK_USR_ZC_INFO_MAX];
>> +                       unsigned long flags;
>> +
>> +                       if (cmsg->cmsg_len != CMSG_LEN(sizeof(void *)))
>> +                               return -EINVAL;
>> +
>> +                       if (sk_is_tcp(sk))
>> +                               zcopy_queue = &tcp_sk(sk)->tx_zcopy_queue;
>> +                       else if (sk_is_udp(sk))
>> +                               zcopy_queue = &udp_sk(sk)->tx_zcopy_queue;
>> +                       else
>> +                               return -EINVAL;
>> +
>> +                       spin_lock_irqsave(&zcopy_queue->lock, flags);
>> +                       list_for_each_entry_safe(zcopy_node_p, tmp, &zcopy_queue->head, node) {
>> +                               sys_zcopy_info.info[i].lo = zcopy_node_p->info.lo;
>> +                               sys_zcopy_info.info[i].hi = zcopy_node_p->info.hi;
>> +                               sys_zcopy_info.info[i].zerocopy = zcopy_node_p->info.zerocopy;
>> +                               list_del(&zcopy_node_p->node);
>> +                               zcopy_node_ps[i++] = zcopy_node_p;
>> +                               if (i == SOCK_USR_ZC_INFO_MAX)
>> +                                       break;
>> +                       }
>> +                       spin_unlock_irqrestore(&zcopy_queue->lock, flags);
>> +
>> +                       if (i > 0) {
>> +                               sys_zcopy_info.length = i;
>> +                               if (unlikely(copy_to_user(*(void **)CMSG_DATA(cmsg),
> 
> This is going to break if user space is 32bit, and kernel 64bit ?
> 

Nice catch, thanks for pointing this out!

I may update the code like this?

```
if (cmsg->cmsg_len != CMSG_LEN(sizeof(void *)) &&
     cmsg->cmsg_len != CMSG_LEN(sizeof(u32)))
     return -EINVAL;

...

void __user *user;
if (cmsg->cmsg_len == 8)
     user = *(void **)CMSG_DATA(cmsg);
else if (cmsg->cmsg_len == 4) {
     u32 val = *(u32 *)CMSG_DATA(cmsg);
     user = (void *)(uintptr_t)val;
}
copy_to_user(user, ...)
```

I assume it's user's duty to pass in the right address and cmsg_len.
Malicious address or invalid address because of wrong cmsg_len will be
protected by copy_to_user.

> Also SOCK_USR_ZC_INFO_MAX is put in stone, it won't change in the future.
> 

In the scenario where users keep calling sendmsg, at most time, the user
will get one notification from the previous sendmsg. Sometimes, users 
get two notifications. Thus, we give SOCK_USR_ZC_INFO_MAX a small but 
sufficient number 8. We put it in stone here to avoid dynamic allocation 
and memory management of the array.

If the memory management worth it, we can change it to a sysctl
variable?

Or, how about we keep SOCK_USR_ZC_INFO_MAX, the max value here, and let
the users pass in the length (could be any number less than MAX) of info
they want to get?

>> +                                                         &sys_zcopy_info,
>> +                                                         sizeof(sys_zcopy_info))
> 
> Your are leaking to user space part of kernel stack, if (i <
> SOCK_USR_ZC_INFO_MAX)
> 

Agree, I should do memset to clear the array.

>> +                                       )) {
>> +                                       spin_lock_irqsave(&zcopy_queue->lock, flags);
>> +                                       while (i > 0)
>> +                                               list_add(&zcopy_node_ps[--i]->node,
>> +                                                        &zcopy_queue->head);
> 
> 
> 
>> +                                       spin_unlock_irqrestore(&zcopy_queue->lock, flags);
>> +                                       return -EFAULT;
>> +                               }
>> +
>> +                               while (i > 0)
>> +                                       consume_skb(zcopy_node_ps[--i]->skb);
>> +                       }
>> +               }
>> +               break;
>>          default:
>>                  return -EINVAL;
>>          }
>> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
>> index 1fe794967211..5adb737c4c01 100644
>> --- a/net/ipv4/ip_output.c
>> +++ b/net/ipv4/ip_output.c
>> @@ -1005,7 +1005,7 @@ static int __ip_append_data(struct sock *sk,
>>              (!exthdrlen || (rt->dst.dev->features & NETIF_F_HW_ESP_TX_CSUM)))
>>                  csummode = CHECKSUM_PARTIAL;
>>
>> -       if ((flags & MSG_ZEROCOPY) && length) {
>> +       if (((flags & MSG_ZEROCOPY) || (flags & MSG_ZEROCOPY_UARG)) && length) {
>>                  struct msghdr *msg = from;
>>
>>                  if (getfrag == ip_generic_getfrag && msg->msg_ubuf) {
>> @@ -1022,7 +1022,9 @@ static int __ip_append_data(struct sock *sk,
>>                                  uarg = msg->msg_ubuf;
>>                          }
>>                  } else if (sock_flag(sk, SOCK_ZEROCOPY)) {
>> -                       uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
>> +                       bool user_args = flags & MSG_ZEROCOPY_UARG;
>> +
>> +                       uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb), user_args);
>>                          if (!uarg)
>>                                  return -ENOBUFS;
>>                          extra_uref = !skb_zcopy(skb);   /* only ref on new uarg */
>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>> index e767721b3a58..6254d0eef3af 100644
>> --- a/net/ipv4/tcp.c
>> +++ b/net/ipv4/tcp.c
>> @@ -462,6 +462,8 @@ void tcp_init_sock(struct sock *sk)
>>
>>          set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags);
>>          sk_sockets_allocated_inc(sk);
>> +
>> +       tx_message_zcopy_queue_init(&tp->tx_zcopy_queue);
>>   }
>>   EXPORT_SYMBOL(tcp_init_sock);
>>
>> @@ -1050,14 +1052,15 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>>
>>          flags = msg->msg_flags;
>>
>> -       if ((flags & MSG_ZEROCOPY) && size) {
>> +       if (((flags & MSG_ZEROCOPY) || (flags & MSG_ZEROCOPY_UARG)) && size) {
>>                  if (msg->msg_ubuf) {
>>                          uarg = msg->msg_ubuf;
>>                          if (sk->sk_route_caps & NETIF_F_SG)
>>                                  zc = MSG_ZEROCOPY;
>>                  } else if (sock_flag(sk, SOCK_ZEROCOPY)) {
>> +                       bool zc_uarg = flags & MSG_ZEROCOPY_UARG;
>>                          skb = tcp_write_queue_tail(sk);
>> -                       uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb));
>> +                       uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb), zc_uarg);
>>                          if (!uarg) {
>>                                  err = -ENOBUFS;
>>                                  goto out_err;
>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>> index 11460d751e73..6c62aacd74d6 100644
>> --- a/net/ipv4/udp.c
>> +++ b/net/ipv4/udp.c
>> @@ -1126,6 +1126,15 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>>                  if (ipc.opt)
>>                          free = 1;
>>                  connected = 0;
>> +
>> +               /* If len is zero and flag MSG_ZEROCOPY_UARG is set,
>> +                * it means this call just wants to get zcopy notifications
>> +                * instead of sending packets. It is useful when users
>> +                * finish sending and want to get trailing notifications.
>> +                */
>> +               if ((msg->msg_flags & MSG_ZEROCOPY_UARG) &&
>> +                   sock_flag(sk, SOCK_ZEROCOPY) && len == 0)
>> +                       return 0;
>>          }
>>          if (!ipc.opt) {
>>                  struct ip_options_rcu *inet_opt;
>> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
>> index b9dd3a66e423..891526ddd74c 100644
>> --- a/net/ipv6/ip6_output.c
>> +++ b/net/ipv6/ip6_output.c
>> @@ -1493,7 +1493,7 @@ static int __ip6_append_data(struct sock *sk,
>>              rt->dst.dev->features & (NETIF_F_IPV6_CSUM | NETIF_F_HW_CSUM))
>>                  csummode = CHECKSUM_PARTIAL;
>>
>> -       if ((flags & MSG_ZEROCOPY) && length) {
>> +       if (((flags & MSG_ZEROCOPY) || (flags & MSG_ZEROCOPY_UARG)) && length) {
>>                  struct msghdr *msg = from;
>>
>>                  if (getfrag == ip_generic_getfrag && msg->msg_ubuf) {
>> @@ -1510,7 +1510,8 @@ static int __ip6_append_data(struct sock *sk,
>>                                  uarg = msg->msg_ubuf;
>>                          }
>>                  } else if (sock_flag(sk, SOCK_ZEROCOPY)) {
>> -                       uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
>> +                       uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb),
>> +                                                   flags & MSG_ZEROCOPY_UARG);
>>                          if (!uarg)
>>                                  return -ENOBUFS;
>>                          extra_uref = !skb_zcopy(skb);   /* only ref on new uarg */
>> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
>> index 2e4dc5e6137b..98f6905c5db9 100644
>> --- a/net/ipv6/udp.c
>> +++ b/net/ipv6/udp.c
>> @@ -1490,6 +1490,15 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>>                  if (!(opt->opt_nflen|opt->opt_flen))
>>                          opt = NULL;
>>                  connected = false;
>> +
>> +               /* If len is zero and flag MSG_ZEROCOPY_UARG is set,
>> +                * it means this call just wants to get zcopy notifications
>> +                * instead of sending packets. It is useful when users
>> +                * finish sending and want to get trailing notifications.
>> +                */
>> +               if ((msg->msg_flags & MSG_ZEROCOPY_UARG) &&
>> +                   sock_flag(sk, SOCK_ZEROCOPY) && len == 0)
>> +                       return 0;
>>          }
>>          if (!opt) {
>>                  opt = txopt_get(np);
>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>> index 16ff976a86e3..d6e6830f6ffe 100644
>> --- a/net/vmw_vsock/virtio_transport_common.c
>> +++ b/net/vmw_vsock/virtio_transport_common.c
>> @@ -84,7 +84,7 @@ static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
>>
>>                  uarg = msg_zerocopy_realloc(sk_vsock(vsk),
>>                                              iter->count,
>> -                                           NULL);
>> +                                           NULL, false);
>>                  if (!uarg)
>>                          return -1;
>>
>> --
>> 2.20.1
>>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [External] Re: [PATCH net-next 0/3] net: socket sendmsg MSG_ZEROCOPY_UARG
  2024-04-10  8:46 ` [PATCH net-next 0/3] net: socket sendmsg MSG_ZEROCOPY_UARG Paolo Abeni
@ 2024-04-10 17:03   ` Zijian Zhang
  0 siblings, 0 replies; 24+ messages in thread
From: Zijian Zhang @ 2024-04-10 17:03 UTC (permalink / raw)
  To: Paolo Abeni, netdev
  Cc: edumazet, willemdebruijn.kernel, davem, kuba, cong.wang, xiaochun.lu

On 4/10/24 1:46 AM, Paolo Abeni wrote:
> On Tue, 2024-04-09 at 20:52 +0000, zijianzhang@bytedance.com wrote:
>> From: Zijian Zhang <zijianzhang@bytedance.com>
>>
>> Original notification mechanism needs poll + recvmmsg which is not
>> easy for applcations to accommodate. And, it also incurs unignorable
>> overhead including extra system calls and usage of optmem.
>>
>> While making maximum reuse of the existing MSG_ZEROCOPY related code,
>> this patch set introduces zerocopy socket send flag MSG_ZEROCOPY_UARG.
>> It provides a new notification method. Users of sendmsg pass a control
>> message as a placeholder for the incoming notifications. Upon returning,
>> kernel embeds notifications directly into user arguments passed in. By
>> doing so, we can significantly reduce the complexity and overhead for
>> managing notifications. In an ideal pattern, the user will keep calling
>> sendmsg with MSG_ZEROCOPY_UARG flag, and the notification will be
>> delivered as soon as possible.
>>
>> MSG_ZEROCOPY_UARG does not need to queue skb into errqueue. Thus,
>> skbuffs allocated from optmem are not a must. In theory, a new struct
>> carrying the zcopy information should be defined along with its memory
>> management code. However, existing zcopy generic code assumes the
>> information is skbuff. Given the very limited performance gain or maybe
>> no gain of this method, and the need to change a lot of existing code,
>> we still use skbuffs allocated from optmem to carry zcopy information.
>>
>> * Performance
>>
>> I extend the selftests/msg_zerocopy.c to accommodate the new flag, test
>> result is as follows, the new flag performs 7% better in TCP and 4%
>> better in UDP.
>>
>> cfg_notification_limit = 8
>> +---------------------+---------+---------+---------+---------+
>>> Test Type / Protocol| TCP v4  | TCP v6  | UDP v4  | UDP v6  |
>> +---------------------+---------+---------+---------+---------+
>>> Copy                | 5328    | 5159    | 8581    | 8457    |
>> +---------------------+---------+---------+---------+---------+
>>> ZCopy               | 5877    | 5568    | 10314   | 10091   |
>> +---------------------+---------+---------+---------+---------+
>>> New ZCopy           | 6254    | 5901    | 10674   | 10293   |
>> +---------------------+---------+---------+---------+---------+
>>> ZCopy / Copy        | 110.30% | 107.93% | 120.20% | 119.32% |
>> +---------------------+---------+---------+---------+---------+
>>> New ZCopy / Copy    | 117.38% | 114.38% | 124.39% | 121.71% |
>> +---------------------+---------+---------+---------+---------+
> 
> Minor nit for a future revision: the relevant part here is the 'New
> ZCopy/ ZCopy' direct comparison, which is missing - even if inferable
> from the above.  You should provide such data, and you could drop the
> 'ZCopy / Copy' 'New ZCopy / Copy' and possibly even 'Copy' lines.
> 
> Thanks,
> 
> Paolo
> 

Thanks for the advice, I will update in the next version :)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [External] Re: [PATCH net-next 1/3] sock: add MSG_ZEROCOPY_UARG
  2024-04-10 17:01     ` [External] " Zijian Zhang
@ 2024-04-10 18:57       ` Eric Dumazet
  2024-04-10 21:34         ` Zijian Zhang
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2024-04-10 18:57 UTC (permalink / raw)
  To: Zijian Zhang
  Cc: netdev, willemdebruijn.kernel, davem, kuba, cong.wang, xiaochun.lu

On Wed, Apr 10, 2024 at 7:01 PM Zijian Zhang <zijianzhang@bytedance.com> wrote:
>
> On 4/9/24 2:18 PM, Eric Dumazet wrote:
> > On Tue, Apr 9, 2024 at 10:53 PM <zijianzhang@bytedance.com> wrote:
> >>
> >> From: Zijian Zhang <zijianzhang@bytedance.com>
> >>
> >> The MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
> >> However, zerocopy is not a free lunch. Apart from the management of user
> >> pages, the combination of poll + recvmsg to receive notifications incurs
> >> unignorable overhead in the applications. The overhead of such sometimes
> >> might be more than the CPU savings from zerocopy. We try to solve this
> >> problem with a new option for TCP and UDP, MSG_ZEROCOPY_UARG.
> >> This new mechanism aims to reduce the overhead associated with receiving
> >> notifications by embedding them directly into user arguments passed with
> >> each sendmsg control message. By doing so, we can significantly reduce
> >> the complexity and overhead for managing notifications. In an ideal
> >> pattern, the user will keep calling sendmsg with MSG_ZEROCOPY_UARG
> >> flag, and the notification will be delivered as soon as possible.
> >>
> >> Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com>
> >> Signed-off-by: Xiaochun Lu <xiaochun.lu@bytedance.com>
> >> ---
> >>   include/linux/skbuff.h                  |   7 +-
> >>   include/linux/socket.h                  |   1 +
> >>   include/linux/tcp.h                     |   3 +
> >>   include/linux/udp.h                     |   3 +
> >>   include/net/sock.h                      |  17 +++
> >>   include/net/udp.h                       |   1 +
> >>   include/uapi/asm-generic/socket.h       |   2 +
> >>   include/uapi/linux/socket.h             |  17 +++
> >>   net/core/skbuff.c                       | 137 ++++++++++++++++++++++--
> >>   net/core/sock.c                         |  50 +++++++++
> >>   net/ipv4/ip_output.c                    |   6 +-
> >>   net/ipv4/tcp.c                          |   7 +-
> >>   net/ipv4/udp.c                          |   9 ++
> >>   net/ipv6/ip6_output.c                   |   5 +-
> >>   net/ipv6/udp.c                          |   9 ++
> >>   net/vmw_vsock/virtio_transport_common.c |   2 +-
> >>   16 files changed, 258 insertions(+), 18 deletions(-)
> >>
> >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> >> index 03ea36a82cdd..19b94ba01007 100644
> >> --- a/include/linux/skbuff.h
> >> +++ b/include/linux/skbuff.h
> >> @@ -1663,12 +1663,14 @@ static inline void skb_set_end_offset(struct sk_buff *skb, unsigned int offset)
> >>   #endif
> >>
> >>   struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
> >> -                                      struct ubuf_info *uarg);
> >> +                                      struct ubuf_info *uarg, bool user_args_notification);
> >>
> >>   void msg_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref);
> >>
> >>   void msg_zerocopy_callback(struct sk_buff *skb, struct ubuf_info *uarg,
> >>                             bool success);
> >> +void msg_zerocopy_uarg_callback(struct sk_buff *skb, struct ubuf_info *uarg,
> >> +                          bool success);
> >>
> >>   int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
> >>                              struct sk_buff *skb, struct iov_iter *from,
> >> @@ -1763,7 +1765,8 @@ static inline void net_zcopy_put(struct ubuf_info *uarg)
> >>   static inline void net_zcopy_put_abort(struct ubuf_info *uarg, bool have_uref)
> >>   {
> >>          if (uarg) {
> >> -               if (uarg->callback == msg_zerocopy_callback)
> >> +               if (uarg->callback == msg_zerocopy_callback ||
> >> +                       uarg->callback == msg_zerocopy_uarg_callback)
> >>                          msg_zerocopy_put_abort(uarg, have_uref);
> >>                  else if (have_uref)
> >>                          net_zcopy_put(uarg);
> >> diff --git a/include/linux/socket.h b/include/linux/socket.h
> >> index 139c330ccf2c..de01392344e1 100644
> >> --- a/include/linux/socket.h
> >> +++ b/include/linux/socket.h
> >> @@ -326,6 +326,7 @@ struct ucred {
> >>                                            * plain text and require encryption
> >>                                            */
> >>
> >> +#define MSG_ZEROCOPY_UARG      0x2000000 /* MSG_ZEROCOPY with UARG notifications */
> >>   #define MSG_ZEROCOPY   0x4000000       /* Use user data in kernel path */
> >>   #define MSG_SPLICE_PAGES 0x8000000     /* Splice the pages from the iterator in sendmsg() */
> >>   #define MSG_FASTOPEN   0x20000000      /* Send data in TCP SYN */
> >> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> >> index 55399ee2a57e..e973f4990646 100644
> >> --- a/include/linux/tcp.h
> >> +++ b/include/linux/tcp.h
> >> @@ -501,6 +501,9 @@ struct tcp_sock {
> >>           */
> >>          struct request_sock __rcu *fastopen_rsk;
> >>          struct saved_syn *saved_syn;
> >> +
> >> +/* TCP MSG_ZEROCOPY_UARG related information */
> >> +       struct tx_msg_zcopy_queue tx_zcopy_queue;
> >>   };
> >>
> >>   enum tsq_enum {
> >> diff --git a/include/linux/udp.h b/include/linux/udp.h
> >> index 3748e82b627b..502b393eac67 100644
> >> --- a/include/linux/udp.h
> >> +++ b/include/linux/udp.h
> >> @@ -95,6 +95,9 @@ struct udp_sock {
> >>
> >>          /* Cache friendly copy of sk->sk_peek_off >= 0 */
> >>          bool            peeking_with_offset;
> >> +
> >> +       /* This field is used by sendmsg zcopy user arg mode notification */
> >> +       struct tx_msg_zcopy_queue tx_zcopy_queue;
> >>   };
> >>
> >>   #define udp_test_bit(nr, sk)                   \
> >> diff --git a/include/net/sock.h b/include/net/sock.h
> >> index 2253eefe2848..f7c045e98213 100644
> >> --- a/include/net/sock.h
> >> +++ b/include/net/sock.h
> >> @@ -544,6 +544,23 @@ struct sock {
> >>          netns_tracker           ns_tracker;
> >>   };
> >>
> >> +struct tx_msg_zcopy_node {
> >> +       struct list_head node;
> >> +       struct tx_msg_zcopy_info info;
> >> +       struct sk_buff *skb;
> >> +};
> >> +
> >> +struct tx_msg_zcopy_queue {
> >> +       struct list_head head;
> >> +       spinlock_t lock; /* protects head queue */
> >> +};
> >> +
> >> +static inline void tx_message_zcopy_queue_init(struct tx_msg_zcopy_queue *q)
> >> +{
> >> +       spin_lock_init(&q->lock);
> >> +       INIT_LIST_HEAD(&q->head);
> >> +}
> >> +
> >>   enum sk_pacing {
> >>          SK_PACING_NONE          = 0,
> >>          SK_PACING_NEEDED        = 1,
> >> diff --git a/include/net/udp.h b/include/net/udp.h
> >> index 488a6d2babcc..9e4d7b128de4 100644
> >> --- a/include/net/udp.h
> >> +++ b/include/net/udp.h
> >> @@ -182,6 +182,7 @@ static inline void udp_lib_init_sock(struct sock *sk)
> >>          skb_queue_head_init(&up->reader_queue);
> >>          up->forward_threshold = sk->sk_rcvbuf >> 2;
> >>          set_bit(SOCK_CUSTOM_SOCKOPT, &sk->sk_socket->flags);
> >> +       tx_message_zcopy_queue_init(&up->tx_zcopy_queue);
> >>   }
> >>
> >>   /* hash routines shared between UDPv4/6 and UDP-Litev4/6 */
> >> diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
> >> index 8ce8a39a1e5f..86aa4b5cb7f1 100644
> >> --- a/include/uapi/asm-generic/socket.h
> >> +++ b/include/uapi/asm-generic/socket.h
> >> @@ -135,6 +135,8 @@
> >>   #define SO_PASSPIDFD           76
> >>   #define SO_PEERPIDFD           77
> >>
> >> +#define SO_ZEROCOPY_NOTIFICATION 78
> >> +
> >>   #if !defined(__KERNEL__)
> >>
> >>   #if __BITS_PER_LONG == 64 || (defined(__x86_64__) && defined(__ILP32__))
> >> diff --git a/include/uapi/linux/socket.h b/include/uapi/linux/socket.h
> >> index d3fcd3b5ec53..469ed8f4e6c8 100644
> >> --- a/include/uapi/linux/socket.h
> >> +++ b/include/uapi/linux/socket.h
> >> @@ -35,4 +35,21 @@ struct __kernel_sockaddr_storage {
> >>   #define SOCK_TXREHASH_DISABLED 0
> >>   #define SOCK_TXREHASH_ENABLED  1
> >>
> >> +/*
> >> + * Given the fact that MSG_ZEROCOPY_UARG tries to copy notifications
> >> + * back to user as soon as possible, 8 should be sufficient.
> >> + */
> >> +#define SOCK_USR_ZC_INFO_MAX 8
> >> +
> >> +struct tx_msg_zcopy_info {
> >> +       __u32 lo;
> >> +       __u32 hi;
> >> +       __u8 zerocopy;
> >> +};
> >> +
> >> +struct tx_usr_zcopy_info {
> >> +       int length;
> >> +       struct tx_msg_zcopy_info info[SOCK_USR_ZC_INFO_MAX];
> >> +};
> >> +
> >>   #endif /* _UAPI_LINUX_SOCKET_H */
> >> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> >> index 2a5ce6667bbb..d939b2c14d55 100644
> >> --- a/net/core/skbuff.c
> >> +++ b/net/core/skbuff.c
> >> @@ -1661,6 +1661,16 @@ void mm_unaccount_pinned_pages(struct mmpin *mmp)
> >>   }
> >>   EXPORT_SYMBOL_GPL(mm_unaccount_pinned_pages);
> >>
> >> +static void init_ubuf_info_msgzc(struct ubuf_info_msgzc *uarg, struct sock *sk, size_t size)
> >> +{
> >> +       uarg->id = ((u32)atomic_inc_return(&sk->sk_zckey)) - 1;
> >> +       uarg->len = 1;
> >> +       uarg->bytelen = size;
> >> +       uarg->zerocopy = 1;
> >> +       uarg->ubuf.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
> >> +       refcount_set(&uarg->ubuf.refcnt, 1);
> >> +}
> >> +
> >>   static struct ubuf_info *msg_zerocopy_alloc(struct sock *sk, size_t size)
> >>   {
> >>          struct ubuf_info_msgzc *uarg;
> >> @@ -1682,12 +1692,38 @@ static struct ubuf_info *msg_zerocopy_alloc(struct sock *sk, size_t size)
> >>          }
> >>
> >>          uarg->ubuf.callback = msg_zerocopy_callback;
> >> -       uarg->id = ((u32)atomic_inc_return(&sk->sk_zckey)) - 1;
> >> -       uarg->len = 1;
> >> -       uarg->bytelen = size;
> >> -       uarg->zerocopy = 1;
> >> -       uarg->ubuf.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
> >> -       refcount_set(&uarg->ubuf.refcnt, 1);
> >> +       init_ubuf_info_msgzc(uarg, sk, size);
> >> +       sock_hold(sk);
> >> +
> >> +       return &uarg->ubuf;
> >> +}
> >> +
> >> +static struct ubuf_info *msg_zerocopy_uarg_alloc(struct sock *sk, size_t size)
> >> +{
> >> +       struct sk_buff *skb;
> >> +       struct ubuf_info_msgzc *uarg;
> >> +       struct tx_msg_zcopy_node *zcopy_node_p;
> >> +
> >> +       WARN_ON_ONCE(!in_task());
> >> +
> >> +       skb = sock_omalloc(sk, sizeof(*zcopy_node_p), GFP_KERNEL);
> >> +       if (!skb)
> >> +               return NULL;
> >> +
> >> +       BUILD_BUG_ON(sizeof(*uarg) > sizeof(skb->cb));
> >> +       uarg = (void *)skb->cb;
> >> +       uarg->mmp.user = NULL;
> >> +       zcopy_node_p = (struct tx_msg_zcopy_node *)skb_put(skb, sizeof(*zcopy_node_p));
> >> +
> >> +       if (mm_account_pinned_pages(&uarg->mmp, size)) {
> >> +               kfree_skb(skb);
> >> +               return NULL;
> >> +       }
> >> +
> >> +       INIT_LIST_HEAD(&zcopy_node_p->node);
> >> +       zcopy_node_p->skb = skb;
> >> +       uarg->ubuf.callback = msg_zerocopy_uarg_callback;
> >> +       init_ubuf_info_msgzc(uarg, sk, size);
> >>          sock_hold(sk);
> >>
> >>          return &uarg->ubuf;
> >> @@ -1699,7 +1735,7 @@ static inline struct sk_buff *skb_from_uarg(struct ubuf_info_msgzc *uarg)
> >>   }
> >>
> >>   struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
> >> -                                      struct ubuf_info *uarg)
> >> +                                      struct ubuf_info *uarg, bool usr_arg_notification)
> >>   {
> >>          if (uarg) {
> >>                  struct ubuf_info_msgzc *uarg_zc;
> >> @@ -1707,7 +1743,8 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
> >>                  u32 bytelen, next;
> >>
> >>                  /* there might be non MSG_ZEROCOPY users */
> >> -               if (uarg->callback != msg_zerocopy_callback)
> >> +               if (uarg->callback != msg_zerocopy_callback &&
> >> +                   uarg->callback != msg_zerocopy_uarg_callback)
> >>                          return NULL;
> >>
> >>                  /* realloc only when socket is locked (TCP, UDP cork),
> >> @@ -1744,6 +1781,8 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
> >>          }
> >>
> >>   new_alloc:
> >> +       if (usr_arg_notification)
> >> +               return msg_zerocopy_uarg_alloc(sk, size);
> >>          return msg_zerocopy_alloc(sk, size);
> >>   }
> >>   EXPORT_SYMBOL_GPL(msg_zerocopy_realloc);
> >> @@ -1830,6 +1869,86 @@ void msg_zerocopy_callback(struct sk_buff *skb, struct ubuf_info *uarg,
> >>   }
> >>   EXPORT_SYMBOL_GPL(msg_zerocopy_callback);
> >>
> >> +static bool skb_zerocopy_uarg_notify_extend(struct tx_msg_zcopy_node *node, u32 lo, u16 len)
> >> +{
> >> +       u32 old_lo, old_hi;
> >> +       u64 sum_len;
> >> +
> >> +       old_lo = node->info.lo;
> >> +       old_hi = node->info.hi;
> >> +       sum_len = old_hi - old_lo + 1ULL + len;
> >> +
> >> +       if (sum_len >= (1ULL << 32))
> >> +               return false;
> >> +
> >> +       if (lo != old_hi + 1)
> >> +               return false;
> >> +
> >> +       node->info.hi += len;
> >> +       return true;
> >> +}
> >> +
> >> +static void __msg_zerocopy_uarg_callback(struct ubuf_info_msgzc *uarg)
> >> +{
> >> +       struct sk_buff *skb = skb_from_uarg(uarg);
> >> +       struct sock *sk = skb->sk;
> >> +       struct tx_msg_zcopy_node *zcopy_node_p, *tail;
> >> +       struct tx_msg_zcopy_queue *zcopy_queue;
> >> +       unsigned long flags;
> >> +       u32 lo, hi;
> >> +       u16 len;
> >> +
> >> +       mm_unaccount_pinned_pages(&uarg->mmp);
> >> +
> >> +       /* if !len, there was only 1 call, and it was aborted
> >> +        * so do not queue a completion notification
> >> +        */
> >> +       if (!uarg->len || sock_flag(sk, SOCK_DEAD))
> >> +               goto release;
> >> +
> >> +       /* only support TCP and UCP currently */
> >> +       if (sk_is_tcp(sk)) {
> >> +               zcopy_queue = &tcp_sk(sk)->tx_zcopy_queue;
> >> +       } else if (sk_is_udp(sk)) {
> >> +               zcopy_queue = &udp_sk(sk)->tx_zcopy_queue;
> >> +       } else {
> >> +               pr_warn("MSG_ZEROCOPY_UARG only support TCP && UDP sockets");
> >> +               goto release;
> >> +       }
> >> +
> >> +       len = uarg->len;
> >> +       lo = uarg->id;
> >> +       hi = uarg->id + len - 1;
> >> +
> >> +       zcopy_node_p = (struct tx_msg_zcopy_node *)skb->data;
> >> +       zcopy_node_p->info.lo = lo;
> >> +       zcopy_node_p->info.hi = hi;
> >> +       zcopy_node_p->info.zerocopy = uarg->zerocopy;
> >> +
> >> +       spin_lock_irqsave(&zcopy_queue->lock, flags);
> >> +       tail = list_last_entry(&zcopy_queue->head, struct tx_msg_zcopy_node, node);
> >> +       if (!tail || !skb_zerocopy_uarg_notify_extend(tail, lo, len)) {
> >> +               list_add_tail(&zcopy_node_p->node, &zcopy_queue->head);
> >> +               skb = NULL;
> >> +       }
> >> +       spin_unlock_irqrestore(&zcopy_queue->lock, flags);
> >> +release:
> >> +       consume_skb(skb);
> >> +       sock_put(sk);
> >> +}
> >> +
> >> +void msg_zerocopy_uarg_callback(struct sk_buff *skb, struct ubuf_info *uarg,
> >> +                               bool success)
> >> +{
> >> +       struct ubuf_info_msgzc *uarg_zc = uarg_to_msgzc(uarg);
> >> +
> >> +       uarg_zc->zerocopy = uarg_zc->zerocopy & success;
> >> +
> >> +       if (refcount_dec_and_test(&uarg->refcnt))
> >> +               __msg_zerocopy_uarg_callback(uarg_zc);
> >> +}
> >> +EXPORT_SYMBOL_GPL(msg_zerocopy_uarg_callback);
> >> +
> >>   void msg_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref)
> >>   {
> >>          struct sock *sk = skb_from_uarg(uarg_to_msgzc(uarg))->sk;
> >> @@ -1838,7 +1957,7 @@ void msg_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref)
> >>          uarg_to_msgzc(uarg)->len--;
> >>
> >>          if (have_uref)
> >> -               msg_zerocopy_callback(NULL, uarg, true);
> >> +               uarg->callback(NULL, uarg, true);
> >>   }
> >>   EXPORT_SYMBOL_GPL(msg_zerocopy_put_abort);
> >>
> >> diff --git a/net/core/sock.c b/net/core/sock.c
> >> index 5ed411231fc7..a00ebd71f6ed 100644
> >> --- a/net/core/sock.c
> >> +++ b/net/core/sock.c
> >> @@ -2843,6 +2843,56 @@ int __sock_cmsg_send(struct sock *sk, struct cmsghdr *cmsg,
> >>          case SCM_RIGHTS:
> >>          case SCM_CREDENTIALS:
> >>                  break;
> >> +       case SO_ZEROCOPY_NOTIFICATION:
> >> +               if (sock_flag(sk, SOCK_ZEROCOPY)) {
> >> +                       int i = 0;
> >> +                       struct tx_usr_zcopy_info sys_zcopy_info;
> >> +                       struct tx_msg_zcopy_node *zcopy_node_p, *tmp;
> >> +                       struct tx_msg_zcopy_queue *zcopy_queue;
> >> +                       struct tx_msg_zcopy_node *zcopy_node_ps[SOCK_USR_ZC_INFO_MAX];
> >> +                       unsigned long flags;
> >> +
> >> +                       if (cmsg->cmsg_len != CMSG_LEN(sizeof(void *)))
> >> +                               return -EINVAL;
> >> +
> >> +                       if (sk_is_tcp(sk))
> >> +                               zcopy_queue = &tcp_sk(sk)->tx_zcopy_queue;
> >> +                       else if (sk_is_udp(sk))
> >> +                               zcopy_queue = &udp_sk(sk)->tx_zcopy_queue;
> >> +                       else
> >> +                               return -EINVAL;
> >> +
> >> +                       spin_lock_irqsave(&zcopy_queue->lock, flags);
> >> +                       list_for_each_entry_safe(zcopy_node_p, tmp, &zcopy_queue->head, node) {
> >> +                               sys_zcopy_info.info[i].lo = zcopy_node_p->info.lo;
> >> +                               sys_zcopy_info.info[i].hi = zcopy_node_p->info.hi;
> >> +                               sys_zcopy_info.info[i].zerocopy = zcopy_node_p->info.zerocopy;
> >> +                               list_del(&zcopy_node_p->node);
> >> +                               zcopy_node_ps[i++] = zcopy_node_p;
> >> +                               if (i == SOCK_USR_ZC_INFO_MAX)
> >> +                                       break;
> >> +                       }
> >> +                       spin_unlock_irqrestore(&zcopy_queue->lock, flags);
> >> +
> >> +                       if (i > 0) {
> >> +                               sys_zcopy_info.length = i;
> >> +                               if (unlikely(copy_to_user(*(void **)CMSG_DATA(cmsg),
> >
> > This is going to break if user space is 32bit, and kernel 64bit ?
> >
>
> Nice catch, thanks for pointing this out!
>
> I may update the code like this?
>
> ```
> if (cmsg->cmsg_len != CMSG_LEN(sizeof(void *)) &&
>      cmsg->cmsg_len != CMSG_LEN(sizeof(u32)))
>      return -EINVAL;
>
> ...

I would instead use u64 convention, instead of "void *" or u32.

>
> void __user *user;
> if (cmsg->cmsg_len == 8)
>      user = *(void **)CMSG_DATA(cmsg);
> else if (cmsg->cmsg_len == 4) {
>      u32 val = *(u32 *)CMSG_DATA(cmsg);
>      user = (void *)(uintptr_t)val;
> }
> copy_to_user(user, ...)
> ```
>
> I assume it's user's duty to pass in the right address and cmsg_len.
> Malicious address or invalid address because of wrong cmsg_len will be
> protected by copy_to_user.



>
> > Also SOCK_USR_ZC_INFO_MAX is put in stone, it won't change in the future.
> >
>
> In the scenario where users keep calling sendmsg, at most time, the user
> will get one notification from the previous sendmsg. Sometimes, users
> get two notifications. Thus, we give SOCK_USR_ZC_INFO_MAX a small but
> sufficient number 8. We put it in stone here to avoid dynamic allocation
> and memory management of the array.
>
> If the memory management worth it, we can change it to a sysctl
> variable?

Convention for user space -> kernel space , is to pass not only a base
pointer, but also a length.

Of course the kernel implementation can still use a local/automatic
array in the stack with 8 elems,
but this could change eventually

Few system calls that did not follow this convention had to be
extended to other system calls.

>
> Or, how about we keep SOCK_USR_ZC_INFO_MAX, the max value here, and let
> the users pass in the length (could be any number less than MAX) of info
> they want to get?
>
> >> +                                                         &sys_zcopy_info,
> >> +                                                         sizeof(sys_zcopy_info))
> >
> > Your are leaking to user space part of kernel stack, if (i <
> > SOCK_USR_ZC_INFO_MAX)
> >
>
> Agree, I should do memset to clear the array.

Or copy only the part that has been initialized, struct_size() might help ...

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [External] Re: [PATCH net-next 1/3] sock: add MSG_ZEROCOPY_UARG
  2024-04-10 18:57       ` Eric Dumazet
@ 2024-04-10 21:34         ` Zijian Zhang
  0 siblings, 0 replies; 24+ messages in thread
From: Zijian Zhang @ 2024-04-10 21:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, willemdebruijn.kernel, davem, kuba, cong.wang, xiaochun.lu

On 4/10/24 11:57 AM, Eric Dumazet wrote:
> On Wed, Apr 10, 2024 at 7:01 PM Zijian Zhang <zijianzhang@bytedance.com> wrote:
>>
>> On 4/9/24 2:18 PM, Eric Dumazet wrote:
>>> On Tue, Apr 9, 2024 at 10:53 PM <zijianzhang@bytedance.com> wrote:
>>>>
>>>> From: Zijian Zhang <zijianzhang@bytedance.com>
>>>>
>>>> The MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
>>>> However, zerocopy is not a free lunch. Apart from the management of user
>>>> pages, the combination of poll + recvmsg to receive notifications incurs
>>>> unignorable overhead in the applications. The overhead of such sometimes
>>>> might be more than the CPU savings from zerocopy. We try to solve this
>>>> problem with a new option for TCP and UDP, MSG_ZEROCOPY_UARG.
>>>> This new mechanism aims to reduce the overhead associated with receiving
>>>> notifications by embedding them directly into user arguments passed with
>>>> each sendmsg control message. By doing so, we can significantly reduce
>>>> the complexity and overhead for managing notifications. In an ideal
>>>> pattern, the user will keep calling sendmsg with MSG_ZEROCOPY_UARG
>>>> flag, and the notification will be delivered as soon as possible.
>>>>
>>>> Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com>
>>>> Signed-off-by: Xiaochun Lu <xiaochun.lu@bytedance.com>
>>>> ---
>>>>    include/linux/skbuff.h                  |   7 +-
>>>>    include/linux/socket.h                  |   1 +
>>>>    include/linux/tcp.h                     |   3 +
>>>>    include/linux/udp.h                     |   3 +
>>>>    include/net/sock.h                      |  17 +++
>>>>    include/net/udp.h                       |   1 +
>>>>    include/uapi/asm-generic/socket.h       |   2 +
>>>>    include/uapi/linux/socket.h             |  17 +++
>>>>    net/core/skbuff.c                       | 137 ++++++++++++++++++++++--
>>>>    net/core/sock.c                         |  50 +++++++++
>>>>    net/ipv4/ip_output.c                    |   6 +-
>>>>    net/ipv4/tcp.c                          |   7 +-
>>>>    net/ipv4/udp.c                          |   9 ++
>>>>    net/ipv6/ip6_output.c                   |   5 +-
>>>>    net/ipv6/udp.c                          |   9 ++
>>>>    net/vmw_vsock/virtio_transport_common.c |   2 +-
>>>>    16 files changed, 258 insertions(+), 18 deletions(-)
>>>>
>>>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>>>> index 03ea36a82cdd..19b94ba01007 100644
>>>> --- a/include/linux/skbuff.h
>>>> +++ b/include/linux/skbuff.h
>>>> @@ -1663,12 +1663,14 @@ static inline void skb_set_end_offset(struct sk_buff *skb, unsigned int offset)
>>>>    #endif
>>>>
>>>>    struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
>>>> -                                      struct ubuf_info *uarg);
>>>> +                                      struct ubuf_info *uarg, bool user_args_notification);
>>>>
>>>>    void msg_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref);
>>>>
>>>>    void msg_zerocopy_callback(struct sk_buff *skb, struct ubuf_info *uarg,
>>>>                              bool success);
>>>> +void msg_zerocopy_uarg_callback(struct sk_buff *skb, struct ubuf_info *uarg,
>>>> +                          bool success);
>>>>
>>>>    int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
>>>>                               struct sk_buff *skb, struct iov_iter *from,
>>>> @@ -1763,7 +1765,8 @@ static inline void net_zcopy_put(struct ubuf_info *uarg)
>>>>    static inline void net_zcopy_put_abort(struct ubuf_info *uarg, bool have_uref)
>>>>    {
>>>>           if (uarg) {
>>>> -               if (uarg->callback == msg_zerocopy_callback)
>>>> +               if (uarg->callback == msg_zerocopy_callback ||
>>>> +                       uarg->callback == msg_zerocopy_uarg_callback)
>>>>                           msg_zerocopy_put_abort(uarg, have_uref);
>>>>                   else if (have_uref)
>>>>                           net_zcopy_put(uarg);
>>>> diff --git a/include/linux/socket.h b/include/linux/socket.h
>>>> index 139c330ccf2c..de01392344e1 100644
>>>> --- a/include/linux/socket.h
>>>> +++ b/include/linux/socket.h
>>>> @@ -326,6 +326,7 @@ struct ucred {
>>>>                                             * plain text and require encryption
>>>>                                             */
>>>>
>>>> +#define MSG_ZEROCOPY_UARG      0x2000000 /* MSG_ZEROCOPY with UARG notifications */
>>>>    #define MSG_ZEROCOPY   0x4000000       /* Use user data in kernel path */
>>>>    #define MSG_SPLICE_PAGES 0x8000000     /* Splice the pages from the iterator in sendmsg() */
>>>>    #define MSG_FASTOPEN   0x20000000      /* Send data in TCP SYN */
>>>> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
>>>> index 55399ee2a57e..e973f4990646 100644
>>>> --- a/include/linux/tcp.h
>>>> +++ b/include/linux/tcp.h
>>>> @@ -501,6 +501,9 @@ struct tcp_sock {
>>>>            */
>>>>           struct request_sock __rcu *fastopen_rsk;
>>>>           struct saved_syn *saved_syn;
>>>> +
>>>> +/* TCP MSG_ZEROCOPY_UARG related information */
>>>> +       struct tx_msg_zcopy_queue tx_zcopy_queue;
>>>>    };
>>>>
>>>>    enum tsq_enum {
>>>> diff --git a/include/linux/udp.h b/include/linux/udp.h
>>>> index 3748e82b627b..502b393eac67 100644
>>>> --- a/include/linux/udp.h
>>>> +++ b/include/linux/udp.h
>>>> @@ -95,6 +95,9 @@ struct udp_sock {
>>>>
>>>>           /* Cache friendly copy of sk->sk_peek_off >= 0 */
>>>>           bool            peeking_with_offset;
>>>> +
>>>> +       /* This field is used by sendmsg zcopy user arg mode notification */
>>>> +       struct tx_msg_zcopy_queue tx_zcopy_queue;
>>>>    };
>>>>
>>>>    #define udp_test_bit(nr, sk)                   \
>>>> diff --git a/include/net/sock.h b/include/net/sock.h
>>>> index 2253eefe2848..f7c045e98213 100644
>>>> --- a/include/net/sock.h
>>>> +++ b/include/net/sock.h
>>>> @@ -544,6 +544,23 @@ struct sock {
>>>>           netns_tracker           ns_tracker;
>>>>    };
>>>>
>>>> +struct tx_msg_zcopy_node {
>>>> +       struct list_head node;
>>>> +       struct tx_msg_zcopy_info info;
>>>> +       struct sk_buff *skb;
>>>> +};
>>>> +
>>>> +struct tx_msg_zcopy_queue {
>>>> +       struct list_head head;
>>>> +       spinlock_t lock; /* protects head queue */
>>>> +};
>>>> +
>>>> +static inline void tx_message_zcopy_queue_init(struct tx_msg_zcopy_queue *q)
>>>> +{
>>>> +       spin_lock_init(&q->lock);
>>>> +       INIT_LIST_HEAD(&q->head);
>>>> +}
>>>> +
>>>>    enum sk_pacing {
>>>>           SK_PACING_NONE          = 0,
>>>>           SK_PACING_NEEDED        = 1,
>>>> diff --git a/include/net/udp.h b/include/net/udp.h
>>>> index 488a6d2babcc..9e4d7b128de4 100644
>>>> --- a/include/net/udp.h
>>>> +++ b/include/net/udp.h
>>>> @@ -182,6 +182,7 @@ static inline void udp_lib_init_sock(struct sock *sk)
>>>>           skb_queue_head_init(&up->reader_queue);
>>>>           up->forward_threshold = sk->sk_rcvbuf >> 2;
>>>>           set_bit(SOCK_CUSTOM_SOCKOPT, &sk->sk_socket->flags);
>>>> +       tx_message_zcopy_queue_init(&up->tx_zcopy_queue);
>>>>    }
>>>>
>>>>    /* hash routines shared between UDPv4/6 and UDP-Litev4/6 */
>>>> diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
>>>> index 8ce8a39a1e5f..86aa4b5cb7f1 100644
>>>> --- a/include/uapi/asm-generic/socket.h
>>>> +++ b/include/uapi/asm-generic/socket.h
>>>> @@ -135,6 +135,8 @@
>>>>    #define SO_PASSPIDFD           76
>>>>    #define SO_PEERPIDFD           77
>>>>
>>>> +#define SO_ZEROCOPY_NOTIFICATION 78
>>>> +
>>>>    #if !defined(__KERNEL__)
>>>>
>>>>    #if __BITS_PER_LONG == 64 || (defined(__x86_64__) && defined(__ILP32__))
>>>> diff --git a/include/uapi/linux/socket.h b/include/uapi/linux/socket.h
>>>> index d3fcd3b5ec53..469ed8f4e6c8 100644
>>>> --- a/include/uapi/linux/socket.h
>>>> +++ b/include/uapi/linux/socket.h
>>>> @@ -35,4 +35,21 @@ struct __kernel_sockaddr_storage {
>>>>    #define SOCK_TXREHASH_DISABLED 0
>>>>    #define SOCK_TXREHASH_ENABLED  1
>>>>
>>>> +/*
>>>> + * Given the fact that MSG_ZEROCOPY_UARG tries to copy notifications
>>>> + * back to user as soon as possible, 8 should be sufficient.
>>>> + */
>>>> +#define SOCK_USR_ZC_INFO_MAX 8
>>>> +
>>>> +struct tx_msg_zcopy_info {
>>>> +       __u32 lo;
>>>> +       __u32 hi;
>>>> +       __u8 zerocopy;
>>>> +};
>>>> +
>>>> +struct tx_usr_zcopy_info {
>>>> +       int length;
>>>> +       struct tx_msg_zcopy_info info[SOCK_USR_ZC_INFO_MAX];
>>>> +};
>>>> +
>>>>    #endif /* _UAPI_LINUX_SOCKET_H */
>>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>>> index 2a5ce6667bbb..d939b2c14d55 100644
>>>> --- a/net/core/skbuff.c
>>>> +++ b/net/core/skbuff.c
>>>> @@ -1661,6 +1661,16 @@ void mm_unaccount_pinned_pages(struct mmpin *mmp)
>>>>    }
>>>>    EXPORT_SYMBOL_GPL(mm_unaccount_pinned_pages);
>>>>
>>>> +static void init_ubuf_info_msgzc(struct ubuf_info_msgzc *uarg, struct sock *sk, size_t size)
>>>> +{
>>>> +       uarg->id = ((u32)atomic_inc_return(&sk->sk_zckey)) - 1;
>>>> +       uarg->len = 1;
>>>> +       uarg->bytelen = size;
>>>> +       uarg->zerocopy = 1;
>>>> +       uarg->ubuf.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
>>>> +       refcount_set(&uarg->ubuf.refcnt, 1);
>>>> +}
>>>> +
>>>>    static struct ubuf_info *msg_zerocopy_alloc(struct sock *sk, size_t size)
>>>>    {
>>>>           struct ubuf_info_msgzc *uarg;
>>>> @@ -1682,12 +1692,38 @@ static struct ubuf_info *msg_zerocopy_alloc(struct sock *sk, size_t size)
>>>>           }
>>>>
>>>>           uarg->ubuf.callback = msg_zerocopy_callback;
>>>> -       uarg->id = ((u32)atomic_inc_return(&sk->sk_zckey)) - 1;
>>>> -       uarg->len = 1;
>>>> -       uarg->bytelen = size;
>>>> -       uarg->zerocopy = 1;
>>>> -       uarg->ubuf.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
>>>> -       refcount_set(&uarg->ubuf.refcnt, 1);
>>>> +       init_ubuf_info_msgzc(uarg, sk, size);
>>>> +       sock_hold(sk);
>>>> +
>>>> +       return &uarg->ubuf;
>>>> +}
>>>> +
>>>> +static struct ubuf_info *msg_zerocopy_uarg_alloc(struct sock *sk, size_t size)
>>>> +{
>>>> +       struct sk_buff *skb;
>>>> +       struct ubuf_info_msgzc *uarg;
>>>> +       struct tx_msg_zcopy_node *zcopy_node_p;
>>>> +
>>>> +       WARN_ON_ONCE(!in_task());
>>>> +
>>>> +       skb = sock_omalloc(sk, sizeof(*zcopy_node_p), GFP_KERNEL);
>>>> +       if (!skb)
>>>> +               return NULL;
>>>> +
>>>> +       BUILD_BUG_ON(sizeof(*uarg) > sizeof(skb->cb));
>>>> +       uarg = (void *)skb->cb;
>>>> +       uarg->mmp.user = NULL;
>>>> +       zcopy_node_p = (struct tx_msg_zcopy_node *)skb_put(skb, sizeof(*zcopy_node_p));
>>>> +
>>>> +       if (mm_account_pinned_pages(&uarg->mmp, size)) {
>>>> +               kfree_skb(skb);
>>>> +               return NULL;
>>>> +       }
>>>> +
>>>> +       INIT_LIST_HEAD(&zcopy_node_p->node);
>>>> +       zcopy_node_p->skb = skb;
>>>> +       uarg->ubuf.callback = msg_zerocopy_uarg_callback;
>>>> +       init_ubuf_info_msgzc(uarg, sk, size);
>>>>           sock_hold(sk);
>>>>
>>>>           return &uarg->ubuf;
>>>> @@ -1699,7 +1735,7 @@ static inline struct sk_buff *skb_from_uarg(struct ubuf_info_msgzc *uarg)
>>>>    }
>>>>
>>>>    struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
>>>> -                                      struct ubuf_info *uarg)
>>>> +                                      struct ubuf_info *uarg, bool usr_arg_notification)
>>>>    {
>>>>           if (uarg) {
>>>>                   struct ubuf_info_msgzc *uarg_zc;
>>>> @@ -1707,7 +1743,8 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
>>>>                   u32 bytelen, next;
>>>>
>>>>                   /* there might be non MSG_ZEROCOPY users */
>>>> -               if (uarg->callback != msg_zerocopy_callback)
>>>> +               if (uarg->callback != msg_zerocopy_callback &&
>>>> +                   uarg->callback != msg_zerocopy_uarg_callback)
>>>>                           return NULL;
>>>>
>>>>                   /* realloc only when socket is locked (TCP, UDP cork),
>>>> @@ -1744,6 +1781,8 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
>>>>           }
>>>>
>>>>    new_alloc:
>>>> +       if (usr_arg_notification)
>>>> +               return msg_zerocopy_uarg_alloc(sk, size);
>>>>           return msg_zerocopy_alloc(sk, size);
>>>>    }
>>>>    EXPORT_SYMBOL_GPL(msg_zerocopy_realloc);
>>>> @@ -1830,6 +1869,86 @@ void msg_zerocopy_callback(struct sk_buff *skb, struct ubuf_info *uarg,
>>>>    }
>>>>    EXPORT_SYMBOL_GPL(msg_zerocopy_callback);
>>>>
>>>> +static bool skb_zerocopy_uarg_notify_extend(struct tx_msg_zcopy_node *node, u32 lo, u16 len)
>>>> +{
>>>> +       u32 old_lo, old_hi;
>>>> +       u64 sum_len;
>>>> +
>>>> +       old_lo = node->info.lo;
>>>> +       old_hi = node->info.hi;
>>>> +       sum_len = old_hi - old_lo + 1ULL + len;
>>>> +
>>>> +       if (sum_len >= (1ULL << 32))
>>>> +               return false;
>>>> +
>>>> +       if (lo != old_hi + 1)
>>>> +               return false;
>>>> +
>>>> +       node->info.hi += len;
>>>> +       return true;
>>>> +}
>>>> +
>>>> +static void __msg_zerocopy_uarg_callback(struct ubuf_info_msgzc *uarg)
>>>> +{
>>>> +       struct sk_buff *skb = skb_from_uarg(uarg);
>>>> +       struct sock *sk = skb->sk;
>>>> +       struct tx_msg_zcopy_node *zcopy_node_p, *tail;
>>>> +       struct tx_msg_zcopy_queue *zcopy_queue;
>>>> +       unsigned long flags;
>>>> +       u32 lo, hi;
>>>> +       u16 len;
>>>> +
>>>> +       mm_unaccount_pinned_pages(&uarg->mmp);
>>>> +
>>>> +       /* if !len, there was only 1 call, and it was aborted
>>>> +        * so do not queue a completion notification
>>>> +        */
>>>> +       if (!uarg->len || sock_flag(sk, SOCK_DEAD))
>>>> +               goto release;
>>>> +
>>>> +       /* only support TCP and UCP currently */
>>>> +       if (sk_is_tcp(sk)) {
>>>> +               zcopy_queue = &tcp_sk(sk)->tx_zcopy_queue;
>>>> +       } else if (sk_is_udp(sk)) {
>>>> +               zcopy_queue = &udp_sk(sk)->tx_zcopy_queue;
>>>> +       } else {
>>>> +               pr_warn("MSG_ZEROCOPY_UARG only support TCP && UDP sockets");
>>>> +               goto release;
>>>> +       }
>>>> +
>>>> +       len = uarg->len;
>>>> +       lo = uarg->id;
>>>> +       hi = uarg->id + len - 1;
>>>> +
>>>> +       zcopy_node_p = (struct tx_msg_zcopy_node *)skb->data;
>>>> +       zcopy_node_p->info.lo = lo;
>>>> +       zcopy_node_p->info.hi = hi;
>>>> +       zcopy_node_p->info.zerocopy = uarg->zerocopy;
>>>> +
>>>> +       spin_lock_irqsave(&zcopy_queue->lock, flags);
>>>> +       tail = list_last_entry(&zcopy_queue->head, struct tx_msg_zcopy_node, node);
>>>> +       if (!tail || !skb_zerocopy_uarg_notify_extend(tail, lo, len)) {
>>>> +               list_add_tail(&zcopy_node_p->node, &zcopy_queue->head);
>>>> +               skb = NULL;
>>>> +       }
>>>> +       spin_unlock_irqrestore(&zcopy_queue->lock, flags);
>>>> +release:
>>>> +       consume_skb(skb);
>>>> +       sock_put(sk);
>>>> +}
>>>> +
>>>> +void msg_zerocopy_uarg_callback(struct sk_buff *skb, struct ubuf_info *uarg,
>>>> +                               bool success)
>>>> +{
>>>> +       struct ubuf_info_msgzc *uarg_zc = uarg_to_msgzc(uarg);
>>>> +
>>>> +       uarg_zc->zerocopy = uarg_zc->zerocopy & success;
>>>> +
>>>> +       if (refcount_dec_and_test(&uarg->refcnt))
>>>> +               __msg_zerocopy_uarg_callback(uarg_zc);
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(msg_zerocopy_uarg_callback);
>>>> +
>>>>    void msg_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref)
>>>>    {
>>>>           struct sock *sk = skb_from_uarg(uarg_to_msgzc(uarg))->sk;
>>>> @@ -1838,7 +1957,7 @@ void msg_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref)
>>>>           uarg_to_msgzc(uarg)->len--;
>>>>
>>>>           if (have_uref)
>>>> -               msg_zerocopy_callback(NULL, uarg, true);
>>>> +               uarg->callback(NULL, uarg, true);
>>>>    }
>>>>    EXPORT_SYMBOL_GPL(msg_zerocopy_put_abort);
>>>>
>>>> diff --git a/net/core/sock.c b/net/core/sock.c
>>>> index 5ed411231fc7..a00ebd71f6ed 100644
>>>> --- a/net/core/sock.c
>>>> +++ b/net/core/sock.c
>>>> @@ -2843,6 +2843,56 @@ int __sock_cmsg_send(struct sock *sk, struct cmsghdr *cmsg,
>>>>           case SCM_RIGHTS:
>>>>           case SCM_CREDENTIALS:
>>>>                   break;
>>>> +       case SO_ZEROCOPY_NOTIFICATION:
>>>> +               if (sock_flag(sk, SOCK_ZEROCOPY)) {
>>>> +                       int i = 0;
>>>> +                       struct tx_usr_zcopy_info sys_zcopy_info;
>>>> +                       struct tx_msg_zcopy_node *zcopy_node_p, *tmp;
>>>> +                       struct tx_msg_zcopy_queue *zcopy_queue;
>>>> +                       struct tx_msg_zcopy_node *zcopy_node_ps[SOCK_USR_ZC_INFO_MAX];
>>>> +                       unsigned long flags;
>>>> +
>>>> +                       if (cmsg->cmsg_len != CMSG_LEN(sizeof(void *)))
>>>> +                               return -EINVAL;
>>>> +
>>>> +                       if (sk_is_tcp(sk))
>>>> +                               zcopy_queue = &tcp_sk(sk)->tx_zcopy_queue;
>>>> +                       else if (sk_is_udp(sk))
>>>> +                               zcopy_queue = &udp_sk(sk)->tx_zcopy_queue;
>>>> +                       else
>>>> +                               return -EINVAL;
>>>> +
>>>> +                       spin_lock_irqsave(&zcopy_queue->lock, flags);
>>>> +                       list_for_each_entry_safe(zcopy_node_p, tmp, &zcopy_queue->head, node) {
>>>> +                               sys_zcopy_info.info[i].lo = zcopy_node_p->info.lo;
>>>> +                               sys_zcopy_info.info[i].hi = zcopy_node_p->info.hi;
>>>> +                               sys_zcopy_info.info[i].zerocopy = zcopy_node_p->info.zerocopy;
>>>> +                               list_del(&zcopy_node_p->node);
>>>> +                               zcopy_node_ps[i++] = zcopy_node_p;
>>>> +                               if (i == SOCK_USR_ZC_INFO_MAX)
>>>> +                                       break;
>>>> +                       }
>>>> +                       spin_unlock_irqrestore(&zcopy_queue->lock, flags);
>>>> +
>>>> +                       if (i > 0) {
>>>> +                               sys_zcopy_info.length = i;
>>>> +                               if (unlikely(copy_to_user(*(void **)CMSG_DATA(cmsg),
>>>
>>> This is going to break if user space is 32bit, and kernel 64bit ?
>>>
>>
>> Nice catch, thanks for pointing this out!
>>
>> I may update the code like this?
>>
>> ```
>> if (cmsg->cmsg_len != CMSG_LEN(sizeof(void *)) &&
>>       cmsg->cmsg_len != CMSG_LEN(sizeof(u32)))
>>       return -EINVAL;
>>
>> ...
> 
> I would instead use u64 convention, instead of "void *" or u32.
> 
>>
>> void __user *user;
>> if (cmsg->cmsg_len == 8)
>>       user = *(void **)CMSG_DATA(cmsg);
>> else if (cmsg->cmsg_len == 4) {
>>       u32 val = *(u32 *)CMSG_DATA(cmsg);
>>       user = (void *)(uintptr_t)val;
>> }
>> copy_to_user(user, ...)
>> ```
>>
>> I assume it's user's duty to pass in the right address and cmsg_len.
>> Malicious address or invalid address because of wrong cmsg_len will be
>> protected by copy_to_user.
> 
> 
> 
>>
>>> Also SOCK_USR_ZC_INFO_MAX is put in stone, it won't change in the future.
>>>
>>
>> In the scenario where users keep calling sendmsg, at most time, the user
>> will get one notification from the previous sendmsg. Sometimes, users
>> get two notifications. Thus, we give SOCK_USR_ZC_INFO_MAX a small but
>> sufficient number 8. We put it in stone here to avoid dynamic allocation
>> and memory management of the array.
>>
>> If the memory management worth it, we can change it to a sysctl
>> variable?
> 
> Convention for user space -> kernel space , is to pass not only a base
> pointer, but also a length.
> 
> Of course the kernel implementation can still use a local/automatic
> array in the stack with 8 elems,
> but this could change eventually
> 
> Few system calls that did not follow this convention had to be
> extended to other system calls.
> 
>>
>> Or, how about we keep SOCK_USR_ZC_INFO_MAX, the max value here, and let
>> the users pass in the length (could be any number less than MAX) of info
>> they want to get?
>>
>>>> +                                                         &sys_zcopy_info,
>>>> +                                                         sizeof(sys_zcopy_info))
>>>
>>> Your are leaking to user space part of kernel stack, if (i <
>>> SOCK_USR_ZC_INFO_MAX)
>>>
>>
>> Agree, I should do memset to clear the array.
> 
> Or copy only the part that has been initialized, struct_size() might help ...

Thanks for the suggestions, I will update in the next version :)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [External] Re: [PATCH net-next 2/3] selftests: fix OOM problem in msg_zerocopy selftest
  2024-04-09 23:57       ` [External] " Zijian Zhang
@ 2024-04-10 23:06         ` Willem de Bruijn
  2024-04-12  0:26           ` Zijian Zhang
  0 siblings, 1 reply; 24+ messages in thread
From: Willem de Bruijn @ 2024-04-10 23:06 UTC (permalink / raw)
  To: Zijian Zhang, Eric Dumazet, Willem de Bruijn
  Cc: netdev, davem, kuba, cong.wang, xiaochun.lu

> >>> In this case, for some reason, notifications do not
> >>> come in order now. We introduce "cfg_notification_order_check" to
> >>> possibly ignore the checking for order.
> >>
> >> Were you testing UDP?
> >>
> >> I don't think this is needed. I wonder what you were doing to see
> >> enough of these events to want to suppress the log output.
> 
> I tested again on both TCP and UDP just now, and it happened to both of 
> them. For tcp test, too many printfs will delay the sending and thus 
> affect the throughput.
> 
> ipv4 tcp -z -t 1
> gap: 277..277 does not append to 276

There is something wrong here. 277 clearly appends to 276

> gap: 276..276 does not append to 278

This would be an actual reordering. But the above line already
indicates that 276 is the next expected value.

> gap: 278..1112 does not append to 277
> gap: 1114..1114 does not append to 1113
> gap: 1113..1113 does not append to 1115
> gap: 1115..2330 does not append to 1114
> gap: 2332..2332 does not append to 2331
> gap: 2331..2331 does not append to 2333
> gap: 2333..2559 does not append to 2332
> gap: 2562..2562 does not append to 2560
> ...
> gap: 25841..25841 does not append to 25843
> gap: 25843..25997 does not append to 25842
> 
> ...
> 
> ipv6 udp -z -t 1
> gap: 11632..11687 does not append to 11625
> gap: 11625..11631 does not append to 11688
> gap: 11688..54662 does not append to 11632

If you ran this on a kernel with a variety of changes, please repeat
this on a clean kernel with no other changes besides the
skb_orphan_frags_rx loopback change.

It this is a real issue, I don't mind moving this behind cfg_verbose.
And prefer that approach over adding a new flag.

But I have never seen this before, and this kind of reordering is rare
with UDP and should not happen with TCP except for really edge cases:
the uarg is released only when both the skb was delivered and the ACK
response was received to free the clone on the retransmit queue.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [External] Re: [PATCH net-next 2/3] selftests: fix OOM problem in msg_zerocopy selftest
  2024-04-10 23:06         ` Willem de Bruijn
@ 2024-04-12  0:26           ` Zijian Zhang
  2024-04-12 15:36             ` Willem de Bruijn
  0 siblings, 1 reply; 24+ messages in thread
From: Zijian Zhang @ 2024-04-12  0:26 UTC (permalink / raw)
  To: Willem de Bruijn, Eric Dumazet
  Cc: netdev, davem, kuba, cong.wang, xiaochun.lu

On 4/10/24 4:06 PM, Willem de Bruijn wrote:
>>>>> In this case, for some reason, notifications do not
>>>>> come in order now. We introduce "cfg_notification_order_check" to
>>>>> possibly ignore the checking for order.
>>>>
>>>> Were you testing UDP?
>>>>
>>>> I don't think this is needed. I wonder what you were doing to see
>>>> enough of these events to want to suppress the log output.
>>
>> I tested again on both TCP and UDP just now, and it happened to both of
>> them. For tcp test, too many printfs will delay the sending and thus
>> affect the throughput.
>>
>> ipv4 tcp -z -t 1
>> gap: 277..277 does not append to 276
> 
> There is something wrong here. 277 clearly appends to 276
> 

```
if (lo != next_completion)
     fprintf(stderr, "gap: %u..%u does not append to %u\n",
         lo, hi, next_completion);
```

According to the code, it expects the lo to be 276, but it's 277.

>> gap: 276..276 does not append to 278
> 
> This would be an actual reordering. But the above line already
> indicates that 276 is the next expected value.
> 
>> gap: 278..1112 does not append to 277
>> gap: 1114..1114 does not append to 1113
>> gap: 1113..1113 does not append to 1115
>> gap: 1115..2330 does not append to 1114
>> gap: 2332..2332 does not append to 2331
>> gap: 2331..2331 does not append to 2333
>> gap: 2333..2559 does not append to 2332
>> gap: 2562..2562 does not append to 2560
>> ...
>> gap: 25841..25841 does not append to 25843
>> gap: 25843..25997 does not append to 25842
>>
>> ...
>>
>> ipv6 udp -z -t 1
>> gap: 11632..11687 does not append to 11625
>> gap: 11625..11631 does not append to 11688
>> gap: 11688..54662 does not append to 11632
> 
> If you ran this on a kernel with a variety of changes, please repeat
> this on a clean kernel with no other changes besides the
> skb_orphan_frags_rx loopback change.
> 
> It this is a real issue, I don't mind moving this behind cfg_verbose.
> And prefer that approach over adding a new flag.
> 
> But I have never seen this before, and this kind of reordering is rare
> with UDP and should not happen with TCP except for really edge cases:
> the uarg is released only when both the skb was delivered and the ACK
> response was received to free the clone on the retransmit queue.

I found the set up where I encountered the OOM problem in msg_zerocopy
selftest. I did it on a clean kernel vm booted by qemu, 
dfa2f0483360("tcp: get rid of sysctl_tcp_adv_win_scale") with only
skb_orphan_frags_rx change.

Then, I do `make olddefconfig` and turn on some configurations for
virtualization like VIRTIO_FS, VIRTIO_NET and some confs like 9P_FS
to share folders. Let's call it config, here was the result I got,
```
./msg_zerocopy.sh
ipv4 tcp -z -t 1
./msg_zerocopy: send: No buffer space available
rx=564 (70 MB)
```

Since the TCP socket is always writable, the do_poll always return True.
There is no any chance for `do_recv_completions` to run.
```
while (!do_poll(fd, POLLOUT)) {
     if (cfg_zerocopy)
         do_recv_completions(fd, domain);
     }
```
Finally, the size of sendmsg zerocopy notification skbs exceeds the 
opt_mem limit. I got "No buffer space available".


However, if I change the config by
```
  DEBUG_IRQFLAGS n -> y
  DEBUG_LOCK_ALLOC n -> y
  DEBUG_MUTEXES n -> y
  DEBUG_RT_MUTEXES n -> y
  DEBUG_RWSEMS n -> y
  DEBUG_SPINLOCK n -> y
  DEBUG_WW_MUTEX_SLOWPATH n -> y
  PROVE_LOCKING n -> y
+DEBUG_LOCKDEP y
+LOCKDEP y
+LOCKDEP_BITS 15
+LOCKDEP_CHAINS_BITS 16
+LOCKDEP_CIRCULAR_QUEUE_BITS 12
+LOCKDEP_STACK_TRACE_BITS 19
+LOCKDEP_STACK_TRACE_HASH_BITS 14
+PREEMPTIRQ_TRACEPOINTS y
+PROVE_RAW_LOCK_NESTING n
+PROVE_RCU y
+TRACE_IRQFLAGS y
+TRACE_IRQFLAGS_NMI y
```

Let's call it config-debug, the selftest works well with reordered
notifications.
```
ipv4 tcp -z -t 1
gap: 2117..2117 does not append to 2115
gap: 2115..2116 does not append to 2118
gap: 2118..3144 does not append to 2117
gap: 3146..3146 does not append to 3145
gap: 3145..3145 does not append to 3147
gap: 3147..3768 does not append to 3146
...
gap: 34935..34935 does not append to 34937
gap: 34938..36409 does not append to 34936

rx=36097 (2272 MB)
missing notifications: 36410 < 36412
tx=36412 (2272 MB) txc=36410 zc=y
```
For exact config to compile the kernel, please see
https://github.com/Sm0ckingBird/config


I also did selftest on 63c8778d9149("Merge branch 
'net-mana-fix-doorbell-access-for-receive-queues'"), the parent of 
dfa2f0483360("tcp: get rid of sysctl_tcp_adv_win_scale")

with config, selftest works well.
```
ipv4 tcp -z -t 1
missing notifications: 223181 < 223188
tx=223188 (13927 MB) txc=223181 zc=y
rx=111592 (13927 MB)
```

with config-debug, selftest works well with reordered notifications
```
ipv4 tcp -z -t 1
...
gap: 30397..30404 does not append to 30389
gap: 30435..30442 does not append to 30427
gap: 30427..30434 does not append to 30443
gap: 30443..30450 does not append to 30435
gap: 30473..30480 does not append to 30465
gap: 30465..30472 does not append to 30481
gap: 30481..30488 does not append to 30473
tx=30491 (1902 MB) txc=30491 zc=y
rx=15245 (1902 MB)
```

Not sure about the exact reason for this OOM problem, and why
turning on DEBUG_LOCKDEP and PROVE_RAW_LOCK_NESTING can solve
the problem with reordered notifications... If you have any thoughts or
comments, please feel free to share them with us.

If the problem does exist, I guess we can force `do_recv_completions`
after some number of sendmsgs and move "gap: ..." after cfg_verbose?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [External] Re: [PATCH net-next 2/3] selftests: fix OOM problem in msg_zerocopy selftest
  2024-04-12  0:26           ` Zijian Zhang
@ 2024-04-12 15:36             ` Willem de Bruijn
  2024-04-12 18:58               ` Zijian Zhang
  0 siblings, 1 reply; 24+ messages in thread
From: Willem de Bruijn @ 2024-04-12 15:36 UTC (permalink / raw)
  To: Zijian Zhang, Willem de Bruijn, Eric Dumazet
  Cc: netdev, davem, kuba, cong.wang, xiaochun.lu

Zijian Zhang wrote:
> On 4/10/24 4:06 PM, Willem de Bruijn wrote:
> >>>>> In this case, for some reason, notifications do not
> >>>>> come in order now. We introduce "cfg_notification_order_check" to
> >>>>> possibly ignore the checking for order.
> >>>>
> >>>> Were you testing UDP?
> >>>>
> >>>> I don't think this is needed. I wonder what you were doing to see
> >>>> enough of these events to want to suppress the log output.
> >>
> >> I tested again on both TCP and UDP just now, and it happened to both of
> >> them. For tcp test, too many printfs will delay the sending and thus
> >> affect the throughput.
> >>
> >> ipv4 tcp -z -t 1
> >> gap: 277..277 does not append to 276
> > 
> > There is something wrong here. 277 clearly appends to 276
> > 
> 
> ```
> if (lo != next_completion)
>      fprintf(stderr, "gap: %u..%u does not append to %u\n",
>          lo, hi, next_completion);
> ```
> 
> According to the code, it expects the lo to be 276, but it's 277.

Ack. I should have phrased that message better.
 
> > If you ran this on a kernel with a variety of changes, please repeat
> > this on a clean kernel with no other changes besides the
> > skb_orphan_frags_rx loopback change.
> > 
> > It this is a real issue, I don't mind moving this behind cfg_verbose.
> > And prefer that approach over adding a new flag.
> > 
> > But I have never seen this before, and this kind of reordering is rare
> > with UDP and should not happen with TCP except for really edge cases:
> > the uarg is released only when both the skb was delivered and the ACK
> > response was received to free the clone on the retransmit queue.
> 
> I found the set up where I encountered the OOM problem in msg_zerocopy
> selftest. I did it on a clean kernel vm booted by qemu, 
> dfa2f0483360("tcp: get rid of sysctl_tcp_adv_win_scale") with only
> skb_orphan_frags_rx change.
> 
> Then, I do `make olddefconfig` and turn on some configurations for
> virtualization like VIRTIO_FS, VIRTIO_NET and some confs like 9P_FS
> to share folders. Let's call it config, here was the result I got,
> ```
> ./msg_zerocopy.sh
> ipv4 tcp -z -t 1
> ./msg_zerocopy: send: No buffer space available
> rx=564 (70 MB)
> ```
> 
> Since the TCP socket is always writable, the do_poll always return True.
> There is no any chance for `do_recv_completions` to run.
> ```
> while (!do_poll(fd, POLLOUT)) {
>      if (cfg_zerocopy)
>          do_recv_completions(fd, domain);
>      }
> ```
> Finally, the size of sendmsg zerocopy notification skbs exceeds the 
> opt_mem limit. I got "No buffer space available".
> 
> 
> However, if I change the config by
> ```
>   DEBUG_IRQFLAGS n -> y
>   DEBUG_LOCK_ALLOC n -> y
>   DEBUG_MUTEXES n -> y
>   DEBUG_RT_MUTEXES n -> y
>   DEBUG_RWSEMS n -> y
>   DEBUG_SPINLOCK n -> y
>   DEBUG_WW_MUTEX_SLOWPATH n -> y
>   PROVE_LOCKING n -> y
> +DEBUG_LOCKDEP y
> +LOCKDEP y
> +LOCKDEP_BITS 15
> +LOCKDEP_CHAINS_BITS 16
> +LOCKDEP_CIRCULAR_QUEUE_BITS 12
> +LOCKDEP_STACK_TRACE_BITS 19
> +LOCKDEP_STACK_TRACE_HASH_BITS 14
> +PREEMPTIRQ_TRACEPOINTS y
> +PROVE_RAW_LOCK_NESTING n
> +PROVE_RCU y
> +TRACE_IRQFLAGS y
> +TRACE_IRQFLAGS_NMI y
> ```
> 
> Let's call it config-debug, the selftest works well with reordered
> notifications.
> ```
> ipv4 tcp -z -t 1
> gap: 2117..2117 does not append to 2115
> gap: 2115..2116 does not append to 2118
> gap: 2118..3144 does not append to 2117
> gap: 3146..3146 does not append to 3145
> gap: 3145..3145 does not append to 3147
> gap: 3147..3768 does not append to 3146
> ...
> gap: 34935..34935 does not append to 34937
> gap: 34938..36409 does not append to 34936
> 
> rx=36097 (2272 MB)
> missing notifications: 36410 < 36412
> tx=36412 (2272 MB) txc=36410 zc=y
> ```
> For exact config to compile the kernel, please see
> https://github.com/Sm0ckingBird/config

Thanks for sharing the system configs. I'm quite surprised at these
reorderings *over loopback* with these debug settings, and no weird
qdiscs that would explain it. Can you see whether you see drops and
retransmits?

> 
> I also did selftest on 63c8778d9149("Merge branch 
> 'net-mana-fix-doorbell-access-for-receive-queues'"), the parent of 
> dfa2f0483360("tcp: get rid of sysctl_tcp_adv_win_scale")
> 
> with config, selftest works well.
> ```
> ipv4 tcp -z -t 1
> missing notifications: 223181 < 223188
> tx=223188 (13927 MB) txc=223181 zc=y
> rx=111592 (13927 MB)
> ```
> 
> with config-debug, selftest works well with reordered notifications
> ```
> ipv4 tcp -z -t 1
> ...
> gap: 30397..30404 does not append to 30389
> gap: 30435..30442 does not append to 30427
> gap: 30427..30434 does not append to 30443
> gap: 30443..30450 does not append to 30435
> gap: 30473..30480 does not append to 30465
> gap: 30465..30472 does not append to 30481
> gap: 30481..30488 does not append to 30473
> tx=30491 (1902 MB) txc=30491 zc=y
> rx=15245 (1902 MB)
> ```
> 
> Not sure about the exact reason for this OOM problem, and why
> turning on DEBUG_LOCKDEP and PROVE_RAW_LOCK_NESTING can solve
> the problem with reordered notifications...

The debug config causes the reordering notifications, right? But
solves the OOM.

> If you have any thoughts or
> comments, please feel free to share them with us.
> 
> If the problem does exist, I guess we can force `do_recv_completions`
> after some number of sendmsgs and move "gap: ..." after cfg_verbose?

I do want to understand the issue better. But not sure when I'll find
the time.

Both sound reasonable to me, yes.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [External] Re: [PATCH net-next 2/3] selftests: fix OOM problem in msg_zerocopy selftest
  2024-04-12 15:36             ` Willem de Bruijn
@ 2024-04-12 18:58               ` Zijian Zhang
  0 siblings, 0 replies; 24+ messages in thread
From: Zijian Zhang @ 2024-04-12 18:58 UTC (permalink / raw)
  To: Willem de Bruijn, Eric Dumazet
  Cc: netdev, davem, kuba, cong.wang, xiaochun.lu

On 4/12/24 8:36 AM, Willem de Bruijn wrote:
> Zijian Zhang wrote:
>> On 4/10/24 4:06 PM, Willem de Bruijn wrote:
>>>>>>> In this case, for some reason, notifications do not
>>>>>>> come in order now. We introduce "cfg_notification_order_check" to
>>>>>>> possibly ignore the checking for order.
>>>>>>
>>>>>> Were you testing UDP?
>>>>>>
>>>>>> I don't think this is needed. I wonder what you were doing to see
>>>>>> enough of these events to want to suppress the log output.
>>>>
>>>> I tested again on both TCP and UDP just now, and it happened to both of
>>>> them. For tcp test, too many printfs will delay the sending and thus
>>>> affect the throughput.
>>>>
>>>> ipv4 tcp -z -t 1
>>>> gap: 277..277 does not append to 276
>>>
>>> There is something wrong here. 277 clearly appends to 276
>>>
>>
>> ```
>> if (lo != next_completion)
>>       fprintf(stderr, "gap: %u..%u does not append to %u\n",
>>           lo, hi, next_completion);
>> ```
>>
>> According to the code, it expects the lo to be 276, but it's 277.
> 
> Ack. I should have phrased that message better.
>   
>>> If you ran this on a kernel with a variety of changes, please repeat
>>> this on a clean kernel with no other changes besides the
>>> skb_orphan_frags_rx loopback change.
>>>
>>> It this is a real issue, I don't mind moving this behind cfg_verbose.
>>> And prefer that approach over adding a new flag.
>>>
>>> But I have never seen this before, and this kind of reordering is rare
>>> with UDP and should not happen with TCP except for really edge cases:
>>> the uarg is released only when both the skb was delivered and the ACK
>>> response was received to free the clone on the retransmit queue.
>>
>> I found the set up where I encountered the OOM problem in msg_zerocopy
>> selftest. I did it on a clean kernel vm booted by qemu,
>> dfa2f0483360("tcp: get rid of sysctl_tcp_adv_win_scale") with only
>> skb_orphan_frags_rx change.
>>
>> Then, I do `make olddefconfig` and turn on some configurations for
>> virtualization like VIRTIO_FS, VIRTIO_NET and some confs like 9P_FS
>> to share folders. Let's call it config, here was the result I got,
>> ```
>> ./msg_zerocopy.sh
>> ipv4 tcp -z -t 1
>> ./msg_zerocopy: send: No buffer space available
>> rx=564 (70 MB)
>> ```
>>
>> Since the TCP socket is always writable, the do_poll always return True.
>> There is no any chance for `do_recv_completions` to run.
>> ```
>> while (!do_poll(fd, POLLOUT)) {
>>       if (cfg_zerocopy)
>>           do_recv_completions(fd, domain);
>>       }
>> ```
>> Finally, the size of sendmsg zerocopy notification skbs exceeds the
>> opt_mem limit. I got "No buffer space available".
>>
>>
>> However, if I change the config by
>> ```
>>    DEBUG_IRQFLAGS n -> y
>>    DEBUG_LOCK_ALLOC n -> y
>>    DEBUG_MUTEXES n -> y
>>    DEBUG_RT_MUTEXES n -> y
>>    DEBUG_RWSEMS n -> y
>>    DEBUG_SPINLOCK n -> y
>>    DEBUG_WW_MUTEX_SLOWPATH n -> y
>>    PROVE_LOCKING n -> y
>> +DEBUG_LOCKDEP y
>> +LOCKDEP y
>> +LOCKDEP_BITS 15
>> +LOCKDEP_CHAINS_BITS 16
>> +LOCKDEP_CIRCULAR_QUEUE_BITS 12
>> +LOCKDEP_STACK_TRACE_BITS 19
>> +LOCKDEP_STACK_TRACE_HASH_BITS 14
>> +PREEMPTIRQ_TRACEPOINTS y
>> +PROVE_RAW_LOCK_NESTING n
>> +PROVE_RCU y
>> +TRACE_IRQFLAGS y
>> +TRACE_IRQFLAGS_NMI y
>> ```
>>
>> Let's call it config-debug, the selftest works well with reordered
>> notifications.
>> ```
>> ipv4 tcp -z -t 1
>> gap: 2117..2117 does not append to 2115
>> gap: 2115..2116 does not append to 2118
>> gap: 2118..3144 does not append to 2117
>> gap: 3146..3146 does not append to 3145
>> gap: 3145..3145 does not append to 3147
>> gap: 3147..3768 does not append to 3146
>> ...
>> gap: 34935..34935 does not append to 34937
>> gap: 34938..36409 does not append to 34936
>>
>> rx=36097 (2272 MB)
>> missing notifications: 36410 < 36412
>> tx=36412 (2272 MB) txc=36410 zc=y
>> ```
>> For exact config to compile the kernel, please see
>> https://github.com/Sm0ckingBird/config
> 
> Thanks for sharing the system configs. I'm quite surprised at these
> reorderings *over loopback* with these debug settings, and no weird
> qdiscs that would explain it. Can you see whether you see drops and
> retransmits?
> 

No drops and retransmits are observed.

```
ip netns exec ns-djROUw1 netstat -s

Tcp:
     1 active connection openings
     0 passive connection openings
     0 failed connection attempts
     1 connection resets received
     0 connections established
     16158 segments received
     32311 segments sent out
     0 segments retransmitted
     0 bad segments received
     0 resets sent

ip netns exec ns-djROUw1 ip -s link show veth0

2: veth0@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65535 qdisc noqueue 
state UP mode DEFAULT group default qlen 1000
     link/ether 02:02:02:02:02:02 brd ff:ff:ff:ff:ff:ff link-netns 
ns-djROUw2
     RX: bytes  packets  errors  dropped overrun mcast
     1067646    16173    0       0       0       0
     TX: bytes  packets  errors  dropped carrier collsns
     2116207634 32325    0       0       0       0
```

>>
>> I also did selftest on 63c8778d9149("Merge branch
>> 'net-mana-fix-doorbell-access-for-receive-queues'"), the parent of
>> dfa2f0483360("tcp: get rid of sysctl_tcp_adv_win_scale")
>>
>> with config, selftest works well.
>> ```
>> ipv4 tcp -z -t 1
>> missing notifications: 223181 < 223188
>> tx=223188 (13927 MB) txc=223181 zc=y
>> rx=111592 (13927 MB)
>> ```
>>
>> with config-debug, selftest works well with reordered notifications
>> ```
>> ipv4 tcp -z -t 1
>> ...
>> gap: 30397..30404 does not append to 30389
>> gap: 30435..30442 does not append to 30427
>> gap: 30427..30434 does not append to 30443
>> gap: 30443..30450 does not append to 30435
>> gap: 30473..30480 does not append to 30465
>> gap: 30465..30472 does not append to 30481
>> gap: 30481..30488 does not append to 30473
>> tx=30491 (1902 MB) txc=30491 zc=y
>> rx=15245 (1902 MB)
>> ```
>>
>> Not sure about the exact reason for this OOM problem, and why
>> turning on DEBUG_LOCKDEP and PROVE_RAW_LOCK_NESTING can solve
>> the problem with reordered notifications...
> 
> The debug config causes the reordering notifications, right? But
> solves the OOM.
> 

With config, OOM is introduced by dfa2f0483360("tcp: get rid of
sysctl_tcp_adv_win_scale"). And, it can be solved by config-debug,
but reordering notifications are observed.

With config, there is no OOM in 63c8778d9149("Merge branch
'net-mana-fix-doorbell-access-for-receive-queues'"), everything works
well. But with config-debug, reordering notifications are observed.

>> If you have any thoughts or
>> comments, please feel free to share them with us.
>>
>> If the problem does exist, I guess we can force `do_recv_completions`
>> after some number of sendmsgs and move "gap: ..." after cfg_verbose?
> 
> I do want to understand the issue better. But not sure when I'll find
> the time.
> 

I also want to understand this, I will look into it when I find time :)

> Both sound reasonable to me, yes.
Thanks for the review and suggestions, I will update in the next iteration.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2024-04-12 18:59 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-09 20:52 [PATCH net-next 0/3] net: socket sendmsg MSG_ZEROCOPY_UARG zijianzhang
2024-04-09 20:52 ` [PATCH net-next 1/3] sock: add MSG_ZEROCOPY_UARG zijianzhang
2024-04-09 21:18   ` Willem de Bruijn
2024-04-10  0:08     ` [External] " Zijian Zhang
2024-04-09 21:18   ` Eric Dumazet
2024-04-10 17:01     ` [External] " Zijian Zhang
2024-04-10 18:57       ` Eric Dumazet
2024-04-10 21:34         ` Zijian Zhang
2024-04-09 21:23   ` Eric Dumazet
2024-04-10  0:11     ` [External] " Zijian Zhang
2024-04-10  9:18   ` kernel test robot
2024-04-10 12:11   ` kernel test robot
2024-04-10 15:22   ` kernel test robot
2024-04-09 20:52 ` [PATCH net-next 2/3] selftests: fix OOM problem in msg_zerocopy selftest zijianzhang
2024-04-09 21:25   ` Willem de Bruijn
2024-04-09 21:30     ` Eric Dumazet
2024-04-09 23:57       ` [External] " Zijian Zhang
2024-04-10 23:06         ` Willem de Bruijn
2024-04-12  0:26           ` Zijian Zhang
2024-04-12 15:36             ` Willem de Bruijn
2024-04-12 18:58               ` Zijian Zhang
2024-04-09 20:53 ` [PATCH net-next 3/3] selftests: add msg_zerocopy_uarg test zijianzhang
2024-04-10  8:46 ` [PATCH net-next 0/3] net: socket sendmsg MSG_ZEROCOPY_UARG Paolo Abeni
2024-04-10 17:03   ` [External] " Zijian Zhang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.