netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v5 00/27] io_uring zerocopy send
@ 2022-07-12 20:52 Pavel Begunkov
  2022-07-12 20:52 ` [PATCH net-next v5 01/27] ipv4: avoid partial copy for zc Pavel Begunkov
                   ` (27 more replies)
  0 siblings, 28 replies; 34+ messages in thread
From: Pavel Begunkov @ 2022-07-12 20:52 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: David S . Miller, Jakub Kicinski, Jonathan Lemon,
	Willem de Bruijn, Jens Axboe, David Ahern, kernel-team,
	Pavel Begunkov

NOTE: Not to be picked directly. After getting necessary acks, I'll be
      working out merging with Jakub and Jens.

The patchset implements io_uring zerocopy send. It works with both registered
and normal buffers, mixing is allowed but not recommended. Apart from usual
request completions, just as with MSG_ZEROCOPY, io_uring separately notifies
the userspace when buffers are freed and can be reused (see API design below),
which is delivered into io_uring's Completion Queue. Those "buffer-free"
notifications are not necessarily per request, but the userspace has control
over it and should explicitly attaching a number of requests to a single
notification. The series also adds some internal optimisations when used with
registered buffers like removing page referencing.

From the kernel networking perspective there are two main changes. The first
one is passing ubuf_info into the network layer from io_uring (inside of an
in kernel struct msghdr). This allows extra optimisations, e.g. ubuf_info
caching on the io_uring side, but also helps to avoid cross-referencing
and synchronisation problems. The second part is an optional optimisation
removing page referencing for requests with registered buffers.

Benchmarking UDP with an optimised version of the selftest (see [1]), which
sends a bunch of requests, waits for completions and repeats. "+ flush" column
posts one additional "buffer-free" notification per request, and just "zc"
doesn't post buffer notifications at all.

NIC (requests / second):
IO size | non-zc    | zc             | zc + flush
4000    | 495134    | 606420 (+22%)  | 558971 (+12%)
1500    | 551808    | 577116 (+4.5%) | 565803 (+2.5%)
1000    | 584677    | 592088 (+1.2%) | 560885 (-4%)
600     | 596292    | 598550 (+0.4%) | 555366 (-6.7%)

dummy (requests / second):
IO size | non-zc    | zc             | zc + flush
8000    | 1299916   | 2396600 (+84%) | 2224219 (+71%)
4000    | 1869230   | 2344146 (+25%) | 2170069 (+16%)
1200    | 2071617   | 2361960 (+14%) | 2203052 (+6%)
600     | 2106794   | 2381527 (+13%) | 2195295 (+4%)

Previously it also brought a massive performance speedup compared to the
msg_zerocopy tool (see [3]), which is probably not super interesting. There
is also an additional bunch of refcounting optimisations that was omitted from
the series for simplicity and as they don't change the picture drastically,
they will be sent as follow up, as well as flushing optimisations closing the
performance gap b/w two last columns.

For TCP on localhost (with hacks enabling localhost zerocopy) and including
additional overhead for receive:

IO size | non-zc    | zc
1200    | 4174      | 4148
4096    | 7597      | 11228

Using a real NIC 1200 bytes, zc is worse than non-zc ~5-10%, maybe the
omitted optimisations will somewhat help, should look better for 4000,
but couldn't test properly because of setup problems.

Links:

  liburing (benchmark + tests):
  [1] https://github.com/isilence/liburing/tree/zc_v4

  kernel repo:
  [2] https://github.com/isilence/linux/tree/zc_v4

  RFC v1:
  [3] https://lore.kernel.org/io-uring/cover.1638282789.git.asml.silence@gmail.com/

  RFC v2:
  https://lore.kernel.org/io-uring/cover.1640029579.git.asml.silence@gmail.com/

  Net patches based:
  git@github.com:isilence/linux.git zc_v4-net-base
  or
  https://github.com/isilence/linux/tree/zc_v4-net-base

API design overview:

  The series introduces an io_uring concept of notifactors. From the userspace
  perspective it's an entity to which it can bind one or more requests and then
  requesting to flush it. Flushing a notifier makes it impossible to attach new
  requests to it, and instructs the notifier to post a completion once all
  requests attached to it are completed and the kernel doesn't need the buffers
  anymore.

  Notifications are stored in notification slots, which should be registered as
  an array in io_uring. Each slot stores only one notifier at any particular
  moment. Flushing removes it from the slot and the slot automatically replaces
  it with a new notifier. All operations with notifiers are done by specifying
  an index of a slot it's currently in.

  When registering a notification the userspace specifies a u64 tag for each
  slot, which will be copied in notification completion entries as
  cqe::user_data. cqe::res is 0 and cqe::flags is equal to wrap around u32
  sequence number counting notifiers of a slot.

Changelog:

  v4 -> v5
    remove ubuf_info checks from custom iov_iter callbacks to
    avoid disabling the page refs optimisations for TCP

  v3 -> v4
    custom iov_iter handling

  RFC v2 -> v3:
    mem accounting for non-registered buffers
    allow mixing registered and normal requests per notifier
    notification flushing via IORING_OP_RSRC_UPDATE
    TCP support
    fix buffer indexing
    fix io-wq ->uring_lock locking
    fix bugs when mixing with MSG_ZEROCOPY
    fix managed refs bugs in skbuff.c

  RFC -> RFC v2:
    remove additional overhead for non-zc from skb_release_data()
    avoid msg propagation, hide extra bits of non-zc overhead
    task_work based "buffer free" notifications
    improve io_uring's notification refcounting
    added 5/19, (no pfmemalloc tracking)
    added 8/19 and 9/19 preventing small copies with zc
    misc small changes

David Ahern (1):
  net: Allow custom iter handler in msghdr

Pavel Begunkov (26):
  ipv4: avoid partial copy for zc
  ipv6: avoid partial copy for zc
  skbuff: don't mix ubuf_info from different sources
  skbuff: add SKBFL_DONT_ORPHAN flag
  skbuff: carry external ubuf_info in msghdr
  net: introduce managed frags infrastructure
  net: introduce __skb_fill_page_desc_noacc
  ipv4/udp: support externally provided ubufs
  ipv6/udp: support externally provided ubufs
  tcp: support externally provided ubufs
  io_uring: initialise msghdr::msg_ubuf
  io_uring: export io_put_task()
  io_uring: add zc notification infrastructure
  io_uring: cache struct io_notif
  io_uring: complete notifiers in tw
  io_uring: add rsrc referencing for notifiers
  io_uring: add notification slot registration
  io_uring: wire send zc request type
  io_uring: account locked pages for non-fixed zc
  io_uring: allow to pass addr into sendzc
  io_uring: sendzc with fixed buffers
  io_uring: flush notifiers after sendzc
  io_uring: rename IORING_OP_FILES_UPDATE
  io_uring: add zc notification flush requests
  io_uring: enable managed frags with register buffers
  selftests/io_uring: test zerocopy send

 include/linux/io_uring_types.h                |  37 ++
 include/linux/skbuff.h                        |  66 +-
 include/linux/socket.h                        |   5 +
 include/uapi/linux/io_uring.h                 |  45 +-
 io_uring/Makefile                             |   2 +-
 io_uring/io_uring.c                           |  42 +-
 io_uring/io_uring.h                           |  22 +
 io_uring/net.c                                | 187 ++++++
 io_uring/net.h                                |   4 +
 io_uring/notif.c                              | 215 +++++++
 io_uring/notif.h                              |  87 +++
 io_uring/opdef.c                              |  24 +-
 io_uring/rsrc.c                               |  55 +-
 io_uring/rsrc.h                               |  16 +-
 io_uring/tctx.h                               |  26 -
 net/compat.c                                  |   1 +
 net/core/datagram.c                           |  14 +-
 net/core/skbuff.c                             |  37 +-
 net/ipv4/ip_output.c                          |  50 +-
 net/ipv4/tcp.c                                |  32 +-
 net/ipv6/ip6_output.c                         |  49 +-
 net/socket.c                                  |   3 +
 tools/testing/selftests/net/Makefile          |   1 +
 .../selftests/net/io_uring_zerocopy_tx.c      | 605 ++++++++++++++++++
 .../selftests/net/io_uring_zerocopy_tx.sh     | 131 ++++
 25 files changed, 1628 insertions(+), 128 deletions(-)
 create mode 100644 io_uring/notif.c
 create mode 100644 io_uring/notif.h
 create mode 100644 tools/testing/selftests/net/io_uring_zerocopy_tx.c
 create mode 100755 tools/testing/selftests/net/io_uring_zerocopy_tx.sh

-- 
2.37.0


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH net-next v5 01/27] ipv4: avoid partial copy for zc
  2022-07-12 20:52 [PATCH net-next v5 00/27] io_uring zerocopy send Pavel Begunkov
@ 2022-07-12 20:52 ` Pavel Begunkov
  2022-07-19  1:54   ` Jakub Kicinski
  2022-07-12 20:52 ` [PATCH net-next v5 02/27] ipv6: " Pavel Begunkov
                   ` (26 subsequent siblings)
  27 siblings, 1 reply; 34+ messages in thread
From: Pavel Begunkov @ 2022-07-12 20:52 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: David S . Miller, Jakub Kicinski, Jonathan Lemon,
	Willem de Bruijn, Jens Axboe, David Ahern, kernel-team,
	Pavel Begunkov

Even when zerocopy transmission is requested and possible,
__ip_append_data() will still copy a small chunk of data just because it
allocated some extra linear space (e.g. 148 bytes). It wastes CPU cycles
on copy and iter manipulations and also misalignes potentially aligned
data. Avoid such coies. And as a bonus we can allocate smaller skb.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 net/ipv4/ip_output.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 00b4bf26fd93..581d1e233260 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -969,7 +969,6 @@ static int __ip_append_data(struct sock *sk,
 	struct inet_sock *inet = inet_sk(sk);
 	struct ubuf_info *uarg = NULL;
 	struct sk_buff *skb;
-
 	struct ip_options *opt = cork->opt;
 	int hh_len;
 	int exthdrlen;
@@ -977,6 +976,7 @@ static int __ip_append_data(struct sock *sk,
 	int copy;
 	int err;
 	int offset = 0;
+	bool zc = false;
 	unsigned int maxfraglen, fragheaderlen, maxnonfragsize;
 	int csummode = CHECKSUM_NONE;
 	struct rtable *rt = (struct rtable *)cork->dst;
@@ -1025,6 +1025,7 @@ static int __ip_append_data(struct sock *sk,
 		if (rt->dst.dev->features & NETIF_F_SG &&
 		    csummode == CHECKSUM_PARTIAL) {
 			paged = true;
+			zc = true;
 		} else {
 			uarg->zerocopy = 0;
 			skb_zcopy_set(skb, uarg, &extra_uref);
@@ -1091,9 +1092,12 @@ static int __ip_append_data(struct sock *sk,
 				 (fraglen + alloc_extra < SKB_MAX_ALLOC ||
 				  !(rt->dst.dev->features & NETIF_F_SG)))
 				alloclen = fraglen;
-			else {
+			else if (!zc) {
 				alloclen = min_t(int, fraglen, MAX_HEADER);
 				pagedlen = fraglen - alloclen;
+			} else {
+				alloclen = fragheaderlen + transhdrlen;
+				pagedlen = datalen - transhdrlen;
 			}
 
 			alloclen += alloc_extra;
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next v5 02/27] ipv6: avoid partial copy for zc
  2022-07-12 20:52 [PATCH net-next v5 00/27] io_uring zerocopy send Pavel Begunkov
  2022-07-12 20:52 ` [PATCH net-next v5 01/27] ipv4: avoid partial copy for zc Pavel Begunkov
@ 2022-07-12 20:52 ` Pavel Begunkov
  2022-07-12 20:52 ` [PATCH net-next v5 03/27] skbuff: don't mix ubuf_info from different sources Pavel Begunkov
                   ` (25 subsequent siblings)
  27 siblings, 0 replies; 34+ messages in thread
From: Pavel Begunkov @ 2022-07-12 20:52 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: David S . Miller, Jakub Kicinski, Jonathan Lemon,
	Willem de Bruijn, Jens Axboe, David Ahern, kernel-team,
	Pavel Begunkov

Even when zerocopy transmission is requested and possible,
__ip_append_data() will still copy a small chunk of data just because it
allocated some extra linear space (e.g. 128 bytes). It wastes CPU cycles
on copy and iter manipulations and also misalignes potentially aligned
data. Avoid such coies. And as a bonus we can allocate smaller skb.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 net/ipv6/ip6_output.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 77e3f5970ce4..fc74ce3ed8cc 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1464,6 +1464,7 @@ static int __ip6_append_data(struct sock *sk,
 	int copy;
 	int err;
 	int offset = 0;
+	bool zc = false;
 	u32 tskey = 0;
 	struct rt6_info *rt = (struct rt6_info *)cork->dst;
 	struct ipv6_txoptions *opt = v6_cork->opt;
@@ -1549,6 +1550,7 @@ static int __ip6_append_data(struct sock *sk,
 		if (rt->dst.dev->features & NETIF_F_SG &&
 		    csummode == CHECKSUM_PARTIAL) {
 			paged = true;
+			zc = true;
 		} else {
 			uarg->zerocopy = 0;
 			skb_zcopy_set(skb, uarg, &extra_uref);
@@ -1630,9 +1632,12 @@ static int __ip6_append_data(struct sock *sk,
 				 (fraglen + alloc_extra < SKB_MAX_ALLOC ||
 				  !(rt->dst.dev->features & NETIF_F_SG)))
 				alloclen = fraglen;
-			else {
+			else if (!zc) {
 				alloclen = min_t(int, fraglen, MAX_HEADER);
 				pagedlen = fraglen - alloclen;
+			} else {
+				alloclen = fragheaderlen + transhdrlen;
+				pagedlen = datalen - transhdrlen;
 			}
 			alloclen += alloc_extra;
 
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next v5 03/27] skbuff: don't mix ubuf_info from different sources
  2022-07-12 20:52 [PATCH net-next v5 00/27] io_uring zerocopy send Pavel Begunkov
  2022-07-12 20:52 ` [PATCH net-next v5 01/27] ipv4: avoid partial copy for zc Pavel Begunkov
  2022-07-12 20:52 ` [PATCH net-next v5 02/27] ipv6: " Pavel Begunkov
@ 2022-07-12 20:52 ` Pavel Begunkov
  2022-07-12 20:52 ` [PATCH net-next v5 04/27] skbuff: add SKBFL_DONT_ORPHAN flag Pavel Begunkov
                   ` (24 subsequent siblings)
  27 siblings, 0 replies; 34+ messages in thread
From: Pavel Begunkov @ 2022-07-12 20:52 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: David S . Miller, Jakub Kicinski, Jonathan Lemon,
	Willem de Bruijn, Jens Axboe, David Ahern, kernel-team,
	Pavel Begunkov

We should not append MSG_ZEROCOPY requests to skbuff with non
MSG_ZEROCOPY ubuf_info, they might be not compatible.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 net/core/skbuff.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5b3559cb1d82..09f56bfa2771 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1212,6 +1212,10 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
 		const u32 byte_limit = 1 << 19;		/* limit to a few TSO */
 		u32 bytelen, next;
 
+		/* there might be non MSG_ZEROCOPY users */
+		if (uarg->callback != msg_zerocopy_callback)
+			return NULL;
+
 		/* realloc only when socket is locked (TCP, UDP cork),
 		 * so uarg->len and sk_zckey access is serialized
 		 */
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next v5 04/27] skbuff: add SKBFL_DONT_ORPHAN flag
  2022-07-12 20:52 [PATCH net-next v5 00/27] io_uring zerocopy send Pavel Begunkov
                   ` (2 preceding siblings ...)
  2022-07-12 20:52 ` [PATCH net-next v5 03/27] skbuff: don't mix ubuf_info from different sources Pavel Begunkov
@ 2022-07-12 20:52 ` Pavel Begunkov
  2022-07-12 20:52 ` [PATCH net-next v5 05/27] skbuff: carry external ubuf_info in msghdr Pavel Begunkov
                   ` (23 subsequent siblings)
  27 siblings, 0 replies; 34+ messages in thread
From: Pavel Begunkov @ 2022-07-12 20:52 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: David S . Miller, Jakub Kicinski, Jonathan Lemon,
	Willem de Bruijn, Jens Axboe, David Ahern, kernel-team,
	Pavel Begunkov

We don't want to list every single ubuf_info callback in
skb_orphan_frags(), add a flag controlling the behaviour.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/linux/skbuff.h | 8 +++++---
 net/core/skbuff.c      | 2 +-
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index d3d10556f0fa..8e12b3b9ad6c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -686,10 +686,13 @@ enum {
 	 * charged to the kernel memory.
 	 */
 	SKBFL_PURE_ZEROCOPY = BIT(2),
+
+	SKBFL_DONT_ORPHAN = BIT(3),
 };
 
 #define SKBFL_ZEROCOPY_FRAG	(SKBFL_ZEROCOPY_ENABLE | SKBFL_SHARED_FRAG)
-#define SKBFL_ALL_ZEROCOPY	(SKBFL_ZEROCOPY_FRAG | SKBFL_PURE_ZEROCOPY)
+#define SKBFL_ALL_ZEROCOPY	(SKBFL_ZEROCOPY_FRAG | SKBFL_PURE_ZEROCOPY | \
+				 SKBFL_DONT_ORPHAN)
 
 /*
  * The callback notifies userspace to release buffers when skb DMA is done in
@@ -3182,8 +3185,7 @@ static inline int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask)
 {
 	if (likely(!skb_zcopy(skb)))
 		return 0;
-	if (!skb_zcopy_is_nouarg(skb) &&
-	    skb_uarg(skb)->callback == msg_zerocopy_callback)
+	if (skb_shinfo(skb)->flags & SKBFL_DONT_ORPHAN)
 		return 0;
 	return skb_copy_ubufs(skb, gfp_mask);
 }
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 09f56bfa2771..fc22b3d32052 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1193,7 +1193,7 @@ static struct ubuf_info *msg_zerocopy_alloc(struct sock *sk, size_t size)
 	uarg->len = 1;
 	uarg->bytelen = size;
 	uarg->zerocopy = 1;
-	uarg->flags = SKBFL_ZEROCOPY_FRAG;
+	uarg->flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
 	refcount_set(&uarg->refcnt, 1);
 	sock_hold(sk);
 
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next v5 05/27] skbuff: carry external ubuf_info in msghdr
  2022-07-12 20:52 [PATCH net-next v5 00/27] io_uring zerocopy send Pavel Begunkov
                   ` (3 preceding siblings ...)
  2022-07-12 20:52 ` [PATCH net-next v5 04/27] skbuff: add SKBFL_DONT_ORPHAN flag Pavel Begunkov
@ 2022-07-12 20:52 ` Pavel Begunkov
  2022-07-12 20:52 ` [PATCH net-next v5 06/27] net: Allow custom iter handler " Pavel Begunkov
                   ` (22 subsequent siblings)
  27 siblings, 0 replies; 34+ messages in thread
From: Pavel Begunkov @ 2022-07-12 20:52 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: David S . Miller, Jakub Kicinski, Jonathan Lemon,
	Willem de Bruijn, Jens Axboe, David Ahern, kernel-team,
	Pavel Begunkov

Make possible for network in-kernel callers like io_uring to pass in a
custom ubuf_info by setting it in a new field of struct msghdr.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/linux/socket.h | 1 +
 net/compat.c           | 1 +
 net/socket.c           | 3 +++
 3 files changed, 5 insertions(+)

diff --git a/include/linux/socket.h b/include/linux/socket.h
index 17311ad9f9af..7bac9fc1cee0 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -69,6 +69,7 @@ struct msghdr {
 	unsigned int	msg_flags;	/* flags on received message */
 	__kernel_size_t	msg_controllen;	/* ancillary data buffer length */
 	struct kiocb	*msg_iocb;	/* ptr to iocb for async requests */
+	struct ubuf_info *msg_ubuf;
 };
 
 struct user_msghdr {
diff --git a/net/compat.c b/net/compat.c
index 210fc3b4d0d8..6cd2e7683dd0 100644
--- a/net/compat.c
+++ b/net/compat.c
@@ -80,6 +80,7 @@ int __get_compat_msghdr(struct msghdr *kmsg,
 		return -EMSGSIZE;
 
 	kmsg->msg_iocb = NULL;
+	kmsg->msg_ubuf = NULL;
 	*ptr = msg.msg_iov;
 	*len = msg.msg_iovlen;
 	return 0;
diff --git a/net/socket.c b/net/socket.c
index 2bc8773d9dc5..ed061609265e 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2106,6 +2106,7 @@ int __sys_sendto(int fd, void __user *buff, size_t len, unsigned int flags,
 	msg.msg_control = NULL;
 	msg.msg_controllen = 0;
 	msg.msg_namelen = 0;
+	msg.msg_ubuf = NULL;
 	if (addr) {
 		err = move_addr_to_kernel(addr, addr_len, &address);
 		if (err < 0)
@@ -2171,6 +2172,7 @@ int __sys_recvfrom(int fd, void __user *ubuf, size_t size, unsigned int flags,
 	msg.msg_namelen = 0;
 	msg.msg_iocb = NULL;
 	msg.msg_flags = 0;
+	msg.msg_ubuf = NULL;
 	if (sock->file->f_flags & O_NONBLOCK)
 		flags |= MSG_DONTWAIT;
 	err = sock_recvmsg(sock, &msg, flags);
@@ -2409,6 +2411,7 @@ int __copy_msghdr_from_user(struct msghdr *kmsg,
 		return -EMSGSIZE;
 
 	kmsg->msg_iocb = NULL;
+	kmsg->msg_ubuf = NULL;
 	*uiov = msg.msg_iov;
 	*nsegs = msg.msg_iovlen;
 	return 0;
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next v5 06/27] net: Allow custom iter handler in msghdr
  2022-07-12 20:52 [PATCH net-next v5 00/27] io_uring zerocopy send Pavel Begunkov
                   ` (4 preceding siblings ...)
  2022-07-12 20:52 ` [PATCH net-next v5 05/27] skbuff: carry external ubuf_info in msghdr Pavel Begunkov
@ 2022-07-12 20:52 ` Pavel Begunkov
  2022-07-12 20:52 ` [PATCH net-next v5 07/27] net: introduce managed frags infrastructure Pavel Begunkov
                   ` (21 subsequent siblings)
  27 siblings, 0 replies; 34+ messages in thread
From: Pavel Begunkov @ 2022-07-12 20:52 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: David S . Miller, Jakub Kicinski, Jonathan Lemon,
	Willem de Bruijn, Jens Axboe, David Ahern, kernel-team,
	Pavel Begunkov

From: David Ahern <dsahern@kernel.org>

Add support for custom iov_iter handling to msghdr. The idea is that
in-kernel subsystems want control over how an SG is split.

Signed-off-by: David Ahern <dsahern@kernel.org>
[pavel: move callback into msghdr]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/linux/skbuff.h |  7 ++++---
 include/linux/socket.h |  4 ++++
 net/core/datagram.c    | 14 ++++++++++----
 net/core/skbuff.c      |  2 +-
 4 files changed, 19 insertions(+), 8 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 8e12b3b9ad6c..a8a2dd4cfdfd 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1776,13 +1776,14 @@ void msg_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref);
 void msg_zerocopy_callback(struct sk_buff *skb, struct ubuf_info *uarg,
 			   bool success);
 
-int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb,
-			    struct iov_iter *from, size_t length);
+int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
+			    struct sk_buff *skb, struct iov_iter *from,
+			    size_t length);
 
 static inline int skb_zerocopy_iter_dgram(struct sk_buff *skb,
 					  struct msghdr *msg, int len)
 {
-	return __zerocopy_sg_from_iter(skb->sk, skb, &msg->msg_iter, len);
+	return __zerocopy_sg_from_iter(msg, skb->sk, skb, &msg->msg_iter, len);
 }
 
 int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 7bac9fc1cee0..3c11ef18a9cf 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -14,6 +14,8 @@ struct file;
 struct pid;
 struct cred;
 struct socket;
+struct sock;
+struct sk_buff;
 
 #define __sockaddr_check_size(size)	\
 	BUILD_BUG_ON(((size) > sizeof(struct __kernel_sockaddr_storage)))
@@ -70,6 +72,8 @@ struct msghdr {
 	__kernel_size_t	msg_controllen;	/* ancillary data buffer length */
 	struct kiocb	*msg_iocb;	/* ptr to iocb for async requests */
 	struct ubuf_info *msg_ubuf;
+	int (*sg_from_iter)(struct sock *sk, struct sk_buff *skb,
+			    struct iov_iter *from, size_t length);
 };
 
 struct user_msghdr {
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 50f4faeea76c..28cdb79df74d 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -613,10 +613,16 @@ int skb_copy_datagram_from_iter(struct sk_buff *skb, int offset,
 }
 EXPORT_SYMBOL(skb_copy_datagram_from_iter);
 
-int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb,
-			    struct iov_iter *from, size_t length)
+int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
+			    struct sk_buff *skb, struct iov_iter *from,
+			    size_t length)
 {
-	int frag = skb_shinfo(skb)->nr_frags;
+	int frag;
+
+	if (msg && msg->sg_from_iter)
+		return msg->sg_from_iter(sk, skb, from, length);
+
+	frag = skb_shinfo(skb)->nr_frags;
 
 	while (length && iov_iter_count(from)) {
 		struct page *pages[MAX_SKB_FRAGS];
@@ -702,7 +708,7 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *from)
 	if (skb_copy_datagram_from_iter(skb, 0, from, copy))
 		return -EFAULT;
 
-	return __zerocopy_sg_from_iter(NULL, skb, from, ~0U);
+	return __zerocopy_sg_from_iter(NULL, NULL, skb, from, ~0U);
 }
 EXPORT_SYMBOL(zerocopy_sg_from_iter);
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index fc22b3d32052..f5a3ebbc1f7e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1358,7 +1358,7 @@ int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
 	if (orig_uarg && uarg != orig_uarg)
 		return -EEXIST;
 
-	err = __zerocopy_sg_from_iter(sk, skb, &msg->msg_iter, len);
+	err = __zerocopy_sg_from_iter(msg, sk, skb, &msg->msg_iter, len);
 	if (err == -EFAULT || (err == -EMSGSIZE && skb->len == orig_len)) {
 		struct sock *save_sk = skb->sk;
 
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next v5 07/27] net: introduce managed frags infrastructure
  2022-07-12 20:52 [PATCH net-next v5 00/27] io_uring zerocopy send Pavel Begunkov
                   ` (5 preceding siblings ...)
  2022-07-12 20:52 ` [PATCH net-next v5 06/27] net: Allow custom iter handler " Pavel Begunkov
@ 2022-07-12 20:52 ` Pavel Begunkov
  2022-07-12 20:52 ` [PATCH net-next v5 08/27] net: introduce __skb_fill_page_desc_noacc Pavel Begunkov
                   ` (20 subsequent siblings)
  27 siblings, 0 replies; 34+ messages in thread
From: Pavel Begunkov @ 2022-07-12 20:52 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: David S . Miller, Jakub Kicinski, Jonathan Lemon,
	Willem de Bruijn, Jens Axboe, David Ahern, kernel-team,
	Pavel Begunkov

Some users like io_uring can do page pinning more efficiently, so we
want a way to delegate referencing to other subsystems. For that add
a new flag called SKBFL_MANAGED_FRAG_REFS. When set, skb doesn't hold
page references and upper layers are responsivle to managing page
lifetime.

It's allowed to convert skbs from managed to normal by calling
skb_zcopy_downgrade_managed(). The function will take all needed
page references and clear the flag. It's needed, for instance,
to avoid mixing managed modes.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/linux/skbuff.h | 25 +++++++++++++++++++++++--
 net/core/skbuff.c      | 29 +++++++++++++++++++++++++++--
 2 files changed, 50 insertions(+), 4 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a8a2dd4cfdfd..07004593d7ca 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -688,11 +688,16 @@ enum {
 	SKBFL_PURE_ZEROCOPY = BIT(2),
 
 	SKBFL_DONT_ORPHAN = BIT(3),
+
+	/* page references are managed by the ubuf_info, so it's safe to
+	 * use frags only up until ubuf_info is released
+	 */
+	SKBFL_MANAGED_FRAG_REFS = BIT(4),
 };
 
 #define SKBFL_ZEROCOPY_FRAG	(SKBFL_ZEROCOPY_ENABLE | SKBFL_SHARED_FRAG)
 #define SKBFL_ALL_ZEROCOPY	(SKBFL_ZEROCOPY_FRAG | SKBFL_PURE_ZEROCOPY | \
-				 SKBFL_DONT_ORPHAN)
+				 SKBFL_DONT_ORPHAN | SKBFL_MANAGED_FRAG_REFS)
 
 /*
  * The callback notifies userspace to release buffers when skb DMA is done in
@@ -1810,6 +1815,11 @@ static inline bool skb_zcopy_pure(const struct sk_buff *skb)
 	return skb_shinfo(skb)->flags & SKBFL_PURE_ZEROCOPY;
 }
 
+static inline bool skb_zcopy_managed(const struct sk_buff *skb)
+{
+	return skb_shinfo(skb)->flags & SKBFL_MANAGED_FRAG_REFS;
+}
+
 static inline bool skb_pure_zcopy_same(const struct sk_buff *skb1,
 				       const struct sk_buff *skb2)
 {
@@ -1884,6 +1894,14 @@ static inline void skb_zcopy_clear(struct sk_buff *skb, bool zerocopy_success)
 	}
 }
 
+void __skb_zcopy_downgrade_managed(struct sk_buff *skb);
+
+static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
+{
+	if (unlikely(skb_zcopy_managed(skb)))
+		__skb_zcopy_downgrade_managed(skb);
+}
+
 static inline void skb_mark_not_on_list(struct sk_buff *skb)
 {
 	skb->next = NULL;
@@ -3499,7 +3517,10 @@ static inline void __skb_frag_unref(skb_frag_t *frag, bool recycle)
  */
 static inline void skb_frag_unref(struct sk_buff *skb, int f)
 {
-	__skb_frag_unref(&skb_shinfo(skb)->frags[f], skb->pp_recycle);
+	struct skb_shared_info *shinfo = skb_shinfo(skb);
+
+	if (!skb_zcopy_managed(skb))
+		__skb_frag_unref(&shinfo->frags[f], skb->pp_recycle);
 }
 
 /**
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index f5a3ebbc1f7e..cf4107d80bc4 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -666,11 +666,18 @@ static void skb_release_data(struct sk_buff *skb)
 			      &shinfo->dataref))
 		goto exit;
 
-	skb_zcopy_clear(skb, true);
+	if (skb_zcopy(skb)) {
+		bool skip_unref = shinfo->flags & SKBFL_MANAGED_FRAG_REFS;
+
+		skb_zcopy_clear(skb, true);
+		if (skip_unref)
+			goto free_head;
+	}
 
 	for (i = 0; i < shinfo->nr_frags; i++)
 		__skb_frag_unref(&shinfo->frags[i], skb->pp_recycle);
 
+free_head:
 	if (shinfo->frag_list)
 		kfree_skb_list(shinfo->frag_list);
 
@@ -895,7 +902,10 @@ EXPORT_SYMBOL(skb_dump);
  */
 void skb_tx_error(struct sk_buff *skb)
 {
-	skb_zcopy_clear(skb, true);
+	if (skb) {
+		skb_zcopy_downgrade_managed(skb);
+		skb_zcopy_clear(skb, true);
+	}
 }
 EXPORT_SYMBOL(skb_tx_error);
 
@@ -1375,6 +1385,16 @@ int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
 }
 EXPORT_SYMBOL_GPL(skb_zerocopy_iter_stream);
 
+void __skb_zcopy_downgrade_managed(struct sk_buff *skb)
+{
+	int i;
+
+	skb_shinfo(skb)->flags &= ~SKBFL_MANAGED_FRAG_REFS;
+	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
+		skb_frag_ref(skb, i);
+}
+EXPORT_SYMBOL_GPL(__skb_zcopy_downgrade_managed);
+
 static int skb_zerocopy_clone(struct sk_buff *nskb, struct sk_buff *orig,
 			      gfp_t gfp_mask)
 {
@@ -1692,6 +1712,8 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 
 	BUG_ON(skb_shared(skb));
 
+	skb_zcopy_downgrade_managed(skb);
+
 	size = SKB_DATA_ALIGN(size);
 
 	if (skb_pfmemalloc(skb))
@@ -3488,6 +3510,8 @@ void skb_split(struct sk_buff *skb, struct sk_buff *skb1, const u32 len)
 	int pos = skb_headlen(skb);
 	const int zc_flags = SKBFL_SHARED_FRAG | SKBFL_PURE_ZEROCOPY;
 
+	skb_zcopy_downgrade_managed(skb);
+
 	skb_shinfo(skb1)->flags |= skb_shinfo(skb)->flags & zc_flags;
 	skb_zerocopy_clone(skb1, skb, 0);
 	if (len < pos)	/* Split line is inside header. */
@@ -3841,6 +3865,7 @@ int skb_append_pagefrags(struct sk_buff *skb, struct page *page,
 	if (skb_can_coalesce(skb, i, page, offset)) {
 		skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], size);
 	} else if (i < MAX_SKB_FRAGS) {
+		skb_zcopy_downgrade_managed(skb);
 		get_page(page);
 		skb_fill_page_desc(skb, i, page, offset, size);
 	} else {
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next v5 08/27] net: introduce __skb_fill_page_desc_noacc
  2022-07-12 20:52 [PATCH net-next v5 00/27] io_uring zerocopy send Pavel Begunkov
                   ` (6 preceding siblings ...)
  2022-07-12 20:52 ` [PATCH net-next v5 07/27] net: introduce managed frags infrastructure Pavel Begunkov
@ 2022-07-12 20:52 ` Pavel Begunkov
  2022-07-12 20:52 ` [PATCH net-next v5 09/27] ipv4/udp: support externally provided ubufs Pavel Begunkov
                   ` (19 subsequent siblings)
  27 siblings, 0 replies; 34+ messages in thread
From: Pavel Begunkov @ 2022-07-12 20:52 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: David S . Miller, Jakub Kicinski, Jonathan Lemon,
	Willem de Bruijn, Jens Axboe, David Ahern, kernel-team,
	Pavel Begunkov

Managed pages contain pinned userspace pages and controlled by upper
layers, there is no need in tracking skb->pfmemalloc for them. Introduce
a helper for filling frags but ignoring page tracking, it'll be needed
later.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/linux/skbuff.h | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 07004593d7ca..1111adefd906 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2550,6 +2550,22 @@ static inline unsigned int skb_pagelen(const struct sk_buff *skb)
 	return skb_headlen(skb) + __skb_pagelen(skb);
 }
 
+static inline void __skb_fill_page_desc_noacc(struct skb_shared_info *shinfo,
+					      int i, struct page *page,
+					      int off, int size)
+{
+	skb_frag_t *frag = &shinfo->frags[i];
+
+	/*
+	 * Propagate page pfmemalloc to the skb if we can. The problem is
+	 * that not all callers have unique ownership of the page but rely
+	 * on page_is_pfmemalloc doing the right thing(tm).
+	 */
+	frag->bv_page		  = page;
+	frag->bv_offset		  = off;
+	skb_frag_size_set(frag, size);
+}
+
 /**
  * __skb_fill_page_desc - initialise a paged fragment in an skb
  * @skb: buffer containing fragment to be initialised
@@ -2566,17 +2582,7 @@ static inline unsigned int skb_pagelen(const struct sk_buff *skb)
 static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
 					struct page *page, int off, int size)
 {
-	skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
-
-	/*
-	 * Propagate page pfmemalloc to the skb if we can. The problem is
-	 * that not all callers have unique ownership of the page but rely
-	 * on page_is_pfmemalloc doing the right thing(tm).
-	 */
-	frag->bv_page		  = page;
-	frag->bv_offset		  = off;
-	skb_frag_size_set(frag, size);
-
+	__skb_fill_page_desc_noacc(skb_shinfo(skb), i, page, off, size);
 	page = compound_head(page);
 	if (page_is_pfmemalloc(page))
 		skb->pfmemalloc	= true;
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next v5 09/27] ipv4/udp: support externally provided ubufs
  2022-07-12 20:52 [PATCH net-next v5 00/27] io_uring zerocopy send Pavel Begunkov
                   ` (7 preceding siblings ...)
  2022-07-12 20:52 ` [PATCH net-next v5 08/27] net: introduce __skb_fill_page_desc_noacc Pavel Begunkov
@ 2022-07-12 20:52 ` Pavel Begunkov
  2022-07-12 20:52 ` [PATCH net-next v5 10/27] ipv6/udp: " Pavel Begunkov
                   ` (18 subsequent siblings)
  27 siblings, 0 replies; 34+ messages in thread
From: Pavel Begunkov @ 2022-07-12 20:52 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: David S . Miller, Jakub Kicinski, Jonathan Lemon,
	Willem de Bruijn, Jens Axboe, David Ahern, kernel-team,
	Pavel Begunkov

Teach ipv4/udp how to use external ubuf_info provided in msghdr and
also prepare it for managed frags by sprinkling
skb_zcopy_downgrade_managed() when it could mix managed and not managed
frags.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 net/ipv4/ip_output.c | 44 +++++++++++++++++++++++++++++++-------------
 1 file changed, 31 insertions(+), 13 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 581d1e233260..df7f9dfbe8be 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1017,18 +1017,35 @@ static int __ip_append_data(struct sock *sk,
 	    (!exthdrlen || (rt->dst.dev->features & NETIF_F_HW_ESP_TX_CSUM)))
 		csummode = CHECKSUM_PARTIAL;
 
-	if (flags & MSG_ZEROCOPY && length && sock_flag(sk, SOCK_ZEROCOPY)) {
-		uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
-		if (!uarg)
-			return -ENOBUFS;
-		extra_uref = !skb_zcopy(skb);	/* only ref on new uarg */
-		if (rt->dst.dev->features & NETIF_F_SG &&
-		    csummode == CHECKSUM_PARTIAL) {
-			paged = true;
-			zc = true;
-		} else {
-			uarg->zerocopy = 0;
-			skb_zcopy_set(skb, uarg, &extra_uref);
+	if ((flags & MSG_ZEROCOPY) && length) {
+		struct msghdr *msg = from;
+
+		if (getfrag == ip_generic_getfrag && msg->msg_ubuf) {
+			if (skb_zcopy(skb) && msg->msg_ubuf != skb_zcopy(skb))
+				return -EINVAL;
+
+			/* Leave uarg NULL if can't zerocopy, callers should
+			 * be able to handle it.
+			 */
+			if ((rt->dst.dev->features & NETIF_F_SG) &&
+			    csummode == CHECKSUM_PARTIAL) {
+				paged = true;
+				zc = true;
+				uarg = msg->msg_ubuf;
+			}
+		} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
+			uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
+			if (!uarg)
+				return -ENOBUFS;
+			extra_uref = !skb_zcopy(skb);	/* only ref on new uarg */
+			if (rt->dst.dev->features & NETIF_F_SG &&
+			    csummode == CHECKSUM_PARTIAL) {
+				paged = true;
+				zc = true;
+			} else {
+				uarg->zerocopy = 0;
+				skb_zcopy_set(skb, uarg, &extra_uref);
+			}
 		}
 	}
 
@@ -1192,13 +1209,14 @@ static int __ip_append_data(struct sock *sk,
 				err = -EFAULT;
 				goto error;
 			}
-		} else if (!uarg || !uarg->zerocopy) {
+		} else if (!zc) {
 			int i = skb_shinfo(skb)->nr_frags;
 
 			err = -ENOMEM;
 			if (!sk_page_frag_refill(sk, pfrag))
 				goto error;
 
+			skb_zcopy_downgrade_managed(skb);
 			if (!skb_can_coalesce(skb, i, pfrag->page,
 					      pfrag->offset)) {
 				err = -EMSGSIZE;
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next v5 10/27] ipv6/udp: support externally provided ubufs
  2022-07-12 20:52 [PATCH net-next v5 00/27] io_uring zerocopy send Pavel Begunkov
                   ` (8 preceding siblings ...)
  2022-07-12 20:52 ` [PATCH net-next v5 09/27] ipv4/udp: support externally provided ubufs Pavel Begunkov
@ 2022-07-12 20:52 ` Pavel Begunkov
  2022-07-12 20:52 ` [PATCH net-next v5 11/27] tcp: " Pavel Begunkov
                   ` (17 subsequent siblings)
  27 siblings, 0 replies; 34+ messages in thread
From: Pavel Begunkov @ 2022-07-12 20:52 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: David S . Miller, Jakub Kicinski, Jonathan Lemon,
	Willem de Bruijn, Jens Axboe, David Ahern, kernel-team,
	Pavel Begunkov

Teach ipv6/udp how to use external ubuf_info provided in msghdr and
also prepare it for managed frags by sprinkling
skb_zcopy_downgrade_managed() when it could mix managed and not managed
frags.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 net/ipv6/ip6_output.c | 44 ++++++++++++++++++++++++++++++-------------
 1 file changed, 31 insertions(+), 13 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index fc74ce3ed8cc..897ca4f9b791 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1542,18 +1542,35 @@ static int __ip6_append_data(struct sock *sk,
 	    rt->dst.dev->features & (NETIF_F_IPV6_CSUM | NETIF_F_HW_CSUM))
 		csummode = CHECKSUM_PARTIAL;
 
-	if (flags & MSG_ZEROCOPY && length && sock_flag(sk, SOCK_ZEROCOPY)) {
-		uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
-		if (!uarg)
-			return -ENOBUFS;
-		extra_uref = !skb_zcopy(skb);	/* only ref on new uarg */
-		if (rt->dst.dev->features & NETIF_F_SG &&
-		    csummode == CHECKSUM_PARTIAL) {
-			paged = true;
-			zc = true;
-		} else {
-			uarg->zerocopy = 0;
-			skb_zcopy_set(skb, uarg, &extra_uref);
+	if ((flags & MSG_ZEROCOPY) && length) {
+		struct msghdr *msg = from;
+
+		if (getfrag == ip_generic_getfrag && msg->msg_ubuf) {
+			if (skb_zcopy(skb) && msg->msg_ubuf != skb_zcopy(skb))
+				return -EINVAL;
+
+			/* Leave uarg NULL if can't zerocopy, callers should
+			 * be able to handle it.
+			 */
+			if ((rt->dst.dev->features & NETIF_F_SG) &&
+			    csummode == CHECKSUM_PARTIAL) {
+				paged = true;
+				zc = true;
+				uarg = msg->msg_ubuf;
+			}
+		} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
+			uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
+			if (!uarg)
+				return -ENOBUFS;
+			extra_uref = !skb_zcopy(skb);	/* only ref on new uarg */
+			if (rt->dst.dev->features & NETIF_F_SG &&
+			    csummode == CHECKSUM_PARTIAL) {
+				paged = true;
+				zc = true;
+			} else {
+				uarg->zerocopy = 0;
+				skb_zcopy_set(skb, uarg, &extra_uref);
+			}
 		}
 	}
 
@@ -1747,13 +1764,14 @@ static int __ip6_append_data(struct sock *sk,
 				err = -EFAULT;
 				goto error;
 			}
-		} else if (!uarg || !uarg->zerocopy) {
+		} else if (!zc) {
 			int i = skb_shinfo(skb)->nr_frags;
 
 			err = -ENOMEM;
 			if (!sk_page_frag_refill(sk, pfrag))
 				goto error;
 
+			skb_zcopy_downgrade_managed(skb);
 			if (!skb_can_coalesce(skb, i, pfrag->page,
 					      pfrag->offset)) {
 				err = -EMSGSIZE;
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next v5 11/27] tcp: support externally provided ubufs
  2022-07-12 20:52 [PATCH net-next v5 00/27] io_uring zerocopy send Pavel Begunkov
                   ` (9 preceding siblings ...)
  2022-07-12 20:52 ` [PATCH net-next v5 10/27] ipv6/udp: " Pavel Begunkov
@ 2022-07-12 20:52 ` Pavel Begunkov
  2022-07-12 20:52 ` [PATCH net-next v5 12/27] io_uring: initialise msghdr::msg_ubuf Pavel Begunkov
                   ` (16 subsequent siblings)
  27 siblings, 0 replies; 34+ messages in thread
From: Pavel Begunkov @ 2022-07-12 20:52 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: David S . Miller, Jakub Kicinski, Jonathan Lemon,
	Willem de Bruijn, Jens Axboe, David Ahern, kernel-team,
	Pavel Begunkov

Teach tcp how to use external ubuf_info provided in msghdr and
also prepare it for managed frags by sprinkling
skb_zcopy_downgrade_managed() when it could mix managed and not managed
frags.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 net/ipv4/tcp.c | 32 ++++++++++++++++++++------------
 1 file changed, 20 insertions(+), 12 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 390eb3dc53bd..a81f694af5e9 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1223,17 +1223,23 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 
 	flags = msg->msg_flags;
 
-	if (flags & MSG_ZEROCOPY && size && sock_flag(sk, SOCK_ZEROCOPY)) {
+	if ((flags & MSG_ZEROCOPY) && size) {
 		skb = tcp_write_queue_tail(sk);
-		uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb));
-		if (!uarg) {
-			err = -ENOBUFS;
-			goto out_err;
-		}
 
-		zc = sk->sk_route_caps & NETIF_F_SG;
-		if (!zc)
-			uarg->zerocopy = 0;
+		if (msg->msg_ubuf) {
+			uarg = msg->msg_ubuf;
+			net_zcopy_get(uarg);
+			zc = sk->sk_route_caps & NETIF_F_SG;
+		} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
+			uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb));
+			if (!uarg) {
+				err = -ENOBUFS;
+				goto out_err;
+			}
+			zc = sk->sk_route_caps & NETIF_F_SG;
+			if (!zc)
+				uarg->zerocopy = 0;
+		}
 	}
 
 	if (unlikely(flags & MSG_FASTOPEN || inet_sk(sk)->defer_connect) &&
@@ -1356,9 +1362,11 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 
 			copy = min_t(int, copy, pfrag->size - pfrag->offset);
 
-			if (tcp_downgrade_zcopy_pure(sk, skb))
-				goto wait_for_space;
-
+			if (unlikely(skb_zcopy_pure(skb) || skb_zcopy_managed(skb))) {
+				if (tcp_downgrade_zcopy_pure(sk, skb))
+					goto wait_for_space;
+				skb_zcopy_downgrade_managed(skb);
+			}
 			copy = tcp_wmem_schedule(sk, copy);
 			if (!copy)
 				goto wait_for_space;
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next v5 12/27] io_uring: initialise msghdr::msg_ubuf
  2022-07-12 20:52 [PATCH net-next v5 00/27] io_uring zerocopy send Pavel Begunkov
                   ` (10 preceding siblings ...)
  2022-07-12 20:52 ` [PATCH net-next v5 11/27] tcp: " Pavel Begunkov
@ 2022-07-12 20:52 ` Pavel Begunkov
  2022-07-12 20:52 ` [PATCH net-next v5 13/27] io_uring: export io_put_task() Pavel Begunkov
                   ` (15 subsequent siblings)
  27 siblings, 0 replies; 34+ messages in thread
From: Pavel Begunkov @ 2022-07-12 20:52 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: David S . Miller, Jakub Kicinski, Jonathan Lemon,
	Willem de Bruijn, Jens Axboe, David Ahern, kernel-team,
	Pavel Begunkov

Initialise newly added ->msg_ubuf in io_recv() and io_send().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/net.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/io_uring/net.c b/io_uring/net.c
index cb08a4b62840..2dd61fcf91d8 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -255,6 +255,7 @@ int io_send(struct io_kiocb *req, unsigned int issue_flags)
 	msg.msg_control = NULL;
 	msg.msg_controllen = 0;
 	msg.msg_namelen = 0;
+	msg.msg_ubuf = NULL;
 
 	flags = sr->msg_flags;
 	if (issue_flags & IO_URING_F_NONBLOCK)
@@ -601,6 +602,7 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
 	msg.msg_flags = 0;
 	msg.msg_controllen = 0;
 	msg.msg_iocb = NULL;
+	msg.msg_ubuf = NULL;
 
 	flags = sr->msg_flags;
 	if (force_nonblock)
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next v5 13/27] io_uring: export io_put_task()
  2022-07-12 20:52 [PATCH net-next v5 00/27] io_uring zerocopy send Pavel Begunkov
                   ` (11 preceding siblings ...)
  2022-07-12 20:52 ` [PATCH net-next v5 12/27] io_uring: initialise msghdr::msg_ubuf Pavel Begunkov
@ 2022-07-12 20:52 ` Pavel Begunkov
  2022-07-12 20:52 ` [PATCH net-next v5 14/27] io_uring: add zc notification infrastructure Pavel Begunkov
                   ` (14 subsequent siblings)
  27 siblings, 0 replies; 34+ messages in thread
From: Pavel Begunkov @ 2022-07-12 20:52 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: David S . Miller, Jakub Kicinski, Jonathan Lemon,
	Willem de Bruijn, Jens Axboe, David Ahern, kernel-team,
	Pavel Begunkov

Make io_put_task() available to non-core parts of io_uring, we'll need
it for notification infrastructure.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/linux/io_uring_types.h | 25 +++++++++++++++++++++++++
 io_uring/io_uring.c            | 11 +----------
 io_uring/io_uring.h            | 10 ++++++++++
 io_uring/tctx.h                | 26 --------------------------
 4 files changed, 36 insertions(+), 36 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 26ef11e978d4..d876a0367081 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -4,6 +4,7 @@
 #include <linux/blkdev.h>
 #include <linux/task_work.h>
 #include <linux/bitmap.h>
+#include <linux/llist.h>
 #include <uapi/linux/io_uring.h>
 
 struct io_wq_work_node {
@@ -43,6 +44,30 @@ struct io_hash_table {
 	unsigned		hash_bits;
 };
 
+/*
+ * Arbitrary limit, can be raised if need be
+ */
+#define IO_RINGFD_REG_MAX 16
+
+struct io_uring_task {
+	/* submission side */
+	int				cached_refs;
+	const struct io_ring_ctx 	*last;
+	struct io_wq			*io_wq;
+	struct file			*registered_rings[IO_RINGFD_REG_MAX];
+
+	struct xarray			xa;
+	struct wait_queue_head		wait;
+	atomic_t			in_idle;
+	atomic_t			inflight_tracked;
+	struct percpu_counter		inflight;
+
+	struct { /* task_work */
+		struct llist_head	task_list;
+		struct callback_head	task_work;
+	} ____cacheline_aligned_in_smp;
+};
+
 struct io_uring {
 	u32 head ____cacheline_aligned_in_smp;
 	u32 tail ____cacheline_aligned_in_smp;
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index caf979cd4327..bb644b1b575a 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -602,7 +602,7 @@ static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx)
 	return ret;
 }
 
-static void __io_put_task(struct task_struct *task, int nr)
+void __io_put_task(struct task_struct *task, int nr)
 {
 	struct io_uring_task *tctx = task->io_uring;
 
@@ -612,15 +612,6 @@ static void __io_put_task(struct task_struct *task, int nr)
 	put_task_struct_many(task, nr);
 }
 
-/* must to be called somewhat shortly after putting a request */
-static inline void io_put_task(struct task_struct *task, int nr)
-{
-	if (likely(task == current))
-		task->io_uring->cached_refs += nr;
-	else
-		__io_put_task(task, nr);
-}
-
 static void io_task_refs_refill(struct io_uring_task *tctx)
 {
 	unsigned int refill = -tctx->cached_refs + IO_TCTX_REFS_CACHE_NR;
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 868f45d55543..2379d9e70c10 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -66,6 +66,7 @@ void io_wq_submit_work(struct io_wq_work *work);
 
 void io_free_req(struct io_kiocb *req);
 void io_queue_next(struct io_kiocb *req);
+void __io_put_task(struct task_struct *task, int nr);
 
 bool io_match_task_safe(struct io_kiocb *head, struct task_struct *task,
 			bool cancel_all);
@@ -253,4 +254,13 @@ static inline void io_commit_cqring_flush(struct io_ring_ctx *ctx)
 		__io_commit_cqring_flush(ctx);
 }
 
+/* must to be called somewhat shortly after putting a request */
+static inline void io_put_task(struct task_struct *task, int nr)
+{
+	if (likely(task == current))
+		task->io_uring->cached_refs += nr;
+	else
+		__io_put_task(task, nr);
+}
+
 #endif
diff --git a/io_uring/tctx.h b/io_uring/tctx.h
index 8a33ff6e5d91..25974beed4d6 100644
--- a/io_uring/tctx.h
+++ b/io_uring/tctx.h
@@ -1,31 +1,5 @@
 // SPDX-License-Identifier: GPL-2.0
 
-#include <linux/llist.h>
-
-/*
- * Arbitrary limit, can be raised if need be
- */
-#define IO_RINGFD_REG_MAX 16
-
-struct io_uring_task {
-	/* submission side */
-	int				cached_refs;
-	const struct io_ring_ctx 	*last;
-	struct io_wq			*io_wq;
-	struct file			*registered_rings[IO_RINGFD_REG_MAX];
-
-	struct xarray			xa;
-	struct wait_queue_head		wait;
-	atomic_t			in_idle;
-	atomic_t			inflight_tracked;
-	struct percpu_counter		inflight;
-
-	struct { /* task_work */
-		struct llist_head	task_list;
-		struct callback_head	task_work;
-	} ____cacheline_aligned_in_smp;
-};
-
 struct io_tctx_node {
 	struct list_head	ctx_node;
 	struct task_struct	*task;
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next v5 14/27] io_uring: add zc notification infrastructure
  2022-07-12 20:52 [PATCH net-next v5 00/27] io_uring zerocopy send Pavel Begunkov
                   ` (12 preceding siblings ...)
  2022-07-12 20:52 ` [PATCH net-next v5 13/27] io_uring: export io_put_task() Pavel Begunkov
@ 2022-07-12 20:52 ` Pavel Begunkov
  2022-07-12 20:52 ` [PATCH net-next v5 15/27] io_uring: cache struct io_notif Pavel Begunkov
                   ` (13 subsequent siblings)
  27 siblings, 0 replies; 34+ messages in thread
From: Pavel Begunkov @ 2022-07-12 20:52 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: David S . Miller, Jakub Kicinski, Jonathan Lemon,
	Willem de Bruijn, Jens Axboe, David Ahern, kernel-team,
	Pavel Begunkov

Add internal part of send zerocopy notifications. There are two main
structures, the first one is struct io_notif, which carries inside
struct ubuf_info and maps 1:1 to it. io_uring will be binding a number
of zerocopy send requests to it and ask to complete (aka flush) it. When
flushed and all attached requests and skbs complete, it'll generate one
and only one CQE. There are intended to be passed into the network layer
as struct msghdr::msg_ubuf.

The second concept is notification slots. The userspace will be able to
register an array of slots and subsequently addressing them by the index
in the array. Slots are independent of each other. Each slot can have
only one notifier at a time (called active notifier) but many notifiers
during the lifetime. When active, a notifier not going to post any
completion but the userspace can attach requests to it by specifying
the corresponding slot while issueing send zc requests. Eventually, the
userspace will want to "flush" the notifier losing any way to attach
new requests to it, however it can use the next atomatically added
notifier of this slot or of any other slot.

When the network layer is done with all enqueued skbs attached to a
notifier and doesn't need the specified in them user data, the flushed
notifier will post a CQE.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/linux/io_uring_types.h |   5 ++
 io_uring/Makefile              |   2 +-
 io_uring/io_uring.c            |   8 ++-
 io_uring/io_uring.h            |   2 +
 io_uring/notif.c               | 102 +++++++++++++++++++++++++++++++++
 io_uring/notif.h               |  64 +++++++++++++++++++++
 6 files changed, 179 insertions(+), 4 deletions(-)
 create mode 100644 io_uring/notif.c
 create mode 100644 io_uring/notif.h

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index d876a0367081..95334e678586 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -34,6 +34,9 @@ struct io_file_table {
 	unsigned int alloc_hint;
 };
 
+struct io_notif;
+struct io_notif_slot;
+
 struct io_hash_bucket {
 	spinlock_t		lock;
 	struct hlist_head	list;
@@ -232,6 +235,8 @@ struct io_ring_ctx {
 		unsigned		nr_user_files;
 		unsigned		nr_user_bufs;
 		struct io_mapped_ubuf	**user_bufs;
+		struct io_notif_slot	*notif_slots;
+		unsigned		nr_notif_slots;
 
 		struct io_submit_state	submit_state;
 
diff --git a/io_uring/Makefile b/io_uring/Makefile
index 466639c289be..8cc8e5387a75 100644
--- a/io_uring/Makefile
+++ b/io_uring/Makefile
@@ -7,5 +7,5 @@ obj-$(CONFIG_IO_URING)		+= io_uring.o xattr.o nop.o fs.o splice.o \
 					openclose.o uring_cmd.o epoll.o \
 					statx.o net.o msg_ring.o timeout.o \
 					sqpoll.o fdinfo.o tctx.o poll.o \
-					cancel.o kbuf.o rsrc.o rw.o opdef.o
+					cancel.o kbuf.o rsrc.o rw.o opdef.o notif.o
 obj-$(CONFIG_IO_WQ)		+= io-wq.o
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index bb644b1b575a..ad816afe2345 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -89,6 +89,7 @@
 #include "kbuf.h"
 #include "rsrc.h"
 #include "cancel.h"
+#include "notif.h"
 
 #include "timeout.h"
 #include "poll.h"
@@ -726,9 +727,8 @@ struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx)
 	return &rings->cqes[off];
 }
 
-static bool io_fill_cqe_aux(struct io_ring_ctx *ctx,
-			    u64 user_data, s32 res, u32 cflags,
-			    bool allow_overflow)
+bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags,
+		     bool allow_overflow)
 {
 	struct io_uring_cqe *cqe;
 
@@ -2496,6 +2496,7 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
 	}
 #endif
 	WARN_ON_ONCE(!list_empty(&ctx->ltimeout_list));
+	WARN_ON_ONCE(ctx->notif_slots || ctx->nr_notif_slots);
 
 	io_mem_free(ctx->rings);
 	io_mem_free(ctx->sq_sqes);
@@ -2672,6 +2673,7 @@ static __cold void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
 		io_unregister_personality(ctx, index);
 	if (ctx->rings)
 		io_poll_remove_all(ctx, NULL, true);
+	io_notif_unregister(ctx);
 	mutex_unlock(&ctx->uring_lock);
 
 	/* failed during ring init, it couldn't have issued any requests */
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 2379d9e70c10..b8c858727dc8 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -33,6 +33,8 @@ void io_req_complete_post(struct io_kiocb *req);
 void __io_req_complete_post(struct io_kiocb *req);
 bool io_post_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags,
 		     bool allow_overflow);
+bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags,
+		     bool allow_overflow);
 void __io_commit_cqring_flush(struct io_ring_ctx *ctx);
 
 struct page **io_pin_pages(unsigned long ubuf, unsigned long len, int *npages);
diff --git a/io_uring/notif.c b/io_uring/notif.c
new file mode 100644
index 000000000000..6ee948af6a49
--- /dev/null
+++ b/io_uring/notif.c
@@ -0,0 +1,102 @@
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/file.h>
+#include <linux/slab.h>
+#include <linux/net.h>
+#include <linux/io_uring.h>
+
+#include "io_uring.h"
+#include "notif.h"
+
+static void __io_notif_complete_tw(struct callback_head *cb)
+{
+	struct io_notif *notif = container_of(cb, struct io_notif, task_work);
+	struct io_ring_ctx *ctx = notif->ctx;
+
+	io_cq_lock(ctx);
+	io_fill_cqe_aux(ctx, notif->tag, 0, notif->seq, true);
+	io_cq_unlock_post(ctx);
+
+	percpu_ref_put(&ctx->refs);
+	kfree(notif);
+}
+
+static inline void io_notif_complete(struct io_notif *notif)
+{
+	__io_notif_complete_tw(&notif->task_work);
+}
+
+static void io_notif_complete_wq(struct work_struct *work)
+{
+	struct io_notif *notif = container_of(work, struct io_notif, commit_work);
+
+	io_notif_complete(notif);
+}
+
+static void io_uring_tx_zerocopy_callback(struct sk_buff *skb,
+					  struct ubuf_info *uarg,
+					  bool success)
+{
+	struct io_notif *notif = container_of(uarg, struct io_notif, uarg);
+
+	if (!refcount_dec_and_test(&uarg->refcnt))
+		return;
+	INIT_WORK(&notif->commit_work, io_notif_complete_wq);
+	queue_work(system_unbound_wq, &notif->commit_work);
+}
+
+struct io_notif *io_alloc_notif(struct io_ring_ctx *ctx,
+				struct io_notif_slot *slot)
+	__must_hold(&ctx->uring_lock)
+{
+	struct io_notif *notif;
+
+	notif = kzalloc(sizeof(*notif), GFP_ATOMIC | __GFP_ACCOUNT);
+	if (!notif)
+		return NULL;
+
+	notif->seq = slot->seq++;
+	notif->tag = slot->tag;
+	notif->ctx = ctx;
+	notif->uarg.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
+	notif->uarg.callback = io_uring_tx_zerocopy_callback;
+	/* master ref owned by io_notif_slot, will be dropped on flush */
+	refcount_set(&notif->uarg.refcnt, 1);
+	percpu_ref_get(&ctx->refs);
+	return notif;
+}
+
+static void io_notif_slot_flush(struct io_notif_slot *slot)
+	__must_hold(&ctx->uring_lock)
+{
+	struct io_notif *notif = slot->notif;
+
+	slot->notif = NULL;
+
+	if (WARN_ON_ONCE(in_interrupt()))
+		return;
+	/* drop slot's master ref */
+	if (refcount_dec_and_test(&notif->uarg.refcnt))
+		io_notif_complete(notif);
+}
+
+__cold int io_notif_unregister(struct io_ring_ctx *ctx)
+	__must_hold(&ctx->uring_lock)
+{
+	int i;
+
+	if (!ctx->notif_slots)
+		return -ENXIO;
+
+	for (i = 0; i < ctx->nr_notif_slots; i++) {
+		struct io_notif_slot *slot = &ctx->notif_slots[i];
+
+		if (slot->notif)
+			io_notif_slot_flush(slot);
+	}
+
+	kvfree(ctx->notif_slots);
+	ctx->notif_slots = NULL;
+	ctx->nr_notif_slots = 0;
+	return 0;
+}
\ No newline at end of file
diff --git a/io_uring/notif.h b/io_uring/notif.h
new file mode 100644
index 000000000000..3d7a1d242e17
--- /dev/null
+++ b/io_uring/notif.h
@@ -0,0 +1,64 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/net.h>
+#include <linux/uio.h>
+#include <net/sock.h>
+#include <linux/nospec.h>
+
+struct io_notif {
+	struct ubuf_info	uarg;
+	struct io_ring_ctx	*ctx;
+
+	/* cqe->user_data, io_notif_slot::tag if not overridden */
+	u64			tag;
+	/* see struct io_notif_slot::seq */
+	u32			seq;
+
+	union {
+		struct callback_head	task_work;
+		struct work_struct	commit_work;
+	};
+};
+
+struct io_notif_slot {
+	/*
+	 * Current/active notifier. A slot holds only one active notifier at a
+	 * time and keeps one reference to it. Flush releases the reference and
+	 * lazily replaces it with a new notifier.
+	 */
+	struct io_notif		*notif;
+
+	/*
+	 * Default ->user_data for this slot notifiers CQEs
+	 */
+	u64			tag;
+	/*
+	 * Notifiers of a slot live in generations, we create a new notifier
+	 * only after flushing the previous one. Track the sequential number
+	 * for all notifiers and copy it into notifiers's cqe->cflags
+	 */
+	u32			seq;
+};
+
+int io_notif_unregister(struct io_ring_ctx *ctx);
+
+struct io_notif *io_alloc_notif(struct io_ring_ctx *ctx,
+				struct io_notif_slot *slot);
+
+static inline struct io_notif *io_get_notif(struct io_ring_ctx *ctx,
+					    struct io_notif_slot *slot)
+{
+	if (!slot->notif)
+		slot->notif = io_alloc_notif(ctx, slot);
+	return slot->notif;
+}
+
+static inline struct io_notif_slot *io_get_notif_slot(struct io_ring_ctx *ctx,
+						      int idx)
+	__must_hold(&ctx->uring_lock)
+{
+	if (idx >= ctx->nr_notif_slots)
+		return NULL;
+	idx = array_index_nospec(idx, ctx->nr_notif_slots);
+	return &ctx->notif_slots[idx];
+}
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next v5 15/27] io_uring: cache struct io_notif
  2022-07-12 20:52 [PATCH net-next v5 00/27] io_uring zerocopy send Pavel Begunkov
                   ` (13 preceding siblings ...)
  2022-07-12 20:52 ` [PATCH net-next v5 14/27] io_uring: add zc notification infrastructure Pavel Begunkov
@ 2022-07-12 20:52 ` Pavel Begunkov
  2022-07-12 20:52 ` [PATCH net-next v5 16/27] io_uring: complete notifiers in tw Pavel Begunkov
                   ` (12 subsequent siblings)
  27 siblings, 0 replies; 34+ messages in thread
From: Pavel Begunkov @ 2022-07-12 20:52 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: David S . Miller, Jakub Kicinski, Jonathan Lemon,
	Willem de Bruijn, Jens Axboe, David Ahern, kernel-team,
	Pavel Begunkov

kmalloc'ing struct io_notif is too expensive when done frequently, cache
them as many other resources in io_uring. Keep two list, the first one
is from where we're getting notifiers, it's protected by ->uring_lock.
The second is protected by ->completion_lock, to which we queue released
notifiers. Then we splice one list into another when needed.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/linux/io_uring_types.h |  7 +++++
 io_uring/io_uring.c            |  3 ++
 io_uring/notif.c               | 57 +++++++++++++++++++++++++++++-----
 io_uring/notif.h               |  5 +++
 4 files changed, 65 insertions(+), 7 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 95334e678586..66ab009e7a6b 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -244,6 +244,9 @@ struct io_ring_ctx {
 		struct xarray		io_bl_xa;
 		struct list_head	io_buffers_cache;
 
+		/* struct io_notif cache, protected by uring_lock */
+		struct list_head	notif_list;
+
 		struct io_hash_table	cancel_table_locked;
 		struct list_head	cq_overflow_list;
 		struct list_head	apoll_cache;
@@ -255,6 +258,10 @@ struct io_ring_ctx {
 	struct io_wq_work_list	locked_free_list;
 	unsigned int		locked_free_nr;
 
+	/* struct io_notif cache protected by completion_lock */
+	struct list_head	notif_list_locked;
+	unsigned int		notif_locked_nr;
+
 	const struct cred	*sq_creds;	/* cred used for __io_sq_thread() */
 	struct io_sq_data	*sq_data;	/* if using sq thread polling */
 
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index ad816afe2345..bdc5a2839d94 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -318,6 +318,8 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	INIT_WQ_LIST(&ctx->locked_free_list);
 	INIT_DELAYED_WORK(&ctx->fallback_work, io_fallback_req_func);
 	INIT_WQ_LIST(&ctx->submit_state.compl_reqs);
+	INIT_LIST_HEAD(&ctx->notif_list);
+	INIT_LIST_HEAD(&ctx->notif_list_locked);
 	return ctx;
 err:
 	kfree(ctx->dummy_ubuf);
@@ -2498,6 +2500,7 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
 	WARN_ON_ONCE(!list_empty(&ctx->ltimeout_list));
 	WARN_ON_ONCE(ctx->notif_slots || ctx->nr_notif_slots);
 
+	io_notif_cache_purge(ctx);
 	io_mem_free(ctx->rings);
 	io_mem_free(ctx->sq_sqes);
 
diff --git a/io_uring/notif.c b/io_uring/notif.c
index 6ee948af6a49..b257db2120b4 100644
--- a/io_uring/notif.c
+++ b/io_uring/notif.c
@@ -15,10 +15,12 @@ static void __io_notif_complete_tw(struct callback_head *cb)
 
 	io_cq_lock(ctx);
 	io_fill_cqe_aux(ctx, notif->tag, 0, notif->seq, true);
+
+	list_add(&notif->cache_node, &ctx->notif_list_locked);
+	ctx->notif_locked_nr++;
 	io_cq_unlock_post(ctx);
 
 	percpu_ref_put(&ctx->refs);
-	kfree(notif);
 }
 
 static inline void io_notif_complete(struct io_notif *notif)
@@ -45,21 +47,62 @@ static void io_uring_tx_zerocopy_callback(struct sk_buff *skb,
 	queue_work(system_unbound_wq, &notif->commit_work);
 }
 
+static void io_notif_splice_cached(struct io_ring_ctx *ctx)
+	__must_hold(&ctx->uring_lock)
+{
+	spin_lock(&ctx->completion_lock);
+	list_splice_init(&ctx->notif_list_locked, &ctx->notif_list);
+	ctx->notif_locked_nr = 0;
+	spin_unlock(&ctx->completion_lock);
+}
+
+void io_notif_cache_purge(struct io_ring_ctx *ctx)
+	__must_hold(&ctx->uring_lock)
+{
+	io_notif_splice_cached(ctx);
+
+	while (!list_empty(&ctx->notif_list)) {
+		struct io_notif *notif = list_first_entry(&ctx->notif_list,
+						struct io_notif, cache_node);
+
+		list_del(&notif->cache_node);
+		kfree(notif);
+	}
+}
+
+static inline bool io_notif_has_cached(struct io_ring_ctx *ctx)
+	__must_hold(&ctx->uring_lock)
+{
+	if (likely(!list_empty(&ctx->notif_list)))
+		return true;
+	if (data_race(READ_ONCE(ctx->notif_locked_nr) <= IO_NOTIF_SPLICE_BATCH))
+		return false;
+	io_notif_splice_cached(ctx);
+	return !list_empty(&ctx->notif_list);
+}
+
 struct io_notif *io_alloc_notif(struct io_ring_ctx *ctx,
 				struct io_notif_slot *slot)
 	__must_hold(&ctx->uring_lock)
 {
 	struct io_notif *notif;
 
-	notif = kzalloc(sizeof(*notif), GFP_ATOMIC | __GFP_ACCOUNT);
-	if (!notif)
-		return NULL;
+	if (likely(io_notif_has_cached(ctx))) {
+		notif = list_first_entry(&ctx->notif_list,
+					 struct io_notif, cache_node);
+		list_del(&notif->cache_node);
+	} else {
+		notif = kzalloc(sizeof(*notif), GFP_ATOMIC | __GFP_ACCOUNT);
+		if (!notif)
+			return NULL;
+		/* pre-initialise some fields */
+		notif->ctx = ctx;
+		notif->uarg.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
+		notif->uarg.callback = io_uring_tx_zerocopy_callback;
+	}
 
 	notif->seq = slot->seq++;
 	notif->tag = slot->tag;
-	notif->ctx = ctx;
-	notif->uarg.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
-	notif->uarg.callback = io_uring_tx_zerocopy_callback;
 	/* master ref owned by io_notif_slot, will be dropped on flush */
 	refcount_set(&notif->uarg.refcnt, 1);
 	percpu_ref_get(&ctx->refs);
diff --git a/io_uring/notif.h b/io_uring/notif.h
index 3d7a1d242e17..b23c9c0515bb 100644
--- a/io_uring/notif.h
+++ b/io_uring/notif.h
@@ -5,6 +5,8 @@
 #include <net/sock.h>
 #include <linux/nospec.h>
 
+#define IO_NOTIF_SPLICE_BATCH	32
+
 struct io_notif {
 	struct ubuf_info	uarg;
 	struct io_ring_ctx	*ctx;
@@ -13,6 +15,8 @@ struct io_notif {
 	u64			tag;
 	/* see struct io_notif_slot::seq */
 	u32			seq;
+	/* hook into ctx->notif_list and ctx->notif_list_locked */
+	struct list_head	cache_node;
 
 	union {
 		struct callback_head	task_work;
@@ -41,6 +45,7 @@ struct io_notif_slot {
 };
 
 int io_notif_unregister(struct io_ring_ctx *ctx);
+void io_notif_cache_purge(struct io_ring_ctx *ctx);
 
 struct io_notif *io_alloc_notif(struct io_ring_ctx *ctx,
 				struct io_notif_slot *slot);
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next v5 16/27] io_uring: complete notifiers in tw
  2022-07-12 20:52 [PATCH net-next v5 00/27] io_uring zerocopy send Pavel Begunkov
                   ` (14 preceding siblings ...)
  2022-07-12 20:52 ` [PATCH net-next v5 15/27] io_uring: cache struct io_notif Pavel Begunkov
@ 2022-07-12 20:52 ` Pavel Begunkov
  2022-07-12 20:52 ` [PATCH net-next v5 17/27] io_uring: add rsrc referencing for notifiers Pavel Begunkov
                   ` (11 subsequent siblings)
  27 siblings, 0 replies; 34+ messages in thread
From: Pavel Begunkov @ 2022-07-12 20:52 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: David S . Miller, Jakub Kicinski, Jonathan Lemon,
	Willem de Bruijn, Jens Axboe, David Ahern, kernel-team,
	Pavel Begunkov

We need a task context to post CQEs but using wq is too expensive.
Try to complete notifiers using task_work and fall back to wq if fails.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/notif.c | 22 +++++++++++++++++++---
 io_uring/notif.h |  3 +++
 2 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/io_uring/notif.c b/io_uring/notif.c
index b257db2120b4..aec74f88fc33 100644
--- a/io_uring/notif.c
+++ b/io_uring/notif.c
@@ -13,6 +13,11 @@ static void __io_notif_complete_tw(struct callback_head *cb)
 	struct io_notif *notif = container_of(cb, struct io_notif, task_work);
 	struct io_ring_ctx *ctx = notif->ctx;
 
+	if (likely(notif->task)) {
+		io_put_task(notif->task, 1);
+		notif->task = NULL;
+	}
+
 	io_cq_lock(ctx);
 	io_fill_cqe_aux(ctx, notif->tag, 0, notif->seq, true);
 
@@ -43,6 +48,14 @@ static void io_uring_tx_zerocopy_callback(struct sk_buff *skb,
 
 	if (!refcount_dec_and_test(&uarg->refcnt))
 		return;
+
+	if (likely(notif->task)) {
+		init_task_work(&notif->task_work, __io_notif_complete_tw);
+		if (likely(!task_work_add(notif->task, &notif->task_work,
+					  TWA_SIGNAL)))
+			return;
+	}
+
 	INIT_WORK(&notif->commit_work, io_notif_complete_wq);
 	queue_work(system_unbound_wq, &notif->commit_work);
 }
@@ -134,12 +147,15 @@ __cold int io_notif_unregister(struct io_ring_ctx *ctx)
 	for (i = 0; i < ctx->nr_notif_slots; i++) {
 		struct io_notif_slot *slot = &ctx->notif_slots[i];
 
-		if (slot->notif)
-			io_notif_slot_flush(slot);
+		if (!slot->notif)
+			continue;
+		if (WARN_ON_ONCE(slot->notif->task))
+			slot->notif->task = NULL;
+		io_notif_slot_flush(slot);
 	}
 
 	kvfree(ctx->notif_slots);
 	ctx->notif_slots = NULL;
 	ctx->nr_notif_slots = 0;
 	return 0;
-}
\ No newline at end of file
+}
diff --git a/io_uring/notif.h b/io_uring/notif.h
index b23c9c0515bb..23ca7620fff9 100644
--- a/io_uring/notif.h
+++ b/io_uring/notif.h
@@ -11,6 +11,9 @@ struct io_notif {
 	struct ubuf_info	uarg;
 	struct io_ring_ctx	*ctx;
 
+	/* complete via tw if ->task is non-NULL, fallback to wq otherwise */
+	struct task_struct	*task;
+
 	/* cqe->user_data, io_notif_slot::tag if not overridden */
 	u64			tag;
 	/* see struct io_notif_slot::seq */
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next v5 17/27] io_uring: add rsrc referencing for notifiers
  2022-07-12 20:52 [PATCH net-next v5 00/27] io_uring zerocopy send Pavel Begunkov
                   ` (15 preceding siblings ...)
  2022-07-12 20:52 ` [PATCH net-next v5 16/27] io_uring: complete notifiers in tw Pavel Begunkov
@ 2022-07-12 20:52 ` Pavel Begunkov
  2022-07-12 20:52 ` [PATCH net-next v5 18/27] io_uring: add notification slot registration Pavel Begunkov
                   ` (10 subsequent siblings)
  27 siblings, 0 replies; 34+ messages in thread
From: Pavel Begunkov @ 2022-07-12 20:52 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: David S . Miller, Jakub Kicinski, Jonathan Lemon,
	Willem de Bruijn, Jens Axboe, David Ahern, kernel-team,
	Pavel Begunkov

In preparation to zerocopy sends with fixed buffers make notifiers to
reference the rsrc node to protect the used fixed buffers. We can't just
grab it for a send request as notifiers can likely outlive requests that
used it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/notif.c |  5 +++++
 io_uring/notif.h |  1 +
 io_uring/rsrc.h  | 12 +++++++++---
 3 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/io_uring/notif.c b/io_uring/notif.c
index aec74f88fc33..0a2e98bd74f6 100644
--- a/io_uring/notif.c
+++ b/io_uring/notif.c
@@ -7,10 +7,12 @@
 
 #include "io_uring.h"
 #include "notif.h"
+#include "rsrc.h"
 
 static void __io_notif_complete_tw(struct callback_head *cb)
 {
 	struct io_notif *notif = container_of(cb, struct io_notif, task_work);
+	struct io_rsrc_node *rsrc_node = notif->rsrc_node;
 	struct io_ring_ctx *ctx = notif->ctx;
 
 	if (likely(notif->task)) {
@@ -25,6 +27,7 @@ static void __io_notif_complete_tw(struct callback_head *cb)
 	ctx->notif_locked_nr++;
 	io_cq_unlock_post(ctx);
 
+	io_rsrc_put_node(rsrc_node, 1);
 	percpu_ref_put(&ctx->refs);
 }
 
@@ -119,6 +122,8 @@ struct io_notif *io_alloc_notif(struct io_ring_ctx *ctx,
 	/* master ref owned by io_notif_slot, will be dropped on flush */
 	refcount_set(&notif->uarg.refcnt, 1);
 	percpu_ref_get(&ctx->refs);
+	notif->rsrc_node = ctx->rsrc_node;
+	io_charge_rsrc_node(ctx);
 	return notif;
 }
 
diff --git a/io_uring/notif.h b/io_uring/notif.h
index 23ca7620fff9..1dd48efb7744 100644
--- a/io_uring/notif.h
+++ b/io_uring/notif.h
@@ -10,6 +10,7 @@
 struct io_notif {
 	struct ubuf_info	uarg;
 	struct io_ring_ctx	*ctx;
+	struct io_rsrc_node	*rsrc_node;
 
 	/* complete via tw if ->task is non-NULL, fallback to wq otherwise */
 	struct task_struct	*task;
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index 87f58315b247..af342fd239d0 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -135,6 +135,13 @@ static inline void io_req_put_rsrc_locked(struct io_kiocb *req,
 	}
 }
 
+static inline void io_charge_rsrc_node(struct io_ring_ctx *ctx)
+{
+	ctx->rsrc_cached_refs--;
+	if (unlikely(ctx->rsrc_cached_refs < 0))
+		io_rsrc_refs_refill(ctx);
+}
+
 static inline void io_req_set_rsrc_node(struct io_kiocb *req,
 					struct io_ring_ctx *ctx,
 					unsigned int issue_flags)
@@ -144,9 +151,8 @@ static inline void io_req_set_rsrc_node(struct io_kiocb *req,
 
 		if (!(issue_flags & IO_URING_F_UNLOCKED)) {
 			lockdep_assert_held(&ctx->uring_lock);
-			ctx->rsrc_cached_refs--;
-			if (unlikely(ctx->rsrc_cached_refs < 0))
-				io_rsrc_refs_refill(ctx);
+
+			io_charge_rsrc_node(ctx);
 		} else {
 			percpu_ref_get(&req->rsrc_node->refs);
 		}
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next v5 18/27] io_uring: add notification slot registration
  2022-07-12 20:52 [PATCH net-next v5 00/27] io_uring zerocopy send Pavel Begunkov
                   ` (16 preceding siblings ...)
  2022-07-12 20:52 ` [PATCH net-next v5 17/27] io_uring: add rsrc referencing for notifiers Pavel Begunkov
@ 2022-07-12 20:52 ` Pavel Begunkov
  2022-07-12 20:52 ` [PATCH net-next v5 19/27] io_uring: wire send zc request type Pavel Begunkov
                   ` (9 subsequent siblings)
  27 siblings, 0 replies; 34+ messages in thread
From: Pavel Begunkov @ 2022-07-12 20:52 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: David S . Miller, Jakub Kicinski, Jonathan Lemon,
	Willem de Bruijn, Jens Axboe, David Ahern, kernel-team,
	Pavel Begunkov

Let the userspace to register and unregister notification slots.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/uapi/linux/io_uring.h | 17 ++++++++++++++
 io_uring/io_uring.c           |  9 ++++++++
 io_uring/notif.c              | 43 +++++++++++++++++++++++++++++++++++
 io_uring/notif.h              |  3 +++
 4 files changed, 72 insertions(+)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index e858dba2e6c9..f1ba8e934168 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -454,6 +454,10 @@ enum {
 	/* register a range of fixed file slots for automatic slot allocation */
 	IORING_REGISTER_FILE_ALLOC_RANGE	= 25,
 
+	/* zerocopy notification API */
+	IORING_REGISTER_NOTIFIERS		= 26,
+	IORING_UNREGISTER_NOTIFIERS		= 27,
+
 	/* this goes last */
 	IORING_REGISTER_LAST
 };
@@ -500,6 +504,19 @@ struct io_uring_rsrc_update2 {
 	__u32 resv2;
 };
 
+struct io_uring_notification_slot {
+	__u64 tag;
+	__u64 resv[3];
+};
+
+struct io_uring_notification_register {
+	__u32 nr_slots;
+	__u32 resv;
+	__u64 resv2;
+	__u64 data;
+	__u64 resv3;
+};
+
 /* Skip updating fd indexes set to this value in the fd table */
 #define IORING_REGISTER_FILES_SKIP	(-2)
 
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index bdc5a2839d94..41ef98a43d32 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3875,6 +3875,15 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 			break;
 		ret = io_register_file_alloc_range(ctx, arg);
 		break;
+	case IORING_REGISTER_NOTIFIERS:
+		ret = io_notif_register(ctx, arg, nr_args);
+		break;
+	case IORING_UNREGISTER_NOTIFIERS:
+		ret = -EINVAL;
+		if (arg || nr_args)
+			break;
+		ret = io_notif_unregister(ctx);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/io_uring/notif.c b/io_uring/notif.c
index 0a2e98bd74f6..e6d98dc208c7 100644
--- a/io_uring/notif.c
+++ b/io_uring/notif.c
@@ -162,5 +162,48 @@ __cold int io_notif_unregister(struct io_ring_ctx *ctx)
 	kvfree(ctx->notif_slots);
 	ctx->notif_slots = NULL;
 	ctx->nr_notif_slots = 0;
+	io_notif_cache_purge(ctx);
+	return 0;
+}
+
+__cold int io_notif_register(struct io_ring_ctx *ctx,
+			     void __user *arg, unsigned int size)
+	__must_hold(&ctx->uring_lock)
+{
+	struct io_uring_notification_slot __user *slots;
+	struct io_uring_notification_slot slot;
+	struct io_uring_notification_register reg;
+	unsigned i;
+
+	if (ctx->nr_notif_slots)
+		return -EBUSY;
+	if (size != sizeof(reg))
+		return -EINVAL;
+	if (copy_from_user(&reg, arg, sizeof(reg)))
+		return -EFAULT;
+	if (!reg.nr_slots || reg.nr_slots > IORING_MAX_NOTIF_SLOTS)
+		return -EINVAL;
+	if (reg.resv || reg.resv2 || reg.resv3)
+		return -EINVAL;
+
+	slots = u64_to_user_ptr(reg.data);
+	ctx->notif_slots = kvcalloc(reg.nr_slots, sizeof(ctx->notif_slots[0]),
+				GFP_KERNEL_ACCOUNT);
+	if (!ctx->notif_slots)
+		return -ENOMEM;
+
+	for (i = 0; i < reg.nr_slots; i++, ctx->nr_notif_slots++) {
+		struct io_notif_slot *notif_slot = &ctx->notif_slots[i];
+
+		if (copy_from_user(&slot, &slots[i], sizeof(slot))) {
+			io_notif_unregister(ctx);
+			return -EFAULT;
+		}
+		if (slot.resv[0] | slot.resv[1] | slot.resv[2]) {
+			io_notif_unregister(ctx);
+			return -EINVAL;
+		}
+		notif_slot->tag = slot.tag;
+	}
 	return 0;
 }
diff --git a/io_uring/notif.h b/io_uring/notif.h
index 1dd48efb7744..00efe164bdc4 100644
--- a/io_uring/notif.h
+++ b/io_uring/notif.h
@@ -6,6 +6,7 @@
 #include <linux/nospec.h>
 
 #define IO_NOTIF_SPLICE_BATCH	32
+#define IORING_MAX_NOTIF_SLOTS (1U << 10)
 
 struct io_notif {
 	struct ubuf_info	uarg;
@@ -48,6 +49,8 @@ struct io_notif_slot {
 	u32			seq;
 };
 
+int io_notif_register(struct io_ring_ctx *ctx,
+		      void __user *arg, unsigned int size);
 int io_notif_unregister(struct io_ring_ctx *ctx);
 void io_notif_cache_purge(struct io_ring_ctx *ctx);
 
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next v5 19/27] io_uring: wire send zc request type
  2022-07-12 20:52 [PATCH net-next v5 00/27] io_uring zerocopy send Pavel Begunkov
                   ` (17 preceding siblings ...)
  2022-07-12 20:52 ` [PATCH net-next v5 18/27] io_uring: add notification slot registration Pavel Begunkov
@ 2022-07-12 20:52 ` Pavel Begunkov
  2022-07-12 20:52 ` [PATCH net-next v5 20/27] io_uring: account locked pages for non-fixed zc Pavel Begunkov
                   ` (8 subsequent siblings)
  27 siblings, 0 replies; 34+ messages in thread
From: Pavel Begunkov @ 2022-07-12 20:52 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: David S . Miller, Jakub Kicinski, Jonathan Lemon,
	Willem de Bruijn, Jens Axboe, David Ahern, kernel-team,
	Pavel Begunkov

Add a new io_uring opcode IORING_OP_SENDZC. The main distinction from
IORING_OP_SEND is that the user should specify a notification slot
index in sqe::notification_idx and the buffers are safe to reuse only
when the used notification is flushed and completes.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/uapi/linux/io_uring.h |  5 ++
 io_uring/net.c                | 94 +++++++++++++++++++++++++++++++++++
 io_uring/net.h                |  4 ++
 io_uring/opdef.c              | 15 ++++++
 4 files changed, 118 insertions(+)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index f1ba8e934168..dcef9d6e7f78 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -63,6 +63,10 @@ struct io_uring_sqe {
 	union {
 		__s32	splice_fd_in;
 		__u32	file_index;
+		struct {
+			__u16	notification_idx;
+			__u16	__pad;
+		};
 	};
 	union {
 		struct {
@@ -194,6 +198,7 @@ enum io_uring_op {
 	IORING_OP_GETXATTR,
 	IORING_OP_SOCKET,
 	IORING_OP_URING_CMD,
+	IORING_OP_SENDZC_NOTIF,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,
diff --git a/io_uring/net.c b/io_uring/net.c
index 2dd61fcf91d8..399267e8f1ef 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -13,6 +13,7 @@
 #include "io_uring.h"
 #include "kbuf.h"
 #include "net.h"
+#include "notif.h"
 
 #if defined(CONFIG_NET)
 struct io_shutdown {
@@ -58,6 +59,15 @@ struct io_sr_msg {
 	unsigned int			flags;
 };
 
+struct io_sendzc {
+	struct file			*file;
+	void __user			*buf;
+	size_t				len;
+	u16				slot_idx;
+	unsigned			msg_flags;
+	unsigned			flags;
+};
+
 #define IO_APOLL_MULTI_POLLED (REQ_F_APOLL_MULTISHOT | REQ_F_POLLED)
 
 int io_shutdown_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
@@ -652,6 +662,90 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
 	return ret;
 }
 
+int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	struct io_sendzc *zc = io_kiocb_to_cmd(req);
+
+	if (READ_ONCE(sqe->addr2) || READ_ONCE(sqe->__pad2[0]) ||
+	    READ_ONCE(sqe->addr3))
+		return -EINVAL;
+
+	zc->flags = READ_ONCE(sqe->ioprio);
+	if (zc->flags & ~IORING_RECVSEND_POLL_FIRST)
+		return -EINVAL;
+
+	zc->buf = u64_to_user_ptr(READ_ONCE(sqe->addr));
+	zc->len = READ_ONCE(sqe->len);
+	zc->msg_flags = READ_ONCE(sqe->msg_flags) | MSG_NOSIGNAL;
+	zc->slot_idx = READ_ONCE(sqe->notification_idx);
+	if (zc->msg_flags & MSG_DONTWAIT)
+		req->flags |= REQ_F_NOWAIT;
+#ifdef CONFIG_COMPAT
+	if (req->ctx->compat)
+		zc->msg_flags |= MSG_CMSG_COMPAT;
+#endif
+	return 0;
+}
+
+int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_ring_ctx *ctx = req->ctx;
+	struct io_sendzc *zc = io_kiocb_to_cmd(req);
+	struct io_notif_slot *notif_slot;
+	struct io_notif *notif;
+	struct msghdr msg;
+	struct iovec iov;
+	struct socket *sock;
+	unsigned msg_flags;
+	int ret, min_ret = 0;
+
+	if (!(req->flags & REQ_F_POLLED) &&
+	    (zc->flags & IORING_RECVSEND_POLL_FIRST))
+		return -EAGAIN;
+
+	if (issue_flags & IO_URING_F_UNLOCKED)
+		return -EAGAIN;
+	sock = sock_from_file(req->file);
+	if (unlikely(!sock))
+		return -ENOTSOCK;
+
+	notif_slot = io_get_notif_slot(ctx, zc->slot_idx);
+	if (!notif_slot)
+		return -EINVAL;
+	notif = io_get_notif(ctx, notif_slot);
+	if (!notif)
+		return -ENOMEM;
+
+	msg.msg_name = NULL;
+	msg.msg_control = NULL;
+	msg.msg_controllen = 0;
+	msg.msg_namelen = 0;
+
+	ret = import_single_range(WRITE, zc->buf, zc->len, &iov, &msg.msg_iter);
+	if (unlikely(ret))
+		return ret;
+
+	msg_flags = zc->msg_flags | MSG_ZEROCOPY;
+	if (issue_flags & IO_URING_F_NONBLOCK)
+		msg_flags |= MSG_DONTWAIT;
+	if (msg_flags & MSG_WAITALL)
+		min_ret = iov_iter_count(&msg.msg_iter);
+
+	msg.msg_flags = msg_flags;
+	msg.msg_ubuf = &notif->uarg;
+	msg.sg_from_iter = NULL;
+	ret = sock_sendmsg(sock, &msg);
+
+	if (unlikely(ret < min_ret)) {
+		if (ret == -EAGAIN && (issue_flags & IO_URING_F_NONBLOCK))
+			return -EAGAIN;
+		return ret == -ERESTARTSYS ? -EINTR : ret;
+	}
+
+	io_req_set_res(req, ret, 0);
+	return IOU_OK;
+}
+
 int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	struct io_accept *accept = io_kiocb_to_cmd(req);
diff --git a/io_uring/net.h b/io_uring/net.h
index 81d71d164770..1dba8befebb3 100644
--- a/io_uring/net.h
+++ b/io_uring/net.h
@@ -40,4 +40,8 @@ int io_socket(struct io_kiocb *req, unsigned int issue_flags);
 int io_connect_prep_async(struct io_kiocb *req);
 int io_connect_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
 int io_connect(struct io_kiocb *req, unsigned int issue_flags);
+
+int io_sendzc(struct io_kiocb *req, unsigned int issue_flags);
+int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+
 #endif
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index a7b84b43e6c2..7ab19bbf3126 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -470,6 +470,21 @@ const struct io_op_def io_op_defs[] = {
 		.issue			= io_uring_cmd,
 		.prep_async		= io_uring_cmd_prep_async,
 	},
+	[IORING_OP_SENDZC_NOTIF] = {
+		.name			= "SENDZC_NOTIF",
+		.needs_file		= 1,
+		.unbound_nonreg_file	= 1,
+		.pollout		= 1,
+		.audit_skip		= 1,
+		.ioprio			= 1,
+#if defined(CONFIG_NET)
+		.prep			= io_sendzc_prep,
+		.issue			= io_sendzc,
+#else
+		.prep			= io_eopnotsupp_prep,
+#endif
+
+	},
 };
 
 const char *io_uring_get_opcode(u8 opcode)
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next v5 20/27] io_uring: account locked pages for non-fixed zc
  2022-07-12 20:52 [PATCH net-next v5 00/27] io_uring zerocopy send Pavel Begunkov
                   ` (18 preceding siblings ...)
  2022-07-12 20:52 ` [PATCH net-next v5 19/27] io_uring: wire send zc request type Pavel Begunkov
@ 2022-07-12 20:52 ` Pavel Begunkov
  2022-07-12 20:52 ` [PATCH net-next v5 21/27] io_uring: allow to pass addr into sendzc Pavel Begunkov
                   ` (7 subsequent siblings)
  27 siblings, 0 replies; 34+ messages in thread
From: Pavel Begunkov @ 2022-07-12 20:52 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: David S . Miller, Jakub Kicinski, Jonathan Lemon,
	Willem de Bruijn, Jens Axboe, David Ahern, kernel-team,
	Pavel Begunkov

Fixed buffers are RLIMIT_MEMLOCK accounted, however it doesn't cover iovec
based zerocopy sends. Do the accounting on the io_uring side.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/net.c   | 1 +
 io_uring/notif.c | 6 ++++++
 2 files changed, 7 insertions(+)

diff --git a/io_uring/net.c b/io_uring/net.c
index 399267e8f1ef..69273d4f4ef0 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -724,6 +724,7 @@ int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
 	ret = import_single_range(WRITE, zc->buf, zc->len, &iov, &msg.msg_iter);
 	if (unlikely(ret))
 		return ret;
+	mm_account_pinned_pages(&notif->uarg.mmp, zc->len);
 
 	msg_flags = zc->msg_flags | MSG_ZEROCOPY;
 	if (issue_flags & IO_URING_F_NONBLOCK)
diff --git a/io_uring/notif.c b/io_uring/notif.c
index e6d98dc208c7..c5179e5c1cd6 100644
--- a/io_uring/notif.c
+++ b/io_uring/notif.c
@@ -14,7 +14,13 @@ static void __io_notif_complete_tw(struct callback_head *cb)
 	struct io_notif *notif = container_of(cb, struct io_notif, task_work);
 	struct io_rsrc_node *rsrc_node = notif->rsrc_node;
 	struct io_ring_ctx *ctx = notif->ctx;
+	struct mmpin *mmp = &notif->uarg.mmp;
 
+	if (mmp->user) {
+		atomic_long_sub(mmp->num_pg, &mmp->user->locked_vm);
+		free_uid(mmp->user);
+		mmp->user = NULL;
+	}
 	if (likely(notif->task)) {
 		io_put_task(notif->task, 1);
 		notif->task = NULL;
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next v5 21/27] io_uring: allow to pass addr into sendzc
  2022-07-12 20:52 [PATCH net-next v5 00/27] io_uring zerocopy send Pavel Begunkov
                   ` (19 preceding siblings ...)
  2022-07-12 20:52 ` [PATCH net-next v5 20/27] io_uring: account locked pages for non-fixed zc Pavel Begunkov
@ 2022-07-12 20:52 ` Pavel Begunkov
  2022-07-12 20:52 ` [PATCH net-next v5 22/27] io_uring: sendzc with fixed buffers Pavel Begunkov
                   ` (6 subsequent siblings)
  27 siblings, 0 replies; 34+ messages in thread
From: Pavel Begunkov @ 2022-07-12 20:52 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: David S . Miller, Jakub Kicinski, Jonathan Lemon,
	Willem de Bruijn, Jens Axboe, David Ahern, kernel-team,
	Pavel Begunkov

Allow to specify an address to zerocopy sends making it more like
sendto(2).

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/uapi/linux/io_uring.h |  2 +-
 io_uring/net.c                | 18 ++++++++++++++++--
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index dcef9d6e7f78..9303bf5236f7 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -65,7 +65,7 @@ struct io_uring_sqe {
 		__u32	file_index;
 		struct {
 			__u16	notification_idx;
-			__u16	__pad;
+			__u16	addr_len;
 		};
 	};
 	union {
diff --git a/io_uring/net.c b/io_uring/net.c
index 69273d4f4ef0..2172cf3facd8 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -66,6 +66,8 @@ struct io_sendzc {
 	u16				slot_idx;
 	unsigned			msg_flags;
 	unsigned			flags;
+	unsigned			addr_len;
+	void __user			*addr;
 };
 
 #define IO_APOLL_MULTI_POLLED (REQ_F_APOLL_MULTISHOT | REQ_F_POLLED)
@@ -666,8 +668,7 @@ int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	struct io_sendzc *zc = io_kiocb_to_cmd(req);
 
-	if (READ_ONCE(sqe->addr2) || READ_ONCE(sqe->__pad2[0]) ||
-	    READ_ONCE(sqe->addr3))
+	if (READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3))
 		return -EINVAL;
 
 	zc->flags = READ_ONCE(sqe->ioprio);
@@ -680,6 +681,10 @@ int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	zc->slot_idx = READ_ONCE(sqe->notification_idx);
 	if (zc->msg_flags & MSG_DONTWAIT)
 		req->flags |= REQ_F_NOWAIT;
+
+	zc->addr = u64_to_user_ptr(READ_ONCE(sqe->addr2));
+	zc->addr_len = READ_ONCE(sqe->addr_len);
+
 #ifdef CONFIG_COMPAT
 	if (req->ctx->compat)
 		zc->msg_flags |= MSG_CMSG_COMPAT;
@@ -689,6 +694,7 @@ int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 
 int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
 {
+	struct sockaddr_storage address;
 	struct io_ring_ctx *ctx = req->ctx;
 	struct io_sendzc *zc = io_kiocb_to_cmd(req);
 	struct io_notif_slot *notif_slot;
@@ -726,6 +732,14 @@ int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
 		return ret;
 	mm_account_pinned_pages(&notif->uarg.mmp, zc->len);
 
+	if (zc->addr) {
+		ret = move_addr_to_kernel(zc->addr, zc->addr_len, &address);
+		if (unlikely(ret < 0))
+			return ret;
+		msg.msg_name = (struct sockaddr *)&address;
+		msg.msg_namelen = zc->addr_len;
+	}
+
 	msg_flags = zc->msg_flags | MSG_ZEROCOPY;
 	if (issue_flags & IO_URING_F_NONBLOCK)
 		msg_flags |= MSG_DONTWAIT;
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next v5 22/27] io_uring: sendzc with fixed buffers
  2022-07-12 20:52 [PATCH net-next v5 00/27] io_uring zerocopy send Pavel Begunkov
                   ` (20 preceding siblings ...)
  2022-07-12 20:52 ` [PATCH net-next v5 21/27] io_uring: allow to pass addr into sendzc Pavel Begunkov
@ 2022-07-12 20:52 ` Pavel Begunkov
  2022-07-12 20:52 ` [PATCH net-next v5 23/27] io_uring: flush notifiers after sendzc Pavel Begunkov
                   ` (5 subsequent siblings)
  27 siblings, 0 replies; 34+ messages in thread
From: Pavel Begunkov @ 2022-07-12 20:52 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: David S . Miller, Jakub Kicinski, Jonathan Lemon,
	Willem de Bruijn, Jens Axboe, David Ahern, kernel-team,
	Pavel Begunkov

Allow zerocopy sends to use fixed buffers. There is an optimisation for
this case, the network layer don't need to reference the pages, see
SKBFL_MANAGED_FRAG_REFS, so io_uring have to ensure validity of fixed
buffers until the notifier is released.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/uapi/linux/io_uring.h |  6 +++++-
 io_uring/net.c                | 29 ++++++++++++++++++++++++-----
 2 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 9303bf5236f7..3f2305bc5c79 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -269,9 +269,13 @@ enum io_uring_op {
  * IORING_RECV_MULTISHOT	Multishot recv. Sets IORING_CQE_F_MORE if
  *				the handler will continue to report
  *				CQEs on behalf of the same SQE.
+ *
+ * IORING_RECVSEND_FIXED_BUF	Use registered buffers, the index is stored in
+ *				the buf_index field.
  */
 #define IORING_RECVSEND_POLL_FIRST	(1U << 0)
-#define IORING_RECV_MULTISHOT	(1U << 1)
+#define IORING_RECV_MULTISHOT		(1U << 1)
+#define IORING_RECVSEND_FIXED_BUF	(1U << 2)
 
 /*
  * accept flags stored in sqe->ioprio
diff --git a/io_uring/net.c b/io_uring/net.c
index 2172cf3facd8..0259fbbad591 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -14,6 +14,7 @@
 #include "kbuf.h"
 #include "net.h"
 #include "notif.h"
+#include "rsrc.h"
 
 #if defined(CONFIG_NET)
 struct io_shutdown {
@@ -667,13 +668,23 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
 int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	struct io_sendzc *zc = io_kiocb_to_cmd(req);
+	struct io_ring_ctx *ctx = req->ctx;
 
 	if (READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3))
 		return -EINVAL;
 
 	zc->flags = READ_ONCE(sqe->ioprio);
-	if (zc->flags & ~IORING_RECVSEND_POLL_FIRST)
+	if (zc->flags & ~(IORING_RECVSEND_POLL_FIRST | IORING_RECVSEND_FIXED_BUF))
 		return -EINVAL;
+	if (zc->flags & IORING_RECVSEND_FIXED_BUF) {
+		unsigned idx = READ_ONCE(sqe->buf_index);
+
+		if (unlikely(idx >= ctx->nr_user_bufs))
+			return -EFAULT;
+		idx = array_index_nospec(idx, ctx->nr_user_bufs);
+		req->imu = READ_ONCE(ctx->user_bufs[idx]);
+		io_req_set_rsrc_node(req, ctx, 0);
+	}
 
 	zc->buf = u64_to_user_ptr(READ_ONCE(sqe->addr));
 	zc->len = READ_ONCE(sqe->len);
@@ -727,10 +738,18 @@ int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
 	msg.msg_controllen = 0;
 	msg.msg_namelen = 0;
 
-	ret = import_single_range(WRITE, zc->buf, zc->len, &iov, &msg.msg_iter);
-	if (unlikely(ret))
-		return ret;
-	mm_account_pinned_pages(&notif->uarg.mmp, zc->len);
+	if (zc->flags & IORING_RECVSEND_FIXED_BUF) {
+		ret = io_import_fixed(WRITE, &msg.msg_iter, req->imu,
+					(u64)zc->buf, zc->len);
+		if (unlikely(ret))
+				return ret;
+	} else {
+		ret = import_single_range(WRITE, zc->buf, zc->len, &iov,
+					  &msg.msg_iter);
+		if (unlikely(ret))
+			return ret;
+		mm_account_pinned_pages(&notif->uarg.mmp, zc->len);
+	}
 
 	if (zc->addr) {
 		ret = move_addr_to_kernel(zc->addr, zc->addr_len, &address);
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next v5 23/27] io_uring: flush notifiers after sendzc
  2022-07-12 20:52 [PATCH net-next v5 00/27] io_uring zerocopy send Pavel Begunkov
                   ` (21 preceding siblings ...)
  2022-07-12 20:52 ` [PATCH net-next v5 22/27] io_uring: sendzc with fixed buffers Pavel Begunkov
@ 2022-07-12 20:52 ` Pavel Begunkov
  2022-07-12 20:52 ` [PATCH net-next v5 24/27] io_uring: rename IORING_OP_FILES_UPDATE Pavel Begunkov
                   ` (4 subsequent siblings)
  27 siblings, 0 replies; 34+ messages in thread
From: Pavel Begunkov @ 2022-07-12 20:52 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: David S . Miller, Jakub Kicinski, Jonathan Lemon,
	Willem de Bruijn, Jens Axboe, David Ahern, kernel-team,
	Pavel Begunkov

Allow to flush notifiers as a part of sendzc request by setting
IORING_SENDZC_FLUSH flag. When the sendzc request succeedes it will
flush the used [active] notifier.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/uapi/linux/io_uring.h |  4 ++++
 io_uring/io_uring.c           | 11 +----------
 io_uring/io_uring.h           | 10 ++++++++++
 io_uring/net.c                |  5 ++++-
 io_uring/notif.c              |  2 +-
 io_uring/notif.h              | 11 +++++++++++
 6 files changed, 31 insertions(+), 12 deletions(-)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 3f2305bc5c79..7d21fba54b62 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -272,10 +272,14 @@ enum io_uring_op {
  *
  * IORING_RECVSEND_FIXED_BUF	Use registered buffers, the index is stored in
  *				the buf_index field.
+ *
+ * IORING_RECVSEND_NOTIF_FLUSH	Flush a notification after a successful
+ *				successful. Only for zerocopy sends.
  */
 #define IORING_RECVSEND_POLL_FIRST	(1U << 0)
 #define IORING_RECV_MULTISHOT		(1U << 1)
 #define IORING_RECVSEND_FIXED_BUF	(1U << 2)
+#define IORING_RECVSEND_NOTIF_FLUSH	(1U << 3)
 
 /*
  * accept flags stored in sqe->ioprio
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 41ef98a43d32..e4f3a1ede2f4 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -615,7 +615,7 @@ void __io_put_task(struct task_struct *task, int nr)
 	put_task_struct_many(task, nr);
 }
 
-static void io_task_refs_refill(struct io_uring_task *tctx)
+void io_task_refs_refill(struct io_uring_task *tctx)
 {
 	unsigned int refill = -tctx->cached_refs + IO_TCTX_REFS_CACHE_NR;
 
@@ -624,15 +624,6 @@ static void io_task_refs_refill(struct io_uring_task *tctx)
 	tctx->cached_refs += refill;
 }
 
-static inline void io_get_task_refs(int nr)
-{
-	struct io_uring_task *tctx = current->io_uring;
-
-	tctx->cached_refs -= nr;
-	if (unlikely(tctx->cached_refs < 0))
-		io_task_refs_refill(tctx);
-}
-
 static __cold void io_uring_drop_tctx_refs(struct task_struct *task)
 {
 	struct io_uring_task *tctx = task->io_uring;
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index b8c858727dc8..d9f2f5c71481 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -69,6 +69,7 @@ void io_wq_submit_work(struct io_wq_work *work);
 void io_free_req(struct io_kiocb *req);
 void io_queue_next(struct io_kiocb *req);
 void __io_put_task(struct task_struct *task, int nr);
+void io_task_refs_refill(struct io_uring_task *tctx);
 
 bool io_match_task_safe(struct io_kiocb *head, struct task_struct *task,
 			bool cancel_all);
@@ -265,4 +266,13 @@ static inline void io_put_task(struct task_struct *task, int nr)
 		__io_put_task(task, nr);
 }
 
+static inline void io_get_task_refs(int nr)
+{
+	struct io_uring_task *tctx = current->io_uring;
+
+	tctx->cached_refs -= nr;
+	if (unlikely(tctx->cached_refs < 0))
+		io_task_refs_refill(tctx);
+}
+
 #endif
diff --git a/io_uring/net.c b/io_uring/net.c
index 0259fbbad591..bf9916d5e50c 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -674,7 +674,8 @@ int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 		return -EINVAL;
 
 	zc->flags = READ_ONCE(sqe->ioprio);
-	if (zc->flags & ~(IORING_RECVSEND_POLL_FIRST | IORING_RECVSEND_FIXED_BUF))
+	if (zc->flags & ~(IORING_RECVSEND_POLL_FIRST |
+			  IORING_RECVSEND_FIXED_BUF | IORING_RECVSEND_NOTIF_FLUSH))
 		return -EINVAL;
 	if (zc->flags & IORING_RECVSEND_FIXED_BUF) {
 		unsigned idx = READ_ONCE(sqe->buf_index);
@@ -776,6 +777,8 @@ int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
 		return ret == -ERESTARTSYS ? -EINTR : ret;
 	}
 
+	if (zc->flags & IORING_RECVSEND_NOTIF_FLUSH)
+		io_notif_slot_flush_submit(notif_slot, 0);
 	io_req_set_res(req, ret, 0);
 	return IOU_OK;
 }
diff --git a/io_uring/notif.c b/io_uring/notif.c
index c5179e5c1cd6..a93887451bbb 100644
--- a/io_uring/notif.c
+++ b/io_uring/notif.c
@@ -133,7 +133,7 @@ struct io_notif *io_alloc_notif(struct io_ring_ctx *ctx,
 	return notif;
 }
 
-static void io_notif_slot_flush(struct io_notif_slot *slot)
+void io_notif_slot_flush(struct io_notif_slot *slot)
 	__must_hold(&ctx->uring_lock)
 {
 	struct io_notif *notif = slot->notif;
diff --git a/io_uring/notif.h b/io_uring/notif.h
index 00efe164bdc4..6cd73d7b965b 100644
--- a/io_uring/notif.h
+++ b/io_uring/notif.h
@@ -54,6 +54,7 @@ int io_notif_register(struct io_ring_ctx *ctx,
 int io_notif_unregister(struct io_ring_ctx *ctx);
 void io_notif_cache_purge(struct io_ring_ctx *ctx);
 
+void io_notif_slot_flush(struct io_notif_slot *slot);
 struct io_notif *io_alloc_notif(struct io_ring_ctx *ctx,
 				struct io_notif_slot *slot);
 
@@ -74,3 +75,13 @@ static inline struct io_notif_slot *io_get_notif_slot(struct io_ring_ctx *ctx,
 	idx = array_index_nospec(idx, ctx->nr_notif_slots);
 	return &ctx->notif_slots[idx];
 }
+
+static inline void io_notif_slot_flush_submit(struct io_notif_slot *slot,
+					      unsigned int issue_flags)
+{
+	if (!(issue_flags & IO_URING_F_UNLOCKED)) {
+		slot->notif->task = current;
+		io_get_task_refs(1);
+	}
+	io_notif_slot_flush(slot);
+}
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next v5 24/27] io_uring: rename IORING_OP_FILES_UPDATE
  2022-07-12 20:52 [PATCH net-next v5 00/27] io_uring zerocopy send Pavel Begunkov
                   ` (22 preceding siblings ...)
  2022-07-12 20:52 ` [PATCH net-next v5 23/27] io_uring: flush notifiers after sendzc Pavel Begunkov
@ 2022-07-12 20:52 ` Pavel Begunkov
  2022-07-12 20:52 ` [PATCH net-next v5 25/27] io_uring: add zc notification flush requests Pavel Begunkov
                   ` (3 subsequent siblings)
  27 siblings, 0 replies; 34+ messages in thread
From: Pavel Begunkov @ 2022-07-12 20:52 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: David S . Miller, Jakub Kicinski, Jonathan Lemon,
	Willem de Bruijn, Jens Axboe, David Ahern, kernel-team,
	Pavel Begunkov

IORING_OP_FILES_UPDATE will be a more generic opcode serving different
resource types, rename it into IORING_OP_RSRC_UPDATE and add subtype
handling.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/uapi/linux/io_uring.h | 12 +++++++++++-
 io_uring/opdef.c              |  9 +++++----
 io_uring/rsrc.c               | 17 +++++++++++++++--
 io_uring/rsrc.h               |  4 ++--
 4 files changed, 33 insertions(+), 9 deletions(-)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 7d21fba54b62..37e8c104d31f 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -171,7 +171,8 @@ enum io_uring_op {
 	IORING_OP_FALLOCATE,
 	IORING_OP_OPENAT,
 	IORING_OP_CLOSE,
-	IORING_OP_FILES_UPDATE,
+	IORING_OP_RSRC_UPDATE,
+	IORING_OP_FILES_UPDATE = IORING_OP_RSRC_UPDATE,
 	IORING_OP_STATX,
 	IORING_OP_READ,
 	IORING_OP_WRITE,
@@ -220,6 +221,7 @@ enum io_uring_op {
 #define IORING_TIMEOUT_ETIME_SUCCESS	(1U << 5)
 #define IORING_TIMEOUT_CLOCK_MASK	(IORING_TIMEOUT_BOOTTIME | IORING_TIMEOUT_REALTIME)
 #define IORING_TIMEOUT_UPDATE_MASK	(IORING_TIMEOUT_UPDATE | IORING_LINK_TIMEOUT_UPDATE)
+
 /*
  * sqe->splice_flags
  * extends splice(2) flags
@@ -286,6 +288,14 @@ enum io_uring_op {
  */
 #define IORING_ACCEPT_MULTISHOT	(1U << 0)
 
+
+/*
+ * IORING_OP_RSRC_UPDATE flags
+ */
+enum {
+	IORING_RSRC_UPDATE_FILES,
+};
+
 /*
  * IORING_OP_MSG_RING command types, stored in sqe->addr
  */
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index 7ab19bbf3126..72dd2b2d8a9d 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -246,12 +246,13 @@ const struct io_op_def io_op_defs[] = {
 		.prep			= io_close_prep,
 		.issue			= io_close,
 	},
-	[IORING_OP_FILES_UPDATE] = {
+	[IORING_OP_RSRC_UPDATE] = {
 		.audit_skip		= 1,
 		.iopoll			= 1,
-		.name			= "FILES_UPDATE",
-		.prep			= io_files_update_prep,
-		.issue			= io_files_update,
+		.name			= "RSRC_UPDATE",
+		.prep			= io_rsrc_update_prep,
+		.issue			= io_rsrc_update,
+		.ioprio			= 1,
 	},
 	[IORING_OP_STATX] = {
 		.audit_skip		= 1,
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 1182cf0ea1fc..98ce8a93a816 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -21,6 +21,7 @@ struct io_rsrc_update {
 	u64				arg;
 	u32				nr_args;
 	u32				offset;
+	int				type;
 };
 
 static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
@@ -658,7 +659,7 @@ __cold int io_register_rsrc(struct io_ring_ctx *ctx, void __user *arg,
 	return -EINVAL;
 }
 
-int io_files_update_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+int io_rsrc_update_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	struct io_rsrc_update *up = io_kiocb_to_cmd(req);
 
@@ -672,6 +673,7 @@ int io_files_update_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	if (!up->nr_args)
 		return -EINVAL;
 	up->arg = READ_ONCE(sqe->addr);
+	up->type = READ_ONCE(sqe->ioprio);
 	return 0;
 }
 
@@ -711,7 +713,7 @@ static int io_files_update_with_index_alloc(struct io_kiocb *req,
 	return ret;
 }
 
-int io_files_update(struct io_kiocb *req, unsigned int issue_flags)
+static int io_files_update(struct io_kiocb *req, unsigned int issue_flags)
 {
 	struct io_rsrc_update *up = io_kiocb_to_cmd(req);
 	struct io_ring_ctx *ctx = req->ctx;
@@ -740,6 +742,17 @@ int io_files_update(struct io_kiocb *req, unsigned int issue_flags)
 	return IOU_OK;
 }
 
+int io_rsrc_update(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_rsrc_update *up = io_kiocb_to_cmd(req);
+
+	switch (up->type) {
+	case IORING_RSRC_UPDATE_FILES:
+		return io_files_update(req, issue_flags);
+	}
+	return -EINVAL;
+}
+
 int io_queue_rsrc_removal(struct io_rsrc_data *data, unsigned idx,
 			  struct io_rsrc_node *node, void *rsrc)
 {
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index af342fd239d0..21813a23215f 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -167,6 +167,6 @@ static inline u64 *io_get_tag_slot(struct io_rsrc_data *data, unsigned int idx)
 	return &data->tags[table_idx][off];
 }
 
-int io_files_update(struct io_kiocb *req, unsigned int issue_flags);
-int io_files_update_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+int io_rsrc_update(struct io_kiocb *req, unsigned int issue_flags);
+int io_rsrc_update_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
 #endif
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next v5 25/27] io_uring: add zc notification flush requests
  2022-07-12 20:52 [PATCH net-next v5 00/27] io_uring zerocopy send Pavel Begunkov
                   ` (23 preceding siblings ...)
  2022-07-12 20:52 ` [PATCH net-next v5 24/27] io_uring: rename IORING_OP_FILES_UPDATE Pavel Begunkov
@ 2022-07-12 20:52 ` Pavel Begunkov
  2022-07-12 20:52 ` [PATCH net-next v5 26/27] io_uring: enable managed frags with register buffers Pavel Begunkov
                   ` (2 subsequent siblings)
  27 siblings, 0 replies; 34+ messages in thread
From: Pavel Begunkov @ 2022-07-12 20:52 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: David S . Miller, Jakub Kicinski, Jonathan Lemon,
	Willem de Bruijn, Jens Axboe, David Ahern, kernel-team,
	Pavel Begunkov

Overlay notification control onto IORING_OP_RSRC_UPDATE (former
IORING_OP_FILES_UPDATE). It allows to flush a range of zc notifications
from slots with indexes [sqe->off, sqe->off+sqe->len). If sqe->arg is
not zero, it also copies sqe->arg as a new tag for all flushed
notifications.

Note, it doesn't flush a notification of a slot if there was no requests
attached to it (since last flush or registration).

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/uapi/linux/io_uring.h |  1 +
 io_uring/rsrc.c               | 38 +++++++++++++++++++++++++++++++++++
 2 files changed, 39 insertions(+)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 37e8c104d31f..9a7aa25d09a1 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -294,6 +294,7 @@ enum io_uring_op {
  */
 enum {
 	IORING_RSRC_UPDATE_FILES,
+	IORING_RSRC_UPDATE_NOTIF,
 };
 
 /*
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 98ce8a93a816..088a2dc32e2c 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -15,6 +15,7 @@
 #include "io_uring.h"
 #include "openclose.h"
 #include "rsrc.h"
+#include "notif.h"
 
 struct io_rsrc_update {
 	struct file			*file;
@@ -742,6 +743,41 @@ static int io_files_update(struct io_kiocb *req, unsigned int issue_flags)
 	return IOU_OK;
 }
 
+static int io_notif_update(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_rsrc_update *up = io_kiocb_to_cmd(req);
+	struct io_ring_ctx *ctx = req->ctx;
+	unsigned len = up->nr_args;
+	unsigned idx_end, idx = up->offset;
+	int ret = 0;
+
+	io_ring_submit_lock(ctx, issue_flags);
+	if (unlikely(check_add_overflow(idx, len, &idx_end))) {
+		ret = -EOVERFLOW;
+		goto out;
+	}
+	if (unlikely(idx_end > ctx->nr_notif_slots)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	for (; idx < idx_end; idx++) {
+		struct io_notif_slot *slot = &ctx->notif_slots[idx];
+
+		if (!slot->notif)
+			continue;
+		if (up->arg)
+			slot->tag = up->arg;
+		io_notif_slot_flush_submit(slot, issue_flags);
+	}
+out:
+	io_ring_submit_unlock(ctx, issue_flags);
+	if (ret < 0)
+		req_set_fail(req);
+	io_req_set_res(req, ret, 0);
+	return IOU_OK;
+}
+
 int io_rsrc_update(struct io_kiocb *req, unsigned int issue_flags)
 {
 	struct io_rsrc_update *up = io_kiocb_to_cmd(req);
@@ -749,6 +785,8 @@ int io_rsrc_update(struct io_kiocb *req, unsigned int issue_flags)
 	switch (up->type) {
 	case IORING_RSRC_UPDATE_FILES:
 		return io_files_update(req, issue_flags);
+	case IORING_RSRC_UPDATE_NOTIF:
+		return io_notif_update(req, issue_flags);
 	}
 	return -EINVAL;
 }
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next v5 26/27] io_uring: enable managed frags with register buffers
  2022-07-12 20:52 [PATCH net-next v5 00/27] io_uring zerocopy send Pavel Begunkov
                   ` (24 preceding siblings ...)
  2022-07-12 20:52 ` [PATCH net-next v5 25/27] io_uring: add zc notification flush requests Pavel Begunkov
@ 2022-07-12 20:52 ` Pavel Begunkov
  2022-07-12 20:52 ` [PATCH net-next v5 27/27] selftests/io_uring: test zerocopy send Pavel Begunkov
  2022-07-20 12:46 ` (subset) [PATCH net-next v5 00/27] io_uring " Jens Axboe
  27 siblings, 0 replies; 34+ messages in thread
From: Pavel Begunkov @ 2022-07-12 20:52 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: David S . Miller, Jakub Kicinski, Jonathan Lemon,
	Willem de Bruijn, Jens Axboe, David Ahern, kernel-team,
	Pavel Begunkov

io_uring's registered buffers infra has a good performant way of pinning
pages, so let's use SKBFL_MANAGED_FRAG_REFS when our requests are purely
register buffer backed.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/net.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 55 insertions(+), 1 deletion(-)

diff --git a/io_uring/net.c b/io_uring/net.c
index bf9916d5e50c..a4e863dce7ec 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -704,6 +704,60 @@ int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	return 0;
 }
 
+static int io_sg_from_iter(struct sock *sk, struct sk_buff *skb,
+			   struct iov_iter *from, size_t length)
+{
+	struct skb_shared_info *shinfo = skb_shinfo(skb);
+	int frag = shinfo->nr_frags;
+	int ret = 0;
+	struct bvec_iter bi;
+	ssize_t copied = 0;
+	unsigned long truesize = 0;
+
+	if (!shinfo->nr_frags)
+		shinfo->flags |= SKBFL_MANAGED_FRAG_REFS;
+
+	if (!skb_zcopy_managed(skb) || !iov_iter_is_bvec(from)) {
+		skb_zcopy_downgrade_managed(skb);
+		return __zerocopy_sg_from_iter(NULL, sk, skb, from, length);
+	}
+
+	bi.bi_size = min(from->count, length);
+	bi.bi_bvec_done = from->iov_offset;
+	bi.bi_idx = 0;
+
+	while (bi.bi_size && frag < MAX_SKB_FRAGS) {
+		struct bio_vec v = mp_bvec_iter_bvec(from->bvec, bi);
+
+		copied += v.bv_len;
+		truesize += PAGE_ALIGN(v.bv_len + v.bv_offset);
+		__skb_fill_page_desc_noacc(shinfo, frag++, v.bv_page,
+					   v.bv_offset, v.bv_len);
+		bvec_iter_advance_single(from->bvec, &bi, v.bv_len);
+	}
+	if (bi.bi_size)
+		ret = -EMSGSIZE;
+
+	shinfo->nr_frags = frag;
+	from->bvec += bi.bi_idx;
+	from->nr_segs -= bi.bi_idx;
+	from->count = bi.bi_size;
+	from->iov_offset = bi.bi_bvec_done;
+
+	skb->data_len += copied;
+	skb->len += copied;
+	skb->truesize += truesize;
+
+	if (sk && sk->sk_type == SOCK_STREAM) {
+		sk_wmem_queued_add(sk, truesize);
+		if (!skb_zcopy_pure(skb))
+			sk_mem_charge(sk, truesize);
+	} else {
+		refcount_add(truesize, &skb->sk->sk_wmem_alloc);
+	}
+	return ret;
+}
+
 int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
 {
 	struct sockaddr_storage address;
@@ -768,7 +822,7 @@ int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
 
 	msg.msg_flags = msg_flags;
 	msg.msg_ubuf = &notif->uarg;
-	msg.sg_from_iter = NULL;
+	msg.sg_from_iter = io_sg_from_iter;
 	ret = sock_sendmsg(sock, &msg);
 
 	if (unlikely(ret < min_ret)) {
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next v5 27/27] selftests/io_uring: test zerocopy send
  2022-07-12 20:52 [PATCH net-next v5 00/27] io_uring zerocopy send Pavel Begunkov
                   ` (25 preceding siblings ...)
  2022-07-12 20:52 ` [PATCH net-next v5 26/27] io_uring: enable managed frags with register buffers Pavel Begunkov
@ 2022-07-12 20:52 ` Pavel Begunkov
  2022-07-27  8:01   ` dust.li
  2022-07-20 12:46 ` (subset) [PATCH net-next v5 00/27] io_uring " Jens Axboe
  27 siblings, 1 reply; 34+ messages in thread
From: Pavel Begunkov @ 2022-07-12 20:52 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: David S . Miller, Jakub Kicinski, Jonathan Lemon,
	Willem de Bruijn, Jens Axboe, David Ahern, kernel-team,
	Pavel Begunkov

Add selftests for io_uring zerocopy sends and io_uring's notification
infrastructure. It's largely influenced by msg_zerocopy and uses it on
the receive side.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 tools/testing/selftests/net/Makefile          |   1 +
 .../selftests/net/io_uring_zerocopy_tx.c      | 605 ++++++++++++++++++
 .../selftests/net/io_uring_zerocopy_tx.sh     | 131 ++++
 3 files changed, 737 insertions(+)
 create mode 100644 tools/testing/selftests/net/io_uring_zerocopy_tx.c
 create mode 100755 tools/testing/selftests/net/io_uring_zerocopy_tx.sh

diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 7ea54af55490..51261483744e 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -59,6 +59,7 @@ TEST_GEN_FILES += toeplitz
 TEST_GEN_FILES += cmsg_sender
 TEST_GEN_FILES += stress_reuseport_listen
 TEST_PROGS += test_vxlan_vnifiltering.sh
+TEST_GEN_FILES += io_uring_zerocopy_tx
 
 TEST_FILES := settings
 
diff --git a/tools/testing/selftests/net/io_uring_zerocopy_tx.c b/tools/testing/selftests/net/io_uring_zerocopy_tx.c
new file mode 100644
index 000000000000..9d64c560a2d6
--- /dev/null
+++ b/tools/testing/selftests/net/io_uring_zerocopy_tx.c
@@ -0,0 +1,605 @@
+/* SPDX-License-Identifier: MIT */
+/* based on linux-kernel/tools/testing/selftests/net/msg_zerocopy.c */
+#include <assert.h>
+#include <errno.h>
+#include <error.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+#include <arpa/inet.h>
+#include <linux/errqueue.h>
+#include <linux/if_packet.h>
+#include <linux/io_uring.h>
+#include <linux/ipv6.h>
+#include <linux/socket.h>
+#include <linux/sockios.h>
+#include <net/ethernet.h>
+#include <net/if.h>
+#include <netinet/in.h>
+#include <netinet/ip.h>
+#include <netinet/ip6.h>
+#include <netinet/tcp.h>
+#include <netinet/udp.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/resource.h>
+#include <sys/socket.h>
+#include <sys/stat.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <sys/un.h>
+#include <sys/wait.h>
+
+#define NOTIF_TAG 0xfffffffULL
+#define NONZC_TAG 0
+#define ZC_TAG 1
+
+enum {
+	MODE_NONZC	= 0,
+	MODE_ZC		= 1,
+	MODE_ZC_FIXED	= 2,
+	MODE_MIXED	= 3,
+};
+
+static bool cfg_flush		= false;
+static bool cfg_cork		= false;
+static int  cfg_mode		= MODE_ZC_FIXED;
+static int  cfg_nr_reqs		= 8;
+static int  cfg_family		= PF_UNSPEC;
+static int  cfg_payload_len;
+static int  cfg_port		= 8000;
+static int  cfg_runtime_ms	= 4200;
+
+static socklen_t cfg_alen;
+static struct sockaddr_storage cfg_dst_addr;
+
+static char payload[IP_MAXPACKET] __attribute__((aligned(4096)));
+
+struct io_sq_ring {
+	unsigned *head;
+	unsigned *tail;
+	unsigned *ring_mask;
+	unsigned *ring_entries;
+	unsigned *flags;
+	unsigned *array;
+};
+
+struct io_cq_ring {
+	unsigned *head;
+	unsigned *tail;
+	unsigned *ring_mask;
+	unsigned *ring_entries;
+	struct io_uring_cqe *cqes;
+};
+
+struct io_uring_sq {
+	unsigned *khead;
+	unsigned *ktail;
+	unsigned *kring_mask;
+	unsigned *kring_entries;
+	unsigned *kflags;
+	unsigned *kdropped;
+	unsigned *array;
+	struct io_uring_sqe *sqes;
+
+	unsigned sqe_head;
+	unsigned sqe_tail;
+
+	size_t ring_sz;
+};
+
+struct io_uring_cq {
+	unsigned *khead;
+	unsigned *ktail;
+	unsigned *kring_mask;
+	unsigned *kring_entries;
+	unsigned *koverflow;
+	struct io_uring_cqe *cqes;
+
+	size_t ring_sz;
+};
+
+struct io_uring {
+	struct io_uring_sq sq;
+	struct io_uring_cq cq;
+	int ring_fd;
+};
+
+#ifdef __alpha__
+# ifndef __NR_io_uring_setup
+#  define __NR_io_uring_setup		535
+# endif
+# ifndef __NR_io_uring_enter
+#  define __NR_io_uring_enter		536
+# endif
+# ifndef __NR_io_uring_register
+#  define __NR_io_uring_register	537
+# endif
+#else /* !__alpha__ */
+# ifndef __NR_io_uring_setup
+#  define __NR_io_uring_setup		425
+# endif
+# ifndef __NR_io_uring_enter
+#  define __NR_io_uring_enter		426
+# endif
+# ifndef __NR_io_uring_register
+#  define __NR_io_uring_register	427
+# endif
+#endif
+
+#if defined(__x86_64) || defined(__i386__)
+#define read_barrier()	__asm__ __volatile__("":::"memory")
+#define write_barrier()	__asm__ __volatile__("":::"memory")
+#else
+
+#define read_barrier()	__sync_synchronize()
+#define write_barrier()	__sync_synchronize()
+#endif
+
+static int io_uring_setup(unsigned int entries, struct io_uring_params *p)
+{
+	return syscall(__NR_io_uring_setup, entries, p);
+}
+
+static int io_uring_enter(int fd, unsigned int to_submit,
+			  unsigned int min_complete,
+			  unsigned int flags, sigset_t *sig)
+{
+	return syscall(__NR_io_uring_enter, fd, to_submit, min_complete,
+			flags, sig, _NSIG / 8);
+}
+
+static int io_uring_register_buffers(struct io_uring *ring,
+				     const struct iovec *iovecs,
+				     unsigned nr_iovecs)
+{
+	int ret;
+
+	ret = syscall(__NR_io_uring_register, ring->ring_fd,
+		      IORING_REGISTER_BUFFERS, iovecs, nr_iovecs);
+	return (ret < 0) ? -errno : ret;
+}
+
+static int io_uring_register_notifications(struct io_uring *ring,
+					   unsigned nr,
+					   struct io_uring_notification_slot *slots)
+{
+	int ret;
+	struct io_uring_notification_register r = {
+		.nr_slots = nr,
+		.data = (unsigned long)slots,
+	};
+
+	ret = syscall(__NR_io_uring_register, ring->ring_fd,
+		      IORING_REGISTER_NOTIFIERS, &r, sizeof(r));
+	return (ret < 0) ? -errno : ret;
+}
+
+static int io_uring_mmap(int fd, struct io_uring_params *p,
+			 struct io_uring_sq *sq, struct io_uring_cq *cq)
+{
+	size_t size;
+	void *ptr;
+	int ret;
+
+	sq->ring_sz = p->sq_off.array + p->sq_entries * sizeof(unsigned);
+	ptr = mmap(0, sq->ring_sz, PROT_READ | PROT_WRITE,
+		   MAP_SHARED | MAP_POPULATE, fd, IORING_OFF_SQ_RING);
+	if (ptr == MAP_FAILED)
+		return -errno;
+	sq->khead = ptr + p->sq_off.head;
+	sq->ktail = ptr + p->sq_off.tail;
+	sq->kring_mask = ptr + p->sq_off.ring_mask;
+	sq->kring_entries = ptr + p->sq_off.ring_entries;
+	sq->kflags = ptr + p->sq_off.flags;
+	sq->kdropped = ptr + p->sq_off.dropped;
+	sq->array = ptr + p->sq_off.array;
+
+	size = p->sq_entries * sizeof(struct io_uring_sqe);
+	sq->sqes = mmap(0, size, PROT_READ | PROT_WRITE,
+			MAP_SHARED | MAP_POPULATE, fd, IORING_OFF_SQES);
+	if (sq->sqes == MAP_FAILED) {
+		ret = -errno;
+err:
+		munmap(sq->khead, sq->ring_sz);
+		return ret;
+	}
+
+	cq->ring_sz = p->cq_off.cqes + p->cq_entries * sizeof(struct io_uring_cqe);
+	ptr = mmap(0, cq->ring_sz, PROT_READ | PROT_WRITE,
+			MAP_SHARED | MAP_POPULATE, fd, IORING_OFF_CQ_RING);
+	if (ptr == MAP_FAILED) {
+		ret = -errno;
+		munmap(sq->sqes, p->sq_entries * sizeof(struct io_uring_sqe));
+		goto err;
+	}
+	cq->khead = ptr + p->cq_off.head;
+	cq->ktail = ptr + p->cq_off.tail;
+	cq->kring_mask = ptr + p->cq_off.ring_mask;
+	cq->kring_entries = ptr + p->cq_off.ring_entries;
+	cq->koverflow = ptr + p->cq_off.overflow;
+	cq->cqes = ptr + p->cq_off.cqes;
+	return 0;
+}
+
+static int io_uring_queue_init(unsigned entries, struct io_uring *ring,
+			       unsigned flags)
+{
+	struct io_uring_params p;
+	int fd, ret;
+
+	memset(ring, 0, sizeof(*ring));
+	memset(&p, 0, sizeof(p));
+	p.flags = flags;
+
+	fd = io_uring_setup(entries, &p);
+	if (fd < 0)
+		return fd;
+	ret = io_uring_mmap(fd, &p, &ring->sq, &ring->cq);
+	if (!ret)
+		ring->ring_fd = fd;
+	else
+		close(fd);
+	return ret;
+}
+
+static int io_uring_submit(struct io_uring *ring)
+{
+	struct io_uring_sq *sq = &ring->sq;
+	const unsigned mask = *sq->kring_mask;
+	unsigned ktail, submitted, to_submit;
+	int ret;
+
+	read_barrier();
+	if (*sq->khead != *sq->ktail) {
+		submitted = *sq->kring_entries;
+		goto submit;
+	}
+	if (sq->sqe_head == sq->sqe_tail)
+		return 0;
+
+	ktail = *sq->ktail;
+	to_submit = sq->sqe_tail - sq->sqe_head;
+	for (submitted = 0; submitted < to_submit; submitted++) {
+		read_barrier();
+		sq->array[ktail++ & mask] = sq->sqe_head++ & mask;
+	}
+	if (!submitted)
+		return 0;
+
+	if (*sq->ktail != ktail) {
+		write_barrier();
+		*sq->ktail = ktail;
+		write_barrier();
+	}
+submit:
+	ret = io_uring_enter(ring->ring_fd, submitted, 0,
+				IORING_ENTER_GETEVENTS, NULL);
+	return ret < 0 ? -errno : ret;
+}
+
+static inline void io_uring_prep_send(struct io_uring_sqe *sqe, int sockfd,
+				      const void *buf, size_t len, int flags)
+{
+	memset(sqe, 0, sizeof(*sqe));
+	sqe->opcode = (__u8) IORING_OP_SEND;
+	sqe->fd = sockfd;
+	sqe->addr = (unsigned long) buf;
+	sqe->len = len;
+	sqe->msg_flags = (__u32) flags;
+}
+
+static inline void io_uring_prep_sendzc(struct io_uring_sqe *sqe, int sockfd,
+				        const void *buf, size_t len, int flags,
+				        unsigned slot_idx, unsigned zc_flags)
+{
+	io_uring_prep_send(sqe, sockfd, buf, len, flags);
+	sqe->opcode = (__u8) IORING_OP_SENDZC_NOTIF;
+	sqe->notification_idx = slot_idx;
+	sqe->ioprio = zc_flags;
+}
+
+static struct io_uring_sqe *io_uring_get_sqe(struct io_uring *ring)
+{
+	struct io_uring_sq *sq = &ring->sq;
+
+	if (sq->sqe_tail + 1 - sq->sqe_head > *sq->kring_entries)
+		return NULL;
+	return &sq->sqes[sq->sqe_tail++ & *sq->kring_mask];
+}
+
+static int io_uring_wait_cqe(struct io_uring *ring, struct io_uring_cqe **cqe_ptr)
+{
+	struct io_uring_cq *cq = &ring->cq;
+	const unsigned mask = *cq->kring_mask;
+	unsigned head = *cq->khead;
+	int ret;
+
+	*cqe_ptr = NULL;
+	do {
+		read_barrier();
+		if (head != *cq->ktail) {
+			*cqe_ptr = &cq->cqes[head & mask];
+			break;
+		}
+		ret = io_uring_enter(ring->ring_fd, 0, 1,
+					IORING_ENTER_GETEVENTS, NULL);
+		if (ret < 0)
+			return -errno;
+	} while (1);
+
+	return 0;
+}
+
+static inline void io_uring_cqe_seen(struct io_uring *ring)
+{
+	*(&ring->cq)->khead += 1;
+	write_barrier();
+}
+
+static unsigned long gettimeofday_ms(void)
+{
+	struct timeval tv;
+
+	gettimeofday(&tv, NULL);
+	return (tv.tv_sec * 1000) + (tv.tv_usec / 1000);
+}
+
+static void do_setsockopt(int fd, int level, int optname, int val)
+{
+	if (setsockopt(fd, level, optname, &val, sizeof(val)))
+		error(1, errno, "setsockopt %d.%d: %d", level, optname, val);
+}
+
+static int do_setup_tx(int domain, int type, int protocol)
+{
+	int fd;
+
+	fd = socket(domain, type, protocol);
+	if (fd == -1)
+		error(1, errno, "socket t");
+
+	do_setsockopt(fd, SOL_SOCKET, SO_SNDBUF, 1 << 21);
+
+	if (connect(fd, (void *) &cfg_dst_addr, cfg_alen))
+		error(1, errno, "connect");
+	return fd;
+}
+
+static void do_tx(int domain, int type, int protocol)
+{
+	struct io_uring_notification_slot b[1] = {{.tag = NOTIF_TAG}};
+	struct io_uring_sqe *sqe;
+	struct io_uring_cqe *cqe;
+	unsigned long packets = 0, bytes = 0;
+	struct io_uring ring;
+	struct iovec iov;
+	uint64_t tstop;
+	int i, fd, ret;
+	int compl_cqes = 0;
+
+	fd = do_setup_tx(domain, type, protocol);
+
+	ret = io_uring_queue_init(512, &ring, 0);
+	if (ret)
+		error(1, ret, "io_uring: queue init");
+
+	ret = io_uring_register_notifications(&ring, 1, b);
+	if (ret)
+		error(1, ret, "io_uring: tx ctx registration");
+
+	iov.iov_base = payload;
+	iov.iov_len = cfg_payload_len;
+
+	ret = io_uring_register_buffers(&ring, &iov, 1);
+	if (ret)
+		error(1, ret, "io_uring: buffer registration");
+
+	tstop = gettimeofday_ms() + cfg_runtime_ms;
+	do {
+		if (cfg_cork)
+			do_setsockopt(fd, IPPROTO_UDP, UDP_CORK, 1);
+
+		for (i = 0; i < cfg_nr_reqs; i++) {
+			unsigned zc_flags = 0;
+			unsigned buf_idx = 0;
+			unsigned slot_idx = 0;
+			unsigned mode = cfg_mode;
+			unsigned msg_flags = 0;
+
+			if (cfg_mode == MODE_MIXED)
+				mode = rand() % 3;
+
+			sqe = io_uring_get_sqe(&ring);
+
+			if (mode == MODE_NONZC) {
+				io_uring_prep_send(sqe, fd, payload,
+						   cfg_payload_len, msg_flags);
+				sqe->user_data = NONZC_TAG;
+			} else {
+				if (cfg_flush) {
+					zc_flags |= IORING_RECVSEND_NOTIF_FLUSH;
+					compl_cqes++;
+				}
+				io_uring_prep_sendzc(sqe, fd, payload,
+						     cfg_payload_len,
+						     msg_flags, slot_idx, zc_flags);
+				if (mode == MODE_ZC_FIXED) {
+					sqe->ioprio |= IORING_RECVSEND_FIXED_BUF;
+					sqe->buf_index = buf_idx;
+				}
+				sqe->user_data = ZC_TAG;
+			}
+		}
+
+		ret = io_uring_submit(&ring);
+		if (ret != cfg_nr_reqs)
+			error(1, ret, "submit");
+
+		for (i = 0; i < cfg_nr_reqs; i++) {
+			ret = io_uring_wait_cqe(&ring, &cqe);
+			if (ret)
+				error(1, ret, "wait cqe");
+
+			if (cqe->user_data == NOTIF_TAG) {
+				compl_cqes--;
+				i--;
+			} else if (cqe->user_data != NONZC_TAG &&
+				   cqe->user_data != ZC_TAG) {
+				error(1, cqe->res, "invalid user_data");
+			} else if (cqe->res <= 0 && cqe->res != -EAGAIN) {
+				error(1, cqe->res, "send failed");
+			} else {
+				if (cqe->res > 0) {
+					packets++;
+					bytes += cqe->res;
+				}
+				/* failed requests don't flush */
+				if (cfg_flush &&
+				    cqe->res <= 0 &&
+				    cqe->user_data == ZC_TAG)
+					compl_cqes--;
+			}
+			io_uring_cqe_seen(&ring);
+		}
+		if (cfg_cork)
+			do_setsockopt(fd, IPPROTO_UDP, UDP_CORK, 0);
+	} while (gettimeofday_ms() < tstop);
+
+	if (close(fd))
+		error(1, errno, "close");
+
+	fprintf(stderr, "tx=%lu (MB=%lu), tx/s=%lu (MB/s=%lu)\n",
+			packets, bytes >> 20,
+			packets / (cfg_runtime_ms / 1000),
+			(bytes >> 20) / (cfg_runtime_ms / 1000));
+
+	while (compl_cqes) {
+		ret = io_uring_wait_cqe(&ring, &cqe);
+		if (ret)
+			error(1, ret, "wait cqe");
+		io_uring_cqe_seen(&ring);
+		compl_cqes--;
+	}
+}
+
+static void do_test(int domain, int type, int protocol)
+{
+	int i;
+
+	for (i = 0; i < IP_MAXPACKET; i++)
+		payload[i] = 'a' + (i % 26);
+	do_tx(domain, type, protocol);
+}
+
+static void usage(const char *filepath)
+{
+	error(1, 0, "Usage: %s [-f] [-n<N>] [-z0] [-s<payload size>] "
+		    "(-4|-6) [-t<time s>] -D<dst_ip> udp", filepath);
+}
+
+static void parse_opts(int argc, char **argv)
+{
+	const int max_payload_len = sizeof(payload) -
+				    sizeof(struct ipv6hdr) -
+				    sizeof(struct tcphdr) -
+				    40 /* max tcp options */;
+	struct sockaddr_in6 *addr6 = (void *) &cfg_dst_addr;
+	struct sockaddr_in *addr4 = (void *) &cfg_dst_addr;
+	char *daddr = NULL;
+	int c;
+
+	if (argc <= 1)
+		usage(argv[0]);
+	cfg_payload_len = max_payload_len;
+
+	while ((c = getopt(argc, argv, "46D:p:s:t:n:fc:m:")) != -1) {
+		switch (c) {
+		case '4':
+			if (cfg_family != PF_UNSPEC)
+				error(1, 0, "Pass one of -4 or -6");
+			cfg_family = PF_INET;
+			cfg_alen = sizeof(struct sockaddr_in);
+			break;
+		case '6':
+			if (cfg_family != PF_UNSPEC)
+				error(1, 0, "Pass one of -4 or -6");
+			cfg_family = PF_INET6;
+			cfg_alen = sizeof(struct sockaddr_in6);
+			break;
+		case 'D':
+			daddr = optarg;
+			break;
+		case 'p':
+			cfg_port = strtoul(optarg, NULL, 0);
+			break;
+		case 's':
+			cfg_payload_len = strtoul(optarg, NULL, 0);
+			break;
+		case 't':
+			cfg_runtime_ms = 200 + strtoul(optarg, NULL, 10) * 1000;
+			break;
+		case 'n':
+			cfg_nr_reqs = strtoul(optarg, NULL, 0);
+			break;
+		case 'f':
+			cfg_flush = 1;
+			break;
+		case 'c':
+			cfg_cork = strtol(optarg, NULL, 0);
+			break;
+		case 'm':
+			cfg_mode = strtol(optarg, NULL, 0);
+			break;
+		}
+	}
+
+	switch (cfg_family) {
+	case PF_INET:
+		memset(addr4, 0, sizeof(*addr4));
+		addr4->sin_family = AF_INET;
+		addr4->sin_port = htons(cfg_port);
+		if (daddr &&
+		    inet_pton(AF_INET, daddr, &(addr4->sin_addr)) != 1)
+			error(1, 0, "ipv4 parse error: %s", daddr);
+		break;
+	case PF_INET6:
+		memset(addr6, 0, sizeof(*addr6));
+		addr6->sin6_family = AF_INET6;
+		addr6->sin6_port = htons(cfg_port);
+		if (daddr &&
+		    inet_pton(AF_INET6, daddr, &(addr6->sin6_addr)) != 1)
+			error(1, 0, "ipv6 parse error: %s", daddr);
+		break;
+	default:
+		error(1, 0, "illegal domain");
+	}
+
+	if (cfg_payload_len > max_payload_len)
+		error(1, 0, "-s: payload exceeds max (%d)", max_payload_len);
+	if (cfg_mode == MODE_NONZC && cfg_flush)
+		error(1, 0, "-f: only zerocopy modes support notifications");
+	if (optind != argc - 1)
+		usage(argv[0]);
+}
+
+int main(int argc, char **argv)
+{
+	const char *cfg_test = argv[argc - 1];
+
+	parse_opts(argc, argv);
+
+	if (!strcmp(cfg_test, "tcp"))
+		do_test(cfg_family, SOCK_STREAM, 0);
+	else if (!strcmp(cfg_test, "udp"))
+		do_test(cfg_family, SOCK_DGRAM, 0);
+	else
+		error(1, 0, "unknown cfg_test %s", cfg_test);
+	return 0;
+}
diff --git a/tools/testing/selftests/net/io_uring_zerocopy_tx.sh b/tools/testing/selftests/net/io_uring_zerocopy_tx.sh
new file mode 100755
index 000000000000..6a65e4437640
--- /dev/null
+++ b/tools/testing/selftests/net/io_uring_zerocopy_tx.sh
@@ -0,0 +1,131 @@
+#!/bin/bash
+#
+# Send data between two processes across namespaces
+# Run twice: once without and once with zerocopy
+
+set -e
+
+readonly DEV="veth0"
+readonly DEV_MTU=65535
+readonly BIN_TX="./io_uring_zerocopy_tx"
+readonly BIN_RX="./msg_zerocopy"
+
+readonly RAND="$(mktemp -u XXXXXX)"
+readonly NSPREFIX="ns-${RAND}"
+readonly NS1="${NSPREFIX}1"
+readonly NS2="${NSPREFIX}2"
+
+readonly SADDR4='192.168.1.1'
+readonly DADDR4='192.168.1.2'
+readonly SADDR6='fd::1'
+readonly DADDR6='fd::2'
+
+readonly path_sysctl_mem="net.core.optmem_max"
+
+# No arguments: automated test
+if [[ "$#" -eq "0" ]]; then
+	IPs=( "4" "6" )
+	protocols=( "tcp" "udp" )
+
+	for IP in "${IPs[@]}"; do
+		for proto in "${protocols[@]}"; do
+			for mode in $(seq 1 3); do
+				$0 "$IP" "$proto" -m "$mode" -t 1 -n 32
+				$0 "$IP" "$proto" -m "$mode" -t 1 -n 32 -f
+				$0 "$IP" "$proto" -m "$mode" -t 1 -n 32 -c -f
+			done
+		done
+	done
+
+	echo "OK. All tests passed"
+	exit 0
+fi
+
+# Argument parsing
+if [[ "$#" -lt "2" ]]; then
+	echo "Usage: $0 [4|6] [tcp|udp|raw|raw_hdrincl|packet|packet_dgram] <args>"
+	exit 1
+fi
+
+readonly IP="$1"
+shift
+readonly TXMODE="$1"
+shift
+readonly EXTRA_ARGS="$@"
+
+# Argument parsing: configure addresses
+if [[ "${IP}" == "4" ]]; then
+	readonly SADDR="${SADDR4}"
+	readonly DADDR="${DADDR4}"
+elif [[ "${IP}" == "6" ]]; then
+	readonly SADDR="${SADDR6}"
+	readonly DADDR="${DADDR6}"
+else
+	echo "Invalid IP version ${IP}"
+	exit 1
+fi
+
+# Argument parsing: select receive mode
+#
+# This differs from send mode for
+# - packet:	use raw recv, because packet receives skb clones
+# - raw_hdrinc: use raw recv, because hdrincl is a tx-only option
+case "${TXMODE}" in
+'packet' | 'packet_dgram' | 'raw_hdrincl')
+	RXMODE='raw'
+	;;
+*)
+	RXMODE="${TXMODE}"
+	;;
+esac
+
+# Start of state changes: install cleanup handler
+save_sysctl_mem="$(sysctl -n ${path_sysctl_mem})"
+
+cleanup() {
+	ip netns del "${NS2}"
+	ip netns del "${NS1}"
+	sysctl -w -q "${path_sysctl_mem}=${save_sysctl_mem}"
+}
+
+trap cleanup EXIT
+
+# Configure system settings
+sysctl -w -q "${path_sysctl_mem}=1000000"
+
+# Create virtual ethernet pair between network namespaces
+ip netns add "${NS1}"
+ip netns add "${NS2}"
+
+ip link add "${DEV}" mtu "${DEV_MTU}" netns "${NS1}" type veth \
+  peer name "${DEV}" mtu "${DEV_MTU}" netns "${NS2}"
+
+# Bring the devices up
+ip -netns "${NS1}" link set "${DEV}" up
+ip -netns "${NS2}" link set "${DEV}" up
+
+# Set fixed MAC addresses on the devices
+ip -netns "${NS1}" link set dev "${DEV}" address 02:02:02:02:02:02
+ip -netns "${NS2}" link set dev "${DEV}" address 06:06:06:06:06:06
+
+# Add fixed IP addresses to the devices
+ip -netns "${NS1}" addr add 192.168.1.1/24 dev "${DEV}"
+ip -netns "${NS2}" addr add 192.168.1.2/24 dev "${DEV}"
+ip -netns "${NS1}" addr add       fd::1/64 dev "${DEV}" nodad
+ip -netns "${NS2}" addr add       fd::2/64 dev "${DEV}" nodad
+
+# Optionally disable sg or csum offload to test edge cases
+# ip netns exec "${NS1}" ethtool -K "${DEV}" sg off
+
+do_test() {
+	local readonly ARGS="$1"
+
+	echo "ipv${IP} ${TXMODE} ${ARGS}"
+	ip netns exec "${NS2}" "${BIN_RX}" "-${IP}" -t 2 -C 2 -S "${SADDR}" -D "${DADDR}" -r "${RXMODE}" &
+	sleep 0.2
+	ip netns exec "${NS1}" "${BIN_TX}" "-${IP}" -t 1 -D "${DADDR}" ${ARGS} "${TXMODE}"
+	wait
+}
+
+do_test "${EXTRA_ARGS}"
+echo ok
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next v5 01/27] ipv4: avoid partial copy for zc
  2022-07-12 20:52 ` [PATCH net-next v5 01/27] ipv4: avoid partial copy for zc Pavel Begunkov
@ 2022-07-19  1:54   ` Jakub Kicinski
  2022-07-19  9:35     ` Willem de Bruijn
  0 siblings, 1 reply; 34+ messages in thread
From: Jakub Kicinski @ 2022-07-19  1:54 UTC (permalink / raw)
  To: Pavel Begunkov, Willem de Bruijn
  Cc: io-uring, netdev, linux-kernel, David S . Miller, Jonathan Lemon,
	Jens Axboe, David Ahern, kernel-team

On Tue, 12 Jul 2022 21:52:25 +0100 Pavel Begunkov wrote:
> Even when zerocopy transmission is requested and possible,
> __ip_append_data() will still copy a small chunk of data just because it
> allocated some extra linear space (e.g. 148 bytes). It wastes CPU cycles
> on copy and iter manipulations and also misalignes potentially aligned
> data. Avoid such coies. And as a bonus we can allocate smaller skb.

s/coies/copies/ can fix when applying

> 
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>  net/ipv4/ip_output.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 00b4bf26fd93..581d1e233260 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -969,7 +969,6 @@ static int __ip_append_data(struct sock *sk,
>  	struct inet_sock *inet = inet_sk(sk);
>  	struct ubuf_info *uarg = NULL;
>  	struct sk_buff *skb;
> -
>  	struct ip_options *opt = cork->opt;
>  	int hh_len;
>  	int exthdrlen;
> @@ -977,6 +976,7 @@ static int __ip_append_data(struct sock *sk,
>  	int copy;
>  	int err;
>  	int offset = 0;
> +	bool zc = false;
>  	unsigned int maxfraglen, fragheaderlen, maxnonfragsize;
>  	int csummode = CHECKSUM_NONE;
>  	struct rtable *rt = (struct rtable *)cork->dst;
> @@ -1025,6 +1025,7 @@ static int __ip_append_data(struct sock *sk,
>  		if (rt->dst.dev->features & NETIF_F_SG &&
>  		    csummode == CHECKSUM_PARTIAL) {
>  			paged = true;
> +			zc = true;
>  		} else {
>  			uarg->zerocopy = 0;
>  			skb_zcopy_set(skb, uarg, &extra_uref);
> @@ -1091,9 +1092,12 @@ static int __ip_append_data(struct sock *sk,
>  				 (fraglen + alloc_extra < SKB_MAX_ALLOC ||
>  				  !(rt->dst.dev->features & NETIF_F_SG)))
>  				alloclen = fraglen;
> -			else {
> +			else if (!zc) {
>  				alloclen = min_t(int, fraglen, MAX_HEADER);

Willem, I think this came in with your GSO work, is there a reason we
use MAX_HEADER here? I thought MAX_HEADER is for headers (i.e. more or
less to be reserved) not for the min amount of data to be included.

I wanna make sure we're not missing something about GSO here.

Otherwise I don't think we need the extra branch but that can 
be a follow up.

>  				pagedlen = fraglen - alloclen;
> +			} else {
> +				alloclen = fragheaderlen + transhdrlen;
> +				pagedlen = datalen - transhdrlen;
>  			}
>  
>  			alloclen += alloc_extra;


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next v5 01/27] ipv4: avoid partial copy for zc
  2022-07-19  1:54   ` Jakub Kicinski
@ 2022-07-19  9:35     ` Willem de Bruijn
  2022-07-21 10:03       ` Pavel Begunkov
  0 siblings, 1 reply; 34+ messages in thread
From: Willem de Bruijn @ 2022-07-19  9:35 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Pavel Begunkov, io-uring, netdev, linux-kernel, David S . Miller,
	Jonathan Lemon, Jens Axboe, David Ahern, kernel-team

On Tue, Jul 19, 2022 at 3:54 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 12 Jul 2022 21:52:25 +0100 Pavel Begunkov wrote:
> > Even when zerocopy transmission is requested and possible,
> > __ip_append_data() will still copy a small chunk of data just because it
> > allocated some extra linear space (e.g. 148 bytes). It wastes CPU cycles
> > on copy and iter manipulations and also misalignes potentially aligned
> > data. Avoid such coies. And as a bonus we can allocate smaller skb.
>
> s/coies/copies/ can fix when applying
>
> >
> > Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> > ---
> >  net/ipv4/ip_output.c | 8 ++++++--
> >  1 file changed, 6 insertions(+), 2 deletions(-)
> >
> > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> > index 00b4bf26fd93..581d1e233260 100644
> > --- a/net/ipv4/ip_output.c
> > +++ b/net/ipv4/ip_output.c
> > @@ -969,7 +969,6 @@ static int __ip_append_data(struct sock *sk,
> >       struct inet_sock *inet = inet_sk(sk);
> >       struct ubuf_info *uarg = NULL;
> >       struct sk_buff *skb;
> > -
> >       struct ip_options *opt = cork->opt;
> >       int hh_len;
> >       int exthdrlen;
> > @@ -977,6 +976,7 @@ static int __ip_append_data(struct sock *sk,
> >       int copy;
> >       int err;
> >       int offset = 0;
> > +     bool zc = false;
> >       unsigned int maxfraglen, fragheaderlen, maxnonfragsize;
> >       int csummode = CHECKSUM_NONE;
> >       struct rtable *rt = (struct rtable *)cork->dst;
> > @@ -1025,6 +1025,7 @@ static int __ip_append_data(struct sock *sk,
> >               if (rt->dst.dev->features & NETIF_F_SG &&
> >                   csummode == CHECKSUM_PARTIAL) {
> >                       paged = true;
> > +                     zc = true;
> >               } else {
> >                       uarg->zerocopy = 0;
> >                       skb_zcopy_set(skb, uarg, &extra_uref);
> > @@ -1091,9 +1092,12 @@ static int __ip_append_data(struct sock *sk,
> >                                (fraglen + alloc_extra < SKB_MAX_ALLOC ||
> >                                 !(rt->dst.dev->features & NETIF_F_SG)))
> >                               alloclen = fraglen;
> > -                     else {
> > +                     else if (!zc) {
> >                               alloclen = min_t(int, fraglen, MAX_HEADER);
>
> Willem, I think this came in with your GSO work, is there a reason we
> use MAX_HEADER here? I thought MAX_HEADER is for headers (i.e. more or
> less to be reserved) not for the min amount of data to be included.
>
> I wanna make sure we're not missing something about GSO here.
>
> Otherwise I don't think we need the extra branch but that can
> be a follow up.

The change was introduced for UDP GSO, to avoid copying most payload
on software segmentation:

"
commit 15e36f5b8e982debe43e425d2e12d34e022d51e9
Author: Willem de Bruijn <willemb@google.com>
Date:   Thu Apr 26 13:42:19 2018 -0400

    udp: paged allocation with gso

    When sending large datagrams that are later segmented, store data in
    page frags to avoid copying from linear in skb_segment.
"

and in code

-                       else
-                               alloclen = datalen + fragheaderlen;
+                       else if (!paged)
+                               alloclen = fraglen;
+                       else {
+                               alloclen = min_t(int, fraglen, MAX_HEADER);
+                               pagedlen = fraglen - alloclen;
+                       }


MAX_HEADER was a short-hand for the exact header length. "alloclen =
fragheaderlen + transhdrlen;" is probably a better choice indeed.

Whether with branch or without, the same change needs to be made to
__ip6_append_data, just as in the referenced commit. Let's keep the
stacks in sync.

This is tricky code. If in doubt, run the msg_zerocopy and udp_gso
tests from tools/testing/selftests/net, ideally with KASAN.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: (subset) [PATCH net-next v5 00/27] io_uring zerocopy send
  2022-07-12 20:52 [PATCH net-next v5 00/27] io_uring zerocopy send Pavel Begunkov
                   ` (26 preceding siblings ...)
  2022-07-12 20:52 ` [PATCH net-next v5 27/27] selftests/io_uring: test zerocopy send Pavel Begunkov
@ 2022-07-20 12:46 ` Jens Axboe
  27 siblings, 0 replies; 34+ messages in thread
From: Jens Axboe @ 2022-07-20 12:46 UTC (permalink / raw)
  To: netdev, asml.silence, io-uring, linux-kernel
  Cc: davem, kuba, willemb, jonathan.lemon, kernel-team, dsahern

On Tue, 12 Jul 2022 21:52:24 +0100, Pavel Begunkov wrote:
> NOTE: Not to be picked directly. After getting necessary acks, I'll be
>       working out merging with Jakub and Jens.
> 
> The patchset implements io_uring zerocopy send. It works with both registered
> and normal buffers, mixing is allowed but not recommended. Apart from usual
> request completions, just as with MSG_ZEROCOPY, io_uring separately notifies
> the userspace when buffers are freed and can be reused (see API design below),
> which is delivered into io_uring's Completion Queue. Those "buffer-free"
> notifications are not necessarily per request, but the userspace has control
> over it and should explicitly attaching a number of requests to a single
> notification. The series also adds some internal optimisations when used with
> registered buffers like removing page referencing.
> 
> [...]

Applied, thanks!

[12/27] io_uring: initialise msghdr::msg_ubuf
        commit: 06f241e2bf4ba2a3e77269be25d21c0196a57a4f
[13/27] io_uring: export io_put_task()
        commit: ba64c07a6ef9a05ca9eb09e13b70df7500e78cf8
[14/27] io_uring: add zc notification infrastructure
        commit: 6f322c753daee4b9d4ad494d4e8b05da610d804c
[15/27] io_uring: cache struct io_notif
        commit: cf49e2d47c49e547d4bc370efe73785fc82354e5
[16/27] io_uring: complete notifiers in tw
        commit: 9cc16ae447db07d210175d2ad2419784dd20f784
[17/27] io_uring: add rsrc referencing for notifiers
        commit: e133e289093ea35c1f7f940fe4c0ceb62037dc59
[18/27] io_uring: add notification slot registration
        commit: f20b817fd29b64ef6de24b83ef23e1f3fb273967
[19/27] io_uring: wire send zc request type
        commit: 480ec5ff9a5a75d68423c0bd02e57a9ee6325320
[20/27] io_uring: account locked pages for non-fixed zc
        commit: fcb98e61d0232cff7dd14ae85ad1c88d68f98273
[21/27] io_uring: allow to pass addr into sendzc
        commit: 7ab12997edc9aa3e2be4169f929c50a1fcd41004
[22/27] io_uring: sendzc with fixed buffers
        commit: bb4019de9ea11d21137b4a8ff01d9e338071d633
[23/27] io_uring: flush notifiers after sendzc
        commit: 95a70c191696da64a6ae235d52132a5c17866dae
[24/27] io_uring: rename IORING_OP_FILES_UPDATE
        commit: d488e605a45192f9f60c7624d46ba0b8c4d93aab
[25/27] io_uring: add zc notification flush requests
        commit: cb155defb9bf20a647c8825a085695f3f94fdb60
[26/27] io_uring: enable managed frags with register buffers
        commit: 04ae3dbe8a027cf10ab759456ffc4fb119486f74
[27/27] selftests/io_uring: test zerocopy send
        commit: 0c450de20ce7d6bc8a2f97c98387baf910454477

Best regards,
-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next v5 01/27] ipv4: avoid partial copy for zc
  2022-07-19  9:35     ` Willem de Bruijn
@ 2022-07-21 10:03       ` Pavel Begunkov
  0 siblings, 0 replies; 34+ messages in thread
From: Pavel Begunkov @ 2022-07-21 10:03 UTC (permalink / raw)
  To: Willem de Bruijn, Jakub Kicinski
  Cc: io-uring, netdev, linux-kernel, David S . Miller, Jonathan Lemon,
	Jens Axboe, David Ahern, kernel-team

On 7/19/22 10:35, Willem de Bruijn wrote:
> On Tue, Jul 19, 2022 at 3:54 AM Jakub Kicinski <kuba@kernel.org> wrote:
>>
>> On Tue, 12 Jul 2022 21:52:25 +0100 Pavel Begunkov wrote:
>>> Even when zerocopy transmission is requested and possible,
>>> __ip_append_data() will still copy a small chunk of data just because it
>>> allocated some extra linear space (e.g. 148 bytes). It wastes CPU cycles
>>> on copy and iter manipulations and also misalignes potentially aligned
>>> data. Avoid such coies. And as a bonus we can allocate smaller skb.
>>
>> s/coies/copies/ can fix when applying
>>
>>>
>>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>>> ---
>>>   net/ipv4/ip_output.c | 8 ++++++--
>>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
>>> index 00b4bf26fd93..581d1e233260 100644
>>> --- a/net/ipv4/ip_output.c
>>> +++ b/net/ipv4/ip_output.c
>>> @@ -969,7 +969,6 @@ static int __ip_append_data(struct sock *sk,
>>>        struct inet_sock *inet = inet_sk(sk);
>>>        struct ubuf_info *uarg = NULL;
>>>        struct sk_buff *skb;
>>> -
>>>        struct ip_options *opt = cork->opt;
>>>        int hh_len;
>>>        int exthdrlen;
>>> @@ -977,6 +976,7 @@ static int __ip_append_data(struct sock *sk,
>>>        int copy;
>>>        int err;
>>>        int offset = 0;
>>> +     bool zc = false;
>>>        unsigned int maxfraglen, fragheaderlen, maxnonfragsize;
>>>        int csummode = CHECKSUM_NONE;
>>>        struct rtable *rt = (struct rtable *)cork->dst;
>>> @@ -1025,6 +1025,7 @@ static int __ip_append_data(struct sock *sk,
>>>                if (rt->dst.dev->features & NETIF_F_SG &&
>>>                    csummode == CHECKSUM_PARTIAL) {
>>>                        paged = true;
>>> +                     zc = true;
>>>                } else {
>>>                        uarg->zerocopy = 0;
>>>                        skb_zcopy_set(skb, uarg, &extra_uref);
>>> @@ -1091,9 +1092,12 @@ static int __ip_append_data(struct sock *sk,
>>>                                 (fraglen + alloc_extra < SKB_MAX_ALLOC ||
>>>                                  !(rt->dst.dev->features & NETIF_F_SG)))
>>>                                alloclen = fraglen;
>>> -                     else {
>>> +                     else if (!zc) {
>>>                                alloclen = min_t(int, fraglen, MAX_HEADER);
>>
>> Willem, I think this came in with your GSO work, is there a reason we
>> use MAX_HEADER here? I thought MAX_HEADER is for headers (i.e. more or
>> less to be reserved) not for the min amount of data to be included.
>>
>> I wanna make sure we're not missing something about GSO here.
>>
>> Otherwise I don't think we need the extra branch but that can
>> be a follow up.

I brought it up before but left it for later as I don't know workloads
and there might be perf implications. I'll send a follow up.

> The change was introduced for UDP GSO, to avoid copying most payload
> on software segmentation:
> 
> "
> commit 15e36f5b8e982debe43e425d2e12d34e022d51e9
> Author: Willem de Bruijn <willemb@google.com>
> Date:   Thu Apr 26 13:42:19 2018 -0400
> 
>      udp: paged allocation with gso
> 
>      When sending large datagrams that are later segmented, store data in
>      page frags to avoid copying from linear in skb_segment.
> "
> 
> and in code
> 
> -                       else
> -                               alloclen = datalen + fragheaderlen;
> +                       else if (!paged)
> +                               alloclen = fraglen;
> +                       else {
> +                               alloclen = min_t(int, fraglen, MAX_HEADER);
> +                               pagedlen = fraglen - alloclen;
> +                       }
> 
> 
> MAX_HEADER was a short-hand for the exact header length. "alloclen =
> fragheaderlen + transhdrlen;" is probably a better choice indeed.

Great, thanks for taking a look!

> 
> Whether with branch or without, the same change needs to be made to
> __ip6_append_data, just as in the referenced commit. Let's keep the
> stacks in sync.

__ip6_append_data() is changed as well but in the following patch.
I had doubts whether it's preferable to keep ipv4 and ipv6 changes
separately.

> This is tricky code. If in doubt, run the msg_zerocopy and udp_gso
> tests from tools/testing/selftests/net, ideally with KASAN.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next v5 27/27] selftests/io_uring: test zerocopy send
  2022-07-12 20:52 ` [PATCH net-next v5 27/27] selftests/io_uring: test zerocopy send Pavel Begunkov
@ 2022-07-27  8:01   ` dust.li
  2022-07-27  9:18     ` Pavel Begunkov
  0 siblings, 1 reply; 34+ messages in thread
From: dust.li @ 2022-07-27  8:01 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring, netdev, linux-kernel
  Cc: David S . Miller, Jakub Kicinski, Jonathan Lemon,
	Willem de Bruijn, Jens Axboe, David Ahern, kernel-team

On Tue, Jul 12, 2022 at 09:52:51PM +0100, Pavel Begunkov wrote:
>Add selftests for io_uring zerocopy sends and io_uring's notification
>infrastructure. It's largely influenced by msg_zerocopy and uses it on
>the receive side.
>
>Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>---
> tools/testing/selftests/net/Makefile          |   1 +
> .../selftests/net/io_uring_zerocopy_tx.c      | 605 ++++++++++++++++++
> .../selftests/net/io_uring_zerocopy_tx.sh     | 131 ++++
> 3 files changed, 737 insertions(+)
> create mode 100644 tools/testing/selftests/net/io_uring_zerocopy_tx.c
> create mode 100755 tools/testing/selftests/net/io_uring_zerocopy_tx.sh
>
>diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
>index 7ea54af55490..51261483744e 100644
>--- a/tools/testing/selftests/net/Makefile
>+++ b/tools/testing/selftests/net/Makefile
>@@ -59,6 +59,7 @@ TEST_GEN_FILES += toeplitz
> TEST_GEN_FILES += cmsg_sender
> TEST_GEN_FILES += stress_reuseport_listen
> TEST_PROGS += test_vxlan_vnifiltering.sh
>+TEST_GEN_FILES += io_uring_zerocopy_tx
> 
> TEST_FILES := settings
> 
>diff --git a/tools/testing/selftests/net/io_uring_zerocopy_tx.c b/tools/testing/selftests/net/io_uring_zerocopy_tx.c
>new file mode 100644
>index 000000000000..9d64c560a2d6
>--- /dev/null
>+++ b/tools/testing/selftests/net/io_uring_zerocopy_tx.c
>@@ -0,0 +1,605 @@
>+/* SPDX-License-Identifier: MIT */
>+/* based on linux-kernel/tools/testing/selftests/net/msg_zerocopy.c */
>+#include <assert.h>
>+#include <errno.h>
>+#include <error.h>
>+#include <fcntl.h>
>+#include <limits.h>
>+#include <stdbool.h>
>+#include <stdint.h>
>+#include <stdio.h>
>+#include <stdlib.h>
>+#include <string.h>
>+#include <unistd.h>
>+
>+#include <arpa/inet.h>
>+#include <linux/errqueue.h>
>+#include <linux/if_packet.h>
>+#include <linux/io_uring.h>
>+#include <linux/ipv6.h>
>+#include <linux/socket.h>
>+#include <linux/sockios.h>
>+#include <net/ethernet.h>
>+#include <net/if.h>
>+#include <netinet/in.h>
>+#include <netinet/ip.h>
>+#include <netinet/ip6.h>
>+#include <netinet/tcp.h>
>+#include <netinet/udp.h>
>+#include <sys/ioctl.h>
>+#include <sys/mman.h>
>+#include <sys/resource.h>
>+#include <sys/socket.h>
>+#include <sys/stat.h>
>+#include <sys/time.h>
>+#include <sys/types.h>
>+#include <sys/un.h>
>+#include <sys/wait.h>
>+
>+#define NOTIF_TAG 0xfffffffULL
>+#define NONZC_TAG 0
>+#define ZC_TAG 1
>+

<...>

>+static void do_test(int domain, int type, int protocol)
>+{
>+	int i;
>+
>+	for (i = 0; i < IP_MAXPACKET; i++)
>+		payload[i] = 'a' + (i % 26);
>+	do_tx(domain, type, protocol);
>+}
>+
>+static void usage(const char *filepath)
>+{
>+	error(1, 0, "Usage: %s [-f] [-n<N>] [-z0] [-s<payload size>] "
>+		    "(-4|-6) [-t<time s>] -D<dst_ip> udp", filepath);

A small flaw, the usage here doesn't match the real options in parse_opts().

Thanks

>+}
>+
>+static void parse_opts(int argc, char **argv)
>+{
>+	const int max_payload_len = sizeof(payload) -
>+				    sizeof(struct ipv6hdr) -
>+				    sizeof(struct tcphdr) -
>+				    40 /* max tcp options */;
>+	struct sockaddr_in6 *addr6 = (void *) &cfg_dst_addr;
>+	struct sockaddr_in *addr4 = (void *) &cfg_dst_addr;
>+	char *daddr = NULL;
>+	int c;
>+
>+	if (argc <= 1)
>+		usage(argv[0]);
>+	cfg_payload_len = max_payload_len;
>+
>+	while ((c = getopt(argc, argv, "46D:p:s:t:n:fc:m:")) != -1) {
>+		switch (c) {
>+		case '4':
>+			if (cfg_family != PF_UNSPEC)
>+				error(1, 0, "Pass one of -4 or -6");
>+			cfg_family = PF_INET;
>+			cfg_alen = sizeof(struct sockaddr_in);
>+			break;
>+		case '6':
>+			if (cfg_family != PF_UNSPEC)
>+				error(1, 0, "Pass one of -4 or -6");
>+			cfg_family = PF_INET6;
>+			cfg_alen = sizeof(struct sockaddr_in6);
>+			break;
>+		case 'D':
>+			daddr = optarg;
>+			break;
>+		case 'p':
>+			cfg_port = strtoul(optarg, NULL, 0);
>+			break;
>+		case 's':
>+			cfg_payload_len = strtoul(optarg, NULL, 0);
>+			break;
>+		case 't':
>+			cfg_runtime_ms = 200 + strtoul(optarg, NULL, 10) * 1000;
>+			break;
>+		case 'n':
>+			cfg_nr_reqs = strtoul(optarg, NULL, 0);
>+			break;
>+		case 'f':
>+			cfg_flush = 1;
>+			break;
>+		case 'c':
>+			cfg_cork = strtol(optarg, NULL, 0);
>+			break;
>+		case 'm':
>+			cfg_mode = strtol(optarg, NULL, 0);
>+			break;
>+		}
>+	}
>+
>+	switch (cfg_family) {
>+	case PF_INET:
>+		memset(addr4, 0, sizeof(*addr4));
>+		addr4->sin_family = AF_INET;
>+		addr4->sin_port = htons(cfg_port);
>+		if (daddr &&
>+		    inet_pton(AF_INET, daddr, &(addr4->sin_addr)) != 1)
>+			error(1, 0, "ipv4 parse error: %s", daddr);
>+		break;
>+	case PF_INET6:
>+		memset(addr6, 0, sizeof(*addr6));
>+		addr6->sin6_family = AF_INET6;
>+		addr6->sin6_port = htons(cfg_port);
>+		if (daddr &&
>+		    inet_pton(AF_INET6, daddr, &(addr6->sin6_addr)) != 1)
>+			error(1, 0, "ipv6 parse error: %s", daddr);
>+		break;
>+	default:
>+		error(1, 0, "illegal domain");
>+	}
>+
>+	if (cfg_payload_len > max_payload_len)
>+		error(1, 0, "-s: payload exceeds max (%d)", max_payload_len);
>+	if (cfg_mode == MODE_NONZC && cfg_flush)
>+		error(1, 0, "-f: only zerocopy modes support notifications");
>+	if (optind != argc - 1)
>+		usage(argv[0]);
>+}
>+
>+int main(int argc, char **argv)
>+{
>+	const char *cfg_test = argv[argc - 1];
>+
>+	parse_opts(argc, argv);
>+
>+	if (!strcmp(cfg_test, "tcp"))
>+		do_test(cfg_family, SOCK_STREAM, 0);
>+	else if (!strcmp(cfg_test, "udp"))
>+		do_test(cfg_family, SOCK_DGRAM, 0);
>+	else
>+		error(1, 0, "unknown cfg_test %s", cfg_test);
>+	return 0;
>+}
>diff --git a/tools/testing/selftests/net/io_uring_zerocopy_tx.sh b/tools/testing/selftests/net/io_uring_zerocopy_tx.sh
>new file mode 100755
>index 000000000000..6a65e4437640
>--- /dev/null
>+++ b/tools/testing/selftests/net/io_uring_zerocopy_tx.sh
>@@ -0,0 +1,131 @@
>+#!/bin/bash
>+#
>+# Send data between two processes across namespaces
>+# Run twice: once without and once with zerocopy
>+
>+set -e
>+
>+readonly DEV="veth0"
>+readonly DEV_MTU=65535
>+readonly BIN_TX="./io_uring_zerocopy_tx"
>+readonly BIN_RX="./msg_zerocopy"
>+
>+readonly RAND="$(mktemp -u XXXXXX)"
>+readonly NSPREFIX="ns-${RAND}"
>+readonly NS1="${NSPREFIX}1"
>+readonly NS2="${NSPREFIX}2"
>+
>+readonly SADDR4='192.168.1.1'
>+readonly DADDR4='192.168.1.2'
>+readonly SADDR6='fd::1'
>+readonly DADDR6='fd::2'
>+
>+readonly path_sysctl_mem="net.core.optmem_max"
>+
>+# No arguments: automated test
>+if [[ "$#" -eq "0" ]]; then
>+	IPs=( "4" "6" )
>+	protocols=( "tcp" "udp" )
>+
>+	for IP in "${IPs[@]}"; do
>+		for proto in "${protocols[@]}"; do
>+			for mode in $(seq 1 3); do
>+				$0 "$IP" "$proto" -m "$mode" -t 1 -n 32
>+				$0 "$IP" "$proto" -m "$mode" -t 1 -n 32 -f
>+				$0 "$IP" "$proto" -m "$mode" -t 1 -n 32 -c -f
>+			done
>+		done
>+	done
>+
>+	echo "OK. All tests passed"
>+	exit 0
>+fi
>+
>+# Argument parsing
>+if [[ "$#" -lt "2" ]]; then
>+	echo "Usage: $0 [4|6] [tcp|udp|raw|raw_hdrincl|packet|packet_dgram] <args>"
>+	exit 1
>+fi
>+
>+readonly IP="$1"
>+shift
>+readonly TXMODE="$1"
>+shift
>+readonly EXTRA_ARGS="$@"
>+
>+# Argument parsing: configure addresses
>+if [[ "${IP}" == "4" ]]; then
>+	readonly SADDR="${SADDR4}"
>+	readonly DADDR="${DADDR4}"
>+elif [[ "${IP}" == "6" ]]; then
>+	readonly SADDR="${SADDR6}"
>+	readonly DADDR="${DADDR6}"
>+else
>+	echo "Invalid IP version ${IP}"
>+	exit 1
>+fi
>+
>+# Argument parsing: select receive mode
>+#
>+# This differs from send mode for
>+# - packet:	use raw recv, because packet receives skb clones
>+# - raw_hdrinc: use raw recv, because hdrincl is a tx-only option
>+case "${TXMODE}" in
>+'packet' | 'packet_dgram' | 'raw_hdrincl')
>+	RXMODE='raw'
>+	;;
>+*)
>+	RXMODE="${TXMODE}"
>+	;;
>+esac
>+
>+# Start of state changes: install cleanup handler
>+save_sysctl_mem="$(sysctl -n ${path_sysctl_mem})"
>+
>+cleanup() {
>+	ip netns del "${NS2}"
>+	ip netns del "${NS1}"
>+	sysctl -w -q "${path_sysctl_mem}=${save_sysctl_mem}"
>+}
>+
>+trap cleanup EXIT
>+
>+# Configure system settings
>+sysctl -w -q "${path_sysctl_mem}=1000000"
>+
>+# Create virtual ethernet pair between network namespaces
>+ip netns add "${NS1}"
>+ip netns add "${NS2}"
>+
>+ip link add "${DEV}" mtu "${DEV_MTU}" netns "${NS1}" type veth \
>+  peer name "${DEV}" mtu "${DEV_MTU}" netns "${NS2}"
>+
>+# Bring the devices up
>+ip -netns "${NS1}" link set "${DEV}" up
>+ip -netns "${NS2}" link set "${DEV}" up
>+
>+# Set fixed MAC addresses on the devices
>+ip -netns "${NS1}" link set dev "${DEV}" address 02:02:02:02:02:02
>+ip -netns "${NS2}" link set dev "${DEV}" address 06:06:06:06:06:06
>+
>+# Add fixed IP addresses to the devices
>+ip -netns "${NS1}" addr add 192.168.1.1/24 dev "${DEV}"
>+ip -netns "${NS2}" addr add 192.168.1.2/24 dev "${DEV}"
>+ip -netns "${NS1}" addr add       fd::1/64 dev "${DEV}" nodad
>+ip -netns "${NS2}" addr add       fd::2/64 dev "${DEV}" nodad
>+
>+# Optionally disable sg or csum offload to test edge cases
>+# ip netns exec "${NS1}" ethtool -K "${DEV}" sg off
>+
>+do_test() {
>+	local readonly ARGS="$1"
>+
>+	echo "ipv${IP} ${TXMODE} ${ARGS}"
>+	ip netns exec "${NS2}" "${BIN_RX}" "-${IP}" -t 2 -C 2 -S "${SADDR}" -D "${DADDR}" -r "${RXMODE}" &
>+	sleep 0.2
>+	ip netns exec "${NS1}" "${BIN_TX}" "-${IP}" -t 1 -D "${DADDR}" ${ARGS} "${TXMODE}"
>+	wait
>+}
>+
>+do_test "${EXTRA_ARGS}"
>+echo ok
>-- 
>2.37.0

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next v5 27/27] selftests/io_uring: test zerocopy send
  2022-07-27  8:01   ` dust.li
@ 2022-07-27  9:18     ` Pavel Begunkov
  0 siblings, 0 replies; 34+ messages in thread
From: Pavel Begunkov @ 2022-07-27  9:18 UTC (permalink / raw)
  To: dust.li, io-uring, netdev, linux-kernel
  Cc: David S . Miller, Jakub Kicinski, Jonathan Lemon,
	Willem de Bruijn, Jens Axboe, David Ahern, kernel-team

On 7/27/22 09:01, dust.li wrote:

>> +static void do_test(int domain, int type, int protocol)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < IP_MAXPACKET; i++)
>> +		payload[i] = 'a' + (i % 26);
>> +	do_tx(domain, type, protocol);
>> +}
>> +
>> +static void usage(const char *filepath)
>> +{
>> +	error(1, 0, "Usage: %s [-f] [-n<N>] [-z0] [-s<payload size>] "
>> +		    "(-4|-6) [-t<time s>] -D<dst_ip> udp", filepath);
> 
> A small flaw, the usage here doesn't match the real options in parse_opts().

Indeed. I'll adjust it, thanks!

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2022-07-27  9:20 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-12 20:52 [PATCH net-next v5 00/27] io_uring zerocopy send Pavel Begunkov
2022-07-12 20:52 ` [PATCH net-next v5 01/27] ipv4: avoid partial copy for zc Pavel Begunkov
2022-07-19  1:54   ` Jakub Kicinski
2022-07-19  9:35     ` Willem de Bruijn
2022-07-21 10:03       ` Pavel Begunkov
2022-07-12 20:52 ` [PATCH net-next v5 02/27] ipv6: " Pavel Begunkov
2022-07-12 20:52 ` [PATCH net-next v5 03/27] skbuff: don't mix ubuf_info from different sources Pavel Begunkov
2022-07-12 20:52 ` [PATCH net-next v5 04/27] skbuff: add SKBFL_DONT_ORPHAN flag Pavel Begunkov
2022-07-12 20:52 ` [PATCH net-next v5 05/27] skbuff: carry external ubuf_info in msghdr Pavel Begunkov
2022-07-12 20:52 ` [PATCH net-next v5 06/27] net: Allow custom iter handler " Pavel Begunkov
2022-07-12 20:52 ` [PATCH net-next v5 07/27] net: introduce managed frags infrastructure Pavel Begunkov
2022-07-12 20:52 ` [PATCH net-next v5 08/27] net: introduce __skb_fill_page_desc_noacc Pavel Begunkov
2022-07-12 20:52 ` [PATCH net-next v5 09/27] ipv4/udp: support externally provided ubufs Pavel Begunkov
2022-07-12 20:52 ` [PATCH net-next v5 10/27] ipv6/udp: " Pavel Begunkov
2022-07-12 20:52 ` [PATCH net-next v5 11/27] tcp: " Pavel Begunkov
2022-07-12 20:52 ` [PATCH net-next v5 12/27] io_uring: initialise msghdr::msg_ubuf Pavel Begunkov
2022-07-12 20:52 ` [PATCH net-next v5 13/27] io_uring: export io_put_task() Pavel Begunkov
2022-07-12 20:52 ` [PATCH net-next v5 14/27] io_uring: add zc notification infrastructure Pavel Begunkov
2022-07-12 20:52 ` [PATCH net-next v5 15/27] io_uring: cache struct io_notif Pavel Begunkov
2022-07-12 20:52 ` [PATCH net-next v5 16/27] io_uring: complete notifiers in tw Pavel Begunkov
2022-07-12 20:52 ` [PATCH net-next v5 17/27] io_uring: add rsrc referencing for notifiers Pavel Begunkov
2022-07-12 20:52 ` [PATCH net-next v5 18/27] io_uring: add notification slot registration Pavel Begunkov
2022-07-12 20:52 ` [PATCH net-next v5 19/27] io_uring: wire send zc request type Pavel Begunkov
2022-07-12 20:52 ` [PATCH net-next v5 20/27] io_uring: account locked pages for non-fixed zc Pavel Begunkov
2022-07-12 20:52 ` [PATCH net-next v5 21/27] io_uring: allow to pass addr into sendzc Pavel Begunkov
2022-07-12 20:52 ` [PATCH net-next v5 22/27] io_uring: sendzc with fixed buffers Pavel Begunkov
2022-07-12 20:52 ` [PATCH net-next v5 23/27] io_uring: flush notifiers after sendzc Pavel Begunkov
2022-07-12 20:52 ` [PATCH net-next v5 24/27] io_uring: rename IORING_OP_FILES_UPDATE Pavel Begunkov
2022-07-12 20:52 ` [PATCH net-next v5 25/27] io_uring: add zc notification flush requests Pavel Begunkov
2022-07-12 20:52 ` [PATCH net-next v5 26/27] io_uring: enable managed frags with register buffers Pavel Begunkov
2022-07-12 20:52 ` [PATCH net-next v5 27/27] selftests/io_uring: test zerocopy send Pavel Begunkov
2022-07-27  8:01   ` dust.li
2022-07-27  9:18     ` Pavel Begunkov
2022-07-20 12:46 ` (subset) [PATCH net-next v5 00/27] io_uring " Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).